Text Preprocessing and Entity Extraction

Text Preprocessing and Entity Extraction

This vignette explains the text preprocessing and entity extraction capabilities of the LBDiscover package, which are fundamental steps in the literature-based discovery process.

Introduction

Before applying discovery models, we need to preprocess the text data and extract the entities of interest. These steps transform raw text into structured information that can be used for discovering relationships between biomedical concepts.

Loading the Package

library(LBDiscover)

Data Retrieval

First, let’s retrieve some sample articles:

# Search for articles about migraines
migraine_articles <- pubmed_search(
  query = "migraine pathophysiology",
  max_results = 100
)
#> Created pubmed_cache environment for result caching
#> Searching PubMed for: migraine pathophysiology
#> Found 12213 results, retrieving 100 records
#> Fetching batch 1 of 1 (records 1-100)
#> Processing 99 articles
#> Cached search results for future use

# View the first article
head(migraine_articles[, c("pmid", "title")], 3)
#>       pmid
#> 1 42220293
#> 2 42215868
#> 3 42209735
#>                                                                                                                                                  title
#> 1 Identifying neural oscillation and phase synchronization abnormalities in migraine and their predictive values in transcranial magnetic stimulation.
#> 2                                                  CGRP-targeted migraine treatment and early pathophysiology in experimental subarachnoid hemorrhage.
#> 3         Multidisciplinary management of headache in developmental age: effects on school performance, quality of life, and therapeutic perspectives.

Basic Text Preprocessing

The first step is to preprocess the text data to extract meaningful terms:

# Preprocess the abstracts
preprocessed_data <- preprocess_text(
  migraine_articles,
  text_column = "abstract",
  remove_stopwords = TRUE,
  custom_stopwords = c("study", "patient", "result", "conclusion"),
  min_word_length = 3,
  max_word_length = 25
)
#> Tokenizing text...

# View terms extracted from the first document
head(preprocessed_data$terms[[1]], 10)
#>          word count
#> 1    abnormal     1
#> 2    addition     1
#> 3         age     1
#> 4       aimed     1
#> 5       alpha     2
#> 6   analgesic     1
#> 7       areas     1
#> 8    assessed     1
#> 9  background     1
#> 10       band     1

Optimized Preprocessing for Large Datasets

For larger datasets, we can use the optimized vectorized preprocessing function:

# Use optimized vectorized preprocessing
opt_preprocessed_data <- vec_preprocess(
  migraine_articles,
  text_column = "abstract",
  remove_stopwords = TRUE,
  min_word_length = 3,
  chunk_size = 50  # Process in chunks of 50 documents
)
#> Processing text in 2 chunks...
#>   |                                                                              |                                                                      |   0%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================================================| 100%

# Compare processing times
system.time({
  preprocess_text(
    migraine_articles,
    text_column = "abstract",
    remove_stopwords = TRUE
  )
})
#> Tokenizing text...
#>    user  system elapsed 
#>   0.067   0.000   0.068

system.time({
  vec_preprocess(
    migraine_articles,
    text_column = "abstract",
    remove_stopwords = TRUE,
    chunk_size = 50
  )
})
#> Processing text in 2 chunks...
#>   |                                                                              |                                                                      |   0%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================================================| 100%
#>    user  system elapsed 
#>   0.067   0.000   0.067

Advanced Text Analysis

N-gram Extraction

We can extract n-grams (sequences of n words) to capture multi-word concepts:

# Extract bigrams (2-word sequences)
bigrams <- extract_ngrams(
  migraine_articles$abstract,
  n = 2,
  min_freq = 2
)

# View the most frequent bigrams
head(bigrams, 10)
#>                 ngram frequency
#> 11816             p 0        89
#> 8358           in the        81
#> 11386          of the        65
#> 2990  associated with        61
#> 8274      in migraine        58
#> 9971     migraine and        49
#> 11267     of migraine        48
#> 17839   with migraine        48
#> 2846             as a        46
#> 8933             is a        44

Sentence Segmentation

Segmenting text into sentences can be useful for more granular analysis:

# Extract sentences from the first abstract
abstracts <- migraine_articles$abstract
first_abstract <- abstracts[1]

# Make sure we have a valid abstract
if(is.na(first_abstract) || length(first_abstract) == 0 || nchar(first_abstract) == 0) {
  # Find the first non-empty abstract
  valid_idx <- which(!is.na(abstracts) & nchar(abstracts) > 0)
  if(length(valid_idx) > 0) {
    first_abstract <- abstracts[valid_idx[1]]
    cat("First abstract was empty, using abstract #", valid_idx[1], "instead.\n")
  } else {
    # Create a sample abstract for demonstration
    first_abstract <- "This is a sample abstract for demonstration. It contains multiple sentences. Each sentence will be extracted separately."
    cat("No valid abstracts found. Using a sample abstract for demonstration.\n")
  }
}

# Now segment the valid abstract
sentences <- segment_sentences(first_abstract)

# Check if sentences list has elements before trying to access them
if(length(sentences) > 0 && length(sentences[[1]]) > 0) {
  # View the first few sentences
  head(sentences[[1]], min(3, length(sentences[[1]])))
} else {
  cat("No sentences could be extracted. The abstract might be too short or formatted incorrectly.\n")
}
#> [1] "BACKGROUND: Migrai.e.is a prevalent neurological disorder that may substantially disrupt daily function."                                                                                              
#> [2] "Repetiti.e.transcranial magnetic stimulation (rTMS) has been shown to be a promising treatment for migrai.e. but its treatment efficacy still needs to be optimi.e.."                                  
#> [3] "This study ai.e. to i.e.tify abnormal neural oscillations in patients with migrai.e. and evaluate their predicti.e.value for TMS treatments to obtain insights for the optimisation of rTMS paradigms."

# View the first few sentences
head(sentences[[1]], 3)
#> [1] "BACKGROUND: Migrai.e.is a prevalent neurological disorder that may substantially disrupt daily function."                                                                                              
#> [2] "Repetiti.e.transcranial magnetic stimulation (rTMS) has been shown to be a promising treatment for migrai.e. but its treatment efficacy still needs to be optimi.e.."                                  
#> [3] "This study ai.e. to i.e.tify abnormal neural oscillations in patients with migrai.e. and evaluate their predicti.e.value for TMS treatments to obtain insights for the optimisation of rTMS paradigms."

Language Detection

For dealing with multilingual corpora, we can detect the language of each document:

# Filter out NA values from abstracts and detect language
abstracts <- migraine_articles$abstract[1:5]
valid_abstracts <- abstracts[!is.na(abstracts)]

# Apply language detection to valid abstracts
if (length(valid_abstracts) > 0) {
  languages <- sapply(valid_abstracts, detect_lang)
  
  # View results
  data.frame(
    abstract_id = which(!is.na(abstracts)),
    language = languages
  )
} else {
  message("No valid abstracts found for language detection")
}
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      abstract_id
#> BACKGROUND: Migraine is a prevalent neurological disorder that may substantially disrupt daily function. Repetitive transcranial magnetic stimulation (rTMS) has been shown to be a promising treatment for migraine, but its treatment efficacy still needs to be optimised. This study aimed to identify abnormal neural oscillations in patients with migraines and evaluate their predictive value for TMS treatments to obtain insights for the optimisation of rTMS paradigms. METHODS: Patients with migraine received a course of rTMS delivered over the left dorsolateral prefrontal cortex (DLPFC). Resting-state electroencephalography (EEG) was assessed at baseline in both patients with migraine and age- and sex-matched healthy controls. RESULTS: Compared with healthy controls, patients with migraine were characterised by a slower peak alpha frequency (PAF). In addition, patients with migraine demonstrated alpha-band hyperconnectivity between parieto-occipital regions and between fronto-occipital areas. More importantly, parieto-occipital connectivity showed predictive value for rTMS analgesic effects in this population. CONCLUSIONS: These findings support the potential utility of oscillatory biomarkers for optimising rTMS treatment in patients with migraine. CLINICAL TRIAL REGISTRATION: Chinese Clinical Trials Registry (ChiCTR2200060337).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             1
#> BACKGROUND: Calcitonin gene-related peptide (CGRP) is a central mediator in migraine and an endogenous vasodilator in the trigeminovascular system. Its role in subarachnoid hemorrhage (SAH), remains uncertain, particularly in patients receiving CGRP-targeted migraine therapy at the time of hemorrhage. This study examined whether CGRP blockade administered shortly before experimental SAH affects outcome and whether early SAH is associated with altered trigeminovascular CGRP signaling. METHODS: CGRP-mediated vasodilation was first assessed ex vivo in rat and mouse cerebral arteries using wire myography to identify the most suitable species for SAH experiments. Experimental pre-chiasmatic SAH was then induced in rats. At 24 h, body weight, rotating pole performance, mechanical sensitivity, and capsaicin evoked CGRP release from dura mater and trigeminal ganglion were evaluated in SAH and sham animals. In a separate intervention arm, rats received vehicle or the anti-CGRP monoclonal antibody fremanezumab, 30 mg/kg intravenously, 15 min before SAH. Functional outcome, neurological scoring, body weight, and survival were followed for 14 days. RESULTS: Rat basilar arteries showed markedly greater CGRP-induced dilation than mouse basilar arteries, supporting rat as the relevant model. At 24 h after SAH, body weight was reduced whereas periorbital and plantar von Frey thresholds were unchanged. Capsaicin-evoked CGRP release from dura mater was significantly reduced after SAH, while trigeminal ganglion release was unaffected, consistent with early peripheral trigeminovascular peptide release. Fremanezumab treated animals showed higher cerebral blood flow 5 min after SAH and improved rotating pole performance at 48 h compared with vehicle treated animals. No significant differences were observed in composite survival, defined as freedom from spontaneous death or euthanasia at humane endpoint, body weight trajectory, well-being score, or composite neurological score at the 14 day follow up. CONCLUSIONS: Experimental SAH is associated with early impairment of releasable CGRP in dura mater, supporting acute trigeminovascular involvement after hemorrhage. Pre-existing CGRP blockade did not worsen overall 14-day outcome in this model of SAH, but seemingly improved early functional recovery and the temporal pattern of disease progression. These findings suggest stage-dependent effects of CGRP signaling in SAH and support further study of CGRP-targeted therapies in cerebrovascular disease.           2
#> Primary headaches represent a major public health burden in the pediatric population, with a prevalence exceeding 60%. They significantly impair quality of life, school attendance, and family dynamics, often manifesting as a complex biopsychosocial condition. This review provides a comprehensive update on the epidemiology, clinical classification, and comorbidities of pediatric primary headaches, offering a critical analysis of current diagnostic pathways and evolving therapeutic strategies. We examined recent literature, ICHD-3 classification criteria, and international consensus guidelines (AAN/AHS), focusing on the pathophysiology, neuro-psychiatric associations, and evidence-based management of migraine and tension-type headache in developmental age. Clinical evidence highlights a significant bidirectional comorbidity between headaches and neurodevelopmental disorders (ADHD, learning disabilities) as well as mood disorders (anxiety, depression), likely mediated by shared neurotransmitter dysregulation. Diagnosis remains fundamentally clinical, relying on history taking and "red flag" exclusion to rule out secondary etiologies. Regarding management, the therapeutic paradigm has shifted towards a "bio-behavioral first" approach, following evidence of high placebo response rates in prophylaxis trials (CHAMP study). While NSAIDs and triptans remain the cornerstone of acute care, emerging anti-CGRP therapies represent a promising frontier for refractory cases. Effective management of pediatric headache requires a multidisciplinary approach. The integration of lifestyle modifications, behavioral interventions (CBT, biofeedback), and judicious pharmacotherapy is superior to single-modality treatments in reducing disability, improving global functioning, and preventing chronification.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            3
#> BACKGROUND: Surgical decompression of the greater occipital nerve (GON) is a recognized treatment for occipital migraines that are resistant to medical management. Recent evidence suggests that vascular compression by the occipital artery (OA) contributes to occipital migraine pathophysiology. This study compared postoperative outcomes between endoscopic-assisted radical GON decompression along its entire course with complete OA elimination and standard open GON decompression with limited proximal OA elimination. METHODS: A retrospective review was performed on 85 patients who underwent endoscopic-assisted (n=74) or open GON decompression (n=11). Outcomes included changes in migraine headache index (MHI), migraine intensity, frequency, and duration. Cox regression analysis evaluated the probability of achieving a 90% reduction in migraine frequency and MHI over 35 months. RESULTS: Both surgical techniques resulted in significant improvements in MHI, intensity, duration, and frequency (p<0.01 for all measures). The endoscopic-assisted group demonstrated greater reductions in MHI (-182.11 vs. -152.85, p=0.17), frequency (-20.90 vs. -15.45, p=0.08), and intensity (-5.44 vs. -3.00, p<0.001) compared to the open group. Complete migraine resolution rates were also significantly higher in the endoscopic group (69.8% vs. 45.0%, p=0.04). Cox regression analysis showed that endoscopic-assisted decompression was associated with a significantly higher likelihood of maintaining a 90% reduction in migraine frequency and MHI over 36 months. CONCLUSIONS: Endoscopic-assisted GON decompression with complete OA resection is more effective than open decompression in achieving and sustaining occipital migraine relief. These findings underscore the key role of vascular compression in migraine pathogenesis and highlight endoscopic-assisted decompression as the preferred surgical approach.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     4
#> Background and Clinical Significance: Migraine is a common yet debilitating condition that significantly impacts personal lives, productivity, and the healthcare system. Pharmacological interventions provide relief for some migraine sufferers, but for others, are ineffective or accompanied by side effects. Emerging evidence implicates autonomic nervous system dysfunction in migraine pathophysiology, suggesting that mind-body interventions may offer a simple, cost-free therapeutic option. Case Presentation: A 61-year-old woman presented with severe daily migraines that had persisted for years despite medication and dietary changes. Upon starting a regular 10 min slow diaphragmatic breathing practice, her migraines ceased immediately. At a 12-month follow-up, she had only experienced two minor headaches and reported improvements in both daily functioning and quality of life. Conclusions: These findings underscore the potential role of autonomic imbalance in chronic migraine and the preliminary feasibility of breathing interventions as an accessible, low-risk treatment that may, for some, surpass medication in efficacy. Breathing practices may offer a viable alternative to pharmaceutical interventions that benefits both patients and healthcare systems alike.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    5
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      language
#> BACKGROUND: Migraine is a prevalent neurological disorder that may substantially disrupt daily function. Repetitive transcranial magnetic stimulation (rTMS) has been shown to be a promising treatment for migraine, but its treatment efficacy still needs to be optimised. This study aimed to identify abnormal neural oscillations in patients with migraines and evaluate their predictive value for TMS treatments to obtain insights for the optimisation of rTMS paradigms. METHODS: Patients with migraine received a course of rTMS delivered over the left dorsolateral prefrontal cortex (DLPFC). Resting-state electroencephalography (EEG) was assessed at baseline in both patients with migraine and age- and sex-matched healthy controls. RESULTS: Compared with healthy controls, patients with migraine were characterised by a slower peak alpha frequency (PAF). In addition, patients with migraine demonstrated alpha-band hyperconnectivity between parieto-occipital regions and between fronto-occipital areas. More importantly, parieto-occipital connectivity showed predictive value for rTMS analgesic effects in this population. CONCLUSIONS: These findings support the potential utility of oscillatory biomarkers for optimising rTMS treatment in patients with migraine. CLINICAL TRIAL REGISTRATION: Chinese Clinical Trials Registry (ChiCTR2200060337).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         en
#> BACKGROUND: Calcitonin gene-related peptide (CGRP) is a central mediator in migraine and an endogenous vasodilator in the trigeminovascular system. Its role in subarachnoid hemorrhage (SAH), remains uncertain, particularly in patients receiving CGRP-targeted migraine therapy at the time of hemorrhage. This study examined whether CGRP blockade administered shortly before experimental SAH affects outcome and whether early SAH is associated with altered trigeminovascular CGRP signaling. METHODS: CGRP-mediated vasodilation was first assessed ex vivo in rat and mouse cerebral arteries using wire myography to identify the most suitable species for SAH experiments. Experimental pre-chiasmatic SAH was then induced in rats. At 24 h, body weight, rotating pole performance, mechanical sensitivity, and capsaicin evoked CGRP release from dura mater and trigeminal ganglion were evaluated in SAH and sham animals. In a separate intervention arm, rats received vehicle or the anti-CGRP monoclonal antibody fremanezumab, 30 mg/kg intravenously, 15 min before SAH. Functional outcome, neurological scoring, body weight, and survival were followed for 14 days. RESULTS: Rat basilar arteries showed markedly greater CGRP-induced dilation than mouse basilar arteries, supporting rat as the relevant model. At 24 h after SAH, body weight was reduced whereas periorbital and plantar von Frey thresholds were unchanged. Capsaicin-evoked CGRP release from dura mater was significantly reduced after SAH, while trigeminal ganglion release was unaffected, consistent with early peripheral trigeminovascular peptide release. Fremanezumab treated animals showed higher cerebral blood flow 5 min after SAH and improved rotating pole performance at 48 h compared with vehicle treated animals. No significant differences were observed in composite survival, defined as freedom from spontaneous death or euthanasia at humane endpoint, body weight trajectory, well-being score, or composite neurological score at the 14 day follow up. CONCLUSIONS: Experimental SAH is associated with early impairment of releasable CGRP in dura mater, supporting acute trigeminovascular involvement after hemorrhage. Pre-existing CGRP blockade did not worsen overall 14-day outcome in this model of SAH, but seemingly improved early functional recovery and the temporal pattern of disease progression. These findings suggest stage-dependent effects of CGRP signaling in SAH and support further study of CGRP-targeted therapies in cerebrovascular disease.       en
#> Primary headaches represent a major public health burden in the pediatric population, with a prevalence exceeding 60%. They significantly impair quality of life, school attendance, and family dynamics, often manifesting as a complex biopsychosocial condition. This review provides a comprehensive update on the epidemiology, clinical classification, and comorbidities of pediatric primary headaches, offering a critical analysis of current diagnostic pathways and evolving therapeutic strategies. We examined recent literature, ICHD-3 classification criteria, and international consensus guidelines (AAN/AHS), focusing on the pathophysiology, neuro-psychiatric associations, and evidence-based management of migraine and tension-type headache in developmental age. Clinical evidence highlights a significant bidirectional comorbidity between headaches and neurodevelopmental disorders (ADHD, learning disabilities) as well as mood disorders (anxiety, depression), likely mediated by shared neurotransmitter dysregulation. Diagnosis remains fundamentally clinical, relying on history taking and "red flag" exclusion to rule out secondary etiologies. Regarding management, the therapeutic paradigm has shifted towards a "bio-behavioral first" approach, following evidence of high placebo response rates in prophylaxis trials (CHAMP study). While NSAIDs and triptans remain the cornerstone of acute care, emerging anti-CGRP therapies represent a promising frontier for refractory cases. Effective management of pediatric headache requires a multidisciplinary approach. The integration of lifestyle modifications, behavioral interventions (CBT, biofeedback), and judicious pharmacotherapy is superior to single-modality treatments in reducing disability, improving global functioning, and preventing chronification.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        en
#> BACKGROUND: Surgical decompression of the greater occipital nerve (GON) is a recognized treatment for occipital migraines that are resistant to medical management. Recent evidence suggests that vascular compression by the occipital artery (OA) contributes to occipital migraine pathophysiology. This study compared postoperative outcomes between endoscopic-assisted radical GON decompression along its entire course with complete OA elimination and standard open GON decompression with limited proximal OA elimination. METHODS: A retrospective review was performed on 85 patients who underwent endoscopic-assisted (n=74) or open GON decompression (n=11). Outcomes included changes in migraine headache index (MHI), migraine intensity, frequency, and duration. Cox regression analysis evaluated the probability of achieving a 90% reduction in migraine frequency and MHI over 35 months. RESULTS: Both surgical techniques resulted in significant improvements in MHI, intensity, duration, and frequency (p<0.01 for all measures). The endoscopic-assisted group demonstrated greater reductions in MHI (-182.11 vs. -152.85, p=0.17), frequency (-20.90 vs. -15.45, p=0.08), and intensity (-5.44 vs. -3.00, p<0.001) compared to the open group. Complete migraine resolution rates were also significantly higher in the endoscopic group (69.8% vs. 45.0%, p=0.04). Cox regression analysis showed that endoscopic-assisted decompression was associated with a significantly higher likelihood of maintaining a 90% reduction in migraine frequency and MHI over 36 months. CONCLUSIONS: Endoscopic-assisted GON decompression with complete OA resection is more effective than open decompression in achieving and sustaining occipital migraine relief. These findings underscore the key role of vascular compression in migraine pathogenesis and highlight endoscopic-assisted decompression as the preferred surgical approach.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 en
#> Background and Clinical Significance: Migraine is a common yet debilitating condition that significantly impacts personal lives, productivity, and the healthcare system. Pharmacological interventions provide relief for some migraine sufferers, but for others, are ineffective or accompanied by side effects. Emerging evidence implicates autonomic nervous system dysfunction in migraine pathophysiology, suggesting that mind-body interventions may offer a simple, cost-free therapeutic option. Case Presentation: A 61-year-old woman presented with severe daily migraines that had persisted for years despite medication and dietary changes. Upon starting a regular 10 min slow diaphragmatic breathing practice, her migraines ceased immediately. At a 12-month follow-up, she had only experienced two minor headaches and reported improvements in both daily functioning and quality of life. Conclusions: These findings underscore the potential role of autonomic imbalance in chronic migraine and the preliminary feasibility of breathing interventions as an accessible, low-risk treatment that may, for some, surpass medication in efficacy. Breathing practices may offer a viable alternative to pharmaceutical interventions that benefits both patients and healthcare systems alike.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                en

Entity Extraction

After preprocessing, the next step is to extract biomedical entities from the text.

Loading Entity Dictionaries

First, let’s load entity dictionaries that will be used for entity recognition:

# Load a disease dictionary
disease_dict <- load_dictionary(
  dictionary_type = "disease",
  source = "mesh"
)
#> Searching MeSH database for: disease[MeSH]
#> Found 194731 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 5
#> Extracted 20 unique terms from MeSH text format
#> Processing batch 2 of 5
#> Extracted 21 unique terms from MeSH text format
#> Processing batch 3 of 5
#> Extracted 25 unique terms from MeSH text format
#> Processing batch 4 of 5
#> Extracted 19 unique terms from MeSH text format
#> Processing batch 5 of 5
#> Extracted 21 unique terms from MeSH text format
#> Retrieved 102 unique terms from MeSH
#> Sanitizing dictionary with 102 terms...
#>   Removed 56 terms that did not match their claimed entity types
#>   Removed 38 terms with non-alphanumeric characters (final cleanup)
#> Sanitization complete. 8 terms remaining (7.8% of original)

# Load a drug dictionary
drug_dict <- load_dictionary(
  dictionary_type = "drug",
  source = "mesh"
)
#> Searching MeSH database for: pharmaceutical preparations[MeSH]
#> Found 1029007 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 2
#> Extracted 23 unique terms from MeSH text format
#> Processing batch 2 of 2
#> Extracted 3 unique terms from MeSH text format
#> Retrieved 26 unique terms from MeSH
#> Sanitizing dictionary with 26 terms...
#>   Removed 26 terms that did not match their claimed entity types
#> Sanitization complete. 0 terms remaining (0% of original)

# View a sample of each dictionary
head(disease_dict, 3)
#>               term            id    type    source
#> 10     Lobomycosis       MESH_10 disease mesh_text
#> 20         Disease MESH_ENTRY_19 disease mesh_text
#> 48 Osteochondrosis        MESH_7 disease mesh_text
head(drug_dict, 3)
#> [1] term   id     type   source
#> <0 rows> (or 0-length row.names)

Basic Entity Extraction

Now we can extract entities from the text using these dictionaries:

# Extract disease and drug entities
entities <- extract_entities(
  preprocessed_data,
  text_column = "abstract",
  dictionary = rbind(disease_dict, drug_dict),
  case_sensitive = FALSE,
  overlap_strategy = "priority"
)
#> Sanitizing dictionary with 8 terms...
#> Sanitization complete. 8 terms remaining (100% of original)
#> Extracting entities from 98 documents...
#> Extracted 37 entity mentions:
#>   disease: 37

# View some extracted entities
head(entities[, c("doc_id", "entity", "entity_type", "sentence")], 10)
#>    doc_id  entity entity_type
#> 5       2 Disease     disease
#> 6       2 Disease     disease
#> 7       9 Disease     disease
#> 8       9 Disease     disease
#> 9       9 Disease     disease
#> 10     10 Disease     disease
#> 11     11 Disease     disease
#> 12     14 Disease     disease
#> 13     14 Disease     disease
#> 14     16 Disease     disease
#>                                                                                                                                                                                                                                                                                                 sentence
#> 5                                                                                                               Pre-existing CGRP blockade did not worsen overall 14-day outcome in this model of SAH, but seemingly improved early functional recovery and the temporal pattern of disease progression.
#> 6                                                                                                                                               These findings suggest stage-dependent effects of CGRP signaling in SAH and support further study of CGRP-targeted therapies in cerebrovascular disease.
#> 7                                                        This review synthesizes evidence on the prevalence, outcomes, and pathophysiology of RLS in various neurological disorders, including Parkinson's disease, multiple sclerosis, migraine, dementia, stroke, epilepsy, and peripheral neuropathy.
#> 8                                                                                                                                                                                                                 In Parkinson's disease, RLS is linked to disease progression and dopaminergic therapy.
#> 9                                                                                                                                                                                                                 In Parkinson's disease, RLS is linked to disease progression and dopaminergic therapy.
#> 10 A major focus is to highlight the translational potential of the experimental models and how they can help bridge the gap between preclinical research and clinical application and how increased understanding of fundamental disease mechanisms can form the basis of improved migraine treatments.
#> 11                                                                                                                                                           While the molecular basis of the disease is well established, the mechanism(s) underlying the paroxysmal nature of attacks remains unclear.
#> 12                                                                                           This study aimed to characterize ocular surface findings in POTS patients to clarify whether these symptoms reflect classic dry eye disease or altered sensory processing related to autonomic dysfunction.
#> 13                                                                               These findings underscore the need for ocular surface staining to distinguish dry eye disease from neuropathic ocular pain, suggesting that altered corneal nerve function in autonomic dysfunction may drive symptoms.
#> 14                                                                                                                                                   RESULTS: Non-response to antimigraine treatment is closely linked to clinical and neurobiological factors that increase the overall disease burden.

Complete Entity Extraction Workflow

For a more comprehensive approach, we can use the complete entity extraction workflow:

# Extract entities using the complete workflow
# Check if running in R CMD check environment
is_check <- !interactive() && 
            (!is.null(Sys.getenv("R_CHECK_RUNNING")) && 
             Sys.getenv("R_CHECK_RUNNING") == "true")
             
# More robust check for testing environment
if (!is_check && !is.null(Sys.getenv("_R_CHECK_LIMIT_CORES_"))) {
  is_check <- TRUE
}

# Set number of cores based on environment
num_cores_to_use <- if(is_check) 1 else 4

# Extract entities using the complete workflow
entities_workflow <- extract_entities_workflow(
  preprocessed_data,
  text_column = "abstract",
  entity_types = c("disease", "drug", "gene", "protein", "pathway"),
  dictionary_sources = c("local", "mesh"),
  sanitize = TRUE,
  parallel = !is_check,           # Disable parallel in check environment
  num_cores = num_cores_to_use    # Use 1 core in check environment
)
#> Running in R CMD check environment. Disabling parallel processing.
#> Creating dictionaries for entity extraction...
#> Loading dictionaries sequentially...
#>   Using cached dictionary for disease (local)
#>   Using cached dictionary for drug (local)
#>   Using cached dictionary for gene (local)
#> Searching MeSH database for: proteins[MeSH]
#> Found 7816950 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 5
#> Extracted 24 unique terms from MeSH text format
#> Processing batch 2 of 5
#> Extracted 23 unique terms from MeSH text format
#> Processing batch 3 of 5
#> Extracted 19 unique terms from MeSH text format
#> Processing batch 4 of 5
#> Extracted 20 unique terms from MeSH text format
#> Processing batch 5 of 5
#> Extracted 19 unique terms from MeSH text format
#> Retrieved 105 unique terms from MeSH
#>   Added 105 terms from protein (mesh)
#> Searching MeSH database for: metabolic networks and pathways[MeSH]
#> Found 196184 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 1
#> Extracted 6 unique terms from MeSH text format
#> Retrieved 6 unique terms from MeSH
#>   Added 6 terms from pathway (mesh)
#> Created combined dictionary with 135 unique terms
#> Sanitizing dictionary with 135 terms...
#>   Removed 8 terms with numbers followed by special characters
#>   Removed 78 terms that did not match their claimed entity types
#>   Removed 42 terms with non-alphanumeric characters (final cleanup)
#> Sanitization complete. 7 terms remaining (5.2% of original)
#> Extracting entities from 98 documents...
#> Processing batch 1/1
#> Extracting entities from 98 documents...
#> Extracted 481 entity mentions:
#>   disease: 477
#>   protein: 4
#> Extracted 481 entity mentions in 0.16 minutes
#>   disease: 477
#>   protein: 4

# View summary of entity types
table(entities_workflow$entity_type)
#> 
#> disease protein 
#>     477       4

Customizing Entity Extraction

We can customize the entity extraction process by providing additional MeSH queries or custom dictionaries:

# Define custom MeSH queries for different entity types
mesh_queries <- list(
  "disease" = "migraine disorders[MeSH] OR headache disorders[MeSH]",
  "drug" = "analgesics[MeSH] OR serotonin agonists[MeSH] OR anticonvulsants[MeSH]",
  "gene" = "genes[MeSH] OR channelopathy[MeSH]"
)

# Create a custom dictionary
custom_dict <- data.frame(
  term = c("CGRP", "trigeminal nerve", "cortical spreading depression"),
  type = c("protein", "anatomy", "biological_process"),
  id = c("CUSTOM_1", "CUSTOM_2", "CUSTOM_3"),
  source = rep("custom", 3),
  stringsAsFactors = FALSE
)

# Extract entities with custom settings
custom_entities <- extract_entities_workflow(
  preprocessed_data,
  text_column = "abstract",
  entity_types = c("disease", "drug", "gene", "protein", "pathway"),
  dictionary_sources = c("local", "mesh"),
  additional_mesh_queries = mesh_queries,
  custom_dictionary = custom_dict,
  sanitize = TRUE
)
#> Running in R CMD check environment. Disabling parallel processing.
#> Creating dictionaries for entity extraction...
#> Adding 3 terms from custom dictionary
#> Loading dictionaries sequentially...
#>   Using cached dictionary for disease (local)
#>   Using cached dictionary for drug (local)
#>   Using cached dictionary for gene (local)
#>   Using cached dictionary for protein (mesh)
#>   Using cached dictionary for pathway (mesh)
#> Created combined dictionary with 138 unique terms
#> Sanitizing dictionary with 135 terms...
#>   Removed 8 terms with numbers followed by special characters
#>   Removed 78 terms that did not match their claimed entity types
#>   Removed 42 terms with non-alphanumeric characters (final cleanup)
#> Sanitization complete. 7 terms remaining (5.2% of original)
#> Extracting entities from 98 documents...
#> Processing batch 1/1
#> Extracting entities from 98 documents...
#> Extracted 571 entity mentions:
#>   anatomy: 3
#>   biological_process: 3
#>   disease: 477
#>   protein: 88
#> Extracted 571 entity mentions in 0.01 minutes
#>   anatomy: 3
#>   biological_process: 3
#>   disease: 477
#>   protein: 88

# View custom entities
custom_entities[custom_entities$source == "custom", ]
#> [1] entity      entity_type doc_id      start_pos   end_pos     sentence   
#> [7] frequency  
#> <0 rows> (or 0-length row.names)

Dictionary Sanitization

The quality of entity extraction heavily depends on the quality of the dictionaries. We can sanitize dictionaries to improve extraction quality:

# Create a raw dictionary with some problematic entries
raw_dict <- data.frame(
  term = c("migraine", "5-HT", "headache", "the", "and", "patient", "inflammation", "study"),
  type = c("disease", "chemical", "symptom", "NA", "NA", "NA", "biological_process", "NA"),
  id = paste0("ID_", 1:8),
  source = rep("example", 8),
  stringsAsFactors = FALSE
)

# Sanitize the dictionary
sanitized_dict <- sanitize_dictionary(
  raw_dict,
  term_column = "term",
  type_column = "type",
  validate_types = TRUE,
  verbose = TRUE
)
#> Sanitizing dictionary with 8 terms...
#>   Removed 1 terms with numbers followed by special characters
#>   Removed 3 common non-medical terms, conjunctive adverbs, and general terms
#> Sanitization complete. 4 terms remaining (50% of original)

# View the sanitized dictionary
sanitized_dict
#>           term               type   id  source
#> 1     migraine            disease ID_1 example
#> 3     headache            symptom ID_3 example
#> 6      patient                 NA ID_6 example
#> 7 inflammation biological_process ID_7 example

Mapping Terms to Biomedical Ontologies

We can map extracted terms to standard biomedical ontologies like MeSH or UMLS:

# Extract terms to map
terms_to_map <- c("migraine", "headache", "CGRP", "serotonin")

# Map to MeSH
mesh_mappings <- map_ontology(
  terms_to_map,
  ontology = "mesh",
  fuzzy_match = TRUE,
  similarity_threshold = 0.8
)
#> Searching MeSH database for: disease[MeSH]
#> Found 194731 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 5
#> Extracted 20 unique terms from MeSH text format
#> Processing batch 2 of 5
#> Extracted 21 unique terms from MeSH text format
#> Processing batch 3 of 5
#> Extracted 25 unique terms from MeSH text format
#> Processing batch 4 of 5
#> Extracted 19 unique terms from MeSH text format
#> Processing batch 5 of 5
#> Extracted 21 unique terms from MeSH text format
#> Retrieved 102 unique terms from MeSH
#> Sanitizing dictionary with 102 terms...
#>   Removed 56 terms that did not match their claimed entity types
#>   Removed 38 terms with non-alphanumeric characters (final cleanup)
#> Sanitization complete. 8 terms remaining (7.8% of original)
#> No matches found for the input terms in the mesh ontology

# View MeSH mappings
mesh_mappings
#> [1] term          ontology_id   ontology_term match_type   
#> <0 rows> (or 0-length row.names)

Topic Modeling

We can also apply topic modeling to discover the main themes in the corpus:

# Extract topics from the corpus
topics <- extract_topics(
  migraine_articles,
  text_column = "abstract",
  n_topics = 5,
  max_terms = 10
)
#> Tokenizing text...

# View top terms for each topic
topics$topics
#> $`Topic 1`
#>                  term    weight
#> migraine     migraine 53.829608
#> associated associated  9.843197
#> patients     patients  7.526765
#> related       related  7.218425
#> group           group  7.175990
#> clinical     clinical  7.084185
#> frequency   frequency  6.236584
#> compared     compared  5.653727
#> results       results  5.430704
#> reduced       reduced  5.401042
#> 
#> $`Topic 2`
#>                term   weight
#> pain           pain 76.99004
#> patients   patients 57.46340
#> migraine   migraine 48.47225
#> headache   headache 45.07473
#> body           body 25.24152
#> treatment treatment 23.63553
#> tth             tth 22.87561
#> symptoms   symptoms 21.07916
#> study         study 20.84757
#> between     between 20.84551
#> 
#> $`Topic 3`
#>                      term    weight
#> migraine         migraine 28.685966
#> headache         headache 11.550284
#> light               light  9.826823
#> 001                   001  8.688372
#> participants participants  5.964117
#> type                 type  5.832478
#> intensity       intensity  5.755576
#> not                   not  5.544141
#> cgrp                 cgrp  4.773276
#> study               study  4.709048
#> 
#> $`Topic 4`
#>                  term    weight
#> migraine     migraine 154.06984
#> clinical     clinical  32.86714
#> patients     patients  26.92056
#> between       between  24.72872
#> evidence     evidence  23.31417
#> may               may  22.95471
#> reported     reported  22.83974
#> these           these  22.53759
#> concussion concussion  18.88280
#> treatment   treatment  17.90422
#> 
#> $`Topic 5`
#>                    term    weight
#> cgrp               cgrp 20.025789
#> memory           memory 11.960472
#> migraine       migraine 11.563287
#> sah                 sah  7.247311
#> aura               aura  7.220514
#> performance performance  7.165361
#> cognitive     cognitive  5.767783
#> after             after  5.406079
#> hmgb1             hmgb1  5.394555
#> using             using  5.196714