This vignette explains the text preprocessing and entity extraction
capabilities of the LBDiscover package, which are
fundamental steps in the literature-based discovery process.
Before applying discovery models, we need to preprocess the text data and extract the entities of interest. These steps transform raw text into structured information that can be used for discovering relationships between biomedical concepts.
First, let’s retrieve some sample articles:
# Search for articles about migraines
migraine_articles <- pubmed_search(
query = "migraine pathophysiology",
max_results = 100
)
#> Created pubmed_cache environment for result caching
#> Searching PubMed for: migraine pathophysiology
#> Found 12213 results, retrieving 100 records
#> Fetching batch 1 of 1 (records 1-100)
#> Processing 99 articles
#> Cached search results for future use
# View the first article
head(migraine_articles[, c("pmid", "title")], 3)
#> pmid
#> 1 42220293
#> 2 42215868
#> 3 42209735
#> title
#> 1 Identifying neural oscillation and phase synchronization abnormalities in migraine and their predictive values in transcranial magnetic stimulation.
#> 2 CGRP-targeted migraine treatment and early pathophysiology in experimental subarachnoid hemorrhage.
#> 3 Multidisciplinary management of headache in developmental age: effects on school performance, quality of life, and therapeutic perspectives.The first step is to preprocess the text data to extract meaningful terms:
# Preprocess the abstracts
preprocessed_data <- preprocess_text(
migraine_articles,
text_column = "abstract",
remove_stopwords = TRUE,
custom_stopwords = c("study", "patient", "result", "conclusion"),
min_word_length = 3,
max_word_length = 25
)
#> Tokenizing text...
# View terms extracted from the first document
head(preprocessed_data$terms[[1]], 10)
#> word count
#> 1 abnormal 1
#> 2 addition 1
#> 3 age 1
#> 4 aimed 1
#> 5 alpha 2
#> 6 analgesic 1
#> 7 areas 1
#> 8 assessed 1
#> 9 background 1
#> 10 band 1For larger datasets, we can use the optimized vectorized preprocessing function:
# Use optimized vectorized preprocessing
opt_preprocessed_data <- vec_preprocess(
migraine_articles,
text_column = "abstract",
remove_stopwords = TRUE,
min_word_length = 3,
chunk_size = 50 # Process in chunks of 50 documents
)
#> Processing text in 2 chunks...
#> | | | 0% | |=================================== | 50% | |======================================================================| 100%
# Compare processing times
system.time({
preprocess_text(
migraine_articles,
text_column = "abstract",
remove_stopwords = TRUE
)
})
#> Tokenizing text...
#> user system elapsed
#> 0.067 0.000 0.068
system.time({
vec_preprocess(
migraine_articles,
text_column = "abstract",
remove_stopwords = TRUE,
chunk_size = 50
)
})
#> Processing text in 2 chunks...
#> | | | 0% | |=================================== | 50% | |======================================================================| 100%
#> user system elapsed
#> 0.067 0.000 0.067We can extract n-grams (sequences of n words) to capture multi-word concepts:
# Extract bigrams (2-word sequences)
bigrams <- extract_ngrams(
migraine_articles$abstract,
n = 2,
min_freq = 2
)
# View the most frequent bigrams
head(bigrams, 10)
#> ngram frequency
#> 11816 p 0 89
#> 8358 in the 81
#> 11386 of the 65
#> 2990 associated with 61
#> 8274 in migraine 58
#> 9971 migraine and 49
#> 11267 of migraine 48
#> 17839 with migraine 48
#> 2846 as a 46
#> 8933 is a 44Segmenting text into sentences can be useful for more granular analysis:
# Extract sentences from the first abstract
abstracts <- migraine_articles$abstract
first_abstract <- abstracts[1]
# Make sure we have a valid abstract
if(is.na(first_abstract) || length(first_abstract) == 0 || nchar(first_abstract) == 0) {
# Find the first non-empty abstract
valid_idx <- which(!is.na(abstracts) & nchar(abstracts) > 0)
if(length(valid_idx) > 0) {
first_abstract <- abstracts[valid_idx[1]]
cat("First abstract was empty, using abstract #", valid_idx[1], "instead.\n")
} else {
# Create a sample abstract for demonstration
first_abstract <- "This is a sample abstract for demonstration. It contains multiple sentences. Each sentence will be extracted separately."
cat("No valid abstracts found. Using a sample abstract for demonstration.\n")
}
}
# Now segment the valid abstract
sentences <- segment_sentences(first_abstract)
# Check if sentences list has elements before trying to access them
if(length(sentences) > 0 && length(sentences[[1]]) > 0) {
# View the first few sentences
head(sentences[[1]], min(3, length(sentences[[1]])))
} else {
cat("No sentences could be extracted. The abstract might be too short or formatted incorrectly.\n")
}
#> [1] "BACKGROUND: Migrai.e.is a prevalent neurological disorder that may substantially disrupt daily function."
#> [2] "Repetiti.e.transcranial magnetic stimulation (rTMS) has been shown to be a promising treatment for migrai.e. but its treatment efficacy still needs to be optimi.e.."
#> [3] "This study ai.e. to i.e.tify abnormal neural oscillations in patients with migrai.e. and evaluate their predicti.e.value for TMS treatments to obtain insights for the optimisation of rTMS paradigms."
# View the first few sentences
head(sentences[[1]], 3)
#> [1] "BACKGROUND: Migrai.e.is a prevalent neurological disorder that may substantially disrupt daily function."
#> [2] "Repetiti.e.transcranial magnetic stimulation (rTMS) has been shown to be a promising treatment for migrai.e. but its treatment efficacy still needs to be optimi.e.."
#> [3] "This study ai.e. to i.e.tify abnormal neural oscillations in patients with migrai.e. and evaluate their predicti.e.value for TMS treatments to obtain insights for the optimisation of rTMS paradigms."For dealing with multilingual corpora, we can detect the language of each document:
# Filter out NA values from abstracts and detect language
abstracts <- migraine_articles$abstract[1:5]
valid_abstracts <- abstracts[!is.na(abstracts)]
# Apply language detection to valid abstracts
if (length(valid_abstracts) > 0) {
languages <- sapply(valid_abstracts, detect_lang)
# View results
data.frame(
abstract_id = which(!is.na(abstracts)),
language = languages
)
} else {
message("No valid abstracts found for language detection")
}
#> abstract_id
#> BACKGROUND: Migraine is a prevalent neurological disorder that may substantially disrupt daily function. Repetitive transcranial magnetic stimulation (rTMS) has been shown to be a promising treatment for migraine, but its treatment efficacy still needs to be optimised. This study aimed to identify abnormal neural oscillations in patients with migraines and evaluate their predictive value for TMS treatments to obtain insights for the optimisation of rTMS paradigms. METHODS: Patients with migraine received a course of rTMS delivered over the left dorsolateral prefrontal cortex (DLPFC). Resting-state electroencephalography (EEG) was assessed at baseline in both patients with migraine and age- and sex-matched healthy controls. RESULTS: Compared with healthy controls, patients with migraine were characterised by a slower peak alpha frequency (PAF). In addition, patients with migraine demonstrated alpha-band hyperconnectivity between parieto-occipital regions and between fronto-occipital areas. More importantly, parieto-occipital connectivity showed predictive value for rTMS analgesic effects in this population. CONCLUSIONS: These findings support the potential utility of oscillatory biomarkers for optimising rTMS treatment in patients with migraine. CLINICAL TRIAL REGISTRATION: Chinese Clinical Trials Registry (ChiCTR2200060337). 1
#> BACKGROUND: Calcitonin gene-related peptide (CGRP) is a central mediator in migraine and an endogenous vasodilator in the trigeminovascular system. Its role in subarachnoid hemorrhage (SAH), remains uncertain, particularly in patients receiving CGRP-targeted migraine therapy at the time of hemorrhage. This study examined whether CGRP blockade administered shortly before experimental SAH affects outcome and whether early SAH is associated with altered trigeminovascular CGRP signaling. METHODS: CGRP-mediated vasodilation was first assessed ex vivo in rat and mouse cerebral arteries using wire myography to identify the most suitable species for SAH experiments. Experimental pre-chiasmatic SAH was then induced in rats. At 24 h, body weight, rotating pole performance, mechanical sensitivity, and capsaicin evoked CGRP release from dura mater and trigeminal ganglion were evaluated in SAH and sham animals. In a separate intervention arm, rats received vehicle or the anti-CGRP monoclonal antibody fremanezumab, 30 mg/kg intravenously, 15 min before SAH. Functional outcome, neurological scoring, body weight, and survival were followed for 14 days. RESULTS: Rat basilar arteries showed markedly greater CGRP-induced dilation than mouse basilar arteries, supporting rat as the relevant model. At 24 h after SAH, body weight was reduced whereas periorbital and plantar von Frey thresholds were unchanged. Capsaicin-evoked CGRP release from dura mater was significantly reduced after SAH, while trigeminal ganglion release was unaffected, consistent with early peripheral trigeminovascular peptide release. Fremanezumab treated animals showed higher cerebral blood flow 5 min after SAH and improved rotating pole performance at 48 h compared with vehicle treated animals. No significant differences were observed in composite survival, defined as freedom from spontaneous death or euthanasia at humane endpoint, body weight trajectory, well-being score, or composite neurological score at the 14 day follow up. CONCLUSIONS: Experimental SAH is associated with early impairment of releasable CGRP in dura mater, supporting acute trigeminovascular involvement after hemorrhage. Pre-existing CGRP blockade did not worsen overall 14-day outcome in this model of SAH, but seemingly improved early functional recovery and the temporal pattern of disease progression. These findings suggest stage-dependent effects of CGRP signaling in SAH and support further study of CGRP-targeted therapies in cerebrovascular disease. 2
#> Primary headaches represent a major public health burden in the pediatric population, with a prevalence exceeding 60%. They significantly impair quality of life, school attendance, and family dynamics, often manifesting as a complex biopsychosocial condition. This review provides a comprehensive update on the epidemiology, clinical classification, and comorbidities of pediatric primary headaches, offering a critical analysis of current diagnostic pathways and evolving therapeutic strategies. We examined recent literature, ICHD-3 classification criteria, and international consensus guidelines (AAN/AHS), focusing on the pathophysiology, neuro-psychiatric associations, and evidence-based management of migraine and tension-type headache in developmental age. Clinical evidence highlights a significant bidirectional comorbidity between headaches and neurodevelopmental disorders (ADHD, learning disabilities) as well as mood disorders (anxiety, depression), likely mediated by shared neurotransmitter dysregulation. Diagnosis remains fundamentally clinical, relying on history taking and "red flag" exclusion to rule out secondary etiologies. Regarding management, the therapeutic paradigm has shifted towards a "bio-behavioral first" approach, following evidence of high placebo response rates in prophylaxis trials (CHAMP study). While NSAIDs and triptans remain the cornerstone of acute care, emerging anti-CGRP therapies represent a promising frontier for refractory cases. Effective management of pediatric headache requires a multidisciplinary approach. The integration of lifestyle modifications, behavioral interventions (CBT, biofeedback), and judicious pharmacotherapy is superior to single-modality treatments in reducing disability, improving global functioning, and preventing chronification. 3
#> BACKGROUND: Surgical decompression of the greater occipital nerve (GON) is a recognized treatment for occipital migraines that are resistant to medical management. Recent evidence suggests that vascular compression by the occipital artery (OA) contributes to occipital migraine pathophysiology. This study compared postoperative outcomes between endoscopic-assisted radical GON decompression along its entire course with complete OA elimination and standard open GON decompression with limited proximal OA elimination. METHODS: A retrospective review was performed on 85 patients who underwent endoscopic-assisted (n=74) or open GON decompression (n=11). Outcomes included changes in migraine headache index (MHI), migraine intensity, frequency, and duration. Cox regression analysis evaluated the probability of achieving a 90% reduction in migraine frequency and MHI over 35 months. RESULTS: Both surgical techniques resulted in significant improvements in MHI, intensity, duration, and frequency (p<0.01 for all measures). The endoscopic-assisted group demonstrated greater reductions in MHI (-182.11 vs. -152.85, p=0.17), frequency (-20.90 vs. -15.45, p=0.08), and intensity (-5.44 vs. -3.00, p<0.001) compared to the open group. Complete migraine resolution rates were also significantly higher in the endoscopic group (69.8% vs. 45.0%, p=0.04). Cox regression analysis showed that endoscopic-assisted decompression was associated with a significantly higher likelihood of maintaining a 90% reduction in migraine frequency and MHI over 36 months. CONCLUSIONS: Endoscopic-assisted GON decompression with complete OA resection is more effective than open decompression in achieving and sustaining occipital migraine relief. These findings underscore the key role of vascular compression in migraine pathogenesis and highlight endoscopic-assisted decompression as the preferred surgical approach. 4
#> Background and Clinical Significance: Migraine is a common yet debilitating condition that significantly impacts personal lives, productivity, and the healthcare system. Pharmacological interventions provide relief for some migraine sufferers, but for others, are ineffective or accompanied by side effects. Emerging evidence implicates autonomic nervous system dysfunction in migraine pathophysiology, suggesting that mind-body interventions may offer a simple, cost-free therapeutic option. Case Presentation: A 61-year-old woman presented with severe daily migraines that had persisted for years despite medication and dietary changes. Upon starting a regular 10 min slow diaphragmatic breathing practice, her migraines ceased immediately. At a 12-month follow-up, she had only experienced two minor headaches and reported improvements in both daily functioning and quality of life. Conclusions: These findings underscore the potential role of autonomic imbalance in chronic migraine and the preliminary feasibility of breathing interventions as an accessible, low-risk treatment that may, for some, surpass medication in efficacy. Breathing practices may offer a viable alternative to pharmaceutical interventions that benefits both patients and healthcare systems alike. 5
#> language
#> BACKGROUND: Migraine is a prevalent neurological disorder that may substantially disrupt daily function. Repetitive transcranial magnetic stimulation (rTMS) has been shown to be a promising treatment for migraine, but its treatment efficacy still needs to be optimised. This study aimed to identify abnormal neural oscillations in patients with migraines and evaluate their predictive value for TMS treatments to obtain insights for the optimisation of rTMS paradigms. METHODS: Patients with migraine received a course of rTMS delivered over the left dorsolateral prefrontal cortex (DLPFC). Resting-state electroencephalography (EEG) was assessed at baseline in both patients with migraine and age- and sex-matched healthy controls. RESULTS: Compared with healthy controls, patients with migraine were characterised by a slower peak alpha frequency (PAF). In addition, patients with migraine demonstrated alpha-band hyperconnectivity between parieto-occipital regions and between fronto-occipital areas. More importantly, parieto-occipital connectivity showed predictive value for rTMS analgesic effects in this population. CONCLUSIONS: These findings support the potential utility of oscillatory biomarkers for optimising rTMS treatment in patients with migraine. CLINICAL TRIAL REGISTRATION: Chinese Clinical Trials Registry (ChiCTR2200060337). en
#> BACKGROUND: Calcitonin gene-related peptide (CGRP) is a central mediator in migraine and an endogenous vasodilator in the trigeminovascular system. Its role in subarachnoid hemorrhage (SAH), remains uncertain, particularly in patients receiving CGRP-targeted migraine therapy at the time of hemorrhage. This study examined whether CGRP blockade administered shortly before experimental SAH affects outcome and whether early SAH is associated with altered trigeminovascular CGRP signaling. METHODS: CGRP-mediated vasodilation was first assessed ex vivo in rat and mouse cerebral arteries using wire myography to identify the most suitable species for SAH experiments. Experimental pre-chiasmatic SAH was then induced in rats. At 24 h, body weight, rotating pole performance, mechanical sensitivity, and capsaicin evoked CGRP release from dura mater and trigeminal ganglion were evaluated in SAH and sham animals. In a separate intervention arm, rats received vehicle or the anti-CGRP monoclonal antibody fremanezumab, 30 mg/kg intravenously, 15 min before SAH. Functional outcome, neurological scoring, body weight, and survival were followed for 14 days. RESULTS: Rat basilar arteries showed markedly greater CGRP-induced dilation than mouse basilar arteries, supporting rat as the relevant model. At 24 h after SAH, body weight was reduced whereas periorbital and plantar von Frey thresholds were unchanged. Capsaicin-evoked CGRP release from dura mater was significantly reduced after SAH, while trigeminal ganglion release was unaffected, consistent with early peripheral trigeminovascular peptide release. Fremanezumab treated animals showed higher cerebral blood flow 5 min after SAH and improved rotating pole performance at 48 h compared with vehicle treated animals. No significant differences were observed in composite survival, defined as freedom from spontaneous death or euthanasia at humane endpoint, body weight trajectory, well-being score, or composite neurological score at the 14 day follow up. CONCLUSIONS: Experimental SAH is associated with early impairment of releasable CGRP in dura mater, supporting acute trigeminovascular involvement after hemorrhage. Pre-existing CGRP blockade did not worsen overall 14-day outcome in this model of SAH, but seemingly improved early functional recovery and the temporal pattern of disease progression. These findings suggest stage-dependent effects of CGRP signaling in SAH and support further study of CGRP-targeted therapies in cerebrovascular disease. en
#> Primary headaches represent a major public health burden in the pediatric population, with a prevalence exceeding 60%. They significantly impair quality of life, school attendance, and family dynamics, often manifesting as a complex biopsychosocial condition. This review provides a comprehensive update on the epidemiology, clinical classification, and comorbidities of pediatric primary headaches, offering a critical analysis of current diagnostic pathways and evolving therapeutic strategies. We examined recent literature, ICHD-3 classification criteria, and international consensus guidelines (AAN/AHS), focusing on the pathophysiology, neuro-psychiatric associations, and evidence-based management of migraine and tension-type headache in developmental age. Clinical evidence highlights a significant bidirectional comorbidity between headaches and neurodevelopmental disorders (ADHD, learning disabilities) as well as mood disorders (anxiety, depression), likely mediated by shared neurotransmitter dysregulation. Diagnosis remains fundamentally clinical, relying on history taking and "red flag" exclusion to rule out secondary etiologies. Regarding management, the therapeutic paradigm has shifted towards a "bio-behavioral first" approach, following evidence of high placebo response rates in prophylaxis trials (CHAMP study). While NSAIDs and triptans remain the cornerstone of acute care, emerging anti-CGRP therapies represent a promising frontier for refractory cases. Effective management of pediatric headache requires a multidisciplinary approach. The integration of lifestyle modifications, behavioral interventions (CBT, biofeedback), and judicious pharmacotherapy is superior to single-modality treatments in reducing disability, improving global functioning, and preventing chronification. en
#> BACKGROUND: Surgical decompression of the greater occipital nerve (GON) is a recognized treatment for occipital migraines that are resistant to medical management. Recent evidence suggests that vascular compression by the occipital artery (OA) contributes to occipital migraine pathophysiology. This study compared postoperative outcomes between endoscopic-assisted radical GON decompression along its entire course with complete OA elimination and standard open GON decompression with limited proximal OA elimination. METHODS: A retrospective review was performed on 85 patients who underwent endoscopic-assisted (n=74) or open GON decompression (n=11). Outcomes included changes in migraine headache index (MHI), migraine intensity, frequency, and duration. Cox regression analysis evaluated the probability of achieving a 90% reduction in migraine frequency and MHI over 35 months. RESULTS: Both surgical techniques resulted in significant improvements in MHI, intensity, duration, and frequency (p<0.01 for all measures). The endoscopic-assisted group demonstrated greater reductions in MHI (-182.11 vs. -152.85, p=0.17), frequency (-20.90 vs. -15.45, p=0.08), and intensity (-5.44 vs. -3.00, p<0.001) compared to the open group. Complete migraine resolution rates were also significantly higher in the endoscopic group (69.8% vs. 45.0%, p=0.04). Cox regression analysis showed that endoscopic-assisted decompression was associated with a significantly higher likelihood of maintaining a 90% reduction in migraine frequency and MHI over 36 months. CONCLUSIONS: Endoscopic-assisted GON decompression with complete OA resection is more effective than open decompression in achieving and sustaining occipital migraine relief. These findings underscore the key role of vascular compression in migraine pathogenesis and highlight endoscopic-assisted decompression as the preferred surgical approach. en
#> Background and Clinical Significance: Migraine is a common yet debilitating condition that significantly impacts personal lives, productivity, and the healthcare system. Pharmacological interventions provide relief for some migraine sufferers, but for others, are ineffective or accompanied by side effects. Emerging evidence implicates autonomic nervous system dysfunction in migraine pathophysiology, suggesting that mind-body interventions may offer a simple, cost-free therapeutic option. Case Presentation: A 61-year-old woman presented with severe daily migraines that had persisted for years despite medication and dietary changes. Upon starting a regular 10 min slow diaphragmatic breathing practice, her migraines ceased immediately. At a 12-month follow-up, she had only experienced two minor headaches and reported improvements in both daily functioning and quality of life. Conclusions: These findings underscore the potential role of autonomic imbalance in chronic migraine and the preliminary feasibility of breathing interventions as an accessible, low-risk treatment that may, for some, surpass medication in efficacy. Breathing practices may offer a viable alternative to pharmaceutical interventions that benefits both patients and healthcare systems alike. enAfter preprocessing, the next step is to extract biomedical entities from the text.
First, let’s load entity dictionaries that will be used for entity recognition:
# Load a disease dictionary
disease_dict <- load_dictionary(
dictionary_type = "disease",
source = "mesh"
)
#> Searching MeSH database for: disease[MeSH]
#> Found 194731 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 5
#> Extracted 20 unique terms from MeSH text format
#> Processing batch 2 of 5
#> Extracted 21 unique terms from MeSH text format
#> Processing batch 3 of 5
#> Extracted 25 unique terms from MeSH text format
#> Processing batch 4 of 5
#> Extracted 19 unique terms from MeSH text format
#> Processing batch 5 of 5
#> Extracted 21 unique terms from MeSH text format
#> Retrieved 102 unique terms from MeSH
#> Sanitizing dictionary with 102 terms...
#> Removed 56 terms that did not match their claimed entity types
#> Removed 38 terms with non-alphanumeric characters (final cleanup)
#> Sanitization complete. 8 terms remaining (7.8% of original)
# Load a drug dictionary
drug_dict <- load_dictionary(
dictionary_type = "drug",
source = "mesh"
)
#> Searching MeSH database for: pharmaceutical preparations[MeSH]
#> Found 1029007 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 2
#> Extracted 23 unique terms from MeSH text format
#> Processing batch 2 of 2
#> Extracted 3 unique terms from MeSH text format
#> Retrieved 26 unique terms from MeSH
#> Sanitizing dictionary with 26 terms...
#> Removed 26 terms that did not match their claimed entity types
#> Sanitization complete. 0 terms remaining (0% of original)
# View a sample of each dictionary
head(disease_dict, 3)
#> term id type source
#> 10 Lobomycosis MESH_10 disease mesh_text
#> 20 Disease MESH_ENTRY_19 disease mesh_text
#> 48 Osteochondrosis MESH_7 disease mesh_text
head(drug_dict, 3)
#> [1] term id type source
#> <0 rows> (or 0-length row.names)Now we can extract entities from the text using these dictionaries:
# Extract disease and drug entities
entities <- extract_entities(
preprocessed_data,
text_column = "abstract",
dictionary = rbind(disease_dict, drug_dict),
case_sensitive = FALSE,
overlap_strategy = "priority"
)
#> Sanitizing dictionary with 8 terms...
#> Sanitization complete. 8 terms remaining (100% of original)
#> Extracting entities from 98 documents...
#> Extracted 37 entity mentions:
#> disease: 37
# View some extracted entities
head(entities[, c("doc_id", "entity", "entity_type", "sentence")], 10)
#> doc_id entity entity_type
#> 5 2 Disease disease
#> 6 2 Disease disease
#> 7 9 Disease disease
#> 8 9 Disease disease
#> 9 9 Disease disease
#> 10 10 Disease disease
#> 11 11 Disease disease
#> 12 14 Disease disease
#> 13 14 Disease disease
#> 14 16 Disease disease
#> sentence
#> 5 Pre-existing CGRP blockade did not worsen overall 14-day outcome in this model of SAH, but seemingly improved early functional recovery and the temporal pattern of disease progression.
#> 6 These findings suggest stage-dependent effects of CGRP signaling in SAH and support further study of CGRP-targeted therapies in cerebrovascular disease.
#> 7 This review synthesizes evidence on the prevalence, outcomes, and pathophysiology of RLS in various neurological disorders, including Parkinson's disease, multiple sclerosis, migraine, dementia, stroke, epilepsy, and peripheral neuropathy.
#> 8 In Parkinson's disease, RLS is linked to disease progression and dopaminergic therapy.
#> 9 In Parkinson's disease, RLS is linked to disease progression and dopaminergic therapy.
#> 10 A major focus is to highlight the translational potential of the experimental models and how they can help bridge the gap between preclinical research and clinical application and how increased understanding of fundamental disease mechanisms can form the basis of improved migraine treatments.
#> 11 While the molecular basis of the disease is well established, the mechanism(s) underlying the paroxysmal nature of attacks remains unclear.
#> 12 This study aimed to characterize ocular surface findings in POTS patients to clarify whether these symptoms reflect classic dry eye disease or altered sensory processing related to autonomic dysfunction.
#> 13 These findings underscore the need for ocular surface staining to distinguish dry eye disease from neuropathic ocular pain, suggesting that altered corneal nerve function in autonomic dysfunction may drive symptoms.
#> 14 RESULTS: Non-response to antimigraine treatment is closely linked to clinical and neurobiological factors that increase the overall disease burden.For a more comprehensive approach, we can use the complete entity extraction workflow:
# Extract entities using the complete workflow
# Check if running in R CMD check environment
is_check <- !interactive() &&
(!is.null(Sys.getenv("R_CHECK_RUNNING")) &&
Sys.getenv("R_CHECK_RUNNING") == "true")
# More robust check for testing environment
if (!is_check && !is.null(Sys.getenv("_R_CHECK_LIMIT_CORES_"))) {
is_check <- TRUE
}
# Set number of cores based on environment
num_cores_to_use <- if(is_check) 1 else 4
# Extract entities using the complete workflow
entities_workflow <- extract_entities_workflow(
preprocessed_data,
text_column = "abstract",
entity_types = c("disease", "drug", "gene", "protein", "pathway"),
dictionary_sources = c("local", "mesh"),
sanitize = TRUE,
parallel = !is_check, # Disable parallel in check environment
num_cores = num_cores_to_use # Use 1 core in check environment
)
#> Running in R CMD check environment. Disabling parallel processing.
#> Creating dictionaries for entity extraction...
#> Loading dictionaries sequentially...
#> Using cached dictionary for disease (local)
#> Using cached dictionary for drug (local)
#> Using cached dictionary for gene (local)
#> Searching MeSH database for: proteins[MeSH]
#> Found 7816950 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 5
#> Extracted 24 unique terms from MeSH text format
#> Processing batch 2 of 5
#> Extracted 23 unique terms from MeSH text format
#> Processing batch 3 of 5
#> Extracted 19 unique terms from MeSH text format
#> Processing batch 4 of 5
#> Extracted 20 unique terms from MeSH text format
#> Processing batch 5 of 5
#> Extracted 19 unique terms from MeSH text format
#> Retrieved 105 unique terms from MeSH
#> Added 105 terms from protein (mesh)
#> Searching MeSH database for: metabolic networks and pathways[MeSH]
#> Found 196184 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 1
#> Extracted 6 unique terms from MeSH text format
#> Retrieved 6 unique terms from MeSH
#> Added 6 terms from pathway (mesh)
#> Created combined dictionary with 135 unique terms
#> Sanitizing dictionary with 135 terms...
#> Removed 8 terms with numbers followed by special characters
#> Removed 78 terms that did not match their claimed entity types
#> Removed 42 terms with non-alphanumeric characters (final cleanup)
#> Sanitization complete. 7 terms remaining (5.2% of original)
#> Extracting entities from 98 documents...
#> Processing batch 1/1
#> Extracting entities from 98 documents...
#> Extracted 481 entity mentions:
#> disease: 477
#> protein: 4
#> Extracted 481 entity mentions in 0.16 minutes
#> disease: 477
#> protein: 4
# View summary of entity types
table(entities_workflow$entity_type)
#>
#> disease protein
#> 477 4We can customize the entity extraction process by providing additional MeSH queries or custom dictionaries:
# Define custom MeSH queries for different entity types
mesh_queries <- list(
"disease" = "migraine disorders[MeSH] OR headache disorders[MeSH]",
"drug" = "analgesics[MeSH] OR serotonin agonists[MeSH] OR anticonvulsants[MeSH]",
"gene" = "genes[MeSH] OR channelopathy[MeSH]"
)
# Create a custom dictionary
custom_dict <- data.frame(
term = c("CGRP", "trigeminal nerve", "cortical spreading depression"),
type = c("protein", "anatomy", "biological_process"),
id = c("CUSTOM_1", "CUSTOM_2", "CUSTOM_3"),
source = rep("custom", 3),
stringsAsFactors = FALSE
)
# Extract entities with custom settings
custom_entities <- extract_entities_workflow(
preprocessed_data,
text_column = "abstract",
entity_types = c("disease", "drug", "gene", "protein", "pathway"),
dictionary_sources = c("local", "mesh"),
additional_mesh_queries = mesh_queries,
custom_dictionary = custom_dict,
sanitize = TRUE
)
#> Running in R CMD check environment. Disabling parallel processing.
#> Creating dictionaries for entity extraction...
#> Adding 3 terms from custom dictionary
#> Loading dictionaries sequentially...
#> Using cached dictionary for disease (local)
#> Using cached dictionary for drug (local)
#> Using cached dictionary for gene (local)
#> Using cached dictionary for protein (mesh)
#> Using cached dictionary for pathway (mesh)
#> Created combined dictionary with 138 unique terms
#> Sanitizing dictionary with 135 terms...
#> Removed 8 terms with numbers followed by special characters
#> Removed 78 terms that did not match their claimed entity types
#> Removed 42 terms with non-alphanumeric characters (final cleanup)
#> Sanitization complete. 7 terms remaining (5.2% of original)
#> Extracting entities from 98 documents...
#> Processing batch 1/1
#> Extracting entities from 98 documents...
#> Extracted 571 entity mentions:
#> anatomy: 3
#> biological_process: 3
#> disease: 477
#> protein: 88
#> Extracted 571 entity mentions in 0.01 minutes
#> anatomy: 3
#> biological_process: 3
#> disease: 477
#> protein: 88
# View custom entities
custom_entities[custom_entities$source == "custom", ]
#> [1] entity entity_type doc_id start_pos end_pos sentence
#> [7] frequency
#> <0 rows> (or 0-length row.names)The quality of entity extraction heavily depends on the quality of the dictionaries. We can sanitize dictionaries to improve extraction quality:
# Create a raw dictionary with some problematic entries
raw_dict <- data.frame(
term = c("migraine", "5-HT", "headache", "the", "and", "patient", "inflammation", "study"),
type = c("disease", "chemical", "symptom", "NA", "NA", "NA", "biological_process", "NA"),
id = paste0("ID_", 1:8),
source = rep("example", 8),
stringsAsFactors = FALSE
)
# Sanitize the dictionary
sanitized_dict <- sanitize_dictionary(
raw_dict,
term_column = "term",
type_column = "type",
validate_types = TRUE,
verbose = TRUE
)
#> Sanitizing dictionary with 8 terms...
#> Removed 1 terms with numbers followed by special characters
#> Removed 3 common non-medical terms, conjunctive adverbs, and general terms
#> Sanitization complete. 4 terms remaining (50% of original)
# View the sanitized dictionary
sanitized_dict
#> term type id source
#> 1 migraine disease ID_1 example
#> 3 headache symptom ID_3 example
#> 6 patient NA ID_6 example
#> 7 inflammation biological_process ID_7 exampleWe can map extracted terms to standard biomedical ontologies like MeSH or UMLS:
# Extract terms to map
terms_to_map <- c("migraine", "headache", "CGRP", "serotonin")
# Map to MeSH
mesh_mappings <- map_ontology(
terms_to_map,
ontology = "mesh",
fuzzy_match = TRUE,
similarity_threshold = 0.8
)
#> Searching MeSH database for: disease[MeSH]
#> Found 194731 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 5
#> Extracted 20 unique terms from MeSH text format
#> Processing batch 2 of 5
#> Extracted 21 unique terms from MeSH text format
#> Processing batch 3 of 5
#> Extracted 25 unique terms from MeSH text format
#> Processing batch 4 of 5
#> Extracted 19 unique terms from MeSH text format
#> Processing batch 5 of 5
#> Extracted 21 unique terms from MeSH text format
#> Retrieved 102 unique terms from MeSH
#> Sanitizing dictionary with 102 terms...
#> Removed 56 terms that did not match their claimed entity types
#> Removed 38 terms with non-alphanumeric characters (final cleanup)
#> Sanitization complete. 8 terms remaining (7.8% of original)
#> No matches found for the input terms in the mesh ontology
# View MeSH mappings
mesh_mappings
#> [1] term ontology_id ontology_term match_type
#> <0 rows> (or 0-length row.names)We can also apply topic modeling to discover the main themes in the corpus:
# Extract topics from the corpus
topics <- extract_topics(
migraine_articles,
text_column = "abstract",
n_topics = 5,
max_terms = 10
)
#> Tokenizing text...
# View top terms for each topic
topics$topics
#> $`Topic 1`
#> term weight
#> migraine migraine 53.829608
#> associated associated 9.843197
#> patients patients 7.526765
#> related related 7.218425
#> group group 7.175990
#> clinical clinical 7.084185
#> frequency frequency 6.236584
#> compared compared 5.653727
#> results results 5.430704
#> reduced reduced 5.401042
#>
#> $`Topic 2`
#> term weight
#> pain pain 76.99004
#> patients patients 57.46340
#> migraine migraine 48.47225
#> headache headache 45.07473
#> body body 25.24152
#> treatment treatment 23.63553
#> tth tth 22.87561
#> symptoms symptoms 21.07916
#> study study 20.84757
#> between between 20.84551
#>
#> $`Topic 3`
#> term weight
#> migraine migraine 28.685966
#> headache headache 11.550284
#> light light 9.826823
#> 001 001 8.688372
#> participants participants 5.964117
#> type type 5.832478
#> intensity intensity 5.755576
#> not not 5.544141
#> cgrp cgrp 4.773276
#> study study 4.709048
#>
#> $`Topic 4`
#> term weight
#> migraine migraine 154.06984
#> clinical clinical 32.86714
#> patients patients 26.92056
#> between between 24.72872
#> evidence evidence 23.31417
#> may may 22.95471
#> reported reported 22.83974
#> these these 22.53759
#> concussion concussion 18.88280
#> treatment treatment 17.90422
#>
#> $`Topic 5`
#> term weight
#> cgrp cgrp 20.025789
#> memory memory 11.960472
#> migraine migraine 11.563287
#> sah sah 7.247311
#> aura aura 7.220514
#> performance performance 7.165361
#> cognitive cognitive 5.767783
#> after after 5.406079
#> hmgb1 hmgb1 5.394555
#> using using 5.196714