---
title: "Getting Started with Literature-Based Discovery"
author: "Chao Liu"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
toc: true
toc_depth: 3
css: custom.css
vignette: >
%\VignetteIndexEntry{Getting Started with Literature-Based Discovery}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.width = 7,
fig.height = 5
)
# Create custom CSS file for better layout
cat("
body {
max-width: 1200px;
margin: 0 auto;
padding: 0 10px;
}
.main-container {
max-width: 100%;
width: 100%;
}
table {
width: 100%;
max-width: 100%;
font-size: 0.9em;
overflow-x: auto;
display: block;
}
th, td {
padding: 4px 8px;
white-space: nowrap;
}
pre {
max-width: 100%;
overflow-x: auto;
}
iframe {
width: 100%;
height: 800px;
border: none;
}
/* Remove outer box lines */
.section {
border: none !important;
padding: 0 !important;
margin-bottom: 20px !important;
}
/* Fix title and nav overlap */
h1.title {
margin-top: 50px !important;
}
.navbar {
margin-bottom: 40px !important;
}
#header {
margin-bottom: 40px;
}
#TOC {
margin-top: 30px;
position: relative;
z-index: 100;
}
/* Fix embedded report styles */
#migraine_discoveries {
margin-top: 30px;
}
#migraine_discoveries .section {
border: none !important;
padding: 0 !important;
}
#migraine_discoveries .nav {
position: relative !important;
margin-bottom: 40px !important;
}
#migraine_discoveries .content {
margin-top: 20px !important;
}
", file = "custom.css")
```
# Introduction
Literature-based discovery (LBD) is a powerful approach to identifying hidden connections between existing knowledge in the scientific literature. This vignette introduces the `LBDiscover` package, which provides tools for automated literature-based discovery by analyzing biomedical publications.
## Installation
You can install the package from CRAN:
```{r cran_installation, eval = FALSE}
install.packages("LBDiscover")
```
You can install the development version of `LBDiscover` from GitHub:
```{r github_installation, eval = FALSE}
# install.packages("devtools")
devtools::install_github("chaoliu-cl/LBDiscover")
```
## Basic Workflow
The typical workflow for literature-based discovery with this package consists of:
1. Retrieving publications from PubMed or other sources
2. Preprocessing text data
3. Extracting biomedical entities
4. Creating a co-occurrence matrix
5. Applying discovery models
6. Visualizing and evaluating results
Let's walk through a comprehensive example exploring connections in migraine research.
## Example: Exploring Migraine Research
In this example, we'll explore potential discoveries in migraine research by applying the improved ABC model approach with various utility functions.
### 1. Load the package
```{r load-package}
library(LBDiscover)
```
### 2. Define the primary term of interest
```{r define-primary-term}
# Define the primary term of interest for our analysis
primary_term <- "migraine"
```
### 3. Retrieve publications
We'll search for articles about migraine pathophysiology and treatment.
```{r retrieve-publications, eval = TRUE, message=FALSE}
# Search for migraine-related articles
migraine_articles <- pubmed_search(
query = paste0(primary_term, " pathophysiology"),
max_results = 1000
)
# Search for treatment-related articles
drug_articles <- pubmed_search(
query = "neurological drugs pain treatment OR migraine therapy OR headache medication",
max_results = 1000
)
# Combine and remove duplicates
all_articles <- merge_results(migraine_articles, drug_articles)
cat("Retrieved", nrow(all_articles), "unique articles\n")
```
### 4. Extract variations of the primary term
```{r extract-term-variations}
# Extract variations of our primary term using the utility function
primary_term_variations <- get_term_vars(all_articles, primary_term)
cat("Found", length(primary_term_variations), "variations of", primary_term, "in the corpus:\n")
print(head(primary_term_variations, 10))
```
### 5. Preprocess text data
```{r preprocess-text}
# Preprocess text
preprocessed_articles <- preprocess_text(
all_articles,
text_column = "abstract",
remove_stopwords = TRUE,
min_word_length = 2 # Set min_word_length to capture short terms
)
```
### 6. Create a custom dictionary
```{r create-custom-dictionary}
# Create a custom dictionary with all variations of our primary term
custom_dictionary <- data.frame(
term = c(primary_term, primary_term_variations),
type = rep("disease", length(primary_term_variations) + 1),
id = paste0("CUSTOM_", 1:(length(primary_term_variations) + 1)),
source = rep("custom", length(primary_term_variations) + 1),
stringsAsFactors = FALSE
)
# Define additional MeSH queries for extended dictionaries
mesh_queries <- list(
"disease" = paste0(primary_term, " disorders[MeSH] OR headache disorders[MeSH]"),
"protein" = "receptors[MeSH] OR ion channels[MeSH]",
"chemical" = "neurotransmitters[MeSH] OR vasoactive agents[MeSH]",
"pathway" = "signal transduction[MeSH] OR pain[MeSH]",
"drug" = "analgesics[MeSH] OR serotonin agonists[MeSH] OR anticonvulsants[MeSH]",
"gene" = "genes[MeSH] OR channelopathy[MeSH]"
)
# Sanitize the custom dictionary
custom_dictionary <- sanitize_dictionary(
custom_dictionary,
term_column = "term",
type_column = "type",
validate_types = FALSE # Don't validate custom terms as they're trusted
)
```
### 7. Extract biomedical entities
```{r extract-entities, message=FALSE}
# Extract entities using our custom dictionary
custom_entities <- extract_entities(
preprocessed_articles,
text_column = "abstract",
dictionary = custom_dictionary,
case_sensitive = FALSE,
overlap_strategy = "priority",
sanitize_dict = FALSE # Already sanitized
)
# Extract entities using the standard workflow with improved entity validation
# Check if running in R CMD check environment
is_check <- !interactive() &&
(!is.null(Sys.getenv("R_CHECK_RUNNING")) &&
Sys.getenv("R_CHECK_RUNNING") == "true")
# More robust check for testing environment
if (!is_check && !is.null(Sys.getenv("_R_CHECK_LIMIT_CORES_"))) {
is_check <- TRUE
}
# Set number of cores based on environment
num_cores_to_use <- if(is_check) 1 else 4
standard_entities <- extract_entities_workflow(
preprocessed_articles,
text_column = "abstract",
entity_types = c("disease", "drug", "gene"),
parallel = !is_check, # Disable parallel in check environment
num_cores = num_cores_to_use, # Use 1 core in check environment
batch_size = 500 # Process 500 documents per batch
)
# Uncomment to include UMLS entities
# standard_entities <- extract_entities_workflow(
# preprocessed_articles,
# text_column = "abstract",
# entity_types = c("disease", "drug", "gene", "protein", "pathway", "chemical"),
# dictionary_sources = c("local", "mesh", "umls"), # Including UMLS
# additional_mesh_queries = mesh_queries,
# sanitize = TRUE,
# api_key = "your-umls-api-key", # Your UMLS API key here
# parallel = TRUE,
# num_cores = 4
# )
# Combine entity datasets using our utility function
entities <- merge_entities(
custom_entities,
standard_entities,
primary_term
)
# Filter entities to ensure only relevant biomedical terms are included
filtered_entities <- valid_entities(
entities,
primary_term,
primary_term_variations,
validation_function = is_valid_biomedical_entity
)
# View the first few extracted entities
head(filtered_entities)
```
### 8. Create co-occurrence matrix
```{r co-occurrence-matrix}
# Create co-occurrence matrix with validated entities
co_matrix <- create_comat(
filtered_entities,
doc_id_col = "doc_id",
entity_col = "entity",
type_col = "entity_type",
normalize = TRUE,
normalization_method = "cosine"
)
# Find our primary term in the co-occurrence matrix
a_term <- find_term(co_matrix, primary_term)
# Check matrix dimensions
dim(co_matrix)
```
### 9. Apply the improved ABC model
```{r abc-model}
# Apply the improved ABC model with enhanced term filtering and type validation
abc_results <- abc_model(
co_matrix,
a_term = a_term,
c_term = NULL, # Allow all potential C terms
min_score = 0.001, # Lower threshold to capture more potential connections
n_results = 500, # Increase to get more candidates before filtering
scoring_method = "combined",
# Focus on biomedically relevant entity types
b_term_types = c("protein", "gene", "pathway", "chemical"),
c_term_types = c("drug", "chemical", "protein", "gene"),
exclude_general_terms = TRUE, # Enable enhanced term filtering
filter_similar_terms = TRUE, # Remove terms too similar to migraine
similarity_threshold = 0.7, # Relatively strict similarity threshold
enforce_strict_typing = TRUE # Enable strict entity type validation
)
# If we don't have enough results, try with less stringent criteria
min_desired_results <- 10
if (nrow(abc_results) < min_desired_results) {
cat("Not enough results with strict filtering. Trying with less stringent criteria...\n")
abc_results <- abc_model(
co_matrix,
a_term = a_term,
c_term = NULL,
min_score = 0.0005, # Even lower threshold
n_results = 500,
scoring_method = "combined",
b_term_types = NULL, # No type constraints
c_term_types = NULL, # No type constraints
exclude_general_terms = TRUE,
filter_similar_terms = TRUE,
similarity_threshold = 0.8, # More lenient similarity threshold
enforce_strict_typing = FALSE # Disable strict type validation as fallback
)
}
# Check if we have results and what columns are available
if (nrow(abc_results) > 0) {
cat("Available columns in abc_results:", paste(colnames(abc_results), collapse = ", "), "\n")
# Safely display results with available columns
display_cols <- intersect(c("a_term", "b_term", "c_term", "abc_score", "b_type", "c_type"),
colnames(abc_results))
if (length(display_cols) > 0) {
# View top results
head(abc_results[, display_cols, drop = FALSE])
} else {
# If specific columns aren't available, show first few columns
head(abc_results[, 1:min(6, ncol(abc_results)), drop = FALSE])
}
} else {
cat("No ABC results found\n")
}
```
### 10. Apply statistical validation to the results
```{r validate-results}
# Apply statistical validation to the results
validated_results <- tryCatch({
validate_abc(
abc_results,
co_matrix,
alpha = 0.1, # More lenient significance threshold
correction = "BH", # Benjamini-Hochberg correction for multiple testing
filter_by_significance = FALSE # Keep all results but mark significant ones
)
}, error = function(e) {
cat("Error in statistical validation:", e$message, "\n")
cat("Using original results without validation...\n")
# Add dummy p-values based on ABC scores
abc_results$p_value <- 1 - abc_results$abc_score / max(abc_results$abc_score, na.rm = TRUE)
abc_results$significant <- abc_results$p_value < 0.1
return(abc_results)
})
# Sort by ABC score and take top results
validated_results <- validated_results[order(-validated_results$abc_score), ]
top_n <- min(100, nrow(validated_results)) # Larger top N for diversification
top_results <- head(validated_results, top_n)
# Check available columns and display safely
if (nrow(top_results) > 0) {
display_cols <- intersect(c("a_term", "b_term", "c_term", "abc_score", "p_value", "significant"),
colnames(top_results))
if (length(display_cols) > 0) {
# View top validated results
head(top_results[, display_cols, drop = FALSE])
} else {
# Show available columns
head(top_results[, 1:min(6, ncol(top_results)), drop = FALSE])
}
} else {
cat("No validated results available\n")
}
```
### 11. Diversify and ensure minimum results
```{r diversify-results}
# Diversify results using our utility function
diverse_results <- safe_diversify(
top_results,
diversity_method = "both",
max_per_group = 5,
min_score = 0.0001,
min_results = 5
)
# Ensure we have enough results for visualization
diverse_results <- min_results(
diverse_results,
top_results,
a_term,
min_results = 3
)
```
### 12. Visualize the results
```{r visualization}
# Create heatmap visualization
plot_heatmap(
diverse_results,
output_file = "migraine_heatmap.png",
width = 1200,
height = 900,
top_n = 15,
min_score = 0.0001,
color_palette = "blues",
show_entity_types = TRUE
)
# Create network visualization
plot_network(
diverse_results,
output_file = "migraine_network.png",
width = 1200,
height = 900,
top_n = 15,
min_score = 0.0001,
node_size_factor = 5,
color_by = "type",
title = "Migraine Treatment Network",
show_entity_types = TRUE,
label_size = 1.0
)
```
### 13. Create interactive visualizations
```{r interactive-visualizations}
# Create interactive HTML network visualization
export_network(
diverse_results,
output_file = "migraine_network.html",
top_n = min(30, nrow(diverse_results)),
min_score = 0.0001,
open = FALSE # Don't automatically open in browser
)
# Create interactive chord diagram
export_chord(
diverse_results,
output_file = "migraine_chord.html",
top_n = min(30, nrow(diverse_results)),
min_score = 0.0001,
open = FALSE
)
```
### 14. Evaluate literature support and generate report
```{r evaluation-report, message=FALSE}
# Evaluate literature support for top connections
evaluation <- eval_evidence(
diverse_results,
max_results = 5,
base_term = "migraine",
max_articles = 5
)
# Prepare articles for report generation
articles_with_years <- prep_articles(all_articles)
# Store results for report
results_list <- list(abc = diverse_results)
# Store visualization paths
visualizations <- list(
heatmap = "migraine_heatmap.png",
network = "migraine_network.html",
chord = "migraine_chord.html"
)
# Create comprehensive report
gen_report(
results_list = results_list,
visualizations = visualizations,
articles = articles_with_years,
output_file = "migraine_discoveries.html"
)
cat("\nDiscovery analysis complete!\n")
```
## Interactive Visualizations and Report
The LBDiscover package generates interactive visualizations and a comprehensive report. Below you can see the embedded report with interactive visualizations.
```{r preprocess-report, echo=FALSE, results='hide', eval=TRUE}
if(file.exists("migraine_discoveries.html")) {
# Read the HTML file
report_html <- readLines("migraine_discoveries.html", warn = FALSE)
if(file.exists("migraine_heatmap.png")) {
# Encode the image to base64
base64_img <- base64enc::base64encode("migraine_heatmap.png")
report_html <- gsub(
"
",
paste0("
"),
report_html,
fixed = TRUE
)
}
# Add JavaScript to handle link clicks directly in the report
js_code <- '
'
# Insert the JavaScript before the closing body tag
report_html <- gsub('