This vignette demonstrates how to use the searchAnalyzeR
package to conduct a comprehensive analysis of systematic review search
strategies using real PubMed data. We’ll walk through a complete
workflow for analyzing search performance, from executing searches to
generating publication-ready reports.
The searchAnalyzeR package provides tools for:
For this demonstration, we’ll analyze a search strategy designed to identify literature on the long-term effects of COVID-19, commonly known as “Long COVID.” This topic represents a rapidly evolving area of research that presents typical challenges faced in systematic reviews.
First, let’s load the required packages for our analysis:
# Load required packages
library(searchAnalyzeR)
library(rentrez) # For PubMed API access
library(xml2) # For XML parsing
library(dplyr)
library(ggplot2)
library(lubridate)
cat("=== searchAnalyzeR: Real PubMed Search Example ===\n")
#> === searchAnalyzeR: Real PubMed Search Example ===
cat("Topic: Long-term effects of COVID-19 (Long COVID)\n")
#> Topic: Long-term effects of COVID-19 (Long COVID)
cat("Objective: Demonstrate search strategy analysis with real data\n\n")
#> Objective: Demonstrate search strategy analysis with real dataA well-defined search strategy is crucial for systematic reviews. Here we define our search parameters including terms, databases, date ranges, and filters:
# Define our search strategy
search_strategy <- list(
terms = c(
"long covid",
"post-covid syndrome",
"covid-19 sequelae",
"post-acute covid-19",
"persistent covid symptoms"
),
databases = c("PubMed"),
date_range = as.Date(c("2020-01-01", "2024-12-31")),
filters = list(
language = "English",
article_types = c("Journal Article", "Review", "Clinical Trial")
),
search_date = Sys.time()
)
cat("Search Strategy:\n")
#> Search Strategy:
cat("Terms:", paste(search_strategy$terms, collapse = " OR "), "\n")
#> Terms: long covid OR post-covid syndrome OR covid-19 sequelae OR post-acute covid-19 OR persistent covid symptoms
cat("Date range:", paste(search_strategy$date_range, collapse = " to "), "\n\n")
#> Date range: 2020-01-01 to 2024-12-31The search strategy includes:
The searchAnalyzeR package provides convenient functions
to search PubMed and retrieve article metadata:
# Execute the search using the package function
cat("Searching PubMed for real articles...\n")
#> Searching PubMed for real articles...
raw_results <- search_pubmed(
search_terms = search_strategy$terms,
max_results = 150,
date_range = search_strategy$date_range,
language = "English"
)
#> PubMed Query: ( "long covid"[Title/Abstract] OR "post-covid syndrome"[Title/Abstract] OR "covid-19 sequelae"[Title/Abstract] OR "post-acute covid-19"[Title/Abstract] OR "persistent covid symptoms"[Title/Abstract] ) AND ("2020/01/01"[Date - Publication] : "2024/12/31"[Date - Publication]) AND English [Language]
#> Found 150 articles
#> Retrieving batch 1 of 3
#> Retrieving batch 2 of 3
#> Retrieving batch 3 of 3
cat("\nRaw search completed. Retrieved", nrow(raw_results), "articles.\n")
#>
#> Raw search completed. Retrieved 150 articles.Raw search results from different databases often have varying
formats. The std_search_results() function standardizes the
data structure:
# Standardize the results using searchAnalyzeR functions
cat("\nStandardizing search results...\n")
#>
#> Standardizing search results...
standardized_results <- std_search_results(raw_results, source_format = "pubmed")This standardization ensures that:
Duplicate detection is critical in systematic reviews, especially when searching multiple databases. The package provides sophisticated algorithms for identifying duplicates:
# Detect and remove duplicates
cat("Detecting duplicates...\n")
#> Detecting duplicates...
dedup_results <- detect_dupes(standardized_results, method = "exact")
cat("Duplicate detection complete:\n")
#> Duplicate detection complete:
cat("- Total articles:", nrow(dedup_results), "\n")
#> - Total articles: 150
cat("- Unique articles:", sum(!dedup_results$duplicate), "\n")
#> - Unique articles: 150
cat("- Duplicates found:", sum(dedup_results$duplicate), "\n\n")
#> - Duplicates found: 0The detect_dupes() function offers several methods:
Basic statistics help assess the overall quality of the search results:
# Calculate search statistics
search_stats <- calc_search_stats(dedup_results)
cat("Search Statistics:\n")
#> Search Statistics:
cat("- Date range:", paste(search_stats$date_range, collapse = " to "), "\n")
#> - Date range: 2024-01-01 to 2026-01-01
cat("- Missing abstracts:", search_stats$missing_abstracts, "\n")
#> - Missing abstracts: 0
cat("- Missing dates:", search_stats$missing_dates, "\n\n")
#> - Missing dates: 0For performance evaluation, we need a “gold standard” of known relevant articles. In a real systematic review, this would be your manually identified relevant articles. For this demonstration, we create a simplified gold standard:
# Create a gold standard for demonstration
# In a real systematic review, this would be your known relevant articles
# For this example, we'll identify articles that contain key terms in titles
cat("Creating demonstration gold standard...\n")
#> Creating demonstration gold standard...
long_covid_terms <- c("long covid", "post-covid", "post-acute covid", "persistent covid", "covid sequelae")
pattern <- paste(long_covid_terms, collapse = "|")
gold_standard_ids <- dedup_results %>%
filter(!duplicate) %>%
filter(grepl(pattern, tolower(title))) %>%
pull(id)
cat("Gold standard created with", length(gold_standard_ids), "highly relevant articles\n\n")
#> Gold standard created with 83 highly relevant articlesNote: In practice, your gold standard would be created through:
The SearchAnalyzer class provides comprehensive tools
for evaluating search performance:
The analyzer calculates a comprehensive set of performance metrics:
# Calculate comprehensive metrics
cat("Calculating performance metrics...\n")
#> Calculating performance metrics...
metrics <- analyzer$calculate_metrics()
# Display key metrics
cat("\n=== SEARCH PERFORMANCE METRICS ===\n")
#>
#> === SEARCH PERFORMANCE METRICS ===
if (!is.null(metrics$precision_recall$precision)) {
cat("Precision:", round(metrics$precision_recall$precision, 3), "\n")
cat("Recall:", round(metrics$precision_recall$recall, 3), "\n")
cat("F1 Score:", round(metrics$precision_recall$f1_score, 3), "\n")
cat("Number Needed to Read:", round(metrics$precision_recall$number_needed_to_read, 1), "\n")
}
#> Precision: 0.553
#> Recall: 1
#> F1 Score: 0.712
#> Number Needed to Read: 1.8
cat("\n=== BASIC METRICS ===\n")
#>
#> === BASIC METRICS ===
cat("Total Records:", metrics$basic$total_records, "\n")
#> Total Records: 150
cat("Unique Records:", metrics$basic$unique_records, "\n")
#> Unique Records: 150
cat("Duplicates:", metrics$basic$duplicates, "\n")
#> Duplicates: 0
cat("Sources:", metrics$basic$sources, "\n")
#> Sources: 97The package generates publication-ready visualizations to assess search performance:
# Generate visualizations
cat("\nGenerating visualizations...\n")
#>
#> Generating visualizations...
# Overview plot
overview_plot <- analyzer$visualize_performance("overview")
print(overview_plot)
# Temporal distribution plot
temporal_plot <- analyzer$visualize_performance("temporal")
print(temporal_plot)
# Precision-recall curve (if gold standard available)
if (length(gold_standard_ids) > 0) {
pr_plot <- analyzer$visualize_performance("precision_recall")
print(pr_plot)
}The package can generate data for PRISMA flow diagrams, essential for systematic review reporting:
# Generate PRISMA flow diagram data
cat("\nCreating PRISMA flow data...\n")
#>
#> Creating PRISMA flow data...
screening_data <- data.frame(
id = dedup_results$id[!dedup_results$duplicate],
identified = TRUE,
duplicate = FALSE,
title_abstract_screened = TRUE,
full_text_eligible = runif(sum(!dedup_results$duplicate)) > 0.7, # Simulate screening
included = runif(sum(!dedup_results$duplicate)) > 0.85, # Simulate final inclusion
excluded_title_abstract = runif(sum(!dedup_results$duplicate)) > 0.3,
excluded_full_text = runif(sum(!dedup_results$duplicate)) > 0.15
)# Generate PRISMA diagram
reporter <- PRISMAReporter$new()
prisma_plot <- reporter$generate_prisma_diagram(screening_data)
print(prisma_plot)The package supports exporting results in various formats commonly used in systematic reviews:
# Export results in multiple formats
cat("\nExporting results...\n")
#>
#> Exporting results...
output_dir <- tempdir()
export_files <- export_results(
search_results = filter(dedup_results, !duplicate),
file_path = file.path(output_dir, "covid_long_term_search"),
formats = c("csv", "xlsx", "ris"),
include_metadata = TRUE
)
cat("Files exported:\n")
#> Files exported:
for (file in export_files) {
cat("-", file, "\n")
}
#> - /tmp/RtmpMm5rJm/covid_long_term_search.csv
#> - /tmp/RtmpMm5rJm/covid_long_term_search.xlsx
#> - /tmp/RtmpMm5rJm/covid_long_term_search.risPerformance metrics can also be exported for further analysis or reporting:
For reproducibility, create a complete data package containing all analysis components:
# Create a complete data package
cat("\nCreating comprehensive data package...\n")
#>
#> Creating comprehensive data package...
package_dir <- create_data_package(
search_results = filter(dedup_results, !duplicate),
analysis_results = list(
metrics = metrics,
search_strategy = search_strategy,
screening_data = screening_data
),
output_dir = output_dir,
package_name = "covid_long_term_systematic_review"
)
cat("Data package created at:", package_dir, "\n")
#> Data package created at: /tmp/RtmpMm5rJm/covid_long_term_systematic_reviewThe package includes tools for validating search strategies against established benchmarks:
# Demonstrate benchmark validation (simplified)
cat("\nDemonstrating benchmark validation...\n")
#>
#> Demonstrating benchmark validation...
validator <- BenchmarkValidator$new()
# Add our search as a custom benchmark
validator$add_benchmark(
name = "covid_long_term",
corpus = filter(dedup_results, !duplicate),
relevant_ids = gold_standard_ids
)
# Validate the strategy
validation_results <- validator$validate_strategy(
search_strategy = search_strategy,
benchmark_name = "covid_long_term"
)
cat("Validation Results:\n")
#> Validation Results:
cat("- Precision:", round(validation_results$precision, 3), "\n")
#> - Precision: 0.686
cat("- Recall:", round(validation_results$recall, 3), "\n")
#> - Recall: 0.843
cat("- F1 Score:", round(validation_results$f1_score, 3), "\n")
#> - F1 Score: 0.757Analyze how well the retrieved abstracts match the search terms:
# Text similarity analysis on abstracts
cat("\nAnalyzing abstract similarity to search terms...\n")
#>
#> Analyzing abstract similarity to search terms...
search_term_text <- paste(search_strategy$terms, collapse = " ")
similarity_scores <- sapply(dedup_results$abstract[!dedup_results$duplicate], function(abstract) {
if (is.na(abstract) || abstract == "") return(0)
calc_text_sim(search_term_text, abstract, method = "jaccard")
})
cat("Average abstract similarity to search terms:", round(mean(similarity_scores, na.rm = TRUE), 3), "\n")
#> Average abstract similarity to search terms: 0.013
cat("Abstracts with high similarity (>0.1):", sum(similarity_scores > 0.1, na.rm = TRUE), "\n")
#> Abstracts with high similarity (>0.1): 0Evaluate which search terms are most effective:
# Analyze term effectiveness
cat("\nAnalyzing individual term effectiveness...\n")
#>
#> Analyzing individual term effectiveness...
term_analysis <- term_effectiveness(
terms = search_strategy$terms,
search_results = filter(dedup_results, !duplicate),
gold_standard = gold_standard_ids
)
print(term_analysis)
#> Term Effectiveness Analysis
#> ==========================
#> Search Results: 150 articles
#> Gold Standard: 83 relevant articles
#> Fields Analyzed: title, abstract
#>
#> term articles_with_term relevant_with_term precision
#> long covid 87 60 0.690
#> post-covid syndrome 4 4 1.000
#> covid-19 sequelae 6 1 0.167
#> post-acute covid-19 15 11 0.733
#> persistent covid symptoms 0 0 0.000
#> coverage
#> 0.723
#> 0.048
#> 0.012
#> 0.133
#> 0.000# Calculate term effectiveness scores
term_scores <- calc_tes(term_analysis)
cat("\nTerm Effectiveness Scores (TES):\n")
#>
#> Term Effectiveness Scores (TES):
print(term_scores[order(term_scores$tes, decreasing = TRUE), ])
#> Term Effectiveness Analysis
#> ==========================
#> Search Results: 150 articles
#> Gold Standard: 83 relevant articles
#> Fields Analyzed: title, abstract
#>
#> term articles_with_term relevant_with_term precision
#> long covid 87 60 0.690
#> post-acute covid-19 15 11 0.733
#> post-covid syndrome 4 4 1.000
#> covid-19 sequelae 6 1 0.167
#> persistent covid symptoms 0 0 0.000
#> coverage tes
#> 0.723 0.70588235
#> 0.133 0.22448980
#> 0.048 0.09195402
#> 0.012 0.02247191
#> 0.000 0.00000000# Find top performing terms
top_terms <- find_top_terms(term_analysis, n = 3, plot = TRUE, plot_type = "precision_coverage")
cat("\nTop 3 performing terms:", paste(top_terms$terms, collapse = ", "), "\n")
#>
#> Top 3 performing terms: long covid, post-acute covid-19, post-covid syndrome
if (!is.null(top_terms$plot)) {
print(top_terms$plot)
}Based on the calculated metrics, the package can provide automated recommendations:
# Final summary and recommendations
cat("\n=== FINAL SUMMARY AND RECOMMENDATIONS ===\n")
#>
#> === FINAL SUMMARY AND RECOMMENDATIONS ===
cat("Search Topic: Long-term effects of COVID-19\n")
#> Search Topic: Long-term effects of COVID-19
cat("Articles Retrieved:", sum(!dedup_results$duplicate), "\n")
#> Articles Retrieved: 150
cat("Search Date Range:", paste(search_strategy$date_range, collapse = " to "), "\n")
#> Search Date Range: 2020-01-01 to 2024-12-31
if (!is.null(metrics$precision_recall$precision)) {
cat("Search Precision:", round(metrics$precision_recall$precision, 3), "\n")
if (metrics$precision_recall$precision < 0.1) {
cat("RECOMMENDATION: Low precision suggests search may be too broad. Consider:\n")
cat("- Adding more specific terms\n")
cat("- Using MeSH terms\n")
cat("- Adding study type filters\n")
} else if (metrics$precision_recall$precision > 0.5) {
cat("RECOMMENDATION: High precision suggests good specificity. Consider:\n")
cat("- Broadening search if recall needs improvement\n")
cat("- Adding synonyms or related terms\n")
}
}
#> Search Precision: 0.553
#> RECOMMENDATION: High precision suggests good specificity. Consider:
#> - Broadening search if recall needs improvement
#> - Adding synonyms or related termsLet’s examine some of the retrieved articles to understand the search results:
# Show some example retrieved articles
cat("\n=== SAMPLE RETRIEVED ARTICLES ===\n")
#>
#> === SAMPLE RETRIEVED ARTICLES ===
sample_articles <- filter(dedup_results, !duplicate) %>%
arrange(desc(date)) %>%
head(3)
for (i in 1:nrow(sample_articles)) {
article <- sample_articles[i, ]
cat("\n", i, ". ", article$title, "\n", sep = "")
cat(" Journal:", article$source, "\n")
cat(" Date:", as.character(article$date), "\n")
cat(" PMID:", gsub("PMID:", "", article$id), "\n")
cat(" Abstract:", substr(article$abstract, 1, 200), "...\n")
}
#>
#> 1. A qualitative analysis of lived experiences, mental health treatment needs, and psychotherapeutic applications among veterans with long COVID.
#> Journal: PubMed: Psychological services
#> Date: 2026-01-01
#> PMID: 40244965
#> Abstract: Long COVID remains a pressing health concern among Americans, with current data suggesting that 45% of those infected by COVID-19 experience at least one symptom of Long COVID. Veterans are a particul ...
#>
#> 2. Combined oral supplementation with magnesium plus vitamin D alleviates mild to moderate depressive symptoms related to long-COVID: an open-label randomized, controlled clinical trial.
#> Journal: PubMed: Magnesium research
#> Date: 2026-01-01
#> PMID: 39846976
#> Abstract: Individuals with long-COVID exhibit a higher frequency of hypomagnesemia, vitamin D deficiency, and depression. Objective. To evaluate the efficacy and safety of oral supplementation with magnesium ch ...
#>
#> 3. Quantifying the Degree of Fatigue in People Reporting Symptoms of Post-COVID-19 Syndrome: Results from a Rasch Analysis.
#> Journal: PubMed: Physiotherapy Canada. Physiotherapie Canada
#> Date: 2025-05-01
#> PMID: 42078679
#> Abstract: Purpose: Fatigue is a defining feature of post-COVID-19 syndrome (PCS), yet there is no accepted measure of this life-altering consequence. The aim here was to create a measure fit for the purposes of ...This vignette demonstrated a complete workflow using the
searchAnalyzeR package:
cat("\n=== ANALYSIS COMPLETE ===\n")
#>
#> === ANALYSIS COMPLETE ===
cat("This example demonstrated:\n")
#> This example demonstrated:
cat("1. Real PubMed search execution using search_pubmed()\n")
#> 1. Real PubMed search execution using search_pubmed()
cat("2. Data standardization and deduplication\n")
#> 2. Data standardization and deduplication
cat("3. Performance metric calculation\n")
#> 3. Performance metric calculation
cat("4. Visualization generation\n")
#> 4. Visualization generation
cat("5. Multi-format export capabilities\n")
#> 5. Multi-format export capabilities
cat("6. PRISMA diagram creation\n")
#> 6. PRISMA diagram creation
cat("7. Benchmark validation\n")
#> 7. Benchmark validation
cat("8. Term effectiveness analysis\n")
#> 8. Term effectiveness analysis
cat("9. Comprehensive reporting\n")
#> 9. Comprehensive reporting
cat("\nAll outputs saved to:", output_dir, "\n")
#>
#> All outputs saved to: /tmp/RtmpMm5rJmThe analysis generates numerous output files for different purposes:
# Clean up and provide final file locations
list.files(output_dir, pattern = "covid", full.names = TRUE, recursive = TRUE)
#> [1] "/tmp/RtmpMm5rJm/covid_long_term_search.csv"
#> [2] "/tmp/RtmpMm5rJm/covid_long_term_search.ris"
#> [3] "/tmp/RtmpMm5rJm/covid_long_term_search.xlsx"
#> [4] "/tmp/RtmpMm5rJm/strategy_A_broad_covid.csv"
#> [5] "/tmp/RtmpMm5rJm/strategy_A_broad_covid.xlsx"
#> [6] "/tmp/RtmpMm5rJm/strategy_B_targeted_covid.csv"
#> [7] "/tmp/RtmpMm5rJm/strategy_B_targeted_covid.xlsx"After completing this analysis, typical next steps would include:
For more advanced features and customization options, consult:
help(package = "searchAnalyzeR")?search_pubmed,
?SearchAnalyzer, etc.This comprehensive workflow demonstrates how
searchAnalyzeR can streamline and enhance the systematic
review search process, providing objective metrics and visualizations to
support evidence-based search strategy development and optimization.