| Title: | Advanced Analytics and Testing Framework for Systematic Review Search Strategies |
|---|---|
| Description: | Provides comprehensive analytics, reporting, and testing capabilities for systematic review search strategies. The package focuses on validating search performance, generating standardized 'PRISMA'-compliant reports, and ensuring reproducibility in evidence synthesis. Features include precision-recall analysis, cross-database performance comparison, benchmark validation against gold standards, sensitivity analysis, temporal coverage assessment, automated report generation, and statistical comparison of search strategies. Supports multiple export formats including 'CSV', 'Excel', 'RIS', 'BibTeX', and 'EndNote'. Includes tools for duplicate detection, search strategy optimization, cross-validation frameworks, meta-analysis of benchmark results, power analysis for study design, and reproducibility package creation. Optionally connects to 'PubMed' for direct database searching and real-time strategy comparison using the 'E-utilities' 'API'. Enhanced with bootstrap comparison methods, 'McNemar' test for strategy evaluation, and comprehensive visualization tools for performance assessment. Methods based on Manning et al. (2008) for information retrieval metrics, Moher et al. (2009) for 'PRISMA' guidelines, and Sampson et al. (2006) for systematic review search methodology. |
| Authors: | Chao Liu [aut, cre] (ORCID: <https://orcid.org/0000-0002-9979-8272>) |
| Maintainer: | Chao Liu <[email protected]> |
| License: | GPL-3 |
| Version: | 0.1.0 |
| Built: | 2026-06-02 09:18:58 UTC |
| Source: | https://github.com/chaoliu-cl/searchanalyzer |
Creates a temporary environment for analysis that isolates objects from the global environment. This helps prevent memory leaks and allows for easy cleanup after analysis.
analysis_env(parent_env = parent.frame(), cleanup = TRUE)analysis_env(parent_env = parent.frame(), cleanup = TRUE)
parent_env |
Environment to use as parent (default: parent.frame()) |
cleanup |
Logical, whether to automatically clean up on exit |
New environment for analysis
Auto-detect Column Mappings
auto_detect_columns(results)auto_detect_columns(results)
results |
Data frame to analyze |
Named vector of column mappings
A comprehensive validation framework for testing search strategies against established benchmark datasets across multiple domains.
The BenchmarkValidator class provides tools for:
Cross-domain validation across medical, environmental, social science domains
Sensitivity analysis for search parameters
Statistical comparison of strategy performance
Reproducible benchmark testing
benchmarksList of benchmark datasets with known relevant articles
new()Initialize a new BenchmarkValidator instance
validate_strategy(search_strategy, benchmark_name)Validate against specific benchmark
cross_domain_validation(search_strategy)Test across multiple domains
sensitivity_analysis(base_strategy, parameter_ranges)Parameter sensitivity testing
benchmarksList of benchmark datasets
new()
Creates a new BenchmarkValidator instance and loads benchmark datasets.
This method is called automatically when creating a new validator with
BenchmarkValidator$new().
BenchmarkValidator$new()
No return value, called for side effects (loading benchmarks) Add a custom benchmark dataset
add_benchmark()
BenchmarkValidator$add_benchmark(name, corpus, relevant_ids)
nameName of the benchmark
corpusData frame with article corpus
relevant_idsVector of relevant article IDs
No return value, called for side effects Validate search strategy against benchmarks
validate_strategy()
BenchmarkValidator$validate_strategy(search_strategy, benchmark_name = "all")
search_strategySearch strategy object
benchmark_nameName of benchmark dataset
Validation results Validate against single benchmark (PUBLIC METHOD)
validate_single_benchmark()
BenchmarkValidator$validate_single_benchmark(search_strategy, benchmark_name)
search_strategySearch strategy object
benchmark_nameName of benchmark dataset
Validation results Cross-domain validation
cross_domain_validation()
BenchmarkValidator$cross_domain_validation(search_strategy)
search_strategySearch strategy object
Cross-domain validation results Sensitivity analysis for search parameters
sensitivity_analysis()
BenchmarkValidator$sensitivity_analysis(base_strategy, parameter_ranges)
base_strategyBase search strategy
parameter_rangesList of parameter ranges to test
Sensitivity analysis results
clone()
The objects of this class are cloneable with this method.
BenchmarkValidator$clone(deep = FALSE)
deepWhether to make a deep clone.
# Create validator validator <- BenchmarkValidator$new() # Check available benchmarks print(names(validator$benchmarks)) # Define search strategy strategy <- list( terms = c("systematic review", "meta-analysis"), databases = c("PubMed", "Embase") ) # Create sample data for validation sample_data <- data.frame( id = paste0("art", 1:20), title = paste("Article", 1:20), abstract = paste("Abstract", 1:20), source = "Journal", date = Sys.Date() ) # Add custom benchmark validator$add_benchmark("custom", sample_data, paste0("art", 1:5)) # Validate against custom benchmark results <- validator$validate_strategy(strategy, "custom")# Create validator validator <- BenchmarkValidator$new() # Check available benchmarks print(names(validator$benchmarks)) # Define search strategy strategy <- list( terms = c("systematic review", "meta-analysis"), databases = c("PubMed", "Embase") ) # Create sample data for validation sample_data <- data.frame( id = paste0("art", 1:20), title = paste("Article", 1:20), abstract = paste("Abstract", 1:20), source = "Journal", date = Sys.Date() ) # Add custom benchmark validator$add_benchmark("custom", sample_data, paste0("art", 1:5)) # Validate against custom benchmark results <- validator$validate_strategy(strategy, "custom")
Bootstrap Comparison of Search Strategies
bootstrap_compare( strategy1_results, strategy2_results, gold_standard, n_bootstrap = 1000 )bootstrap_compare( strategy1_results, strategy2_results, gold_standard, n_bootstrap = 1000 )
strategy1_results |
Results from first strategy |
strategy2_results |
Results from second strategy |
gold_standard |
Vector of relevant article IDs |
n_bootstrap |
Number of bootstrap samples |
Bootstrap comparison results
Manages a cache of search results to avoid redundant database queries while keeping memory usage under control.
cache_manage( operation, key = NULL, value = NULL, max_size = 500, max_items = 50 )cache_manage( operation, key = NULL, value = NULL, max_size = 500, max_items = 50 )
operation |
Operation to perform ("add", "get", "clear", "status") |
key |
Cache key (usually search query) |
value |
Value to cache (for "add" operation) |
max_size |
Maximum cache size in MB (default: 500) |
max_items |
Maximum number of items to cache (default: 50) |
Varies by operation
Calculate Confidence Intervals
calc_ci(x, conf_level = 0.95, method = "normal")calc_ci(x, conf_level = 0.95, method = "normal")
x |
Numeric vector |
conf_level |
Confidence level (0-1) |
method |
Method for calculation ("normal", "bootstrap") |
List with lower and upper bounds
Calculate Cosine Similarity
calc_cosine(text1, text2)calc_cosine(text1, text2)
text1 |
First text string |
text2 |
Second text string |
Cosine similarity score
Calculate Coverage Metrics Across Databases
calc_coverage(results_by_database, gold_standard)calc_coverage(results_by_database, gold_standard)
results_by_database |
List of result sets by database |
gold_standard |
Vector of relevant article IDs |
Calculates coverage metrics for each database and overall:
coverage_count: Number of relevant articles found by each database
coverage_rate: Proportion of relevant articles found by each database
unique_coverage: Number of relevant articles found only by this database
total_coverage: Overall proportion of relevant articles found by all databases
redundancy_rate: Proportion of duplicate results across databases
List containing coverage statistics
# Create sample data results_db1 <- c("art1", "art2", "art3", "art4") results_db2 <- c("art2", "art3", "art5", "art6") results_by_db <- list("Database1" = results_db1, "Database2" = results_db2) gold_standard <- c("art1", "art3", "art5", "art7", "art8") coverage <- calc_coverage(results_by_db, gold_standard) print(coverage$total_coverage)# Create sample data results_db1 <- c("art1", "art2", "art3", "art4") results_db2 <- c("art2", "art3", "art5", "art6") results_by_db <- list("Database1" = results_db1, "Database2" = results_db2) gold_standard <- c("art1", "art3", "art5", "art7", "art8") coverage <- calc_coverage(results_by_db, gold_standard) print(coverage$total_coverage)
Calculate Search Efficiency Metrics
calc_efficiency(search_time, results_count, relevant_count)calc_efficiency(search_time, results_count, relevant_count)
search_time |
Time taken to execute search (in seconds) |
results_count |
Number of results retrieved |
relevant_count |
Number of relevant results |
Calculates various efficiency metrics for search performance:
time_per_result: Average time to retrieve each result
time_per_relevant: Average time to retrieve each relevant result
relevant_ratio: Proportion of results that are relevant
efficiency_score: Overall efficiency combining time and relevance
List containing efficiency metrics
efficiency <- calc_efficiency(search_time = 30, results_count = 100, relevant_count = 15) print(paste("Efficiency score:", round(efficiency$efficiency_score, 4)))efficiency <- calc_efficiency(search_time = 30, results_count = 100, relevant_count = 15) print(paste("Efficiency score:", round(efficiency$efficiency_score, 4)))
Calculate Jaccard Similarity
calc_jaccard(text1, text2)calc_jaccard(text1, text2)
text1 |
First text string |
text2 |
Second text string |
Jaccard similarity score
Calculate Precision and Recall Metrics
calc_precision_recall(retrieved, relevant, total_relevant = NULL)calc_precision_recall(retrieved, relevant, total_relevant = NULL)
retrieved |
Vector of retrieved article IDs |
relevant |
Vector of relevant article IDs (gold standard) |
total_relevant |
Total number of relevant articles in corpus |
Calculates standard information retrieval metrics:
Precision: TP/(TP+FP) - proportion of retrieved articles that are relevant
Recall: TP/(TP+FN) - proportion of relevant articles that were retrieved
F1 Score: Harmonic mean of precision and recall
Number Needed to Read: 1/precision - articles needed to read to find one relevant
where TP = True Positives, FP = False Positives, FN = False Negatives
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval.
retrieved_ids <- c("art1", "art2", "art3", "art4", "art5") relevant_ids <- c("art1", "art3", "art6", "art7") metrics <- calc_precision_recall(retrieved_ids, relevant_ids) print(paste("Precision:", round(metrics$precision, 3))) print(paste("Recall:", round(metrics$recall, 3)))retrieved_ids <- c("art1", "art2", "art3", "art4", "art5") relevant_ids <- c("art1", "art3", "art6", "art7") metrics <- calc_precision_recall(retrieved_ids, relevant_ids) print(paste("Precision:", round(metrics$precision, 3))) print(paste("Recall:", round(metrics$recall, 3)))
Power Analysis for Search Strategy Evaluation
calc_sample_size( effect_size = 0.1, alpha = 0.05, power = 0.8, baseline_f1 = 0.7 )calc_sample_size( effect_size = 0.1, alpha = 0.05, power = 0.8, baseline_f1 = 0.7 )
effect_size |
Expected effect size (difference in F1 scores) |
alpha |
Significance level |
power |
Desired statistical power |
baseline_f1 |
Baseline F1 score |
Required sample size
Calculate Search Result Statistics
calc_search_stats(search_results)calc_search_stats(search_results)
search_results |
Data frame with search results |
List of summary statistics
Calculate Strategy Comparison Metrics
calc_strategy_comparison(strategy1_results, strategy2_results, gold_standard)calc_strategy_comparison(strategy1_results, strategy2_results, gold_standard)
strategy1_results |
Vector of article IDs from strategy 1 |
strategy2_results |
Vector of article IDs from strategy 2 |
gold_standard |
Vector of relevant article IDs |
Compares two search strategies across multiple dimensions:
overlap_analysis: Articles found by both, one, or neither strategy
performance_comparison: Precision, recall, F1 for each strategy
complementarity: How well strategies complement each other
efficiency_comparison: Relative efficiency metrics
List containing comparison metrics
strategy1 <- c("art1", "art2", "art3", "art4", "art5") strategy2 <- c("art3", "art4", "art5", "art6", "art7") gold_standard <- c("art1", "art3", "art5", "art8", "art9") comparison <- calc_strategy_comparison(strategy1, strategy2, gold_standard) print(comparison$overlap_analysis)strategy1 <- c("art1", "art2", "art3", "art4", "art5") strategy2 <- c("art3", "art4", "art5", "art6", "art7") gold_standard <- c("art1", "art3", "art5", "art8", "art9") comparison <- calc_strategy_comparison(strategy1, strategy2, gold_standard) print(comparison$overlap_analysis)
Calculate Temporal Coverage Metrics
calc_temporal_coverage(search_results, target_date_range = NULL)calc_temporal_coverage(search_results, target_date_range = NULL)
search_results |
Data frame with search results including date column |
target_date_range |
Vector of two dates defining the target time period |
Analyzes the temporal distribution of search results:
coverage_by_year: Number of articles by publication year
target_period_coverage: Proportion of results in target date range
temporal_gaps: Years with no results in the target period
peak_years: Years with highest number of results
List containing temporal coverage statistics
# Create sample data search_results <- data.frame( id = paste0("art", 1:20), date = seq(as.Date("2010-01-01"), as.Date("2023-12-31"), length.out = 20) ) target_range <- c(as.Date("2015-01-01"), as.Date("2020-12-31")) temporal_metrics <- calc_temporal_coverage(search_results, target_range) print(temporal_metrics$target_period_coverage)# Create sample data search_results <- data.frame( id = paste0("art", 1:20), date = seq(as.Date("2010-01-01"), as.Date("2023-12-31"), length.out = 20) ) target_range <- c(as.Date("2015-01-01"), as.Date("2020-12-31")) temporal_metrics <- calc_temporal_coverage(search_results, target_range) print(temporal_metrics$target_period_coverage)
Calculates a balanced effectiveness score for individual search terms using the harmonic mean of precision and coverage. This provides a single metric to evaluate how well each term performs in retrieving relevant articles.
calc_tes(term_analysis, score_name = "tes")calc_tes(term_analysis, score_name = "tes")
term_analysis |
Data frame from term_effectiveness() function |
score_name |
Name for the new score column (default: "tes") |
The Term Effectiveness Score (TES) is calculated as:
Where:
Precision: Proportion of retrieved articles that are relevant
Coverage: Proportion of term-specific relevant articles that were retrieved
This differs from the traditional F1 score in that it uses coverage (term-specific relevance) rather than recall (overall strategy relevance).
Key Differences from F1 Score:
F1 Score: Precision × Recall (strategy-level performance)
TES: Precision × Coverage (term-level performance)
Recall: Relevant articles found / All relevant articles
Coverage: Relevant articles found / Term-specific relevant articles
Data frame with added effectiveness score column
term_effectiveness for calculating term precision and coverage
calc_precision_recall for strategy-level F1 scores
# Create sample term analysis terms <- c("diabetes", "treatment", "clinical") search_results <- data.frame( id = paste0("art", 1:20), title = paste("Study on", sample(terms, 20, replace = TRUE)), abstract = paste("Research about", sample(terms, 20, replace = TRUE)) ) gold_standard <- paste0("art", c(1, 3, 5, 7, 9)) # Analyze term effectiveness term_analysis <- term_effectiveness(terms, search_results, gold_standard) # Calculate effectiveness scores term_scores <- calc_tes(term_analysis) print(term_scores[order(term_scores$tes, decreasing = TRUE), ])# Create sample term analysis terms <- c("diabetes", "treatment", "clinical") search_results <- data.frame( id = paste0("art", 1:20), title = paste("Study on", sample(terms, 20, replace = TRUE)), abstract = paste("Research about", sample(terms, 20, replace = TRUE)) ) gold_standard <- paste0("art", c(1, 3, 5, 7, 9)) # Analyze term effectiveness term_analysis <- term_effectiveness(terms, search_results, gold_standard) # Calculate effectiveness scores term_scores <- calc_tes(term_analysis) print(term_scores[order(term_scores$tes, decreasing = TRUE), ])
Calculate Text Similarity
calc_text_sim(text1, text2, method = "jaccard")calc_text_sim(text1, text2, method = "jaccard")
text1 |
First text string |
text2 |
Second text string |
method |
Similarity method ("jaccard", "cosine", "jaro_winkler") |
Similarity score between 0 and 1
This file contains general utility functions used throughout the package.
check_deps(required_packages, install_missing = FALSE)check_deps(required_packages, install_missing = FALSE)
required_packages |
Character vector of required package names |
install_missing |
Logical, whether to suggest installing missing packages |
This function checks if required packages are installed using requireNamespace
to check availability without loading packages. For CRAN compliance, this function
does not automatically install packages.
Logical vector indicating which packages are available
# Check if packages are available required <- c("ggplot2", "dplyr") availability <- check_deps(required) print(availability) # Get suggestions for missing packages required_with_missing <- c("ggplot2", "dplyr", "nonexistent_package") availability <- check_deps(required_with_missing, install_missing = TRUE) print(availability)# Check if packages are available required <- c("ggplot2", "dplyr") availability <- check_deps(required) print(availability) # Get suggestions for missing packages required_with_missing <- c("ggplot2", "dplyr", "nonexistent_package") availability <- check_deps(required_with_missing, install_missing = TRUE) print(availability)
Generic function to process a large dataset in manageable chunks to reduce memory usage.
chunk_process(data, chunk_size = 10000, fn, combine_fn = rbind, ...)chunk_process(data, chunk_size = 10000, fn, combine_fn = rbind, ...)
data |
Large data frame to process |
chunk_size |
Number of rows per chunk |
fn |
Function to apply to each chunk |
combine_fn |
Function to combine results from chunks |
... |
Additional arguments passed to fn |
Combined results after processing all chunks
Clean Column Names
clean_col_names(names)clean_col_names(names)
names |
Character vector of column names |
Cleaned column names
Clean Text Fields
clean_text(text)clean_text(text)
text |
Character vector to clean |
Cleaned character vector
This file contains advanced benchmark testing capabilities including cross-validation, statistical testing, and performance comparison methods. Statistical Significance Testing for Search Performance
compare_strategies( strategy1_results, strategy2_results, gold_standard, test_type = "mcnemar", alpha = 0.05 )compare_strategies( strategy1_results, strategy2_results, gold_standard, test_type = "mcnemar", alpha = 0.05 )
strategy1_results |
Results from first search strategy |
strategy2_results |
Results from second search strategy |
gold_standard |
Vector of relevant article IDs |
test_type |
Type of statistical test ("mcnemar", "paired_t", "wilcoxon") |
alpha |
Significance level |
Statistical test results
Compares the effectiveness of terms across multiple search strategies to identify which terms perform best in different contexts.
compare_terms(term_list, top_n = 5)compare_terms(term_list, top_n = 5)
term_list |
Named list of term_analysis objects from different strategies |
top_n |
Number of top terms to compare (default: 5) |
This function:
Calculates effectiveness scores for each strategy
Identifies top terms in each strategy
Creates a comparison matrix showing performance across strategies
Data frame comparing term effectiveness across strategies
Perform a complete workflow: search databases, analyze results, generate reports.
complete_search_workflow( search_terms, databases = "pubmed", gold_standard = NULL, max_results = 100, date_range = NULL, output_dir = NULL )complete_search_workflow( search_terms, databases = "pubmed", gold_standard = NULL, max_results = 100, date_range = NULL, output_dir = NULL )
search_terms |
Character vector of search terms |
databases |
Vector of databases to search |
gold_standard |
Optional vector of known relevant article IDs |
max_results |
Maximum results to retrieve |
date_range |
Optional date range for search |
output_dir |
Directory for reports (uses tempdir() by default) |
List containing search results, analysis, and report paths
# Complete workflow results <- complete_search_workflow( search_terms = "diabetes treatment clinical trial", databases = "pubmed", max_results = 50, date_range = c("2022/01/01", "2023/12/31") ) # View summary print(results$summary) # Access detailed metrics print(results$analysis$metrics)# Complete workflow results <- complete_search_workflow( search_terms = "diabetes treatment clinical trial", databases = "pubmed", max_results = 50, date_range = c("2022/01/01", "2023/12/31") ) # View summary print(results$summary) # Access detailed metrics print(results$analysis$metrics)
Create Analysis Template Script
create_analysis_template(file_path)create_analysis_template(file_path)
file_path |
Output file path |
Create Data Dictionary
create_data_dictionary(file_path, search_results)create_data_dictionary(file_path, search_results)
file_path |
Output file path |
search_results |
Search results data |
Create Data Package for Sharing
create_data_package( search_results, analysis_results = NULL, output_dir = NULL, package_name = "search_analysis_package" )create_data_package( search_results, analysis_results = NULL, output_dir = NULL, package_name = "search_analysis_package" )
search_results |
Data frame with search results |
analysis_results |
List of analysis results |
output_dir |
Directory to create the package (defaults to tempdir()) |
package_name |
Name of the package |
Path to created package directory
# Create sample data search_results <- data.frame( id = paste0("art", 1:10), title = paste("Study", 1:10), abstract = paste("Abstract", 1:10), source = "Journal", date = Sys.Date(), stringsAsFactors = FALSE ) # Create data package (writes to tempdir()) package_path <- create_data_package(search_results) print(package_path)# Create sample data search_results <- data.frame( id = paste0("art", 1:10), title = paste("Study", 1:10), abstract = paste("Abstract", 1:10), source = "Journal", date = Sys.Date(), stringsAsFactors = FALSE ) # Create data package (writes to tempdir()) package_path <- create_data_package(search_results) print(package_path)
Create Package Manifest
create_package_manifest(package_dir)create_package_manifest(package_dir)
package_dir |
Package directory |
Create Package README
create_package_readme(package_dir, search_results, analysis_results)create_package_readme(package_dir, search_results, analysis_results)
package_dir |
Package directory |
search_results |
Search results data |
analysis_results |
Analysis results |
Create PRISMA Flow Diagram with Proper Spacing and Text Enclosure
create_prisma(flow_data)create_prisma(flow_data)
flow_data |
List containing PRISMA flow numbers |
ggplot object
Create Progress Bar for Long Operations
create_progress_bar(total, format = "[:bar] :percent :elapsed")create_progress_bar(total, format = "[:bar] :percent :elapsed")
total |
Total number of iterations |
format |
Progress bar format string |
Progress bar object
Create Default Search Strategy Template
create_strategy(terms, databases, date_range = NULL, filters = NULL)create_strategy(terms, databases, date_range = NULL, filters = NULL)
terms |
Character vector of search terms |
databases |
Character vector of databases |
date_range |
Date vector of length 2 (start, end) |
filters |
List of additional filters |
Search strategy list
Create Summary Statistics Table
create_summary(data, numeric_vars = NULL, categorical_vars = NULL)create_summary(data, numeric_vars = NULL, categorical_vars = NULL)
data |
Data frame |
numeric_vars |
Character vector of numeric variable names |
categorical_vars |
Character vector of categorical variable names |
Summary statistics data frame
Cross-Validation Framework for Search Strategies
cv_strategy( search_strategy, validation_corpus, gold_standard, k_folds = 5, stratified = TRUE )cv_strategy( search_strategy, validation_corpus, gold_standard, k_folds = 5, stratified = TRUE )
search_strategy |
Search strategy object |
validation_corpus |
Full corpus for validation |
gold_standard |
Vector of relevant article IDs |
k_folds |
Number of folds for cross-validation |
stratified |
Whether to use stratified sampling |
Cross-validation results
Detect DOI-based Duplicates
detect_doi_dupes(results)detect_doi_dupes(results)
results |
Data frame with search results |
Data frame with DOI duplicates marked
Detect and Remove Duplicate Records
detect_dupes(results, method = "exact", similarity_threshold = 0.85)detect_dupes(results, method = "exact", similarity_threshold = 0.85)
results |
Standardized search results data frame |
method |
Method for duplicate detection ("exact", "fuzzy", "doi") |
similarity_threshold |
Threshold for fuzzy matching (0-1) |
This function provides three methods for duplicate detection:
exact: Matches on title and first 100 characters of abstract
fuzzy: Uses Jaro-Winkler string distance for similarity matching
doi: Matches based on cleaned DOI strings
For fuzzy matching, similarity_threshold should be between 0 and 1, where 1 means identical strings. A threshold of 0.85 typically works well for academic titles.
Data frame with duplicates marked and removed
Detect Exact Duplicates
detect_exact_dupes(results)detect_exact_dupes(results)
results |
Data frame with search results |
Data frame with exact duplicates marked
Detect Fuzzy Duplicates
detect_fuzzy_dupes(results, threshold = 0.85)detect_fuzzy_dupes(results, threshold = 0.85)
results |
Data frame with search results |
threshold |
Similarity threshold |
Data frame with fuzzy duplicates marked
Export Analysis Metrics
export_metrics(metrics, file_path, format = "xlsx")export_metrics(metrics, file_path, format = "xlsx")
metrics |
List of calculated metrics |
file_path |
Output file path |
format |
Export format ("csv", "xlsx", "json") |
File path of created file
# Create sample metrics metrics <- list( basic = list(total_records = 100, unique_records = 95), precision_recall = list(precision = 0.8, recall = 0.6, f1_score = 0.69) ) # Export metrics (writes to tempdir()) output_file <- export_metrics(metrics, file.path(tempdir(), "metrics.xlsx")) print(output_file)# Create sample metrics metrics <- list( basic = list(total_records = 100, unique_records = 95), precision_recall = list(precision = 0.8, recall = 0.6, f1_score = 0.69) ) # Export metrics (writes to tempdir()) output_file <- export_metrics(metrics, file.path(tempdir(), "metrics.xlsx")) print(output_file)
Export Metrics to CSV
export_metrics_csv(metrics, file_path)export_metrics_csv(metrics, file_path)
metrics |
List of calculated metrics |
file_path |
Output file path |
File path of created file
Export Metrics to JSON
export_metrics_json(metrics, file_path)export_metrics_json(metrics, file_path)
metrics |
List of calculated metrics |
file_path |
Output file path |
File path of created file
Export Metrics to Excel
export_metrics_xlsx(metrics, file_path)export_metrics_xlsx(metrics, file_path)
metrics |
List of calculated metrics |
file_path |
Output file path |
File path of created file
This file contains functions for exporting search analysis results, reports, and data in various formats. Export Search Results to Multiple Formats
export_results( search_results, file_path = NULL, formats = c("csv", "xlsx"), include_metadata = TRUE )export_results( search_results, file_path = NULL, formats = c("csv", "xlsx"), include_metadata = TRUE )
search_results |
Data frame with search results |
file_path |
Base file path (without extension). If NULL, uses tempdir() |
formats |
Vector of formats to export ("csv", "xlsx", "ris", "bibtex") |
include_metadata |
Logical, whether to include metadata sheets/files |
This function exports search results to multiple standard formats used in systematic reviews and reference management. Supported formats include:
CSV: Comma-separated values for data analysis
Excel: Multi-sheet workbook with metadata
RIS: Reference Information Systems format for reference managers
BibTeX: LaTeX bibliography format
EndNote: Thomson Reuters EndNote format
Vector of created file paths
# Create sample search results search_results <- data.frame( id = paste0("article_", 1:5), title = paste("Sample Article", 1:5), abstract = paste("Abstract for article", 1:5), source = "Sample Journal", date = Sys.Date(), stringsAsFactors = FALSE ) # Export to multiple formats (writes to tempdir()) output_files <- export_results(search_results, formats = c("csv", "xlsx")) print(output_files)# Create sample search results search_results <- data.frame( id = paste0("article_", 1:5), title = paste("Sample Article", 1:5), abstract = paste("Abstract for article", 1:5), source = "Sample Journal", date = Sys.Date(), stringsAsFactors = FALSE ) # Export to multiple formats (writes to tempdir()) output_files <- export_results(search_results, formats = c("csv", "xlsx")) print(output_files)
Export to BibTeX Format
export_to_bibtex(search_results, file_path)export_to_bibtex(search_results, file_path)
search_results |
Data frame with search results |
file_path |
Output file path |
File path of created file
Export to CSV Format
export_to_csv(search_results, file_path, include_metadata = TRUE)export_to_csv(search_results, file_path, include_metadata = TRUE)
search_results |
Data frame with search results |
file_path |
Output file path |
include_metadata |
Logical, whether to create metadata file |
File path of created file
Export to EndNote Format
export_to_endnote(search_results, file_path)export_to_endnote(search_results, file_path)
search_results |
Data frame with search results |
file_path |
Output file path |
File path of created file
Export to RIS Format
export_to_ris(search_results, file_path)export_to_ris(search_results, file_path)
search_results |
Data frame with search results |
file_path |
Output file path |
File path of created file
Export to Excel Format with Multiple Sheets
export_to_xlsx(search_results, file_path, include_metadata = TRUE)export_to_xlsx(search_results, file_path, include_metadata = TRUE)
search_results |
Data frame with search results |
file_path |
Output file path |
include_metadata |
Logical, whether to include metadata sheets |
File path of created file
Export Validation Results
export_validation(validation_results, file_path, format = "xlsx")export_validation(validation_results, file_path, format = "xlsx")
validation_results |
Results from benchmark validation |
file_path |
Output file path |
format |
Export format ("xlsx", "csv", "json") |
File path of created file
# Create sample validation results validation_results <- list( precision = 0.8, recall = 0.6, f1_score = 0.69, true_positives = 24, false_positives = 6, false_negatives = 16 ) # Export validation results (writes to tempdir()) output_file <- export_validation( validation_results, file.path(tempdir(), "validation.xlsx") ) print(output_file)# Create sample validation results validation_results <- list( precision = 0.8, recall = 0.6, f1_score = 0.69, true_positives = 24, false_positives = 6, false_negatives = 16 ) # Export validation results (writes to tempdir()) output_file <- export_validation( validation_results, file.path(tempdir(), "validation.xlsx") ) print(output_file)
Export Validation Results to CSV
export_validation_csv(validation_results, file_path)export_validation_csv(validation_results, file_path)
validation_results |
Validation results |
file_path |
Output file path |
File path of created file
Export Validation Results to JSON
export_validation_json(validation_results, file_path)export_validation_json(validation_results, file_path)
validation_results |
Validation results |
file_path |
Output file path |
File path of created file
Export Validation Results to Excel
export_validation_xlsx(validation_results, file_path)export_validation_xlsx(validation_results, file_path)
validation_results |
Validation results |
file_path |
Output file path |
File path of created file
Extract Screening Data Structure
extract_screening(search_results, screening_decisions = NULL)extract_screening(search_results, screening_decisions = NULL)
search_results |
Combined search results |
screening_decisions |
Optional data frame with screening decisions |
Data frame with screening structure for PRISMA
Identifies the top-performing search terms based on their effectiveness scores and optionally creates highlighted visualizations.
find_top_terms( term_analysis, n = 3, score_col = "tes", plot = TRUE, plot_type = "precision_only" )find_top_terms( term_analysis, n = 3, score_col = "tes", plot = TRUE, plot_type = "precision_only" )
term_analysis |
Data frame from term_effectiveness() function |
n |
Number of top terms to identify (default: 3) |
score_col |
Name of the score column to use for ranking (default: "tes") |
plot |
Whether to create a highlighted plot (default: TRUE) |
plot_type |
Type of plot for highlighting ("precision_only", "coverage_only", "precision_coverage") |
This function:
Calculates effectiveness scores if not already present
Identifies the top N performing terms
Optionally creates a visualization highlighting these terms
List containing top terms and optionally a highlighted plot
Format Numbers for Display
format_numbers(x, digits = 3, percent = FALSE)format_numbers(x, digits = 3, percent = FALSE)
x |
Numeric vector |
digits |
Number of decimal places |
percent |
Logical, whether to format as percentage |
Formatted character vector
Generate Reproducible Random Seed
gen_repro_seed(base_string = "searchAnalyzeR")gen_repro_seed(base_string = "searchAnalyzeR")
base_string |
Base string for seed generation |
This function generates a reproducible seed based on a string input. It does not set the seed automatically - users should call set.seed() themselves if they want to use the generated seed.
Integer seed value
# Generate a seed value seed_value <- gen_repro_seed("my_analysis") # User can choose to set it set.seed(seed_value) sample(1:10, 3)# Generate a seed value seed_value <- gen_repro_seed("my_analysis") # User can choose to set it set.seed(seed_value) sample(1:10, 3)
Extract Package Version Information
get_pkg_versions( packages = c("searchAnalyzeR", "ggplot2", "lubridate", "openxlsx") )get_pkg_versions( packages = c("searchAnalyzeR", "ggplot2", "lubridate", "openxlsx") )
packages |
Character vector of package names |
Data frame with package version information
Check if Object is Empty
is_empty(x)is_empty(x)
x |
Object to check |
Logical indicating if object is empty
Removes intermediate and temporary objects created during analysis to free memory. This is particularly useful for large-scale analyses.
mem_cleanup(keep_results = TRUE, verbose = TRUE, env = parent.frame())mem_cleanup(keep_results = TRUE, verbose = TRUE, env = parent.frame())
keep_results |
Logical, whether to keep final results |
verbose |
Logical, whether to print memory freed information |
env |
Environment to clean (defaults to parent.frame()) |
Amount of memory freed in MB
Wraps a function call with memory usage monitoring, reporting memory usage before, during, and after execution.
mem_monitor(fn, interval = 1, ...)mem_monitor(fn, interval = 1, ...)
fn |
Function to execute |
interval |
Time interval in seconds for memory checks during execution |
... |
Arguments passed to fn |
Result of fn with memory usage statistics as an attribute
Reports the current memory usage of the R session.
mem_usage(units = "MB", include_gc = FALSE)mem_usage(units = "MB", include_gc = FALSE)
units |
Units for reporting memory usage ("MB", "GB", or "KB") |
include_gc |
Logical, whether to run garbage collection before measuring |
Named list with memory usage information
Merge Search Results from Multiple Sources
merge_results(result_list, deduplicate = TRUE, dedup_method = "exact")merge_results(result_list, deduplicate = TRUE, dedup_method = "exact")
result_list |
List of standardized search result data frames |
deduplicate |
Logical, whether to remove duplicates |
dedup_method |
Method for duplicate detection |
Combined and deduplicated data frame
Meta-Analysis of Benchmark Results
meta_analyze(benchmark_results, strategy_name, metric = "f1_score")meta_analyze(benchmark_results, strategy_name, metric = "f1_score")
benchmark_results |
List of benchmark result objects |
strategy_name |
Name of strategy to analyze across benchmarks |
metric |
Metric to meta-analyze ("precision", "recall", "f1_score") |
Meta-analysis results
Converts a data frame to a memory-efficient format by optimizing column types.
opt_df(df, compress_strings = FALSE, verbose = TRUE)opt_df(df, compress_strings = FALSE, verbose = TRUE)
df |
Data frame to optimize |
compress_strings |
Logical, whether to convert character columns to factors |
verbose |
Logical, whether to print memory savings information |
Memory-efficient version of the input data frame
Create Database Performance Comparison
plot_db_performance(results_by_database, gold_standard = NULL)plot_db_performance(results_by_database, gold_standard = NULL)
results_by_database |
List of result sets by database |
gold_standard |
Vector of relevant article IDs |
ggplot object
Create Keyword Effectiveness Analysis Plot
plot_keyword_eff(search_results, search_terms, gold_standard = NULL)plot_keyword_eff(search_results, search_terms, gold_standard = NULL)
search_results |
Data frame with search results |
search_terms |
Vector of search terms |
gold_standard |
Vector of relevant article IDs |
ggplot object
This file contains all visualization functions used by the SearchAnalyzer class and other components of the searchAnalyzeR package. Create Overview Performance Plot
plot_overview(metrics)plot_overview(metrics)
metrics |
List of calculated metrics from SearchAnalyzer |
Creates a focused overview plot displaying the core search performance metrics:
Precision: Proportion of retrieved articles that are relevant
Recall: Proportion of relevant articles that were retrieved
F1 Score: Harmonic mean of precision and recall
The plot uses color coding to distinguish between metric types and displays exact values on top of each bar.
ggplot object showing key performance indicators
# Assume you have calculated metrics metrics <- list( precision_recall = list(precision = 0.8, recall = 0.6, f1_score = 0.69) ) overview_plot <- plot_overview(metrics) print(overview_plot)# Assume you have calculated metrics metrics <- list( precision_recall = list(precision = 0.8, recall = 0.6, f1_score = 0.69) ) overview_plot <- plot_overview(metrics) print(overview_plot)
Create Precision-Recall Curve
plot_pr_curve(retrieved, relevant, thresholds = seq(0, 1, 0.05))plot_pr_curve(retrieved, relevant, thresholds = seq(0, 1, 0.05))
retrieved |
Vector of retrieved article IDs |
relevant |
Vector of relevant article IDs |
thresholds |
Vector of threshold values |
ggplot object
Create Sensitivity Analysis Heatmap
plot_sensitivity(sensitivity_results)plot_sensitivity(sensitivity_results)
sensitivity_results |
Results from sensitivity analysis |
ggplot object
Create Temporal Coverage Plot
plot_temporal(search_results, gold_standard = NULL)plot_temporal(search_results, gold_standard = NULL)
search_results |
Data frame with search results including date column |
gold_standard |
Vector of relevant article IDs |
ggplot object
Plot Term Effectiveness Results
plot_term_effectiveness( term_analysis, plot_type = "precision_coverage", highlight_terms = NULL, title_override = NULL, show_values = TRUE )plot_term_effectiveness( term_analysis, plot_type = "precision_coverage", highlight_terms = NULL, title_override = NULL, show_values = TRUE )
term_analysis |
Result from term_effectiveness function |
plot_type |
Type of plot to create ("precision_coverage", "counts", "comparison", "precision_only", "coverage_only") |
highlight_terms |
Optional character vector of terms to highlight |
title_override |
Optional custom title for the plot |
show_values |
Logical, whether to show values on bars/points (default: TRUE) |
This function creates visualizations of term effectiveness results with enhanced options for creating individual, clean plots. New plot types include "precision_only" and "coverage_only" for focused analysis.
A ggplot object if ggplot2 is available, otherwise NULL with a message
# Create sample data for demonstration search_results <- data.frame( id = paste0("art", 1:10), title = c("Diabetes treatment", "Clinical trial", "Diabetes study", "Treatment options", "New therapy", "Glucose control", "Insulin therapy", "Management of diabetes", "Clinical study", "Therapy comparison"), abstract = c("This study examines diabetes treatments.", "A clinical trial on new treatments.", "Diabetes research findings.", "Comparison of treatment options.", "Novel therapy approach.", "Methods to control glucose levels.", "Insulin therapy effectiveness.", "Managing diabetes effectively.", "Clinical research protocols.", "Comparing therapy approaches.") ) # Define search terms and gold standard terms <- c("diabetes", "treatment", "clinical", "therapy") gold_standard <- c("art1", "art3", "art7", "art8") # First analyze term effectiveness term_metrics <- term_effectiveness(terms, search_results, gold_standard) # Create individual plots precision_plot <- plot_term_effectiveness(term_metrics, "precision_only") coverage_plot <- plot_term_effectiveness(term_metrics, "coverage_only") bubble_plot <- plot_term_effectiveness(term_metrics, "precision_coverage")# Create sample data for demonstration search_results <- data.frame( id = paste0("art", 1:10), title = c("Diabetes treatment", "Clinical trial", "Diabetes study", "Treatment options", "New therapy", "Glucose control", "Insulin therapy", "Management of diabetes", "Clinical study", "Therapy comparison"), abstract = c("This study examines diabetes treatments.", "A clinical trial on new treatments.", "Diabetes research findings.", "Comparison of treatment options.", "Novel therapy approach.", "Methods to control glucose levels.", "Insulin therapy effectiveness.", "Managing diabetes effectively.", "Clinical research protocols.", "Comparing therapy approaches.") ) # Define search terms and gold standard terms <- c("diabetes", "treatment", "clinical", "therapy") gold_standard <- c("art1", "art3", "art7", "art8") # First analyze term effectiveness term_metrics <- term_effectiveness(terms, search_results, gold_standard) # Create individual plots precision_plot <- plot_term_effectiveness(term_metrics, "precision_only") coverage_plot <- plot_term_effectiveness(term_metrics, "coverage_only") bubble_plot <- plot_term_effectiveness(term_metrics, "precision_coverage")
Print Method for Term Comparison
## S3 method for class 'term_comparison' print(x, ...)## S3 method for class 'term_comparison' print(x, ...)
x |
A term_comparison object |
... |
Further arguments passed to or from other methods |
Invisibly returns the input object
Print Method for term_effectiveness Objects
## S3 method for class 'term_effectiveness' print(x, ...)## S3 method for class 'term_effectiveness' print(x, ...)
x |
A term_effectiveness object |
... |
Further arguments passed to or from other methods |
Invisibly returns the input object
A comprehensive reporting system for generating PRISMA-compliant reports from systematic review search analyses.
The PRISMAReporter class provides tools for:
Generating comprehensive search strategy reports
Creating PRISMA flow diagrams
Documenting search strategies
Exporting reports in multiple formats (HTML, PDF, Word)
new()Initialize a new PRISMAReporter instance
generate_report(search_analysis, output_format, template_type)Generate comprehensive search strategy report
generate_prisma_diagram(screening_data)Generate PRISMA flow diagram
document_search_strategy(search_strategy)Generate search strategy documentation
new()
Creates a new PRISMAReporter instance for generating PRISMA-compliant reports. Sets up the necessary template paths and configuration.
PRISMAReporter$new()
No return value, called for side effects (initialization) Generate comprehensive search strategy report
generate_report()
PRISMAReporter$generate_report( search_analysis, output_format = "html", template_type = "comprehensive" )
search_analysisSearchAnalyzer object
output_formatOutput format ("html", "pdf", "word")
template_typeType of report template
Path to generated report Generate PRISMA flow diagram
generate_prisma_diagram()
PRISMAReporter$generate_prisma_diagram(screening_data)
screening_dataData frame with screening results
ggplot object Generate search strategy documentation
document_search_strategy()
PRISMAReporter$document_search_strategy(search_strategy)
search_strategySearch strategy object
Formatted documentation
clone()
The objects of this class are cloneable with this method.
PRISMAReporter$clone(deep = FALSE)
deepWhether to make a deep clone.
# Create reporter reporter <- PRISMAReporter$new() # Create sample search strategy for documentation search_strategy <- list( terms = c("systematic review", "meta-analysis", "evidence synthesis"), databases = c("PubMed", "Embase", "Cochrane"), date_range = as.Date(c("2020-01-01", "2023-12-31")), filters = list(language = "English", study_type = "RCT") ) # Generate search strategy documentation strategy_docs <- reporter$document_search_strategy(search_strategy) print(strategy_docs) # Create sample screening data for PRISMA diagram screening_data <- data.frame( id = 1:100, duplicate = c(rep(FALSE, 80), rep(TRUE, 20)), title_abstract_screened = c(rep(TRUE, 80), rep(FALSE, 20)), full_text_eligible = c(rep(TRUE, 25), rep(FALSE, 75)), included = c(rep(TRUE, 15), rep(FALSE, 85)), excluded_title_abstract = c(rep(FALSE, 25), rep(TRUE, 55), rep(FALSE, 20)), excluded_full_text = c(rep(FALSE, 15), rep(TRUE, 10), rep(FALSE, 75)) ) # Generate PRISMA diagram prisma_plot <- reporter$generate_prisma_diagram(screening_data) print("PRISMA diagram created successfully")# Create reporter reporter <- PRISMAReporter$new() # Create sample search strategy for documentation search_strategy <- list( terms = c("systematic review", "meta-analysis", "evidence synthesis"), databases = c("PubMed", "Embase", "Cochrane"), date_range = as.Date(c("2020-01-01", "2023-12-31")), filters = list(language = "English", study_type = "RCT") ) # Generate search strategy documentation strategy_docs <- reporter$document_search_strategy(search_strategy) print(strategy_docs) # Create sample screening data for PRISMA diagram screening_data <- data.frame( id = 1:100, duplicate = c(rep(FALSE, 80), rep(TRUE, 20)), title_abstract_screened = c(rep(TRUE, 80), rep(FALSE, 20)), full_text_eligible = c(rep(TRUE, 25), rep(FALSE, 75)), included = c(rep(TRUE, 15), rep(FALSE, 85)), excluded_title_abstract = c(rep(FALSE, 25), rep(TRUE, 55), rep(FALSE, 20)), excluded_full_text = c(rep(FALSE, 15), rep(TRUE, 10), rep(FALSE, 75)) ) # Generate PRISMA diagram prisma_plot <- reporter$generate_prisma_diagram(screening_data) print("PRISMA diagram created successfully")
A class for connecting to and searching PubMed database directly, then formatting results for analysis with searchAnalyzeR.
This module provides functionality to search PubMed directly and integrate the results with searchAnalyzeR's analysis capabilities. PubMed Search Interface
This class uses the rentrez package to interface with NCBI's E-utilities to search PubMed and retrieve article metadata. Results are automatically formatted for use with SearchAnalyzer. If rentrez is not available, it provides simulated data for demonstration purposes.
new()Initialize a new PubMedConnector instance
search(query, max_results, date_range)Search PubMed database
get_details(pmids)Get detailed information for specific PMIDs
format_for_analysis()Format results for SearchAnalyzer
last_search_resultsRaw results from last search
formatted_resultsFormatted results ready for analysis
search_metadataMetadata about the last search
use_simulationFlag indicating if simulation mode is active
new()
Initialize a new PubMedConnector instance
PubMedConnector$new()
No return value, called for side effects Search PubMed database
search()
PubMedConnector$search( query, max_results = 100, date_range = NULL, retmode = "xml" )
queryPubMed search query string
max_resultsMaximum number of results to retrieve (default: 100)
date_rangeOptional date range as c("YYYY/MM/DD", "YYYY/MM/DD")
retmodeReturn mode ("xml" or "text")
Number of results found Get detailed information for specific PMIDs
get_details()
PubMedConnector$get_details(pmids, retmode = "xml")
pmidsVector of PubMed IDs
retmodeReturn mode ("xml" or "text")
Detailed article information Format results for SearchAnalyzer
format_for_analysis()
PubMedConnector$format_for_analysis()
Data frame formatted for searchAnalyzeR analysis Get search summary
get_search_summary()
PubMedConnector$get_search_summary()
List with search summary information
clone()
The objects of this class are cloneable with this method.
PubMedConnector$clone(deep = FALSE)
deepWhether to make a deep clone.
# Create PubMed connector pubmed <- PubMedConnector$new() # Search for diabetes studies results <- pubmed$search( query = "diabetes[Title/Abstract] AND clinical trial[Publication Type]", max_results = 100, date_range = c("2020/01/01", "2023/12/31") ) # Format for analysis search_data <- pubmed$format_for_analysis() # Use with SearchAnalyzer analyzer <- SearchAnalyzer$new(search_data) metrics <- analyzer$calculate_metrics()# Create PubMed connector pubmed <- PubMedConnector$new() # Search for diabetes studies results <- pubmed$search( query = "diabetes[Title/Abstract] AND clinical trial[Publication Type]", max_results = 100, date_range = c("2020/01/01", "2023/12/31") ) # Format for analysis search_data <- pubmed$format_for_analysis() # Use with SearchAnalyzer analyzer <- SearchAnalyzer$new(search_data) metrics <- analyzer$calculate_metrics()
Rename Columns Based on Mapping
rename_columns(df, mapping)rename_columns(df, mapping)
df |
Data frame to rename |
mapping |
Named vector of column mappings |
Data frame with renamed columns
A comprehensive system for managing and validating the reproducibility of systematic review search strategies and analyses.
The ReproducibilityManager class provides tools for:
Creating reproducible search packages
Validating reproducibility of existing packages
Generating audit trails
Ensuring transparency and reproducibility in evidence synthesis
new()Initialize a new ReproducibilityManager instance
create_repro_package(search_strategy, results, analysis_config)Create reproducible search package
validate_repro(package_path)Validate reproducibility of existing package
gen_audit_trail(search_analysis)Generate audit trail
new()
Creates a new ReproducibilityManager instance for managing search reproducibility. Sets up necessary configurations and validates system requirements.
ReproducibilityManager$new()
No return value, called for side effects (initialization) Create reproducible search package
create_repro_package()
ReproducibilityManager$create_repro_package( search_strategy, results, analysis_config )
search_strategySearch strategy object
resultsSearch results
analysis_configAnalysis configuration
Path to reproducibility package Validate reproducibility of existing package
validate_repro()
ReproducibilityManager$validate_repro(package_path)
package_pathPath to reproducibility package
Validation results Generate audit trail
gen_audit_trail()
ReproducibilityManager$gen_audit_trail(search_analysis)
search_analysisSearchAnalyzer object
Audit trail object
clone()
The objects of this class are cloneable with this method.
ReproducibilityManager$clone(deep = FALSE)
deepWhether to make a deep clone.
# Create reproducibility manager manager <- ReproducibilityManager$new() # Create sample search strategy search_strategy <- list( terms = c("systematic review", "meta-analysis"), databases = c("PubMed", "Embase"), timestamp = Sys.time(), date_range = as.Date(c("2020-01-01", "2023-12-31")) ) # Create sample search results search_results <- data.frame( id = paste0("article_", 1:20), title = paste("Research Study", 1:20), abstract = paste("Abstract for study", 1:20), source = "Journal of Research", date = Sys.Date() - sample(1:365, 20, replace = TRUE), stringsAsFactors = FALSE ) # Create sample analysis configuration analysis_config <- list( gold_standard = paste0("article_", sample(1:20, 5)), method = "precision_recall", parameters = list(threshold = 0.8) ) # Create reproducible package (writes to tempdir()) package_path <- manager$create_repro_package( search_strategy = search_strategy, results = search_results, analysis_config = analysis_config ) print(paste("Package created at:", package_path)) # Generate audit trail (create mock analyzer object for demonstration) mock_analysis <- list( search_results = search_results, metadata = list(timestamp = Sys.time()) ) class(mock_analysis) <- "mock_analyzer" audit_trail <- manager$gen_audit_trail(mock_analysis) print("Audit trail generated successfully")# Create reproducibility manager manager <- ReproducibilityManager$new() # Create sample search strategy search_strategy <- list( terms = c("systematic review", "meta-analysis"), databases = c("PubMed", "Embase"), timestamp = Sys.time(), date_range = as.Date(c("2020-01-01", "2023-12-31")) ) # Create sample search results search_results <- data.frame( id = paste0("article_", 1:20), title = paste("Research Study", 1:20), abstract = paste("Abstract for study", 1:20), source = "Journal of Research", date = Sys.Date() - sample(1:365, 20, replace = TRUE), stringsAsFactors = FALSE ) # Create sample analysis configuration analysis_config <- list( gold_standard = paste0("article_", sample(1:20, 5)), method = "precision_recall", parameters = list(threshold = 0.8) ) # Create reproducible package (writes to tempdir()) package_path <- manager$create_repro_package( search_strategy = search_strategy, results = search_results, analysis_config = analysis_config ) print(paste("Package created at:", package_path)) # Generate audit trail (create mock analyzer object for demonstration) mock_analysis <- list( search_results = search_results, metadata = list(timestamp = Sys.time()) ) class(mock_analysis) <- "mock_analyzer" audit_trail <- manager$gen_audit_trail(mock_analysis) print("Audit trail generated successfully")
Benchmark Suite Execution
run_benchmarks( search_strategies, benchmark_datasets, metrics_to_calculate = c("precision", "recall", "f1", "efficiency") )run_benchmarks( search_strategies, benchmark_datasets, metrics_to_calculate = c("precision", "recall", "f1", "efficiency") )
search_strategies |
List of search strategy objects |
benchmark_datasets |
List of benchmark datasets |
metrics_to_calculate |
Vector of metrics to calculate |
Comprehensive benchmark results
Safe Division Function
safe_divide(numerator, denominator, default_value = 0)safe_divide(numerator, denominator, default_value = 0)
numerator |
Numerator value |
denominator |
Denominator value |
default_value |
Value to return if denominator is 0 |
Division result or default value
Convert List to Data Frame Safely
safe_list_to_df(x)safe_list_to_df(x)
x |
List to convert |
Data frame or NULL if conversion fails
Search multiple databases and combine results for comprehensive analysis.
search_multiple_databases( search_strategy, databases = c("pubmed"), max_results_per_db = 100 )search_multiple_databases( search_strategy, databases = c("pubmed"), max_results_per_db = 100 )
search_strategy |
List containing search parameters |
databases |
Vector of databases to search ("pubmed", "pmc", etc.) |
max_results_per_db |
Maximum results per database |
Combined search results from all databases
# Define search strategy strategy <- list( terms = "diabetes AND treatment", date_range = c("2020/01/01", "2023/12/31"), max_results = 50 ) # Search multiple databases results <- search_multiple_databases( search_strategy = strategy, databases = c("pubmed"), max_results_per_db = 100 ) # Analyze results analyzer <- SearchAnalyzer$new(results) metrics <- analyzer$calculate_metrics()# Define search strategy strategy <- list( terms = "diabetes AND treatment", date_range = c("2020/01/01", "2023/12/31"), max_results = 50 ) # Search multiple databases results <- search_multiple_databases( search_strategy = strategy, databases = c("pubmed"), max_results_per_db = 100 ) # Analyze results analyzer <- SearchAnalyzer$new(results) metrics <- analyzer$calculate_metrics()
Searches PubMed using the provided search terms and retrieves article metadata in a format compatible with searchAnalyzeR analysis functions.
search_pubmed( search_terms, max_results = 200, date_range = NULL, language = "English" )search_pubmed( search_terms, max_results = 200, date_range = NULL, language = "English" )
search_terms |
Character vector of search terms to use in PubMed query |
max_results |
Maximum number of results to retrieve (default: 200) |
date_range |
Optional date range as c("YYYY-MM-DD", "YYYY-MM-DD") |
language |
Optional language filter (default: "English") |
This function connects to PubMed using the rentrez package (if available) or provides simulated data if the package is not installed. Results are returned as a standardized data frame ready for use with SearchAnalyzer.
Data frame containing standardized search results
# Search for diabetes clinical trials results <- search_pubmed( search_terms = c("diabetes", "clinical trial"), max_results = 100, date_range = c("2020-01-01", "2023-12-31") ) # Use with SearchAnalyzer analyzer <- SearchAnalyzer$new(results) metrics <- analyzer$calculate_metrics()# Search for diabetes clinical trials results <- search_pubmed( search_terms = c("diabetes", "clinical trial"), max_results = 100, date_range = c("2020-01-01", "2023-12-31") ) # Use with SearchAnalyzer analyzer <- SearchAnalyzer$new(results) metrics <- analyzer$calculate_metrics()
The SearchAnalyzer class provides a comprehensive framework for analyzing the performance of systematic review search strategies. It calculates precision, recall, and other performance metrics, generates visualizations, and supports validation against gold standard datasets.
Core class for analyzing systematic review search strategies
This R6 class encapsulates all functionality needed for search strategy analysis. Key capabilities include:
Performance metric calculation (precision, recall, F1, efficiency)
Temporal and database coverage analysis
Visualization generation for reports
Gold standard validation
new(search_results, gold_standard, search_strategy)Initialize analyzer
calculate_metrics()Calculate comprehensive performance metrics
visualize_performance(type)Generate performance visualizations
search_resultsData frame containing search results
gold_standardReference set of relevant articles
metadataSearch strategy metadata
new()
Initialize the analyzer with search results and optional gold standard.
SearchAnalyzer$new( search_results, gold_standard = NULL, search_strategy = NULL )
search_resultsData frame with search results
gold_standardVector of known relevant article IDs
search_strategyList containing search parameters
No return value, called for side effects Calculate comprehensive performance metrics
calculate_metrics()
SearchAnalyzer$calculate_metrics()
List of performance metrics Generate performance visualization
visualize_performance()
SearchAnalyzer$visualize_performance(type = "overview")
typeType of visualization
ggplot object
clone()
The objects of this class are cloneable with this method.
SearchAnalyzer$clone(deep = FALSE)
deepWhether to make a deep clone.
Simulate Search Strategy Execution
simulate_search_execution(strategy, corpus)simulate_search_execution(strategy, corpus)
strategy |
Search strategy object |
corpus |
Data frame with article corpus |
Vector of retrieved article IDs
Standardize Date Formats
standardize_date(dates)standardize_date(dates)
dates |
Character or Date vector |
Date vector
Standardize Cochrane Results
std_cochrane_results(results)std_cochrane_results(results)
results |
Data frame with Cochrane results |
Standardized data frame
Standardize Embase Results
std_embase_results(results)std_embase_results(results)
results |
Data frame with Embase results |
Standardized data frame
Standardize Generic Results
std_generic_results(results)std_generic_results(results)
results |
Data frame with generic results |
Standardized data frame
Standardize PubMed Results
std_pubmed_results(results)std_pubmed_results(results)
results |
Data frame with PubMed results |
Standardized data frame
Standardize Scopus Results
std_scopus_results(results)std_scopus_results(results)
results |
Data frame with Scopus results |
Standardized data frame
Standardize Search Results Format
std_search_results(results, source_format = "generic")std_search_results(results, source_format = "generic")
results |
Data frame with search results |
source_format |
Character indicating the source format |
Standardized data frame
Standardize Web of Science Results
std_wos_results(results)std_wos_results(results)
results |
Data frame with Web of Science results |
Standardized data frame
Stream Process Large Files
stream_file( file_path, process_fn, chunk_size = 10000, skip = 0, max_lines = NULL, progress = TRUE )stream_file( file_path, process_fn, chunk_size = 10000, skip = 0, max_lines = NULL, progress = TRUE )
file_path |
Path to the file to process |
process_fn |
Function to process each chunk/line |
chunk_size |
Number of lines to read at once |
skip |
Number of lines to skip at beginning of file |
max_lines |
Maximum number of lines to process (NULL = all) |
progress |
Logical, whether to show progress |
Result of processing
Analyzes the effectiveness of individual search terms by calculating precision, coverage, and other relevant metrics for each term. This provides insight into which terms are most effective at retrieving relevant articles.
term_effectiveness( terms, search_results, gold_standard = NULL, text_fields = c("title", "abstract") )term_effectiveness( terms, search_results, gold_standard = NULL, text_fields = c("title", "abstract") )
terms |
Character vector of search terms to analyze |
search_results |
Data frame with search results |
gold_standard |
Optional vector of relevant article IDs |
text_fields |
Character vector of column names to search for terms (default: c("title", "abstract")) |
For each term, this function calculates:
Number of articles containing the term
Number of relevant articles containing the term (if gold_standard provided)
Precision (proportion of retrieved articles that are relevant)
Coverage (proportion of relevant articles retrieved by the term)
Data frame with term effectiveness metrics
# Create sample data search_results <- data.frame( id = paste0("art", 1:10), title = c("Diabetes treatment", "Clinical trial", "Diabetes study", "Treatment options", "New therapy", "Glucose control", "Insulin therapy", "Management of diabetes", "Clinical study", "Therapy comparison"), abstract = c("This study examines diabetes treatments.", "A clinical trial on new treatments.", "Diabetes research findings.", "Comparison of treatment options.", "Novel therapy approach.", "Methods to control glucose levels.", "Insulin therapy effectiveness.", "Managing diabetes effectively.", "Clinical research protocols.", "Comparing therapy approaches.") ) # Define search terms terms <- c("diabetes", "treatment", "clinical", "therapy") # Define gold standard (relevant articles) gold_standard <- c("art1", "art3", "art7", "art8") # Analyze term effectiveness term_metrics <- term_effectiveness(terms, search_results, gold_standard) print(term_metrics)# Create sample data search_results <- data.frame( id = paste0("art", 1:10), title = c("Diabetes treatment", "Clinical trial", "Diabetes study", "Treatment options", "New therapy", "Glucose control", "Insulin therapy", "Management of diabetes", "Clinical study", "Therapy comparison"), abstract = c("This study examines diabetes treatments.", "A clinical trial on new treatments.", "Diabetes research findings.", "Comparison of treatment options.", "Novel therapy approach.", "Methods to control glucose levels.", "Insulin therapy effectiveness.", "Managing diabetes effectively.", "Clinical research protocols.", "Comparing therapy approaches.") ) # Define search terms terms <- c("diabetes", "treatment", "clinical", "therapy") # Define gold standard (relevant articles) gold_standard <- c("art1", "art3", "art7", "art8") # Analyze term effectiveness term_metrics <- term_effectiveness(terms, search_results, gold_standard) print(term_metrics)
Validate Date Range
validate_date_range(date_range, allow_future = TRUE)validate_date_range(date_range, allow_future = TRUE)
date_range |
Date vector of length 2 |
allow_future |
Logical, whether future dates are allowed |
Logical indicating if valid
Validate Search Strategy Object
validate_strategy(search_strategy)validate_strategy(search_strategy)
search_strategy |
Search strategy object to validate |
Logical indicating if valid, with warnings for issues