Skip to content

Latest commit

 

History

History
81 lines (74 loc) · 3.7 KB

WORKFLOW.md

File metadata and controls

81 lines (74 loc) · 3.7 KB

TIGA Workflow

Steps for updating the TIGA dataset from sources.

Dependencies

  • R 4.2+; readr, data.table, igraph, muStat, RMySQL (Webapp: shiny, DT, shinyBS, shinysky, plotly)
  • Python 3.9+; pandas, venv, BioClients
  • Java 8+; Jena, IU_IDSL_JENA

Steps

  1. Download latest files from the NHGRI-EBI GWAS Catalog. See FTP site for latest and all releases. Required files:
    • gwas-catalog-studies_ontology-annotated.tsv
    • gwas-catalog-associations_ontology-annotated.tsv
  2. Download from Experimental Factor Ontology (EFO):
    • efo.owl
  3. Edit LATEST_RELEASE_GWC.txt and LATEST_RELEASE_EFO.txt accordingly.
  4. Create venv virtual environment for Python.
  5. mkdir venv
  6. cd venv
  7. venv -i ../venv\_requirements.txt
  8. RUN Go_TIGA_Workflow.sh. Commands can also be run manually as described here.
  9. Clean studies: * gwascat_gwas.R
  10. Clean, separate OR_or_beta into oddsratio, beta columns: * gwascat_assn.R
  11. Convert EFO OWL to TSV: * java -jar iu_idsl_jena-0.0.1-SNAPSHOT-jar-with-dependencies.jar
  12. From EFO TSV create GraphML: * efo_graph.R
  13. Clean traits: * gwascat_trait.R
  14. MAPPED GENES: Separate mapped into up-/down-stream. * snp2gene_mapped.pl
  15. Get iCite RCRs for studies via PMIDs: * python3 -m BioClients.icite.Client get_stats
  16. Get Ensembl annotations for mapped genes via EnsemblIds: * python3 -m BioClients.ensembl.Client get_info
  17. Get IDG TCRD gene annotations: * python3 -m BioClients.idg.tcrd.Client listTargets
  18. Run commands in Go_gwascat_DbCreate.sh building MySql db. Writes file gwas_counts.tsv.
  19. Pre-process and filter. Studies, genes and traits may be removed due to insufficient evidence, with reasons recorded. * tiga_gt_prepfilter.R
  20. Provenance for gene-trait pairs (STUDY_ACCESSION, PUBMEDID). * tiga_gt_provenance.R
  21. Generate variables, statistics, evidence features for gene-trait pairs. * tiga_gt_variables.R
  22. Score and rank gene-trait pairs based on selected variables. * tiga_gt_stats.R
  23. TIGA web app requires files:
    1. gwascat_gwas.tsv
    2. filtered_genes.tsv
    3. filtered_studies.tsv
    4. filtered_traits.tsv
    5. gt_provenance.tsv.gz
    6. gt_stats.tsv.gz
    7. efo_graph.graphml.gz
    8. gwascat_release.txt
    9. efo_release.txt
    10. tcrd_info.tsv
  24. TIGA download files should be copied to the TIGA Download Directory for automated access.

Notes

  • Split comma separated fields, convert to UTF-8 characters.
  • Gene-trait association variables:
    • N_study: studies supporting gene-trait association
    • N_snp: SNPs involved with gene-trait association
    • N_snpw(*): SNPs involved with gene-trait association weighted by genomic distance
    • RCRAS(*): RCR Aggregated Score
    • pValue(*): max SNP pValues
    • OR: median(OR), where OR = odds ratio
    • N_beta: count of supporting beta values
    • geneNtrait: total traits associated with gene
    • traitNgene: total genes associated with trait
  • Gene-trait scores and ranks:
    • meanRank: meanRank based on variables selected(*) by benchmark validation.
    • meanRankScore: 100 - Percentile(meanRank)
  • MySql database currently not required for TIGA app.