This repository contains demo data and code to process and analyse Chemical Genomics experiments performed in the Ehrt Schnappinger lab. The experimental setup used was developed by the Broad Institute (https://www.nature.com/articles/s41586-019-1315-z).
For the purposes of this demo, the data has been encoded to mask drug and strain names.
- Tuberculosis is a lung disease that causes the greatest number of deaths/year due to an infectious agent.
- It is caused by the bacterium Mycobacterium tuberculosis
- In 2018, the disease infected ~10 million people and caused ~1.5 million deaths
- 20% of cases exhibit resistance to one or more drugs
- Understanding the mechanism of action of new anti-TB drugs helps speed up drug discovery
We developed a chemical-genetic approach to predict the mechanism of action of a new drug
The overarching goal of this project was to predict mechanisms of action of new drugs by using a library of strains with varying drug susceptibilities as input.
1-Overview_and_goals.ipynb
provides a detailed explanation of the format of the input data and the specific goals addressed by this dataset.
2-Clean_data.ipynb
processes counts_files to cleaned dataframes. Each step generates a folder with intermediate files under Clean_data_outputs
3-QC
performs several QC analyses on the processed data and examines possible reasons behind missing datapoints
4-Model_fitting.ipynb
uses the processed data to fit supervised machine learning models and describes methods for narrowing the number of features required for the analysis.
Each experiment consists of taking pools of M. tuberculosis depleted of essential targets and screening them against compound libraries to determine chemical genetic interactions. An overview of the experimental setup is as follows:
- Anti-mycobacterial drugs are pipetted at varying concentrations into 96 well plates, with each plate corresponding to one drug. For each drug, seven concentrations are used (0.125X MIC - 8X MIC) in addition to a no drug control (0.000x MIC). Each drug-MIC combination has six replicates. Details of this are recorded in
Raw_data/HypoIII_all_drugs_encoded.csv
under the columns 'Dispensedwell', 'Dispensedrow', 'Dispensedcol'. - Strain pools consist of ~400 M. tuberculosis strains, each containing an inducible depletion system targeting an essential gene. These are pipetted into drug plates. Simutaneously depletion of the target gene is induced.
- After the desired incubation time, optical density of the plates is recorded. Details of this are in
Raw_data/181120_HypoIII_Ods_encoded.xlsx
with the ODs under the column 'RawData'. - Strains are harvested and PCR'ed to amplify the barcode-containing region. To enable multiplexing, different p5 indexes are used for each well, and different p7 indexes are used for each plate.
- Amplicons are submitted for Illumina sequencing. FASTQ files thus obtained are converted into counts file (code for this is not shown, since it was developed by Tom Ioerger (Texas A&M) and remains unpublished).
- Each Illumina run generates a .counts file. This is a a tab delimited file containing demultiplexed counts for each p5-p7 (and therefore each plate-well) combination and for each strain in the mix. Refer to files within
Raw_data/Counts_files_encoded
for reference.