add additional columns from the big gnomad VCF file (#14)

* generalize colum level * fix tests * adjust version * fix oom error * add column infromation from the gnomad vcf * new columns * new create file * genome version * data folder * add info for config * Update README.md * Update README.md * Update README.md * reduce number of columns to reduce db size * columns * new test set * Update README.md * Update README.md * Update setup.py Co-authored-by: Kalin Nonchev <[email protected]> Co-authored-by: Kalin Nonchev <[email protected]> Co-authored-by: Kalin Nonchev <[email protected]>
KalinNonchev · Dec 12, 2021 · fea2b54 · fea2b54
1 parent 2009f6c
commit fea2b54
Show file tree

Hide file tree

Showing 16 changed files with 242 additions and 123 deletions.
diff --git a/README.md b/README.md
@@ -1,24 +1,31 @@
-# gnomAD_MAF
+# gnomAD_VCF
+
+### NEW version (December 2021)
+- more available variant features present, check [here](https://github.com/KalinNonchev/gnomAD_MAF/blob/gnomad_vcf/gnomad_db/pkgdata/gnomad_columns.yaml)
+- `get_maf_from_df` renamed to `get_info_from_df`
+- `get_maf_from_str` renamed to `get_info_from_str`
+
+
 [The Genome Aggregation Database (gnomAD)](https://gnomad.broadinstitute.org) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community.
 
-This package scales the huge gnomAD files (on average ~120G/chrom) to a SQLite database with a size of 56G for WGS v3.1.1 (about 760.000.000 variants), and allows scientists to look for minor allele frequencies of variants really fast (A query containing 300.000 variants takes ~40s.)
+This package scales the huge gnomAD files (on average ~120G/chrom) to a SQLite database with a size of 34G for WGS v2.1.1 (261.942.336 variants) and 99G for WGS v3.1.1 (about 759.302.267 variants), and allows scientists to look for various variant annotations present in gnomAD (i.e. Allele Count, Depth, Minor Allele Frequency, etc. - [here](https://github.com/KalinNonchev/gnomAD_MAF/blob/gnomad_vcf/gnomad_db/pkgdata/gnomad_columns.yaml) you can find all selected features given the genome version). (A query containing 300.000 variants takes ~40s.)
 
-It extracts from a gnomAD vcf the ["AF", "AF_afr", "AF_eas", "AF_fin", "AF_nfe", "AF_asj", "AF_oth", "AF_popmax"] columns. 
+It extracts from a gnomAD vcf about 23 variant annotations. You can find further infromation about the exact fields [here](https://github.com/KalinNonchev/gnomAD_MAF/blob/gnomad_vcf/gnomad_db/gnomad_columns.yaml). 
 
-###### The package works for all currently available gnomAD releases.(November 2021) 
+###### The package works for all currently available gnomAD releases.(January 2022) 
 
 ## 1. Download SQLite preprocessed files
 
 I have preprocessed and created sqlite3 files for gnomAD v2.1.1 and 3.1.1 for you, which can be easily downloaded from here. They contain all variants on the 24 standard chromosomes.
 
-gnomAD v3.1.1 (hg38, **759'302'267** variants) 25G zipped, 56G in total - https://zenodo.org/record/5045170/files/gnomad_db_v3.1.1.sqlite3.gz?download=1 \
-gnomAD v2.1.1 (hg19, **261'942'336** variants) 9G zipped, 20G in total - https://zenodo.org/record/5045102/files/gnomad_db_v2.1.1.sqlite3.gz?download=1 
+gnomAD v3.1.1 (hg38, **759'302'267** variants) 46.9G zipped, 99G in total - https://zenodo.org/record/5758663/files/gnomad_db_v3.1.1.sqlite3.gz?download=1 \
+gnomAD v2.1.1 (hg19, **261'942'336** variants) 16.1G zipped, 48G in total - https://zenodo.org/record/5770384/files/gnomad_db_v2.1.1.sqlite3.gz?download=1
 
 You can download it as:
 
 ```python
 from gnomad_db.database import gnomAD_DB
-download_link = "https://zenodo.org/record/5045102/files/gnomad_db_v2.1.1.sqlite3.gz?download=1"
+download_link = "https://zenodo.org/record/5770384/files/gnomad_db_v2.1.1.sqlite3.gz?download=1"
 output_dir = "test_dir" # database_location
 gnomAD_DB.download_and_unzip(download_link, output_dir)
 ```
@@ -41,6 +48,7 @@ database_location: "test_out" # where to create the database, make sure you have
 gnomad_vcf_location: "data" # where are your *.vcf.bgz located
 tables_location: "test_out" # where to store the preprocessed intermediate files, you can leave it like this 
 script_locations: "test_out" # where to store the scripts, where you can check the progress of your jobs, you can leave it like this
+genome: "Grch37" # genome version of the gnomAD vcf file (2.1.1 = Grch37, 3.1.1 = Grch38)
 ```
 
 Once this is done, run
@@ -77,11 +85,11 @@ from gnomad_db.database import gnomAD_DB
 ```python
 # pass dir
 database_location = "test_dir"
-db = gnomAD_DB(database_location)
+db = gnomAD_DB(database_location, genome="Grch37")
 ```
 
 3. Insert some test variants to run the examples below \
-**If you have downloaded the preprocessed sqlite3 files, you can skip this step as you already have variants**
+**If you have downloaded the preprocessed sqlite3 files, you can skip this step as you already have variants, make sure to have the correct genome version!**
 ```python
 # get some variants
 var_df = pd.read_csv("data/test_vcf_gnomad_chr21_10000.tsv.gz", sep="\t", names=db.columns, index_col=False)
@@ -101,21 +109,21 @@ dummy_var_df = pd.DataFrame({
     "alt": ["G", "T"]})
 
 # query from dataframe AF column
-db.get_maf_from_df(dummy_var_df, "AF")
+db.get_info_from_df(dummy_var_df, "AF")
 
 # query from dataframe AF and AF_popmax columns
-db.get_maf_from_df(dummy_var_df, "AF, AF_popmax")
+db.get_info_from_df(dummy_var_df, "AF, AF_popmax")
 
 # query from dataframe all columns
-db.get_maf_from_df(dummy_var_df, "*")
+db.get_info_from_df(dummy_var_df, "*")
 
 # query from string
-db.get_maf_from_str("21:9825790:C>T", "AF")
+db.get_info_from_str("21:9825790:C>T", "AF")
 ```
 
 5. You can query also intervals of minor allele frequencies
 ```python
-db.get_mafs_for_interval(chrom=21, interval_start=9825780, interval_end=9825799, query="AF")
+db.get__for_interval(chrom=21, interval_start=9825780, interval_end=9825799, query="AF")
 ```
 
 For more information on how to use the package, look into GettingStartedwithGnomAD_DB.ipynb notebook!

diff --git a/Snakefile b/Snakefile
@@ -14,12 +14,14 @@ database_location = config['database_location']
 gnomad_vcf_location = config['gnomad_vcf_location']
 tables_location = config['tables_location']
 script_locations = config['script_locations']
+genome = config['genome']
 KERNEL = config['KERNEL']
 
 
 rule all:
     input:
-        script_locations + "/scripts/GettingStartedwithGnomAD_DB.ipynb"
+        script_locations + "/scripts/insertVariants.ipynb"
+
 
 # -------------------------- EXTRACT VARIANTS WITH MAF FROM gnomAD VCF --------------------------
 rule extract_tables:
@@ -30,7 +32,7 @@ rule extract_tables:
     message:
         "Running createTSVtables notebook..."
     shell:
-        "papermill {input.notebook} {output.notebook} -p gnomad_vcf_location {gnomad_vcf_location} -p tables_location {tables_location} -k {KERNEL}"
+        "papermill {input.notebook} {output.notebook} -p gnomad_vcf_location {gnomad_vcf_location} -p tables_location {tables_location} -p genome {genome} -k {KERNEL}"
 
 
 # -------------------------- INSSERT VARIANTS WITH MAF TO DATABASE ------------------------------
@@ -43,16 +45,16 @@ rule insert_variants:
     message:
         "Running insertVariants notebook..."
     shell:
-        "papermill {input.notebook} {output.notebook} -p database_location {database_location} -p tables_location {tables_location} -k {KERNEL}"
+        "papermill {input.notebook} {output.notebook} -p database_location {database_location} -p tables_location {tables_location} -p genome {genome} -k {KERNEL}"
 
 # -------------------------- INSSERT VARIANTS WITH MAF TO DATABASE ------------------------------
-rule create_GettingStartedNB:
-    input:
-        script_locations + "/scripts/insertVariants.ipynb",
-        notebook = "scripts/GettingStartedwithGnomAD_DB.ipynb"
-    output:
-        notebook = script_locations + "/scripts/GettingStartedwithGnomAD_DB.ipynb"
-    message:
-        "Running GettingStartedwithGnomAD_DB notebook.... Take a look here!"
-    shell:
-        "papermill {input.notebook} {output.notebook} -p database_location {database_location} -k {KERNEL}"
+#rule create_GettingStartedNB:
+#    input:
+#        script_locations + "/scripts/insertVariants.ipynb",
+#        notebook = "scripts/GettingStartedwithGnomAD_DB.ipynb"
+#    output:
+#        notebook = script_locations + "/scripts/GettingStartedwithGnomAD_DB.ipynb"
+#    message:
+#        "Running GettingStartedwithGnomAD_DB notebook.... Take a look here!"
+#    shell:
+#        "papermill {input.notebook} {output.notebook} -p database_location {database_location} -k {KERNEL}"
diff --git a/data/test_vcf_gnomad_chr21_10000.tsv.gz b/data/test_vcf_gnomad_chr21_10000.tsv.gz
diff --git a/environment.yaml b/environment.yaml
@@ -22,4 +22,4 @@ dependencies:
       - joblib
       - pytest
       - nbformat>=5.1
-      - joblib
+      - joblib
diff --git a/gnomad_db/database.py b/gnomad_db/database.py
@@ -5,21 +5,29 @@
 import multiprocessing
 from joblib import Parallel, delayed
 from . import utils
-
+import yaml
+import pkg_resources 
 
 class gnomAD_DB:
 
-    def __init__(self, genodb_path, parallel=False, cpu_count=None):
+    def __init__(self, genodb_path, genome="Grch38", parallel=False, cpu_count=None):
 
 
         self.parallel = parallel
+        self.genome = genome
 
         if self.parallel:
             self.cpu_count = cpu_count if isinstance(cpu_count, int) else int(multiprocessing.cpu_count())
 
         self.db_file = os.path.join(genodb_path, 'gnomad_db.sqlite3')
 
-        self.columns = ["chrom", "pos", "ref", "alt", "AF", "AF_afr", "AF_eas", "AF_fin", "AF_nfe", "AF_asj", "AF_oth", "AF_popmax"]
+        columns_path = pkg_resources.resource_filename("gnomad_db", "pkgdata/gnomad_columns.yaml")
+
+        with open(columns_path) as f:
+            columns = yaml.load(f, Loader=yaml.FullLoader)
+
+        self.columns = list(map(lambda x: x.lower(), columns["base_columns"])) + columns[self.genome]
+        self.dict_columns = columns
 
         if not os.path.exists(self.db_file):
             if not os.path.exists(genodb_path):
@@ -33,20 +41,15 @@ def open_dbconn(self):
 
 
     def create_table(self):
-        sql_create = """
+        value_columns = ",".join([f"{col} REAL" for col in self.dict_columns[self.genome]])
+        sql_create = f"""
         CREATE TABLE gnomad_db (
             chrom TEXT,
             pos INTEGER,
             ref TEXT,
             alt TEXT,
-            AF REAL,
-            AF_afr REAL,
-            AF_eas REAL,
-            AF_fin REAL,
-            AF_nfe REAL,
-            AF_asj REAL,
-            AF_oth REAL,
-            AF_popmax REAL,
+            filter TEXT,
+            {value_columns},
             PRIMARY KEY (chrom, pos, ref, alt));
         """
 
@@ -63,8 +66,6 @@ def insert_variants(self, var_df: pd.DataFrame):
         ), "Columns are missing. The dataframe should contain: " + ", ".join(self.columns)
 
 
-
-
         ## sort and process var_df
         var_df = var_df.reindex(self.columns, axis=1)
         var_df = self._sanitize_variants(var_df)
@@ -77,26 +78,28 @@ def insert_variants(self, var_df: pd.DataFrame):
 
         rows = [tuple(x) for x in var_df.to_numpy()]
 
+        num_values = ",".join(["?" for col in self.columns])
+
         sql_input = f"""
                     INSERT OR REPLACE INTO gnomad_db({", ".join(self.columns)})
-                    VALUES (?,?,?,?,?,?,?,?,?,?,?,?)
+                    VALUES ({num_values})
                     """
+
         with self.open_dbconn() as conn:
             c = conn.cursor()
             c.executemany(sql_input, rows)
 
-
     def _sanitize_variants(self, var_df: pd.DataFrame) -> pd.DataFrame:
         var_df = var_df.replace(".", np.NaN)
+        var_df["pos"] = var_df["pos"].astype(int)
         var_df["chrom"] = var_df["chrom"].astype(str)
         var_df["chrom"] = var_df.chrom.apply(lambda x: x.replace("chr", ""))
         return var_df
 
     def _pack_var_args(self, var: pd.Series) -> str:
         return (var.chrom, var.pos, var.ref, var.alt)
 
-    def _get_maf_from_df(self, var_df: pd.DataFrame, query: str="AF") -> pd.Series:
-
+    def _get_info_from_df(self, var_df: pd.DataFrame, query: str="AF") -> pd.Series:
         var_df = self._sanitize_variants(var_df)
 
         rows = [self._pack_var_args(var) for _, var in var_df.iterrows()]
@@ -133,22 +136,22 @@ def _get_maf_from_df(self, var_df: pd.DataFrame, query: str="AF") -> pd.Series:
 
 
 
-    def get_maf_from_df(self, var_df: pd.DataFrame, query: str="AF") -> pd.Series:
+    def get_info_from_df(self, var_df: pd.DataFrame, query: str="AF") -> pd.Series:
         if var_df.empty:
             return var_df
 
         if self.parallel and len(var_df) > 100 * self.cpu_count:
             out = np.array_split(var_df, self.cpu_count)
             assert len(out) == self.cpu_count
 
-            delayed_get_maf_from_df = delayed(self._get_maf_from_df)
+            delayed_get_maf_from_df = delayed(self._get_info_from_df)
             out = Parallel(self.cpu_count, prefer="threads")(delayed_get_maf_from_df(df, query) for df in out)
 
             out = pd.concat(out)
             out.set_index(var_df.index, inplace=True)
             assert len(var_df) == len(out)
         else:
-            out = self._get_maf_from_df(var_df, query)
+            out = self._get_info_from_df(var_df, query)
 
         return out
 
@@ -175,7 +178,7 @@ def query_direct(self, sql_query: str):
             c = conn.cursor()
             return pd.read_sql_query(sql_query, conn)
 
-    def get_mafs_for_interval(self, chrom: str, interval_start: int, interval_end: int, query: str="AF") -> pd.DataFrame:
+    def get_info_for_interval(self, chrom: str, interval_start: int, interval_end: int, query: str="AF") -> pd.DataFrame:
 
         query = self._query_columns(query)
 
@@ -188,9 +191,9 @@ def get_mafs_for_interval(self, chrom: str, interval_start: int, interval_end: i
 
 
 
-    def get_maf_from_str(self, var: str, query: str="AF") -> float:
+    def get_info_from_str(self, var: str, query: str="AF") -> float:
         """
-        get AF for variant in form chrom:pos:ref>alt
+        get columns for variant in form chrom:pos:ref>alt
         """
 
         chrom, pos, ref, alt = self._pack_from_str(var)

diff --git a/gnomad_db/pkgdata/gnomad_columns.yaml b/gnomad_db/pkgdata/gnomad_columns.yaml
@@ -0,0 +1,46 @@
+base_columns:
+    - CHROM
+    - POS
+    - REF
+    - ALT
+    - FILTER
+Grch37:
+    - AC # Alternate allele count for samples
+    - AN # Total number of alleles in samples
+    - AF # Alternate allele frequency in samples
+    - rf_tp_probability # Random forest prediction probability for a site being a true variant
+    - MQ # Root mean square of the mapping quality of reads across all samples
+    - QD # Variant call confidence normalized by depth of sample reads supporting a variant
+    - ReadPosRankSum # Z-score from Wilcoxon rank sum test of alternate vs. reference read position bias
+    - DP # Depth of informative coverage for each sample; reads with MQ=255 or with bad mates are filtered
+    - VQSLOD # Log-odds ratio of being a true variant versus being a false positive under the trained VQSR Gaussian mixture model
+    - AC_popmax # Allele count in the population with the maximum AF
+    - AN_popmax # Total number of alleles in the population with the maximum AF
+    - AF_popmax # Maximum allele frequency across populations (excluding samples of Ashkenazi
+    - AF_eas
+    - AF_oth
+    - AF_nfe
+    - AF_fin
+    - AF_afr
+    - AF_asj
+Grch38:
+    - AC # Alternate allele count for samples
+    - AN # Total number of alleles in samples
+    - AF # Alternate allele frequency in samples
+    - InbreedingCoeff # Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation
+    - MQ # Root mean square of the mapping quality of reads across all samples
+    - QD # Variant call confidence normalized by depth of sample reads supporting a variant
+    - ReadPosRankSum # Z-score from Wilcoxon rank sum test of alternate vs. reference read position bias
+#   - DP # Depth of informative coverage for each sample; reads with MQ=255 or with bad mates are filtered
+    - VarDP
+    - AS_VQSLOD
+#    - VQSLOD # Log-odds ratio of being a true variant versus being a false positive under the trained VQSR Gaussian mixture model
+    - AC_popmax # Allele count in the population with the maximum AF
+    - AN_popmax # Total number of alleles in the population with the maximum AF
+    - AF_popmax # Maximum allele frequency across populations (excluding samples of Ashkenazi
+    - AF_eas
+    - AF_oth
+    - AF_nfe
+    - AF_fin
+    - AF_afr
+    - AF_asj
diff --git a/script_config.yaml b/script_config.yaml
@@ -1,5 +1,6 @@
-database_location: "test_out"
-gnomad_vcf_location: "data"
-tables_location: "test_out"
-script_locations: "test_out"
-KERNEL: "gnomad_db"
+database_location: "test_out" # where to create the database, make sure you have space on your device.
+gnomad_vcf_location: "data" # where are your *.vcf.bgz located
+tables_location: "test_out" # where to store the preprocessed intermediate files, you can leave it like this 
+script_locations: "test_out" # where to store the scripts, where you can check the progress of your jobs, you can leave it like this
+genome: "Grch37" # genome version of the gnomAD vcf file (2.1.1 = Grch37, 3.1.1 = Grch38)
+KERNEL: "gnomad_db"
-Original file line number
+Diff line change
@@ Expand Up / @@ -22,4 +22,4 @@ dependencies: @@
           - joblib
           - pytest
           - nbformat>=5.1
-          - joblib
+          - joblib