Skip to content

Commit

Permalink
add additional columns from the big gnomad VCF file (#14)
Browse files Browse the repository at this point in the history
* generalize colum level

* fix tests

* adjust version

* fix oom error

* add column infromation from the gnomad vcf

* new columns

* new create file

* genome version

* data folder

* add info for config

* Update README.md

* Update README.md

* Update README.md

* reduce number of columns to reduce db size

* columns

* new test set

* Update README.md

* Update README.md

* Update setup.py

Co-authored-by: Kalin Nonchev <[email protected]>
Co-authored-by: Kalin Nonchev <[email protected]>
Co-authored-by: Kalin Nonchev <[email protected]>
  • Loading branch information
4 people authored Dec 12, 2021
1 parent 2009f6c commit fea2b54
Show file tree
Hide file tree
Showing 16 changed files with 242 additions and 123 deletions.
36 changes: 22 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,31 @@
# gnomAD_MAF
# gnomAD_VCF

### NEW version (December 2021)
- more available variant features present, check [here](https://github.com/KalinNonchev/gnomAD_MAF/blob/gnomad_vcf/gnomad_db/pkgdata/gnomad_columns.yaml)
- `get_maf_from_df` renamed to `get_info_from_df`
- `get_maf_from_str` renamed to `get_info_from_str`


[The Genome Aggregation Database (gnomAD)](https://gnomad.broadinstitute.org) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community.

This package scales the huge gnomAD files (on average ~120G/chrom) to a SQLite database with a size of 56G for WGS v3.1.1 (about 760.000.000 variants), and allows scientists to look for minor allele frequencies of variants really fast (A query containing 300.000 variants takes ~40s.)
This package scales the huge gnomAD files (on average ~120G/chrom) to a SQLite database with a size of 34G for WGS v2.1.1 (261.942.336 variants) and 99G for WGS v3.1.1 (about 759.302.267 variants), and allows scientists to look for various variant annotations present in gnomAD (i.e. Allele Count, Depth, Minor Allele Frequency, etc. - [here](https://github.com/KalinNonchev/gnomAD_MAF/blob/gnomad_vcf/gnomad_db/pkgdata/gnomad_columns.yaml) you can find all selected features given the genome version). (A query containing 300.000 variants takes ~40s.)

It extracts from a gnomAD vcf the ["AF", "AF_afr", "AF_eas", "AF_fin", "AF_nfe", "AF_asj", "AF_oth", "AF_popmax"] columns.
It extracts from a gnomAD vcf about 23 variant annotations. You can find further infromation about the exact fields [here](https://github.com/KalinNonchev/gnomAD_MAF/blob/gnomad_vcf/gnomad_db/gnomad_columns.yaml).

###### The package works for all currently available gnomAD releases.(November 2021)
###### The package works for all currently available gnomAD releases.(January 2022)

## 1. Download SQLite preprocessed files

I have preprocessed and created sqlite3 files for gnomAD v2.1.1 and 3.1.1 for you, which can be easily downloaded from here. They contain all variants on the 24 standard chromosomes.

gnomAD v3.1.1 (hg38, **759'302'267** variants) 25G zipped, 56G in total - https://zenodo.org/record/5045170/files/gnomad_db_v3.1.1.sqlite3.gz?download=1 \
gnomAD v2.1.1 (hg19, **261'942'336** variants) 9G zipped, 20G in total - https://zenodo.org/record/5045102/files/gnomad_db_v2.1.1.sqlite3.gz?download=1
gnomAD v3.1.1 (hg38, **759'302'267** variants) 46.9G zipped, 99G in total - https://zenodo.org/record/5758663/files/gnomad_db_v3.1.1.sqlite3.gz?download=1 \
gnomAD v2.1.1 (hg19, **261'942'336** variants) 16.1G zipped, 48G in total - https://zenodo.org/record/5770384/files/gnomad_db_v2.1.1.sqlite3.gz?download=1

You can download it as:

```python
from gnomad_db.database import gnomAD_DB
download_link = "https://zenodo.org/record/5045102/files/gnomad_db_v2.1.1.sqlite3.gz?download=1"
download_link = "https://zenodo.org/record/5770384/files/gnomad_db_v2.1.1.sqlite3.gz?download=1"
output_dir = "test_dir" # database_location
gnomAD_DB.download_and_unzip(download_link, output_dir)
```
Expand All @@ -41,6 +48,7 @@ database_location: "test_out" # where to create the database, make sure you have
gnomad_vcf_location: "data" # where are your *.vcf.bgz located
tables_location: "test_out" # where to store the preprocessed intermediate files, you can leave it like this
script_locations: "test_out" # where to store the scripts, where you can check the progress of your jobs, you can leave it like this
genome: "Grch37" # genome version of the gnomAD vcf file (2.1.1 = Grch37, 3.1.1 = Grch38)
```

Once this is done, run
Expand Down Expand Up @@ -77,11 +85,11 @@ from gnomad_db.database import gnomAD_DB
```python
# pass dir
database_location = "test_dir"
db = gnomAD_DB(database_location)
db = gnomAD_DB(database_location, genome="Grch37")
```

3. Insert some test variants to run the examples below \
**If you have downloaded the preprocessed sqlite3 files, you can skip this step as you already have variants**
**If you have downloaded the preprocessed sqlite3 files, you can skip this step as you already have variants, make sure to have the correct genome version!**
```python
# get some variants
var_df = pd.read_csv("data/test_vcf_gnomad_chr21_10000.tsv.gz", sep="\t", names=db.columns, index_col=False)
Expand All @@ -101,21 +109,21 @@ dummy_var_df = pd.DataFrame({
"alt": ["G", "T"]})

# query from dataframe AF column
db.get_maf_from_df(dummy_var_df, "AF")
db.get_info_from_df(dummy_var_df, "AF")

# query from dataframe AF and AF_popmax columns
db.get_maf_from_df(dummy_var_df, "AF, AF_popmax")
db.get_info_from_df(dummy_var_df, "AF, AF_popmax")

# query from dataframe all columns
db.get_maf_from_df(dummy_var_df, "*")
db.get_info_from_df(dummy_var_df, "*")

# query from string
db.get_maf_from_str("21:9825790:C>T", "AF")
db.get_info_from_str("21:9825790:C>T", "AF")
```

5. You can query also intervals of minor allele frequencies
```python
db.get_mafs_for_interval(chrom=21, interval_start=9825780, interval_end=9825799, query="AF")
db.get__for_interval(chrom=21, interval_start=9825780, interval_end=9825799, query="AF")
```

For more information on how to use the package, look into GettingStartedwithGnomAD_DB.ipynb notebook!
Expand Down
28 changes: 15 additions & 13 deletions Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,14 @@ database_location = config['database_location']
gnomad_vcf_location = config['gnomad_vcf_location']
tables_location = config['tables_location']
script_locations = config['script_locations']
genome = config['genome']
KERNEL = config['KERNEL']


rule all:
input:
script_locations + "/scripts/GettingStartedwithGnomAD_DB.ipynb"
script_locations + "/scripts/insertVariants.ipynb"


# -------------------------- EXTRACT VARIANTS WITH MAF FROM gnomAD VCF --------------------------
rule extract_tables:
Expand All @@ -30,7 +32,7 @@ rule extract_tables:
message:
"Running createTSVtables notebook..."
shell:
"papermill {input.notebook} {output.notebook} -p gnomad_vcf_location {gnomad_vcf_location} -p tables_location {tables_location} -k {KERNEL}"
"papermill {input.notebook} {output.notebook} -p gnomad_vcf_location {gnomad_vcf_location} -p tables_location {tables_location} -p genome {genome} -k {KERNEL}"


# -------------------------- INSSERT VARIANTS WITH MAF TO DATABASE ------------------------------
Expand All @@ -43,16 +45,16 @@ rule insert_variants:
message:
"Running insertVariants notebook..."
shell:
"papermill {input.notebook} {output.notebook} -p database_location {database_location} -p tables_location {tables_location} -k {KERNEL}"
"papermill {input.notebook} {output.notebook} -p database_location {database_location} -p tables_location {tables_location} -p genome {genome} -k {KERNEL}"

# -------------------------- INSSERT VARIANTS WITH MAF TO DATABASE ------------------------------
rule create_GettingStartedNB:
input:
script_locations + "/scripts/insertVariants.ipynb",
notebook = "scripts/GettingStartedwithGnomAD_DB.ipynb"
output:
notebook = script_locations + "/scripts/GettingStartedwithGnomAD_DB.ipynb"
message:
"Running GettingStartedwithGnomAD_DB notebook.... Take a look here!"
shell:
"papermill {input.notebook} {output.notebook} -p database_location {database_location} -k {KERNEL}"
#rule create_GettingStartedNB:
# input:
# script_locations + "/scripts/insertVariants.ipynb",
# notebook = "scripts/GettingStartedwithGnomAD_DB.ipynb"
# output:
# notebook = script_locations + "/scripts/GettingStartedwithGnomAD_DB.ipynb"
# message:
# "Running GettingStartedwithGnomAD_DB notebook.... Take a look here!"
# shell:
# "papermill {input.notebook} {output.notebook} -p database_location {database_location} -k {KERNEL}"
Binary file modified data/test_vcf_gnomad_chr21_10000.tsv.gz
Binary file not shown.
2 changes: 1 addition & 1 deletion environment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,4 @@ dependencies:
- joblib
- pytest
- nbformat>=5.1
- joblib
- joblib
51 changes: 27 additions & 24 deletions gnomad_db/database.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,21 +5,29 @@
import multiprocessing
from joblib import Parallel, delayed
from . import utils

import yaml
import pkg_resources

class gnomAD_DB:

def __init__(self, genodb_path, parallel=False, cpu_count=None):
def __init__(self, genodb_path, genome="Grch38", parallel=False, cpu_count=None):


self.parallel = parallel
self.genome = genome

if self.parallel:
self.cpu_count = cpu_count if isinstance(cpu_count, int) else int(multiprocessing.cpu_count())

self.db_file = os.path.join(genodb_path, 'gnomad_db.sqlite3')

self.columns = ["chrom", "pos", "ref", "alt", "AF", "AF_afr", "AF_eas", "AF_fin", "AF_nfe", "AF_asj", "AF_oth", "AF_popmax"]
columns_path = pkg_resources.resource_filename("gnomad_db", "pkgdata/gnomad_columns.yaml")

with open(columns_path) as f:
columns = yaml.load(f, Loader=yaml.FullLoader)

self.columns = list(map(lambda x: x.lower(), columns["base_columns"])) + columns[self.genome]
self.dict_columns = columns

if not os.path.exists(self.db_file):
if not os.path.exists(genodb_path):
Expand All @@ -33,20 +41,15 @@ def open_dbconn(self):


def create_table(self):
sql_create = """
value_columns = ",".join([f"{col} REAL" for col in self.dict_columns[self.genome]])
sql_create = f"""
CREATE TABLE gnomad_db (
chrom TEXT,
pos INTEGER,
ref TEXT,
alt TEXT,
AF REAL,
AF_afr REAL,
AF_eas REAL,
AF_fin REAL,
AF_nfe REAL,
AF_asj REAL,
AF_oth REAL,
AF_popmax REAL,
filter TEXT,
{value_columns},
PRIMARY KEY (chrom, pos, ref, alt));
"""

Expand All @@ -63,8 +66,6 @@ def insert_variants(self, var_df: pd.DataFrame):
), "Columns are missing. The dataframe should contain: " + ", ".join(self.columns)




## sort and process var_df
var_df = var_df.reindex(self.columns, axis=1)
var_df = self._sanitize_variants(var_df)
Expand All @@ -77,26 +78,28 @@ def insert_variants(self, var_df: pd.DataFrame):

rows = [tuple(x) for x in var_df.to_numpy()]

num_values = ",".join(["?" for col in self.columns])

sql_input = f"""
INSERT OR REPLACE INTO gnomad_db({", ".join(self.columns)})
VALUES (?,?,?,?,?,?,?,?,?,?,?,?)
VALUES ({num_values})
"""

with self.open_dbconn() as conn:
c = conn.cursor()
c.executemany(sql_input, rows)


def _sanitize_variants(self, var_df: pd.DataFrame) -> pd.DataFrame:
var_df = var_df.replace(".", np.NaN)
var_df["pos"] = var_df["pos"].astype(int)
var_df["chrom"] = var_df["chrom"].astype(str)
var_df["chrom"] = var_df.chrom.apply(lambda x: x.replace("chr", ""))
return var_df

def _pack_var_args(self, var: pd.Series) -> str:
return (var.chrom, var.pos, var.ref, var.alt)

def _get_maf_from_df(self, var_df: pd.DataFrame, query: str="AF") -> pd.Series:

def _get_info_from_df(self, var_df: pd.DataFrame, query: str="AF") -> pd.Series:
var_df = self._sanitize_variants(var_df)

rows = [self._pack_var_args(var) for _, var in var_df.iterrows()]
Expand Down Expand Up @@ -133,22 +136,22 @@ def _get_maf_from_df(self, var_df: pd.DataFrame, query: str="AF") -> pd.Series:



def get_maf_from_df(self, var_df: pd.DataFrame, query: str="AF") -> pd.Series:
def get_info_from_df(self, var_df: pd.DataFrame, query: str="AF") -> pd.Series:
if var_df.empty:
return var_df

if self.parallel and len(var_df) > 100 * self.cpu_count:
out = np.array_split(var_df, self.cpu_count)
assert len(out) == self.cpu_count

delayed_get_maf_from_df = delayed(self._get_maf_from_df)
delayed_get_maf_from_df = delayed(self._get_info_from_df)
out = Parallel(self.cpu_count, prefer="threads")(delayed_get_maf_from_df(df, query) for df in out)

out = pd.concat(out)
out.set_index(var_df.index, inplace=True)
assert len(var_df) == len(out)
else:
out = self._get_maf_from_df(var_df, query)
out = self._get_info_from_df(var_df, query)

return out

Expand All @@ -175,7 +178,7 @@ def query_direct(self, sql_query: str):
c = conn.cursor()
return pd.read_sql_query(sql_query, conn)

def get_mafs_for_interval(self, chrom: str, interval_start: int, interval_end: int, query: str="AF") -> pd.DataFrame:
def get_info_for_interval(self, chrom: str, interval_start: int, interval_end: int, query: str="AF") -> pd.DataFrame:

query = self._query_columns(query)

Expand All @@ -188,9 +191,9 @@ def get_mafs_for_interval(self, chrom: str, interval_start: int, interval_end: i



def get_maf_from_str(self, var: str, query: str="AF") -> float:
def get_info_from_str(self, var: str, query: str="AF") -> float:
"""
get AF for variant in form chrom:pos:ref>alt
get columns for variant in form chrom:pos:ref>alt
"""

chrom, pos, ref, alt = self._pack_from_str(var)
Expand Down
46 changes: 46 additions & 0 deletions gnomad_db/pkgdata/gnomad_columns.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
base_columns:
- CHROM
- POS
- REF
- ALT
- FILTER
Grch37:
- AC # Alternate allele count for samples
- AN # Total number of alleles in samples
- AF # Alternate allele frequency in samples
- rf_tp_probability # Random forest prediction probability for a site being a true variant
- MQ # Root mean square of the mapping quality of reads across all samples
- QD # Variant call confidence normalized by depth of sample reads supporting a variant
- ReadPosRankSum # Z-score from Wilcoxon rank sum test of alternate vs. reference read position bias
- DP # Depth of informative coverage for each sample; reads with MQ=255 or with bad mates are filtered
- VQSLOD # Log-odds ratio of being a true variant versus being a false positive under the trained VQSR Gaussian mixture model
- AC_popmax # Allele count in the population with the maximum AF
- AN_popmax # Total number of alleles in the population with the maximum AF
- AF_popmax # Maximum allele frequency across populations (excluding samples of Ashkenazi
- AF_eas
- AF_oth
- AF_nfe
- AF_fin
- AF_afr
- AF_asj
Grch38:
- AC # Alternate allele count for samples
- AN # Total number of alleles in samples
- AF # Alternate allele frequency in samples
- InbreedingCoeff # Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation
- MQ # Root mean square of the mapping quality of reads across all samples
- QD # Variant call confidence normalized by depth of sample reads supporting a variant
- ReadPosRankSum # Z-score from Wilcoxon rank sum test of alternate vs. reference read position bias
# - DP # Depth of informative coverage for each sample; reads with MQ=255 or with bad mates are filtered
- VarDP
- AS_VQSLOD
# - VQSLOD # Log-odds ratio of being a true variant versus being a false positive under the trained VQSR Gaussian mixture model
- AC_popmax # Allele count in the population with the maximum AF
- AN_popmax # Total number of alleles in the population with the maximum AF
- AF_popmax # Maximum allele frequency across populations (excluding samples of Ashkenazi
- AF_eas
- AF_oth
- AF_nfe
- AF_fin
- AF_afr
- AF_asj
11 changes: 6 additions & 5 deletions script_config.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
database_location: "test_out"
gnomad_vcf_location: "data"
tables_location: "test_out"
script_locations: "test_out"
KERNEL: "gnomad_db"
database_location: "test_out" # where to create the database, make sure you have space on your device.
gnomad_vcf_location: "data" # where are your *.vcf.bgz located
tables_location: "test_out" # where to store the preprocessed intermediate files, you can leave it like this
script_locations: "test_out" # where to store the scripts, where you can check the progress of your jobs, you can leave it like this
genome: "Grch37" # genome version of the gnomAD vcf file (2.1.1 = Grch37, 3.1.1 = Grch38)
KERNEL: "gnomad_db"
Loading

0 comments on commit fea2b54

Please sign in to comment.