Skip to content

Commit

Permalink
update v4 (#24)
Browse files Browse the repository at this point in the history
* update v4

---------

Co-authored-by: Kalin Nonchev
  • Loading branch information
KalinNonchev authored Nov 4, 2023
1 parent afdfef2 commit df7b61c
Show file tree
Hide file tree
Showing 17 changed files with 313 additions and 354 deletions.
28 changes: 18 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,33 +2,41 @@

# gnomAD_DB

### Changelog
#### Changelog

#### NEW version (July 2022)
#### NEW version (November 2023)
- release gnomAD WGS v4.0 and WES v4.0
- `gnomad_version`=["v2"|"v3"|"v4"] argument has to be specified when initializing the database
- minor fixes

#### version (July 2022)
- release gnomAD WGS v3.1.2
- minor bug fixes

#### version (December 2021)
- more available variant features present, check [here](https://github.com/KalinNonchev/gnomAD_DB/blob/master/gnomad_db/pkgdata/gnomad_columns.yaml)
- `get_maf_from_df` renamed to `get_info_from_df`
- `get_maf_from_str` renamed to `get_info_from_str`
- `genome`=["Grch37"|"Grch38"] argument have to be specified, when initializing the database
- [DEPRECATED 11.2023]`genome`=["Grch37"|"Grch38"] argument has to be specified when initializing the database

## Why and What

[The Genome Aggregation Database (gnomAD)](https://gnomad.broadinstitute.org) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community.

This package scales the huge gnomAD files (on average ~120G/chrom) to a SQLite database with a size of 34G for WGS v2.1.1 (261.942.336 variants) and 98G for WGS v3.1.2 (about 759.302.267 variants), and allows scientists to look for various variant annotations present in gnomAD (i.e. Allele Count, Depth, Minor Allele Frequency, etc. - [here](https://github.com/KalinNonchev/gnomAD_DB/blob/master/gnomad_db/pkgdata/gnomad_columns.yaml) you can find all selected features given the genome version). (A query containing 300.000 variants takes ~40s.)

It extracts from a gnomAD vcf about 23 variant annotations. You can find further infromation about the exact fields [here](https://github.com/KalinNonchev/gnomAD_DB/blob/master/gnomad_db/pkgdata/gnomad_columns.yaml).
It extracts from a gnomAD vcf about 23 variant annotations. You can find further information about the exact fields [here](https://github.com/KalinNonchev/gnomAD_DB/blob/master/gnomad_db/pkgdata/gnomad_columns.yaml).

###### The package works for all currently available gnomAD releases.(July 2022)

## 1. Download SQLite preprocessed files

I have preprocessed and created sqlite3 files for gnomAD v2.1.1 and 3.1.2 for you, which can be easily downloaded from here. They contain all variants on the 24 standard chromosomes.
I have preprocessed and created sqlite3 files for gnomAD for you, which can be easily downloaded from here. They contain all variants on the 24 standard chromosomes.

gnomAD v3.1.2 (hg38, **759'302'267** variants) 46.2G zipped, 98G in total - https://zenodo.org/record/6818606/files/gnomad_db_v3.1.2.sqlite3.gz?download=1 \
gnomAD v2.1.1 (hg19, **261'942'336** variants) 16.1G zipped, 48G in total - https://zenodo.org/record/5770384/files/gnomad_db_v2.1.1.sqlite3.gz?download=1
- WGS gnomAD v4.0 (hg38, **759'302'267** variants) 36.1G zipped, 74G in total - https://zenodo.org/records/10066323/files/gnomad_db_wgs_v4.0.sqlite3.gz?download=1
- WES gnomAD v4.0 (hg38, **161'417'006** variants) 7.3G zipped, 17G in total - https://zenodo.org/records/10066310/files/gnomad_db_wes_v4.0.sqlite3.gz?download=1
- WGS gnomAD v3.1.2 (hg38, **759'302'267** variants) 46.2G zipped, 98G in total - https://zenodo.org/record/6818606/files/gnomad_db_v3.1.2.sqlite3.gz?download=1
- WGS gnomAD v2.1.1 (hg19, **261'942'336** variants) 16.1G zipped, 48G in total - https://zenodo.org/record/5770384/files/gnomad_db_v2.1.1.sqlite3.gz?download=1

You can download it as:

Expand All @@ -41,7 +49,7 @@ gnomAD_DB.download_and_unzip(download_link, output_dir)
#### NB this would take ~30min (network speed 10mb/s)


or you can create the database by yourself. **However, I recommend to use the preprocessed files to save ressources and time**. If you do so, you can go to **2. API usage** and explore the package and its great features!
or you can create the database by yourself. **However, I recommend using the preprocessed files to save resources and time**. If you do so, you can go to **2. API usage** and explore the package and its great features!


## 2. API usage
Expand All @@ -62,11 +70,11 @@ from gnomad_db.database import gnomAD_DB
```

2. Initialize database connection \
**Make sure to have the correct genome version!**
**Make sure to have the correct gnomad version!**
```python
# pass dir
database_location = "test_dir"
db = gnomAD_DB(database_location, genome="Grch38")
db = gnomAD_DB(database_location, gnomad_version="v3")
```

3. Insert some test variants to run the examples below \
Expand Down
6 changes: 3 additions & 3 deletions Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ database_location = config['database_location']
gnomad_vcf_location = config['gnomad_vcf_location']
tables_location = config['tables_location']
script_locations = config['script_locations']
genome = config['genome']
gnomad_version = config['gnomad_version']
KERNEL = config['KERNEL']


Expand All @@ -32,7 +32,7 @@ rule extract_tables:
message:
"Running createTSVtables notebook..."
shell:
"papermill {input.notebook} {output.notebook} -p gnomad_vcf_location {gnomad_vcf_location} -p tables_location {tables_location} -p genome {genome} -k {KERNEL}"
"papermill {input.notebook} {output.notebook} -p gnomad_vcf_location {gnomad_vcf_location} -p tables_location {tables_location} -p gnomad_version {gnomad_version} -k {KERNEL}"


# -------------------------- INSSERT VARIANTS WITH MAF TO DATABASE ------------------------------
Expand All @@ -45,7 +45,7 @@ rule insert_variants:
message:
"Running insertVariants notebook..."
shell:
"papermill {input.notebook} {output.notebook} -p database_location {database_location} -p tables_location {tables_location} -p genome {genome} -k {KERNEL}"
"papermill {input.notebook} {output.notebook} -p database_location {database_location} -p tables_location {tables_location} -p gnomad_version {gnomad_version} -k {KERNEL}"

# -------------------------- INSSERT VARIANTS WITH MAF TO DATABASE ------------------------------
#rule create_GettingStartedNB:
Expand Down
17 changes: 13 additions & 4 deletions gnomad_db/database.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,13 @@
import yaml
import pkg_resources


class gnomAD_DB:

def __init__(self, genodb_path, genome="Grch38", parallel=False, cpu_count=None):
def __init__(self, genodb_path, gnomad_version, parallel=False, cpu_count=None):


self.parallel = parallel
self.genome = genome

if self.parallel:
self.cpu_count = cpu_count if isinstance(cpu_count, int) else int(multiprocessing.cpu_count())
Expand All @@ -26,7 +26,10 @@ def __init__(self, genodb_path, genome="Grch38", parallel=False, cpu_count=None)
with open(columns_path) as f:
columns = yaml.load(f, Loader=yaml.FullLoader)

self.columns = list(map(lambda x: x.lower(), columns["base_columns"])) + columns[self.genome]

self.gnomad_version = self._parse_gnomad_version(gnomad_version, list(columns.keys())[1:])

self.columns = list(map(lambda x: x.lower(), columns["base_columns"])) + columns[self.gnomad_version]
self.dict_columns = columns

if not os.path.exists(self.db_file):
Expand All @@ -41,7 +44,7 @@ def open_dbconn(self):


def create_table(self):
value_columns = ",".join([f"{col} REAL" for col in self.dict_columns[self.genome]])
value_columns = ",".join([f"{col} REAL" for col in self.dict_columns[self.gnomad_version]])
sql_create = f"""
CREATE TABLE gnomad_db (
chrom TEXT,
Expand Down Expand Up @@ -171,6 +174,12 @@ def _pack_from_str(self, var: str) -> str:
ref = var[2].split(">")[0]
alt = var[2].split(">")[1]
return chrom, pos, ref, alt

def _parse_gnomad_version(self, gnomad_version: str, supported_gnomad_versions: list) -> str:
gnomad_version = str(gnomad_version)
gnomad_version = gnomad_version.split(".")[-1]
assert gnomad_version in supported_gnomad_versions, f"We don't support this version: {gnomad_version}. Please select one fo the following ones: {supported_gnomad_versions}"
return gnomad_version


def query_direct(self, sql_query: str):
Expand Down
25 changes: 20 additions & 5 deletions gnomad_db/pkgdata/gnomad_columns.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ base_columns:
- REF
- ALT
- FILTER
Grch37:
v2:
- AC # Alternate allele count for samples
- AN # Total number of alleles in samples
- AF # Alternate allele frequency in samples
Expand All @@ -23,23 +23,38 @@ Grch37:
- AF_fin # Alternate allele frequency in XX samples of Finnish ancestry
- AF_afr # Alternate allele frequency in samples of African/African-American ancestry
- AF_asj # Alternate allele frequency in samples of Ashkenazi Jewish ancestry
Grch38:
v3:
- AC # Alternate allele count for samples
- AN # Total number of alleles in samples
- AF # Alternate allele frequency in samples
- InbreedingCoeff # Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation
- MQ # Root mean square of the mapping quality of reads across all samples
- QD # Variant call confidence normalized by depth of sample reads supporting a variant
- ReadPosRankSum # Z-score from Wilcoxon rank sum test of alternate vs. reference read position bias
# - DP # Depth of informative coverage for each sample; reads with MQ=255 or with bad mates are filtered
- VarDP
- AS_VQSLOD
# - VQSLOD # Log-odds ratio of being a true variant versus being a false positive under the trained VQSR Gaussian mixture model
- AC_popmax # Allele count in the population with the maximum AF
- AN_popmax # Total number of alleles in the population with the maximum AF
- AF_popmax # Maximum allele frequency across populations (excluding samples of Ashkenazi
- AF_eas # Alternate allele frequency in samples of East Asian ancestry
# - AF_oth # Alternate allele frequency in XY samples of Other ancestry # not supported anymore 9.07.22
- AF_nfe # Alternate allele frequency in XY samples of Non-Finnish European ancestry
- AF_fin # Alternate allele frequency in XX samples of Finnish ancestry
- AF_afr # Alternate allele frequency in samples of African/African-American ancestry
- AF_asj # Alternate allele frequency in samples of Ashkenazi Jewish ancestry

v4:
- AC # Alternate allele count for samples
- AN # Total number of alleles in samples
- AF # Alternate allele frequency in samples
- MQ # Root mean square of the mapping quality of reads across all samples
- QD # Variant call confidence normalized by depth of sample reads supporting a variant
- ReadPosRankSum # Z-score from Wilcoxon rank sum test of alternate vs. reference read position bias
- VarDP
- AS_VQSLOD
- AC_grpmax # Allele count in the population with the maximum AF
- AN_grpmax # Total number of alleles in the population with the maximum AF
- AF_grpmax # Maximum allele frequency across populations (excluding samples of Ashkenazi
- AF_eas # Alternate allele frequency in samples of East Asian ancestry
- AF_nfe # Alternate allele frequency in XY samples of Non-Finnish European ancestry
- AF_fin # Alternate allele frequency in XX samples of Finnish ancestry
- AF_afr # Alternate allele frequency in samples of African/African-American ancestry
Expand Down
2 changes: 1 addition & 1 deletion script_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,5 @@ database_location: "test_out" # where to create the database, make sure you have
gnomad_vcf_location: "data" # where are your *.vcf.bgz located
tables_location: "test_out" # where to store the preprocessed intermediate files, you can leave it like this
script_locations: "test_out" # where to store the scripts, where you can check the progress of your jobs, you can leave it like this
genome: "Grch37" # genome version of the gnomAD vcf file (2.1.1 = Grch37, 3.1.1 = Grch38)
gnomad_version: "v2" # main gnomad_version version of the gnomAD vcf file (e.g., v2, v3, v4)
KERNEL: "gnomad_db"
Loading

0 comments on commit df7b61c

Please sign in to comment.