The purpose of this project is to convert databases that are required by varfish into a format that can be easily imported (i.e. TSV files with a header containing the column names of the corresponding varfish database table).
- bcftools
We recommend the installation of the requirements via conda:
conda install bcftools bedops samtools
git clone [email protected]:bihealth/varfish-anno.git
cd varfish-anno
make init
This step creates the folder structure in databases/
.
make download
The downloads will be stored in databases/<database_name>/download/
.
The reference is placed in downloads/
.
Please note that the download routine is not sophisticated. You might want to double check the process, especially in case something breaks. It is thought as extensive instructions to download the required databases. If the files are already available to you, you can place them in the corresponding download folder and omit this step. Note that in this case the conversion scripts might need some adaption to match the correct file name (see next section).
The download links are defined in downloads/Makefile
and the variable names
are prefixed with URL_
. Those variables are safe to change (if the downloaded
file contains the expected format).
- The KEGG database is not automatically downloadable. Instructions are printed
to obtain the required files (or see instructions below). They need to be
placed in
databases/kegg/downloads
. - The case files are in
.ped
format and are individual depending on your project. You need to place them indatabases/case/download
. - Copy the resulting
-vars
file from Varhab todatabases/annotation
- Copy the resulting
-gts
file from Varhab todatabases/smallvariant
Note that ExAC, gnomAD and dbSNP databases are rather large files and will take time to download.
- https://genome.ucsc.edu
- Tools -> Table Browser
- group: All tables
- assembly: GRCh37
- table: keggPathway
- output format: selected fields from primary and related tables
- output file:
genetokegg.tsv
- get output
- Linked Tables -> knownGene -> allow selection from checked tables
- Linked Tables -> ensGtp -> allow selection from checked tables
- Select Fields from keggPathway -> mapID
- ensGtp fields -> gene
- get output
- https://genome.ucsc.edu
- Tools -> Table Browser
- group: All tables
- assembly: GRCh37
- table: keggMapDesc
- output format: selected fields from primary and related tables
- output file:
kegginfo.tsv
- get output
- Select Fields from keggMapDesc -> mapID
- Select Fields from keggMapDesc -> description
- get output
make convert
Every script defines a HEADER
, INPUT
and OUTPUT
variable, and, if needed,
a REF
variable. The names should be self-explanatory. They are preset to the
downloaded files. You can change the INPUT
and REF
, if needed.
Note that ExAC, gnomAD and dbSNP databases are rather large files and especially dbSNP will take time to convert.