- Introduction
- Dependencies
- Installation
- Stargazer
- Sun Grid Engine (SGE)
- SNP Callers
- Running in Command Line
- Running within Python
PyPGx is a Python package for pharmacogenomics (PGx) research, which can be used as a standalone program and as a Python module. Documentation is available at Read the Docs.
PyPGx requires Python 3 and the following Python packages:
requests>=2 pandas>=1.0.0 bs4>=0.0.1 lxml>=4.5.0 pysam>=0.16.0 vcfgo>=0.0.10
The easiest way to install PyPGx and all of its dependencies is to use
pip
:
$ pip install pypgx
For genotype analyses PyPGx relies on Stargazer, a bioinformatics tool for
calling star alleles (haplotypes) in PGx genes using data from
next-generation sequencing (NGS) or single nucleotide polymorphism (SNP)
array. Therefore, Stargazer must be pre-installed in order to run PyPGx
commands such as bam2gt
. For more information on Stargazer, please visit
their official webpage
and Github repository.
Many PyPGx commands such as bam2gt2
rely on the Sun Grid Engine (SGE)
cluster to distribute their tasks across multiple machines for speed. These
commands are indicated by [SGE]
and will generate a shell script, which
can be run like this:
$ sh example-qsub.sh
One major input for the Stargzer program is a Variant Call Format (VCF) file,
which is a standard file format for storing SNP calls. Currently, PyPGx
relies on two SNP callers to make VCF files: Genome Analysis Toolkit (GATK)
and BCFtools. When running PyPGx commands like bam2vcf
, you can pick
which SNP calling algorithm to use; it is assumed that you already installed
the corresponding SNP caller.
Generally speaking, GATK is considered more accurate but much slower than BCFtools. For instance, without the use of the SGE cluster, SNP calling for 70 WGS samples for the CYP2D6 gene takes 19 min to complete with GATK, but only 2 min with BCFtools. Therefore, if you have many samples and you do not have access to SGE for running parallel jobs, BCFtools may be a better choice. Of course, if you have SGE in your sever, then GATK is strongly recommended.
For more information on the SNP callers, please visit the GATK website and the BCFtools website.
For getting help:
$ pypgx -h usage: pypgx [-h] [-v] tool ... positional arguments: tool name of the tool bam2gt convert BAM files to a genotype file bam2gt2 convert BAM files to genotype files [SGE] gt2pt convert a genotype file to phenotypes bam2vcf convert BAM files to a VCF file bam2vcf2 convert BAM files to a VCF file [SGE] bam2gdf convert BAM files to a GDF file gt2html convert a genotype file to an HTML report bam2html convert a BAM file to an HTML report [SGE] fq2bam convert FASTQ files to BAM files [SGE] bam2bam realign BAM files to another reference genome [SGE] bam2sdf convert BAM files to a SDF file sdf2gdf convert a SDF file to a GDF file pgkb extract CPIC guidelines using PharmGKB API minivcf slice VCF file mergevcf merge VCF files summary create summary file using Stargazer data meta create meta file from summary files compare compare genotype files check check table files for Stargazer liftover convert variants in SNP table from hg19 to hg38 peek find all possible star alleles from VCF file viewsnp view SNP data for pairs of sample/star allele compgt compute the concordance between two genotype files compvcf compute the concordance between two VCF files unicov compute the uniformity of sequencing coverage optional arguments: -h, --help show this help message and exit -v, --version print the PyPGx version number and exit
For getting tool-specific help:
$ pypgx bam2gdf -h usage: pypgx bam2gdf [-h] [--bam_dir DIR] [--bam_list FILE] genome_build target_gene control_gene output_file [bam_file [bam_file ...]] positional arguments: genome_build genome build ('hg19' or 'hg38') target_gene name of target gene (e.g. 'cyp2d6') control_gene name or region of control gene (e.g. ‘vdr’, ‘chr12:48232319-48301814’) output_file write output to this file bam_file input BAM files optional arguments: -h, --help show this help message and exit --bam_dir DIR treat any BAM files in DIR as input --bam_list FILE read BAM files from FILE, one file path per line
For running in command line:
$ pypgx bam2gdf hg19 cyp2d6 vdr out.gdf in1.bam in2.bam
The output GDF file will look like:
Locus Total_Depth Average_Depth_sample Depth_for_S1 Depth_for_S2 ... chr22:42539471 190 95 53 137 chr22:42539472 192 96 54 138 chr22:42539473 190 95 53 137 ...
For running within Python:
from pypgx.phenotyper import phenotyper phenotyper("cyp2d6", "*1", "*1") phenotyper("cyp2d6", "*1", "*4") phenotyper("cyp2d6", "*1", "*2x2") # *2x2 is gene duplication. phenotyper("cyp2d6", "*4", "*5") # *5 is gene deletion.
To give:
'normal_metabolizer' 'intermediate_metabolizer' 'ultrarapid_metabolizer' 'poor_metabolizer'