Skip to content

deeprho is a method for estimating recombination rate given population genetic data

License

Notifications You must be signed in to change notification settings

haotianzh/deeprho

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

██████╗ ███████╗███████╗██████╗ ██████╗ ██╗  ██╗ ██████╗ 
██╔══██╗██╔════╝██╔════╝██╔══██╗██╔══██╗██║  ██║██╔═══██╗
██║  ██║█████╗  █████╗  ██████╔╝██████╔╝███████║██║   ██║
██║  ██║██╔══╝  ██╔══╝  ██╔═══╝ ██╔══██╗██╔══██║██║   ██║
██████╔╝███████╗███████╗██║     ██║  ██║██║  ██║╚██████╔╝
╚═════╝ ╚══════╝╚══════╝╚═╝     ╚═╝  ╚═╝╚═╝  ╚═╝ ╚═════╝    v2.0

DeepRho: software accompanyment for "DeepRho: Accurate Estimation of Recombination Rate from Inferred Genealogies using Deep Learning", Haotian Zhang and Yufeng Wu, manuscript, 2021.

DeepRho constructs images from population genetic data and takes advantage of the power of convolutional neural network (CNN) in image classification to etstimate recombination rate. The key idea of DeepRho is generating genetics-informative images based on inferred gene geneaologies and linkage disequilibrium from population genetic data.

Code

deeprho is an open-source software developed for per-base recombination rate estimation from inferred genealogies using deep learning. deeprho makes estimates based on LD patterns and local genealogical trees inferred by RENT+.


Prerequisites

Installations

  1. Clone from GitHub: git clone https://github.com/haotianzh/deeprho_v2.git or download & unzip the file to your local directory.
  2. Enter root directory: cd deeprho_v2
  3. Create a virtual environment through conda: conda create -n deeprho python=3.7 openjdk=11 msprime
  4. Activate conda environment: conda activate deeprho
  5. Install: pip install .
  6. Validate: deeprho -v
  7. [Optional] see GPU support if you are seeking to use GPU

Input Formats

  • ms-formatted input (the first line is position (seperated by space) followed by haplotype sequences, check examples/data.ms for details)
  • VCF file (check examples/data.vcf)

Usages (Examples)

  • # save a precalculated lookup table for a user provided demography 
    deeprho maketable --demography examples/YRI_pop_sizes.csv --out YRI_pop_table
  • # estimate recombination rates
    deeprho estimate --file examples/example_YRI.vcf --ploidy 2 --table YRI_pop_table --num-thread 8 --plot --verbose 
  • # generate a test case under a given evolutionary setting
    deeprho test --demography examples/YRI_pop_sizes.csv --rate-map examples/test_recombination_map.txt --npop 50 --ploidy 2 --out test.vcf
    demography is a .csv file which contains at least three columns label, x (time) and y(size). label is the population name which should have only one population in a single file, time is measured in generation, see examples/ACB_pop_sizes.csv for example.

Outputs

Default output name is formatted as <FILE>.rate[.txt|.png|.npy] in the same directory as your input.

  • .txt file consists of 3 columns Start, End and Rate seperated by tab. a simple output likes:

    # your_vcf_file_name.rate.txt
    Start	End	Rate
    0	8	0.0
    8	1822	2.862294427352283e-08
    1822	4321	2.3297465959039865e-08
    4321	7125	1.6098357471351787e-08
    7125	10570	4.027717518356611e-09
    10570	14312	2.1394376828669226e-09
    14312	17689	2.2685986706092933e-09
    17689	19928	1.6854787948356243e-09
  • .png file shows a simple plot of estimated recombination map.

    isolated
  • .npy file stores a ndarray object recording recombination rate per base, the i-th element of the ndarray denotes the rate from base i to base (i+1).

GPU Support (more)

  1. First check if your graphics card is CUDA-enabled.
  2. Check compatibility table to find appropriate python, tensorflow, CUDA, cuDNN version combo.
  3. Install cudatoolkit and cudnn: conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0
  4. (For Linux) Set env: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/ (have to do this step every time you restart the session)
  5. Verify install: python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Docs

  • Make lookup table

      deeprho maketable [-h] [--ne NE] [--demography DEMOGRAPHY] [--npop NPOP] [--ploidy PLOIDY] [--rmin RMIN] \
                        [--rmax RMAX] [--repeat REPEAT] [--draw DRAW] [--num-thread NUM_THREAD] [--verbose]  
    Arguments Descriptions
    --ploidy <PLOIDY> Ploidy (default 2)
    --ne <NE> Effective population size (default 105)
    --demography <DEMOGRAPHY> Demography file if no lookup table provided
    --npop <NPOP> Number of individuals or samples
    --num-thread <NUMTHREAD> Number of workers for parallel (default 4)
    --rmin <RMIN> Min of recombination rate per base per generation
    --rmax <RMAX> Max of recombination rate per base per generation
    --repeat <REPEAT> Number of repeats in simulation
    --draw <DRAW> Number of repeats in simulation
    --verbose Show loggings in console
    --help, -h Show usage
  • Estimate

      deeprho estimate [-h] [--file FILE] [--length LENGTH] [--ne NE] [--ploidy PLOIDY] [--res RES] \
                        [--threshold THRESHOLD] [--gws GWS] [--ws WS] [--ss SS] [--m1 MODEL_FINE] \
                        [--m2 MODEL_LARGE] [--num-thread NUM_THREAD] [--plot] [--savenp] [--verbose]
    Arguments Descriptions
    --file <FILE> Input file
    --ploidy <PLOIDY> Ploidy (default 1)
    --ne <NE> Effective population size (default 105)
    --demography <DEMOGRAPHY> Demography file if no lookup table provided
    --gws <GWS> Window size for inferring genealogy (default 103 SNPs)
    --ws <WS> Window size for performing deeprho (fixed at 50 SNPs)
    --ss <SS> Step size for performing deeprho (default as 25 SNPs)
    --length <LENGTH> Length of chromosome
    --m1 <MODELFINE> Path of fine model
    --m2 <MODELLARGE> Path of large model
    --threshold <THRESHOLD> Threshold of recombination Hotspot (default 5x10-8)
    --savenp Save estimated rates as numpy ndarray (saved as <FILE>.out.npy)
    --plot Plot recombination map (saved as <FILE>.out.png)
    --num-thread <NUMTHREAD> Specify number of workers for parallel (default 4)
    --verbose Show loggings in console
    --help, -h Show usage
    • <LENGTH> can be either explicitly specified or inferred from input, if the latter, <LENGTH>= Sn-S1, where Sn is physical position of the last SNP site, S1 is the position of the first SNP site.
    • <MODELFINE>, <MODELLARGE> are two pretrained-models, deeprho takes two-stages strategies to estimate recombination rate, <MODELFINE> is applied for estimating recombination background regions while <MODELLARGE> is used to fine-tune hotspot regions. two default models with a constant demographic model are included in this repo, users are also allowed to train their own models through following sections.
    • <THRESHOLD> defines a threshold above which a region can be regarded as a hotspot. 5x10-8 is set as default.
    • <GWS> guides how large region the genealogies are inferred from. As our test, 1000 is a great choice to include as much information as possible for improving local genealogical inference.
  • Test

      deeprho test [-h] [--ne NE] [--demography DEMOGRAPHY] [--npop NPOP] [--ploidy PLOIDY] [--rate-map RATEMAP] \
                        [--recombination-rate RATE] [--sequence-length LENGTH] [--num-thread NUM_THREAD] [--verbose]  
    Arguments Descriptions
    --ploidy <PLOIDY> Ploidy (default 2)
    --ne <NE> Effective population size (default 105)
    --demography <DEMOGRAPHY> Demography file if no lookup table provided
    --npop <NPOP> Number of individuals or samples
    --sequence-length <LENGTH> Length of simulated genome
    --recombination-rate <RRATE> Recombination rate
    --rate-map <RATEMAP> Recombination rate map
    --mutation-rate <MRATE> Mutation rate (default as 2.5x10-8)
    --help, -h Show usage
  • Demography settings: there are some software used for inferring demographic history, such as PSMC, SMC++, MSMC. Here we take SMC++ output as our input but only contains one population, get more information about SMC++ output.

TIPS: If you are not familiar with these parametric settings, just leave them as default if possible.

Contact:

Feel free to shoot us at [email protected].

About

deeprho is a method for estimating recombination rate given population genetic data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages