Skip to content

WeiCSong/HFS

Repository files navigation

Haplotype Function Score (HFS)

This repository provides code for calculating the Haplotype Function Score (HFS) and conducting subsequent analyses. The HFS computation process involves several key steps:

Step 0: Divide the hg38 reference genome into blocks of 4096 base pairs using a sliding window approach with a 2048 base pair overlap. The division is detailed in data/loci_sliding_full.bed.

Step 1: Within each block, transform the genotype matrix into a haplotype matrix using HFS_fisrt.slurm.

Step 2: Convert each haplotype into a DNA sequence, represented in FASTA format using HFS_second.slurm.

Step 3: Use the sei framework to transform the DNA sequence into a sequence class score using HFS_third.slurm. The Haplotype Function Score is defined as the score corresponded to the reference sequence class.

Step 4: HFS-based association analysis. Example codes are in code/.

Requirement

For HFS_third.slurm, create a conda enviroment with conda env create -f sei.yml

For other steps, create a conda enviroment with conda env create -f hfs.yml

download and decompress the neccessary reference data at https://data.mendeley.com/datasets/hvjsxwnptk/2

download and decompress per-block snp list from UKB by

Note: step1 and 3 is CPU- and GPU-intensive, and step2 has very heavy I/O burden. Although step2 requires only 1 CPU core per job, submitting more than two jobs on a same node will dramatically slow down all the jobs, depending on the I/O limitation of your server. If applicable, separate each step2 job to different nodes.

Details

Step 0:

All subsequent codes assume the following directory structure: ~/HFS/$project_id/$chr/$chunk/$block, where $chunk is the independent chunks of hg38 as defined here There is generally no LD between different chunks, and all the subsequent analysis are all conducted separately for each chunk. Thus, each .slurm will submit 1,361 jobs, one job per chunk. $block is the id of each 4096bp block.

data/loci_sliding_full.bed contains our block division used in elife publication The six columns are chr, start, end, blockid and chunkid.

Step 1:

HFS_fisrt.slurm filter, phase and transform bgen genotype to haplotype matrix for all blocks in a chunk. Within each ~/FLAT/$project_id/$chr/$chunk/$block, HFS_fisrt.slurm prints three files:

hap.index: An index file for the convenience of step 2.

hapsgeno.gz: an n x 2 haplotype matrix. Each element is a haplotype id. The sample order is the same as you provided in $keep_path file, so there is no sample ID column.

haplotype.vcf.gz: a vcf file recording the exact genotypes of all haplotype. First nine columns are standard vcf columns and from 10th columns onwards are genotypes of each haplotype, with column names corresponded to haplotype ID in hapsgeno.gz.

Variables to be modified in HFS_fisrt.slurm (examples are provided within the script)

liftover issue: Since UKB imputed data are on hg19, i first generate a list of snp id corresponded to each hg38 block (download here), then extract them and liftover to hg38 for each block, as in HFS_first_liftover.slurm. If you have hg38 .bgen data, directly use HFS_first.slurm.

project_id: name of your project. All analysis will be carried out in ~/HFS/$project_id

bgen_path: path to bgen file.

sample_path: path to bgen .sample file.

keep_path: path to two-column sample id file to define which samples are kept. Will be passed to plink2 --keep. Remember to SORT this file by LC_ALL=C sort then add the header #FID IID. hapsgeno.gz will be matched to the order of this file by LC_ALL=C join.

map_path: genetic map files for SHAPEIT5. Left blank if your bgen is phased and does need phasing.

shapeit5_path path to shapeit5_common executable. Left blank if your bgen is phased and does need phasing.

Step 2:

Within each chunk, HFS_second.slurm pooled all haplotypes from all blocks and generate multiple .fa.gz, each contains DNA sequences of 30,000 haplotypes. This step is very I/O-intensive. I provided an example R command to submit different jobs (for chunk=1,2,...1361) to different nodes, using one core per node, to avoid hitting the I/O limit of one node.

N=#list of avaliable nodes
remain=#vectors of unsubmitted jobs, could be 1,2,3,...1361
d=readLines("HFS_second.slurm")
for(x in 1:length(remain)){
node=N[x]
id=remain[x]
d[11]=paste(c("#SBATCH --array=",id,"-",id,"%1"),collapse="")
d[12]=paste(c("#SBATCH --nodelist=node[",node,"]"),collapse="")
writeLines(d,"test.slurm")
system("sbatch test.slurm")
}
##modify this script to submit more than one jobs on one node if your server has higher limits of I/O per node

Variables to be modified in `HFS_second.slurm

project_id: name of your project. All analysis will be carried out in ~/HFS/$project_id

Step 3:

HFS_third.slurm calls sei framework on GPU to transform all fasta DNA sequences into sequence class score, then calls ~/HFS/code/splitsei.r to print sequence class to $blockid/sei. remember to modify code/seqs_fasta.yml and sc.py: i wrote a default path to ~/sei-framework/model in these files, modify this according to your sei installation. The first column of $blockid/sei is haplotype id (corresponded to hapsgeno.gz and haplotype.vcf.gz), followed by 39 columns of sequence class score, as defined at the main page of sei framework. Note: sequence class 40 is discarded since it is too sparse and is biologically trivial.

Variables to be modified in `HFS_second.slurm

project_id: name of your project. All analysis will be carried out in ~/HFS/$project_id

Step 4:

Now you have calculated the HFS matrix for the entire cohort. You can link this matrix to the phenotype data and apply various analysis as in GWAS. I provide two example scripts that we applied in the elife publication: susie.r calculated the marginal and posterior association between HFS and phenotypes, and the output posterior effect size is used in polygenic prediction. Note: susie.r is also run separately for each chunk. enrichment_linear.r calculated the association between posterior inclusion probability (PIP) and biological annotations to reveal underlying pathways, cells and tissues of phenotypes.There are also other scripts for sensitivity analysis of our paper.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published