- python3, plink, bcftools, bash, nextflow
- singularity / dockers image : no test yet
- select for each chromosome on quality of imputation : min info
- convert each vcf in plink
- rename duplicate rs or . by chro
- added cm in bim files if file
genetic_map
give in argumen, - merge all chromosome in plink format
- give a report with analyse of frequencie and score
file_listvcf
: file contains each bgzip vcf files to merge, one by line [default : none]min_scoreinfo
: what score info minimum do you want : [default :0.6]output_pat
: pattern of output for bed final file [default : out]output_dir
: directory of output : [default : plink]score_imp
: header of score imputation [default : INFO], for score info depend of software used on software of imptuation- PBWT : INFO
- Filters done by plink :
cut_maf
: default 0cut_hwe
: default 0cut_geno
: default 0cut_mind
:- file to extract rsinfomation with position :
file_ref_gzip
: must be in gzip example of file used : hereposhead_chro_inforef
psotion of column chromosome in file [default : 0]poshead_bp_inforef
: position of column where bp in file [default : 1]poshead_rs_inforef
: position of column where rs in file [default : 2]poshead_a1_inforef
: position of column where rs in file [default : 3]poshead_a2_inforef
: position of column where rs in file [default : 4]
do_stat
: by default true make stat using frequencies and scorestatfreq_vcf
:- pattern used in Info to computed frequencies ([default : "%AN %AC" with AN total and AC alternative number])
- can be two value NAll NAlt, where frequencies computed as Nalt/NAll
- can be one value frequencies
genetic_maps
: genetics maps to added map in bim file, if not provided, map doesn't added in bim, must be not compressed :- file for hg19
- file for hg17
- file for hg18
- file for hg38
- memory and cpu :
max_plink_cores_merge
,plink_mem_req_merge
are now two news parameter can defined memory for merge processplink_mem_req
: memory used for plink, bcftoolsother_mem_req
: other memorymax_plink_cores
: cpus for plink and bcftools
chr position COMBINED_rate(cM/Mb) Genetic_Map(cM)
1 55550 0 0
1 568322 0 0
1 568527 0 0
- data and command line can be found h3agwas-examples
vcf_in_bgen.nf
: convert each vcf that was filter in bgen, without merge with one processvcf_in_bgen_merge.nf
:- vcf are filter and merge and merge and format in bgen
vcf_in_bgen_merge_chro.nf
:- vcf are filter
- vcf are format in bgen
- bgen are merge
- plink, bcftools, bash, qctools, nextflow
- singularity / dockers image : no test yet
- Intial data : format of Sanger imputation format vcf file
- select for each chromosome on quality of imputation : min info
- convert each vcf in impute2 format used by boltlmmm
- output for each chromosome is basename of initial files with .impute2.gz
file_listvcf
: file contains each bgzip vcf files to merge, one by line [default : none]min_scoreinfo
: what score info minimum do you want : [default :0.6]output_dir
: directory of output : [default : bgen]qctoolsv2_bin
: bin file for qctoolsgenotype_field
: genotype field to transform [degault : GP]bgen_type
: bgen type see [qctool manual] (https://www.well.ox.ac.uk/~gav/qctool_v2/documentation/alphabetical_options.html) :- default bgen (other : "bgen_v1.2", "bgen_v1.1")
other_option
: other option to give to qctools
for instance for bolt lmm format bgen must be :
~/nextflow ~/Travail/git/h3agwas/formatdata/vcf_in_bgen_merge.nf --file_listvcf listvcf --output_pat exampledata2_imp --output_dir ./ -profile slurmSingularity -resume --bgen_type bgen_v1.2
- plink, bcftools, bash, nextflow
- singularity / dockers image : no test yet
- Intiial data : format of Sanger imputation format vcf file
- select for each chromosome on quality of imputation : min info
- convert each vcf in impute2 format used by boltlmmm
- output for each chromosome is basename of initial files with .impute2.gz
file_listvcf
: file contains each bgzip vcf files to merge, one by line [default : none]min_scoreinfo
: what score info minimum do you want : [default :0.6]output_dir
: directory of output : [default : impute2]
- plink, bash, nextflow, python3 (library : panda)
- singularity / dockers image : no test yet
- initial data of gwas format transform in other files
- search rs on file to added new rs at each position (if not found add chro:pos)
- added N and frequencies values if need and plink file gave
- Change header, separator... etc
-
file_gwas
: one file gwas :- intial header of your file :
head_pval
[optional]head_freq
[optional]head_bp
head_chr
head_beta
[optional]head_se
[optional]head_A1
[optional]head_A2
[optional]head_N
[optional]
sep
separator default space or tab, [optional], for comma : COM, tabulation : TAB and space "WHI"
- intial header of your file :
-
header of your output :
out_gc
: prepared data for submission of gwas catalog- if not initialise : using output of your initial files
headnew_pval
[optional]headnew_freq
[optional]headnew_bp
[optional]headnew_chr
[optional]headnew_beta
[optional]headnew_se
[optional]headnew_A1
[optional]headnew_A2
[optional]headnew_N
[optional]sepout
: separator for your summary stat output [optional : default " "]
-
file to extract rsinfomation with position :
file_ref_gzip
: must be in gzip example of file used : hereposhead_chro_inforef
psotion of column chromosome in file [default : 0]poshead_bp_inforef
: position of column where bp in file [default : 1]poshead_rs_inforef
: position of column where rs in file [default : 2]
-
others option :
- added some N and frequencie in gwas file :
- used plink information to compute freq and N and added in gwas file if
head_N
and/orhead_freq
not intialise input_dir
: plink directoryinput_pat
: plink basename
- used plink information to compute freq and N and added in gwas file if
mem_req
: memory request for processes>
- added some N and frequencie in gwas file :
nextflow run convert_posversiongenome.nf
- if no file give download gwas catalog
- extract positions of interest
- used rs to search position see args
file_ref_gzip
- used crossmap to defined position s not found previously and strand : see
bin_crossmap
anddata_crossmap
- return file with new position
-
output_dir
: direction of output [default : output] -
output
: output : [default : out] -
file_toconvert
: file to convert if empty download gwas cataloglink_gwas_cat
: link to download gwas catalog [default : https://www.ebi.ac.uk/gwas/api/search/downloads/alternative ]head_rs
: head rs to file to convert [default SNPS (gwas catalog)]head_bp
: head bp to file to convert [default SNPS (gwas catalog)]head_chro
: head bp to file to convert [default SNPS (gwas catalog)]sep
: separator used TAB, SPACE, "," [default TAB] (not allowed : ;)
-
file to extract rsinfomation with position :
-
file_ref_gzip
: must be in gzip example of file used : hereposhead_chro_inforef
psotion of column chromosome in file [default : 0]poshead_bp_inforef
: position of column where bp in file [default : 1]poshead_rs_inforef
: position of column where rs in file [default : 2]
-
bin_crossmap
: crossmap [default ~/.local/bin/CrossMap.py] -
data_crossmap
: data to convert [default : "" ]- if no argument will be download :
- hg38 in hg19 :
link_data_crossmap
(http://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz)
###output ;
{out}.tsv
: final file{out}.multi.tsv
: more that one position have been found{out}.detail.tsv
: file before cleaning{out}.notfound.tsv
: fileswhere position not found- folder
datai
: contains files contains files download - folder
datatmp
: contains temporary file (extract of rs file)
R : library pip3.6 install CrossMap --user pip3.6 install numpy==1.16.1 --user chmod +x ~/.local/bin/CrossMap.py
transform vcf in bimbam format after filters for quality. ###arguments
file_listvcf
: file contains each bgzip vcf files to merge, one by line [default : none]min_scoreinfo
: what score info minimum do you want : [default :0.6]output_dir
: directory of output : [default : impute2]genotype_field
: genotype field in vcf file [default : GP]qctoolsv2_bin
: qctools v2 binary [default :qctool_v2]bcftools_bin
: bcftools bin [default : bcftools]
input_dir
input_pat
output_dir
: direction of output [default : output]- file to extract rsinfomation with position :
file_ref_gzip
: must be in gzip example of file used : hereposhead_chro_inforef
psotion of column chromosome in file [default : 0]poshead_bp_inforef
: position of column where bp in file [default : 1]poshead_rs_inforef
: position of column where rs in file [default : 2]
deleted_notref
: deleted position s not found infile_ref_gzip
reffasta
: fasta reference, if present do control of vcf file :- checkVCF.py
- bcftools : used plugin of +fixref see
BCFTOOLS_PLUGINS=bcftools/plugins/
michigan_qc
: apply migigan qc [default 0]- see : prepare you data
dataref_michigan
: data ref used by michigan, if empty dowload [default : ""]ftp_dataref_michigan
: dl data from michigan [default : ftp://ngs.sanger.ac.uk/production/hrc/HRC.r1-1/HRC.r1-1.GRCh37.wgs.mac5.sites.tab.gz]bin_checkmich
perl script [default : "HRC-1000G-check-bim.pl"]
### General requirement
* bcftools
* plink
* R
* python
* qctools (v2)
* samtools
* for control of vcf
* checkVCF.py is present in binary of nextflow pipeline (https://github.com/zhanxw/checkVCF)
* if michigan qc apply, need data set of michigan and perl script (see [here](https://imputationserver.readthedocs.io/en/latest/prepare-your-data/))
### Example
see [h3agwas-example github](https://github.com/h3abionet/h3agwas-examples)
nextflow run h3abionet/h3agwas/formatdata/vcf_in_plink.nf --file_listvcf utils/listvcf --output_pat kgp_imputed --output_dir plink_imputed/ --reffasta utils_data/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz -profile singularity