GitHub - alexmondaini/Genome_GATK: for HKU colleagues

GATK4 for Hong Kong University

This repo is designed to help HKU colleagues to run workflows written in WDL using Cromwell as the workflow management system. Its is highly desirable that one has some familiarity with WDL and Cromwell before proceeding here.

Checking if your pipeline has the correct syntax, and creating inputs from WDL files:

WOMtool is an excellent jar file application that parses a wdl file and provides a template with all the desired inputs in json format, this is the way to go to start creating your first inputs.
You can download the WOMtool jar file from this link. Once downloaded we can generate input templates for the mutect2.wdl pipeline in the following way: java -jar womtool-${version}.jar inputs mutect2.wdl , this will output to stdout a template with inputs a user can fill in values. To get a json file we can do this: java -jar womtool-59.jar inputs mutect2.wdl > mutect2.json

Running workflows in high-throughput.

Once you get the basics on how pipelines are run in WDL+Cromwell, we can start executing them in a high-throughput manner. Cromwell is the execution engine that run wdl files, it requires a backend configuration to run in our cluster (xomics). The configuration files to get Cromwell working in our cluster are:
test.conf is a configuration file that helps us to familiarize with cromwell's default environment variables and how the engine works in the background. Use test.conf along with the hello_world directory, this is a great first step to learn and get started.
application.conf is the production configuration file of the engine (cromwell), use this for production workflows.
Furthemore, cromwell can be executed in two different modes. These are run and server:
- run mode is a good way to get started with Cromwell and experiment quickly. Run mode launches a single workflow from the command line and exits with a 0 or 1 code to indicate the result.
- server mode is the mode you wish for most applications of Cromwell, suitable for production use (high-throughput). Server mode starts Cromwell as a web server that exposes REST endpoints.
In order to start server mode you can use a script such as the followig one:

#!/bin/bash
#PBS -N job_cromwell_server
#PBS -l walltime=440:00:00
#PBS -l select=1:ncpus=6:mem=60gb
#PBS -q cgsd
#PBS -k oe

# choose your cromwell version
VERSION=83
# choose your local port 
PORT=8000

cat 1>&2 <<END
1. SSH tunnel from your workstation using the following command:

   ssh -N -f -L ${PORT}:${HOSTNAME}:8000 ${USER}@xomics.cpos.hku.hk

   and point your web browser to http://localhost:8000
   
END


module load java/11.0.9
java -Xms10G -Xmx60G -Dconfig.file=/path/to/application.conf \
-jar /path/to/cromwell-${VERSION}.jar server

Check the stderr returned from the script above when you send it as a job in the cluster, the stderr will have this name format job_cromwell_server.eXXXX where XXXX is your job number, copy the ssh line which has been expanded with the hostname and port and paste into your local machine terminal to create the tunnel.
This command will forward a port (8000-default cromwell port) from your node to your local machine on port (8000) in this example. Once the port is opened, all you need to do is to go to your browser and type http://localhost:8000/ ,and you will be presented with the Swagger interface used by cromwell to launch workflows in server mode.
Whenever you get to this point, you will notice that the Swagger interface exposes your local filesystem to launch workflows. For that I keep a copy of this github repository in my local machine since (wdl and json) files are very light and use them to launch the workflows in xomics. It's worth noting that all file paths present in the json files are relative to the filesystem of xomics and Cromwell will use all filepaths relative to xomics, not to your local machine (which is great) and allows you to store the heavy stuff in the cluster and not in your local machine.

Finally if you wish to use Github and git as the distributed version control system for organizing and sharing code, I would advise to create your own repository on github or create a branch of the current one, so you can have different versions to compare against.

Happy coding ! 😎

Name		Name	Last commit message	Last commit date
Latest commit History 347 Commits
AddOrReplaceReadGroups		AddOrReplaceReadGroups
analyzecovariates		analyzecovariates
copy-number-variation		copy-number-variation
create_pon		create_pon
gatk4-exome-analysis-pipeline		gatk4-exome-analysis-pipeline
gatk4-germline-snps-indels		gatk4-germline-snps-indels
gatk4-rnaseq-germline-snps-indels		gatk4-rnaseq-germline-snps-indels
genomics_db		genomics_db
hello_world		hello_world
index_genome_workflow		index_genome_workflow
mutect2		mutect2
preprocess_fastq_to_bam		preprocess_fastq_to_bam
select_variants		select_variants
seq-format-conversion		seq-format-conversion
seq-format-validation		seq-format-validation
validate_variants		validate_variants
.gitignore		.gitignore
README.md		README.md
application.conf		application.conf
test.conf		test.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GATK4 for Hong Kong University

Running workflows in high-throughput.

About

Releases

Packages

Languages

alexmondaini/Genome_GATK

Folders and files

Latest commit

History

Repository files navigation

GATK4 for Hong Kong University

Running workflows in high-throughput.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages