Fastq input strategy to BALSAMIC #1098
Closed
mathiasbio
started this conversation in
Polls
Replies: 1 comment
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Defining paths used in discussion:
background
Currently (v11.2.0) fastqs from different lanes are concatenated by CG into the fastqinputfolder such as:
HNN57DRX2_ACCXXXXX_S12_L001_R1_001.fastq.gz
HNN57DRX2_ACCXXXXX_S12_L001_R2_001.fastq.gz
HNN57DRX2_ACCXXXXX_S12_L002_R1_001.fastq.gz
HNN57DRX2_ACCXXXXX_S12_L002_R2_001.fastq.gz
---------->>>
concatenated_ACCXXXXX_XXXXXX_R_1.fastq.gz
concatenated_ACCXXXXX_XXXXXX_R_2.fastq.gz
With the pull request here:
#1090
This concatenation will be removed, and instead the start of the pipeline will be parallelised to run per lane. This means primarily trimming and mapping.
When the concatenation step is removed from CG the intention is to let it instead produce links in the fastqinputfolder to fastq-files in housekeeper. These links are given a new name in CG which differs from the original fastq-name, example:
original: HYMT5DSX3_[LIMS-ID]S297_L001_R1_001.fastq.gz
link: 1_171015_HYMT5DSX3[LIMS-ID]_XXXXXX_R_1.fastq.gz
With the update in this PR: #1069
The path to the fastqfiles in housekeeper will be added to the singularity bind path so that rules in snakemake will have access to them.
the question
What we are discussing now is what the best strategy is for how to input and use this fastq info for the "map per lane" PR #1090
As an example below is a list of fastqlinks for a T / N case in the fastqinputfolder:
1_171015_HYMT5DSX3_[LIMS-ID-TUMOR]XXXXXX_R_1.fastq.gz
2_171015_HYMT5DSX3[LIMS-ID-TUMOR]XXXXXX_R_1.fastq.gz
3_171015_HYMT5DSX3[LIMS-ID-TUMOR]XXXXXX_R_1.fastq.gz
4_171015_HYMT5DSX3[LIMS-ID-TUMOR]XXXXXX_R_1.fastq.gz
1_171015_HYMT5DSX3[LIMS-ID-TUMOR]XXXXXX_R_2.fastq.gz
2_171015_HYMT5DSX3[LIMS-ID-TUMOR]XXXXXX_R_2.fastq.gz
3_171015_HYMT5DSX3[LIMS-ID-TUMOR]XXXXXX_R_2.fastq.gz
4_171015_HYMT5DSX3[LIMS-ID-TUMOR]XXXXXX_R_2.fastq.gz
1_171015_HYMYKDSX3[LIMS-ID-NORMAL]XXXXXX_R_1.fastq.gz
2_171015_HYMYKDSX3[LIMS-ID-NORMAL]XXXXXX_R_1.fastq.gz
3_171015_HYMYKDSX3[LIMS-ID-NORMAL]XXXXXX_R_1.fastq.gz
4_171015_HYMYKDSX3[LIMS-ID-NORMAL]XXXXXX_R_1.fastq.gz
1_171015_HYMYKDSX3[LIMS-ID-NORMAL]XXXXXX_R_2.fastq.gz
2_171015_HYMYKDSX3[LIMS-ID-NORMAL]XXXXXX_R_2.fastq.gz
3_171015_HYMYKDSX3[LIMS-ID-NORMAL]XXXXXX_R_2.fastq.gz
4_171015_HYMYKDSX3[LIMS-ID-NORMAL]_XXXXXX_R_2.fastq.gz
option 1
As it is implemented right now, the path to the fastq-files (fastqinputfolder) is input into BALSAMIC when you configure the case.
This is then included in the [case].json file which the analysis in BALSAMIC is using to configure the workflow.
In the balsamic.smk file a function is imported:
It takes as input the path to the fastqinputdir and the samplename and returns a dictionary populated with "fastqpatterns" which is the name of the fastq-files with the fastq-suffix and read-number removed, and which holds the paths to the fwd and rev reads (for that sample). For example:
["fastqpair_patterns"][4_171015_HYMYKDSX3_{LIMS-ID-NORMAL}_XXXXXX_R]
["fwd"][fastqpath]
["rev"][fastqpath]
This information is returned to a sample_dict in the balsamic.smk
And used in the rules as a fastqpattern wildcard.
Example trimming (start):
Example aligning per lane:
Example unifying into sample wildcard in dedup step:
As a side-note, this preparation of the fastqpattern dict can be moved to the previous CONFIG step
option 2
Instead of relying on the fastqinputfolder and sample-name:
CG or BALSAMIC config, will provide in the [case].json configfile to the balsamic.smk a complemented dictionary with information about flowcells and lanes which can be parsed in the snakemake workflow with some hard-coded components with an expected fastq-name.
The [case].json configfile would be modified something like this:
Now it looks like this:
Into:
I am at the moment not sure how to implement this in snakemake but I think a strategy could be to create a function like this:
Keep this rule:
Modify this:
And this to something like:
comment
Bare in mind these are examples and either option could be changed code-wise. I think the major decision is if we want to hard-code fastq-names in the rules, or rely on a larger more expansive wildcard (fastqpatterns).
1 vote ·
Beta Was this translation helpful? Give feedback.
All reactions