generated from sib-swiss/course_website_template
-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathquality_control.qmd
281 lines (210 loc) · 13.3 KB
/
quality_control.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
---
title: Setup & quality control
---
## Presentations
{{< downloadthis assets/pdf/01_course_intro.pdf dname="01_introduction_course" label="Course introduction" icon="filetype-pdf" >}}
{{< downloadthis assets/pdf/02_introduction_cancer_variants.pdf dname="02_introduction_cancer_variants" label="Introduction to cancer variant analysis" icon="filetype-pdf" >}}
## Exercises
### First login
If you are participating in this course with a teacher, you have received a link and a password. Copy-paste the link (including the port, e.g.: `http://12.345.678.91:10002`) in your browser. This should result in the following page:
![](assets/images/vscode_login_page.png){width=300}
::: {.callout-note}
The link gives you access to a web version of [Visual Studio Code](https://code.visualstudio.com). This is a powerful code editor that you can also use as a local application on your computer.
:::
Type in the password that was provided to you by the teacher. Now let's open the terminal. You can do that by clicking **Application menu** > **Terminal** > **New Terminal**:
![](assets/images/open_terminal.gif){width=500}
For a.o. efficiency and reproducibility it makes sense to execute your commands from a script. With use of the 'new file' button:
![](assets/images/new_file.gif){width=500}
### Setup
We will start the exercises with pre-aligned bam files. To start, download and extract the course_data folder:
```sh
wget https://cancer-variants-training.s3.eu-central-1.amazonaws.com/course_data.tar.gz
tar -xvzf course_data.tar.gz
rm course_data.tar.gz
```
Now, check out the directory `course_data` and see what's in there (e.g. with `tree`):
```
course_data/
├── alignments
│ ├── normal.recal.bai
│ ├── normal.recal.bam
│ ├── tumor.recal.bai
│ └── tumor.recal.bam
├── reference
│ ├── exome_regions.bed
│ ├── exome_regions.bed.interval_list
│ ├── ref_genome.dict
│ └── ref_genome.fa
└── resources
├── 1000G_phase1.snps.high_confidence.hg38.subset.vcf.gz
├── 1000G_phase1.snps.high_confidence.hg38.subset.vcf.gz.tbi
├── 1000g_pon.hg38.subset.vcf.gz
├── 1000g_pon.hg38.subset.vcf.gz.tbi
├── af-only-gnomad.hg38.subset.vcf.gz
├── af-only-gnomad.hg38.subset.vcf.gz.tbi
├── alphamissense
│ ├── AlphaMissense_hg38.tsv.gz
│ └── AlphaMissense_hg38.tsv.gz.tbi
├── clinvar
│ ├── clinvar.vcf.gz
│ └── clinvar.vcf.gz.tbi
├── CNVkit_data
│ └── cnvkit.tar.gz
├── exe1
│ ├── output.txt_summary_everything.html
│ └── output.txt_summary.html
├── Mills_and_1000G_gold_standard.indels.hg38.subset.vcf.gz
├── Mills_and_1000G_gold_standard.indels.hg38.subset.vcf.gz.tbi
├── Mutect2_data
│ ├── somatic.filtered.PASS.vcf.gz
│ ├── somatic.filtered.PASS.vcf.gz.tbi
│ └── variants.tar.gz
├── README.md.docx
├── refFlat.txt
└── revel
├── new_tabbed_revel_grch38.tsv.gz
└── new_tabbed_revel_grch38.tsv.gz.tbi
10 directories, 30 files
```
Showing us that we have three directories:
- `alignments`: containing bam files of tumor and normal
- `reference`: containing the genome fasta file and target intervals
- `resources`: containing amongst other variant files (`vcf`) from amongst other the [1000 genomes](https://www.internationalgenome.org/) project and [gnomAD](https://gnomad.broadinstitute.org/), and subsets of the clinvar and alphamissense databases.
The dataset we're working with is prepared by the developers of the [Precision medicine bioinformatics course]() by the Griffith lab. It is whole exome sequencing data of cell lines derived from a tumor of triple negative breast cancer ([HCC1395](https://www.cellosaurus.org/CVCL_1249)) and derived from normal tissue ([HCC1395 BL](https://www.cellosaurus.org/CVCL_1250)).
You are strongly encouraged to your work with scripts during the course, which you store in the directory `scripts`. Therefore create a scripts directory:
```sh
cd ~/project/
mkdir scripts
```
### Quality control of the bam files
First we use the information in the header to get information about the contents of the bam file and how it was generated. In order to use the tools installed for this course activate the mamba environment (do this every time you open a new terminal):
```sh
mamba activate ngs-tools
```
::: {.callout-important}
## Exercise
Check out the contents of the headers of the bam files with `samtools view -H`, and answer the following questions:
a. How is the bam file sorted?
b. Which chromosomes were used as reference?
c. How many read groups are in there, what is the readgroup ID and what is the sample name?
d. Which aligner was used?
e. If there are duplicates in there, how are they marked?
f. Which other programs were run to create the bam file?
:::
::: {.callout-tip collapse="true"}
## Answer
To get information on the bam header we run
```sh
cd ~/project/course_data/alignments
samtools view -H normal.recal.bam
```
Which returns:
```
@HD VN:1.6 SO:coordinate
@SQ SN:chr6 LN:170805979
@SQ SN:chr17 LN:83257441
@RG ID:HWI-ST466.C1TD1ACXX.normal LB:normal PL:ILLUMINA SM:normal PU:HWI-ST466.C1TD1ACXX
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem /config/data/reference//ref_genome.fa /config/data/reads/normal_R1.fastq.gz /config/data/reads/normal_R2.fastq.gz
@PG ID:samtools PN:samtools PP:bwa VN:1.21 CL:samtools sort
@PG ID:samtools.1 PN:samtools PP:samtools VN:1.21 CL:samtools view -bh
@PG ID:MarkDuplicates VN:Version:4.5.0.0 CL:MarkDuplicates --INPUT /config/data/alignments/normal.rg.bam --OUTPUT /config/data/alignments/normal.rg.md.bam --METRICS_FILE /config/data/alignments/marked_dup_metrics_normal.txt ... PN:MarkDuplicates
@PG ID:GATK ApplyBQSR VN:4.5.0.0 CL:ApplyBQSR --output /config/project/data/alignments/.recal.bam --bqsr-recal-file /config/project/data/alignments/normal.bqsr.recal.table --input /config/project/data/alignments/normal.rg.md.bam ... PN:GATK ApplyBQSR
@PG ID:samtools.2 PN:samtools PP:samtools.1 VN:1.21 CL:samtools view -H tumor.recal.bam
```
a. From the `@HD` tag, we can see that the bam file is sorted by coordinate.
b. We have two lines starting with `@SQ`, which gives us information about the reference used. The reference genome used contains chromosomes 6 and 17.
c. We have one line starting with `@RG`. Specifying there is a read group with ID `HWI-ST466.C1TD1ACXX.normal` and sample name (`SM`) `normal`.
d. In order to know which programs were run to generate this bam file, we can use the `@PG` tags. Here, we see that the aligner used was `bwa mem`
e. The duplicates were marked with `MarkDuplicates`.
f. To sort and compress `samtools sort` and `samtools view` were used. For base quality score recalibration (BQSR) `GATK ApplyBQSR` was used.
:::
Now we extract information from the alignments themselves. We first have a look at few alignments and then get a summary of the alignments with `samtools flagstat`.
::: {.callout-important}
## Exercise
Check out the first few alignments with `samtools view` and `head`.
a. What is likely the read length used?
b. Are the reads single-end or paired-end?
c. Are the first reads tagged with the readgroup ID?
:::
::: {.callout-tip collapse="true"}
## Answer
The command:
```sh
samtools view normal.recal.bam | head -3
```
Returns the first three alignments:
```
HWI-ST466:135068617:C1TD1ACXX:7:1114:9733:82689 163 chr6 60001 60 100M = 60106 205 GATCTTATATAACTGTGAGATTAATCTCAGATAATGACACAAAATATAGTGAAGTTGGTAAGTTATTTAGTAAAGCTCATGAAAATTGTGCCCTCCATTC ;B@EDB@A@A@DDDGBGBFCBC?DBEEGCGCAADAGCECDCDDDBABAGBECEGCBHH@?EGBCBCDDBHBAEEHGDFBBHBDDDBCHBGEFFEGEDBBG MC:Z:100M MD:Z:100 PG:Z:MarkDuplicates RG:Z:HWI-ST466.C1TD1ACXX.normal NM:i:0 AS:i:100 XS:i:0
HWI-ST466:135068617:C1TD1ACXX:7:1303:2021:90688 99 chr6 60001 60 100M = 60104 203 GATCTTATATAACTGTGAGATTAATCTCAGATAATGACACAAAATATAGTGAAGTTGGTAAGTTATTTAGTAAAGCTCATGAAAATTGTGCCCTCCATTC >A@FDB@A?@>DDCE@FAF@?BAC>ECFCGBAADBEADBDCDDD@@A@FAFADD?ABF@@EF?@ABCDAFB?DCGEBEB@EBCDCBCHBFEFFDGFCBBG MC:Z:100M MD:Z:100 PG:Z:MarkDuplicates RG:Z:HWI-ST466.C1TD1ACXX.normal NM:i:0 AS:i:100 XS:i:0
HWI-ST466:135068617:C1TD1ACXX:7:2304:7514:30978 113 chr6 60001 60 2S98M = 61252 1194 TAGATCTTATATAACTGTGAGATTAATCTCAGATAATGACACAAAATATAGTGAAGTTGGTAAGTTATTTAGTAAAGCTCATGAAAATTGTGCCCTCCAT >DHABFEACBBBCBGCECHEGBEACBDHCGCHCBDBBFAEAGBCCB@BAEECGBEEDBED>@EDDABDDADE@CBDFFBFBCFCCBADBDBDFFFAFF?@ MC:Z:60S40M MD:Z:98 PG:Z:MarkDuplicates RG:Z:HWI-ST466.C1TD1ACXX.normal NM:i:0 AS:i:98 XS:i:0
```
a. At the CIGAR strings we see `100M` and `2S98M`, meaning that the original read had 100 base pairs. So it's likely a read length of 100 bp has been used.
b. This question might be a bit more challenging. In the 5th column we see an equal sign (`=`). This shows that the mate is mapped to same chromsome, so suggesting we are working with paired-end reads. Secondly, at the `@PG` header tag, we saw that `bwa mem` took two fastq files (`normal_R1.fastq.gz` and `normal_R2.fastq.gz`) as input. Lastly, we can use the sam flags in the second column to figure that out. If you paste the first flag, `163`, in the [explain sam flags website](https://broadinstitute.github.io/picard/explain-flags.html), we can see that the read is paired.
c. Yes, the read group tag starts with `RG`, and is specified for all three alignments.
:::
::: {.callout-important}
## Exercise
Create a script called `01_samtools_flagstat.sh` and store it in `~/project/scripts`. Use `samtools flagstat` to summarize the alignments in both bam files. How many reads are in the bam files? And how many alignments are marked as duplicate?
:::
::: {.callout-tip collapse="true"}
## Answer
```sh
cd ~/project/course_data/alignments
for sample in tumor normal
do
samtools flagstat "$sample".recal.bam > "$sample".recal.bam.flagstat
done
```
returns for `normal.recal.bam`:
```
12744793 + 0 in total (QC-passed reads + QC-failed reads)
12733034 + 0 primary
0 + 0 secondary
11759 + 0 supplementary
1397598 + 0 duplicates
1397598 + 0 primary duplicates
12671529 + 0 mapped (99.43% : N/A)
12659770 + 0 primary mapped (99.42% : N/A)
12733034 + 0 paired in sequencing
6366517 + 0 read1
6366517 + 0 read2
12515974 + 0 properly paired (98.30% : N/A)
12586532 + 0 with itself and mate mapped
73238 + 0 singletons (0.58% : N/A)
20414 + 0 with mate mapped to a different chr
13194 + 0 with mate mapped to a different chr (mapQ>=5)
```
Showing us that we have 12,744,793 reads, and all those were paired. Of the alignments, 1,397,598 were marked as duplicate. If we do the same for the tumor bam file we see that it has 16,674,562 reads and 1,871,521 alignment marked as duplicate.
:::
So, all looks good until now. We have many reads and high alignment rates. The bam files seem to be ready for variant analysis, because they have read groups, duplicates are marked and they are sorted by coordinate. However, since we are working with whole exome sequencing (WES) data, we would like to know whether the reads align to the target regions and what kind of coverage we have. For this, we use `gatk CollectHsMetrics`.
::: {.callout-important}
## Exercise
Create a script called `02_gatk_collecthsmetrics.sh` in `~/project/scripts` to run `gatk CollectHsMetrics` on the two bam files. Here's an example:
```sh
ALIGNDIR=~/project/course_data/alignments
REFDIR=~/project/course_data/reference
RESOURCEDIR=~/project/course_data/resources
for sample in tumor normal
do
gatk CollectHsMetrics \
-I "$ALIGNDIR"/"$sample".recal.bam \
-O "$ALIGNDIR"/"$sample".recal.bam_hs_metrics.txt \
-R "$REFDIR"/ref_genome.fa \
--BAIT_INTERVALS "$REFDIR"/exome_regions.bed.interval_list \
--TARGET_INTERVALS "$REFDIR"/exome_regions.bed.interval_list
done
```
The produced reports are not so easy to read. To parse the metrics file, we run `multiqc` on the directory:
```sh
cd ~/project/course_data/alignments
multiqc .
```
Download the multiqc report (`multiqc_report.html`) by right-clicking on the file and select 'Download'. Check out the hybrid-selection metrics. How did the hybrid capture go? Are most bases on-target? What kind of coverages can we expect in the target regions?
:::
::: {.callout-note}
You can find the documentation on the different metrics [here](https://broadinstitute.github.io/picard/picard-metric-definitions.html#HsMetrics). Note that the descriptions of the metrics are not always correct. The metrics with `ON_BAIT` contain information including duplicates, while `ON_TARGET` without duplicates. This has been an issue [since 2020](https://github.com/broadinstitute/picard/issues/1494), and hasn't been worked on unfortunately.
:::
::: {.callout-tip collapse="true"}
## Answer
We see that we have a fold enrichment of around 22 and about 45% usable bases on-target. About 90% were 'selected bases', which means bases aligning within 250 bp of the target region. We're looking at a de-duplicated coverage of 84x for the normal sample and 114.3x for the tumor sample. So, on average, we seem to have acceptable coverages for both samples.
:::