quality_control.qmd

---
title: Setup & quality control
---

## Presentations

{{< downloadthis assets/pdf/01_course_intro.pdf dname="01_introduction_course" label="Course introduction" icon="filetype-pdf" >}}

{{< downloadthis assets/pdf/02_introduction_cancer_variants.pdf dname="02_introduction_cancer_variants" label="Introduction to cancer variant analysis" icon="filetype-pdf" >}}

## Exercises 

### First login

If you are participating in this course with a teacher, you have received a link and a password. Copy-paste the link (including the port, e.g.: `http://12.345.678.91:10002`) in your browser. This should result in the following page:

![](assets/images/vscode_login_page.png){width=300}


::: {.callout-note}
The link gives you access to a web version of [Visual Studio Code](https://code.visualstudio.com). This is a powerful code editor that you can also use as a local application on your computer. 
::: 

Type in the password that was provided to you by the teacher. Now let's open the terminal. You can do that by clicking **Application menu** > **Terminal** > **New Terminal**:

![](assets/images/open_terminal.gif){width=500}


For a.o. efficiency and reproducibility it makes sense to execute your commands from a script. With use of the 'new file' button:

![](assets/images/new_file.gif){width=500}

### Setup

We will start the exercises with pre-aligned bam files. To start, download and extract the course_data folder:

```sh
wget https://cancer-variants-training.s3.eu-central-1.amazonaws.com/course_data.tar.gz
tar -xvzf course_data.tar.gz
rm course_data.tar.gz
```

Now, check out the directory `course_data` and see what's in there (e.g. with `tree`):

```
course_data/
├── alignments
│   ├── normal.recal.bai
│   ├── normal.recal.bam
│   ├── tumor.recal.bai
│   └── tumor.recal.bam
├── reference
│   ├── exome_regions.bed
│   ├── exome_regions.bed.interval_list
│   ├── ref_genome.dict
│   └── ref_genome.fa
└── resources
    ├── 1000G_phase1.snps.high_confidence.hg38.subset.vcf.gz
    ├── 1000G_phase1.snps.high_confidence.hg38.subset.vcf.gz.tbi
    ├── 1000g_pon.hg38.subset.vcf.gz
    ├── 1000g_pon.hg38.subset.vcf.gz.tbi
    ├── af-only-gnomad.hg38.subset.vcf.gz
    ├── af-only-gnomad.hg38.subset.vcf.gz.tbi
    ├── alphamissense
    │   ├── AlphaMissense_hg38.tsv.gz
    │   └── AlphaMissense_hg38.tsv.gz.tbi
    ├── clinvar
    │   ├── clinvar.vcf.gz
    │   └── clinvar.vcf.gz.tbi
    ├── CNVkit_data
    │   └── cnvkit.tar.gz
    ├── exe1
    │   ├── output.txt_summary_everything.html
    │   └── output.txt_summary.html
    ├── Mills_and_1000G_gold_standard.indels.hg38.subset.vcf.gz
    ├── Mills_and_1000G_gold_standard.indels.hg38.subset.vcf.gz.tbi
    ├── Mutect2_data
    │   ├── somatic.filtered.PASS.vcf.gz
    │   ├── somatic.filtered.PASS.vcf.gz.tbi
    │   └── variants.tar.gz
    ├── README.md.docx
    ├── refFlat.txt
    └── revel
        ├── new_tabbed_revel_grch38.tsv.gz
        └── new_tabbed_revel_grch38.tsv.gz.tbi

10 directories, 30 files
```

Showing us that we have three directories:

- `alignments`: containing bam files of tumor and normal
- `reference`: containing the genome fasta file and target intervals
- `resources`: containing amongst other variant files (`vcf`) from amongst other the [1000 genomes](https://www.internationalgenome.org/) project and [gnomAD](https://gnomad.broadinstitute.org/), and subsets of the clinvar and alphamissense databases. 

The dataset we're working with is prepared by the developers of the [Precision medicine bioinformatics course]() by the Griffith lab. It is whole exome sequencing data of cell lines derived from a tumor of triple negative breast cancer ([HCC1395](https://www.cellosaurus.org/CVCL_1249)) and derived from normal tissue ([HCC1395 BL](https://www.cellosaurus.org/CVCL_1250)). 

You are strongly encouraged to your work with scripts during the course, which you store in the directory `scripts`. Therefore create a scripts directory:

```sh
cd ~/project/
mkdir scripts 
```

### Quality control of the bam files

First we use the information in the header to get information about the contents of the bam file and how it was generated. In order to use the tools installed for this course activate the mamba environment (do this every time you open a new terminal):

```sh
mamba activate ngs-tools 
```

::: {.callout-important}
## Exercise

Check out the contents of the headers of the bam files with `samtools view -H`, and answer the following questions:

a. How is the bam file sorted?
b. Which chromosomes were used as reference?
c. How many read groups are in there, what is the readgroup ID and what is the sample name? 
d. Which aligner was used? 
e. If there are duplicates in there, how are they marked?
f. Which other programs were run to create the bam file? 

::: 

::: {.callout-tip collapse="true"}
## Answer

To get information on the bam header we run 

```sh
cd ~/project/course_data/alignments
samtools view -H normal.recal.bam
```

Which returns:

```
@HD     VN:1.6  SO:coordinate
@SQ     SN:chr6 LN:170805979
@SQ     SN:chr17        LN:83257441
@RG     ID:HWI-ST466.C1TD1ACXX.normal   LB:normal       PL:ILLUMINA     SM:normal       PU:HWI-ST466.C1TD1ACXX
@PG     ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem /config/data/reference//ref_genome.fa /config/data/reads/normal_R1.fastq.gz /config/data/reads/normal_R2.fastq.gz
@PG     ID:samtools     PN:samtools     PP:bwa  VN:1.21 CL:samtools sort
@PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.21 CL:samtools view -bh
@PG     ID:MarkDuplicates       VN:Version:4.5.0.0      CL:MarkDuplicates --INPUT /config/data/alignments/normal.rg.bam --OUTPUT /config/data/alignments/normal.rg.md.bam --METRICS_FILE /config/data/alignments/marked_dup_metrics_normal.txt ...   PN:MarkDuplicates
@PG     ID:GATK ApplyBQSR       VN:4.5.0.0      CL:ApplyBQSR --output /config/project/data/alignments/.recal.bam --bqsr-recal-file /config/project/data/alignments/normal.bqsr.recal.table --input /config/project/data/alignments/normal.rg.md.bam ...        PN:GATK ApplyBQSR
@PG     ID:samtools.2   PN:samtools     PP:samtools.1   VN:1.21 CL:samtools view -H tumor.recal.bam
```

a. From the `@HD` tag, we can see that the bam file is sorted by coordinate.
b. We have two lines starting with `@SQ`, which gives us information about the reference used. The reference genome used contains chromosomes 6 and 17.
c. We have one line starting with `@RG`. Specifying there is a read group with ID `HWI-ST466.C1TD1ACXX.normal` and sample name (`SM`) `normal`. 
d. In order to know which programs were run to generate this bam file, we can use the `@PG` tags. Here, we see that the aligner used was `bwa mem` 
e. The duplicates were marked with `MarkDuplicates`. 
f. To sort and compress `samtools sort` and `samtools view` were used. For base quality score recalibration (BQSR) `GATK ApplyBQSR` was used. 

:::

Now we extract information from the alignments themselves. We first have a look at few alignments and then get a summary of the alignments with `samtools flagstat`. 

::: {.callout-important}
## Exercise

Check out the first few alignments with `samtools view`  and `head`. 

a. What is likely the read length used? 
b. Are the reads single-end or paired-end? 
c. Are the first reads tagged with the readgroup ID?

::: 

::: {.callout-tip collapse="true"}
## Answer

The command:

```sh
samtools view normal.recal.bam | head -3
```

Returns the first three alignments:

```
HWI-ST466:135068617:C1TD1ACXX:7:1114:9733:82689 163     chr6    60001   60      100M    =       60106   205     GATCTTATATAACTGTGAGATTAATCTCAGATAATGACACAAAATATAGTGAAGTTGGTAAGTTATTTAGTAAAGCTCATGAAAATTGTGCCCTCCATTC     ;B@EDB@A@A@DDDGBGBFCBC?DBEEGCGCAADAGCECDCDDDBABAGBECEGCBHH@?EGBCBCDDBHBAEEHGDFBBHBDDDBCHBGEFFEGEDBBG     MC:Z:100M       MD:Z:100        PG:Z:MarkDuplicates     RG:Z:HWI-ST466.C1TD1ACXX.normal NM:i:0  AS:i:100        XS:i:0
HWI-ST466:135068617:C1TD1ACXX:7:1303:2021:90688 99      chr6    60001   60      100M    =       60104   203     GATCTTATATAACTGTGAGATTAATCTCAGATAATGACACAAAATATAGTGAAGTTGGTAAGTTATTTAGTAAAGCTCATGAAAATTGTGCCCTCCATTC     >A@FDB@A?@>DDCE@FAF@?BAC>ECFCGBAADBEADBDCDDD@@A@FAFADD?ABF@@EF?@ABCDAFB?DCGEBEB@EBCDCBCHBFEFFDGFCBBG     MC:Z:100M       MD:Z:100        PG:Z:MarkDuplicates     RG:Z:HWI-ST466.C1TD1ACXX.normal NM:i:0  AS:i:100        XS:i:0
HWI-ST466:135068617:C1TD1ACXX:7:2304:7514:30978 113     chr6    60001   60      2S98M   =       61252   1194    TAGATCTTATATAACTGTGAGATTAATCTCAGATAATGACACAAAATATAGTGAAGTTGGTAAGTTATTTAGTAAAGCTCATGAAAATTGTGCCCTCCAT     >DHABFEACBBBCBGCECHEGBEACBDHCGCHCBDBBFAEAGBCCB@BAEECGBEEDBED>@EDDABDDADE@CBDFFBFBCFCCBADBDBDFFFAFF?@     MC:Z:60S40M     MD:Z:98 PG:Z:MarkDuplicates     RG:Z:HWI-ST466.C1TD1ACXX.normal NM:i:0  AS:i:98 XS:i:0
```

a. At the CIGAR strings we see `100M` and `2S98M`, meaning that the original read had 100 base pairs. So it's likely a read length of 100 bp has been used. 
b. This question might be a bit more challenging. In the 5th column we see an equal sign (`=`). This shows that the mate is mapped to same chromsome, so suggesting we are working with paired-end reads. Secondly, at the `@PG` header tag, we saw that `bwa mem` took two fastq files (`normal_R1.fastq.gz` and `normal_R2.fastq.gz`) as input. Lastly, we can use the sam flags in the second column to figure that out. If you paste the first flag, `163`, in the [explain sam flags website](https://broadinstitute.github.io/picard/explain-flags.html), we can see that the read is paired.
c. Yes, the read group tag starts with `RG`, and is specified for all three alignments. 
:::

::: {.callout-important}
## Exercise

Create a script called `01_samtools_flagstat.sh` and store it in `~/project/scripts`. Use `samtools flagstat` to summarize the alignments in both bam files. How many reads are in the bam files? And how many alignments are marked as duplicate? 

::: 

::: {.callout-tip collapse="true"}
## Answer

```sh 
cd ~/project/course_data/alignments

for sample in tumor normal
do
    samtools flagstat "$sample".recal.bam > "$sample".recal.bam.flagstat
done
```

returns for `normal.recal.bam`:

```
12744793 + 0 in total (QC-passed reads + QC-failed reads)
12733034 + 0 primary
0 + 0 secondary
11759 + 0 supplementary
1397598 + 0 duplicates
1397598 + 0 primary duplicates
12671529 + 0 mapped (99.43% : N/A)
12659770 + 0 primary mapped (99.42% : N/A)
12733034 + 0 paired in sequencing
6366517 + 0 read1
6366517 + 0 read2
12515974 + 0 properly paired (98.30% : N/A)
12586532 + 0 with itself and mate mapped
73238 + 0 singletons (0.58% : N/A)
20414 + 0 with mate mapped to a different chr
13194 + 0 with mate mapped to a different chr (mapQ>=5)
```

Showing us that we have 12,744,793 reads, and all those were paired. Of the alignments, 1,397,598 were marked as duplicate. If we do the same for the tumor bam file we see that it has 16,674,562 reads and 1,871,521 alignment marked as duplicate. 

:::

So, all looks good until now. We have many reads and high alignment rates. The bam files seem to be ready for variant analysis, because they have read groups, duplicates are marked and they are sorted by coordinate. However, since we are working with whole exome sequencing (WES) data, we would like to know whether the reads align to the target regions and what kind of coverage we have. For this, we use `gatk CollectHsMetrics`. 

::: {.callout-important}
## Exercise

Create a script called `02_gatk_collecthsmetrics.sh` in `~/project/scripts` to run `gatk CollectHsMetrics` on the two bam files. Here's an example:

```sh
ALIGNDIR=~/project/course_data/alignments
REFDIR=~/project/course_data/reference
RESOURCEDIR=~/project/course_data/resources

for sample in tumor normal
do
    gatk CollectHsMetrics \
    -I "$ALIGNDIR"/"$sample".recal.bam \
    -O "$ALIGNDIR"/"$sample".recal.bam_hs_metrics.txt \
    -R "$REFDIR"/ref_genome.fa \
    --BAIT_INTERVALS "$REFDIR"/exome_regions.bed.interval_list \
    --TARGET_INTERVALS "$REFDIR"/exome_regions.bed.interval_list
done 
```
The produced reports are not so easy to read. To parse the metrics file, we run `multiqc` on the directory:

```sh 
cd ~/project/course_data/alignments
multiqc .
```

Download the multiqc report (`multiqc_report.html`) by right-clicking on the file and select 'Download'. Check out the hybrid-selection metrics. How did the hybrid capture go? Are most bases on-target? What kind of coverages can we expect in the target regions?

:::

::: {.callout-note}
You can find the documentation on the different metrics [here](https://broadinstitute.github.io/picard/picard-metric-definitions.html#HsMetrics). Note that the descriptions of the metrics are not always correct. The metrics with `ON_BAIT` contain information including duplicates, while `ON_TARGET` without duplicates. This has been an issue [since 2020](https://github.com/broadinstitute/picard/issues/1494), and hasn't been worked on unfortunately. 
::: 

::: {.callout-tip collapse="true"}
## Answer

We see that we have a fold enrichment of around 22 and about 45% usable bases on-target. About 90% were 'selected bases', which means bases aligning within 250 bp of the target region. We're looking at a de-duplicated coverage of 84x for the normal sample and 114.3x for the tumor sample. So, on average, we seem to have acceptable coverages for both samples. 
:::