Skip to content

08 Add Read Group Information

Neranjan Perera edited this page Dec 6, 2018 · 3 revisions

In this section we will be adding meta data about the sample. Adding meta data is very important is downstream analysis of your data, and these information is visible to GATK analysis tools. In here we use the minimal read group information for the samples and some are important tags.

In the SAM/BAM file the read group information is indicated in @RG tag which signify the "read group".

  • ID : globally unique string which identify this run. Usually this linked to the lane where the data was run.

  • SM : associated name in the DNA sample. This will be the sample identifier and it is the most important tag. In GATK all the analysis is done by sample, and this will selects which sample group it will belong to.

  • PL : platform used. eg: "Illumina", "Pacbio", "iontorrent"

  • LB : an identifier of the library from this DNA was sequenced. This field is important for future reference and quality control. In the case of errors associated with DNA preparation, this will link the data to the laboratory preparation step.

  • PU : platform unit identifier for the run. The generic identifier will allow to go back to the machine, time and where it was run. Usually this is a flowcell-barcode-lane unique identifier.

To learn more about SAM tools tags please refer the SAM tools format.

The following Picard tools command will add the read group information to each sample.

module load picard/2.9.2
export _JAVA_OPTIONS=-Djava.io.tmpdir=/scratch

cd ../${d4}

java -jar $PICARD AddOrReplaceReadGroups \
        INPUT=../${d3}/${INPUT_FILE_NAME}_nodup.bam \
        OUTPUT=${INPUT_FILE_NAME}_rg.bam \
        RGID=group1 \
        RGSM=${INPUT_FILE_NAME} \
        RGPL=illumina \
        RGLB=1 \
        RGPU=barcode \
        CREATE_INDEX=True

The above command will add reads groups to each sample and will created BAM files:

readgroup/
├── SRR1517848_rg.bai
├── SRR1517848_rg.bam
├── SRR1517878_rg.bai
├── SRR1517878_rg.bam
├── SRR1517884_rg.bai
├── SRR1517884_rg.bam
├── SRR1517906_rg.bai
├── SRR1517906_rg.bam
├── SRR1517991_rg.bai
├── SRR1517991_rg.bam
├── SRR1518011_rg.bai
├── SRR1518011_rg.bam
├── SRR1518158_rg.bai
├── SRR1518158_rg.bam
├── SRR1518253_rg.bai
└── SRR1518253_rg.bam

How to check the reads have read group information ?

You can do this by quick samtools and unix commands using:
samtools view -H SRR1517848_rg.bam | grep '^@RG'
which will give you:

@RG	ID:group1	LB:1	PL:illumina	SM:SRR1517848	PU:barcode

The presence of the @RG tags indicate the presence of read groups. Each read group has a SM tag, indicating the sample from which the reads belonging to that read group originate.

In addition to the presence of a read group in the header, each read must belong to one and only one read group. Given the following example reads,