Skip to content

Lab 02: Trimming

Beant Kapoor edited this page Aug 14, 2023 · 3 revisions

Trimming

Today we'll use the program Cutadapt to trim portions of sequences that have potential adapter contamination.

Helpful links:

Note:

An overview of how we'll manage the Arabidopsis data in our directories is shown below:

raw_data/
    reads/
         SRR17062759_1.fastq.gz
         SRR17062759_2.fastq.gz
         SRR17062760_1.fastq.gz
         SRR17062760_2.fastq.gz
         SRR17062762_1.fastq.gz
         SRR17062762_2.fastq.gz
         SRR17062763_1.fastq.gz
         SRR17062763_2.fastq.gz
    reference/
          Athaliana_447_Araport11.gene_exons.gff3
          Athaliana_447_TAIR10.fa

analysis/
    username/
        01_fastqc/
        02_trimming/ (we'll make this today)

Steps:

1. Consider the metadata for your sequences and their library prep

Original publication for our dataset:

Illumina provides all adapter sequences: https://support-docs.illumina.com/SHARE/AdapterSeq/Content/SHARE/AdapterSeq/AdapterSequencesIntro.htm

Read 1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA

Read 2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

2. Prepare your project directory

Go to your personal project directory.

Change to the raw_data directory in your analysis directory, then create a trimming_example subdirectory:

# go to your personal analysis directory

cd analysis
mkdir 02_trimming
cd 02_trimming

Create symbolic links to the raw data:

ln -s /pickett_shared/teaching/RNASeq_workshop/raw_data/reads/SRR17062759* .

3. Explore the files manually for adapters and quality

First, let's manually look to see if Illumina adapter sequences are present. Run grep using the "Read 1" adapter:

zcat SRR17062759_1.fastq.gz | grep "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"

Congratulations! You've found adapter contamination! Although these reads have been largely adapter trimmed before they were uploaded to the SRA database, some contamination persists, and these are the perfect matches.

Let's allow bioinformatic tools to perform the work for us to make sure there are no other sequences with adapters with sequencing errors in them.

4. Run Cutadapt

Now that we've seen the quality metrics for our data, we'll clean it up using Cutadapt.

To run Cutadapt on the files:

spack load py-cutadapt

cutadapt \
  -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
  -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
  -m 50 \
  -o cut_SRR17062759_1.fastq.gz \
  -p cut_SRR17062759_2.fastq.gz \
  SRR17062759_1.fastq.gz SRR17062759_2.fastq.gz

note: this took approximately 16 minutes to run; consider looking at fastqc results

5. Prepare raw dataset for tomorrow's lab:

For the sake of illustration and computation, we used a small dataset for this lab. Continuing with the Arabidopsis data, we'll use pre-trimmed files for all downstream steps. Create symbolic links in your 2a_trimming folder by running the following command:

cd 02_trimming
ln -s /pickett_shared/teaching/RNASeq_workshop/final_outputs/02_trimming/*.fastq ./

If you'd like to follow the exact steps used in preparing these files, see the Appendix below.

Appendix

The following settings were used to trim the Arabidopsis data using Cutadapt. For now, we already have these completed files in our analysis directory, but the data can be processed automatically using a for-loop (we'll discuss this on Thursday).

for R1 in /pickett_shared/teaching/RNASeq_workshop/raw_data/reads/*_1.fastq.gz; do
  cutadapt \
    -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
    -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
    -m 50 \
    -o cut_$(basename ${R1}) \
    -p cut_$(basename ${R1%%_1.fastq.gz})_2.fastq.gz \
    $R1 ${R1%%_1.fastq.gz}_2.fastq.gz
done