-
Notifications
You must be signed in to change notification settings - Fork 0
Lab 02: Trimming
Today we'll use the program Cutadapt to trim portions of sequences that have potential adapter contamination.
Helpful links:
An overview of how we'll manage the Arabidopsis data in our directories is shown below:
raw_data/
reads/
SRR17062759_1.fastq.gz
SRR17062759_2.fastq.gz
SRR17062760_1.fastq.gz
SRR17062760_2.fastq.gz
SRR17062762_1.fastq.gz
SRR17062762_2.fastq.gz
SRR17062763_1.fastq.gz
SRR17062763_2.fastq.gz
reference/
Athaliana_447_Araport11.gene_exons.gff3
Athaliana_447_TAIR10.fa
analysis/
username/
01_fastqc/
02_trimming/ (we'll make this today)
Original publication for our dataset:
Illumina provides all adapter sequences: https://support-docs.illumina.com/SHARE/AdapterSeq/Content/SHARE/AdapterSeq/AdapterSequencesIntro.htm
Read 1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
Read 2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Go to your personal project directory.
Change to the raw_data directory in your analysis directory, then create a trimming_example subdirectory:
# go to your personal analysis directory
cd analysis
mkdir 02_trimming
cd 02_trimming
Create symbolic links to the raw data:
ln -s /pickett_shared/teaching/RNASeq_workshop/raw_data/reads/SRR17062759* .
First, let's manually look to see if Illumina adapter sequences are present. Run grep using the "Read 1" adapter:
zcat SRR17062759_1.fastq.gz | grep "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
Congratulations! You've found adapter contamination! Although these reads have been largely adapter trimmed before they were uploaded to the SRA database, some contamination persists, and these are the perfect matches.
Let's allow bioinformatic tools to perform the work for us to make sure there are no other sequences with adapters with sequencing errors in them.
Now that we've seen the quality metrics for our data, we'll clean it up using Cutadapt.
To run Cutadapt on the files:
spack load py-cutadapt
cutadapt \
-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
-m 50 \
-o cut_SRR17062759_1.fastq.gz \
-p cut_SRR17062759_2.fastq.gz \
SRR17062759_1.fastq.gz SRR17062759_2.fastq.gz
note: this took approximately 16 minutes to run; consider looking at fastqc results
For the sake of illustration and computation, we used a small dataset for this lab. Continuing with the Arabidopsis data, we'll use pre-trimmed files for all downstream steps. Create symbolic links in your 2a_trimming folder by running the following command:
cd 02_trimming
ln -s /pickett_shared/teaching/RNASeq_workshop/final_outputs/02_trimming/*.fastq ./
If you'd like to follow the exact steps used in preparing these files, see the Appendix below.
The following settings were used to trim the Arabidopsis data using Cutadapt. For now, we already have these completed files in our analysis directory, but the data can be processed automatically using a for-loop (we'll discuss this on Thursday).
for R1 in /pickett_shared/teaching/RNASeq_workshop/raw_data/reads/*_1.fastq.gz; do
cutadapt \
-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
-m 50 \
-o cut_$(basename ${R1}) \
-p cut_$(basename ${R1%%_1.fastq.gz})_2.fastq.gz \
$R1 ${R1%%_1.fastq.gz}_2.fastq.gz
done