You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Given a gtf specification of transcript exon/intron structures and the genome sequence, construct the nucleotide sequence of the transcripts and add poly(A) tails.
Input:
Gtf file with exon/intron structures of transcripts
File with genome sequence
Length of the poly(A) tail
Dictionary of expected nucleotide frequencies in poly(A) tail
Output:
For each transcript, the list of exons should be traversed from 5' to 3', the sequences of the exons need to be extracted from the genome given the coordinates and then pasted together. At the end, a tail of the specified length should be added at the 3' end of the transcript, given a vector of mono-nucleotide frequencies (of course, the frequency of A's will be much higher than of any other nucleotide).
Design plan
1- Obtain gtf file and also generate a test file for code validation (sampled transcript gtf, from Group 2) :
a. To use as main test file: Reference gtf file
b. Subset the gtf file so that only the rows labelled exon remain
c. Extract gtf file data to make a BED file (bedtools getfasta -split only works on BED12 files)
d. For the sample test file, use just one gene based on gene id. Later on, the sample test file can be expanded as required
2- To run bedtools on the genome with data from sample/final bed file and extract exon sequences for all transcripts (bedtools getfasta) :
3-Function to add poly A tail at the 3’ end (right side) of RNA sequence:
a. Input: The output transcripts from the previous step
b. This is based on length of tail (sample a length of 250 nucleotides)
c. It needs to take into account the probability (weight/relative frequency) of each nucleotide (with A having the largest frequency defined)- this will be saved as a dictionary. {‘A’:freqA,‘U’:freqU,‘G’:freqG,‘C’:freqC}
d. Append via e.g. str.join(), str.ljust()
e. Output the final transcript sequences as a .fasta file. (Final Output)
Given a gtf specification of transcript exon/intron structures and the genome sequence, construct the nucleotide sequence of the transcripts and add poly(A) tails.
Input:
Gtf file with exon/intron structures of transcripts
File with genome sequence
Length of the poly(A) tail
Dictionary of expected nucleotide frequencies in poly(A) tail
Output: fasta-formatted file of transcript sequences
For each transcript, the list of exons should be traversed from 5' to 3', the sequences of the exons need to be extracted from the genome given the coordinates and then pasted together. At the end, a tail of the specified length should be added at the 3' end of the transcript, given a vector of mono-nucleotide frequencies (of course, the frequency of A's will be much higher than of any other nucleotide).
Note, the poly(A) tail can be added by a padding function, which could be used on other contexts as well.
One could use bedtools getfasta for extracting the transcript sequences. The gene annotations would need to be filtered first to only include the features we want to extract. This could easily done with awk, sed or even grep. See this blog post on how to filter GTF files with awk.
This issue then does not require any actual coding, one would just need to add 3 steps to the Nextflow workflow (or, perhaps better, create a Nextflow subworkflow with these 3 steps, then import it in the main one):
Filter gene annotations, keep only features of type transcript
Extract transcript sequences with bedtools getfasta
README description
Extract transcript sequences
Project aim:
Given a gtf specification of transcript exon/intron structures and the genome sequence, construct the nucleotide sequence of the transcripts and add poly(A) tails.
Input:
Output:
For each transcript, the list of exons should be traversed from 5' to 3', the sequences of the exons need to be extracted from the genome given the coordinates and then pasted together. At the end, a tail of the specified length should be added at the 3' end of the transcript, given a vector of mono-nucleotide frequencies (of course, the frequency of A's will be much higher than of any other nucleotide).
Design plan
1- Obtain gtf file and also generate a test file for code validation (sampled transcript gtf, from Group 2) :
a. To use as main test file: Reference gtf file
b. Subset the gtf file so that only the rows labelled exon remain
c. Extract gtf file data to make a BED file (bedtools getfasta -split only works on BED12 files)
d. For the sample test file, use just one gene based on gene id. Later on, the sample test file can be expanded as required
2- To run bedtools on the genome with data from sample/final bed file and extract exon sequences for all transcripts (bedtools getfasta) :
a. To use as genome sequence: https://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz (Human Genome release 107 from Ensembl)
b. Using bedtools getfasta and its options, get the output
c. Output will be: transcript sequences containing all exon data in 5’ to 3’ direction without poly A tail
3-Function to add poly A tail at the 3’ end (right side) of RNA sequence:
a. Input: The output transcripts from the previous step
b. This is based on length of tail (sample a length of 250 nucleotides)
c. It needs to take into account the probability (weight/relative frequency) of each nucleotide (with A having the largest frequency defined)- this will be saved as a dictionary. {‘A’:freqA,‘U’:freqU,‘G’:freqG,‘C’:freqC}
d. Append via e.g. str.join(), str.ljust()
e. Output the final transcript sequences as a .fasta file. (Final Output)
Installation
TBA
Usage/Examples
Original issue description
https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation/-/issues/3
Extract transcript sequences
Given a gtf specification of transcript exon/intron structures and the genome sequence, construct the nucleotide sequence of the transcripts and add poly(A) tails.
Input:
Output: fasta-formatted file of transcript sequences
For each transcript, the list of exons should be traversed from 5' to 3', the sequences of the exons need to be extracted from the genome given the coordinates and then pasted together. At the end, a tail of the specified length should be added at the 3' end of the transcript, given a vector of mono-nucleotide frequencies (of course, the frequency of A's will be much higher than of any other nucleotide).
Note, the poly(A) tail can be added by a padding function, which could be used on other contexts as well.
One could use
bedtools getfasta
for extracting the transcript sequences. The gene annotations would need to be filtered first to only include the features we want to extract. This could easily done withawk
,sed
or evengrep
. See this blog post on how to filter GTF files withawk
.This issue then does not require any actual coding, one would just need to add 3 steps to the Nextflow workflow (or, perhaps better, create a Nextflow subworkflow with these 3 steps, then import it in the main one):
transcript
bedtools getfasta
Pipeline overview description
https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation
To start the "sample preparation" process, the transcript sequences need to be constructe based on the structure specified in the gff/gtf file and the genome sequence. Poly(A) tails are added to transcripts accoroding to input #10.
The text was updated successfully, but these errors were encountered: