Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: #5 cDNA generator #25

Open
ninsch3000 opened this issue Oct 27, 2023 · 0 comments
Open

test: #5 cDNA generator #25

ninsch3000 opened this issue Oct 27, 2023 · 0 comments

Comments

@ninsch3000
Copy link
Collaborator

ninsch3000 commented Oct 27, 2023

README description

cDNA Generator module

Generate cDNA based on mRNA transcript sequences and the coresponding priming probabilities.

Example usage

A simple example can be run from the test_files directory:

cdna-generator -ifa tests/cdna_generator/files/transcript.fasta -igtf tests/cdna_generator/files/Example_GTF_Input.GTF -icpn tests/cdna_generator/files/copy_number_input.csv -ofa cdna_seq.fa -ocsv cdna_counts.csv

Installation

pip install .

Docker

A docker image is available, to fetch this image:

docker pull ericdb/my-image

To run a simple example using this image:

docker run my-image python cdna/cli.py -ifa test_files/yeast_example.fa -icpn test_files/copy_number_input.csv -igt test_files/Example_GTF_Input.GTF -ofa test_files/cDNA.fasta -ocsv test_files/cDNA.csv

License

MIT license, Copyright (c) 2022 Zavolan Lab, Biozentrum, University of Basel

Contributers

Eric Boittier, Bastian Wagner, Quentin Badolle

More info:

Input files

transcript_copies (csv-formatted) containing:

  • ID of transcript
  • ID of parent transcript
  • transcript copy number

transcript_sequences (fasta-formatted) containing:

  • ID of transcript
  • transcript-sequence

priming_sites (gtf-formatted) containing:

  • ID of transcript
  • Position of priming site
  • Binding likelihood of priming site

Output files

cDNA_sequences (fasta-formatted) containing:

  • cDNA sequence ID
  • cDNA-sequence

cDNA_counts (csv-formatted) containing:

  • cDNA sequence ID
  • cDNA-counts

Original issue description

https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation/-/issues/5

Generate cDNAs

Generate cDNA copies of transcripts, allowing for the priming of DNA synthesis at transcript-internal sites.

Input:

  1. fasta-formatted file of transcript sequences
  2. gtf-formatted file with potential priming sites for individual transcripts, with associated probabilities
  3. file with the copy number of each unique transcript subjected to the cDNA synthesis

Output:

  1. fasta-formatted file with DNA copies of the transcripts, ending at the one of the possible priming sites for each transcript. Priming sites are sampled in proportion to their probability of being used within a transcript. Each copy of a unique transcript is independently sampled, but only unique DNA sequences are saved to the output file.
  2. Csv-formatted file with the copy number of each unique DNA copy.

Pipeline overview description

https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation
The possible priming sites are sampled with the probabilities computed at the previous step, to pick a site for generating the complementary DNA.

Project design description

https://git.scicore.unibas.ch/zavolan_group/tools/cdna-generator/-/wikis/Project-Design
Input: fasta-formatted file of transcript sequences gtf-formatted file with potential priming sites for individual transcripts, with associated probabilities file with the copy number of each unique transcript subjected to the cDNA synthesis

Output: fasta-formatted file with DNA copies of the transcripts, ending at the one of the possible priming sites for each transcript. Priming sites are sampled in proportion to their probability of being used within a transcript. Each copy of a unique transcript is independently sampled, but only unique DNA sequences are saved to the output file. Csv-formatted file with the copy number of each unique DNA copy.

Simulating cDNA synthesis This is done by reverse transcribing starting from the primer sequence. For each transcript we have the sequence and the copy number. So we for each copy of the transcript we have to sample a priming site in proportion to its probability, calculated at the previous step. Then the cDNAs will be all the sequences generated from the initial pool of transcripts by copying the initial transcript sequence up to the chosen priming site.

project_schema

cDNA Generator Design

  1. Extract transcript_sequences, transcript_copy_number, priming_sites and priming_probabilities from input files.
  • transcript_sequences = GAUGCGG… , UAGCGCUG…, CUCUUGCGG… [...]
  • transcript_copy_number = 100, 40, 30 [...]
  • priming_sites = 220, 260, 390 [...]
  • priming_probabilities = 0.33, 0.27, 0.40 [...]
  1. Generate a list of unique_transcripts based on transcript_sequences + priming_sites and add the list to the FASTA output file. mRNA -> cDNA
  • TTTACGGT…
  • CCATACGG…
  • CGGGGCG…
  1. Generate list of copy numbers for each unique transcript based on priming_probabilities + transcript_copy_number
  • TTTACGGT… 33
  • CCATACGG… 27
  • CGGGGCG… 40
  1. Iterate 1-3 and extend lists
  2. Write unique_transcripts output FASTA file and copy_number_transcripts output CSV file

Open questions:

  • What if the RT-polymerase is breaking off before reaching the 5'-end of the transcript? With the current design we only consider potential start sites (priming sites).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant