Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create new reference files for marmoset genome #177

Closed
kopardev opened this issue Nov 26, 2024 · 6 comments · Fixed by #185
Closed

create new reference files for marmoset genome #177

kopardev opened this issue Nov 26, 2024 · 6 comments · Fixed by #185
Assignees
Labels
enhancement New feature or request RENEE RepoName
Milestone

Comments

@kopardev
Copy link
Member

@kelly-sovacool can you create the reference files for:
Fasta: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/011/100/555/GCF_011100555.1_mCalJa1.2.pat.X/GCF_011100555.1_mCalJa1.2.pat.X_genomic.fna.gz
GTF:
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/011/100/555/GCF_011100555.1_mCalJa1.2.pat.X/GCF_011100555.1_mCalJa1.2.pat.X_genomic.gtf.gz

@kopardev kopardev added the enhancement New feature or request label Nov 26, 2024
@kopardev kopardev added this to the 2024-11 milestone Nov 26, 2024
@kopardev kopardev added the RENEE RepoName label Nov 26, 2024
@kelly-sovacool kelly-sovacool changed the title create new reference files create new reference files for marmoset genome Nov 26, 2024
@kelly-sovacool
Copy link
Member

https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_011100555.1/

will use mCalJac1 as the name

@kelly-sovacool
Copy link
Member

will need to edit the gtf

/data/CCBR_Pipeliner/db/PipeDB/Indices/mCalJac1_2021/qualimap_error.log

Traceback (most recent call last):
  File "/data/CCBR_Pipeliner/db/PipeDB/Indices/mCalJac1_2021/workflow/scripts/builder/generate_qualimap_ref.py", line 98, in <module>
    biotype = exons[0].attr["gene_type"]
KeyError: 'gene_type'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/CCBR_Pipeliner/db/PipeDB/Indices/mCalJac1_2021/workflow/scripts/builder/generate_qualimap_ref.py", line 100, in <module>
    biotype = exons[0].attr["gene_biotype"]
KeyError: 'gene_biotype'

@kopardev
Copy link
Member Author

kopardev commented Dec 3, 2024

@kelly-sovacool Seems like the exon level GTF entries do not have gene_type or gene_biotype information. In fact, the entire file does not have gene_type... only has gene_biotype ... which actually makes sense as they are interchange-able.

The way GTFs are structured is that they have are:

  • gene ... represented by gene_id
    • transcript ... represented by transcript_id ... also includes gene_id of the gene it is a transcript of
      • exon .. represented by a exon_number ... also includes the transcript_id it belongs to ... sometimes also includes the gene_id the transcript belongs to

Generally all the above have the gene_biotype (or gene_type) information duplicated, but does not seem to be the case in here. The gene_biotype information is only available at the gene level. So you may have to go to the attributes column (column 9 of the GTF) extract the value ofgene_id for a particular exon .. then go to the line with feature (column 3 of the GTF) == "gene" and its gene_id matches the value of gene_id from the exon .. then retrieve the gene_biotype from that line. Does that make sense??

@kopardev
Copy link
Member Author

kopardev commented Dec 3, 2024

@kelly-sovacool Another simpler way (I would probably do it this way):

write a new script to

  • crawl through the GTF to extract feature (column 3) == "gene"
  • extract gene_id and its corresponding gene_biotype from this line
  • save 2 columns gene_id and gene_biotype to a lookup table file

Then read the GTF line by line a second time and for each line

  • if feature == "gene" spite out the line as is
  • if feature != "gene" .. then
    • get its gene_id attribute and from the lookup table file lookup its gene_biotype and its value ... add this key-value to the attributes ... attributes are separated by "; " ... you can append it to the end as gene_biotype "blahblah";
    • spit out the line
  • save output to new GTF

New GTF will now have a gene_biotype attribute for all features (aka all lines) ...Using this new GTF with the legacy generate_qualimap_ref.py script should work!

@kelly-sovacool
Copy link
Member

@kopardev yes makes sense, I should be able to whip up a python script to do this fairly easily

@kopardev
Copy link
Member Author

kopardev commented Dec 3, 2024

@kelly-sovacool Aim to include this logic into a new generate_qualimap_ref.py itself ... to make this future-proof!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request RENEE RepoName
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants