-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Final annotation protein sequence length includes exceedingly short proteins #125
Comments
Hi @rosscrowhurst , I remember we had similar discussions many years ago. Originally, we tried a cutoff for 120nts? However, we were notified that effector genes in fungal could be very short (90nts or even shorteer?), so the threshold could be kingdom-dependent. This post mentioned the shortest manual curated gene is 7nts in human, but it doesn't look like a protein-coding one. I'm more concerned about gene models predicted with multiple exons, while one of the CDS is just 3nts long. Should we discard those gene models or mark them as 'to be further checked'? |
A paper in 1994 described a very short gene of 21nts |
@CeciliaDeng - check the shortest plant protein in NCBI refseq. A paper from 1994 does not count - find more recent pubs from last few years. I routinely use 24 aa as cut off (or 25 depending on my typing!) so a cds total length of 72 bases and protein of 24 aa. However most proteins of less than 60 aa are hypothetical and only a few have a named function/description. I see 24 aa (72 nt) as a minimum for me as anything shorter never hit anything in NCBI. And I am basing this on my own emphirical research of short peps in Red5, Russell, CK6901 and ME02. Take it or leave it as you wish |
@CeciliaDeng "I'm more concerned about gene models predicted with multiple exons, while one of the CDS is just 3nts long. Should we discard those gene models or mark them as 'to be further checked'?" Shortest CDS in Arabidopsis is 1 nt. You can remove short protein sequence genes but you can not remove genes just if the cds/exon is short. Remember that phasing in determines the overall CDS length. The caveat here is that there are correct introns each side with correct splice sites and not a 1 base exon due to some lift over glitch |
Following candidates are available from AGAT, I prefer I propose to implement the following,
|
@GallVp - just a note that AGAT scripts work on gtf or gff. If you are going to use AGAT to cull sort genes then you need to do it before the pep. cDNA, CDS is exported or produced since the AGAT tools you listed do not handle fasta as well. If you wait until the end of the pipeline then filter out then you also have to filter the fasta and re-run any gene set metric extraction steps. I am putting this workflow (that I put in TEAM) here so that you can see the requirements for post pipeline processing and how you need to handle all the fasta as well as the GFF3. So best it to run any culling step earlier than later. Also all culled models need to be removed from any separate files (like emapper results etc) for sanity. Just mentioning these things here as a reminder :)
|
Thank you @rosscrowhurst The filter will be applied right after BRAKER/LIFTOFF merge which precedes eggnog and protein/cds/cdna extraction. It will become part of our genepal/subworkflows/local/gff_merge_cleanup.nf Lines 13 to 29 in ee702d7
|
|
I have tested GFFREAD using a small test case. You can see the details here, genepal/modules/local/tests/gffread/main.nf.test Lines 12 to 32 in f980772
GFFREAD does support transcript-level filtering by CDS length. |
Description of the bug
The final proteins (and therefore gff3) appear to be allowing proteins of any size to be included and not implementing a minimum protein length threshold (which should be able to be set by user).
Here are the lengths of proteins in final .pep file sorted by size and the number of proteins for each size. Only protein shorter than 101 aa are included here (number of proteins then length of proteins:
There should be a minimum threshold cut off so why are proteins of 2,3,4,5 .. 24 being included. This means the end user need to filter results files to remove them and that should be a feature of the pipeline. So this is both a "bug" (logic bug not coding/nf bug) as well as request for feature setting to enforce a cut off at an end user specified value (or 10 perhaps as a minimum)
Command used and terminal output
No response
Relevant files
No response
System information
No response
The text was updated successfully, but these errors were encountered: