Final annotation protein sequence length includes exceedingly short proteins #125

rosscrowhurst · 2024-12-08T20:41:48Z

Description of the bug

The final proteins (and therefore gff3) appear to be allowing proteins of any size to be included and not implementing a minimum protein length threshold (which should be able to be set by user).

Here are the lengths of proteins in final .pep file sorted by size and the number of proteins for each size. Only protein shorter than 101 aa are included here (number of proteins then length of proteins:

There should be a minimum threshold cut off so why are proteins of 2,3,4,5 .. 24 being included. This means the end user need to filter results files to remove them and that should be a feature of the pipeline. So this is both a "bug" (logic bug not coding/nf bug) as well as request for feature setting to enforce a cut off at an end user specified value (or 10 perhaps as a minimum)

Command used and terminal output

No response

Relevant files

No response

System information

No response

The text was updated successfully, but these errors were encountered:

CeciliaDeng · 2024-12-08T22:10:43Z

Hi @rosscrowhurst , I remember we had similar discussions many years ago. Originally, we tried a cutoff for 120nts? However, we were notified that effector genes in fungal could be very short (90nts or even shorteer?), so the threshold could be kingdom-dependent.

This post mentioned the shortest manual curated gene is 7nts in human, but it doesn't look like a protein-coding one.

I'm more concerned about gene models predicted with multiple exons, while one of the CDS is just 3nts long. Should we discard those gene models or mark them as 'to be further checked'?

CeciliaDeng · 2024-12-08T22:18:59Z

A paper in 1994 described a very short gene of 21nts

rosscrowhurst · 2024-12-08T23:40:28Z

@CeciliaDeng - check the shortest plant protein in NCBI refseq. A paper from 1994 does not count - find more recent pubs from last few years. I routinely use 24 aa as cut off (or 25 depending on my typing!) so a cds total length of 72 bases and protein of 24 aa. However most proteins of less than 60 aa are hypothetical and only a few have a named function/description. I see 24 aa (72 nt) as a minimum for me as anything shorter never hit anything in NCBI. And I am basing this on my own emphirical research of short peps in Red5, Russell, CK6901 and ME02. Take it or leave it as you wish

rosscrowhurst · 2024-12-08T23:41:36Z

@CeciliaDeng "I'm more concerned about gene models predicted with multiple exons, while one of the CDS is just 3nts long. Should we discard those gene models or mark them as 'to be further checked'?" Shortest CDS in Arabidopsis is 1 nt. You can remove short protein sequence genes but you can not remove genes just if the cds/exon is short. Remember that phasing in determines the overall CDS length. The caveat here is that there are correct introns each side with correct splice sites and not a 1 base exon due to some lift over glitch

GallVp · 2024-12-10T02:19:29Z

Following candidates are available from AGAT,

I prefer agat_sp_filter_by_ORF_size.pl because it seems from its perl code that it works on AA length derived from [CDS Length/3 - 1]. Moreover, a gene is removed even if one of its isoforms has short length.

I propose to implement the following,

Include a parameter filter_genes_by_aa_length and set its default value to 24 as suggested by @rosscrowhurst
filter_genes_by_aa_length` minimum allowed value will be 3 (arbitrary, any suggestion?) and max value can be any positive integer
If filter_genes_by_aa_length is set to null, no filter is applied.

rosscrowhurst · 2024-12-10T02:48:33Z

@GallVp - just a note that AGAT scripts work on gtf or gff. If you are going to use AGAT to cull sort genes then you need to do it before the pep. cDNA, CDS is exported or produced since the AGAT tools you listed do not handle fasta as well. If you wait until the end of the pipeline then filter out then you also have to filter the fasta and re-run any gene set metric extraction steps.

I am putting this workflow (that I put in TEAM) here so that you can see the requirements for post pipeline processing and how you need to handle all the fasta as well as the GFF3. So best it to run any culling step earlier than later. Also all culled models need to be removed from any separate files (like emapper results etc) for sanity. Just mentioning these things here as a reminder :)

# Set protein fasta file path
GENOME_NAME=ck6901m
PEP_FASTA=/workspace/hrarnc/ck6901m/genepal/with_liftoff/results/annotations/ck6901m/${GENOME_NAME}.pep.fasta





# get sequence sizes of protein
/output/hrarnc/software/bin/get_sequence_sizes.pl -f=${PEP_FASTA} -o=${GENOME_NAME}.pep



# min protein size
MIN_PEP_LENGTH=24



# Get list of proteins less than MIN_PEP_LENGTH
CULL_LIST=${GENOME_NAME}.pep.cull.list
cat ${GENOME_NAME}.pep.sorted.sequence.lengths.txt | awk -v var=${MIN_PEP_LENGTH} -F"\t" '{ if ($2 < var){print $1; }}' > ${CULL_LIST}



# count list
NUM_SHORTS_TO_REMOVE=`cat ${CULL_LIST} | wc -l`; echo -e "NUMBER SHORT PROTEINS\t${NUM_SHORTS_TO_REMOVE}"



# remove from fasta
for TYPE in cds cdna pep
do
  IN_FASTA=${GENOME_NAME}.${TYPE}.fasta
  echo ${IN_FASTA}
  if [ -e ${IN_FASTA} ]
  then
    if [ "$TYPE" = "pep" ]
    then
      OUT_FASTA=${GENOME_NAME}.${TYPE}.minlen${MIN_PEP_LENGTH}.fasta
      /output/hrarnc/software/bin/remove_fasta_seqs_with_list.pl -f=${IN_FASTA} -o=${OUT_FASTA} -l=${CULL_LIST}
    else
      let "CML=3*${MIN_PEP_LENGTH}"
      OUT_FASTA=${GENOME_NAME}.${TYPE}.minlen${CML}.fasta
      /output/hrarnc/software/bin/remove_fasta_seqs_with_list.pl -f=${IN_FASTA} -o=${OUT_FASTA} -l=${CULL_LIST}
    fi
  fi
done



# remove gff3 records - fasta uses mRNA feature name so use the with_mRNAID variant script
/output/hrarnc/software/bin/remove_from_gff3_with_mRNAid_list.pl -i=ck6901m.gt.gff3 -o=ck6901m.filteredbyminsize.gff3 -l=${CULL_LIST}

GallVp · 2024-12-10T02:55:49Z

Thank you @rosscrowhurst

The filter will be applied right after BRAKER/LIFTOFF merge which precedes eggnog and protein/cds/cdna extraction. It will become part of our gff_merge_cleanup sub workflow.

genepal/subworkflows/local/gff_merge_cleanup.nf

Lines 13 to 29 in ee702d7

    
           ch_gff_branch               = ch_braker_gff 
        
                                       | join(ch_liftoff_gff, remainder:true) 
        
                                       | branch { meta, braker_gff, liftoff_gff -> 
        
                                           both        : (     braker_gff      &&      liftoff_gff ) 
        
                                           braker_only : (     braker_gff      && ( !  liftoff_gff ) ) 
        
                                           liftoff_only: ( ( ! braker_gff )    &&      liftoff_gff ) 
        
                                       } 
        
           // MODULE: AGAT_SPMERGEANNOTATIONS 
        
           AGAT_SPMERGEANNOTATIONS( 
        
               ch_gff_branch.both.map { meta, bg, lg -> [ meta, [ bg, lg ] ] }, 
        
               [] 
        
           ) 
        
           ch_merged_gff               = AGAT_SPMERGEANNOTATIONS.out.gff 
        
                                       | mix ( ch_gff_branch.liftoff_only.map { meta, braker_gff, liftoff_gff -> [ meta, liftoff_gff ] } ) 
        
                                       | mix ( ch_gff_branch.braker_only.map { meta, braker_gff, liftoff_gff -> [ meta, braker_gff ] } )

GallVp · 2024-12-11T03:53:21Z

agat_sp_filter_by_ORF_size.pl is not isoform aware and fails to filter out a gene when the first isoform passes the filter test even when the other isoforms fail the test. TODOs,

Unit test this behaviour and raise issue on the AGAT repo -- agat_sp_filter_by_ORF_size.pl is not isoform aware NBISweden/AGAT#512
Investigate whether GFFREAD can be used for filter short isoforms as suggested by @rosscrowhurst

GallVp · 2024-12-11T08:58:24Z

Hi @rosscrowhurst

I have tested GFFREAD using a small test case. You can see the details here,

genepal/modules/local/tests/gffread/main.nf.test

Lines 12 to 32 in f980772

    
           test("filter by length") { 
        
               when { 
        
                   process { 
        
                       """ 
        
                       input[0] = [ 
        
                           [id: 'test'], 
        
                           file("$baseDir" + '/modules/local/tests/gffread/testdata/t.gff', checkIfExists: true) 
        
                       ] 
        
                       input[1] = [] 
        
                       """ 
        
                   } 
        
               } 
        
               then { 
        
                   assertAll ( 
        
                       { assert process.success }, 
        
                       { assert snapshot(process.out).match() }, 
        
                       { assert file(process.out.gffread_gff[0][1]).text.contains('gene19851') }, 
        
                       { assert file(process.out.gffread_gff[0][1]).text.contains('gene19851.t1') }, 
        
                       { assert ! file(process.out.gffread_gff[0][1]).text.contains('gene19851.t2') } // This is the only transcript which is being knocked out

GFFREAD does support transcript-level filtering by CDS length.

rosscrowhurst added the bug Something isn't working label Dec 8, 2024

GallVp self-assigned this Dec 8, 2024

GallVp added this to the 0.6.0 milestone Dec 8, 2024

GallVp linked a pull request Dec 10, 2024 that will close this issue

Added parameter filter_genes_by_aa_length #127

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Final annotation protein sequence length includes exceedingly short proteins #125

Final annotation protein sequence length includes exceedingly short proteins #125

rosscrowhurst commented Dec 8, 2024 •

edited

Loading

CeciliaDeng commented Dec 8, 2024

CeciliaDeng commented Dec 8, 2024

rosscrowhurst commented Dec 8, 2024

rosscrowhurst commented Dec 8, 2024 •

edited

Loading

GallVp commented Dec 10, 2024 •

edited

Loading

rosscrowhurst commented Dec 10, 2024

GallVp commented Dec 10, 2024

GallVp commented Dec 11, 2024 •

edited

Loading

GallVp commented Dec 11, 2024

Final annotation protein sequence length includes exceedingly short proteins #125

Final annotation protein sequence length includes exceedingly short proteins #125

Comments

rosscrowhurst commented Dec 8, 2024 • edited Loading

Description of the bug

Command used and terminal output

Relevant files

System information

CeciliaDeng commented Dec 8, 2024

CeciliaDeng commented Dec 8, 2024

rosscrowhurst commented Dec 8, 2024

rosscrowhurst commented Dec 8, 2024 • edited Loading

GallVp commented Dec 10, 2024 • edited Loading

rosscrowhurst commented Dec 10, 2024

GallVp commented Dec 10, 2024

GallVp commented Dec 11, 2024 • edited Loading

GallVp commented Dec 11, 2024

rosscrowhurst commented Dec 8, 2024 •

edited

Loading

rosscrowhurst commented Dec 8, 2024 •

edited

Loading

GallVp commented Dec 10, 2024 •

edited

Loading

GallVp commented Dec 11, 2024 •

edited

Loading