Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Final annotation protein sequence length includes exceedingly short proteins #125

Open
rosscrowhurst opened this issue Dec 8, 2024 · 9 comments · Fixed by #127
Open
Assignees
Labels
bug Something isn't working
Milestone

Comments

@rosscrowhurst
Copy link

rosscrowhurst commented Dec 8, 2024

Description of the bug

The final proteins (and therefore gff3) appear to be allowing proteins of any size to be included and not implementing a minimum protein length threshold (which should be able to be set by user).

Here are the lengths of proteins in final .pep file sorted by size and the number of proteins for each size. Only protein shorter than 101 aa are included here (number of proteins then length of proteins:

 92 100
 77 99
 90 98
 78 97
 71 96
 85 95
 71 94
 87 93
 95 92
 71 91
 77 90
 68 89
 72 88
 76 87
 61 86
 60 85
 78 84
 70 83
 75 82
 70 81
 73 80
 64 79
 76 78
 83 77
 79 76
 79 75
 80 74
 89 73
 75 72
 85 71
101 70
104 69
 81 68
 68 67
 93 66
 26 65
 22 64
 19 63
 18 62
 19 61
 17 60
 16 59
 18 58
 18 57
 23 56
 15 55
 17 54
 20 53
 11 52
 16 51
 11 50
 13 49
 17 48
 10 47
 16 46
 16 45
 12 44
  9 43
  6 42
  9 41
 10 40
  5 39
 12 38
 11 37
  5 36
  6 35
 12 34
  2 33
 10 32
  8 31
  5 30
  5 29
  3 28
  5 27
  4 26
  7 25
  5 24
  9 23
 12 22
  4 21
  2 20
  4 19
  8 18
  5 17
  8 16
  6 15
 10 14
  5 13
 16 12
  4 11
  8 10
  9 9
  4 8
  8 7
  6 6
 14 5
  9 4
  2 3
  2 2

There should be a minimum threshold cut off so why are proteins of 2,3,4,5 .. 24 being included. This means the end user need to filter results files to remove them and that should be a feature of the pipeline. So this is both a "bug" (logic bug not coding/nf bug) as well as request for feature setting to enforce a cut off at an end user specified value (or 10 perhaps as a minimum)

Command used and terminal output

No response

Relevant files

No response

System information

No response

@rosscrowhurst rosscrowhurst added the bug Something isn't working label Dec 8, 2024
@GallVp GallVp self-assigned this Dec 8, 2024
@GallVp GallVp added this to the 0.6.0 milestone Dec 8, 2024
@CeciliaDeng
Copy link
Collaborator

Hi @rosscrowhurst , I remember we had similar discussions many years ago. Originally, we tried a cutoff for 120nts? However, we were notified that effector genes in fungal could be very short (90nts or even shorteer?), so the threshold could be kingdom-dependent.

This post mentioned the shortest manual curated gene is 7nts in human, but it doesn't look like a protein-coding one.

I'm more concerned about gene models predicted with multiple exons, while one of the CDS is just 3nts long. Should we discard those gene models or mark them as 'to be further checked'?

@CeciliaDeng
Copy link
Collaborator

A paper in 1994 described a very short gene of 21nts

@rosscrowhurst
Copy link
Author

@CeciliaDeng - check the shortest plant protein in NCBI refseq. A paper from 1994 does not count - find more recent pubs from last few years. I routinely use 24 aa as cut off (or 25 depending on my typing!) so a cds total length of 72 bases and protein of 24 aa. However most proteins of less than 60 aa are hypothetical and only a few have a named function/description. I see 24 aa (72 nt) as a minimum for me as anything shorter never hit anything in NCBI. And I am basing this on my own emphirical research of short peps in Red5, Russell, CK6901 and ME02. Take it or leave it as you wish

@rosscrowhurst
Copy link
Author

rosscrowhurst commented Dec 8, 2024

@CeciliaDeng "I'm more concerned about gene models predicted with multiple exons, while one of the CDS is just 3nts long. Should we discard those gene models or mark them as 'to be further checked'?" Shortest CDS in Arabidopsis is 1 nt. You can remove short protein sequence genes but you can not remove genes just if the cds/exon is short. Remember that phasing in determines the overall CDS length. The caveat here is that there are correct introns each side with correct splice sites and not a 1 base exon due to some lift over glitch

@GallVp
Copy link
Member

GallVp commented Dec 10, 2024

Following candidates are available from AGAT,

  1. agat_sp_filter_by_ORF_size.pl
  2. agat_sp_filter_gene_by_length.html

I prefer agat_sp_filter_by_ORF_size.pl because it seems from its perl code that it works on AA length derived from [CDS Length/3 - 1]. Moreover, a gene is removed even if one of its isoforms has short length.

I propose to implement the following,

  • Include a parameter filter_genes_by_aa_length and set its default value to 24 as suggested by @rosscrowhurst
  • filter_genes_by_aa_length` minimum allowed value will be 3 (arbitrary, any suggestion?) and max value can be any positive integer
  • If filter_genes_by_aa_length is set to null, no filter is applied.

@rosscrowhurst
Copy link
Author

@GallVp - just a note that AGAT scripts work on gtf or gff. If you are going to use AGAT to cull sort genes then you need to do it before the pep. cDNA, CDS is exported or produced since the AGAT tools you listed do not handle fasta as well. If you wait until the end of the pipeline then filter out then you also have to filter the fasta and re-run any gene set metric extraction steps.

I am putting this workflow (that I put in TEAM) here so that you can see the requirements for post pipeline processing and how you need to handle all the fasta as well as the GFF3. So best it to run any culling step earlier than later. Also all culled models need to be removed from any separate files (like emapper results etc) for sanity. Just mentioning these things here as a reminder :)

# Set protein fasta file path
GENOME_NAME=ck6901m
PEP_FASTA=/workspace/hrarnc/ck6901m/genepal/with_liftoff/results/annotations/ck6901m/${GENOME_NAME}.pep.fasta





# get sequence sizes of protein
/output/hrarnc/software/bin/get_sequence_sizes.pl -f=${PEP_FASTA} -o=${GENOME_NAME}.pep



# min protein size
MIN_PEP_LENGTH=24



# Get list of proteins less than MIN_PEP_LENGTH
CULL_LIST=${GENOME_NAME}.pep.cull.list
cat ${GENOME_NAME}.pep.sorted.sequence.lengths.txt | awk -v var=${MIN_PEP_LENGTH} -F"\t" '{ if ($2 < var){print $1; }}' > ${CULL_LIST}



# count list
NUM_SHORTS_TO_REMOVE=`cat ${CULL_LIST} | wc -l`; echo -e "NUMBER SHORT PROTEINS\t${NUM_SHORTS_TO_REMOVE}"



# remove from fasta
for TYPE in cds cdna pep
do
  IN_FASTA=${GENOME_NAME}.${TYPE}.fasta
  echo ${IN_FASTA}
  if [ -e ${IN_FASTA} ]
  then
    if [ "$TYPE" = "pep" ]
    then
      OUT_FASTA=${GENOME_NAME}.${TYPE}.minlen${MIN_PEP_LENGTH}.fasta
      /output/hrarnc/software/bin/remove_fasta_seqs_with_list.pl -f=${IN_FASTA} -o=${OUT_FASTA} -l=${CULL_LIST}
    else
      let "CML=3*${MIN_PEP_LENGTH}"
      OUT_FASTA=${GENOME_NAME}.${TYPE}.minlen${CML}.fasta
      /output/hrarnc/software/bin/remove_fasta_seqs_with_list.pl -f=${IN_FASTA} -o=${OUT_FASTA} -l=${CULL_LIST}
    fi
  fi
done



# remove gff3 records - fasta uses mRNA feature name so use the with_mRNAID variant script
/output/hrarnc/software/bin/remove_from_gff3_with_mRNAid_list.pl -i=ck6901m.gt.gff3 -o=ck6901m.filteredbyminsize.gff3 -l=${CULL_LIST}

@GallVp
Copy link
Member

GallVp commented Dec 10, 2024

Thank you @rosscrowhurst

The filter will be applied right after BRAKER/LIFTOFF merge which precedes eggnog and protein/cds/cdna extraction. It will become part of our gff_merge_cleanup sub workflow.

ch_gff_branch = ch_braker_gff
| join(ch_liftoff_gff, remainder:true)
| branch { meta, braker_gff, liftoff_gff ->
both : ( braker_gff && liftoff_gff )
braker_only : ( braker_gff && ( ! liftoff_gff ) )
liftoff_only: ( ( ! braker_gff ) && liftoff_gff )
}
// MODULE: AGAT_SPMERGEANNOTATIONS
AGAT_SPMERGEANNOTATIONS(
ch_gff_branch.both.map { meta, bg, lg -> [ meta, [ bg, lg ] ] },
[]
)
ch_merged_gff = AGAT_SPMERGEANNOTATIONS.out.gff
| mix ( ch_gff_branch.liftoff_only.map { meta, braker_gff, liftoff_gff -> [ meta, liftoff_gff ] } )
| mix ( ch_gff_branch.braker_only.map { meta, braker_gff, liftoff_gff -> [ meta, braker_gff ] } )

@GallVp GallVp linked a pull request Dec 10, 2024 that will close this issue
10 tasks
@GallVp
Copy link
Member

GallVp commented Dec 11, 2024

agat_sp_filter_by_ORF_size.pl is not isoform aware and fails to filter out a gene when the first isoform passes the filter test even when the other isoforms fail the test. TODOs,

@GallVp
Copy link
Member

GallVp commented Dec 11, 2024

Hi @rosscrowhurst

I have tested GFFREAD using a small test case. You can see the details here,

test("filter by length") {
when {
process {
"""
input[0] = [
[id: 'test'],
file("$baseDir" + '/modules/local/tests/gffread/testdata/t.gff', checkIfExists: true)
]
input[1] = []
"""
}
}
then {
assertAll (
{ assert process.success },
{ assert snapshot(process.out).match() },
{ assert file(process.out.gffread_gff[0][1]).text.contains('gene19851') },
{ assert file(process.out.gffread_gff[0][1]).text.contains('gene19851.t1') },
{ assert ! file(process.out.gffread_gff[0][1]).text.contains('gene19851.t2') } // This is the only transcript which is being knocked out

GFFREAD does support transcript-level filtering by CDS length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants