Getting stuck at Slimming assembly graph during get_organelle_from_assembly.py #351

bwesen · 2024-10-30T22:47:31Z

I'm using getorganelle to extract mitochondrial genomes from insects. I first do an assembly using SPAdes (because we use that output for other downstream tasks anyway), and then I run get_organelle_from_assembly.py with animal_mt on the assembly_graph.fastg output from spades.

I'm running this on a lot of genomes, and on maybe 10% of them, getorganelle just gets stuck on the INFO: Slimming assembly graph ... step (while its running slim_graph.py), taking up 100% of one CPU core, for hours and hours until I kill it. There is no other log output, even with --verbose.

When it works, it usually works in 5-20 minutes, for assemblies of similar genomes. But sometimes it takes hours, and sometimes it never converges.

My genome assemblies from SPAdes are far from good, its a boatload of small contigs of size 5-10 kb in the contigs.fasta output. But most of the time getorganelle finds a circular genome.

Apart from just setting a timer and simply killing getorganelle if it hasn't found anything in say an hour... is there anything to check - can this be a bug, or simply an artefact of some kind of complicated / buggy .fastg file from spades? Are there some internal cutoff/limits to set in getorganelle to avoid it getting stuck like this?

I can share a assembly_graph.fastg file somewhere if someone wants to test it.

BTW I switched from running get_organelle_from_reads to running on the spades graph output because I found that the latter was much better on average on finding circular mitogenomes.

bwesen · 2024-10-31T16:41:42Z

A little update - the same genome as processed through the getorganelle from_reads mode works fine and finds a circular genome after just 15 minutes or so. I guess the graph output from spades in this case is just too massive, I had a look in Bandage and it's a million nodes or so, most of them small unconnected islands, and I guess this complexity is handled differently in both modes. In the from_reads mode all those bad short reads never enter the pipeline I guess in the first place.

As I haven't found any sane options in Spades to limit this in the graph output, it might be a good idea to be able to get GetOrganelle to limit/threshold the graph input? Maybe there is an option I'm just not aware of. Kind of similar to how the from_reads mode "tastes" the input reads?

bwesen · 2024-11-02T10:42:01Z

Further update - I discovered that --min-depth has an effect on this mode (from_assembly), it seems it does some kind of filtering before slimming that reduces the load significantly if there is a lot of un-assembled reads in the graph as in my case. Just setting min-depth to anything over 1.0 made it converge even though it took 12 hours of processing. Setting min-depth to 10 allowed it to find a solution in 1 hour.

Unfortunately it's not possible to use it consistently - other genome assembly graphs I have require a min-depth of 1.2 to converge to a circular genome (anything over that and you get a bunch of separate scaffolds instead).

So I guess what could be done here is some kind of analysis at the start of the from_assembly run, that tries to figure out an adaptive min-depth threshold of some sort?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting stuck at Slimming assembly graph during get_organelle_from_assembly.py #351

Getting stuck at Slimming assembly graph during get_organelle_from_assembly.py #351

bwesen commented Oct 30, 2024 •

edited

Loading

bwesen commented Oct 31, 2024 •

edited

Loading

bwesen commented Nov 2, 2024

Getting stuck at Slimming assembly graph during get_organelle_from_assembly.py #351

Getting stuck at Slimming assembly graph during get_organelle_from_assembly.py #351

Comments

bwesen commented Oct 30, 2024 • edited Loading

bwesen commented Oct 31, 2024 • edited Loading

bwesen commented Nov 2, 2024

bwesen commented Oct 30, 2024 •

edited

Loading

bwesen commented Oct 31, 2024 •

edited

Loading