Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long read compatibility by tweaking the pipeline #332

Open
varun8476 opened this issue Feb 14, 2023 · 1 comment
Open

Long read compatibility by tweaking the pipeline #332

varun8476 opened this issue Feb 14, 2023 · 1 comment

Comments

@varun8476
Copy link

Hi,
Can we make ARIBA compatible with long reads by changing the mapping and assembly approach?
I am planning to do this as my masters thesis project. I am a bioinformatics student and my first hunch is to use minimap2 for mapping the reads to the cluster and using any long read assembler such as Flye or Miniasm for assembling the reads.
Any leads as to whether this approach is feasible or pointing out any research done related to this would be helpful.
Thanks in advance.

@martinghunt
Copy link
Collaborator

This hasn't been tried. At the time ARIBA was made, long read assemblies were too low quality (in particular, indel errors), which would have led to too many errors. ARIBA is made to be quite conservative, which is fine for Illumina but not for data with a higher error rate.

But I'm not sure it's worth trying because these days long reads and their assemblies are significantly better now (although I'd still be wary of indel errors). If it was me, I would assemble all the reads (using flye/unicycler/whatever works) and then use arbitamr for the amr predictions: https://github.com/MDU-PHL/abritamr

Sorry if that sounds too negative, but realistically I expect that would be the best method. Happy to be proven wrong! That said, if you really want to do it then this is what I can think of that will need changing, and there's probably more that I haven't thought of. Basically, there's a bunch of places where read pairs are assumed, and it'll be a fair bit of work to deal with going from paired to unpaired:

  • change the initial mapping to not assume read pairs and probably change the kmer, step and minimizer sizes. If you want to update minimap -> minimap2 then fine but either way there's the faffing with c++ code so the mapping will work on paired and unpaired
  • all the reads allocated to clusters are stored in a single tabix indexed file. As each cluster is run it retrieves the reads from that file. All this code will need editing to handle unpaired reads.
  • getting an assembly method (good combination of assembler and command line options) that reliably works. This could turn out to be a massive pain. Expect the unexpected where one assembler may work perfectly on one sample and not on another sample.
  • after assembly, it uses read pairs to make a scaffold graph and checks for nodes with >1 edge. Would need to either skip this completely or reimplement by looking for long reads joining contigs.
  • also after assembly it maps the reads back to get pileup info, so that needs changing for unpaired as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants