Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding UMI-tools #49

Closed
pfeiferl opened this issue Jul 5, 2020 · 13 comments
Closed

Adding UMI-tools #49

pfeiferl opened this issue Jul 5, 2020 · 13 comments
Labels
enhancement New feature or request

Comments

@pfeiferl
Copy link

pfeiferl commented Jul 5, 2020

Hi all!
Firstly I would like to thank you for this awesome work. Secondly I have a request:
Please is it possible to add UMI-tools to environment (docker) to allow work with umi's?
Thank you for answer

@ewels ewels added the enhancement New feature or request label Nov 7, 2020
@ewels
Copy link
Member

ewels commented Nov 7, 2020

I think this would be a great addition 👍🏻 If you have any details (kits used for example), that would be helpful.

@ewels
Copy link
Member

ewels commented Nov 9, 2020

QIAGEN kit details: https://resources.qiagenbioinformatics.com/manuals/biomedicalgenomicsanalysis/current/index.php?manual=Create_UMI_Reads_miRNA.html

structure

Regarding our slightly weird sequencing setup at the NGI:

Qiagen smRNA seq kit contains UMI at the end of a 75bp read. We are sequencing 2x50bp, which does contain UMI data in read 2 but doesn't allow us to use the analysis tool from Qiagen.

@pfeiferl
Copy link
Author

pfeiferl commented Nov 9, 2020

Hi, I am using that QIAGEN kit you found (btw. for few people would be handy that that common sequence is qiagen adapter sequence).
But using umi-tools in smrnaseq pipeline i found a tricky thing - using umi-tools extract and dedup and then on deduplicated reads mirtop tools, leads to loose a lot of information due mirtop deduplication process (and in some cases in loosing ALL reads because are somehow interpreted by mirtop as duplicated).
So I recomend using umi dedup on alignment against host reference genome and then calling featurecounts (unfortunatly not possible to do this on bam files obtained by alignment against mature and hairping due to missing position field in bam file).

@pcantalupo
Copy link

pcantalupo commented Nov 9, 2020

@pfeiferl to be clear for future readers of the thread, that is the 3' adapter sequence as described in the Qiagen miRNAseq manual on page 53 (07/2020 version) and specified on these lines in the pipeline. Please correct me if I'm wrong.

I'm currently working with a client who did 75bp sequencing so that the UMIs are in the sequence reads. I was going to try using UMI tools but haven't looked into it in depth. I'm a bit confused on how you specify the regex for extracting. Are you using the regular expression mode? Can you share your code for how you did the extract and dedup? Thank you

@pfeiferl
Copy link
Author

pfeiferl commented Nov 9, 2020

@pcantalupo you are correct

And umi extract: Yes, I am using regex - firstly the raw fastq files must NOT have trimmed adapter, you are searching for umi after it.

umi_tools extract --stdin=in.fastq.gz --stdout=out.fastq.gz --extract-method=regex --bc-pattern='.+AACTGTAGGCACCATCAAT{s<=2}(? P<umi_1>.{12})(?P<discard_2>.*)' 
  • .+ means lett all before this sequence (AACTGTAGGCACCATCAAT) for use
  • {s<=2} means you are allowing 2 mistakes in adapter sequence
  • (? P<umi_1>.{12}) umi (it will be in the header)
  • (?P<discard_2>.*)' you must discard the end of read (umi tools ordinary works with single cell datas, where first what comes is adapter, then umi, and read is after it, so it letting the 5 end if you will not tell otherwise)

Dedup just simply

umi_tools dedup --method=unique -I in.bam -S out.bam

@ewels
Copy link
Member

ewels commented Nov 10, 2020

Note that the main nf-core/rnaseq pipeline already supports UMI tools: https://github.com/nf-core/rnaseq/pull/435/files#diff-6401496ba455b9488ffa902a6e4d7732b2c60ff2d77c5c3ef96b28a7ac7d3b28R1023

The rnaseq pipeline has just moved to DSL2, meaning that this functionality has been ported to nf-core/modules: https://github.com/nf-core/modules/tree/master/software/umitools

Once the DSL2 stuff settles down we will want to start migrating all pipelines to DSL2. When we come to do this pipeline that'll mean that we can reuse the same modules to also run UMI tools in this pipeline.

If we don't want to wait that long, we could always copy over code from the above ☝🏻 (but if we're not in a rush then I think it'd be better to wait).

@lpantano
Copy link
Contributor

I have looked into that. Meanwhile extracting seems compatible with the pipeline and dedup BAM files can be used for some tools, there are some parts that will need some custom dedup before aligning. If somebody have time for this, I can help on the guiding how to implementing. I don't see a lot of time for doing it by myself in the next month, but who knows!

CKComputomics pushed a commit to CKComputomics/smrnaseq that referenced this issue Apr 26, 2022
Integrate the umi tools module already existing in nf-core into the
smrnaseq pipeline. See Issue nf-core#49
@CKComputomics
Copy link
Contributor

@lpantano I'm currently looking into this. The UMI extract part is pretty straightforward to implement, but what steps do you refer to when you talk about custom deduplicating before aligning? Is a deduce step for the mapped hairpin and mature bam files (+ possibly for the genome alignment) not sufficient?

@lpantano
Copy link
Contributor

Hi, dedup on bamfiles is not going to help some tools. If dedups happen in the fastq files, then it is fine. The tools targeted to a better quantification on miRNA will do a 'collapsing' step at the fastq level, where each sequence that is the same will be reported once in the output file (normally a fasta file), having the times where that sequence appeared in the read name.

mirdeep, mirtop, mirtrace will do this. Any of them are working at the bam file levels in the same way than rnaseq. So botton line, if this could be done at fastq level then is fine, but if not, I don't think it will be that useful. The quantification from the bam file shouldn't be used for anything than just statistics about how many reads map to mature, hairpin or genome.

Happy to set up a call to talk more, it is a little confusing, just because the history of smrnaseq analysis. thanks!

@CKComputomics
Copy link
Contributor

So adding an additional step in genome mode that maps all reads, dedups the bam and converts it back to fastq would be an option?
Setting up a call would be great.

@ewels
Copy link
Member

ewels commented May 19, 2022

Assuming that UMI sequences are carried into aligned BAM files in the read headers, it should be fine to do alignment+UMI based deduplication, no? I don't really see why it has to be raw sequence based only?

@apeltzer
Copy link
Member

I think #164 adds exactly that now - so everyone could take a look at the feature and test it for inclusion 👍🏻

@apeltzer
Copy link
Member

#303 adds UMI handling, please test this thourhgly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants