Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

-M 2 Question #6

Open
lroppolo opened this issue Sep 14, 2022 · 9 comments
Open

-M 2 Question #6

lroppolo opened this issue Sep 14, 2022 · 9 comments

Comments

@lroppolo
Copy link

Hello there!

I have a question about the -M 2 option for identifying sample types using the barcodes.

I see that option 2 finds matched barcodes on both ends of sequence, and identifies pairs that match a sample ID. I am using a dual-indexed barcode set on 16 samples, and when I select this option for demultiplexing, I end up with nearly 18,000 individual fastq files as output instead of 16 individual bins for my samples. Am I doing something wrong here? I just need some guidance on how this works so I can be sure I'm doing the right thing.

Thanks!

Lauren

@jbh-cas
Copy link
Collaborator

jbh-cas commented Sep 16, 2022 via email

@lroppolo
Copy link
Author

Hi Jim,

Thank you very much for your reply!

I'm going to attach the barcode file below:
longreads_demux.csv

For reference, I am using the Twist UDI primer sets, and I will also attach a screen grab of the reference sheet below so that you can get an idea of where I am getting the pieces of the barcode file from:
Screen Shot 2022-09-17 at 8 09 02 PM

I am getting errors when I try to run the command you provided to look at the headers of the fastq files- evidently the files don't have at least 10 lines. On seeing this error, I chose to look at my smallest output file in its entirety, and have pasted it below:

@8c581613-6b2c-4a30-b990-7660417f51d7 runid=534c0b0e13044fa9b3e67426197b926f4230572b read=3690 ch=998 start_time=2022-08-18T19:55:25.874796+00:00 flow_cell_id=PAM31864 protocol_group_id=CN_longReads sample_id=no_sample parent_read_id=8c581613-6b2c-4a30-b990-7660417f51d7 basecall_model_version_id=2021-05-05_dna_r9.4.1_promethion_384_dd219f32 h+(3),h-(3) 97230 97232 97123 97131 97042 97049 97231
ACGCTTCGATCTCTCTCTCTTTCCTTCTCTCTGTCTCTCTCTGCCTGTCTCTCTCACTCTGTCTTCTGTCTTACACTCTCTCTCTGCCTGCCTGTCTCTCTCACTCTCTCTCTCTGTGTGTCTCTCTCTCTCTTTCTGTTTCTCTCTGTCTCTCTCTGTCTGTCTCTGTCTTTCTCTGTCTGTCTCTTTGTCTGTCTGTCTTTGTCTTTCCTTCT
+
04*'%')&&*+/{{{{{{{850/('(+.1{{<4447{{{{{420103887{2{{3/..362445=<)((((511+*+.>{{{{{{4320/4777879{{{{6,,,/{{{{{{{{@68={{{;;?{{{{{{{{{322210003+*{{{5455{{{{{7121,*''++)*+++//02{{44599889=998889<=<=:9:=>CA@?>BACC:89>:

Please let me know if I can provide you with anything else, and again I very much appreciate your willingness to help me work through this!

-Lauren

@jbh-cas
Copy link
Collaborator

jbh-cas commented Sep 21, 2022 via email

@jbh-cas
Copy link
Collaborator

jbh-cas commented Sep 22, 2022

Try the embedded longreads_demux.tsv file for your barcode input and see how that works.

Also, if you could, on your original barcode file, do a wc -l command to count the number of lines and see what that tells us

Here it is for file longreads_demux.tsv showing it has 17 lines:

$ wc -l longreads_demux.tsv
17 longreads_demux.tsv

longreads_demux.tsv

SampleID	FwIndex	FwPrimer	RvIndex	RvPrimer
Samp_1	CACTCAAGAA	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	TGGATGGCAA	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_2	AGAGCCATTC	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	TTCACCAGCT	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_3	CACGATTCCG	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	CCTGAGTAGC	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_4	TTGGAGCCTG	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	AGGTGTCCGT	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_5	TTACGACTTG	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	GTCTGGTTGC	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_6	TTAAGGTCGG	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	CTCTTAGATG	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_7	GGTTCTGTCA	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	TATCACCTGC	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_8	GATACGCACC	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	CAGAGGCAAG	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_9	TCGCGAAGCT	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	CCGGTCAACA	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_10	GTTAAGACGG	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	TCACGAGGTG	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_11	CCGGTCATAC	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	CCATAGACAA	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_12	GTCAGCTTAA	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	GAGCTTGGAC	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_13	ACCGCGGATA	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	TACGGTGTTG	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_14	GTTGCATCAA	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	TTCAACTCGA	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_15	TGTGCACCAA	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	AAGGCAGGTA	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_16	ATCTGTGGTC	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	CGGCCAATTC	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

@lroppolo
Copy link
Author

Hello there Jim!

Thank you so much for your reply-- I have been using a .tsv file, for whatever reason the file converted to a .csv when I pulled it down, but I was able to check and confirm that my file does have 17 lines as well.

When I run the script with the .tsv file again, I am having the same issue happen. I will attach my script below:

FILES="/myDirectory/fastq_pass/*"

for file in $FILES
do
     echo "processing $file file..."
     python3 /mydirectory/minibar.py longreads_demux.tsv $file -T -M 2 -F -P CN_
done

Essentially, I am getting output files that begin with the "CN_" prefix, and have every combination possible of Samp_1 through Samp_16. The filenames look something like this:

CN_Samp_1_Samp_15_Samp_14_Samp_7_Samp_3.fastq
CN_Samp_2_Samp_16_Samp_4_Samp_9.fastq
CN_Samp_3_Samp_8.fastq

And the output follows the format that I pasted in my last comment, if that is helpful. Please let me know if this helps make any sense of what I'm doing and we can go from there! I appreciate your willingness to help me troubleshoot.

All the best,

Lauren

@jbh-cas
Copy link
Collaborator

jbh-cas commented Oct 10, 2022

Lauren,

The Twist barcodes are 10bp which is 2 more than standard Illumina barcodes but quite a bit smaller than typical Nanopore barcodes. The ones used in the minibar paper were 15bp each and I checked with a colleague who used the nanopore 96-plex kit recently for 40+ photobacteria species and pointed out this quote: "An ONT native barcode is 40 bp in length (24 bp for the barcode itself plus 8 bp of flanking sequence on each side)" and even so she mentioned there were a lot of unclassified reads.

You can try -e 2 to reduce the error tolerance for the 10bp Twist barcodes but I do worry with the R9 chemistry that the error rate might be too great to separate out a large number of the reads. I'm hopeful that R10 chemistry improves things but I don't have any experience with it -- that is, R10, plenty with hope :)

best,
Jim H.

@jbh-cas
Copy link
Collaborator

jbh-cas commented Oct 11, 2022 via email

@jbh-cas
Copy link
Collaborator

jbh-cas commented Oct 19, 2022

Lauren,

I just put a new version 0.23 up on github that should reduce the number files that you are getting. Though the sequences will go into a Multiple_Matches.fastq file, so the -e itest is still worth trying.

best,
Jim Henderson

@lroppolo
Copy link
Author

lroppolo commented Oct 19, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants