-M 2 Question #6

lroppolo · 2022-09-14T16:14:27Z

Hello there!

I have a question about the -M 2 option for identifying sample types using the barcodes.

I see that option 2 finds matched barcodes on both ends of sequence, and identifies pairs that match a sample ID. I am using a dual-indexed barcode set on 16 samples, and when I select this option for demultiplexing, I end up with nearly 18,000 individual fastq files as output instead of 16 individual bins for my samples. Am I doing something wrong here? I just need some guidance on how this works so I can be sure I'm doing the right thing.

Thanks!

Lauren

jbh-cas · 2022-09-16T23:35:09Z

That does seem excessive. Let's look at some of the headers for the output files by doing this: head -n 1 *fastq | head -n 30 That will show the file names and first lines for 10 of the fastq files to give a sense of what is being called. You can just paste that into a response. Also if you could share the barcode file as an attachment that would be helpful. My first guess is that there is something in the barcode file that causes each record to be seen as a sample. Though I don't know what that might be. That's why taking a look at the barcode file and the headers will be helpful. ---------------------------- Original Message ---------------------------- Subject: [calacademy-research/minibar] -M 2 Question (Issue #6) From: "LR Brazell" ***@***.***> Date: Wed, September 14, 2022 9:14 am To: "calacademy-research/minibar" ***@***.***> Cc: "Subscribed" ***@***.***>

--------------------------------------------------------------------------

Hello there! I have a question about the -M 2 option for identifying sample types using the barcodes. I see that option 2 finds matched barcodes on both ends of sequence, and identifies pairs that match a sample ID. I am using a dual-indexed barcode set on 16 samples, and when I select this option for demultiplexing, I end up with nearly 18,000 individual fastq files as output instead of 16

individual bins for my samples. Am I doing something wrong here? I just need some guidance on how this works so I can be sure I'm doing the right thing.

…

Thanks! Lauren -- Reply to this email directly or view it on GitHub: #6 You are receiving this because you are subscribed to this thread. Message ID: ***@***.***>

lroppolo · 2022-09-18T00:12:28Z

Hi Jim,

Thank you very much for your reply!

I'm going to attach the barcode file below:
longreads_demux.csv

For reference, I am using the Twist UDI primer sets, and I will also attach a screen grab of the reference sheet below so that you can get an idea of where I am getting the pieces of the barcode file from:

I am getting errors when I try to run the command you provided to look at the headers of the fastq files- evidently the files don't have at least 10 lines. On seeing this error, I chose to look at my smallest output file in its entirety, and have pasted it below:

@8c581613-6b2c-4a30-b990-7660417f51d7 runid=534c0b0e13044fa9b3e67426197b926f4230572b read=3690 ch=998 start_time=2022-08-18T19:55:25.874796+00:00 flow_cell_id=PAM31864 protocol_group_id=CN_longReads sample_id=no_sample parent_read_id=8c581613-6b2c-4a30-b990-7660417f51d7 basecall_model_version_id=2021-05-05_dna_r9.4.1_promethion_384_dd219f32 h+(3),h-(3) 97230 97232 97123 97131 97042 97049 97231
ACGCTTCGATCTCTCTCTCTTTCCTTCTCTCTGTCTCTCTCTGCCTGTCTCTCTCACTCTGTCTTCTGTCTTACACTCTCTCTCTGCCTGCCTGTCTCTCTCACTCTCTCTCTCTGTGTGTCTCTCTCTCTCTTTCTGTTTCTCTCTGTCTCTCTCTGTCTGTCTCTGTCTTTCTCTGTCTGTCTCTTTGTCTGTCTGTCTTTGTCTTTCCTTCT
+
04*'%')&&*+/{{{{{{{850/('(+.1{{<4447{{{{{420103887{2{{3/..362445=<)((((511+*+.>{{{{{{4320/4777879{{{{6,,,/{{{{{{{{@68={{{;;?{{{{{{{{{322210003+*{{{5455{{{{{7121,*''++)*+++//02{{44599889=998889<=<=:9:=>CA@?>BACC:89>:

Please let me know if I can provide you with anything else, and again I very much appreciate your willingness to help me work through this!

-Lauren

jbh-cas · 2022-09-21T19:01:22Z

This is a csv which means fields separated by commas. We need a tsv where tabs are the separator character and this is the message I get running the command where I have screen scraped your one rec example $ minibar.py longreads_demux.csv email_rec.fq -D Need at least 5 tab delimited columns in the barcode_file. Here is the first line of 'longreads_demux.csv': SampleID,FwIndex,FwPrimer,RvIndex,RvPrimer I am going to change the commas to tabs and also I am going to add Samp_ in fornt of the number at line begin so that the sample name stands out more at the end of the header. But I am thinking this not exactly the file set you are using. I'll send the the tsv version back to you and you can let me know what it does. If it does do the same thing perhaps you can send a handful of the input records to test against. ---------------------------- Original Message ---------------------------- Subject: Re: [calacademy-research/minibar] -M 2 Question (Issue #6) From: "LR Brazell" ***@***.***> Date: Sat, September 17, 2022 5:12 pm To: "calacademy-research/minibar" ***@***.***> Cc: "Jim Henderson" ***@***.***> "Comment" ***@***.***>

--------------------------------------------------------------------------

Hi Jim, Thank you very much for your reply! I'm going to attach the barcode file below: [longreads_demux.csv](https://github.com/calacademy-research/minibar/files/9592764/longreads_demux.csv) For reference, I am using the Twist UDI primer sets, and I will also attach a screen grab of the reference sheet below so that you can get an idea of where I am getting the pieces of the barcode file from: <img width="1416" alt="Screen Shot 2022-09-17 at 8 09 02 PM" src="https://user-images.githubusercontent.com/43578691/190880258-a5d5377a-41ce-47cf-b25b-0f1d16405bd9.png"> I am getting errors when I try to run the command you provided to look at the headers of the fastq files- evidently the files don't have at least 10 lines. On seeing this error, I chose to look at my smallest output file in its entirety, and have pasted it below: ``` @8c581613-6b2c-4a30-b990-7660417f51d7 runid=534c0b0e13044fa9b3e67426197b926f4230572b read=3690 ch=998 start_time=2022-08-18T19:55:25.874796+00:00 flow_cell_id=PAM31864 protocol_group_id=CN_longReads sample_id=no_sample parent_read_id=8c581613-6b2c-4a30-b990-7660417f51d7

basecall_model_version_id=2021-05-05_dna_r9.4.1_promethion_384_dd219f32 h+(3),h-(3) 97230 97232 97123 97131 97042 97049 97231

…

ACGCTTCGATCTCTCTCTCTTTCCTTCTCTCTGTCTCTCTCTGCCTGTCTCTCTCACTCTGTCTTCTGTCTTACACTCTCTCTCTGCCTGCCTGTCTCTCTCACTCTCTCTCTCTGTGTGTCTCTCTCTCTCTTTCTGTTTCTCTCTGTCTCTCTCTGTCTGTCTCTGTCTTTCTCTGTCTGTCTCTTTGTCTGTCTGTCTTTGTCTTTCCTTCT + ***@***.***={{{;;?{{{{{{{{{322210003+*{{{5455{{{{{7121,*''++)*+++//02{{44599889=998889<=<=:9:=>CA@?>BACC:89>: ``` Please let me know if I can provide you with anything else, and again I very much appreciate your willingness to help me work through this! -Lauren -- Reply to this email directly or view it on GitHub: #6 (comment) You are receiving this because you commented. Message ID: ***@***.***>

jbh-cas · 2022-09-22T22:06:48Z

Try the embedded longreads_demux.tsv file for your barcode input and see how that works.

Also, if you could, on your original barcode file, do a wc -l command to count the number of lines and see what that tells us

Here it is for file longreads_demux.tsv showing it has 17 lines:

$ wc -l longreads_demux.tsv
17 longreads_demux.tsv

longreads_demux.tsv

SampleID	FwIndex	FwPrimer	RvIndex	RvPrimer
Samp_1	CACTCAAGAA	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	TGGATGGCAA	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_2	AGAGCCATTC	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	TTCACCAGCT	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_3	CACGATTCCG	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	CCTGAGTAGC	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_4	TTGGAGCCTG	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	AGGTGTCCGT	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_5	TTACGACTTG	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	GTCTGGTTGC	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_6	TTAAGGTCGG	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	CTCTTAGATG	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_7	GGTTCTGTCA	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	TATCACCTGC	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_8	GATACGCACC	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	CAGAGGCAAG	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_9	TCGCGAAGCT	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	CCGGTCAACA	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_10	GTTAAGACGG	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	TCACGAGGTG	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_11	CCGGTCATAC	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	CCATAGACAA	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_12	GTCAGCTTAA	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	GAGCTTGGAC	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_13	ACCGCGGATA	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	TACGGTGTTG	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_14	GTTGCATCAA	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	TTCAACTCGA	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_15	TGTGCACCAA	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	AAGGCAGGTA	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_16	ATCTGTGGTC	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	CGGCCAATTC	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

lroppolo · 2022-09-27T20:44:11Z

Hello there Jim!

Thank you so much for your reply-- I have been using a .tsv file, for whatever reason the file converted to a .csv when I pulled it down, but I was able to check and confirm that my file does have 17 lines as well.

When I run the script with the .tsv file again, I am having the same issue happen. I will attach my script below:

FILES="/myDirectory/fastq_pass/*"

for file in $FILES
do
     echo "processing $file file..."
     python3 /mydirectory/minibar.py longreads_demux.tsv $file -T -M 2 -F -P CN_
done

Essentially, I am getting output files that begin with the "CN_" prefix, and have every combination possible of Samp_1 through Samp_16. The filenames look something like this:

CN_Samp_1_Samp_15_Samp_14_Samp_7_Samp_3.fastq
CN_Samp_2_Samp_16_Samp_4_Samp_9.fastq
CN_Samp_3_Samp_8.fastq

And the output follows the format that I pasted in my last comment, if that is helpful. Please let me know if this helps make any sense of what I'm doing and we can go from there! I appreciate your willingness to help me troubleshoot.

All the best,

Lauren

jbh-cas · 2022-10-10T20:22:35Z

Lauren,

The Twist barcodes are 10bp which is 2 more than standard Illumina barcodes but quite a bit smaller than typical Nanopore barcodes. The ones used in the minibar paper were 15bp each and I checked with a colleague who used the nanopore 96-plex kit recently for 40+ photobacteria species and pointed out this quote: "An ONT native barcode is 40 bp in length (24 bp for the barcode itself plus 8 bp of flanking sequence on each side)" and even so she mentioned there were a lot of unclassified reads.

You can try -e 2 to reduce the error tolerance for the 10bp Twist barcodes but I do worry with the R9 chemistry that the error rate might be too great to separate out a large number of the reads. I'm hopeful that R10 chemistry improves things but I don't have any experience with it -- that is, R10, plenty with hope :)

best,
Jim H.

jbh-cas · 2022-10-11T07:48:00Z

Try the attached longreads_demux.tsv file for your barcode input and see how that works. Also, if you could, on your original barcode file, do a wc -l command to count the number of lines and see what that tells us Here it is for the attached file longreads_demux.tsv showing it has 16 lines: $ wc -l longreads_demux.tsv 16 longreads_demux.tsv ---------------------------- Original Message ---------------------------- Subject: Re: [calacademy-research/minibar] -M 2 Question (Issue #6) From: "LR Brazell" ***@***.***> Date: Sat, September 17, 2022 5:12 pm To: "calacademy-research/minibar" ***@***.***> Cc: "Jim Henderson" ***@***.***> "Comment" ***@***.***>

--------------------------------------------------------------------------

Hi Jim, Thank you very much for your reply! I'm going to attach the barcode file below: [longreads_demux.csv](https://github.com/calacademy-research/minibar/files/9592764/longreads_demux.csv) For reference, I am using the Twist UDI primer sets, and I will also attach a screen grab of the reference sheet below so that you can get an idea of where I am getting the pieces of the barcode file from: <img width="1416" alt="Screen Shot 2022-09-17 at 8 09 02 PM" src="https://user-images.githubusercontent.com/43578691/190880258-a5d5377a-41ce-47cf-b25b-0f1d16405bd9.png"> I am getting errors when I try to run the command you provided to look at the headers of the fastq files- evidently the files don't have at least 10 lines. On seeing this error, I chose to look at my smallest output file in its entirety, and have pasted it below: ``` @8c581613-6b2c-4a30-b990-7660417f51d7 runid=534c0b0e13044fa9b3e67426197b926f4230572b read=3690 ch=998 start_time=2022-08-18T19:55:25.874796+00:00 flow_cell_id=PAM31864 protocol_group_id=CN_longReads sample_id=no_sample parent_read_id=8c581613-6b2c-4a30-b990-7660417f51d7

basecall_model_version_id=2021-05-05_dna_r9.4.1_promethion_384_dd219f32 h+(3),h-(3) 97230 97232 97123 97131 97042 97049 97231

…

ACGCTTCGATCTCTCTCTCTTTCCTTCTCTCTGTCTCTCTCTGCCTGTCTCTCTCACTCTGTCTTCTGTCTTACACTCTCTCTCTGCCTGCCTGTCTCTCTCACTCTCTCTCTCTGTGTGTCTCTCTCTCTCTTTCTGTTTCTCTCTGTCTCTCTCTGTCTGTCTCTGTCTTTCTCTGTCTGTCTCTTTGTCTGTCTGTCTTTGTCTTTCCTTCT + ***@***.***={{{;;?{{{{{{{{{322210003+*{{{5455{{{{{7121,*''++)*+++//02{{44599889=998889<=<=:9:=>CA@?>BACC:89>: ``` Please let me know if I can provide you with anything else, and again I very much appreciate your willingness to help me work through this! -Lauren -- Reply to this email directly or view it on GitHub: #6 (comment) You are receiving this because you commented. Message ID: ***@***.***>

jbh-cas · 2022-10-19T18:05:30Z

Lauren,

I just put a new version 0.23 up on github that should reduce the number files that you are getting. Though the sequences will go into a Multiple_Matches.fastq file, so the -e itest is still worth trying.

best,
Jim Henderson

lroppolo · 2022-10-19T22:05:32Z

Hello Jim! I am about to download the latest version and try this. Will keep you posted on how things are going. Thank you for your help! Lauren

…

On Wed, Oct 19, 2022 at 2:05 PM Jim Henderson ***@***.***> wrote: [*Caution*: Email from External Sender. Do not click or open links or attachments unless you know this sender.] Lauren, I just put a new version 0.23 up on github that should reduce the number files that you are getting. Though the sequences will go into a Multiple_Matches.fastq file, so the -e itest is still worth trying. best, Jim Henderson — Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKMPKQ2CFO2MXIY5KLGWSP3WEAZ7JANCNFSM6AAAAAAQMSL4GY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

-M 2 Question #6

-M 2 Question #6

lroppolo commented Sep 14, 2022

jbh-cas commented Sep 16, 2022 via email

lroppolo commented Sep 18, 2022

jbh-cas commented Sep 21, 2022 via email

jbh-cas commented Sep 22, 2022 •

edited

Loading

lroppolo commented Sep 27, 2022

jbh-cas commented Oct 10, 2022

jbh-cas commented Oct 11, 2022 via email

jbh-cas commented Oct 19, 2022

lroppolo commented Oct 19, 2022 via email

-M 2 Question #6

-M 2 Question #6

Comments

lroppolo commented Sep 14, 2022

jbh-cas commented Sep 16, 2022 via email

lroppolo commented Sep 18, 2022

jbh-cas commented Sep 21, 2022 via email

jbh-cas commented Sep 22, 2022 • edited Loading

lroppolo commented Sep 27, 2022

jbh-cas commented Oct 10, 2022

jbh-cas commented Oct 11, 2022 via email

jbh-cas commented Oct 19, 2022

lroppolo commented Oct 19, 2022 via email

jbh-cas commented Sep 22, 2022 •

edited

Loading