-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
-M 2 Question #6
Comments
That does seem excessive. Let's look at some of the headers for the output files by doing this:
head -n 1 *fastq | head -n 30
That will show the file names and first lines for 10 of the fastq files to give a sense of
what is being called. You can just paste that into a response.
Also if you could share the barcode file as an attachment that would be helpful. My first guess is that there is something in the barcode file that causes each record to be seen as a sample. Though I don't know what that
might be. That's why taking a look at the barcode file and the headers will be helpful.
---------------------------- Original Message ----------------------------
Subject: [calacademy-research/minibar] -M 2 Question (Issue #6)
From: "LR Brazell" ***@***.***>
Date: Wed, September 14, 2022 9:14 am
To: "calacademy-research/minibar" ***@***.***>
Cc: "Subscribed" ***@***.***>
--------------------------------------------------------------------------
Hello there!
I have a question about the -M 2 option for identifying sample types using the barcodes.
I see that option 2 finds matched barcodes on both ends of sequence, and identifies pairs that match a sample ID. I am using a dual-indexed barcode set on 16 samples, and when I select this option for demultiplexing, I end up with nearly 18,000 individual fastq files as output instead of 16
individual bins for my samples. Am I doing something wrong here? I just need some guidance on how this works so I can be sure I'm doing the right thing.
…
Thanks!
Lauren
--
Reply to this email directly or view it on GitHub:
#6
You are receiving this because you are subscribed to this thread.
Message ID: ***@***.***>
|
Hi Jim, Thank you very much for your reply! I'm going to attach the barcode file below: For reference, I am using the Twist UDI primer sets, and I will also attach a screen grab of the reference sheet below so that you can get an idea of where I am getting the pieces of the barcode file from: I am getting errors when I try to run the command you provided to look at the headers of the fastq files- evidently the files don't have at least 10 lines. On seeing this error, I chose to look at my smallest output file in its entirety, and have pasted it below:
Please let me know if I can provide you with anything else, and again I very much appreciate your willingness to help me work through this! -Lauren |
This is a csv which means fields separated by commas. We need a tsv where tabs are the separator character and this is the message I get running the command where I have screen scraped your one rec example
$ minibar.py longreads_demux.csv email_rec.fq -D
Need at least 5 tab
delimited columns in the barcode_file.
Here is the first line of 'longreads_demux.csv':
SampleID,FwIndex,FwPrimer,RvIndex,RvPrimer
I am going to change the commas to tabs and also I am going to add Samp_ in fornt of the number at line begin so that the sample name stands out more at the end of the header.
But I am thinking this not exactly the file set you are using. I'll
send the the tsv version back to you and you can let me know what it does. If it does do the same thing perhaps you can send a handful of the input records to test against.
---------------------------- Original Message ----------------------------
Subject: Re: [calacademy-research/minibar] -M 2 Question (Issue #6)
From: "LR Brazell" ***@***.***>
Date: Sat, September 17, 2022 5:12 pm
To: "calacademy-research/minibar" ***@***.***>
Cc: "Jim Henderson" ***@***.***>
"Comment" ***@***.***>
--------------------------------------------------------------------------
Hi Jim,
Thank you very much for your reply!
I'm going to attach the barcode file below:
[longreads_demux.csv](https://github.com/calacademy-research/minibar/files/9592764/longreads_demux.csv)
For reference, I am using the Twist UDI primer sets, and I will also attach a screen grab of the reference sheet below so that you can get an idea of where I am getting the pieces of the barcode file from:
<img width="1416" alt="Screen Shot 2022-09-17 at 8 09 02 PM" src="https://user-images.githubusercontent.com/43578691/190880258-a5d5377a-41ce-47cf-b25b-0f1d16405bd9.png">
I am getting errors when I try to run the command you provided to look at the headers of the fastq files- evidently the files don't have at least 10 lines. On seeing this error, I chose to look at my smallest output file in its entirety, and have pasted it below:
```
@8c581613-6b2c-4a30-b990-7660417f51d7 runid=534c0b0e13044fa9b3e67426197b926f4230572b read=3690 ch=998 start_time=2022-08-18T19:55:25.874796+00:00 flow_cell_id=PAM31864 protocol_group_id=CN_longReads sample_id=no_sample parent_read_id=8c581613-6b2c-4a30-b990-7660417f51d7
basecall_model_version_id=2021-05-05_dna_r9.4.1_promethion_384_dd219f32 h+(3),h-(3) 97230 97232 97123 97131 97042 97049 97231
… ACGCTTCGATCTCTCTCTCTTTCCTTCTCTCTGTCTCTCTCTGCCTGTCTCTCTCACTCTGTCTTCTGTCTTACACTCTCTCTCTGCCTGCCTGTCTCTCTCACTCTCTCTCTCTGTGTGTCTCTCTCTCTCTTTCTGTTTCTCTCTGTCTCTCTCTGTCTGTCTCTGTCTTTCTCTGTCTGTCTCTTTGTCTGTCTGTCTTTGTCTTTCCTTCT
+
***@***.***={{{;;?{{{{{{{{{322210003+*{{{5455{{{{{7121,*''++)*+++//02{{44599889=998889<=<=:9:=>CA@?>BACC:89>:
```
Please let me know if I can provide you with anything else, and again I very much appreciate your willingness to help me work through this!
-Lauren
--
Reply to this email directly or view it on GitHub:
#6 (comment)
You are receiving this because you commented.
Message ID: ***@***.***>
|
Try the embedded longreads_demux.tsv file for your barcode input and see how that works. Also, if you could, on your original barcode file, do a Here it is for file longreads_demux.tsv showing it has 17 lines: $ wc -l longreads_demux.tsv longreads_demux.tsv
|
Hello there Jim! Thank you so much for your reply-- I have been using a .tsv file, for whatever reason the file converted to a .csv when I pulled it down, but I was able to check and confirm that my file does have 17 lines as well. When I run the script with the .tsv file again, I am having the same issue happen. I will attach my script below:
Essentially, I am getting output files that begin with the "CN_" prefix, and have every combination possible of Samp_1 through Samp_16. The filenames look something like this: CN_Samp_1_Samp_15_Samp_14_Samp_7_Samp_3.fastq And the output follows the format that I pasted in my last comment, if that is helpful. Please let me know if this helps make any sense of what I'm doing and we can go from there! I appreciate your willingness to help me troubleshoot. All the best, Lauren |
Lauren, The Twist barcodes are 10bp which is 2 more than standard Illumina barcodes but quite a bit smaller than typical Nanopore barcodes. The ones used in the minibar paper were 15bp each and I checked with a colleague who used the nanopore 96-plex kit recently for 40+ photobacteria species and pointed out this quote: "An ONT native barcode is 40 bp in length (24 bp for the barcode itself plus 8 bp of flanking sequence on each side)" and even so she mentioned there were a lot of unclassified reads. You can try -e 2 to reduce the error tolerance for the 10bp Twist barcodes but I do worry with the R9 chemistry that the error rate might be too great to separate out a large number of the reads. I'm hopeful that R10 chemistry improves things but I don't have any experience with it -- that is, R10, plenty with hope :) best, |
Try the attached longreads_demux.tsv file for your barcode input and see how that works.
Also, if you could, on your original barcode file, do a wc -l command to count the number of lines and see what that tells us
Here
it is for the attached file longreads_demux.tsv showing it has 16 lines:
$ wc -l longreads_demux.tsv
16 longreads_demux.tsv
---------------------------- Original Message ----------------------------
Subject: Re: [calacademy-research/minibar] -M 2 Question (Issue #6)
From: "LR Brazell" ***@***.***>
Date: Sat, September 17, 2022 5:12 pm
To: "calacademy-research/minibar" ***@***.***>
Cc: "Jim Henderson" ***@***.***>
"Comment" ***@***.***>
--------------------------------------------------------------------------
Hi Jim,
Thank you very much for your reply!
I'm going to attach the barcode file below:
[longreads_demux.csv](https://github.com/calacademy-research/minibar/files/9592764/longreads_demux.csv)
For reference, I am using the Twist UDI primer sets, and I will also attach a screen grab of the reference sheet below so that you can get an idea of where I am getting the pieces of the barcode file from:
<img width="1416" alt="Screen Shot 2022-09-17 at 8 09 02 PM" src="https://user-images.githubusercontent.com/43578691/190880258-a5d5377a-41ce-47cf-b25b-0f1d16405bd9.png">
I am getting errors when I try to run the command you provided to look at the headers of the fastq files- evidently the files don't have at least 10 lines. On seeing this error, I chose to look at my smallest output file in its entirety, and have pasted it below:
```
@8c581613-6b2c-4a30-b990-7660417f51d7 runid=534c0b0e13044fa9b3e67426197b926f4230572b read=3690 ch=998 start_time=2022-08-18T19:55:25.874796+00:00 flow_cell_id=PAM31864 protocol_group_id=CN_longReads sample_id=no_sample parent_read_id=8c581613-6b2c-4a30-b990-7660417f51d7
basecall_model_version_id=2021-05-05_dna_r9.4.1_promethion_384_dd219f32 h+(3),h-(3) 97230 97232 97123 97131 97042 97049 97231
… ACGCTTCGATCTCTCTCTCTTTCCTTCTCTCTGTCTCTCTCTGCCTGTCTCTCTCACTCTGTCTTCTGTCTTACACTCTCTCTCTGCCTGCCTGTCTCTCTCACTCTCTCTCTCTGTGTGTCTCTCTCTCTCTTTCTGTTTCTCTCTGTCTCTCTCTGTCTGTCTCTGTCTTTCTCTGTCTGTCTCTTTGTCTGTCTGTCTTTGTCTTTCCTTCT
+
***@***.***={{{;;?{{{{{{{{{322210003+*{{{5455{{{{{7121,*''++)*+++//02{{44599889=998889<=<=:9:=>CA@?>BACC:89>:
```
Please let me know if I can provide you with anything else, and again I very much appreciate your willingness to help me work through this!
-Lauren
--
Reply to this email directly or view it on GitHub:
#6 (comment)
You are receiving this because you commented.
Message ID: ***@***.***>
|
Lauren, I just put a new version 0.23 up on github that should reduce the number files that you are getting. Though the sequences will go into a Multiple_Matches.fastq file, so the -e itest is still worth trying. best, |
Hello Jim!
I am about to download the latest version and try this. Will keep you
posted on how things are going.
Thank you for your help!
Lauren
…On Wed, Oct 19, 2022 at 2:05 PM Jim Henderson ***@***.***> wrote:
[*Caution*: Email from External Sender. Do not click or open links or
attachments unless you know this sender.]
Lauren,
I just put a new version 0.23 up on github that should reduce the number
files that you are getting. Though the sequences will go into a
Multiple_Matches.fastq file, so the -e itest is still worth trying.
best,
Jim Henderson
—
Reply to this email directly, view it on GitHub
<#6 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKMPKQ2CFO2MXIY5KLGWSP3WEAZ7JANCNFSM6AAAAAAQMSL4GY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Hello there!
I have a question about the -M 2 option for identifying sample types using the barcodes.
I see that option 2 finds matched barcodes on both ends of sequence, and identifies pairs that match a sample ID. I am using a dual-indexed barcode set on 16 samples, and when I select this option for demultiplexing, I end up with nearly 18,000 individual fastq files as output instead of 16 individual bins for my samples. Am I doing something wrong here? I just need some guidance on how this works so I can be sure I'm doing the right thing.
Thanks!
Lauren
The text was updated successfully, but these errors were encountered: