Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does it work with non-model species? #3

Open
DuttaAnik opened this issue Jul 11, 2024 · 5 comments
Open

Does it work with non-model species? #3

DuttaAnik opened this issue Jul 11, 2024 · 5 comments
Labels
question Further information is requested

Comments

@DuttaAnik
Copy link

DuttaAnik commented Jul 11, 2024

Hello,
Thanks for developing the tool. Does this tool work with non-model species of different ploidy?

@JustinChu
Copy link
Owner

Unfortunately the code was designed specifically for diploid genomes. The code considers if a site is homozygous or heterozygous, though can handle if missing sites exist too. If you fed in sites with only 2 alleles that have frequencies that are roughly equal (as a hack), it may provide some results, but I cannot guarantee that the results will make sense.

This does have me thinking if we could create a model to handle genomes with a generic ploidy, without sacrificing statistical power.

@DuttaAnik
Copy link
Author

Thanks for the reply. Although it is a far-fetched idea, it would be really cool to have this option in this tool along with handling multi-allelic sites. To my knowledge, no good tools are available to detect sample swap in non-model organisms.

@JustinChu JustinChu added the question Further information is requested label Jul 11, 2024
@JustinChu
Copy link
Owner

JustinChu commented Jul 11, 2024

I would be interested in if the tool gives back any meaningful results in your case if you run it (with the hack). If I were to guess, I think given enough sites with high enough variability, In the worst case I think it will say everything is unrelated so I don't think it would hurt.

@DuttaAnik
Copy link
Author

Hi, I have a few questions. First, thanks for fixing the parsing bug. It works now.

So, in this following command:
scripts/generateSites name=prefix ref=reference.fa vcf=snps.vcf I should use the multisample VCF file that contains SNPs from all the samples, right?

Then, in this command:
ntsmVCF -p prefix -s sites.fa -r reference.fa multiVCF.vcf
Should I use the same VCF that I used in the first command? This is a bit confusing. And the sites.fa I assume is created from the first command, right?

Lastly, can I use a list of raw fastq files instead of writing them one by one in the code below? If yes, what should be the format of the list file?
Because I have more than 100s of fastq files.
ntsmCount -t 2 -s sites.fa sample_part1.fq sample_part2.fq > counts.txt

Thank you.

@JustinChu
Copy link
Owner

JustinChu commented Jul 15, 2024

So, in this following command: scripts/generateSites name=prefix ref=reference.fa vcf=snps.vcf I should use the multisample VCF file that contains SNPs from all the samples, right?

Edit*: Actually, the VCF that is used here doesn't need to be a multisample VCF. it just needs the biallelic variants.

Then, in this command: ntsmVCF -p prefix -s sites.fa -r reference.fa multiVCF.vcf Should I use the same VCF that I used in the first command? This is a bit confusing. And the sites.fa I assume is created from the first command, right?

Edit* The multi VCF file here must be a multisample VCF with reliable genotyping results from a reliable set of samples to capture the population structure. It can be but does not have to be is not the same as above. Also, ideally the multisample VCF used should not contain any of the samples used in the sample swap detection process downstream. The sites.fa is correct. I've changed the readme to clarify where sites.fa comes from. I've also added text to mention that using a rotation matrix is optional.

Lastly, can I use a list of raw fastq files instead of writing them one by one in the code below? If yes, what should be the format of the list file? Because I have more than 100s of fastq files. ntsmCount -t 2 -s sites.fa sample_part1.fq sample_part2.fq > counts.txt

At the moment I don't have support for a file list. However, unix glob (i.e. wildcards *) should work. Also, to be clear each sample will need its own count file and thus a separate ntsmCount command.

@DuttaAnik DuttaAnik changed the title Does it work with plant samples? Does it work with non-model species? Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants