Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul the downsampling to deal with duplicates #13

Open
eriqande opened this issue Oct 23, 2024 · 0 comments
Open

Overhaul the downsampling to deal with duplicates #13

eriqande opened this issue Oct 23, 2024 · 0 comments

Comments

@eriqande
Copy link
Owner

Will pointed out an issue with the downsampling---it does not take account of whether reads are marked as duplicates or not. Ideally we would downsample everything to the same number of primary (non-duplicated) reads. As it is now, however, it is downsampling to the same number of total reads (primary and duplicated). This leads to inconsistent depths of actual reads used when there is variation in the fraction of duplicated reads.

Two possible avenues:

  1. Simply remove all duplicates. This would be easy by violates the GATK principle of keeping everything for possible later inspection.
  2. Don't remove duplicates, but account for them in the average depth coverage (for example use samtools stats with the -d option for the calculations to be used for downsampling) and then also in the actual downsampling (e.g., by using samtools view -F to exclude duplicates, and then piping that result to samtools view --subsample (unless --subsample requires an indexed bam)). Also, I know CH and Matt have used GATK tools for downsampling, and those might work a little better with regard to duplicates.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant