Overhaul the downsampling to deal with duplicates #13

eriqande · 2024-10-23T13:51:37Z

Will pointed out an issue with the downsampling---it does not take account of whether reads are marked as duplicates or not. Ideally we would downsample everything to the same number of primary (non-duplicated) reads. As it is now, however, it is downsampling to the same number of total reads (primary and duplicated). This leads to inconsistent depths of actual reads used when there is variation in the fraction of duplicated reads.

Two possible avenues:

Simply remove all duplicates. This would be easy by violates the GATK principle of keeping everything for possible later inspection.
Don't remove duplicates, but account for them in the average depth coverage (for example use samtools stats with the -d option for the calculations to be used for downsampling) and then also in the actual downsampling (e.g., by using samtools view -F to exclude duplicates, and then piping that result to samtools view --subsample (unless --subsample requires an indexed bam)). Also, I know CH and Matt have used GATK tools for downsampling, and those might work a little better with regard to duplicates.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul the downsampling to deal with duplicates #13

Overhaul the downsampling to deal with duplicates #13

eriqande commented Oct 23, 2024

Overhaul the downsampling to deal with duplicates #13

Overhaul the downsampling to deal with duplicates #13

Comments

eriqande commented Oct 23, 2024