You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Will pointed out an issue with the downsampling---it does not take account of whether reads are marked as duplicates or not. Ideally we would downsample everything to the same number of primary (non-duplicated) reads. As it is now, however, it is downsampling to the same number of total reads (primary and duplicated). This leads to inconsistent depths of actual reads used when there is variation in the fraction of duplicated reads.
Two possible avenues:
Simply remove all duplicates. This would be easy by violates the GATK principle of keeping everything for possible later inspection.
Don't remove duplicates, but account for them in the average depth coverage (for example use samtools stats with the -d option for the calculations to be used for downsampling) and then also in the actual downsampling (e.g., by using samtools view -F to exclude duplicates, and then piping that result to samtools view --subsample (unless --subsample requires an indexed bam)). Also, I know CH and Matt have used GATK tools for downsampling, and those might work a little better with regard to duplicates.
The text was updated successfully, but these errors were encountered:
Will pointed out an issue with the downsampling---it does not take account of whether reads are marked as duplicates or not. Ideally we would downsample everything to the same number of primary (non-duplicated) reads. As it is now, however, it is downsampling to the same number of total reads (primary and duplicated). This leads to inconsistent depths of actual reads used when there is variation in the fraction of duplicated reads.
Two possible avenues:
samtools stats
with the-d
option for the calculations to be used for downsampling) and then also in the actual downsampling (e.g., by usingsamtools view -F
to exclude duplicates, and then piping that result tosamtools view --subsample
(unless --subsample requires an indexed bam)). Also, I know CH and Matt have used GATK tools for downsampling, and those might work a little better with regard to duplicates.The text was updated successfully, but these errors were encountered: