You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When number of samples gets large, having 10x the samples results in 100x slower execution which indicates quadratic complexity of the operation.
I looked into the codebase and found that most of the time is spent in bcf_get_genotypes. This function is called as part of computing the number of elements in the PL field before each access to the field. bcf_get_genotypes creates an array of GT information for all samples, which is then used to compute max_ploidy. The only part of the array being accessed is the GT information for "current" sample, the rest is discarded.
I am not really sure what the idea behind this max_ploidy check is, if it could be removed one could just access the GT array for the current sample and count the number of values to get the ploidy, which should be much faster than creating the array for all samples.
The text was updated successfully, but these errors were encountered:
Using the master version of pysam and the following benchmark script: https://github.com/astaric/pysam/blob/speed-up-bcf-genotype-count/tests/VariantRecordPL_bench.py I get the following runtimes for accessing the PL field for all samples in a single record:
When number of samples gets large, having 10x the samples results in 100x slower execution which indicates quadratic complexity of the operation.
I looked into the codebase and found that most of the time is spent in bcf_get_genotypes. This function is called as part of computing the number of elements in the PL field before each access to the field.
bcf_get_genotypes
creates an array of GT information for all samples, which is then used to computemax_ploidy
. The only part of the array being accessed is the GT information for "current" sample, the rest is discarded.I am not really sure what the idea behind this
max_ploidy
check is, if it could be removed one could just access the GT array for the current sample and count the number of values to get the ploidy, which should be much faster than creating the array for all samples.The text was updated successfully, but these errors were encountered: