Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing PL field is slow when number of samples gets large #1280

Open
astaric opened this issue Apr 22, 2024 · 0 comments
Open

Accessing PL field is slow when number of samples gets large #1280

astaric opened this issue Apr 22, 2024 · 0 comments

Comments

@astaric
Copy link

astaric commented Apr 22, 2024

Using the master version of pysam and the following benchmark script: https://github.com/astaric/pysam/blob/speed-up-bcf-genotype-count/tests/VariantRecordPL_bench.py I get the following runtimes for accessing the PL field for all samples in a single record:

---------------------------------------------------------------------------------------------------------------- benchmark: 5 tests ----------------------------------------------------------------------------------------------------------------
Name (time in us)                                    Min                        Max                       Mean                StdDev                     Median                   IQR            Outliers          OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_access_pl_values_10_samples                 38.2920 (1.0)             185.3330 (1.0)              41.2211 (1.0)          8.9669 (inf)              39.9170 (1.0)          2.3340 (inf)         15;33  24,259.4229 (1.0)         733           1
test_access_pl_values_100_samples               136.3750 (3.56)            546.6249 (2.95)            142.9587 (3.47)        12.5355 (inf)             139.7920 (3.50)         3.6876 (inf)       119;289   6,995.0286 (0.29)       3296           1
test_access_pl_values_1000_samples            3,246.3749 (84.78)         8,651.0000 (46.68)         3,340.0859 (81.03)      425.3084 (inf)           3,273.8131 (82.02)       87.7501 (inf)           2;8     299.3935 (0.01)        274           1
test_access_pl_values_10000_samples         257,407.6669 (>1000.0)     270,355.5422 (>1000.0)     261,211.9690 (>1000.0)  6,149.1067 (inf)         258,542.3335 (>1000.0)  7,216.5631 (inf)           1;0       3.8283 (0.00)          4           1
test_access_pl_values_100000_samples     25,141,500.8749 (>1000.0)  25,141,500.8749 (>1000.0)  25,141,500.8749 (>1000.0)      0.0000 (1.0)      25,141,500.8749 (>1000.0)      0.0000 (1.0)           0;0       0.0398 (0.00)          1           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

When number of samples gets large, having 10x the samples results in 100x slower execution which indicates quadratic complexity of the operation.

I looked into the codebase and found that most of the time is spent in bcf_get_genotypes. This function is called as part of computing the number of elements in the PL field before each access to the field. bcf_get_genotypes creates an array of GT information for all samples, which is then used to compute max_ploidy. The only part of the array being accessed is the GT information for "current" sample, the rest is discarded.

I am not really sure what the idea behind this max_ploidy check is, if it could be removed one could just access the GT array for the current sample and count the number of values to get the ploidy, which should be much faster than creating the array for all samples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant