Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the pipeline idempotent? #201

Open
VJalili opened this issue Jul 20, 2021 · 3 comments
Open

Is the pipeline idempotent? #201

VJalili opened this issue Jul 20, 2021 · 3 comments
Assignees

Comments

@VJalili
Copy link
Member

VJalili commented Jul 20, 2021

There is a small discrepancy between multiple executions of GATKSVPipelineSingleSample on the same input with the same parameters. The discrepancy is observed in the following files:

FinalVCFCleanup.out
Module08Annotation.output_vcf
FilterSample.out
ResetPESRTGTOverdispersionFilter.out
ResetBothsidesSupportFilter.out
ResetHighSRBackgroundFilter.out
FilterVcfWithReferencePanelCalls.out
FilterVcfForCaseSampleGenotype.out
FilterVcfDepthLt5kb.out
GetUniqueNonGenotypedDepthCalls.out
Module0506.vcf
Module0506.complex_genotype_vcf
ConvertCNVsWithoutDepthSupportToBNDs.out_vcf
Module04.genotyped_pesr_vcf
Module04.genotyped_depth_vcf
RewriteSRCoords.annotated_vcf
FilterLargePESRCallsWithoutRawDepthSupport.out
MergePesrVcfs.merged_pesr_vcf
FilterWham.out
Module01.wham_vcf
Module00c.std_wham_vcf
Module00a.wham_vcf

It seems the discrepancy originates from the Whamg workflow. The workflow uses whamg (whamg source code), which seems to have some randomness involved (zeeev/wham#51).

I ran the GATKSVPipelineSingleSample workflow multiple times without call-caching (i.e., setting "read_from_cache" : false and "write_to_cache" : false in the options files), and ran svtest vcf on the final VCFs of the separate executions comparing all to a common baseline. If the pipeline was idempotent, I would expect the same difference/similarity to the baseline in all the executions, however, the executions slightly differ from each other compared to the baseline. Some of the output metrics of svtest vcf comparing a test run with a given UUID vs. the baseline are as follows.

5700c10e-8db3-477f-9ea0-2cef381a8108 829d3048-9a91-4fe3-ae5b-7549cd6e50f8 b6a0c977-8de6-481d-b58d-c9d6d182db2f
final_cleanup_vcf_DEL_tp 4236 4233 4242
final_cleanup_vcf_DEL_fp 9 13 10
final_cleanup_vcf_DEL_fn 21 26 18
final_cleanup_vcf_DUP_tp 1351 1356 1356
final_cleanup_vcf_DUP_fp 16 11 11
final_cleanup_vcf_DUP_fn 13 8 8
final_cleanup_vcf_INS_tp 2442 2446 2444
final_cleanup_vcf_INS_fp 5 5 3
final_cleanup_vcf_INS_fn 16 12 14
final_cleanup_vcf_INV_tp 9 9 9
final_cleanup_vcf_INV_fp 0 0 0
final_cleanup_vcf_INV_fn 0 0 0
final_cleanup_vcf_BND_tp 1397 1395 1397
final_cleanup_vcf_BND_fp 3 1 3
final_cleanup_vcf_BND_fn 5 7 4
final_cleanup_vcf_CPX_tp 83 84 84
final_cleanup_vcf_CPX_fp 14 13 12
final_cleanup_vcf_CPX_fn 1 0 0
final_cleanup_vcf_CNV_tp 198 198 198
final_cleanup_vcf_CNV_fp 0 0 0
final_cleanup_vcf_CNV_fn 0 0 0
@mwalker174
Copy link
Collaborator

Also svtk's pe-test and sr-test tools do not set PRNG seeds.

@epiercehoffman
Copy link
Collaborator

epiercehoffman commented Feb 7, 2023

Also missing random seed in svtk adjudicate for subsampling function https://github.com/broadinstitute/gatk-sv/blob/main/src/svtk/svtk/adjudicate/random_forest.py#L82 (and 86)

@VJalili
Copy link
Member Author

VJalili commented Feb 7, 2023

The pipeline's Wham-related irreproducible output is related to using a random number generator with a floating seed in the Wham source code. The steps we took to fix this issue are:

  • We reported the issue on Wham's GitHub repository: Stochastic output zeeev/wham#51
  • I made a pull request that sets a fixed seed to the random number generator: Fix the seed of random num generator zeeev/wham#62
  • I tried updating Wham's docker image to include a version of Wham with a fixed seed by updating Wham's Dockerfile:
    • The current Docker image is built using a compiled version of Wham; hence I tried recompiling from https://github.com/VJalili/wham/tree/fix-seed, where the seed of the random number generator is fixed.
    • The build process failed for reasons similar to those reported on Wham's GitHub page: Wham installation fails zeeev/wham#56
    • The make process of Wham includes building bamtools from an older version, which fails to compile on Ubuntu 18 (and all the newer versions).
    • The referenced version of bamtools seemingly depends on older versions of the built environment (e.g., gcc) that are not included in the Ubuntu 18 docker image. I explored options for installing older compiler versions, explicitly trying gcc-4.x and gcc-5, but bamtools failed to compile with these versions.
    • I tried installing bamtools from apt (apt-get install -y libbamtools-dev) to avoid the build-related issues and removed bamtools from Wham's make process. Wham failed to compile with this approach.
    • I tried replacing the cloned version of bamtools (fetched using git clone --recursive) with (a) current HEAD and (b) older commits (e.g., HEAD of the master branch in 2019). With this approach, bamtools compiles successfully, but Wham's make continues to fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants