Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Cochran-Mantel-Haenszel statistical test for association #14255

Merged
merged 5 commits into from
Mar 11, 2024

Conversation

Will-Tyler
Copy link
Contributor

@Will-Tyler Will-Tyler commented Feb 6, 2024

Description

In this pull request, I add a function to perform a Cochran-Mantel-Haenszel statistical test for association.

This pull request closes #13481.

Testing

I add unit tests. Since I have not used R before (the associated GitHub issue suggests using R to create test cases), I created the unit tests from examples that I found on the internet. I linked these sources in the code for the unit tests.

I built the documentation locally and inspected it to confirm that it matches my expectations.

I am having trouble testing the docstring examples locally. When I run make -C hail doctest-query, the tests error due to a checksum exception.

Discussion

I have not added an example to the documentation that uses a matrix table yet. (This is an acceptance criteria in #13481.) I wanted to get some advice about the best way to do this. I think ideally, the example would have a binary phenotype, an allele to test for association, and some stratifying variable. I tried to search through the existing code to find suitable example matrix tables in the docstrings, but I didn't find anything promising. I would appreciate help here.

Update: thanks to @patrick-schultz's recommendation, I have added an example using a matrix table.

@patrick-schultz
Copy link
Collaborator

This is great, thanks for working on this!

I have not added an example to the documentation that uses a matrix table yet. (This is an #13481.) I wanted to get some advice about the best way to do this. I think ideally, the example would have a binary phenotype, an allele to test for association, and some stratifying variable. I tried to search through the existing code to find suitable example matrix tables in the docstrings, but I didn't find anything promising. I would appreciate help here.

The code that sets up the doctest environment is here. In particular, the matrixtable available in doc examples as ds lives at hail/hail/python/hail/docs/data/example.mt. It has lots of fields you can use:

In [3]: mt = hl.read_matrix_table('python/hail/docs/data/example.mt')

In [4]: mt.describe()
----------------------------------------
Global fields:
    'global_field_1': int32
    'global_field_2': int32
    'pli': dict<str, float64>
    'populations': array<str>
----------------------------------------
Column fields:
    's': str
    'sample_qc': struct {
        dp_stats: struct {
            mean: float64,
            stdev: float64,
            min: float64,
            max: float64
        },
        gq_stats: struct {
            mean: float64,
            stdev: float64,
            min: float64,
            max: float64
        },
        call_rate: float64,
        n_called: int64,
        n_not_called: int64,
        n_filtered: int64,
        n_hom_ref: int64,
        n_het: int64,
        n_hom_var: int64,
        n_non_ref: int64,
        n_singleton: int64,
        n_snp: int64,
        n_insertion: int64,
        n_deletion: int64,
        n_transition: int64,
        n_transversion: int64,
        n_star: int64,
        r_ti_tv: float64,
        r_het_hom_var: float64,
        r_insertion_deletion: float64
    }
    'is_case': bool
    'pheno': struct {
        is_case: bool,
        is_female: bool,
        age: float64,
        height: float64,
        blood_pressure: float64,
        cohort_name: str
    }
    'cov': struct {
        PC1: float64
    }
    'cov1': float64
    'cov2': float64
    'cohort': str
    'cohorts': array<str>
    'pop': str
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        NEGATIVE_TRAIN_SITE: bool,
        HWP: float64,
        AC: array<int32>,
        culprit: str,
        MQ0: int32,
        ReadPosRankSum: float64,
        AN: int32,
        InbreedingCoeff: float64,
        AF: array<float64>,
        GQ_STDDEV: float64,
        FS: float64,
        DP: int32,
        GQ_MEAN: float64,
        POSITIVE_TRAIN_SITE: bool,
        VQSLOD: float64,
        ClippingRankSum: float64,
        BaseQRankSum: float64,
        MLEAF: array<float64>,
        MLEAC: array<int32>,
        MQ: float64,
        QD: float64,
        END: int32,
        DB: bool,
        HaplotypeScore: float64,
        MQRankSum: float64,
        CCC: int32,
        NCC: int32,
        DS: bool
    }
    'use_as_marker': bool
    'panel_maf': float64
    'anno1': int32
    'anno2': int32
    'consequence': str
    'gene': array<str>
    'score': float64
    'a_index': int32
    'variant_qc': struct {
        dp_stats: struct {
            mean: float64,
            stdev: float64,
            min: float64,
            max: float64
        },
        gq_stats: struct {
            mean: float64,
            stdev: float64,
            min: float64,
            max: float64
        },
        AC: array<int32>,
        AF: array<float64>,
        AN: int32,
        homozygote_count: array<int32>,
        call_rate: float64,
        n_called: int64,
        n_not_called: int64,
        n_filtered: int64,
        n_het: int64,
        n_non_ref: int64,
        het_freq_hwe: float64,
        p_value_hwe: float64,
        p_value_excess_het: float64
    }
----------------------------------------
Entry fields:
    'GT': call
    'AD': array<int32>
    'DP': int32
    'GQ': int32
    'PL': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

@Will-Tyler
Copy link
Contributor Author

In particular, the matrixtable available in doc examples as ds lives at hail/hail/python/hail/docs/data/example.mt.

Thanks, this is exactly what I was looking for. I am having trouble testing my new example locally though. When I run make -C hail doctest-query, the tests fail with a checksum error. I tried running make -C hail clean and retrying, but I still get the same error.

E           hail.utils.java.FatalError: ChecksumException: Checksum error: file:/Users/willtyler/Desktop/hail/hail/python/hail/docs/data/example.8bits.bgen.idx2/metadata.json.gz at 0 exp: 982431825 got: -2031629660
E
E           Java stack trace:
E           org.apache.hadoop.fs.ChecksumException: Checksum error: file:/Users/willtyler/Desktop/hail/hail/python/hail/docs/data/example.8bits.bgen.idx2/metadata.json.gz at 0 exp: 982431825 got: -2031629660
E           	at org.apache.hadoop.fs.FSInputChecker.verifySums(FSInputChecker.java:347)
E           	at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:303)
E           	at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:252)
E           	at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:197)
E           	at java.base/java.io.DataInputStream.read(DataInputStream.java:149)
E           	at is.hail.io.fs.HadoopFS$$anon$2.read(HadoopFS.scala:58)
E           	at java.base/java.io.DataInputStream.read(DataInputStream.java:149)
E           	at org.apache.commons.compress.utils.CountingInputStream.read(CountingInputStream.java:56)
E           	at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:252)
E           	at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:271)
E           	at org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream.init(GzipCompressorInputStream.java:185)
E           	at org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream.<init>(GzipCompressorInputStream.java:168)
E           	at is.hail.io.fs.GZipCompressionCodec$.makeInputStream(FS.scala:125)
E           	at is.hail.io.fs.FS.open(FS.scala:563)
E           	at is.hail.io.fs.FS.open$(FS.scala:560)
E           	at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:85)
E           	at is.hail.io.fs.FS.open(FS.scala:578)
E           	at is.hail.io.fs.FS.open$(FS.scala:577)
E           	at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:85)
E           	at is.hail.io.fs.FS.open(FS.scala:572)
E           	at is.hail.io.fs.FS.open$(FS.scala:571)
E           	at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:85)
E           	at is.hail.io.fs.FS.open(FS.scala:569)
E           	at is.hail.io.fs.FS.open$(FS.scala:569)
E           	at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:85)
E           	at is.hail.io.index.IndexReader$.readTypes(IndexReader.scala:65)
E           	at is.hail.io.bgen.LoadBgen$.$anonfun$getBgenFileMetadata$2(LoadBgen.scala:208)
E           	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
E           	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
E           	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
E           	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
E           	at scala.collection.TraversableLike.map(TraversableLike.scala:286)
E           	at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
E           	at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
E           	at is.hail.io.bgen.LoadBgen$.getBgenFileMetadata(LoadBgen.scala:207)
E           	at is.hail.io.bgen.MatrixBGENReader$.apply(LoadBgen.scala:387)
E           	at is.hail.io.bgen.MatrixBGENReader$.fromJValue(LoadBgen.scala:365)
E           	at is.hail.expr.ir.MatrixReader$.fromJson(MatrixIR.scala:116)
E           	at is.hail.expr.ir.IRParser$.matrix_ir_1(Parser.scala:1815)
E           	at is.hail.expr.ir.IRParser$.$anonfun$matrix_ir$1(Parser.scala:1738)
E           	at is.hail.utils.StackSafe$More.advance(StackSafe.scala:64)
E           	at is.hail.utils.StackSafe$.run(StackSafe.scala:16)
E           	at is.hail.utils.StackSafe$StackFrame.run(StackSafe.scala:32)
E           	at is.hail.expr.ir.IRParser$.$anonfun$parse_matrix_ir$1(Parser.scala:2164)
E           	at is.hail.expr.ir.IRParser$.parse(Parser.scala:2136)
E           	at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:2164)
E           	at is.hail.backend.Backend.$anonfun$matrixTableType$2(Backend.scala:186)
E           	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:78)
E           	at is.hail.utils.package$.using(package.scala:664)
E           	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:78)
E           	at is.hail.utils.package$.using(package.scala:664)
E           	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:13)
E           	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:65)
E           	at is.hail.backend.spark.SparkBackend.$anonfun$withExecuteContext$2(SparkBackend.scala:407)
E           	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:55)
E           	at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:62)
E           	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:393)
E           	at is.hail.backend.Backend.$anonfun$matrixTableType$1(Backend.scala:185)
E           	at is.hail.backend.Backend.jsonToBytes(Backend.scala:175)
E           	at is.hail.backend.Backend.matrixTableType(Backend.scala:185)
E           	at is.hail.backend.BackendHttpHandler.handle(BackendServer.scala:104)
E           	at jdk.httpserver/com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:77)
E           	at jdk.httpserver/sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:82)
E           	at jdk.httpserver/com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:80)
E           	at jdk.httpserver/sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:848)
E           	at jdk.httpserver/com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:77)
E           	at jdk.httpserver/sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:817)
E           	at jdk.httpserver/sun.net.httpserver.ServerImpl$DefaultExecutor.execute(ServerImpl.java:201)
E           	at jdk.httpserver/sun.net.httpserver.ServerImpl$Dispatcher.handle(ServerImpl.java:560)
E           	at jdk.httpserver/sun.net.httpserver.ServerImpl$Dispatcher.run(ServerImpl.java:526)
E           	at java.base/java.lang.Thread.run(Thread.java:829)
E
E
E
E           Hail version: 0.2.127-e81ad92151ab
E           Error summary: ChecksumException: Checksum error: file:/Users/willtyler/Desktop/hail/hail/python/hail/docs/data/example.8bits.bgen.idx2/metadata.json.gz at 0 exp: 982431825 got: -2031629660

@patrick-schultz
Copy link
Collaborator

Unfortunately, running doctests locally has always been a bit of a mess. I don't think any of us ever run them locally, we just let CI handle it. If you want to sort out local doctests I'd suggest starting a thread in zulip asking for help. But it looks like your latest commit has passed all tests in CI. Just dismiss my review when you're ready for another look. I think this is looking ready to merge.

Copy link
Collaborator

@patrick-schultz patrick-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

@Will-Tyler
Copy link
Contributor Author

Thanks for reviewing!

@danking
Copy link
Contributor

danking commented Feb 23, 2024

Unfortunately, the CI results are private to protect against inadvertent secret leaks. It looks like the docs are failing.

The Locus should be hl.Locus.

I suspect the test will pass with that change.

=================================== FAILURES ===================================
__________ [doctest] hail.expr.functions.cochran_mantel_haenszel_test __________
834     Examples
835     --------
836     >>> a = [56, 61, 73, 71]
837     >>> b = [69, 257, 65, 48]
838     >>> c = [40, 57, 71, 55]
839     >>> d = [77, 301, 79, 48]
840     >>> hl.eval(hl.cochran_mantel_haenszel_test(a, b, c, d))
841     Struct(test_statistic=5.0496881823306765, p_value=0.024630370456863417)
842 
843     >>> mt = ds.filter_rows(mt.locus == Locus(20, 10633237))
UNEXPECTED EXCEPTION: NameError("name 'Locus' is not defined")
Traceback (most recent call last):
  File "/usr/lib/python3.9/doctest.py", line 1334, in __run
    exec(compile(example.source, filename, "single",
  File "<doctest hail.expr.functions.cochran_mantel_haenszel_test[5]>", line 1, in <module>
NameError: name 'Locus' is not defined
/usr/local/lib/python3.9/dist-packages/hail/expr/functions.py:843: UnexpectedException
835     --------
836     >>> a = [56, 61, 73, 71]
837     >>> b = [69, 257, 65, 48]
838     >>> c = [40, 57, 71, 55]
839     >>> d = [77, 301, 79, 48]
840     >>> hl.eval(hl.cochran_mantel_haenszel_test(a, b, c, d))
841     Struct(test_statistic=5.0496881823306765, p_value=0.024630370456863417)
842 
843     >>> mt = ds.filter_rows(mt.locus == Locus(20, 10633237))
844     >>> mt.count_rows()
Expected:
    1
Got:
    9

/usr/local/lib/python3.9/dist-packages/hail/expr/functions.py:844: DocTestFailure
845     1
846     >>> a, b, c, d = mt.aggregate_entries(
847     ...     hl.tuple([
848     ...         hl.array([hl.agg.count_where(mt.GT.is_non_ref() & mt.pheno.is_case & mt.pheno.is_female), hl.agg.count_where(mt.GT.is_non_ref() & mt.pheno.is_case & ~mt.pheno.is_female)]),
849     ...         hl.array([hl.agg.count_where(mt.GT.is_non_ref() & ~mt.pheno.is_case & mt.pheno.is_female), hl.agg.count_where(mt.GT.is_non_ref() & ~mt.pheno.is_case & ~mt.pheno.is_female)]),
850     ...         hl.array([hl.agg.count_where(~mt.GT.is_non_ref() & mt.pheno.is_case & mt.pheno.is_female), hl.agg.count_where(~mt.GT.is_non_ref() & mt.pheno.is_case & ~mt.pheno.is_female)]),
851     ...         hl.array([hl.agg.count_where(~mt.GT.is_non_ref() & ~mt.pheno.is_case & mt.pheno.is_female), hl.agg.count_where(~mt.GT.is_non_ref() & ~mt.pheno.is_case & ~mt.pheno.is_female)])
852     ...     ])
853     ... )
854     >>> hl.eval(hl.cochran_mantel_haenszel_test(a, b, c, d))
Expected:
    Struct(test_statistic=0.2188830334629822, p_value=0.6398923118508772)
Got:
    Struct(test_statistic=0.0015755655579811829, p_value=0.9683375680436793)

/usr/local/lib/python3.9/dist-packages/hail/expr/functions.py:854: DocTestFailure

@Will-Tyler
Copy link
Contributor Author

I was able to get the doctests running locally, and I got PASSED hail/expr/functions.py::hail.expr.functions.cochran_mantel_haenszel_test.

As Dan mentioned, I cannot see the output of the CI tests. I will try rebasing onto the main branch.

@Will-Tyler Will-Tyler force-pushed the cochran-mantel-haenszel-test branch from 7af9cab to 270c303 Compare March 8, 2024 02:19
@Will-Tyler Will-Tyler force-pushed the cochran-mantel-haenszel-test branch from 181c5f8 to c4df2d1 Compare March 9, 2024 02:25
@hail-ci-robot hail-ci-robot merged commit 9d10a2b into hail-is:main Mar 11, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[query] implement the Cochran–Mantel–Haenszel test for repeated tests of independence
4 participants