Add Cochran-Mantel-Haenszel statistical test for association #14255

Will-Tyler · 2024-02-06T05:38:50Z

Description

In this pull request, I add a function to perform a Cochran-Mantel-Haenszel statistical test for association.

This pull request closes #13481.

Testing

I add unit tests. Since I have not used R before (the associated GitHub issue suggests using R to create test cases), I created the unit tests from examples that I found on the internet. I linked these sources in the code for the unit tests.

I built the documentation locally and inspected it to confirm that it matches my expectations.

I am having trouble testing the docstring examples locally. When I run make -C hail doctest-query, the tests error due to a checksum exception.

Discussion

I have not added an example to the documentation that uses a matrix table yet. (This is an acceptance criteria in #13481.) I wanted to get some advice about the best way to do this. I think ideally, the example would have a binary phenotype, an allele to test for association, and some stratifying variable. I tried to search through the existing code to find suitable example matrix tables in the docstrings, but I didn't find anything promising. I would appreciate help here.

Update: thanks to @patrick-schultz's recommendation, I have added an example using a matrix table.

patrick-schultz · 2024-02-07T20:29:52Z

This is great, thanks for working on this!

I have not added an example to the documentation that uses a matrix table yet. (This is an #13481.) I wanted to get some advice about the best way to do this. I think ideally, the example would have a binary phenotype, an allele to test for association, and some stratifying variable. I tried to search through the existing code to find suitable example matrix tables in the docstrings, but I didn't find anything promising. I would appreciate help here.

The code that sets up the doctest environment is here. In particular, the matrixtable available in doc examples as ds lives at hail/hail/python/hail/docs/data/example.mt. It has lots of fields you can use:

In [3]: mt = hl.read_matrix_table('python/hail/docs/data/example.mt')

In [4]: mt.describe()
----------------------------------------
Global fields:
    'global_field_1': int32
    'global_field_2': int32
    'pli': dict<str, float64>
    'populations': array<str>
----------------------------------------
Column fields:
    's': str
    'sample_qc': struct {
        dp_stats: struct {
            mean: float64,
            stdev: float64,
            min: float64,
            max: float64
        },
        gq_stats: struct {
            mean: float64,
            stdev: float64,
            min: float64,
            max: float64
        },
        call_rate: float64,
        n_called: int64,
        n_not_called: int64,
        n_filtered: int64,
        n_hom_ref: int64,
        n_het: int64,
        n_hom_var: int64,
        n_non_ref: int64,
        n_singleton: int64,
        n_snp: int64,
        n_insertion: int64,
        n_deletion: int64,
        n_transition: int64,
        n_transversion: int64,
        n_star: int64,
        r_ti_tv: float64,
        r_het_hom_var: float64,
        r_insertion_deletion: float64
    }
    'is_case': bool
    'pheno': struct {
        is_case: bool,
        is_female: bool,
        age: float64,
        height: float64,
        blood_pressure: float64,
        cohort_name: str
    }
    'cov': struct {
        PC1: float64
    }
    'cov1': float64
    'cov2': float64
    'cohort': str
    'cohorts': array<str>
    'pop': str
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        NEGATIVE_TRAIN_SITE: bool,
        HWP: float64,
        AC: array<int32>,
        culprit: str,
        MQ0: int32,
        ReadPosRankSum: float64,
        AN: int32,
        InbreedingCoeff: float64,
        AF: array<float64>,
        GQ_STDDEV: float64,
        FS: float64,
        DP: int32,
        GQ_MEAN: float64,
        POSITIVE_TRAIN_SITE: bool,
        VQSLOD: float64,
        ClippingRankSum: float64,
        BaseQRankSum: float64,
        MLEAF: array<float64>,
        MLEAC: array<int32>,
        MQ: float64,
        QD: float64,
        END: int32,
        DB: bool,
        HaplotypeScore: float64,
        MQRankSum: float64,
        CCC: int32,
        NCC: int32,
        DS: bool
    }
    'use_as_marker': bool
    'panel_maf': float64
    'anno1': int32
    'anno2': int32
    'consequence': str
    'gene': array<str>
    'score': float64
    'a_index': int32
    'variant_qc': struct {
        dp_stats: struct {
            mean: float64,
            stdev: float64,
            min: float64,
            max: float64
        },
        gq_stats: struct {
            mean: float64,
            stdev: float64,
            min: float64,
            max: float64
        },
        AC: array<int32>,
        AF: array<float64>,
        AN: int32,
        homozygote_count: array<int32>,
        call_rate: float64,
        n_called: int64,
        n_not_called: int64,
        n_filtered: int64,
        n_het: int64,
        n_non_ref: int64,
        het_freq_hwe: float64,
        p_value_hwe: float64,
        p_value_excess_het: float64
    }
----------------------------------------
Entry fields:
    'GT': call
    'AD': array<int32>
    'DP': int32
    'GQ': int32
    'PL': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

hail/python/hail/expr/functions.py

Will-Tyler · 2024-02-08T04:28:57Z

In particular, the matrixtable available in doc examples as ds lives at hail/hail/python/hail/docs/data/example.mt.

Thanks, this is exactly what I was looking for. I am having trouble testing my new example locally though. When I run make -C hail doctest-query, the tests fail with a checksum error. I tried running make -C hail clean and retrying, but I still get the same error.

E           hail.utils.java.FatalError: ChecksumException: Checksum error: file:/Users/willtyler/Desktop/hail/hail/python/hail/docs/data/example.8bits.bgen.idx2/metadata.json.gz at 0 exp: 982431825 got: -2031629660
E
E           Java stack trace:
E           org.apache.hadoop.fs.ChecksumException: Checksum error: file:/Users/willtyler/Desktop/hail/hail/python/hail/docs/data/example.8bits.bgen.idx2/metadata.json.gz at 0 exp: 982431825 got: -2031629660
E           	at org.apache.hadoop.fs.FSInputChecker.verifySums(FSInputChecker.java:347)
E           	at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:303)
E           	at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:252)
E           	at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:197)
E           	at java.base/java.io.DataInputStream.read(DataInputStream.java:149)
E           	at is.hail.io.fs.HadoopFS$$anon$2.read(HadoopFS.scala:58)
E           	at java.base/java.io.DataInputStream.read(DataInputStream.java:149)
E           	at org.apache.commons.compress.utils.CountingInputStream.read(CountingInputStream.java:56)
E           	at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:252)
E           	at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:271)
E           	at org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream.init(GzipCompressorInputStream.java:185)
E           	at org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream.<init>(GzipCompressorInputStream.java:168)
E           	at is.hail.io.fs.GZipCompressionCodec$.makeInputStream(FS.scala:125)
E           	at is.hail.io.fs.FS.open(FS.scala:563)
E           	at is.hail.io.fs.FS.open$(FS.scala:560)
E           	at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:85)
E           	at is.hail.io.fs.FS.open(FS.scala:578)
E           	at is.hail.io.fs.FS.open$(FS.scala:577)
E           	at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:85)
E           	at is.hail.io.fs.FS.open(FS.scala:572)
E           	at is.hail.io.fs.FS.open$(FS.scala:571)
E           	at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:85)
E           	at is.hail.io.fs.FS.open(FS.scala:569)
E           	at is.hail.io.fs.FS.open$(FS.scala:569)
E           	at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:85)
E           	at is.hail.io.index.IndexReader$.readTypes(IndexReader.scala:65)
E           	at is.hail.io.bgen.LoadBgen$.$anonfun$getBgenFileMetadata$2(LoadBgen.scala:208)
E           	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
E           	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
E           	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
E           	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
E           	at scala.collection.TraversableLike.map(TraversableLike.scala:286)
E           	at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
E           	at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
E           	at is.hail.io.bgen.LoadBgen$.getBgenFileMetadata(LoadBgen.scala:207)
E           	at is.hail.io.bgen.MatrixBGENReader$.apply(LoadBgen.scala:387)
E           	at is.hail.io.bgen.MatrixBGENReader$.fromJValue(LoadBgen.scala:365)
E           	at is.hail.expr.ir.MatrixReader$.fromJson(MatrixIR.scala:116)
E           	at is.hail.expr.ir.IRParser$.matrix_ir_1(Parser.scala:1815)
E           	at is.hail.expr.ir.IRParser$.$anonfun$matrix_ir$1(Parser.scala:1738)
E           	at is.hail.utils.StackSafe$More.advance(StackSafe.scala:64)
E           	at is.hail.utils.StackSafe$.run(StackSafe.scala:16)
E           	at is.hail.utils.StackSafe$StackFrame.run(StackSafe.scala:32)
E           	at is.hail.expr.ir.IRParser$.$anonfun$parse_matrix_ir$1(Parser.scala:2164)
E           	at is.hail.expr.ir.IRParser$.parse(Parser.scala:2136)
E           	at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:2164)
E           	at is.hail.backend.Backend.$anonfun$matrixTableType$2(Backend.scala:186)
E           	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:78)
E           	at is.hail.utils.package$.using(package.scala:664)
E           	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:78)
E           	at is.hail.utils.package$.using(package.scala:664)
E           	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:13)
E           	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:65)
E           	at is.hail.backend.spark.SparkBackend.$anonfun$withExecuteContext$2(SparkBackend.scala:407)
E           	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:55)
E           	at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:62)
E           	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:393)
E           	at is.hail.backend.Backend.$anonfun$matrixTableType$1(Backend.scala:185)
E           	at is.hail.backend.Backend.jsonToBytes(Backend.scala:175)
E           	at is.hail.backend.Backend.matrixTableType(Backend.scala:185)
E           	at is.hail.backend.BackendHttpHandler.handle(BackendServer.scala:104)
E           	at jdk.httpserver/com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:77)
E           	at jdk.httpserver/sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:82)
E           	at jdk.httpserver/com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:80)
E           	at jdk.httpserver/sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:848)
E           	at jdk.httpserver/com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:77)
E           	at jdk.httpserver/sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:817)
E           	at jdk.httpserver/sun.net.httpserver.ServerImpl$DefaultExecutor.execute(ServerImpl.java:201)
E           	at jdk.httpserver/sun.net.httpserver.ServerImpl$Dispatcher.handle(ServerImpl.java:560)
E           	at jdk.httpserver/sun.net.httpserver.ServerImpl$Dispatcher.run(ServerImpl.java:526)
E           	at java.base/java.lang.Thread.run(Thread.java:829)
E
E
E
E           Hail version: 0.2.127-e81ad92151ab
E           Error summary: ChecksumException: Checksum error: file:/Users/willtyler/Desktop/hail/hail/python/hail/docs/data/example.8bits.bgen.idx2/metadata.json.gz at 0 exp: 982431825 got: -2031629660

patrick-schultz · 2024-02-12T16:14:54Z

Unfortunately, running doctests locally has always been a bit of a mess. I don't think any of us ever run them locally, we just let CI handle it. If you want to sort out local doctests I'd suggest starting a thread in zulip asking for help. But it looks like your latest commit has passed all tests in CI. Just dismiss my review when you're ready for another look. I think this is looking ready to merge.

patrick-schultz

Thanks for the contribution!

Will-Tyler · 2024-02-12T18:02:24Z

Thanks for reviewing!

danking · 2024-02-23T20:06:01Z

Unfortunately, the CI results are private to protect against inadvertent secret leaks. It looks like the docs are failing.

The Locus should be hl.Locus.

I suspect the test will pass with that change.

=================================== FAILURES ===================================
__________ [doctest] hail.expr.functions.cochran_mantel_haenszel_test __________
834     Examples
835     --------
836     >>> a = [56, 61, 73, 71]
837     >>> b = [69, 257, 65, 48]
838     >>> c = [40, 57, 71, 55]
839     >>> d = [77, 301, 79, 48]
840     >>> hl.eval(hl.cochran_mantel_haenszel_test(a, b, c, d))
841     Struct(test_statistic=5.0496881823306765, p_value=0.024630370456863417)
842 
843     >>> mt = ds.filter_rows(mt.locus == Locus(20, 10633237))
UNEXPECTED EXCEPTION: NameError("name 'Locus' is not defined")
Traceback (most recent call last):
  File "/usr/lib/python3.9/doctest.py", line 1334, in __run
    exec(compile(example.source, filename, "single",
  File "<doctest hail.expr.functions.cochran_mantel_haenszel_test[5]>", line 1, in <module>
NameError: name 'Locus' is not defined
/usr/local/lib/python3.9/dist-packages/hail/expr/functions.py:843: UnexpectedException
835     --------
836     >>> a = [56, 61, 73, 71]
837     >>> b = [69, 257, 65, 48]
838     >>> c = [40, 57, 71, 55]
839     >>> d = [77, 301, 79, 48]
840     >>> hl.eval(hl.cochran_mantel_haenszel_test(a, b, c, d))
841     Struct(test_statistic=5.0496881823306765, p_value=0.024630370456863417)
842 
843     >>> mt = ds.filter_rows(mt.locus == Locus(20, 10633237))
844     >>> mt.count_rows()
Expected:
    1
Got:
    9

/usr/local/lib/python3.9/dist-packages/hail/expr/functions.py:844: DocTestFailure
845     1
846     >>> a, b, c, d = mt.aggregate_entries(
847     ...     hl.tuple([
848     ...         hl.array([hl.agg.count_where(mt.GT.is_non_ref() & mt.pheno.is_case & mt.pheno.is_female), hl.agg.count_where(mt.GT.is_non_ref() & mt.pheno.is_case & ~mt.pheno.is_female)]),
849     ...         hl.array([hl.agg.count_where(mt.GT.is_non_ref() & ~mt.pheno.is_case & mt.pheno.is_female), hl.agg.count_where(mt.GT.is_non_ref() & ~mt.pheno.is_case & ~mt.pheno.is_female)]),
850     ...         hl.array([hl.agg.count_where(~mt.GT.is_non_ref() & mt.pheno.is_case & mt.pheno.is_female), hl.agg.count_where(~mt.GT.is_non_ref() & mt.pheno.is_case & ~mt.pheno.is_female)]),
851     ...         hl.array([hl.agg.count_where(~mt.GT.is_non_ref() & ~mt.pheno.is_case & mt.pheno.is_female), hl.agg.count_where(~mt.GT.is_non_ref() & ~mt.pheno.is_case & ~mt.pheno.is_female)])
852     ...     ])
853     ... )
854     >>> hl.eval(hl.cochran_mantel_haenszel_test(a, b, c, d))
Expected:
    Struct(test_statistic=0.2188830334629822, p_value=0.6398923118508772)
Got:
    Struct(test_statistic=0.0015755655579811829, p_value=0.9683375680436793)

/usr/local/lib/python3.9/dist-packages/hail/expr/functions.py:854: DocTestFailure

Will-Tyler · 2024-03-08T01:51:05Z

I was able to get the doctests running locally, and I got PASSED hail/expr/functions.py::hail.expr.functions.cochran_mantel_haenszel_test.

As Dan mentioned, I cannot see the output of the CI tests. I will try rebasing onto the main branch.

Add docs Add unit tests Closes hail-is#13481

daniel-goldstein assigned patrick-schultz Feb 6, 2024

patrick-schultz requested changes Feb 7, 2024

View reviewed changes

hail/python/hail/expr/functions.py Outdated Show resolved Hide resolved

Will-Tyler requested a review from patrick-schultz February 12, 2024 17:15

patrick-schultz approved these changes Feb 12, 2024

View reviewed changes

Will-Tyler force-pushed the cochran-mantel-haenszel-test branch from 7af9cab to 270c303 Compare March 8, 2024 02:19

Will-Tyler added 5 commits March 9, 2024 02:25

Add Cochran Mantel Haenszel statistical test for association

e47b49c

Add docs Add unit tests Closes hail-is#13481

Use explicit array expression in unit test

a365632

Add example using matrix table

9b192ac

Explicitly use hail functions

6e491c0

Fix docstring error

c4df2d1

Will-Tyler force-pushed the cochran-mantel-haenszel-test branch from 181c5f8 to c4df2d1 Compare March 9, 2024 02:25

hail-ci-robot merged commit 9d10a2b into hail-is:main Mar 11, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Cochran-Mantel-Haenszel statistical test for association #14255

Add Cochran-Mantel-Haenszel statistical test for association #14255

Will-Tyler commented Feb 6, 2024 •

edited

Loading

patrick-schultz commented Feb 7, 2024

Will-Tyler commented Feb 8, 2024

patrick-schultz commented Feb 12, 2024

patrick-schultz left a comment

Will-Tyler commented Feb 12, 2024

danking commented Feb 23, 2024

Will-Tyler commented Mar 8, 2024

Add Cochran-Mantel-Haenszel statistical test for association #14255

Add Cochran-Mantel-Haenszel statistical test for association #14255

Conversation

Will-Tyler commented Feb 6, 2024 • edited Loading

Description

Testing

Discussion

patrick-schultz commented Feb 7, 2024

Will-Tyler commented Feb 8, 2024

patrick-schultz commented Feb 12, 2024

patrick-schultz left a comment

Choose a reason for hiding this comment

Will-Tyler commented Feb 12, 2024

danking commented Feb 23, 2024

Will-Tyler commented Mar 8, 2024

Will-Tyler commented Feb 6, 2024 •

edited

Loading