Fix and test remaining dataset definitions in ehrQL examples #1697

StevenMaude · 2023-11-02T13:29:30Z

#1648 adds tests for most of the dataset definitions in the examples page. There were a couple left as it wasn't immediately clear how best to fix them. We should fix and test these.

Specifically these are the two sections below.

"What is the earliest/latest hospitalisation event matching some criteria?"

ehrql/docs/how-to/examples.md

Lines 493 to 527 in 260caf1

    
           ### What is the earliest/latest hospitalisation event matching some criteria? 
        
           ```python 
        
           from ehrql import create_dataset, codelist_from_csv 
        
           from ehrql.tables.tpp import apcs, patients 
        
           cardiac_diagnosis_codes = codelist_from_csv("XXX", column="YYY") 
        
           dataset = create_dataset() 
        
           dataset.first_cardiac_hospitalisation_date = apcs.where( 
        
                   apcs.snomedct_code.is_in(cardiac_diagnosis_codes) 
        
           ).where( 
        
                   apcs.date.is_on_or_after("2022-07-01") 
        
           ).sort_by( 
        
                   apcs.date 
        
           ).first_for_patient().date 
        
           dataset.define_population(patients.exists_for_patient()) 
        
           ``` 
        
           ```ehrql 
        
           from ehrql import create_dataset, codelist_from_csv 
        
           from ehrql.tables.core import medications, patients 
        
           cardiac_diagnosis_codes = codelist_from_csv("XXX", column="YYY") 
        
           dataset = create_dataset() 
        
           dataset.last_cardiac_hospitalisation_date = medications.where( 
        
                   medications.dmd_code.is_in(cardiac_diagnosis_codes) 
        
           ).where( 
        
                   medications.date.is_on_or_after("2022-07-01") 
        
           ).sort_by( 
        
                   medications.date 
        
           ).last_for_patient().date 
        
           dataset.define_population(patients.exists_for_patient()) 
        
           ```

Issues

The first dataset definition of the two in this section uses apcs.snomedct_code which doesn't exist. This could be switched to primary_diagnosis which is a single ICD-10 code, or all_diagnoses which is a string of diagnosis codes separated by semi-colons.
The second dataset definition is valid ehrQL, but isn't really a semantically sensible example:
- A variable refers to "hospitalisation date" when actually dealing with medications (and the example is supposed to be dealing with hospitalisations)
- It checks whether diagnosis codes are in the dmd_code.

"Finding the observed value of clinical events matching some criteria expressed relative to another value"

ehrql/docs/how-to/examples.md

Lines 661 to 684 in 260caf1

    
           ### Finding the observed value of clinical events matching some criteria expressed relative to another value 
        
           ```python 
        
           from ehrql import create_dataset, codelist_from_csv 
        
           from ehrql.tables.core import clinical_events, patients 
        
           hba1c_codelist = codelist_from_csv("XXX", column="YYY") 
        
           dataset = create_dataset() 
        
           mean_hba1c = clinical_events.where( 
        
                   clinical_events.snomedct_code.is_in(hba1c_codelist) 
        
           ).where( 
        
                   clinical_events.date.is_on_or_after("2022-07-01") 
        
           ).numeric_value.maximum_for_patient() 
        
           dataset.mean_max_hbac_difference = max_hba1c - ( 
        
           clinical_events.where(clinical_events.snomedct_code.is_in(hba1c_codelist) 
        
           ).where( 
        
                   clinical_events.numeric_value == max_hba1c 
        
           ).sort_by( 
        
                   clinical_events.date 
        
           ).numeric_value.mean_for_patient()) 
        
           dataset.define_population(patients.exists_for_patient()) 
        
           ```

Issues

The dataset definition creates a variable called mean_hba1c but does this by calculating the maximum_for_patient.
The dataset definition uses max_hba1c, which doesn't exist.
The dataset definition finds the value of clinical events matching the hba1c_codelist where the value is the same as max_hba1c, finds the mean (which will be the same value) and then subtracts this from max_hba1c.

What this might have been intended as

Maybe something like the following?

from ehrql import create_dataset, codelist_from_csv
from ehrql.tables.core import clinical_events, patients

hba1c_codelist = codelist_from_csv("XXX", column="YYY")

dataset = create_dataset()

max_hba1c = clinical_events.where(
        clinical_events.snomedct_code.is_in(hba1c_codelist)
).where(
        clinical_events.date.is_on_or_after("2022-07-01")
).numeric_value.maximum_for_patient()

mean_hba1c = clinical_events.where(
        clinical_events.snomedct_code.is_in(hba1c_codelist)
).where(
        clinical_events.date.is_on_or_after("2022-07-01")
).numeric_value.mean_for_patient()

dataset.mean_max_hbac_difference = max_hba1c - mean_hba1c
dataset.define_population(patients.exists_for_patient())

The text was updated successfully, but these errors were encountered:

This code has been subject to considerable work to get it into this form. However, it did not seem useful to retain the various approaches and versions of the code before this state. A quick guide to this code: * It finds any Markdown files in `docs/`. * It uses the SuperFences extension, as we do in the MkDocs configuration, to extract Markdown code blocks labelled with `ehrql` syntax. These are assumed to be self-contained dataset definitions. * The code blocks that will be tested should appear as code blocks in the documentation, by default (provided the CSS isn't changed to modify the appearance of code blocks somehow, which shouldn't be the case, because why would you?). They are identified in the parametrized tests by their ordinal fence number in the source file. * It finds any Python modules indicated by a `.py` extension. Python modules are assumed to be self-contained dataset definitions. * The found dataset definitions are run to generate a dataset, and the output checked to see if it's a CSV. There is some monkeypatching necessary to make this work: * `codelist_from_csv()` relies on having CSV data available, and the checks on valid codelist codes are patched out. Without further work, we don't have any direct way of including data for inline dataset definitions in Markdown source, or specifying which mock CSV data to use without any established convention for examples to use. #1697 proposes ideas to remove this monkeypatching further. * The sandboxing code is monkeypatched out to use "unsafe" loading of dataset definitions. Without doing so, it is not possible to monkeypatch any other ehrQL code: the ehrQL is run in a subprocess otherwise. For more details and discussion, see the related PR for this code (#1648) and the previous PR (#1475) which this approach replaces.

StevenMaude · 2023-11-07T15:24:02Z

This doesn't cover examples in the surrounding OpenSAFELY documentation, for which there's another issue: opensafely/documentation#1373.

rebkwok · 2023-11-13T17:17:41Z

Just a note as a reminder - currently the tests assume that all ehrql snippets are dataset definitions. I added a measures definition in #1723 which failed because the tests tried to test it as a dataset definition using generate_dataset. Possibly it'll be addressed in #1357

StevenMaude · 2023-11-14T10:07:58Z

I opened a separate issue for measure definitions not yet being tested: #1726.

Fixes #1697. These were left in, but not all of these were tested. We had a case where an ehrQL user actually tried to follow one of these broken examples, which led to some confusion.

StevenMaude mentioned this issue Nov 2, 2023

test: Check complete ehrQL dataset definitions in docs #1648

Merged

inglesp added this to the P2 milestone Nov 3, 2023

inglesp added this to Data Team Nov 3, 2023

StevenMaude added a commit that referenced this issue Dec 13, 2023

Remove non-working examples

5f6f08a

Fixes #1697. These were left in, but not all of these were tested. We had a case where an ehrQL user actually tried to follow one of these broken examples, which led to some confusion.

StevenMaude mentioned this issue Dec 13, 2023

docs: Remove non-working examples #1839

Merged

StevenMaude added a commit that referenced this issue Dec 13, 2023

docs: Remove non-working examples

e0b2c09

Fixes #1697. These were left in, but not all of these were tested. We had a case where an ehrQL user actually tried to follow one of these broken examples, which led to some confusion.

StevenMaude closed this as completed in #1839 Dec 13, 2023

github-project-automation bot moved this to Done in Data Team Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix and test remaining dataset definitions in ehrQL examples #1697

Fix and test remaining dataset definitions in ehrQL examples #1697

StevenMaude commented Nov 2, 2023 •

edited

Loading

StevenMaude commented Nov 7, 2023 •

edited

Loading

rebkwok commented Nov 13, 2023

StevenMaude commented Nov 14, 2023

Fix and test remaining dataset definitions in ehrQL examples #1697

Fix and test remaining dataset definitions in ehrQL examples #1697

Comments

StevenMaude commented Nov 2, 2023 • edited Loading

"What is the earliest/latest hospitalisation event matching some criteria?"

Issues

"Finding the observed value of clinical events matching some criteria expressed relative to another value"

Issues

What this might have been intended as

StevenMaude commented Nov 7, 2023 • edited Loading

rebkwok commented Nov 13, 2023

StevenMaude commented Nov 14, 2023

StevenMaude commented Nov 2, 2023 •

edited

Loading

StevenMaude commented Nov 7, 2023 •

edited

Loading