Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and test remaining dataset definitions in ehrQL examples #1697

Closed
StevenMaude opened this issue Nov 2, 2023 · 3 comments · Fixed by #1839
Closed

Fix and test remaining dataset definitions in ehrQL examples #1697

StevenMaude opened this issue Nov 2, 2023 · 3 comments · Fixed by #1839
Milestone

Comments

@StevenMaude
Copy link
Contributor

StevenMaude commented Nov 2, 2023

#1648 adds tests for most of the dataset definitions in the examples page. There were a couple left as it wasn't immediately clear how best to fix them. We should fix and test these.

Specifically these are the two sections below.

"What is the earliest/latest hospitalisation event matching some criteria?"

### What is the earliest/latest hospitalisation event matching some criteria?
```python
from ehrql import create_dataset, codelist_from_csv
from ehrql.tables.tpp import apcs, patients
cardiac_diagnosis_codes = codelist_from_csv("XXX", column="YYY")
dataset = create_dataset()
dataset.first_cardiac_hospitalisation_date = apcs.where(
apcs.snomedct_code.is_in(cardiac_diagnosis_codes)
).where(
apcs.date.is_on_or_after("2022-07-01")
).sort_by(
apcs.date
).first_for_patient().date
dataset.define_population(patients.exists_for_patient())
```
```ehrql
from ehrql import create_dataset, codelist_from_csv
from ehrql.tables.core import medications, patients
cardiac_diagnosis_codes = codelist_from_csv("XXX", column="YYY")
dataset = create_dataset()
dataset.last_cardiac_hospitalisation_date = medications.where(
medications.dmd_code.is_in(cardiac_diagnosis_codes)
).where(
medications.date.is_on_or_after("2022-07-01")
).sort_by(
medications.date
).last_for_patient().date
dataset.define_population(patients.exists_for_patient())
```

Issues

  • The first dataset definition of the two in this section uses apcs.snomedct_code which doesn't exist. This could be switched to primary_diagnosis which is a single ICD-10 code, or all_diagnoses which is a string of diagnosis codes separated by semi-colons.
  • The second dataset definition is valid ehrQL, but isn't really a semantically sensible example:
    • A variable refers to "hospitalisation date" when actually dealing with medications (and the example is supposed to be dealing with hospitalisations)
    • It checks whether diagnosis codes are in the dmd_code.

"Finding the observed value of clinical events matching some criteria expressed relative to another value"

### Finding the observed value of clinical events matching some criteria expressed relative to another value
```python
from ehrql import create_dataset, codelist_from_csv
from ehrql.tables.core import clinical_events, patients
hba1c_codelist = codelist_from_csv("XXX", column="YYY")
dataset = create_dataset()
mean_hba1c = clinical_events.where(
clinical_events.snomedct_code.is_in(hba1c_codelist)
).where(
clinical_events.date.is_on_or_after("2022-07-01")
).numeric_value.maximum_for_patient()
dataset.mean_max_hbac_difference = max_hba1c - (
clinical_events.where(clinical_events.snomedct_code.is_in(hba1c_codelist)
).where(
clinical_events.numeric_value == max_hba1c
).sort_by(
clinical_events.date
).numeric_value.mean_for_patient())
dataset.define_population(patients.exists_for_patient())
```

Issues

  • The dataset definition creates a variable called mean_hba1c but does this by calculating the maximum_for_patient.
  • The dataset definition uses max_hba1c, which doesn't exist.
  • The dataset definition finds the value of clinical events matching the hba1c_codelist where the value is the same as max_hba1c, finds the mean (which will be the same value) and then subtracts this from max_hba1c.

What this might have been intended as

Maybe something like the following?

from ehrql import create_dataset, codelist_from_csv
from ehrql.tables.core import clinical_events, patients

hba1c_codelist = codelist_from_csv("XXX", column="YYY")

dataset = create_dataset()

max_hba1c = clinical_events.where(
        clinical_events.snomedct_code.is_in(hba1c_codelist)
).where(
        clinical_events.date.is_on_or_after("2022-07-01")
).numeric_value.maximum_for_patient()

mean_hba1c = clinical_events.where(
        clinical_events.snomedct_code.is_in(hba1c_codelist)
).where(
        clinical_events.date.is_on_or_after("2022-07-01")
).numeric_value.mean_for_patient()

dataset.mean_max_hbac_difference = max_hba1c - mean_hba1c
dataset.define_population(patients.exists_for_patient())
StevenMaude added a commit that referenced this issue Nov 2, 2023
This code has been subject to considerable work to get it into this
form. However, it did not seem useful to retain the various approaches
and versions of the code before this state.

A quick guide to this code:

* It finds any Markdown files in `docs/`.
* It uses the SuperFences extension, as we do in the MkDocs
  configuration, to extract Markdown code blocks labelled with `ehrql`
  syntax. These are assumed to be self-contained dataset definitions.
* The code blocks that will be tested should appear as code blocks in
  the documentation, by default (provided the CSS isn't changed to
  modify the appearance of code blocks somehow, which shouldn't be the
  case, because why would you?). They are identified in the parametrized
  tests by their ordinal fence number in the source file.
* It finds any Python modules indicated by a `.py` extension. Python
  modules are assumed to be self-contained dataset definitions.
* The found dataset definitions are run to generate a dataset, and the
  output checked to see if it's a CSV.

There is some monkeypatching necessary to make this work:

* `codelist_from_csv()` relies on having CSV data available, and the
  checks on valid codelist codes are patched out.  Without further work,
  we don't have any direct way of including data for inline dataset
  definitions in Markdown source, or specifying which mock CSV data to
  use without any established convention for examples to use.  #1697
  proposes ideas to remove this monkeypatching further.
* The sandboxing code is monkeypatched out to use "unsafe" loading of
  dataset definitions. Without doing so, it is not possible to
  monkeypatch any other ehrQL code: the ehrQL is run in a subprocess
  otherwise.

For more details and discussion, see the related PR for this code
(#1648) and the previous PR (#1475) which this approach replaces.
StevenMaude added a commit that referenced this issue Nov 2, 2023
This code has been subject to considerable work to get it into this
form. However, it did not seem useful to retain the various approaches
and versions of the code before this state.

A quick guide to this code:

* It finds any Markdown files in `docs/`.
* It uses the SuperFences extension, as we do in the MkDocs
  configuration, to extract Markdown code blocks labelled with `ehrql`
  syntax. These are assumed to be self-contained dataset definitions.
* The code blocks that will be tested should appear as code blocks in
  the documentation, by default (provided the CSS isn't changed to
  modify the appearance of code blocks somehow, which shouldn't be the
  case, because why would you?). They are identified in the parametrized
  tests by their ordinal fence number in the source file.
* It finds any Python modules indicated by a `.py` extension. Python
  modules are assumed to be self-contained dataset definitions.
* The found dataset definitions are run to generate a dataset, and the
  output checked to see if it's a CSV.

There is some monkeypatching necessary to make this work:

* `codelist_from_csv()` relies on having CSV data available, and the
  checks on valid codelist codes are patched out.  Without further work,
  we don't have any direct way of including data for inline dataset
  definitions in Markdown source, or specifying which mock CSV data to
  use without any established convention for examples to use.  #1697
  proposes ideas to remove this monkeypatching further.
* The sandboxing code is monkeypatched out to use "unsafe" loading of
  dataset definitions. Without doing so, it is not possible to
  monkeypatch any other ehrQL code: the ehrQL is run in a subprocess
  otherwise.

For more details and discussion, see the related PR for this code
(#1648) and the previous PR (#1475) which this approach replaces.
StevenMaude added a commit that referenced this issue Nov 2, 2023
This code has been subject to considerable work to get it into this
form. However, it did not seem useful to retain the various approaches
and versions of the code before this state.

A quick guide to this code:

* It finds any Markdown files in `docs/`.
* It uses the SuperFences extension, as we do in the MkDocs
  configuration, to extract Markdown code blocks labelled with `ehrql`
  syntax. These are assumed to be self-contained dataset definitions.
* The code blocks that will be tested should appear as code blocks in
  the documentation, by default (provided the CSS isn't changed to
  modify the appearance of code blocks somehow, which shouldn't be the
  case, because why would you?). They are identified in the parametrized
  tests by their ordinal fence number in the source file.
* It finds any Python modules indicated by a `.py` extension. Python
  modules are assumed to be self-contained dataset definitions.
* The found dataset definitions are run to generate a dataset, and the
  output checked to see if it's a CSV.

There is some monkeypatching necessary to make this work:

* `codelist_from_csv()` relies on having CSV data available, and the
  checks on valid codelist codes are patched out.  Without further work,
  we don't have any direct way of including data for inline dataset
  definitions in Markdown source, or specifying which mock CSV data to
  use without any established convention for examples to use.  #1697
  proposes ideas to remove this monkeypatching further.
* The sandboxing code is monkeypatched out to use "unsafe" loading of
  dataset definitions. Without doing so, it is not possible to
  monkeypatch any other ehrQL code: the ehrQL is run in a subprocess
  otherwise.

For more details and discussion, see the related PR for this code
(#1648) and the previous PR (#1475) which this approach replaces.
StevenMaude added a commit that referenced this issue Nov 2, 2023
This code has been subject to considerable work to get it into this
form. However, it did not seem useful to retain the various approaches
and versions of the code before this state.

A quick guide to this code:

* It finds any Markdown files in `docs/`.
* It uses the SuperFences extension, as we do in the MkDocs
  configuration, to extract Markdown code blocks labelled with `ehrql`
  syntax. These are assumed to be self-contained dataset definitions.
* The code blocks that will be tested should appear as code blocks in
  the documentation, by default (provided the CSS isn't changed to
  modify the appearance of code blocks somehow, which shouldn't be the
  case, because why would you?). They are identified in the parametrized
  tests by their ordinal fence number in the source file.
* It finds any Python modules indicated by a `.py` extension. Python
  modules are assumed to be self-contained dataset definitions.
* The found dataset definitions are run to generate a dataset, and the
  output checked to see if it's a CSV.

There is some monkeypatching necessary to make this work:

* `codelist_from_csv()` relies on having CSV data available, and the
  checks on valid codelist codes are patched out.  Without further work,
  we don't have any direct way of including data for inline dataset
  definitions in Markdown source, or specifying which mock CSV data to
  use without any established convention for examples to use.  #1697
  proposes ideas to remove this monkeypatching further.
* The sandboxing code is monkeypatched out to use "unsafe" loading of
  dataset definitions. Without doing so, it is not possible to
  monkeypatch any other ehrQL code: the ehrQL is run in a subprocess
  otherwise.

For more details and discussion, see the related PR for this code
(#1648) and the previous PR (#1475) which this approach replaces.
@inglesp inglesp added this to the P2 milestone Nov 3, 2023
@inglesp inglesp added this to Data Team Nov 3, 2023
@StevenMaude
Copy link
Contributor Author

StevenMaude commented Nov 7, 2023

This doesn't cover examples in the surrounding OpenSAFELY documentation, for which there's another issue: opensafely/documentation#1373.

@rebkwok
Copy link
Contributor

rebkwok commented Nov 13, 2023

Just a note as a reminder - currently the tests assume that all ehrql snippets are dataset definitions. I added a measures definition in #1723 which failed because the tests tried to test it as a dataset definition using generate_dataset. Possibly it'll be addressed in #1357

@StevenMaude
Copy link
Contributor Author

I opened a separate issue for measure definitions not yet being tested: #1726.

StevenMaude added a commit that referenced this issue Dec 13, 2023
Fixes #1697.

These were left in, but not all of these were tested.

We had a case where an ehrQL user actually tried to follow one
of these broken examples, which led to some confusion.
StevenMaude added a commit that referenced this issue Dec 13, 2023
Fixes #1697.

These were left in, but not all of these were tested.

We had a case where an ehrQL user actually tried to follow one
of these broken examples, which led to some confusion.
StevenMaude added a commit that referenced this issue Dec 13, 2023
Fixes #1697.

These were left in, but not all of these were tested.

We had a case where an ehrQL user actually tried to follow one
of these broken examples, which led to some confusion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants