last_ons_death and categorising the cause of death #1333

p-stehlik · 2023-09-15T04:24:50Z

p-stehlik
Sep 15, 2023

Hi there

Just wondering how I can use ehrql to categorise someone as died of a certain cause were the ICD-10 codes are in a codelist csv?

I've tried to use .isin() but clearly that is not working - is there a way to use case for a list of conditions?
Or should I do this post hoc?


cause_of_death = last_ons_death.cause_of_death_01

dataset.cause_of_death_cat = case(
    when((cause_of_death.isin(codes_ICD10_covid))).then("COVID-19"),
    when(~(cause_of_death.isin(codes_ICD10_covid)).then("Non COVID-19"),
    default="alive"
    )
)

Thanks

P.S. ehrql is amazing and much better than the cohort extractor - really phenomenal work guys!!

Answered by evansd

Sep 15, 2023

Ah sorry, this is our fault! There were some typos in some of documentation examples which used isin but the actual method name is is_in:
https://docs.opensafely.org/ehrql/reference/language/#CodePatientSeries.is_in

So if you just add underscores in two places to your code then it should all work.

You also have the option, instead of using ~, to write:

cause_of_death.is_not_in(codes_ICD10_covid)

But use whichever seems clearest to you.

And thanks for the encouraging words!

View full answer

evansd · 2023-09-15T08:46:40Z

evansd
Sep 15, 2023
Maintainer

Ah sorry, this is our fault! There were some typos in some of documentation examples which used isin but the actual method name is is_in:
https://docs.opensafely.org/ehrql/reference/language/#CodePatientSeries.is_in

So if you just add underscores in two places to your code then it should all work.

You also have the option, instead of using ~, to write:

cause_of_death.is_not_in(codes_ICD10_covid)

But use whichever seems clearest to you.

And thanks for the encouraging words!

0 replies

p-stehlik · 2023-09-22T02:33:28Z

p-stehlik
Sep 22, 2023
Author

Thanks that has now worked.

I have two follow up questions:

QUESTION 1:

When I run the code now, it seems that t only generates patients who have died of covid, or an unknown cause (ie only uses the two values rather than autogenerate some with non-covid deaths).

last_ons_death = ons_deaths.sort_by(ons_deaths.date).last_for_patient()

cause_death = last_ons_death.underlying_cause_of_death

dataset.cause_of_death = cause_death

dataset.cause_of_death_cat = case(
    when(cause_death.is_in(codes_ICD10_covid)).then("COVID-19"),
    when(cause_death.is_not_in(codes_ICD10_covid)).then("Non COVID-19"),
    default="unknown"
    )

Then when I look at the data:

df.cause_of_death_cat.value_counts()

COVID-19    290
unknown     210
Name: cause_of_death_cat, dtype: int64

I'm not sure what I am doing wrong here?

QUESTION 2

I only want a cohort of participants that have diet so far (I'm looking at cause of deaths).
When i generate the cohort I have put:

#DEFINE POPULATION - all patients who have died
dataset.define_population(patients.date_of_death.is_on_or_between(index_date, end_date))

(NB end_date is "today")

However when I generate the date of death later on:

last_ons_death = ons_deaths.sort_by(ons_deaths.date).last_for_patient()
dataset.date_of_death = last_ons_death.date #date

It seems that I have patients that are (no death date) I assume this is to mimic "missing data" which would appear in the dataset but there is knowledge that the patient has died (somehow?)

0 replies

evansd · 2023-09-27T14:46:13Z

evansd
Sep 27, 2023
Maintainer

What's happening here is that, in both cases, you're bumping up against the limits of the dummy data generator, which is currently still quite simplistic. It's an area that we still want to do a lot of work on, but it's a fundamentally quite difficult problem so we're trying to understand how far we can get with the current system before we develop new solutions.

In the first case, the problem is that the dummy data generator doesn't "know" about any codes other than the ones it sees in your dataset definition. So when it's picking random codes to populate your data it's only picking from ones in the COVID list.

In the second case, it doesn't "know" that the ONS death certificate table should match up with date of death in the primary care record.

You have a few options here:

1. Live with it

Dummy data only exists to allow you to check that your code works before running it against real data. So, in a sense, it doesn't matter if it's wildly unrealistic as long as it exercises your analysis code correctly. If it's possible to tweak your code to cope with the oddities of the dummy data without doing too much violence to it then one option is just to do that and live with the dummy data being weird.

2. Supply your own dummy dataset

You can supply your own dummmy dataset and bypass the dummy data generator entirely by using the --dummy-data-file argument. You could do this either by using generate-dataset to output a CSV which you then tweak by hand. Or you could use an entirely separate script to generate the data.

The downside here is that you'll be responsible for updating this file to match any changes you might make to your dataset definition. ehrQL should warn you if the dummy dataset file no longer matches the format it expects, but it won't be able to fix it for you.

3. Supply your own dummy tables

You can use the create-dummy-tables command to create a sort of tiny, fake EHR database which you can populate with your own made up patients and then run your ehrQL queries against.

Give create-dummy-tables the path to your dataset definition and the name of a directory and it will create a bunch of CSV files in that directory, one for each table in the database, populated with dummy data.

You can edit this data however you wish and then run generate-dataset with the --dummy-tables argument pointed at this directory and it will run your query against the data in these CSV files.

The advantage of this approach is that you have more flexibility to change your dataset definition without necessarily having to make any manual changes to your dummy tables. As long as your dummy tables contain enough data to work with then ehrQL will just recompute your new query against the old dummy tables.

The downside is that it can be more fiddly to edit the dummy tables in the first place because changing the data for a single dummy patient may require edits across multiple different tables.

Hope that's enough to unblock you. We know that the dummy data system needs work, and we need better how-to documentation for the workarounds above, but we're trying to focus on fixing specific blockers for our researchers at the moment.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

last_ons_death and categorising the cause of death #1333

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

last_ons_death and categorising the cause of death #1333

p-stehlik Sep 15, 2023

Replies: 3 comments

evansd Sep 15, 2023 Maintainer

p-stehlik Sep 22, 2023 Author

evansd Sep 27, 2023 Maintainer

1. Live with it

2. Supply your own dummy dataset

3. Supply your own dummy tables

p-stehlik
Sep 15, 2023

evansd
Sep 15, 2023
Maintainer

p-stehlik
Sep 22, 2023
Author

evansd
Sep 27, 2023
Maintainer