-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add documentation for Parquet files and export process (#280)
* Add documentation for Parquet files and export process * Formatting * Move `TODO` to issue #306 * Remove PR references * Formatting * Move specific details to `pixl_core` docs and add links * Update directory structure on the FTPS server * Formatting * Rename docs/data -> docs/file_types * Link to `file_types` documentation * Add directory structures to docstrings * Update upload.py Co-authored-by: Stef Piatek <[email protected]> * Fix docs link Co-authored-by: Jeremy Stein <[email protected]> * Clarify that the radiology reports go through Cogstack Co-authored-by: Jeremy Stein <[email protected]> * Add note about test files Co-authored-by: Jeremy Stein <[email protected]> --------- Co-authored-by: Stef Piatek <[email protected]> Co-authored-by: Jeremy Stein <[email protected]>
- Loading branch information
1 parent
582c995
commit c24e86c
Showing
6 changed files
with
119 additions
and
17 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
# Parquet files you might encounter throughout PIXL | ||
|
||
## OMOP-ES files | ||
|
||
From | ||
[OMOP-ES](https://github.com/UCLH-Foundry/the-rolling-skeleton/blob/main/docs/design/100-day-design.md#data-flow-through-components) | ||
we receive parquet files defining the data we need to export. These input files appear as 2 groups: | ||
|
||
1. **Public** parquet files: have had identifiers removed and replaced with a sequential ID for the | ||
export | ||
2. **Private** parquet files: map sequential identifiers to patient identifiers (e.g. MRNs, | ||
Accession numbers, NHS numbers) | ||
|
||
## Radiology reports | ||
|
||
The PIXL pipeline generates **Radiology** parquet files, which | ||
contain the radiology reports for the given extract. These are generated by calling the CogStack API, which returns a de-identified radiology report given a full radiology report. | ||
|
||
The functionality for this is defined in the [EHR API](../../pixl_ehr/README.md), specifically in | ||
[`PIXLDatabase.get_radiology_reports`](../../pixl_ehr/src/pixl_ehr/_databases.py), which queries the | ||
PIXL database for the de-identified radiology reports of the current extract and collects them | ||
in a single _parquet_ file together with the `image_identifier` and `procedure_occurrence_id`. | ||
|
||
## Exporting (copying from OMOP ES) | ||
|
||
As part of the PIXL pipeline, we copy the OMOP-ES public _parquet_ files to an export directory, to | ||
prepare them for upload to the DSH. The exporting details are in the | ||
[`pixl_core` documentation](../../pixl_core/README.md#omop-es-files). | ||
|
||
## Uploading to the DSH | ||
|
||
The final step in the journey of the _parquet_ files is to upload them to the DSH. This is | ||
implemented and documented in [`pixl_core`](../../pixl_core/README.md#uploading-to-an-ftps-server). | ||
|
||
## Testing | ||
|
||
Various _parquet_ files are provided throughout the repo to enable unit and system testing: | ||
|
||
- `cli/tests/resources/omop/` contains public and private parquet files together with an | ||
`extract_summary.json` file to mimic the input received from OMOP-ES for the unit tests. (This directory is identical to that below and should be deleted at some point). | ||
- `test/resources/omop/` contains public and private parquet files together with an | ||
`extract_summary.json` file to mimic the input received from OMOP-ES for the system tests | ||
|
||
During the system test, a `radiology.parquet` file is generated and temporarily stored in | ||
`exports/test-extract-uclh-omop-cdm/latest/radiology/radiology.parquet` to check the successful | ||
de-identification before the DSH upload. This file is then deleted after the test. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters