Support for OMOP ES batches #274

jeremyestein · 2024-01-31T19:14:21Z

Will fix #231

TODO:

Split test OMOP extracts into batches, delete stray extra copy
Document the different bits of test data (batches) and why we have them
Implement the multi batch detection and reading/merging
Tests....
- increase test coverage a bit for the public methods in _io.py?
Clarify batch dir naming (extract_N vs batch_N), change code and docs when decided
Check assumption that timestamp will be identical for all batches within an extract

…ch setup as needed. For now, all tests are doing it the "old" way. Delete duplicated test resources.

back to multi-batch when it's implemented.

person has procedures in both batches, there is some overlap in the PERSON_LINKS.parquet file

others.

split over two batches.

stefpiatek

Looking good, will need to either change to using export_* or ensure that OMOP ES changes how they do their batches before merging

stefpiatek · 2024-02-05T09:18:59Z

cli/src/pixl_cli/_io.py

+    # should it really be 'extract_*'?
+    batch_dirs = list(extract_path.glob("batch_*"))


Yeah we're given extract_*, though we could ask if OMOP ES could change it? Worth dropping a message in pixl-omop channel in UCLH Foundry slack.

Also, should we sort these?

cli/src/pixl_cli/_io.py

stefpiatek · 2024-02-05T09:21:18Z

cli/src/pixl_cli/_io.py

@@ -47,20 +47,75 @@ def messages_from_state_file(filepath: Path) -> list[Message]:
    return [deserialise(line) for line in filepath.open().readlines() if string_is_non_empty(line)]


-def config_from_log_file(parquet_path: Path) -> tuple[str, datetime]:
-    log_file = parquet_path / "extract_summary.json"
+def determine_batch_structure(extract_path: Path) -> tuple[str, datetime, list[Path]]:


maybe parse_batch_structure?

cli/src/pixl_cli/_io.py

cli/tests/test_messages_from_parquet.py

milanmlft

Thanks for the detailed docstrings! ❤️
Just one small suggestion; maybe add the example file structure from the issue to the cli/README.md as well? And clarify that we support this multi-batch structure.

└── omop_extract
    ├── extract_1
    │   ├── extract_summary.json
    │   ├── private
    │   └── public
    ├── extract_2
    │   ├── extract_summary.json
    │   ├── private
    │   └── public
    └── extract_3
        ├── extract_summary.json
        ├── private
        └── public

… CLI

of the pipeline - probably needs some changing to make more consistent!

jeremyestein added 6 commits January 31, 2024 19:06

Firstly refactor tests so that each test can generate the correct bat…

a40860d

…ch setup as needed. For now, all tests are doing it the "old" way. Delete duplicated test resources.

Merge branch 'main' into jeremy/omop-batch-support

cc4a3ea

Temporarily point to correct single batch directory. Will change this

0a39b2f

back to multi-batch when it's implemented.

Split procedure occurrence test data between two batches. Because one

7c628c7

person has procedures in both batches, there is some overlap in the PERSON_LINKS.parquet file

Dodgy batch that doesn't match the project name + timestamp of the

e2d0917

others.

Update tests, implementation not quite there yet

74c1db1

jeremyestein changed the title ~~Support for OMOP ES batches~~ Support for OMOP ES batches [force-system-test] Feb 1, 2024

jeremyestein added 7 commits February 1, 2024 18:46

Need to write to batch dir on export, too

b9ce649

Reading and copying extracts must be done on a batch-by-batch basis.

f55fb92

Set up the system test so it has all the original data, which is now

c68d1f3

split over two batches.

Merge branch 'main' into jeremy/omop-batch-support

e7dcd35

Explain what the different batches are for

0362496

Minor typing error

6fba6c5

Increase test coverage slightly; not sure if worth it.

0d15136

jeremyestein marked this pull request as ready for review February 2, 2024 19:20

jeremyestein changed the title ~~Support for OMOP ES batches [force-system-test]~~ Support for OMOP ES batches Feb 2, 2024

stefpiatek requested a review from a team February 5, 2024 09:43

stefpiatek approved these changes Feb 5, 2024

View reviewed changes

milanmlft approved these changes Feb 5, 2024

View reviewed changes

jeremyestein added 7 commits February 5, 2024 16:40

Review suggestions: log batch details, sort globbed batch dirs.

d253f34

Document the single and multi batch extracts that are accepted by the…

6cff5a2

… CLI

Merge branch 'main' into jeremy/omop-batch-support

2a5e81e

Check that truncated parquet files stop the process.

21db43d

Merge branch 'main' into jeremy/omop-batch-support

c83e7d1

Fix up merge, new code in main needs to be OMOP ES batch-aware.

9cb88d1

Merge branch 'main' into jeremy/omop-batch-support

c9ccbbf

jeremyestein mentioned this pull request Feb 7, 2024

Add documentation for Parquet files and export process #280

Merged

1 task

jeremyestein added 3 commits February 7, 2024 17:05

Fix test to be batch aware. It now (correctly) fails.

2b3a33b

Capture docker output for easier debugging

99c9d58

Merge branch 'main' into jeremy/omop-batch-support

e35f76f

jeremyestein added 3 commits February 7, 2024 17:44

Merge branch 'main' into jeremy/omop-batch-support

fe9cc9d

Missed a dependency

f871583

FTP upload everything as found. Document the files present at each stage

421f91f

of the pipeline - probably needs some changing to make more consistent!

jeremyestein mentioned this pull request Feb 9, 2024

FTP upload preserving directory structure #289

Merged

stefpiatek closed this Feb 12, 2024

stefpiatek deleted the jeremy/omop-batch-support branch April 19, 2024 16:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for OMOP ES batches #274

Support for OMOP ES batches #274

jeremyestein commented Jan 31, 2024 •

edited

Loading

stefpiatek left a comment

stefpiatek Feb 5, 2024

stefpiatek Feb 5, 2024

stefpiatek Feb 5, 2024

milanmlft left a comment •

edited

Loading

		# should it really be 'extract_*'?
		batch_dirs = list(extract_path.glob("batch_*"))

Support for OMOP ES batches #274

Support for OMOP ES batches #274

Conversation

jeremyestein commented Jan 31, 2024 • edited Loading

stefpiatek left a comment

Choose a reason for hiding this comment

stefpiatek Feb 5, 2024

Choose a reason for hiding this comment

stefpiatek Feb 5, 2024

Choose a reason for hiding this comment

stefpiatek Feb 5, 2024

Choose a reason for hiding this comment

milanmlft left a comment • edited Loading

Choose a reason for hiding this comment

jeremyestein commented Jan 31, 2024 •

edited

Loading

milanmlft left a comment •

edited

Loading