-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for OMOP ES batches #274
Conversation
…ch setup as needed. For now, all tests are doing it the "old" way. Delete duplicated test resources.
back to multi-batch when it's implemented.
person has procedures in both batches, there is some overlap in the PERSON_LINKS.parquet file
split over two batches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, will need to either change to using export_*
or ensure that OMOP ES changes how they do their batches before merging
cli/src/pixl_cli/_io.py
Outdated
# should it really be 'extract_*'? | ||
batch_dirs = list(extract_path.glob("batch_*")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we're given extract_*
, though we could ask if OMOP ES could change it? Worth dropping a message in pixl-omop channel in UCLH Foundry slack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, should we sort these?
@@ -47,20 +47,75 @@ def messages_from_state_file(filepath: Path) -> list[Message]: | |||
return [deserialise(line) for line in filepath.open().readlines() if string_is_non_empty(line)] | |||
|
|||
|
|||
def config_from_log_file(parquet_path: Path) -> tuple[str, datetime]: | |||
log_file = parquet_path / "extract_summary.json" | |||
def determine_batch_structure(extract_path: Path) -> tuple[str, datetime, list[Path]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe parse_batch_structure
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed docstrings! ❤️
Just one small suggestion; maybe add the example file structure from the issue to the cli/README.md
as well? And clarify that we support this multi-batch structure.
└── omop_extract
├── extract_1
│ ├── extract_summary.json
│ ├── private
│ └── public
├── extract_2
│ ├── extract_summary.json
│ ├── private
│ └── public
└── extract_3
├── extract_summary.json
├── private
└── public
of the pipeline - probably needs some changing to make more consistent!
Will fix #231
TODO:
extract_N
vsbatch_N
), change code and docs when decided