Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for OMOP ES batches #274

Closed
wants to merge 26 commits into from
Closed

Conversation

jeremyestein
Copy link
Contributor

@jeremyestein jeremyestein commented Jan 31, 2024

Will fix #231

TODO:

  • Split test OMOP extracts into batches, delete stray extra copy
  • Document the different bits of test data (batches) and why we have them
  • Implement the multi batch detection and reading/merging
  • Tests....
    • increase test coverage a bit for the public methods in _io.py?
  • Clarify batch dir naming (extract_N vs batch_N), change code and docs when decided
  • Check assumption that timestamp will be identical for all batches within an extract

@jeremyestein jeremyestein changed the title Support for OMOP ES batches Support for OMOP ES batches [force-system-test] Feb 1, 2024
@jeremyestein jeremyestein marked this pull request as ready for review February 2, 2024 19:20
@jeremyestein jeremyestein changed the title Support for OMOP ES batches [force-system-test] Support for OMOP ES batches Feb 2, 2024
@stefpiatek stefpiatek requested a review from a team February 5, 2024 09:43
Copy link
Contributor

@stefpiatek stefpiatek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, will need to either change to using export_* or ensure that OMOP ES changes how they do their batches before merging

Comment on lines 69 to 70
# should it really be 'extract_*'?
batch_dirs = list(extract_path.glob("batch_*"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we're given extract_*, though we could ask if OMOP ES could change it? Worth dropping a message in pixl-omop channel in UCLH Foundry slack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should we sort these?

cli/src/pixl_cli/_io.py Show resolved Hide resolved
@@ -47,20 +47,75 @@ def messages_from_state_file(filepath: Path) -> list[Message]:
return [deserialise(line) for line in filepath.open().readlines() if string_is_non_empty(line)]


def config_from_log_file(parquet_path: Path) -> tuple[str, datetime]:
log_file = parquet_path / "extract_summary.json"
def determine_batch_structure(extract_path: Path) -> tuple[str, datetime, list[Path]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe parse_batch_structure?

cli/src/pixl_cli/_io.py Show resolved Hide resolved
cli/tests/test_messages_from_parquet.py Show resolved Hide resolved
Copy link
Member

@milanmlft milanmlft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed docstrings! ❤️
Just one small suggestion; maybe add the example file structure from the issue to the cli/README.md as well? And clarify that we support this multi-batch structure.

└── omop_extract
    ├── extract_1
    │   ├── extract_summary.json
    │   ├── private
    │   └── public
    ├── extract_2
    │   ├── extract_summary.json
    │   ├── private
    │   └── public
    └── extract_3
        ├── extract_summary.json
        ├── private
        └── public

@stefpiatek stefpiatek closed this Feb 12, 2024
@stefpiatek stefpiatek deleted the jeremy/omop-batch-support branch April 19, 2024 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Accept OMOP ES batches
3 participants