diff --git a/README.md b/README.md index 0d5f774..4e0ba70 100644 --- a/README.md +++ b/README.md @@ -17,27 +17,34 @@ # Automatic Cohort Extraction System for Event-Streams -Automatic Cohort Extraction System (ACES) is a library that streamlines the extraction of task-specific cohorts from time series datasets formatted as event-streams, such as Electronic Health Records (EHR). ACES is designed to query these EHR datasets for valid subjects, guided by various constraints and requirements defined in a YAML task configuration file. This offers a powerful and user-friendly solution to researchers and developers. The use of a human-readable YAML configuration file also eliminates the need for users to be proficient in complex dataframe querying, making the extraction process accessible to a broader audience. +**Updates** -There are diverse applications in healthcare and beyond. For instance, researchers can effortlessly define subsets of EHR datasets for training of foundation models. Retrospective analyses can also become more accessible to clinicians as it enables the extraction of tailored cohorts for studying specific medical conditions or population demographics. +- **\[2024-09-01\]** Predicates can now be defined in a configuration file separate to task criteria files. +- **\[2024-08-29\]** MEDS v0.3.3 is now supported. +- **\[2024-08-22\]** Polars v1.5.\* is now supported. +- **\[2024-08-10\]** Expanded predicates configuration language to support regular expressions, multi-column constraints, and multi-value constraints. +- **\[2024-07-30\]** Added ability to place constraints on static variables, such as patient demographics. +- **\[2024-06-28\]** Paper posted at [arXiv:2406.19653](https://arxiv.org/abs/2406.19653). -Currently, two data standards are directly supported: the [Medical Event Data Standard (MEDS)](https://github.com/Medical-Event-Data-Standard/meds) standard and the [EventStreamGPT (ESGPT)](https://github.com/mmcdermott/EventStreamGPT) standard. You must format your in one of these two formats by following instructions in their respective repositories. ACES also supports ***any*** arbitrary dataset schema, provided you extract the necessary dataset-specific plain predicates and format it as an event-stream. More information about this is available below and [here](https://eventstreamaces.readthedocs.io/en/latest/notebooks/predicates.html). +Automatic Cohort Extraction System (ACES) is a library that streamlines the extraction of task-specific cohorts from time series datasets formatted as event-streams, such as Electronic Health Records (EHR). ACES is designed to query these EHR datasets for valid subjects, guided by various constraints and requirements defined in a YAML task configuration file. This offers a powerful and user-friendly solution to researchers and developers. The use of a human-readable YAML configuration file also eliminates the need for users to be proficient in complex dataframe querying, making the extraction process accessible to a broader audience. -This README provides an overview of this tool, instructions for use, and a description of the fields in the task configuration file (see configs in `sample_configs/`). Please refer to the [ACES Documentation](https://eventstreamaces.readthedocs.io/en/latest/) for more detailed information. +There are diverse applications in healthcare and beyond. For instance, researchers can effortlessly define subsets of EHR datasets for training of foundation models. Retrospective analyses can also become more accessible to clinicians as it enables the extraction of tailored cohorts for studying specific medical conditions or population demographics. Finally, ACES can help realize a new era of benchmarking over tasks instead of data - please check out [MEDS-DEV](https://github.com/mmcdermott/MEDS-DEV/tree/main)! -## Installation +Currently, two data standards are directly supported: the [Medical Event Data Standard (MEDS)](https://github.com/Medical-Event-Data-Standard/meds) standard and the [EventStreamGPT (ESGPT)](https://github.com/mmcdermott/EventStreamGPT) standard. You must format your data in one of these two formats by following instructions in their respective repositories. ACES also supports ***any*** arbitrary dataset schema, provided you extract the necessary dataset-specific plain predicates and format it as an event-stream. More information about this is available below and [here](https://eventstreamaces.readthedocs.io/en/latest/notebooks/predicates.html). -### For MEDS v0.3.2 +This README provides a brief overview of this tool, instructions for use, and a description of the fields in the task configuration file (see representative configs in `sample_configs/`). Please refer to the [ACES Documentation](https://eventstreamaces.readthedocs.io/en/latest/) for more detailed information. -`pip install es-aces` +## Installation -### For MEDS v0.3 +### For MEDS v0.3.3 -`pip install es-aces==0.3.2` +```bash +pip install es-aces +``` -### For ESGPT Installation +### For ESGPT -1. If using the ESGPT data standard, install [EventStreamGPT (ESGPT)](https://github.com/mmcdermott/EventStreamGPT): +1. Install [EventStreamGPT (ESGPT)](https://github.com/mmcdermott/EventStreamGPT): Clone EventStreamGPT: @@ -56,13 +63,13 @@ pip install -e . ## Instructions for Use 1. **Prepare a Task Configuration File**: Define your predicates and task windows according to your research needs. Please see below or [here](https://eventstreamaces.readthedocs.io/en/latest/configuration.html) for details regarding the configuration language. -2. **Get Predicates DataFrame**: Process your dataset according to the instructions for the [MEDS](https://github.com/Medical-Event-Data-Standard/meds) (single-nested or un-nested) or [ESGPT](https://github.com/mmcdermott/EventStreamGPT) standard so you can leverage ACES to automatically create the predicates dataframe. You can also create your own predicates dataframe directly (more information below and [here](https://eventstreamaces.readthedocs.io/en/latest/notebooks/predicates.html)). +2. **Prepare Dataset & Predicates DataFrame**: Process your dataset according to instructions for the [MEDS](https://github.com/Medical-Event-Data-Standard/meds) or [ESGPT](https://github.com/mmcdermott/EventStreamGPT) standard so you can leverage ACES to automatically create the predicates dataframe. Alternatively, you can also create your own predicates dataframe directly (more information below and [here](https://eventstreamaces.readthedocs.io/en/latest/notebooks/predicates.html)). 3. **Execute Query**: A query may be executed using either the command-line interface or by importing the package in Python: ### Command-Line Interface: ```bash -aces-cli data.path='/path/to/data/file/or/directory' data.standard='' cohort_dir='/directory/to/task/config/' cohort_name='' +aces-cli data.path='/path/to/data/directory/or/file' data.standard='' cohort_dir='/directory/to/task/config/' cohort_name='' ``` For help using `aces-cli`: @@ -78,13 +85,13 @@ from aces import config, predicates, query from omegaconf import DictConfig # create task configuration object -cfg = config.TaskExtractorConfig.load(config_path="/path/to/task/config/task.yaml") +cfg = config.TaskExtractorConfig.load(config_path="/path/to/task/config.yaml") # get predicates dataframe data_config = DictConfig( { - "path": "/path/to/data/file/or/directory", - "standard": "", + "path": "/path/to/data/directory/or/file", + "standard": "", "ts_format": "%m/%d/%Y %H:%M", } ) @@ -94,34 +101,32 @@ predicates_df = predicates.get_predicates_df(cfg=cfg, data_config=data_config) df_result = query.query(cfg=cfg, predicates_df=predicates_df) ``` -4. **Results**: The output will be a dataframe of subjects who satisfy the conditions defined in your task configuration file. Timestamps for the start/end boundaries of each window specified in the task configuration, as well as predicate counts for each window, are also provided. Below are sample logs for the successful extraction of an in-hospital mortality cohort using the ESGPT standard: +4. **Results**: The output will be a dataframe of subjects who satisfy the conditions defined in your task configuration file. Timestamps for the start/end boundaries of each window specified in the task configuration, as well as predicate counts for each window, are also provided. Below are sample logs for the successful extraction of an in-hospital mortality cohort: ```log -aces-cli cohort_name="inhospital_mortality" cohort_dir="sample_configs" data.standard="esgpt" data.path="MIMIC_ESD_new_schema_08-31-23-1/" -2024-06-05 02:06:57.362 | INFO | aces.__main__:main:40 - Loading config from 'sample_configs/inhospital_mortality.yaml' -2024-06-05 02:06:57.369 | INFO | aces.config:load:832 - Parsing predicates... -2024-06-05 02:06:57.369 | INFO | aces.config:load:838 - Parsing trigger event... -2024-06-05 02:06:57.369 | INFO | aces.config:load:841 - Parsing windows... -2024-06-05 02:06:57.380 | INFO | aces.__main__:main:43 - Attempting to get predicates dataframe given: -standard: esgpt +aces-cli cohort_name="inhospital_mortality" cohort_dir="sample_configs" data.standard="meds" data.path="MEDS_DATA" +2024-09-24 02:06:57.362 | INFO | aces.__main__:main:153 - Loading config from 'sample_configs/inhospital_mortality.yaml' +2024-09-24 02:06:57.369 | INFO | aces.config:load:1258 - Parsing windows... +2024-09-24 02:06:57.369 | INFO | aces.config:load:1267 - Parsing trigger event... +2024-09-24 02:06:57.369 | INFO | aces.config:load:1282 - Parsing predicates... +2024-09-24 02:06:57.380 | INFO | aces.__main__:main:156 - Attempting to get predicates dataframe given: +standard: meds ts_format: '%m/%d/%Y %H:%M' -path: MIMIC_ESD_new_schema_08-31-23-1/ +path: MEDS_DATA/ _prefix: '' -Updating config.save_dir from /n/data1/hms/dbmi/zaklab/RAMMS/data/MIMIC_IV/ESD_new_schema_08-31-23-1 to MIMIC_ESD_new_schema_08-31-23-1 -Loading events from MIMIC_ESD_new_schema_08-31-23-1/events_df.parquet... -Loading dynamic_measurements from MIMIC_ESD_new_schema_08-31-23-1/dynamic_measurements_df.parquet... -2024-06-05 02:07:01.405 | INFO | aces.predicates:generate_plain_predicates_from_esgpt:241 - Generating plain predicate columns... -2024-06-05 02:07:01.579 | INFO | aces.predicates:generate_plain_predicates_from_esgpt:252 - Added predicate column 'admission'. -2024-06-05 02:07:01.770 | INFO | aces.predicates:generate_plain_predicates_from_esgpt:252 - Added predicate column 'discharge'. -2024-06-05 02:07:01.925 | INFO | aces.predicates:generate_plain_predicates_from_esgpt:252 - Added predicate column 'death'. -2024-06-05 02:07:07.155 | INFO | aces.predicates:generate_plain_predicates_from_esgpt:273 - Cleaning up predicates dataframe... -2024-06-05 02:07:07.156 | INFO | aces.predicates:get_predicates_df:401 - Loaded plain predicates. Generating derived predicate columns... -2024-06-05 02:07:07.167 | INFO | aces.predicates:get_predicates_df:404 - Added predicate column 'discharge_or_death'. -2024-06-05 02:07:07.772 | INFO | aces.predicates:get_predicates_df:413 - Generating special predicate columns... -2024-06-05 02:07:07.841 | INFO | aces.predicates:get_predicates_df:434 - Added predicate column '_ANY_EVENT'. -2024-06-05 02:07:07.841 | INFO | aces.query:query:32 - Checking if '(subject_id, timestamp)' columns are unique... -2024-06-05 02:07:08.221 | INFO | aces.utils:log_tree:59 - +2024-09-24 02:07:58.176 | INFO | aces.predicates:generate_plain_predicates_from_meds:268 - Loading MEDS data... +2024-09-24 02:07:01.405 | INFO | aces.predicates:generate_plain_predicates_from_esgpt:272 - Generating plain predicate columns... +2024-09-24 02:07:01.579 | INFO | aces.predicates:generate_plain_predicates_from_esgpt:276 - Added predicate column 'admission'. +2024-09-24 02:07:01.770 | INFO | aces.predicates:generate_plain_predicates_from_esgpt:276 - Added predicate column 'discharge'. +2024-09-24 02:07:01.925 | INFO | aces.predicates:generate_plain_predicates_from_esgpt:276 - Added predicate column 'death'. +2024-09-24 02:07:07.155 | INFO | aces.predicates:generate_plain_predicates_from_esgpt:279 - Cleaning up predicates dataframe... +2024-09-24 02:07:07.156 | INFO | aces.predicates:get_predicates_df:642 - Loaded plain predicates. Generating derived predicate columns... +2024-09-24 02:07:07.167 | INFO | aces.predicates:get_predicates_df:645 - Added predicate column 'discharge_or_death'. +2024-09-24 02:07:07.772 | INFO | aces.predicates:get_predicates_df:654 - Generating special predicate columns... +2024-09-24 02:07:07.841 | INFO | aces.predicates:get_predicates_df:681 - Added predicate column '_ANY_EVENT'. +2024-09-24 02:07:07.841 | INFO | aces.query:query:76 - Checking if '(subject_id, timestamp)' columns are unique... +2024-09-24 02:07:08.221 | INFO | aces.utils:log_tree:57 - trigger ┣━━ input.end @@ -129,21 +134,22 @@ trigger ┗━━ gap.end ┗━━ target.end -2024-06-05 02:07:08.221 | INFO | aces.query:query:43 - Beginning query... -2024-06-05 02:07:08.221 | INFO | aces.query:query:44 - Identifying possible trigger nodes based on the specified trigger event... -2024-06-05 02:07:08.233 | INFO | aces.constraints:check_constraints:93 - Excluding 14,623,763 rows as they failed to satisfy '1 <= admission <= None'. -2024-06-05 02:07:08.249 | INFO | aces.extract_subtree:extract_subtree:252 - Summarizing subtree rooted at 'input.end'... -2024-06-05 02:07:13.259 | INFO | aces.extract_subtree:extract_subtree:252 - Summarizing subtree rooted at 'input.start'... -2024-06-05 02:07:26.011 | INFO | aces.constraints:check_constraints:93 - Excluding 12,212 rows as they failed to satisfy '5 <= _ANY_EVENT <= None'. -2024-06-05 02:07:26.052 | INFO | aces.extract_subtree:extract_subtree:252 - Summarizing subtree rooted at 'gap.end'... -2024-06-05 02:07:30.223 | INFO | aces.constraints:check_constraints:93 - Excluding 631 rows as they failed to satisfy 'None <= admission <= 0'. -2024-06-05 02:07:30.224 | INFO | aces.constraints:check_constraints:93 - Excluding 18,165 rows as they failed to satisfy 'None <= discharge <= 0'. -2024-06-05 02:07:30.224 | INFO | aces.constraints:check_constraints:93 - Excluding 221 rows as they failed to satisfy 'None <= death <= 0'. -2024-06-05 02:07:30.226 | INFO | aces.extract_subtree:extract_subtree:252 - Summarizing subtree rooted at 'target.end'... -2024-06-05 02:07:41.512 | INFO | aces.query:query:60 - Done. 44,318 valid rows returned corresponding to 11,606 subjects. -2024-06-05 02:07:41.513 | INFO | aces.query:query:72 - Extracting label 'death' from window 'target'... -2024-06-05 02:07:41.514 | INFO | aces.query:query:86 - Setting index timestamp as 'end' of window 'input'... -2024-06-05 02:07:41.606 | INFO | aces.__main__:main:52 - Completed in 0:00:44.243514. Results saved to 'sample_configs/inhospital_mortality.parquet'. +2024-09-24 02:07:08.221 | INFO | aces.query:query:85 - Beginning query... +2024-09-24 02:07:08.221 | INFO | aces.query:query:89 - Static variable criteria specified, filtering patient demographics... +2024-09-24 02:07:08.221 | INFO | aces.query:query:99 - Identifying possible trigger nodes based on the specified trigger event... +2024-09-24 02:07:08.233 | INFO | aces.constraints:check_constraints:110 - Excluding 14,623,763 rows as they failed to satisfy '1 <= admission <= None'. +2024-09-24 02:07:08.249 | INFO | aces.extract_subtree:extract_subtree:252 - Summarizing subtree rooted at 'input.end'... +2024-09-24 02:07:13.259 | INFO | aces.extract_subtree:extract_subtree:252 - Summarizing subtree rooted at 'input.start'... +2024-09-24 02:07:26.011 | INFO | aces.constraints:check_constraints:176 - Excluding 12,212 rows as they failed to satisfy '5 <= _ANY_EVENT <= None'. +2024-09-24 02:07:26.052 | INFO | aces.extract_subtree:extract_subtree:252 - Summarizing subtree rooted at 'gap.end'... +2024-09-24 02:07:30.223 | INFO | aces.constraints:check_constraints:176 - Excluding 631 rows as they failed to satisfy 'None <= admission <= 0'. +2024-09-24 02:07:30.224 | INFO | aces.constraints:check_constraints:176 - Excluding 18,165 rows as they failed to satisfy 'None <= discharge <= 0'. +2024-09-24 02:07:30.224 | INFO | aces.constraints:check_constraints:176 - Excluding 221 rows as they failed to satisfy 'None <= death <= 0'. +2024-09-24 02:07:30.226 | INFO | aces.extract_subtree:extract_subtree:252 - Summarizing subtree rooted at 'target.end'... +2024-09-24 02:07:41.512 | INFO | aces.query:query:113 - Done. 44,318 valid rows returned corresponding to 11,606 subjects. +2024-09-24 02:07:41.513 | INFO | aces.query:query:129 - Extracting label 'death' from window 'target'... +2024-09-24 02:07:41.514 | INFO | aces.query:query:142 - Setting index timestamp as 'end' of window 'input'... +2024-09-24 02:07:41.606 | INFO | aces.__main__:main:188 - Completed in 0:00:44.243514. Results saved to 'sample_configs/inhospital_mortality.parquet'. ``` ## Task Configuration File @@ -172,11 +178,11 @@ windows: ... ``` -Sample task configuration files for 6 common tasks are provided in `sample_configs/`. All task configurations can be directly extracted using `'direct'` model on `sample_data/sample_data.csv` as this predicates dataframe was designed specifically to capture predicates needed for all tasks. However, only `inhospital_mortality.yaml` and `imminent-mortality.yaml` would be able to be extracted on `sample_data/esgpt_sample` and `sample_data/meds_sample` due to a lack of required predicates. +Sample task configuration files for 6 common tasks are provided in `sample_configs/`. All task configurations can be directly extracted using `'direct'` mode on `sample_data/sample_data.csv` as this predicates dataframe was designed specifically to capture concepts needed for all tasks. However, only `inhospital_mortality.yaml` and `imminent-mortality.yaml` would be able to be extracted on `sample_data/esgpt_sample` and `sample_data/meds_sample` due to a lack of required concepts in the datasets. ### Predicates -Predicates describe the event at a timestamp and are used to create predicate columns that contain predicate counts for each row of your dataset. If the MEDS or ESGPT data standard is used, ACES automatically computes the predicates dataframe needed for the query from the `predicates` fields in your task configuration file. However, you may also choose to construct your own predicates dataframe should you not wish to use the MEDS or ESGPT data standard. +Predicates describe the event at a timestamp. Predicate columns are created to contain predicate counts for each row of your dataset. If the MEDS or ESGPT data standard is used, ACES automatically computes the predicates dataframe needed for the query from the `predicates` fields in your task configuration file. However, you may also choose to construct your own predicates dataframe should you not wish to use the MEDS or ESGPT data standard. Example predicates dataframe `.csv`: @@ -205,19 +211,24 @@ normal_spo2: value_max: 120 # optional value_min_inclusive: true # optional value_max_inclusive: true # optional + other_cols: {} # optional ``` Fields for a "plain" predicate: -- `code` (required): Must be a string with `//` sequence separating the column name and column value. +- `code` (required): Must be one of the following: + - a string with `//` sequence separating the column name and column value. + - a list of strings as above in the form of {any: \[???, ???, ...\]}, which will match any of the listed codes. + - a regex in the form of {regex: "???"}, which will match any code that matches that regular expression. - `value_min` (optional): Must be float or integer specifying the minimum value of the predicate, if the variable is presented as numerical values. - `value_max` (optional): Must be float or integer specifying the maximum value of the predicate, if the variable is presented as numerical values. - `value_min_inclusive` (optional): Must be a boolean specifying whether `value_min` is inclusive or not. - `value_max_inclusive` (optional): Must be a boolean specifying whether `value_max` is inclusive or not. +- `other_cols` (optional): Must be a 1-to-1 dictionary of column name and column value, which places additional constraints on further columns. #### Derived Predicates -"Derived" predicates combine existing "plain" predicates using `and` or `or` keywords and have exactly 1 required `expr` field: For instance, the following defines a predicate representing either death or discharge (by combining "plain" predicates of `death` and `discharge`): +"Derived" predicates combine existing "plain" predicates using `and` / `or` keywords and have exactly 1 required `expr` field: For instance, the following defines a predicate representing either death or discharge (by combining "plain" predicates of `death` and `discharge`): ```yaml # plain predicates @@ -233,9 +244,9 @@ discharge_or_death: Field for a "derived" predicate: -- `expr`: Must be a string with the 'and()' or 'or()' key sequences, with "plain" predicates as its constituents. +- `expr`: Must be a string with the 'and()' / 'or()' key sequences, with "plain" predicates as its constituents. -A special predicate `_ANY_EVENT` is always defined, which simply represents any event, as the name suggests. This predicate can be used like any other predicate manually defined (ie., setting a constraint on its occurrence or using it as a trigger, more information below). +A special predicate `_ANY_EVENT` is always defined, which simply represents any event, as the name suggests. This predicate can be used like any other predicate manually defined (ie., setting a constraint on its occurrence or using it as a trigger - more information below!). #### Special Predicates @@ -249,7 +260,7 @@ There are also a few special predicates that you can use. These *do not* need to ### Trigger Event -The trigger event is a simple field with a value of a predicate name. For each trigger event, a predication by a model can be made. For instance, in the following example, the trigger event is an admission. Therefore, in your task, a prediction by a model can be made for each valid admission (after extraction according to other task specifications). You can also simply filter to a cohort of one event (ie., just a trigger event) should you not have any further criteria in your task. +The trigger event is a simple field with a value of a predicate name. For each trigger event, a prediction by a model can be made. For instance, in the following example, the trigger event is an admission. Therefore, in your task, a prediction by a model can be made for each valid admission (ie., samples remaining after extraction according to other task specifications are considered valid). You can also simply filter to a cohort of one event (ie., just a trigger event) should you not have any further criteria in your task. ```yaml predicates: @@ -277,28 +288,28 @@ input: In this example, the window `input` begins at `NULL` (ie., the first event or the start of the time series record), and ends at 24 hours after the `trigger` event, which is specified to be a hospital admission. The window is inclusive on both ends (ie., both the first event and the event at 24 hours after the admission, if any, is included in this window). Finally, a constraint of 5 events of any kind is placed so any valid window would include sufficient data. -Two fields (`start` and `end`) are required to define the size of a window. Both fields must be a string referencing a predicate name, or a string referencing the `start` or `end` field of another window name. In addition, it may express a temporal relationship by including a positive or negative time period expressed as a string (ie., `+ 2 days`, `- 365 days`, `+ 12h`, `- 30 minutes`, `+ 60s`). It may also express an event relationship by including a sequence with a directional arrow and a predicate name (ie., `-> predicate_1` or `<- predicate_1`). Finally, it may also contain `NULL`, indicating the first/last event for the `start`/`end` field, respectively. +Two fields (`start` and `end`) are required to define the size of a window. Both fields must be a string referencing a predicate name, or a string referencing the `start` or `end` field of another window. In addition, it may express a temporal relationship by including a positive or negative time period expressed as a string (ie., `+ 2 days`, `- 365 days`, `+ 12h`, `- 30 minutes`, `+ 60s`). It may also express an event relationship by including a sequence with a directional arrow and a predicate name (ie., `-> predicate_1` indicating the period until the next occurrence of the predicate, or `<- predicate_1` indicating the period following the previous occurrence of the predicate). Finally, it may also contain `NULL`, indicating the first/last event for the `start`/`end` field, respectively. -`start_inclusive` and `end_inclusive` are required booleans specifying whether the events, if any, at the `start` and `end` points of the window are included in the window. +`start_inclusive` and `end_inclusive` are required booleans specifying whether the events, if present, at the `start` and `end` points of the window are included in the window. The `has` field specifies constraints relating to predicates within the window. For each predicate defined previously, a constraint for occurrences can be set using a string in the format of `(, )`. Unbounded conditions can be specified by using `None` or leaving it empty (ie., `(5, None)`, `(8,)`, `(None, 32)`, `(,10)`). -`label` is an optional field and can only exist in ONE window in the task configuration file if defined. It must be a string matching a defined predicate name, and is used to extract the label for the task. +`label` is an optional field and can only exist in ONE window in the task configuration file if defined (an error is thrown otherwise). It must be a string matching a defined predicate name, and is used to extract the label for the task. -`index_timestamp` is an optional field and can only exist in ONE window in the task configuration file if defined. It must be either `start` or `end`, and is used to create an index column used to easily manipulate the results output. Usually, one would set it to be the time at which the prediction would be made (ie., set to `end` in your window containing input data). Please ensure that you are validating your interpretation of `index_timestamp` for your task. For instance, if `index_timestamp` is set to the `end` of a particular window, the timestamp would be the event at the window boundary. However, in some cases, your task may want to exclude this boundary event, so ensure you are correctly interpreting the timestamp during extraction. +`index_timestamp` is an optional field and can only exist in ONE window in the task configuration file if defined (an error is thrown otherwise). It must be either `start` or `end`, and is used to create an index column used to easily manipulate the results output. Usually, one would set it to be the time at which the prediction would be made (ie., set to `end` in your window containing input data). Please ensure that you are validating your interpretation of `index_timestamp` for your task. For instance, if `index_timestamp` is set to the `end` of a particular window, the timestamp would be the event at the window boundary. However, in some cases, your task may want to exclude this boundary event, so ensure you are correctly interpreting the timestamp during extraction. ## FAQs ### Static Data -Static data is now supported. In MEDS, static variables are simply stored in rows with `null` timestamps. In ESGPT, static variables are stored in a separate `subjects_df` table. In either case, it is feasible to express static variables as a predicate and apply the associated criteria normally using the `patient_demographics` heading of a configuration file. Please see [here](https://eventstreamaces.readthedocs.io/en/latest/notebooks/examples.html) and [here](https://eventstreamaces.readthedocs.io/en/latest/notebooks/predicates.html) for examples and details. +In MEDS, static variables are simply stored in rows with `null` timestamps. In ESGPT, static variables are stored in a separate `subjects_df` table. In either case, it is feasible to express static variables as a predicate and apply the associated criteria normally using the `patient_demographics` heading of a configuration file. Please see [here](https://eventstreamaces.readthedocs.io/en/latest/notebooks/examples.html) and [here](https://eventstreamaces.readthedocs.io/en/latest/notebooks/predicates.html) for examples and details. ### Complementary Tools ACES is an integral part of the MEDS ecosystem. To fully leverage its capabilities, you can utilize it alongside other complementary MEDS tools, such as: -- [MEDS-ETL](https://github.com/Medical-Event-Data-Standard/meds_etl), which can be used to transform various data schemas, including some command data models, into the MEDS format. -- [MEDS-TAB](https://github.com/Medical-Event-Data-Standard/meds_etl), which can be used generate automated tabular baseline methods (ie., XGBoost over ACES-defined tasks). +- [MEDS-ETL](https://github.com/Medical-Event-Data-Standard/meds_etl), which can be used to transform various data schemas, including some common data models, into the MEDS format. +- [MEDS-TAB](https://github.com/Medical-Event-Data-Standard/meds_etl), which can be used to generate automated tabular baseline methods (ie., XGBoost over ACES-defined tasks). - [MEDS-Polars](https://github.com/Medical-Event-Data-Standard/meds_etl), which contains polars-based ETL scripts. ### Alternative Tools @@ -307,23 +318,21 @@ There are existing alternatives for cohort extraction that focus on specific com ACES serves as a middle ground between PIC-SURE and ATLAS. While it may offer less capability than PIC-SURE, it compensates with greater ease of use and improved communication value. Compared to ATLAS, ACES provides greater capability, though with slightly lower ease of use, yet it still maintains a higher communication value. -Finally, ACES is not tied to a particular common data model. Built on a flexible event-stream format, ACES is a no-code solution with a descriptive input format, permitting easy and wide iteration over task definitions, and can be applied to a variety of schemas, making it a versatile tool suitable for diverse research needs. +Finally, ACES is not tied to a particular common data model. Built on a flexible event-stream format, ACES is a no-code solution with a descriptive input format, permitting easy and wide iteration over task definitions. It can be applied to a variety of schemas, making it a versatile tool suitable for diverse research needs. ## Future Roadmap ### Usability - Extract indexing information for easier setup of downstream tasks ([#37](https://github.com/justin13601/ACES/issues/37)) -- Allow separate predicates-only files and criteria-only files ([#42](https://github.com/justin13601/ACES/issues/42)) ### Coverage - Directly support nested configuration files ([#43](https://github.com/justin13601/ACES/issues/43)) - Support timestamp binning for use in predicates or as qualifiers ([#44](https://github.com/justin13601/ACES/issues/44)) - Support additional label types ([#45](https://github.com/justin13601/ACES/issues/45)) -- Support additional predicate types ([#47](https://github.com/justin13601/ACES/issues/47)) -- Better handle criteria for static variables ([#48](https://github.com/justin13601/ACES/issues/48)) - Allow chaining of multiple task configurations ([#49](https://github.com/justin13601/ACES/issues/49)) +- Additional predicates expansions ([#66](https://github.com/justin13601/ACES/issues/66)) ### Generalizability @@ -332,7 +341,6 @@ Finally, ACES is not tied to a particular common data model. Built on a flexible ### Causal Usage - Directly support case-control matching ([#51](https://github.com/justin13601/ACES/issues/51)) -- Directly support profiling of excluded populations ([#52](https://github.com/justin13601/ACES/issues/52)) ### Additional Tasks diff --git a/docs/source/conf.py b/docs/source/conf.py index ab4cef1..fb4d690 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -28,8 +28,8 @@ copyright = "2024, Justin Xu & Matthew McDermott" author = "Justin Xu & Matthew McDermott" -release = "0.2.5" -version = "0.2.5" +# release = "0.2.5" +# version = "0.2.5" def ensure_pandoc_installed(_): @@ -256,7 +256,7 @@ def ensure_pandoc_installed(_): # The name for this set of Sphinx documents. If None, it defaults to # " v documentation". -html_title = f"ACES v{version} Documentation" +html_title = "ACES Documentation" # A shorter title for the navigation bar. Default is the same as html_title. html_short_title = "ACES Documentation" @@ -386,7 +386,7 @@ def ensure_pandoc_installed(_): # -- Options for EPUB output epub_show_urls = "footnote" -print(f"loading configurations for {project} {version} ...", file=sys.stderr) +print(f"loading configurations for {project} ...", file=sys.stderr) def setup(app): diff --git a/docs/source/configuration.md b/docs/source/configuration.md index 1ad01c2..38fcc99 100644 --- a/docs/source/configuration.md +++ b/docs/source/configuration.md @@ -63,6 +63,8 @@ These configs consist of the following four fields: will be used). - `value_min_inclusive`: See `value_min` - `value_max_inclusive`: See `value_max` +- `other_cols`: This optional field accepts a 1-to-1 dictionary of column names to column values, and can be + used to specify further constraints on other columns (ie., not `code`) for this predicate. A given observation will be gauged to satisfy or fail to satisfy this predicate in one of two ways, depending on its source format. @@ -191,5 +193,3 @@ to achieve the result. Instead, this bound is always interpreted to be inclusive the constraint for predicate `name` with constraint `name: (1, 2)` if the count of observations of predicate `name` in a window was either 1 or 2. All constraints in the dictionary must be satisfied on a window for it to be included. - -______________________________________________________________________ diff --git a/docs/source/index.md b/docs/source/index.md index a052972..ab38c37 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -38,19 +38,19 @@ If you have a dataset and want to leverage it for machine learning tasks, the AC - Task-Specific Concepts: Identify the predicates (data concepts) required for your specific machine learning tasks. - Pre-Defined Criteria: Utilize our pre-defined criteria across various tasks and clinical areas to expedite this process. -- [PIE-MD](https://github.com/mmcdermott/PIE_MD/tree/main/tasks/criteria): Access our repository of tasks to find relevant predicates! +- [MEDS-DEV](https://github.com/mmcdermott/MEDS-DEV/tree/main): Access our benchmark of tasks to find relevant predicates! ### III. Set Dataset-Agnostic Criteria - Standardization: Combine the identified predicates with standardized, dataset-agnostic criteria files. -- Examples: Refer to the [MIMIC-IV](https://github.com/mmcdermott/PIE_MD/tree/main/tasks/MIMIC-IV) and [eICU](https://github.com/mmcdermott/PIE_MD/tree/main/tasks/eICU) examples for guidance on how to structure your criteria files for your private datasets! +- Examples: Refer to the [MEDS-DEV](https://github.com/mmcdermott/MEDS-DEV/tree/main/src/MEDS_DEV/tasks/criteria) examples for guidance on how to structure your criteria files for your private datasets! ### IV. Run ACES -- Run the ACES Command-Line Interface tool (`aces-cli`) to extract cohorts based on your task - check out the [Usage Guide](https://eventstreamaces.readthedocs.io/en/latest/usage.html)! +- Run the ACES Command-Line Interface tool (`aces-cli`) to extract cohorts based on your task - check out the [Usage Guide](https://eventstreamaces.readthedocs.io/en/latest/usage.html) for more information! ### V. Run MEDS-Tab -- Painless Reproducibility: Use [MEDS-Tab](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV/tree/main/tasks) to obtain comparable, reproducible, and well-tuned XGBoost results tailored to your dataset-specific feature space! +- Painless Reproducibility: Use [MEDS-Tab](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV/tree/main) to obtain comparable, reproducible, and well-tuned XGBoost results tailored to your dataset-specific feature space! -By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the ACES ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations! +By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the ACES and MEDS ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations! diff --git a/docs/source/notebooks/examples.ipynb b/docs/source/notebooks/examples.ipynb index feb2e35..4f6b128 100644 --- a/docs/source/notebooks/examples.ipynb +++ b/docs/source/notebooks/examples.ipynb @@ -6,8 +6,13 @@ "source": [ "# Task Examples\n", "\n", - "Provided below are two examples of mortality prediction tasks that ACES could easily extract subject cohorts for. The configurations have been tested all the provided synthetic data in the repository ([`sample_data/`](https://github.com/justin13601/ACES/tree/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_data)), as well as the MIMIC-IV dataset loaded using MEDS & ESGPT (with very minor changes to the below predicate definition). The configuration files for both of these tasks are provided in the repository ([`sample_configs/`](https://github.com/justin13601/ACES/tree/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_configs)), and cohorts can be extracted using the `aces-cli` tool:\n", - "\n", + "Provided below are two examples of mortality prediction tasks that ACES could easily extract subject cohorts for. The configurations have been tested all the provided synthetic data in the repository ([sample_data/](https://github.com/justin13601/ACES/tree/main/sample_data)), as well as the MIMIC-IV dataset loaded using MEDS & ESGPT (with very minor changes to the below predicate definition). The configuration files for both of these tasks are provided in the repository ([sample_configs/](https://github.com/justin13601/ACES/tree/main/sample_configs)), and cohorts can be extracted using the `aces-cli` tool:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "```bash\n", "aces-cli data.path='/path/to/MIMIC/ESGPT/schema/' data.standard='esgpt' cohort_dir='sample_configs/' cohort_name='...'\n", "```" @@ -269,6 +274,7 @@ "source": [ "imminent_mortality_cfg_path = f\"{config_path}/imminent_mortality.yaml\"\n", "cfg = config.TaskExtractorConfig.load(config_path=imminent_mortality_cfg_path)\n", + "\n", "tree = cfg.window_tree\n", "print_tree(tree)" ] @@ -279,7 +285,7 @@ "source": [ "## Other Examples\n", "\n", - "A few other examples are provided in [`sample_configs/`](https://github.com/justin13601/ACES/tree/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_configs) of the repository. We will continue to add task configurations to this folder or to a benchmarking effort for EHR representation learning. More information can be found [here](https://github.com/mmcdermott/PIE_MD/tree/main) - stay tuned!" + "A few other examples are provided in [sample_configs/](https://github.com/justin13601/ACES/tree/main/sample_configs) of the repository. We will continue to add task configurations to [MEDS-DEV](https://github.com/mmcdermott/MEDS-DEV/tree/main), a benchmarking effort for EHR representation learning - stay tuned!" ] } ], diff --git a/docs/source/notebooks/predicates.ipynb b/docs/source/notebooks/predicates.ipynb index 4234987..04daf7b 100644 --- a/docs/source/notebooks/predicates.ipynb +++ b/docs/source/notebooks/predicates.ipynb @@ -71,7 +71,7 @@ "source": [ "## Sample Predicates DataFrame\n", "\n", - "A sample predicates dataframe is provided in the repository ([`sample_data/sample_data.csv`](https://github.com/justin13601/ACES/blob/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_data/sample_data.csv)). This dataframe holds completely synthetic data and was designed such that the accompanying sample configuration files in the repository ([`sample_configs/`](https://github.com/justin13601/ACES/tree/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_configs)) could be directly extracted." + "A sample predicates dataframe is provided in the repository ([sample_data/sample_data.csv](https://github.com/justin13601/ACES/blob/main/sample_data/sample_data.csv)). This dataframe holds completely synthetic data and was designed such that the accompanying sample configuration files in the repository ([sample_configs/](https://github.com/justin13601/ACES/tree/main/sample_configs)) could be directly extracted." ] }, { @@ -100,7 +100,7 @@ "\n", "ACES is able to automatically compute the predicates dataframe from your dataset and the fields defined in your task configuration if you are using the MEDS or ESGPT data standard. Should you choose to not transform your dataset into one of these two currently supported standards, you may also navigate the transformation yourself by creating your own predicates dataframe.\n", "\n", - "Again, it is acceptable if your own predicates dataframe only contains `plain` predicate columns, as ACES can automatically create `derived` predicate columns from boolean logic in the task configuration file. However, for complex predicates that would be impossible to express (outside of `and/or`) in the configuration file, we direct you to create them manually prior to using ACES. Support for additional complex predicates is planned for the future, including the ability to use SQL or other expressions (see [#47](https://github.com/justin13601/ACES/issues/47)).\n", + "Again, it is acceptable if your own predicates dataframe only contains `plain` predicate columns, as ACES can automatically create `derived` predicate columns from boolean logic in the task configuration file. However, for complex predicates that would be impossible to express (outside of `and/or`) in the configuration file, we direct you to create them manually prior to using ACES. Support for additional complex predicates is planned for the future, including the ability to use SQL or other expressions (see [#66](https://github.com/justin13601/ACES/issues/66)).\n", "\n", "**Note**: When creating `plain` predicate columns directly, you must still define them in the configuration file (they could be with an arbitrary value in the `code` field) - ACES will verify their existence after data loading (ie., by validating that a column exists with the predicate name in your dataframe). You will also need them for referencing in your windows." ] @@ -109,7 +109,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Example of the `derived` predicate `discharge_or_death`, expressed as an `or()` relationship between `plain` predicates `discharge` and `death, which have been directly defined (ie., arbitrary values for their codes are present).\n", + "Example of the `derived` predicate `discharge_or_death`, expressed as an `or()` relationship between `plain` predicates `discharge` and `death`, which have been directly defined (ie., arbitrary values for their codes, `defined in data`, are present).\n", "\n", "```yaml\n", "predicates:\n", diff --git a/docs/source/notebooks/tutorial.ipynb b/docs/source/notebooks/tutorial.ipynb index 97f2eb5..67feef4 100644 --- a/docs/source/notebooks/tutorial.ipynb +++ b/docs/source/notebooks/tutorial.ipynb @@ -47,7 +47,7 @@ "source": [ "### Directories\n", "\n", - "Next, let's specify our paths and directories. In this tutorial, we will extract a cohort for a typical in-hospital mortality prediction task from the ESGPT synthetic sample dataset. The task configuration file and sample data are both shipped with the repository in [`sample_configs/`](https://github.com/justin13601/ACES/tree/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_configs) and [`sample_data/`](https://github.com/justin13601/ACES/tree/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_data) folders in the project root, respectively." + "Next, let's specify our paths and directories. In this tutorial, we will extract a cohort for a typical in-hospital mortality prediction task from the ESGPT synthetic sample dataset. The task configuration file and sample data are both shipped with the repository in [sample_configs/](https://github.com/justin13601/ACES/tree/main/sample_configs) and [sample_data/](https://github.com/justin13601/ACES/tree/main/sample_data) folders in the project root, respectively." ] }, { diff --git a/docs/source/usage.md b/docs/source/usage.md index 245fbb3..462282b 100644 --- a/docs/source/usage.md +++ b/docs/source/usage.md @@ -149,43 +149,47 @@ Hydra configuration files are leveraged for cohort extraction runs. All fields c #### Data Configuration -To set a data standard: +**To set a data standard**: -`data.standard`: String specifying the data standard, must be 'meds' OR 'esgpt' OR 'direct' +***`data.standard`***: String specifying the data standard, must be 'meds' OR 'esgpt' OR 'direct' -To query from a single MEDS shard: +**To query from a single MEDS shard**: -`data.path`: Path to the `.parquet`shard file +***`data.path`***: Path to the `.parquet` shard file -To query from multiple MEDS shards, you must set `data=sharded`. Additionally: +**To query from multiple MEDS shards**, you must set `data=sharded`. Additionally: -`data.root`: Root directory of MEDS dataset containing shard directories +***`data.root`***: Root directory of MEDS dataset containing shard directories -`data.shard`: Expression specifying MEDS shards (`$(expand_shards /)`) +***`data.shard`***: Expression specifying MEDS shards using [expand_shards](https://github.com/justin13601/ACES/blob/main/src/aces/expand_shards.py) (`$(expand_shards /)`) -To query from an ESGPT dataset: +**To query from an ESGPT dataset**: -`data.path`: Directory of the full ESGPT dataset +***`data.path`***: Directory of the full ESGPT dataset -To query from a direct predicates dataframe: +**To query from a direct predicates dataframe**: -`data.path` Path to the `.csv` or `.parquet` file containing the predicates dataframe +***`data.path`*** Path to the `.csv` or `.parquet` file containing the predicates dataframe -`data.ts_format`: Timestamp format for predicates. Defaults to "%m/%d/%Y %H:%M" +***`data.ts_format`***: Timestamp format for predicates. Defaults to "%m/%d/%Y %H:%M" #### Task Configuration -`cohort_dir`: Directory of your task configuration file +***`cohort_dir`***: Directory of your task configuration file + +***`cohort_name`***: Name of the task configuration file + +The above two fields are used below for automatically loading task configurations, saving results, and logging: -`cohort_name`: Name of the task configuration file +***`config_path`***: Path to the task configuration file. Defaults to `${cohort_dir}/${cohort_name}.yaml` -The above two fields are used for automatically loading task configurations, saving results, and logging: +***`output_filepath`***: Path to store the outputs. Defaults to `${cohort_dir}/${cohort_name}/${data.shard}.parquet` for MEDS with multiple shards, and `${cohort_**dir}/${cohort_name}.parquet` otherwise -`config_path`: Path to the task configuration file. Defaults to `${cohort_dir}/${cohort_name}.yaml` +***`log_dir`***: Path to store logs. Defaults to `${cohort_dir}/${cohort_name}/.logs` -`output_filepath`: Path to store the outputs. Defaults to `${cohort_dir}/${cohort_name}/${data.shard}.parquet` for MEDS with multiple shards, and `${cohort_dir}/${cohort_name}.parquet` otherwise +Additionally, predicates may be specified in a separate predicates configuration file and loaded for overrides: -`log_dir`: Path to store logs. Defaults to `${cohort_dir}/${cohort_name}/.logs` +***`predicates_path`***: Path to the [separate predicates-only file](https://eventstreamaces.readthedocs.io/en/latest/usage.html#separate-predicates-only-file). Defaults to null #### Tab Completion @@ -257,6 +261,8 @@ For example, to query an in-hospital mortality task on the sample data (both the >>> query.query(cfg=cfg, predicates_df=predicates_df) ``` +### Separate Predicates-Only File + For more complex tasks involving a large number of predicates, a separate predicates-only "database" file can be created and passed into `TaskExtractorConfig.load()`. Only referenced predicates will have a predicate column computed and evaluated, so one could create a dataset-specific deposit file with many predicates and @@ -266,4 +272,8 @@ reference as needed to ensure the cleanliness of the dataset-agnostic task crite >>> cfg = config.TaskExtractorConfig.load(config_path="criteria.yaml", predicates_path="predicates.yaml") ``` +If the same predicates are defined in both the task configuration file and the predicates-only file, the +predicates-only definition takes precedent and will be used to override previous definitions. As such, one may +create a predicates-only "database" file for a particular dataset, and override accordingly for various tasks. + ______________________________________________________________________ diff --git a/sample_data/meds_sample/held_out/0.parquet b/sample_data/meds_sample/held_out/0.parquet new file mode 100644 index 0000000..5c71d98 Binary files /dev/null and b/sample_data/meds_sample/held_out/0.parquet differ diff --git a/sample_data/meds_sample/sample_shard.parquet b/sample_data/meds_sample/sample_shard.parquet index af1b81f..5c71d98 100644 Binary files a/sample_data/meds_sample/sample_shard.parquet and b/sample_data/meds_sample/sample_shard.parquet differ diff --git a/sample_data/meds_sample/test/0.parquet b/sample_data/meds_sample/test/0.parquet deleted file mode 100644 index af1b81f..0000000 Binary files a/sample_data/meds_sample/test/0.parquet and /dev/null differ diff --git a/sample_data/meds_sample/train/0.parquet b/sample_data/meds_sample/train/0.parquet index ee91456..2f90ac3 100644 Binary files a/sample_data/meds_sample/train/0.parquet and b/sample_data/meds_sample/train/0.parquet differ diff --git a/sample_data/meds_sample/train/1.parquet b/sample_data/meds_sample/train/1.parquet index 88be651..98e4ee7 100644 Binary files a/sample_data/meds_sample/train/1.parquet and b/sample_data/meds_sample/train/1.parquet differ diff --git a/src/aces/config.py b/src/aces/config.py index 9ab7c87..45d76d8 100644 --- a/src/aces/config.py +++ b/src/aces/config.py @@ -1273,6 +1273,9 @@ def load(cls, config_path: str | Path, predicates_path: str | Path = None) -> Ta referenced_predicates = {pred for w in windows.values() for pred in w.referenced_predicates} referenced_predicates.add(trigger.predicate) + label_reference = [w.label for w in windows.values() if w.label] + if label_reference: + referenced_predicates.update(set(label_reference)) current_predicates = set(referenced_predicates) special_predicates = {ANY_EVENT_COLUMN, START_OF_RECORD_KEY, END_OF_RECORD_KEY} for pred in current_predicates - special_predicates: diff --git a/src/aces/configs/__init__.py b/src/aces/configs/__init__.py index 6f2de57..03ac28f 100644 --- a/src/aces/configs/__init__.py +++ b/src/aces/configs/__init__.py @@ -37,9 +37,17 @@ (`.csv` or `.parquet`) if using `direct` - standard (required): data standard, one of 'meds', 'esgpt', or 'direct' - ts_format (required if data.standard is 'direct'): timestamp format for the data + - root (required, applicable when data=sharded): root directory for the data shards + - shard (required, applicable when data=sharded): shard number of specific shard from a MEDS dataset. + + Note: data.shard can be expanded using the `expand_shards` function. Please refer to + https://eventstreamaces.readthedocs.io/en/latest/usage.html#multiple-shards and + https://github.com/justin13601/ACES/blob/main/src/aces/expand_shards.py for more information. + cohort_dir (required): cohort directory, used to automatically load configs, saving results, and logging cohort_name (required): cohort name, used to automatically load configs, saving results, and logging config_path (optional): path to the task configuration file, defaults to '/.yaml' + predicates_path (optional): path to a separate predicates-only configuration file for overriding output_filepath (optional): path to the output file, defaults to '/.parquet' ---------------- Default Config ---------------- diff --git a/src/aces/configs/_aces.yaml b/src/aces/configs/_aces.yaml index a28bc2b..8d957de 100644 --- a/src/aces/configs/_aces.yaml +++ b/src/aces/configs/_aces.yaml @@ -55,9 +55,17 @@ hydra: (`.csv` or `.parquet`) if using `direct` - standard (required): data standard, one of 'meds', 'esgpt', or 'direct' - ts_format (required if data.standard is 'direct'): timestamp format for the data + - root (required, applicable when data=sharded): root directory for the data shards + - shard (required, applicable when data=sharded): shard number of specific shard from a MEDS dataset. + + Note: data.shard can be expanded using the `expand_shards` function. Please refer to + https://eventstreamaces.readthedocs.io/en/latest/usage.html#multiple-shards and + https://github.com/justin13601/ACES/blob/main/src/aces/expand_shards.py for more information. + cohort_dir (required): cohort directory, used to automatically load configs, saving results, and logging cohort_name (required): cohort name, used to automatically load configs, saving results, and logging config_path (optional): path to the task configuration file, defaults to '/.yaml' + predicates_path (optional): path to a separate predicates-only configuration file for overriding output_filepath (optional): path to the output file, defaults to '/.parquet' ---------------- Default Config ----------------