-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible expand_shards documentation typo #132
base: main
Are you sure you want to change the base?
Conversation
expand_shards should take in the same input as data.root in the documentation right?
WalkthroughThe pull request introduces a modification to the command-line usage example for the Changes
Poem
Recent review detailsConfiguration used: CodeRabbit UI Files selected for processing (1)
Additional comments not posted (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Just checked, I think both are correct and should probably be included? The config sets the path to Case 1: MEDS files like Case 2: We also added support just a simple directory of parquet files like If this seems right I'll add both |
@@ -212,7 +212,7 @@ aces-cli cohort_name="foo" cohort_dir="bar/" data.standard=meds data.path="baz.p | |||
A MEDS dataset can have multiple shards, each stored as a `.parquet` file containing subsets of the full dataset. We can make use of Hydra's launchers and multi-run (`-m`) capabilities to start an extraction job for each shard (`data=sharded`), either in series or in parallel (e.g., using `joblib`, or `submitit` for Slurm). To load data with multiple shards, a data root needs to be provided, along with an expression containing a comma-delimited list of files for each shard. We provide a function `expand_shards` to do this, which accepts a sequence representing `<shards_location>/<number_of_shards>`. It also accepts a file directory, where all `.parquet` files in its directory and subdirectories will be included. | |||
|
|||
```bash | |||
aces-cli cohort_name="foo" cohort_dir="bar/" data.standard=meds data=sharded data.root="baz/" "data.shard=$(expand_shards qux/#)" -m | |||
aces-cli cohort_name="foo" cohort_dir="bar/" data.standard=meds data=sharded data.root="baz/" "data.shard=$(expand_shards baz/)" -m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both are correct? The config sets the path to: ${data.root}/${data.shard}.parquet
Case 1: MEDS files like dir/train/0.parquet, dir/train/1.parquet, and dir/held_out/0.parquet.
Then data.root would be dir/ and expand_shards would be called using expand_shards("train/2", "held_out/0")
Case 2: We also added support just a simple directory of parquet files like dir/0.parquet, dir/1.parquet, and dir/2.parquet.
So data.root would be dir/ and expand_shards would be called using expand_shards("dir/")
So qux/#
or baz/
are both supported?
@Oufattole can we merge or close this? Not sure which would be more appropriate, but I'd like to get it closed out rather than remaining indefinitely open. Thanks! |
expand_shards should take in the same input as data.root in the documentation right?
Summary by CodeRabbit
aces-cli
tool to reflect the correct directory for loading MEDS dataset shards.data.shard
parameter to ensure users reference the correct location for their dataset shards.