Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] Support yield in flat_map #49900

Closed
richardliaw opened this issue Jan 16, 2025 · 4 comments
Closed

[data] Support yield in flat_map #49900

richardliaw opened this issue Jan 16, 2025 · 4 comments
Assignees
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability P2 Important issue, but not time-critical

Comments

@richardliaw
Copy link
Contributor

Description

Currently, only map_batches supports generators, but it would be really useful for map to also support generators as well.

Use case

I have a use case where it is easier to express logic as map (i.e., input row: list of files).

However, if I expand the list of files, i will end up with an explosion of data (each file is 5GB, list of 100 is 500GB).

So ideally I can yield the output along the way to avoid blowing up heap, and not starving the rest of the pipeline

@richardliaw richardliaw added data Ray Data-related issues enhancement Request for new feature and/or capability P2 Important issue, but not time-critical labels Jan 16, 2025
@richardliaw richardliaw self-assigned this Jan 16, 2025
@richardliaw
Copy link
Contributor Author

@xingyu-long will take a stab at this to start

@xingyu-long
Copy link
Contributor

@richardliaw thanks! will take this.

@richardliaw richardliaw changed the title [data] Support yield in map [data] Support yield in flat_map Jan 20, 2025
@xingyu-long
Copy link
Contributor

xingyu-long commented Jan 22, 2025

after did some investigation, it seems flat_map already supported this (note from slack conversation with @richardliaw )

In [38]: def fn(val: Dict[str, int]) -> Dict[str, int]:
    ...:     for i in range(val["id"]):
    ...:         yield {"hi": 1, "id": val["id"]}
    ...:

In [39]: ds = ray.data.range(5).flat_map(fn)

In [40]: ds.take()
✔️  Dataset execution finished in 0.05 seconds: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 10.0/10.0 [00:00<00:00, 197 row/s]
- ReadRange->SplitBlocks(4): Tasks: 0; Queued blocks: 0; Resources: 0.0 CPU, 0.0B object store: 100%|██████████████████████████████████████████| 5.00/5.00 [00:00<00:00, 117 row/s]
- FlatMap(fn): Tasks: 0; Queued blocks: 0; Resources: 0.0 CPU, 0.0B object store: 100%|████████████████████████████████████████████████████████| 10.0/10.0 [00:00<00:00, 233 row/s]
- limit=20: Tasks: 0; Queued blocks: 0; Resources: 0.0 CPU, 160.0B object store: 100%|█████████████████████████████████████████████████████████| 10.0/10.0 [00:00<00:00, 236 row/s]
Out[40]:
[{'hi': 1, 'id': 1},
 {'hi': 1, 'id': 2},
 {'hi': 1, 'id': 2},
 {'hi': 1, 'id': 3},
 {'hi': 1, 'id': 3},
 {'hi': 1, 'id': 3},
 {'hi': 1, 'id': 4},
 {'hi': 1, 'id': 4},
 {'hi': 1, 'id': 4},
 {'hi': 1, 'id': 4}]

it seems i cannot close this issue @richardliaw could you close this? thanks!

@richardliaw
Copy link
Contributor Author

Thanks for taking a look!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests

2 participants