Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] support bucket batch #49844

Open
qmpzzpmq opened this issue Jan 15, 2025 · 3 comments
Open

[data] support bucket batch #49844

qmpzzpmq opened this issue Jan 15, 2025 · 3 comments
Labels
enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@qmpzzpmq
Copy link

Description

https://discuss.ray.io/t/how-to-bucket-batch-requests-on-serve/12287

is possible add a feature in ray data about bucket batch.

you can find example here:
https://pytorchnlp.readthedocs.io/en/latest/_modules/torchnlp/samplers/bucket_batch_sampler.html
https://pytorch.org/data/0.4/generated/torchdata.datapipes.iter.MaxTokenBucketizer.html

Use case

for sequence related task

@qmpzzpmq qmpzzpmq added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 15, 2025
@qmpzzpmq qmpzzpmq changed the title [<Ray component: data>] [<Ray component: data>] bucket batch Jan 15, 2025
@richardliaw richardliaw changed the title [<Ray component: data>] bucket batch [data] support bucket batch Jan 15, 2025
@richardliaw
Copy link
Contributor

Hi @qmpzzpmq could you share an example of what you're trying to do?

@qmpzzpmq
Copy link
Author

hi @richardliaw let say I get a dataset like
['1', '11', '1', '1111', '111', '1', '11', '11', '111']
I would like to batch them like
[['1', '1', '1', '11'], ['11', '11'], ['111'], ['111'], ['1111']].
In each batch, you will find in each batch, the total length of batch is not higher than 5, it is very common in seq2seq task, since it help for maximum using GPU memory.

the example I used actually you can find in https://pytorch.org/data/0.4/generated/torchdata.datapipes.iter.MaxTokenBucketizer.html.

You can also find related function here: https://espnet.github.io/espnet/guide/espnet/utils/make_batchset.html
Be careful with the parameters batch_bins and batch_frame_XX

@richardliaw
Copy link
Contributor

Is there an example dataset that I can work with?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

2 participants