[data] support bucket batch #49844

qmpzzpmq · 2025-01-15T07:32:19Z

Description

https://discuss.ray.io/t/how-to-bucket-batch-requests-on-serve/12287

is possible add a feature in ray data about bucket batch.

you can find example here:
https://pytorchnlp.readthedocs.io/en/latest/_modules/torchnlp/samplers/bucket_batch_sampler.html
https://pytorch.org/data/0.4/generated/torchdata.datapipes.iter.MaxTokenBucketizer.html

Use case

for sequence related task

richardliaw · 2025-01-15T23:01:41Z

Hi @qmpzzpmq could you share an example of what you're trying to do?

qmpzzpmq · 2025-01-16T06:22:03Z

hi @richardliaw let say I get a dataset like
['1', '11', '1', '1111', '111', '1', '11', '11', '111']
I would like to batch them like
[['1', '1', '1', '11'], ['11', '11'], ['111'], ['111'], ['1111']].
In each batch, you will find in each batch, the total length of batch is not higher than 5, it is very common in seq2seq task, since it help for maximum using GPU memory.

the example I used actually you can find in https://pytorch.org/data/0.4/generated/torchdata.datapipes.iter.MaxTokenBucketizer.html.

You can also find related function here: https://espnet.github.io/espnet/guide/espnet/utils/make_batchset.html
Be careful with the parameters batch_bins and batch_frame_XX

richardliaw · 2025-01-16T23:06:57Z

Is there an example dataset that I can work with?

qmpzzpmq added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 15, 2025

qmpzzpmq changed the title ~~[<Ray component: data>]~~ [<Ray component: data>] bucket batch Jan 15, 2025

richardliaw changed the title ~~[<Ray component: data>] bucket batch~~ [data] support bucket batch Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] support bucket batch #49844

[data] support bucket batch #49844

qmpzzpmq commented Jan 15, 2025

richardliaw commented Jan 15, 2025

qmpzzpmq commented Jan 16, 2025

richardliaw commented Jan 16, 2025

[data] support bucket batch #49844

[data] support bucket batch #49844

Comments

qmpzzpmq commented Jan 15, 2025

Description

Use case

richardliaw commented Jan 15, 2025

qmpzzpmq commented Jan 16, 2025

richardliaw commented Jan 16, 2025