-
Hi @rwightman , thanks for the continuous good work. I am playing a bit with the Webdataset format, utilizing some of the methods in: To be honest, i am having second thoughts if this is indeed a viable general-use format for datasets (from small to very large). maybe you could shed some light.
with this condition (applying it only to the train-set), i am getting an unbounded test set - the iteration never stops.
it better to add an assert: p.s: i think their might be an issue also with that implementation, since they dont do: Thanks, |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 2 replies
-
@mrT23 hello hello, I think it is a viable general-use format, but I'd hesitate to say from small, but rather it's useful from mid-size to very very large. In larger scale training WDS with appropriately sized shards scales well as you increase the number of nodes as each training process is allocated a subset of the shards and is reading them sequentially. With random access you will quickly overwhelm shared storag. Combining small datasets with any sort of distributed and/or a number of dataloader worker counts is problematic with the limitations of IterableDataset. Each world_size * workers is completely independent. But for distributed you have the limitation where the batches produced from ALL such workers should be the same. To keep my solution simple/sane, I decided that best approach was to enable wrap for the underlying iterator and round the num_samples past the amount that would (likely, but not guaranteed) cover all samples across all shards (but a bit of duplication on the wrap). You can also round down (I was going to add a floor option) and you'd end up missing some samples and cutting iteration short. Keep in mind you can't let the iterator hit the end naturally for each independent worker as you'd end up with a different number of batches produced across train processes and get out of sync, unless you were doing proc-proc communication which would be rather complex. Re your questions
You will get 'stuck' if you don't have enough shards to distribute across all workers though. So yeah, that assert from OpenCLIP should be in there... |
Beta Was this translation helpful? Give feedback.
-
I've been working on improving the dataset readers lately and was doing some tests (on a 3x 3090 machine I have at home). Glossary: Commands:
Results (im/sec first epoch, im/sec second epoch):
|
Beta Was this translation helpful? Give feedback.
-
As you can see TFDS is the most efficient, it does the handling of shard iteration, incl the JPEG decode in multi-threaded native code, so has the lowest CPU overhead. WDS is pretty decent but at lower node/process count the overhead is visible. However if you scale up the model and node count it's no longer visible. HF datasets could use some improvement. It's easy to use and convenient, I'm pushing for some improved support for sharded training (w/ needed shuffle & seed support) Local filesystem works well at this scale but falls apart if you scale up (unless you can copy your dataset to local SSD drives on every node, not feasible in all setups or at large data scale). Individual file access on NAS / remote shares does not work well, although my NAS has good caching, so after 1 epoch all of imagenet is 'hot'. WDS and TFDS also work well in cloud storage, TFDS works very smoothly with the tfrecords in gs:// buckets, less easy getting s3 to work well. WDS seems to be a bit more reliable piping from S3 buckets (have done that on 800 8x GPU nodes for CLIP training), and gs:// is a bit less reliable (the command line util pipe breaks sometimes). |
Beta Was this translation helpful? Give feedback.
-
Also on this topic, PyTorch Data is shaping up to be useable https://github.com/pytorch/data ... I'm not sure if it covers my distributed train needs but it looks like it's geting close. There is support for WDS style tar file shards (and I think iterating over samples in any tar). I will likely experiment with it soon if I get confirmation from the authors that it covers my use cases... |
Beta Was this translation helpful? Give feedback.
@mrT23 hello hello, I think it is a viable general-use format, but I'd hesitate to say from small, but rather it's useful from mid-size to very very large. In larger scale training WDS with appropriately sized shards scales well as you increase the number of nodes as each training process is allocated a subset of the shards and is reading them sequentially. With random access you will quickly overwhelm shared storag.
Combining small datasets with any sort of distributed and/or a number of dataloader worker counts is problematic with the limitations of IterableDataset. Each world_size * workers is completely independent. But for distributed you have the limitation where the batches produce…