Skip to content

Is webdataset a viable format for general-use ? #1524

Answered by rwightman
mrT23 asked this question in General
Discussion options

You must be logged in to vote

@mrT23 hello hello, I think it is a viable general-use format, but I'd hesitate to say from small, but rather it's useful from mid-size to very very large. In larger scale training WDS with appropriately sized shards scales well as you increase the number of nodes as each training process is allocated a subset of the shards and is reading them sequentially. With random access you will quickly overwhelm shared storag.

Combining small datasets with any sort of distributed and/or a number of dataloader worker counts is problematic with the limitations of IterableDataset. Each world_size * workers is completely independent. But for distributed you have the limitation where the batches produce…

Replies: 4 comments 2 replies

Comment options

You must be logged in to vote
0 replies
Answer selected by mrT23
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
2 replies
@mrT23
Comment options

@mrT23
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants