Is webdataset a viable format for general-use ? #1524

mrT23 · 2022-10-29T14:54:11Z

mrT23
Oct 29, 2022

Hi @rwightman , thanks for the continuous good work.

I am playing a bit with the Webdataset format, utilizing some of the methods in:
https://github.com/rwightman/pytorch-image-models/blob/475ecdfa3d369b6d482287f2467ce101ce5c276c/timm/data/readers/reader_wds.py

To be honest, i am having second thoughts if this is indeed a viable general-use format for datasets (from small to very large). maybe you could shed some light.

Can i work with a single TAR file in distributed (or even with num_workers>1 ) ?
Their base GitHub recommend to resample the shards, but this is unnatural even for the train-set, and probably forbidden for the test-set.
Does the line:
https://github.com/rwightman/pytorch-image-models/blob/475ecdfa3d369b6d482287f2467ce101ce5c276c/timm/data/readers/reader_wds.py#L422
contains a bug ?

with this condition (applying it only to the train-set), i am getting an unbounded test set - the iteration never stops.

maybe similar to:
https://github.com/mlfoundations/open_clip/blob/b4cf9269b0b11c0eea47cb16039369a46bd67449/src/training/data.py#L79

it better to add an assert:
assert num_shards >= args.workers * args.world_size, 'number of shards must be >= total workers'
?
However, this is a strong limitation for small-to-medium datasets....

p.s: i think their might be an issue also with that implementation, since they dont do:
wds.split_by_node for the test-set (you do), so basically they dont apply distributed their (if i understand correctly)

Thanks,
Tal

Answered by rwightman

Oct 29, 2022

@mrT23 hello hello, I think it is a viable general-use format, but I'd hesitate to say from small, but rather it's useful from mid-size to very very large. In larger scale training WDS with appropriately sized shards scales well as you increase the number of nodes as each training process is allocated a subset of the shards and is reading them sequentially. With random access you will quickly overwhelm shared storag.

Combining small datasets with any sort of distributed and/or a number of dataloader worker counts is problematic with the limitations of IterableDataset. Each world_size * workers is completely independent. But for distributed you have the limitation where the batches produce…

View full answer

rwightman · 2022-10-29T20:28:19Z

rwightman
Oct 29, 2022
Maintainer

@mrT23 hello hello, I think it is a viable general-use format, but I'd hesitate to say from small, but rather it's useful from mid-size to very very large. In larger scale training WDS with appropriately sized shards scales well as you increase the number of nodes as each training process is allocated a subset of the shards and is reading them sequentially. With random access you will quickly overwhelm shared storag.

Combining small datasets with any sort of distributed and/or a number of dataloader worker counts is problematic with the limitations of IterableDataset. Each world_size * workers is completely independent. But for distributed you have the limitation where the batches produced from ALL such workers should be the same.

To keep my solution simple/sane, I decided that best approach was to enable wrap for the underlying iterator and round the num_samples past the amount that would (likely, but not guaranteed) cover all samples across all shards (but a bit of duplication on the wrap). You can also round down (I was going to add a floor option) and you'd end up missing some samples and cutting iteration short. Keep in mind you can't let the iterator hit the end naturally for each independent worker as you'd end up with a different number of batches produced across train processes and get out of sync, unless you were doing proc-proc communication which would be rather complex.

Re your questions

You cannot work with a single tar unless you stick to 1 worker and 1 process, it defeats the point which is to allow scaling of training. You want to have number of shards >> workers * world_size, I'd say at least 2x would be a good goal. I think for Imagenet-1k I usually work with 1024 train shards and I think 256-512 eval. For OpenClip, LAION-2B one of the cluster setups is > 200K, and I think the other is ~100K with larger shards.
.
There might still be some issues with my setup, but I don't think it will be unbounded ds = self.ds.with_epoch(num_worker_samples) that sets the iterator such that it will terminate after that number of samples (the underlying iterator wraps), so if the num samples is really high (ie it doesn't match the dataset), it will seem like it keeps repeating.
I also worked on the OpenCLIP impl, it always does validation on just rank-0 so there are no issues with distributed validation. However, that's not very efficient if you have a lot of GPUs for train and you have them all waiting for 1 to eval every run through....

You will get 'stuck' if you don't have enough shards to distribute across all workers though. So yeah, that assert from OpenCLIP should be in there...

0 replies

rwightman · 2022-10-29T20:35:37Z

rwightman
Oct 29, 2022
Maintainer

I've been working on improving the dataset readers lately and was doing some tests (on a 3x 3090 machine I have at home).

Glossary:
FS = Filesystem (ImageFolder) dataset
HF = Hugging Face Datasets (random access Arrow default)
WDS = Webdataset sharded tars
TFDS = Tensorflow Datasets sharded tfrecord
Local = files on local SSD drive
NAS = files on 6x HDD NAS w/ tiered SSD read-cache and 32GB memory cache, 5gbit ethernet to train node

Commands:

FS -- ./distributed_train.sh 3 /data/imagenet/ --model resnet50 --channels-last --amp -b 320 -j 8
HF -- ./distributed_train.sh 3 /data/hfds/ --dataset hfds/imagenet-1k --model resnet50 --channels-last --amp -b 320 -j 8
WDS -- ./distributed_train.sh 3 /data/imagenet1k-wds/ --dataset wds/ --model resnet50 --channels-last --amp -b 320 -j 8
TFDS -- ./distributed_train.sh 3 /data/tfds/ --dataset tfds/imagenet2012:5.0.0 --model resnet50 --channels-last --amp -b 320 -j 6

Results (im/sec first epoch, im/sec second epoch):

FS, local - 2540, 2600
FS, NAS - 900, 2320
HF, local - 1780, 1734
WDS, local - 2400, 2420
WDS, NAS - 2380, 2420
TFDS, NAS - 2700, 2760

0 replies

rwightman · 2022-10-29T20:40:01Z

rwightman
Oct 29, 2022
Maintainer

As you can see TFDS is the most efficient, it does the handling of shard iteration, incl the JPEG decode in multi-threaded native code, so has the lowest CPU overhead.

WDS is pretty decent but at lower node/process count the overhead is visible. However if you scale up the model and node count it's no longer visible.

HF datasets could use some improvement. It's easy to use and convenient, I'm pushing for some improved support for sharded training (w/ needed shuffle & seed support)

Local filesystem works well at this scale but falls apart if you scale up (unless you can copy your dataset to local SSD drives on every node, not feasible in all setups or at large data scale). Individual file access on NAS / remote shares does not work well, although my NAS has good caching, so after 1 epoch all of imagenet is 'hot'.

WDS and TFDS also work well in cloud storage, TFDS works very smoothly with the tfrecords in gs:// buckets, less easy getting s3 to work well. WDS seems to be a bit more reliable piping from S3 buckets (have done that on 800 8x GPU nodes for CLIP training), and gs:// is a bit less reliable (the command line util pipe breaks sometimes).

0 replies

rwightman · 2022-10-29T22:56:52Z

rwightman
Oct 29, 2022
Maintainer

Also on this topic, PyTorch Data is shaping up to be useable https://github.com/pytorch/data ... I'm not sure if it covers my distributed train needs but it looks like it's geting close. There is support for WDS style tar file shards (and I think iterating over samples in any tar). I will likely experiment with it soon if I get confirmation from the authors that it covers my use cases...

2 replies

mrT23 Oct 30, 2022
Author

Thanks a lot for the detailed answer, it clarified some of my issues. Representing a dataset with a small number of tarfiles, instead of many jpg+data files is very useful, and could answer problems i encountered in the past, even for small datasets.

regarding the problem i claimed to have with self.ds.with_epoch(num_worker_samples), when testing it again you are right, no actuall bug there. However, this line enforces you to have num_shards > num_workers.

Some further notes regarding WDS for small-to-medium datasets:

hard to implement fancy sampling methods like aug-repeats, or dedicated triplet sampling with WDS
it would be nice if there be an "official" standard to state how many files in a shard, and the suffix per file name (["jpg","txt"] in Open-clip, ["jpg","label"] in Imagenet, for example)
In its current state, i dont think that WDS can be called a "plug-and-play" replacement for general usage for map-style representation.
might be useful to add conversion to regular map-style tarfile dataset. seemless conversion between the two styles might enable more flexibility.

mrT23 Oct 30, 2022
Author

p.s.

i try to write a conversion between wds and map-style tarfile reader, and i am sharing it here.
The main issue is that it takes a long time to get the files in a tarfile (namelist = file_obj.getnames()). still, this operation can be done once, and the filenames can be saved to the disk for further reading:

  def convert_to_regular_dataset(self, save_path=None):
      import tarfile
      counter = 0
      samples = [[] for _ in range(self.num_samples)]
      for tar_file in self.base_files[0]:
          filename = tar_file['url']
          file_obj = tarfile.open(filename, "r")
          # get the names of files in tar file
          print("getnames...")
          namelist = file_obj.getnames() # slow function :(
          suffix = []
          for name in namelist:
              if name.split('.')[1] in suffix:
                  break
              suffix.append(name.split('.')[1])
          nfiles = int(len(namelist) / len(suffix))
          l = len(suffix)
          base_path = filename[:filename.rfind('/')]
          filename_tar = filename[filename.rfind('/') + 1:]
          for i in range(nfiles):
              samples[counter] = (filename_tar, namelist[i * l:(i + 1) * l])
              counter += 1

      map_dataset = CustomImageDataset(base_path, samples, suffix, preprocess_func=self.preprocess_func, transform=self.transform)

      if save_path is not None:
          print(save_path)
          torch.save(map_dataset, save_path)


class CustomImageDataset(Dataset):
    def __init__(self, base_path, samples, suffix, preprocess_func=None, transform=None, target_transform=None):
        self.base_path = base_path
        self.samples = samples
        self.suffix = suffix
        self.preprocess_func = preprocess_func
        self.transform = transform
        self.target_transform = target_transform
        self.cached_tar = {}

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        url = os.path.join(self.base_path, self.samples[idx][0])
        files = self.samples[idx][1]

        # caching
        if url not in self.cached_tar:
            tar = tarfile.open(url)
            self.cached_tar['url'] = tar
        else:
            tar = self.cached_tar['url']

        sample = {}
        for i, file in enumerate(files):
            sample[self.suffix[i]] = tar.extractfile(file).read()
        img, label = self.preprocess_func(sample)
        if self.transform:
            img = self.transform(img)
        if self.target_transform:
            label = self.target_transform(label)
        return img, label

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is webdataset a viable format for general-use ? #1524

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Is webdataset a viable format for general-use ? #1524

mrT23 Oct 29, 2022

Replies: 4 comments · 2 replies

rwightman Oct 29, 2022 Maintainer

rwightman Oct 29, 2022 Maintainer

rwightman Oct 29, 2022 Maintainer

rwightman Oct 29, 2022 Maintainer

mrT23 Oct 30, 2022 Author

mrT23 Oct 30, 2022 Author

mrT23
Oct 29, 2022

Replies: 4 comments 2 replies

rwightman
Oct 29, 2022
Maintainer

rwightman
Oct 29, 2022
Maintainer

rwightman
Oct 29, 2022
Maintainer

rwightman
Oct 29, 2022
Maintainer

mrT23 Oct 30, 2022
Author

mrT23 Oct 30, 2022
Author