Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong judgement of batch_sampler in prepare_data_loader #2091

Closed
2 of 4 tasks
XuHwang opened this issue Oct 27, 2023 · 6 comments · Fixed by #2097
Closed
2 of 4 tasks

Wrong judgement of batch_sampler in prepare_data_loader #2091

XuHwang opened this issue Oct 27, 2023 · 6 comments · Fixed by #2097
Assignees

Comments

@XuHwang
Copy link

XuHwang commented Oct 27, 2023

System Info

- `Accelerate` version: 0.24.0
- Platform: Linux-4.15.0-213-generic-x86_64-with-glibc2.27
- Python version: 3.9.0
- Numpy version: 1.26.1
- PyTorch version (GPU?): 1.13.1+cu117 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 503.79 GB
- GPU type: Tesla V100-PCIE-32GB
- `Accelerate` default config:
        Not found

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

I defined my own batch sampler, aiming to output a batch of indices each time. It works well when it's passed as the batch_sampler argument to torch.util.data.DataLoader. Meanwhile, it functions after wrapping the dataloader with accelerate.data_loader.prepare_data_loader() for the version accelerate==0.20.3. But now I upgrade the accelerate to the latest version 0.24.0, it raises an Attribute Error: 'MySampler' object has no attribute 'sampler'. The code snippets are pasted below.

I have compared the source code of accelerate/data_loader.py in version 0.20.3 and 0.24.0. I found the main difference to this bug is the sampler_is_batch_sampler variable in prepare_data_loade() function. In line 718 in version 0.20.3, sampler_is_batch_sampler is set to False, while it is set as sampler_is_batch_sampler = isinstance(dataloader.sampler, BatchSampler) in line 834 in version 0.24.0. The condition may be not right in my case, where sampler
of dataloader is None and batch_sampler is set to my own sampler (no member sampler in the batch sampler) instead.

import accelerate
import numpy as np
from torch.utils.data import Sampler, DataLoader, Dataset

class MyDataset(Dataset):
    def __init__(self, data) -> None:
        super().__init__()
        self.data = data
    
    def __len__(self) -> int:
        return len(self.data)

    def __getitem__(self, index):
        return self.data[index]


class MySampler:
    """
    BaseSampler is an iterator which could be used as a `batch_sampler` in DataLoader.
    It iterates a batch of sample index each time. And BaseSampler could only handle uniformly
    sampling. It works with `num_workers` in DataLoader because each worker aims to load
    a batch of samples each time.
    """
    def __init__(self, dataset_length: int, batch_size: int, shuffle:bool=True) -> None:
        self.batch_size = batch_size
        self.data_index = np.arange(dataset_length)
        self.shuffle = shuffle

    def __iter__(self):
        batch_num = len(self)
        if self.shuffle:
            index = np.random.permutation(self.data_index)
        else:
            index = self.data_index

        output = np.array_split(index, batch_num)
        yield from output


    def __len__(self):
        return (len(self.data_index) + self.batch_size - 1) // self.batch_size



dataset = MyDataset(np.arange(10000))

batch_sampler = MySampler(len(dataset), 32)
dataloader_pt = DataLoader(dataset, batch_sampler=batch_sampler)

# Original pytorch DataLoader, no problem. Pytorch version 1.13.1
for d in dataloader_pt:
    print(d)
    break

# Accelerate DataLoader, attribute not found. Accelerate verision 0.24.0
dataloader_al = accelerate.data_loader.prepare_data_loader(dataloader_pt)
for d in dataloader_al:
    print(d)
    break

The output of the snippets is:

tensor([3851, 7922, 4125, 2075, 4539, 6159, 9525, 8622,  967, 3022, 7877, 9807,
        5243, 3136, 1554, 5355, 4284, 3041, 5014, 4597, 7593, 1324, 4064, 4886,
        7167, 4549, 7643, 6493, 9435, 5662, 4689, 2710])

AttributeError                            Traceback (most recent call last)
/home/xxxx/xxxx.ipynb Cell 21 line 5
File ~/.conda/envs/xxx/lib/python3.9/site-packages/accelerate/data_loader.py:838, in prepare_data_loader(dataloader, device, num_processes, process_index, split_batches, put_on_device, rng_types, dispatch_batches, even_batches, slice_fn_for_dispatch)
    836     sampler = dataloader.sampler.sampler
    837 else:
--> 838     sampler = dataloader.batch_sampler.sampler
    839 if isinstance(sampler, RandomSampler) and num_processes > 1:
    840     # When iterating through the dataloader during distributed processes
    841     # we want to ensure that on each process we are iterating through the same
    842     # samples in the same order if a seed is set. This requires a tweak
    843     # to the `torch.utils.data.RandomSampler` class (if used).
    844     sampler = SeedableRandomSampler(
    845         data_source=sampler.data_source,
    846         replacement=sampler.replacement,
    847         num_samples=sampler._num_samples,
    848         generator=getattr(sampler, \"generator\", torch.Generator()),
    849     )

AttributeError: 'MySampler' object has no attribute 'sampler'"

Expected behavior

The expected output should be normal, where no error raised.

@muellerzr muellerzr self-assigned this Oct 27, 2023
@BenjaminBossan
Copy link
Member

This is indeed due to a recent change in accelerate. As a quick fix, could you make your MySampler class inherit from PyTorch's BatchSampler and see if everything works as expected?

@XuHwang
Copy link
Author

XuHwang commented Oct 27, 2023

Thanks for the reply. It indeed functions if I inherit MySampler PyTorch's BatchSampler. But I'm not sure whether the logic behind BatchSampler would have effect my sampler. Here I have a small test, it seems the result is expected.

class MySampler2(BatchSampler):
    """
    BaseSampler is an iterator which could be used as a `batch_sampler` in DataLoader.
    It iterates a batch of sample index each time. And BaseSampler could only handle uniformly
    sampling. It works with `num_workers` in DataLoader because each worker aims to load
    a batch of samples each time.
    """
    def __init__(self, sampler, dataset_length: int, batch_size: int, shuffle:bool=True) -> None:
        super().__init__(sampler, batch_size=batch_size, drop_last=True)
        self.batch_size = batch_size
        self.data_index = np.arange(dataset_length)
        self.shuffle = shuffle

    def __iter__(self):
        output = [np.arange(10, 10+self.batch_size)] * len(self)
        yield from output


    def __len__(self):
        return (len(self.data_index) + self.batch_size - 1) // self.batch_size

dataset = MyDataset(np.arange(10000))
sampler = Sampler(dataset)    # this is pytorch's Sampler
batch_sampler2 = MySampler2(sampler, len(dataset), 32)
dataloader_pt2 = DataLoader(dataset, batch_sampler=batch_sampler2)
# Original pytorch DataLoader, no problem. Pytorch version 1.13.1
for d in dataloader_pt2:
    print(d)
    break

# Accelerate DataLoader, attribute not found. Accelerate verision 0.24.0
dataloader_al2 = accelerate.data_loader.prepare_data_loader(dataloader_pt2)
for d in dataloader_al2:
    print(d)
    break

It has expected output:

tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,
        28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41])
tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,
        28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41])

@BenjaminBossan
Copy link
Member

But I'm not sure whether the logic behind BatchSampler would have effect my sampler.

As long as you override __init__, __iter__ and __len__, you should be good.

@XuHwang
Copy link
Author

XuHwang commented Oct 27, 2023

Thanks for the solution. And I wonder whether accelerate will support the batch_sampler defined from scratch as I mentioned initially, where no attribute sampler in it?

@muellerzr
Copy link
Collaborator

Yes, it’s a bug that we’ll look at fixing

@muellerzr muellerzr mentioned this issue Oct 27, 2023
5 tasks
@muellerzr
Copy link
Collaborator

Thanks for the flag on the regression @XuHwang! This will be solved with #2097

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants