Efficiently compute total number of steps #1098

RicardoDominguez · 2024-01-11T16:26:15Z

When calculating the total number of steps, the current implementation required iterating through the entire dataset, which can be very slow for large datasets. Instead, take sampler.num_batches(), which I believe should correspond precisely to the length of the data loader (since when packing, the batch size of the data loader is 1).

winglian · 2024-01-11T16:36:49Z

@RicardoDominguez thanks! looks good so far. I need to just do some sanity checks on the validity of this logic. Do you have any ideas about how we might be able to unit test this?

winglian · 2024-01-11T16:37:31Z

I went ahead and rebased the PR onto main as there were some linting issues in the commit of main you were on.

RicardoDominguez · 2024-01-11T16:45:14Z

Hmm not sure how to unit test these changes beyond comparing sampler.num_batches() to len(dataloader).

winglian · 2024-02-01T05:36:44Z

I also think there is a minor discrepancy because we often pass drop_last=True to the dataloader.

winglian · 2024-02-01T06:51:38Z

I think this might not be necessary as checking the length of the sampler is already what happens under the hood.

Also, inspecting the torch DataLoader class, the __len__() method already uses return len(self._index_sampler), where

    @property
    def _index_sampler(self):
        # The actual sampler used for generating indices for `_DatasetFetcher`
        # (see _utils/fetch.py) to read data at each time. This would be
        # `.batch_sampler` if in auto-collation mode, and `.sampler` otherwise.
        # We can't change `.sampler` and `.batch_sampler` attributes for BC
        # reasons.
        if self._auto_collation:
            return self.batch_sampler
        else:
            return self.sampler

winglian force-pushed the eff_total_steps branch from 362c4c6 to 3d4e5a5 Compare January 11, 2024 16:35

RicardoDominguez and others added 2 commits February 1, 2024 01:56

efficiently compute total number of steps

5c105b1

chore: lint

6e0cd11

winglian force-pushed the eff_total_steps branch from 3d4e5a5 to 6e0cd11 Compare February 1, 2024 06:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficiently compute total number of steps #1098

Efficiently compute total number of steps #1098

RicardoDominguez commented Jan 11, 2024 •

edited

Loading

winglian commented Jan 11, 2024

winglian commented Jan 11, 2024

RicardoDominguez commented Jan 11, 2024

winglian commented Feb 1, 2024

winglian commented Feb 1, 2024

Efficiently compute total number of steps #1098

Are you sure you want to change the base?

Efficiently compute total number of steps #1098

Conversation

RicardoDominguez commented Jan 11, 2024 • edited Loading

winglian commented Jan 11, 2024

winglian commented Jan 11, 2024

RicardoDominguez commented Jan 11, 2024

winglian commented Feb 1, 2024

winglian commented Feb 1, 2024

RicardoDominguez commented Jan 11, 2024 •

edited

Loading