Collate missing features #1096

RicardoDominguez · 2024-01-11T13:45:51Z

Currently the BatchSamplerDataCollatorForSeq2Seq typically requires input_ids, labels, attention_mask, position_ids. For certain uses cases (e.g., datasets of type completion), the latter 3 are straightforwardly derived from input_ids. Therefore, saving to disk datasets with all 4 features can be redundant, and generally requires a lot more disk space (up to 5x).

The changes here implement a MissingFeaturesCollator together with the collate_missing_features config option. If labels, attention_mask, or position_ids are missing from the dataset, it replaces them with the following defaults:

labels: copy input_ids
position_ids: [0, 1, 2, ..., len(input_ids)-1]
attention_mask: [1, 1, 1, ...] with same length as input_ids

I've observed that eagerly filling these missing features does not lead to longer run-times.

RicardoDominguez · 2024-01-11T13:50:22Z

For instance, pile-cc with input_ids, labels, attention_mask, position_ids, and length is ~1.2TB, and ~250GB with only input_ids and length.

winglian · 2024-01-11T15:10:46Z

Is there another part that goes with this to optionally have the tokenization step be a bit more sparse for this feature?

winglian · 2024-01-11T15:11:14Z

README.md

+collate_missing_features:
+  # - labels # copy input_ids
+  # - position_ids # [0, 1, 2, ..., len(input_ids)-1]
+  # - attention_mask [1, 1, 1, ...] with same length as input_ids


Suggested change

# - attention_mask [1, 1, 1, ...] with same length as input_ids

# - attention_mask # [1, 1, 1, ...] with same length as input_ids

winglian · 2024-01-11T15:14:30Z

Looks good so far. I think a unit test on build_collator could be helpful to validate the collator that is returned based on various options, and then some sort of unit test to validate the functionality of MissingFeaturesCollator would be ideal too.

RicardoDominguez · 2024-01-11T16:29:20Z

I can write something that makes the tokenization step a bit more sparse. How about when all datasets are of type completion?

Collate missing features

02544d7

winglian reviewed Jan 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collate missing features #1096

Collate missing features #1096

RicardoDominguez commented Jan 11, 2024 •

edited

Loading

RicardoDominguez commented Jan 11, 2024 •

edited

Loading

winglian commented Jan 11, 2024

winglian Jan 11, 2024

winglian commented Jan 11, 2024

RicardoDominguez commented Jan 11, 2024

	# - attention_mask [1, 1, 1, ...] with same length as input_ids
	# - attention_mask # [1, 1, 1, ...] with same length as input_ids

Collate missing features #1096

Are you sure you want to change the base?

Collate missing features #1096

Conversation

RicardoDominguez commented Jan 11, 2024 • edited Loading

RicardoDominguez commented Jan 11, 2024 • edited Loading

winglian commented Jan 11, 2024

winglian Jan 11, 2024

Choose a reason for hiding this comment

winglian commented Jan 11, 2024

RicardoDominguez commented Jan 11, 2024

RicardoDominguez commented Jan 11, 2024 •

edited

Loading

RicardoDominguez commented Jan 11, 2024 •

edited

Loading