Allow passing a custom batch sampler to the trainer #3162

alonme · 2025-01-11T15:48:27Z

resolves #3152

Also, I believe there was an issue with the typing and usage of the seed parameter, given that 0 is falsy, it would be ignored if it was passed as a seed.

Missing:

Documentation - wanted to make sure this makes sense before writing docs

alonme · 2025-01-11T15:53:35Z

sentence_transformers/sampler.py

@@ -77,21 +97,17 @@ def __init__(
        dataset: Dataset,
        batch_size: int,
        drop_last: bool,
-        valid_label_columns: list[str] = None,
-        generator: torch.Generator = None,
-        seed: int = 0,


I believe there was an issue with the typing and usage of the seed parameter, given that 0 is falsy, it would be ignored if it was passed as a seed.

tomaarsen · 2025-01-20T12:27:47Z

sentence_transformers/trainer.py

+        if self._batch_sampler:
+            return self._batch_sampler


I like the changes in the sampler.py, but I'm not sure if this is the best option.
This here requires the user to initialize the Batch Sampler themselves, which prevents Sentence Transformers from updating the dataset(s). This breaks the prompts feature here:

sentence-transformers/sentence_transformers/trainer.py

Lines 284 to 291 in c68bf68

if self.train_dataset is not None:

self.train_dataset = self.maybe_add_prompts_or_dataset_name_column(

train_dataset, args.prompts, dataset_name="train"

)

if self.eval_dataset is not None:

self.eval_dataset = self.maybe_add_prompts_or_dataset_name_column(

eval_dataset, args.prompts, dataset_name="eval"

)

Additionally, it prevents multi-dataset training setups, because there's only one batch sampler possible.

So, I think a good solution is for the batch_sampler argument to be 1) a class (not instance) that subclasses DefaultBatchSampler or 2) a function that returns a subclass of DefaultBatchSampler given dataset, batch_size, drop_last, valid_label_columns, generator, seed, and *args & **kwargs.

What do you think?

Allow passing a custom batch sampler to the trainer

8fff2a8

alonme commented Jan 11, 2025

View reviewed changes

tomaarsen reviewed Jan 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow passing a custom batch sampler to the trainer #3162

Allow passing a custom batch sampler to the trainer #3162

alonme commented Jan 11, 2025 •

edited

Loading

alonme Jan 11, 2025

tomaarsen Jan 20, 2025

	if self.train_dataset is not None:
	self.train_dataset = self.maybe_add_prompts_or_dataset_name_column(
	train_dataset, args.prompts, dataset_name="train"
	)
	if self.eval_dataset is not None:
	self.eval_dataset = self.maybe_add_prompts_or_dataset_name_column(
	eval_dataset, args.prompts, dataset_name="eval"
	)

Allow passing a custom batch sampler to the trainer #3162

Are you sure you want to change the base?

Allow passing a custom batch sampler to the trainer #3162

Conversation

alonme commented Jan 11, 2025 • edited Loading

alonme Jan 11, 2025

Choose a reason for hiding this comment

tomaarsen Jan 20, 2025

Choose a reason for hiding this comment

alonme commented Jan 11, 2025 •

edited

Loading