Cannot continue training for custom dataset #144

alicewith · 2024-07-18T13:10:09Z

Hi, folks. I'm struggling to continue the training manual.

The first problem I encountered is that when I wrote the custom dataset loader script and executed the accelerate launch run_pseudo_labelling.py command, the terminal output tells me that modifying the script is needed by passing the trust_remote_code=True argument.

$ pwd
C:\Users\user\Desktop\distil-whisper\training
$ pip install -e .
$ accelerate launch run_pseudo_labelling.py --model_name_or_path openai/whisper-tiny --dataset_name C:\Users\user\Desktop\distil-whisper\training\my_dataset --text_column_name sentence --id_column_name audio --output_dir my_dataset_output
[...]
ValueError: The repository for my_dataset contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/my_dataset.
Please pass the argument `trust_remote_code=True` to allow custom code to be run.
[...]

So I modified the training/run_pseudo_labelling.py file as described above:

raw_datasets[split] = load_dataset(
    data_args.dataset_name,
    data_args.dataset_config_name,
    split=split,
    cache_dir=data_args.dataset_cache_dir,
    token=token,
    streaming=data_args.streaming,
    num_proc=data_args.preprocessing_num_workers if not data_args.streaming else None,
+    trust_remote_code=True,
)

By applying this workaround, I'm able to continue the training, although the second problem is encountered as below:

$ accelerate launch run_pseudo_labelling.py --model_name_or_path openai/whisper-tiny --dataset_name C:\Users\user\Desktop\distil-whisper\training\my_dataset --text_column_name sentence --id_column_name audio --output_dir my_dataset_output
[...]
FileNotFoundError: [Errno 2] No such file or directory: 'my_dataset-0'
[...]

I can't continue at this point due to the error. I'm guessing that the file path, say "C:\Users\user\Desktop\distil-whisper\training\my_dataset\audio.mp3", was unexpectedly corrupted as the run_pseudo_labelling.py script is running.

My custom dataset loader script is:

import datasets

class MyDataset(datasets.GeneratorBasedBuilder):
    def _info(self):
        return datasets.DatasetInfo(
            features=datasets.Features({
                'audio': datasets.Audio(),
                'sentence': datasets.Value('string'),
            }),
        )
    
    def _split_generators(self, dl_manager):
        train_generator = datasets.SplitGenerator(
            datasets.Split.TRAIN,
        )
        valid_generator = datasets.SplitGenerator(
            datasets.Split.VALIDATION,
        )
        test_generator = datasets.SplitGenerator(
            datasets.Split.TEST,
        )

        split_generators = [
            train_generator,
            valid_generator,
            test_generator,
        ]

        return split_generators

    def _generate_examples(self):
        for index in range(1):
            result = {
                'audio': 'C:\Users\user\Desktop\distil-whisper\training\my_dataset\audio.mp3',
                'sentence': 'Every evening, the dogs in our neighbourhood are howling.',
            }

            yield index, result

Note that I located the dump audio file in the relevant path before I run.

How could I resolve these issues? Thanks!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot continue training for custom dataset #144

Cannot continue training for custom dataset #144

alicewith commented Jul 18, 2024

Cannot continue training for custom dataset #144

Cannot continue training for custom dataset #144

Comments

alicewith commented Jul 18, 2024