Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot continue training for custom dataset #144

Open
alicewith opened this issue Jul 18, 2024 · 0 comments
Open

Cannot continue training for custom dataset #144

alicewith opened this issue Jul 18, 2024 · 0 comments

Comments

@alicewith
Copy link

Hi, folks. I'm struggling to continue the training manual.

The first problem I encountered is that when I wrote the custom dataset loader script and executed the accelerate launch run_pseudo_labelling.py command, the terminal output tells me that modifying the script is needed by passing the trust_remote_code=True argument.

$ pwd
C:\Users\user\Desktop\distil-whisper\training
$ pip install -e .
$ accelerate launch run_pseudo_labelling.py --model_name_or_path openai/whisper-tiny --dataset_name C:\Users\user\Desktop\distil-whisper\training\my_dataset --text_column_name sentence --id_column_name audio --output_dir my_dataset_output
[...]
ValueError: The repository for my_dataset contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/my_dataset.
Please pass the argument `trust_remote_code=True` to allow custom code to be run.
[...]

So I modified the training/run_pseudo_labelling.py file as described above:

raw_datasets[split] = load_dataset(
    data_args.dataset_name,
    data_args.dataset_config_name,
    split=split,
    cache_dir=data_args.dataset_cache_dir,
    token=token,
    streaming=data_args.streaming,
    num_proc=data_args.preprocessing_num_workers if not data_args.streaming else None,
+    trust_remote_code=True,
)

By applying this workaround, I'm able to continue the training, although the second problem is encountered as below:

$ accelerate launch run_pseudo_labelling.py --model_name_or_path openai/whisper-tiny --dataset_name C:\Users\user\Desktop\distil-whisper\training\my_dataset --text_column_name sentence --id_column_name audio --output_dir my_dataset_output
[...]
FileNotFoundError: [Errno 2] No such file or directory: 'my_dataset-0'
[...]

I can't continue at this point due to the error. I'm guessing that the file path, say "C:\Users\user\Desktop\distil-whisper\training\my_dataset\audio.mp3", was unexpectedly corrupted as the run_pseudo_labelling.py script is running.

My custom dataset loader script is:

import datasets

class MyDataset(datasets.GeneratorBasedBuilder):
    def _info(self):
        return datasets.DatasetInfo(
            features=datasets.Features({
                'audio': datasets.Audio(),
                'sentence': datasets.Value('string'),
            }),
        )
    
    def _split_generators(self, dl_manager):
        train_generator = datasets.SplitGenerator(
            datasets.Split.TRAIN,
        )
        valid_generator = datasets.SplitGenerator(
            datasets.Split.VALIDATION,
        )
        test_generator = datasets.SplitGenerator(
            datasets.Split.TEST,
        )

        split_generators = [
            train_generator,
            valid_generator,
            test_generator,
        ]

        return split_generators

    def _generate_examples(self):
        for index in range(1):
            result = {
                'audio': 'C:\Users\user\Desktop\distil-whisper\training\my_dataset\audio.mp3',
                'sentence': 'Every evening, the dogs in our neighbourhood are howling.',
            }

            yield index, result

Note that I located the dump audio file in the relevant path before I run.

How could I resolve these issues? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant