You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The first problem I encountered is that when I wrote the custom dataset loader script and executed the accelerate launch run_pseudo_labelling.py command, the terminal output tells me that modifying the script is needed by passing the trust_remote_code=True argument.
$ pwd
C:\Users\user\Desktop\distil-whisper\training
$ pip install -e .
$ accelerate launch run_pseudo_labelling.py --model_name_or_path openai/whisper-tiny --dataset_name C:\Users\user\Desktop\distil-whisper\training\my_dataset --text_column_name sentence --id_column_name audio --output_dir my_dataset_output
[...]
ValueError: The repository for my_dataset contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/my_dataset.
Please pass the argument `trust_remote_code=True` to allow custom code to be run.
[...]
So I modified the training/run_pseudo_labelling.py file as described above:
raw_datasets[split] = load_dataset(
data_args.dataset_name,
data_args.dataset_config_name,
split=split,
cache_dir=data_args.dataset_cache_dir,
token=token,
streaming=data_args.streaming,
num_proc=data_args.preprocessing_num_workers if not data_args.streaming else None,
+ trust_remote_code=True,
)
By applying this workaround, I'm able to continue the training, although the second problem is encountered as below:
$ accelerate launch run_pseudo_labelling.py --model_name_or_path openai/whisper-tiny --dataset_name C:\Users\user\Desktop\distil-whisper\training\my_dataset --text_column_name sentence --id_column_name audio --output_dir my_dataset_output
[...]
FileNotFoundError: [Errno 2] No such file or directory: 'my_dataset-0'
[...]
I can't continue at this point due to the error. I'm guessing that the file path, say "C:\Users\user\Desktop\distil-whisper\training\my_dataset\audio.mp3", was unexpectedly corrupted as the run_pseudo_labelling.py script is running.
Hi, folks. I'm struggling to continue the training manual.
The first problem I encountered is that when I wrote the custom dataset loader script and executed the
accelerate launch run_pseudo_labelling.py
command, the terminal output tells me that modifying the script is needed by passing thetrust_remote_code=True
argument.So I modified the
training/run_pseudo_labelling.py
file as described above:raw_datasets[split] = load_dataset( data_args.dataset_name, data_args.dataset_config_name, split=split, cache_dir=data_args.dataset_cache_dir, token=token, streaming=data_args.streaming, num_proc=data_args.preprocessing_num_workers if not data_args.streaming else None, + trust_remote_code=True, )
By applying this workaround, I'm able to continue the training, although the second problem is encountered as below:
I can't continue at this point due to the error. I'm guessing that the file path, say "C:\Users\user\Desktop\distil-whisper\training\my_dataset\audio.mp3", was unexpectedly corrupted as the
run_pseudo_labelling.py
script is running.My custom dataset loader script is:
Note that I located the dump audio file in the relevant path before I run.
How could I resolve these issues? Thanks!
The text was updated successfully, but these errors were encountered: