Experimental Result? #3

QuangDiy · 2024-05-30T11:50:56Z

Recently, I experimented with fine-tuning Whisper using QLoRA. We tried using the Large v3 model, fine-tuning it with four datasets: CMV-17, VIVOS, Fleurs, and 100 hours of VinAI data. The results for the Large V3 model exceeded 100%. I'm curious if you encountered a similar issue? (I reran the experiments three times but the results remained the same). However, when switching to version v2, the results were more effective.

Model	Word Error Rate			Character Error Rate
Model	Fleurs	CMV-Vi	VIVOS	Fleurs	CMV	VIVOS
Whisper Large V3	8.66	15.49	12.41	5.36	10.07	8.10
FT Whisper Large V3 (25h VinAi data)	10.48	15.19	9.40	5.64	8.51	4.69
FT Whisper Large V3 (100h VinAi data)	>100	>100	>100	>100	>100	>100
Whisper Large V2	11.08	18.46	11.56	6.96	10.73	6.35
FT Whisper Large V2 (100h VinAi data)	11.36	17.34	10.98	6.38	9.70	5.67

Despite fine-tuning with over 100 hours of data, the improvements seemed minimal. If you have experienced a similar situation, I hope you can share your solution.

Thank you very much.

phineas-pta · 2024-05-30T14:01:57Z

bác check lại transcription lúc chạy benchmark thử xem, khả năng là nó ko ra tiếng việt

whisper bị 1 cái là khi đã fine tune thì cái language detection bị giảm khá nặng, e cũng bị như vậy nên lúc nào cũng phải ép cái lang="vi"

QuangDiy · 2024-05-30T14:46:26Z

Lúc gọi model mình cũng đã config như vầy lúc train con v3 khoảng 50h data thì ổn nhưng lên 125h thì nó bị vầy mình cũng không rõ tại sao.

model.generation_config.language = "<|vi|>"
model.generation_config.task = "transcribe"

nhưng infer nó vẫn ra:

hãy nghĩ hành trục tuyết tương như một hành tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th gi gi gi gi gi gi gi gi gi gi gi gi w w w w w w w w w w w w w w w

phineas-pta · 2024-05-30T14:53:51Z

cú pháp infer khác mà nhỉ 🤔 hay là bị hallucination 🤔

QuangDiy · 2024-05-30T15:05:00Z

À infer tôi vẫn ép là language="vi". Cùng một settings như vậy tôi chuyển qua V2 chạy thì ổn.

  with torch.cuda.amp.autocast():
        with torch.no_grad():
          generated_tokens = (
              model.generate(
                  input_features=batch["input_features"].to("cuda"),
                  max_new_tokens=200,
                  language="vi",
                  task="transcribe"
              )
              .cpu()
              .numpy()
          )

phineas-pta · 2024-05-30T15:25:37Z

vậy chắc do hallucination r, v3 bị cái này khá phiền

ngoài ra thì bạn check lại loss curve lúc training để cho chắc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experimental Result? #3

Experimental Result? #3

QuangDiy commented May 30, 2024

phineas-pta commented May 30, 2024

QuangDiy commented May 30, 2024

phineas-pta commented May 30, 2024

QuangDiy commented May 30, 2024

phineas-pta commented May 30, 2024

Experimental Result? #3

Experimental Result? #3

Comments

QuangDiy commented May 30, 2024

phineas-pta commented May 30, 2024

QuangDiy commented May 30, 2024

phineas-pta commented May 30, 2024

QuangDiy commented May 30, 2024

phineas-pta commented May 30, 2024