Hallucinations #7

joseph2mi · 2023-06-26T00:31:28Z

The issue with WhisperTimeSync with WhisperHallu is that if you need to use Whisper Hallu, it means that there are long silences and noise that prevent an accurate transcription. So you use WhisperHallu to cut the audio for easier transcription, but you can't sync it with WhisperTimeSync because whispertimesync, lol the original whisper, doesn't recognize the correct timestamps in the first place...

EtienneAb3d · 2023-06-26T05:27:07Z

@joseph2mi

The WhisperHallu option addSRT is producing 2 outputs:

one with noise and silence filtering to get a transcription without hallucinations.
one without cut to get a proper SRT with good timestamps, but possibly with hallucinations (that should not damage the timestamps quality).

You then use WhisperTimeSync to put the good timestamps over the good text.

joseph2mi · 2023-07-13T08:49:40Z

Hi, thanks for the response. You said "one without cut to get a proper SRT with good timestamps, but possibly with hallucinations", the assumption is that the timestamp quality is not affected.

The issue is, for some hallucinations, which just repeat themselves into lines, the timestamps vary between 5 seconds and 30 seconds.

Therefore, when the timestamps are synced with the correct subtitles, you get extremely long chunks of subtitle texts for each line, which is inaccurate and defeats the purpose of needing WhisperHallu. I was wondering if there was a way, even with hallucinations, to get accurate timestamps from Whisper or Faster Whisper.

EtienneAb3d · 2023-07-13T15:27:58Z

I never see such timestamp shift due to hallucinations: even if timestamps are not always fully accurate, I never had the impression that this inaccuracy was due to hallucinations.

joseph2mi · 2023-07-13T22:26:26Z

Here is an example of a timestamp in Vietnamese through Faster Whisper:

600
00:30:21,250 --> 00:30:24,500
Mà tôi không

601
00:30:24,500 --> 00:30:26,500
Trách ông Cát Mát

602
00:30:26,500 --> 00:30:28,500
Vì Cát Mát

603
00:30:28,500 --> 00:30:30,500
Là người đưa ra

604
00:30:30,500 --> 00:30:52,540
Cảm ơn các bạn đã theo dõi, hãy đăng ký kênh để ủng hộ kênh của mình nhé!

605
00:31:05,010 --> 00:31:30,350
Cảm ơn các bạn đã theo dõi, hãy đăng ký kênh để ủng hộ kênh của mình nhé!

606
00:31:42,370 --> 00:32:04,700
Cảm ơn các bạn đã theo dõi, hãy đăng ký kênh để ủng hộ kênh của mình nhé!

607
00:32:16,050 --> 00:32:37,360
Cảm ơn các bạn đã theo dõi, hãy đăng ký kênh để ủng hộ kênh của mình nhé!

608
00:32:48,830 --> 00:33:11,550
Cảm ơn các bạn đã theo dõi, hãy đăng ký kênh để ủng hộ kênh của mình nhé!

609
00:33:11,550 --> 00:33:34,560
Cảm ơn các bạn đã theo dõi, hãy đăng ký kênh để ủng hộ kênh của mình nhé!

610
00:33:34,560 --> 00:33:56,830
Cảm ơn các bạn đã theo dõi, hãy đăng ký kênh để ủng hộ kênh của mình nhé!

611
00:33:56,830 --> 00:34:18,720
Cảm ơn các bạn đã theo dõi, hãy đăng ký kênh để ủng hộ kênh của mình nhé!

612
00:34:28,940 --> 00:34:48,940
Cảm ơn các bạn đã theo dõi, hãy đăng ký kênh để ủng hộ kênh của mình nhé!

613
00:34:48,940 --> 00:35:12,620
Cảm ơn các bạn đã theo dõi, hãy đăng ký kênh để ủng hộ kênh của mình nhé!

614
00:35:12,620 --> 00:35:35,150
Cảm ơn các bạn đã theo dõi, hãy đăng ký kênh để ủng hộ kênh của mình nhé!

615
00:35:35,150 --> 00:35:55,340
Cảm ơn các bạn đã theo dõi, hãy đăng ký kênh để ủng hộ kênh của mình nhé!

616
00:35:55,340 --> 00:36:16,720
Cảm ơn các bạn đã theo dõi, hãy đăng ký kênh để ủng hộ kênh của mình nhé!

If you see here, the timestamps are okay and are usually between 2-5 seconds. The moment it starts hallucinating (I'm still saving the timestamps so I can integrate them later with WhisperHallu and WhisperTimeSync, the timestamps suddenly go up to 30-second intervals, which don't help for subtitles.

My parameters on Faster Whisper are as follow:
model_size=large-v2
device="cuda"
compute_type="float32"
beam_size=7,
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=50),
language = "vi",
max_initial_timestamp = 2.0,
condition_on_previous_text = True,
length_penalty = 1.5,

EtienneAb3d · 2023-07-14T08:43:06Z

In my own experiments, using the original sound file was more efficient to get proper timestamps.
In your case, perhaps you may try/adapt WhisperHallu with a configuration using all filters (especially blank and noise removal), but without cut.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hallucinations #7

Hallucinations #7

joseph2mi commented Jun 26, 2023

EtienneAb3d commented Jun 26, 2023

joseph2mi commented Jul 13, 2023

EtienneAb3d commented Jul 13, 2023

joseph2mi commented Jul 13, 2023

EtienneAb3d commented Jul 14, 2023

Hallucinations #7

Hallucinations #7

Comments

joseph2mi commented Jun 26, 2023

EtienneAb3d commented Jun 26, 2023

joseph2mi commented Jul 13, 2023

EtienneAb3d commented Jul 13, 2023

joseph2mi commented Jul 13, 2023

EtienneAb3d commented Jul 14, 2023