You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add support to Wav2vec2 / Connectionist Temporal Classification (CTC) phoneme models (Wav2Vec2ForCTC HuggingFace CTC model class)
Motivation
The DistilWhisperLargeV2 has impressive results as far as I can see from the provided Space with the NextJS Web app; the perfect companion of Whisper transcription model is the Wav2Vec2 phoneme model. An example of execution of Whisper + Wav2vec2 infact is WhisperX that enables fast automatic speech recognition with word-level timestamps plus speaker diarization.
Other solutions
The wav2vec2-service provides a wave2vec implementation for fast cpu inference via ONNX.
The text was updated successfully, but these errors were encountered:
Hi, looking at wav2vec2 params I think that a LayerNorm can cut it for the implementation.
In the model config, the GroupNorm is used in the following manner nn.GroupNorm(num_groups=self.out_conv_dim, num_channels=self.out_conv_dim..., where out_conv_dim==in_conv_dim==512, which means 1 group.
I think a permutation of dims and LayerNorm can help. I am working on #132 but this hack could work for now 🤔
Add support to Wav2vec2 / Connectionist Temporal Classification (CTC) phoneme models (
Wav2Vec2ForCTC
HuggingFace CTC model class)Motivation
The DistilWhisperLargeV2 has impressive results as far as I can see from the provided Space with the NextJS Web app; the perfect companion of Whisper transcription model is the Wav2Vec2 phoneme model. An example of execution of Whisper + Wav2vec2 infact is WhisperX that enables fast automatic speech recognition with word-level timestamps plus speaker diarization.
Other solutions
The wav2vec2-service provides a wave2vec implementation for fast cpu inference via ONNX.
The text was updated successfully, but these errors were encountered: