Releases: KoljaB/TurnVoice
v0.0.7
v0.0.65
- added --faster parameter to select faster_whisper for timestamp transcription instead of stable_whisper (stable takes lot of resources esp on longer videos)
- added --model parameter to select model for transcription. can be 'tiny', 'tiny.en', 'base', 'base.en', 'small', 'small.en', 'medium', 'medium.en', 'large-v1', 'large-v2', 'large-v3', or 'large'
- updated to Coqui TTS v0.22.0 which enables access to 58 free predefined speaker voices
v0.0.60
-
switched from Deezer's Spleeter to Facebook Demux, reasons:
- better vocal splitting quality
- ability to more solid handle >10 min files
-
crossfade-algorithm to switch between original and vocalstripped audio more seamlessly
-
usage of stable_whisper timestamp refinement technique to achieve higher timestamp detection precision
-
new javascript Renderscript-Editor to finetune speaking timings, text and speaker assignment
v0.0.50
-
added
--prepare
to write a full script including text, speakers and timestampsturnvoice https://www.youtube.com/watch?v=2N3PsXPdkmM --prepare
-
added
--render
to read back such a script and generate the final video from it:turnvoice https://www.youtube.com/watch?v=2N3PsXPdkmM --render "downloads\my_video_name\full_script.txt"
-
improved audio quality output
v0.0.45
v0.0.41
-
using deep-translator instead of NLLB-200-600M now so we don't need the CC-BY-NC License and also don't need to download, load and unload a heavyweight translation model anymore
(Deep-translator seems good to use for free. I think there is a way better and more general solution which I roughly have in mind. Some problems to solve yet but I guess I can make a quite significant upgrade to this in the coming days)
v0.0.40
- added Elevenlabs, Azure, OpenAI TTS and System TTS as synthesis engines to select from
- added possibility to feed a local video instead of a youtube video
- added possibility to replace multiple speaker voices at once (submit more than one voice)
- added possibility to submit own speaker timefiles (in the format of the created speaker1.txt, speaker2.txt etc timefiles) to finetune multiple speaker rendering
v0.0.30
-
added lots of stuff to the algorithm:
- we unload the transcription model completely from the GPU after the first main transcription
- we then load the synthesis in a freshly cleaned VRAM and start it to take as much VRAM as it wants, because this is our bottleneck
- after the first synthesis we lazy load the transcription model AGAIN
- we can then transcript the synthesis and verify it using measuring text distance (with levenshtein and jaro winkler)
- and we can detect if the model generates hallucinations using the transcription word timestamps
So with this we have
=> a massive speed gain (x5)
=> way lower VRAM usage (because the huge transcription gets removed from VRAM, also we unload the translation model if used)
=> way more solid synthesis via verification (reducing hallucinations and strange artifacts generation by retrying synthesis)We can now voiceturn a 20 min video on a 8GB VRAM in ~33 min
-
added fades at start and end of the synthesis since it gets trimmed, so we don't clip
-
autostart finished video after rendering
v0.0.22
v0.0.20
-
improved sync
We now trim silence out of the synthesized audio before starting the voice speed matching algorithm. Coqui engine inserts ~0.3s (varies with speed) of silent audio at the end of the synthesis. That messed a bit with the transcription timestamps before and this upgrade made a good step good towards better synced results.