Skip to content

Releases: KoljaB/TurnVoice

v0.0.7

10 Oct 15:37
Compare
Choose a tag to compare
  • added error message for missing spleeter installation
  • upgraded every dependency to latest version
  • some updates to Readme (CUDA 12.1, torch version, troubleshoot)

v0.0.65

20 Dec 20:09
Compare
Choose a tag to compare
  • added --faster parameter to select faster_whisper for timestamp transcription instead of stable_whisper (stable takes lot of resources esp on longer videos)
  • added --model parameter to select model for transcription. can be 'tiny', 'tiny.en', 'base', 'base.en', 'small', 'small.en', 'medium', 'medium.en', 'large-v1', 'large-v2', 'large-v3', or 'large'
  • updated to Coqui TTS v0.22.0 which enables access to 58 free predefined speaker voices

v0.0.60

18 Dec 16:11
Compare
Choose a tag to compare
  • switched from Deezer's Spleeter to Facebook Demux, reasons:

    • better vocal splitting quality
    • ability to more solid handle >10 min files
  • crossfade-algorithm to switch between original and vocalstripped audio more seamlessly

  • usage of stable_whisper timestamp refinement technique to achieve higher timestamp detection precision

  • new javascript Renderscript-Editor to finetune speaking timings, text and speaker assignment

    Editor

v0.0.50

15 Dec 20:25
Compare
Choose a tag to compare
  • added --prepare to write a full script including text, speakers and timestamps

    turnvoice https://www.youtube.com/watch?v=2N3PsXPdkmM --prepare
  • added --render to read back such a script and generate the final video from it:

    turnvoice https://www.youtube.com/watch?v=2N3PsXPdkmM --render "downloads\my_video_name\full_script.txt"
  • improved audio quality output

v0.0.45

12 Dec 23:31
Compare
Choose a tag to compare
  • added --prompt to to change speaking style

Example:

turnvoice https://www.youtube.com/watch?v=K89dChsgznw --prompt "speaking style of captain jack sparrow"

v0.0.41

12 Dec 12:22
Compare
Choose a tag to compare
  • using deep-translator instead of NLLB-200-600M now so we don't need the CC-BY-NC License and also don't need to download, load and unload a heavyweight translation model anymore

    (Deep-translator seems good to use for free. I think there is a way better and more general solution which I roughly have in mind. Some problems to solve yet but I guess I can make a quite significant upgrade to this in the coming days)

v0.0.40

12 Dec 00:18
Compare
Choose a tag to compare
  • added Elevenlabs, Azure, OpenAI TTS and System TTS as synthesis engines to select from
  • added possibility to feed a local video instead of a youtube video
  • added possibility to replace multiple speaker voices at once (submit more than one voice)
  • added possibility to submit own speaker timefiles (in the format of the created speaker1.txt, speaker2.txt etc timefiles) to finetune multiple speaker rendering

v0.0.30

08 Dec 00:30
Compare
Choose a tag to compare
  • added lots of stuff to the algorithm:

    • we unload the transcription model completely from the GPU after the first main transcription
    • we then load the synthesis in a freshly cleaned VRAM and start it to take as much VRAM as it wants, because this is our bottleneck
    • after the first synthesis we lazy load the transcription model AGAIN
    • we can then transcript the synthesis and verify it using measuring text distance (with levenshtein and jaro winkler)
    • and we can detect if the model generates hallucinations using the transcription word timestamps

    So with this we have
    => a massive speed gain (x5)
    => way lower VRAM usage (because the huge transcription gets removed from VRAM, also we unload the translation model if used)
    => way more solid synthesis via verification (reducing hallucinations and strange artifacts generation by retrying synthesis)

    We can now voiceturn a 20 min video on a 8GB VRAM in ~33 min

  • added fades at start and end of the synthesis since it gets trimmed, so we don't clip

  • autostart finished video after rendering

v0.0.22

05 Dec 18:59
Compare
Choose a tag to compare
  • can translate now
  • cleaner cli (takes IDs and -u not needed anymore - this was my facebook "the" moment)
turnvoice RK91Ji6GCZ8

v0.0.20

05 Dec 14:10
Compare
Choose a tag to compare
  • improved sync

    We now trim silence out of the synthesized audio before starting the voice speed matching algorithm. Coqui engine inserts ~0.3s (varies with speed) of silent audio at the end of the synthesis. That messed a bit with the transcription timestamps before and this upgrade made a good step good towards better synced results.