Skip to content

Latest commit

 

History

History
209 lines (184 loc) · 26.6 KB

audio-ai.md

File metadata and controls

209 lines (184 loc) · 26.6 KB

🏠Home

Audio

Compression

  • EnCodec SOTA deep learning based audio codec supporting both mono 24 kHz audio and stereo 48 kHz audio

Multiple Tasks

  • audio-webui A web-based UI for various audio-related Neural Networks with features like text-to-audio, voice cloning, and automatic-speech-recognition using Bark, AudioLDM, AudioCraft, RVC, coqui-ai and Whisper
  • tts-generation-webui for all things TTS, currently supports Bark v2, MusicGen, Tortoise, Vocos
  • Speechbrain A PyTorch-based Speech Toolkit for TTS, STT, etc
  • Nvidia NeMo TTS, LLM, Audio Synthesis framework
  • speech-rest-api for Speech-To-Text and Text-To-Speech with Whisper and Speechbrain
  • LangHelper language learning through Text-to-speech + chatGPT + speech-to-text to practise speaking assessments, memorizing words and listening tests
  • Silero-models pre-trained speech-to-text, text-to-speech and text-enhancement for ONNX, PyTorch, TensorFlow, SSML
  • AI-Waifu-Vtuber AI Waifu Vtuber & is a virtual streamer. Supports multiple languages and uses VoiceVox, DeepL, Whisper, Seliro TTS, and VtubeStudio, and now also supports Twitch streaming.
  • Voicebox large-scale text-guided generative speech model using non-autoregressive flow-matching, paper, demo, pytorch implementation, implementation
  • Auto-Synced-Translated-Dubs Automatic YouTube video speech to text, translation, text to speech in order to dub a whole video
  • SeamlessM4T Foundational Models for SOTA Speech and Text Translation
  • Amphion a toolkit for Audio, Music, and Speech Generation supporting TTS, SVS, VC, SVC, TTA, TTM
  • voicefixer restore human speech regardless how serious its degraded
  • VoiceCraft clone and edit an unseen voice with few seconds example and Text-to-Speech capabilities
  • audapolis an audio/video editor for spoken word media editing like a text editor using speech recognition

Speech Recognition

voice activity detection (VAD):

  • Silero-VAD pre-trained enterprise-grade real tie Voice Activity Detector
  • libfvad fork of WebRTC VAD engine as a standalone library independent from other WebRTC features
  • voice_activity_detection Voice Activity Detection based on Deep Learning & TensorFlow
  • rVADfast unsupervised, robust voice activity detection

subtitle generation:

  • subtitler on-device web app for audio transcribing and rendering subtitles

TextToSpeech

Voice Conversion

Video Voice Dubbing

  • weeablind dub multi lingual media using modern AI speech synthesis, diarization, and language identification
  • Auto-synced-translated-dubs Youtube audio translation and dubbing pipeline using Whisper speech-to-text, Google/DeepL Translate, Azure/Google TTS
  • videodubber dub video using GCP TTS, Translate, Whisper, Spacy tokenization and syllable counting
  • TranslatorYouTuber Takes a youtube video, clones the voice and re-creates that video in a different language
  • global-video-dubbing Using Googel Cloud Video Intelligence API with Cloud Translation API and Cloud Text to Speech API to generate voice dubbing and tranaslations in many languages automatically
  • wav2lip Lip Syncing from audio
  • video-retalking Audio-based Lip Synchronization for Talking Head Video Editing In the Wild
  • Wunjo AI Synthesize & clone voices in English, Russian & Chinese, real-time speech recognition, deepfake face & lips animation, face swap with one photo, change video by text prompts, segmentation, and retouching. Open-source, local & free
  • YouTranslate Takes a youtube video, clones the voice with elevenlabs API translate the text with google translate API and re-creates that video in a different language
  • audio2photoreal Photoreal Embodiment by Synthesizing Humans including pose, hands and face in Conversations
  • TurnVoice Dubbing via CoquiTTS, Elevenlaps, OpenAI or Azure Voices, Translation, Speaking Style changer, Precise control via Editor, Background Audio Preservation

Music Generation

  • audiocraft library for audio processing and generation with deep learning using EnCodec compressor / tokenizer and MusicGen support
    • audiocraft-infinity-webui webui supporting generation longer than 30 seconds, song continuation, seed option, load local models from chavinlo's training repo, MacOS/linux support, running on CPU/gpu
    • musicgen_trainer simple trainer for musicgen/audiocraft
    • audiocraft-webui basic webui with support for long audio, segmented audio and processing queue
    • audiocraft-webui another basic webui, unknown feature set
    • MusicGeneration a streamlit gui for audiocraft and musicgen
    • audiocraftgui with wxPython supporting continuous generation by using chunks and overlaps
    • MusicGen a simple and controllable model for music generation using a Transformer model examples, colab, colab collection
    • audiocraft-infinity-webui generation length over 30 seconds, ability to continue songs, seeds, allows to load local models
    • AudioCraft Plus an all-in-one WebUI for the original AudioCraft, adding multiband diffusion, continuation, custom model support, mono to stereo and more
  • AudioLDM Generate speech, sound effects, music and beyond, with text code, paper, HF demo
  • StableAudio Stability AI's Stable Audio only providing Training and Inference code, no models

Audio Source Separation

  • Separate Anything You Describe Describe what you want to isolate from audio, Language-queried audio source separation (LASS), paper
  • Hybrid-Net Real-time audio source separation, generate lyrics, chords, beat by lamucal.ai
  • TubeSplitter Web application to extract and separate audio stems from YouTube videos using Flask, pytube, and spleeter
  • demucs Hybrid Transformer based source separation
    • streamstem web app utilizing yt-dlp, spotify-api and demucs for an end to end audio source separation pipeline
    • moseca web app for Music Source Separation & Karaoke utilizig demucs
    • MISST native windows GUI for demucs supporting youtube, spotify and files

Research

  • Vocos Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
  • WavJourney Compositional Audio Creation with LLMs github
  • PromptingWhisper Audio-Visual Speech Recognition, Code-Switched Speech Recognition, and Zero-Shot Speech Translation for Whisper
  • Translatotron 3 Unsupervised speech-to-speech translation from monolingual data

Benchmarks