The Voice AI Index

The Voice AI Index https://voice.kymatalabs.com The living index of voice & speech AI tooling — TTS, speech recognition, voice cloning, realtime voice agents, toolkits. ggml-org/whisper.cpp — momentum 83https://voice.kymatalabs.com/p/ggml-org-whisper-cpp/ggml-org/whisper.cppPort of OpenAI's Whisper model in C/C++ fishaudio/fish-speech — momentum 81https://voice.kymatalabs.com/p/fishaudio-fish-speech/fishaudio/fish-speechSOTA Open Source TTS OpenBMB/VoxCPM — momentum 80https://voice.kymatalabs.com/p/openbmb-voxcpm/OpenBMB/VoxCPMVoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning cjpais/Handy — momentum 80https://voice.kymatalabs.com/p/cjpais-handy/cjpais/HandyA free, open source, and extensible speech-to-text application that works completely offline. debpalash/OmniVoice-Studio — momentum 80https://voice.kymatalabs.com/p/debpalash-omnivoice-studio/debpalash/OmniVoice-StudioThe open-source ElevenLabs alternative for local voice cloning, design, create, dubbing and dictation Desktop App MisoLabsAI/MisoTTS — momentum 80https://voice.kymatalabs.com/p/misolabsai-misotts/MisoLabsAI/MisoTTSMiso TTS is an 8 billion, highly emotive text-to-speech model index-tts/index-tts — momentum 79https://voice.kymatalabs.com/p/index-tts-index-tts/index-tts/index-ttsAn Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System DrewThomasson/ebook2audiobook — momentum 79https://voice.kymatalabs.com/p/drewthomasson-ebook2audiobook/DrewThomasson/ebook2audiobookGenerate audiobooks from e-books, voice cloning & 1158+ languages! Huanshere/VideoLingo — momentum 79https://voice.kymatalabs.com/p/huanshere-videolingo/Huanshere/VideoLingoNetflix-level subtitle cutting, translation, alignment, and even dubbing - one-click fully automated AI video subtitle team | Netflix级字幕切割、翻译、对齐、甚至加上配音，一键全自动视频搬运AI字幕组 k2-fsa/OmniVoice — momentum 79https://voice.kymatalabs.com/p/k2-fsa-omnivoice/k2-fsa/OmniVoiceHigh-Quality Voice Cloning TTS for 600+ Languages m-bain/whisperX — momentum 78https://voice.kymatalabs.com/p/m-bain-whisperx/m-bain/whisperXWhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization) jianchang512/pyvideotrans — momentum 78https://voice.kymatalabs.com/p/jianchang512-pyvideotrans/jianchang512/pyvideotransTranslate the video from one language to another and embed dubbing & subtitles. modelscope/FunASR — momentum 78https://voice.kymatalabs.com/p/modelscope-funasr/modelscope/FunASRIndustrial-grade speech recognition toolkit: 170x realtime, 50+ languages, speaker diarization, emotion detection, streaming, and OpenAI-compatible API. NVIDIA-NeMo/NeMo — momentum 78https://voice.kymatalabs.com/p/nvidia-nemo-nemo/NVIDIA-NeMo/NeMoA scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech) openai/whisper — momentum 77https://voice.kymatalabs.com/p/openai-whisper/openai/whisperRobust Speech Recognition via Large-Scale Weak Supervision leon-ai/leon — momentum 77https://voice.kymatalabs.com/p/leon-ai-leon/leon-ai/leon🧠 Leon is your open-source personal assistant. k2-fsa/sherpa-onnx — momentum 77https://voice.kymatalabs.com/p/k2-fsa-sherpa-onnx/k2-fsa/sherpa-onnxSpeech-to-text, text-to-speech, speaker diarization, speech enhancement, source separation, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, HarmonyOS, Raspberry Pi, RISC-V, RK NPU, Axera NPU, Ascend NPU, x86_64 servers, websocket ser pipecat-ai/pipecat — momentum 77https://voice.kymatalabs.com/p/pipecat-ai-pipecat/pipecat-ai/pipecatOpen Source framework for voice and multimodal conversational AI PaddlePaddle/PaddleSpeech — momentum 77https://voice.kymatalabs.com/p/paddlepaddle-paddlespeech/PaddlePaddle/PaddleSpeechEasy-to-use Speech Toolkit including Self-Supervised Learning model, SOTA/Streaming ASR with punctuation, Streaming TTS with text frontend, Speaker Verification System, End-to-End Speech Translation and Keyword Spotting. Won NAACL2022 Best Demo Award. Open-Less/openless — momentum 77https://voice.kymatalabs.com/p/open-less-openless/Open-Less/openlessHold a key, speak, release — AI-polished text appears at your cursor in any app. Open-source voice input for macOS & Windows. (按住快捷键说话，松开即得润色后的文字) RVC-Boss/GPT-SoVITS — momentum 76https://voice.kymatalabs.com/p/rvc-boss-gpt-sovits/RVC-Boss/GPT-SoVITS1 min voice data can also be used to train a good TTS model! (few shot voice cloning) FunAudioLLM/CosyVoice — momentum 76https://voice.kymatalabs.com/p/funaudiollm-cosyvoice/FunAudioLLM/CosyVoiceMulti-lingual large voice generation model, providing inference, training and deployment full-stack ability. alphacep/vosk-api — momentum 76https://voice.kymatalabs.com/p/alphacep-vosk-api/alphacep/vosk-apiOffline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node Zackriya-Solutions/meetily — momentum 76https://voice.kymatalabs.com/p/zackriya-solutions-meetily/Zackriya-Solutions/meetilyPrivacy first, AI meeting assistant with 4x faster Parakeet/Whisper live transcription, speaker diarization, and Ollama summarization built on Rust. 100% local processing. no cloud required. Meetily (Meetly Ai - https://meetily.ai) is the #1 Self-hosted, Open-source Ai meeting note taker for macOS livekit/agents — momentum 76https://voice.kymatalabs.com/p/livekit-agents/livekit/agentsA framework for building realtime voice AI agents 🤖🎙️📹 TEN-framework/ten-framework — momentum 76https://voice.kymatalabs.com/p/ten-framework-ten-framework/TEN-framework/ten-frameworkOpen-source framework for conversational voice AI agents QuentinFuxa/WhisperLiveKit — momentum 76https://voice.kymatalabs.com/p/quentinfuxa-whisperlivekit/QuentinFuxa/WhisperLiveKitSimultaneous speech-to-text models pyannote/pyannote-audio — momentum 76https://voice.kymatalabs.com/p/pyannote-pyannote-audio/pyannote/pyannote-audioNeural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding KoljaB/RealtimeSTT — momentum 76https://voice.kymatalabs.com/p/koljab-realtimestt/KoljaB/RealtimeSTTA robust, efficient, low-latency speech-to-text library with advanced voice activity detection, wake word activation and instant transcription. espnet/espnet — momentum 76https://voice.kymatalabs.com/p/espnet-espnet/espnet/espnetEnd-to-End Speech Processing Toolkit