Discover and install AI models from our curated collection
VoxCPM 1.5 is an end-to-end text-to-speech (TTS) model from ModelBest. It features zero-shot voice cloning and high-quality speech synthesis capabilities.
Links
Tags
NeuTTS Air is the world's first super-realistic, on-device TTS speech language model with instant voice cloning. Built on a 0.5B LLM backbone, it brings natural-sounding speech, real-time performance, and speaker cloning to local devices.
Links
Tags
Qwen3-TTS-12Hz-1.7B-CustomVoice via vLLM-Omni - Text-to-speech model from Alibaba Qwen team with custom voice cloning capabilities. Generates natural-sounding speech with voice personalization.
Links
Tags

VibeVoice Realtime 0.5B (C++ / GGML, Q8_0) - native C++ port of Microsoft VibeVoice via the vibevoice-cpp backend. 24kHz mono TTS with voice cloning from a single reference voice prompt. Default voice prompt: en-Carter_man.
Links
Tags
Qwen3-TTS 0.6B Base (C++ / GGML, qwentts.cpp). Native C++ text-to-speech with streaming output and zero-shot voice cloning (set `voice` to a 24kHz reference .wav). 24kHz mono, 11 languages with Mandarin dialects. Q8_0 (~0.95 GB talker).
Links
Tags
Qwen3-TTS 0.6B Base (C++ / GGML, qwentts.cpp), Q4_K_M (~0.6 GB talker). Streaming + voice cloning, 24kHz mono, 11 languages.
Links
Tags
Qwen3-TTS 1.7B Base (C++ / GGML, qwentts.cpp), Q8_0 (~2.0 GB talker). Higher-quality streaming + voice cloning, 24kHz mono, 11 languages.
Links
Tags
Qwen3-TTS 1.7B Base (C++ / GGML, qwentts.cpp), Q4_K_M (~1.2 GB talker). Streaming + voice cloning, 24kHz mono, 11 languages.
Links
Tags
OmniVoice (C++ / GGML) - native text-to-speech with voice cloning and voice design. 24kHz mono output, 646 languages, streaming synthesis. Q8_0 GGUFs (~945 MB total): 612M Qwen3 backbone + RVQ audio codec.
Links
Tags
OmniVoice (C++ / GGML), BF16 high-quality variant - text-to-speech with voice cloning and voice design. 24kHz mono, 646 languages, streaming. BF16 GGUFs (~1.6 GB total).
Links
Tags

Fish Speech S2-Pro is a high-quality text-to-speech model supporting voice cloning via reference audio. Uses a two-stage pipeline: text to semantic tokens (LLaMA-based) then semantic to audio (DAC decoder).
Links
Tags