LocalAI - Models

vibevoice-cpp

VibeVoice Realtime 0.5B (C++ / GGML, Q8_0) - native C++ port of Microsoft VibeVoice via the vibevoice-cpp backend. 24kHz mono TTS with voice cloning from a single reference voice prompt. Default voice prompt: en-Carter_man.

Links

Tags

vibevoice-cpp-asr

VibeVoice ASR 7B (C++ / GGML, Q4_K) - long-form speech-to-text with speaker diarization. Returns per-speaker JSON segments with start/end timestamps. English-only. ~10 GB download.

Links

Tags

rfdetr-cpp-nano

RF-DETR Nano object detection model, served via the native rfdetr.cpp backend (ggml + purego, no Python). Q8_0 quantization is the recommended default for CPU: same accuracy as F16/F32, ~20MB on disk, fastest CPU latency. Pure C++/ggml runtime; no Python dependencies. Drop-in for the /v1/detection endpoint.

Links

Tags

locate-anything-3b

NVIDIA LocateAnything-3B open-vocabulary object detection (visual grounding), served via the native locate-anything.cpp backend (C++/ggml + purego, no Python). Describe what to find in a text prompt and get labeled boxes back; separate multiple categories with . Q8_0 is the recommended default: box-identical to F16/F32, ~6.3GB, fastest CPU latency. Drop-in for the /v1/detection endpoint (pass the prompt).

Links

Tags

depth-anything-3-base

Depth Anything 3 (base) monocular metric depth + camera pose, served via the native depth-anything.cpp backend (C++/ggml + purego, no Python at inference). Given an image it returns a dense depth map plus the recovered camera extrinsics (3x4) and intrinsics (3x3). Use GenerateImage (src -> normalized depth PNG at dst) or Predict (JSON depth stats + pose). q4_k is the recommended CPU default.

Links

Tags

rfdetr-cpp-small

RF-DETR Small object detection model (DINOv2-small backbone, 512px input, 3 decoder layers), served via the native rfdetr.cpp backend (ggml + purego, no Python). A step up from Nano in accuracy while staying lightweight on CPU. F16 quantization is the recommended default: identical accuracy to F32 at roughly half the size. Drop-in for the /v1/detection endpoint.

Links

Tags

wan-2.1-t2v-1.3b-ggml

Wan 2.1 T2V 1.3B — text-to-video diffusion model, GGUF-quantized for the stable-diffusion.cpp backend. Generates short (33-frame) 832x480 clips from a text prompt. Cheapest Wan variant, suitable for CPU-offloaded inference with ~10 GB of usable RAM.

Links

Tags

wan-2.1-i2v-14b-480p-ggml

Wan 2.1 I2V 14B 480P — image-to-video diffusion, GGUF Q4 quantization. Animates a reference image into a 33-frame 480p clip. Requires more RAM than the 1.3B T2V variant; CPU offload enabled by default.

Links

https://huggingface.co/city96/Wan2.1-I2V-14B-480P-gguf

Tags

wan-2.1-flf2v-14b-720p-ggml

Wan 2.1 FLF2V 14B 720P — first-last-frame-to-video diffusion, GGUF Q4_K_M. Takes a start and end reference image and interpolates a 33-frame clip between them. Unlike the plain I2V variant this model feeds the end frame through clip_vision as well, so it conditions semantically (not just in pixel-space) on both endpoints. That makes it the right choice for seamless loops (start_image == end_image) and clean narrative cuts. Native 720p but accepts 480p resolutions; shares the same VAE, t5xxl text encoder, and clip_vision_h as I2V 14B.

Links

https://huggingface.co/city96/Wan2.1-FLF2V-14B-720P-gguf

Tags

wan-2.1-i2v-14b-720p-ggml

Wan 2.1 I2V 14B 720P — image-to-video diffusion, GGUF Q4_K_M. Native 720p sibling of the 480p I2V model: animates a single reference image into a 33-frame clip at up to 1280x720. Trained purely as image-to-video (no first-last-frame interpolation path), so motion is freer and better-suited to single-anchor animation than repurposing the FLF2V 720P variant for i2v. Shares the same VAE, umt5_xxl text encoder, and clip_vision_h as the I2V 14B 480P and FLF2V 14B 720P entries.

Links

https://huggingface.co/city96/Wan2.1-I2V-14B-720P-gguf

Tags

sd-1.5-ggml

Stable Diffusion 1.5

Links

https://huggingface.co/second-state/stable-diffusion-v1-5-GGUF

Tags

sd-3.5-medium-ggml

Stable Diffusion 3.5 Medium is a Multimodal Diffusion Transformer (MMDiT) text-to-image model that features improved performance in image quality, typography, complex prompt understanding, and resource-efficiency.

Links

Tags

sd-3.5-large-ggml

Stable Diffusion 3.5 Large is a Multimodal Diffusion Transformer (MMDiT) text-to-image model that features improved performance in image quality, typography, complex prompt understanding, and resource-efficiency.

Links

Tags

flux.1-dev-ggml

FLUX.1 [dev] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. For more information, please read our blog post. Key Features Cutting-edge output quality, second only to our state-of-the-art model FLUX.1 [pro]. Competitive prompt following, matching the performance of closed source alternatives . Trained using guidance distillation, making FLUX.1 [dev] more efficient. Open weights to drive new scientific research, and empower artists to develop innovative workflows. Generated outputs can be used for personal, scientific, and commercial purposes as described in the flux-1-dev-non-commercial-license. This model is quantized with GGUF

Links

Tags

flux.1-dev-ggml-q8_0

FLUX.1 [dev] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. For more information, please read our blog post. Key Features Cutting-edge output quality, second only to our state-of-the-art model FLUX.1 [pro]. Competitive prompt following, matching the performance of closed source alternatives . Trained using guidance distillation, making FLUX.1 [dev] more efficient. Open weights to drive new scientific research, and empower artists to develop innovative workflows. Generated outputs can be used for personal, scientific, and commercial purposes as described in the flux-1-dev-non-commercial-license.

Links

Tags

flux.1-dev-ggml-abliterated-v2-q8_0

FLUX.1 [dev] is an abliterated version of FLUX.1 [dev]

Links

Tags

flux.1-krea-dev-ggml

FLUX.1 Krea [dev] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. For more information, please read our blog post and Krea's blog post. Cutting-edge output quality, with a focus on aesthetic photography. Competitive prompt following, matching the performance of closed source alternatives. Trained using guidance distillation, making FLUX.1 Krea [dev] more efficient. Open weights to drive new scientific research, and empower artists to develop innovative workflows. Generated outputs can be used for personal, scientific, and commercial purposes, as described in the flux-1-dev-non-commercial-license.

Links

Tags

flux.1-krea-dev-ggml-q8_0

FLUX.1 Krea [dev] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. For more information, please read our blog post and Krea's blog post. Cutting-edge output quality, with a focus on aesthetic photography. Competitive prompt following, matching the performance of closed source alternatives. Trained using guidance distillation, making FLUX.1 Krea [dev] more efficient. Open weights to drive new scientific research, and empower artists to develop innovative workflows. Generated outputs can be used for personal, scientific, and commercial purposes, as described in the flux-1-dev-non-commercial-license.

Links

Tags

ideogram-4-iq4nl-ggml

Ideogram 4 is a text-to-image diffusion model known for state-of-the-art prompt adherence and exceptional, accurate text rendering inside images. It is driven by a Qwen3-VL-8B text encoder and performs real classifier-free guidance from a separate unconditional diffusion model. This is the iQ4_NL (4-bit) quantization, a good balance of quality and footprint (~5.8GB diffusion + ~5.8GB unconditional). The bundle also pulls the Qwen3-VL-8B-Instruct text encoder and the FLUX.2 VAE. Quantized GGUF weights by stduhpf for use with stable-diffusion.cpp.

Links

Tags

ideogram-4-q8_0-ggml

Ideogram 4 is a text-to-image diffusion model known for state-of-the-art prompt adherence and exceptional, accurate text rendering inside images. It is driven by a Qwen3-VL-8B text encoder and performs real classifier-free guidance from a separate unconditional diffusion model. This is the Q8_0 (8-bit) quantization for highest quality (~10.1GB diffusion + ~10.1GB unconditional). The bundle also pulls the Qwen3-VL-8B-Instruct text encoder and the FLUX.2 VAE. Quantized GGUF weights by stduhpf for use with stable-diffusion.cpp.

Links

Tags

whisper-1

Port of OpenAI's Whisper model in C/C++

Links

Tags

Model Gallery

Find Your Perfect Model

Filter by Model Type

Browse by Tags

vibevoice-cpp

vibevoice-cpp-asr

rfdetr-cpp-nano

locate-anything-3b

depth-anything-3-base

rfdetr-cpp-small

wan-2.1-t2v-1.3b-ggml

wan-2.1-i2v-14b-480p-ggml

wan-2.1-flf2v-14b-720p-ggml

wan-2.1-i2v-14b-720p-ggml

sd-1.5-ggml

sd-3.5-medium-ggml

sd-3.5-large-ggml

flux.1-dev-ggml

flux.1-dev-ggml-q8_0

flux.1-dev-ggml-abliterated-v2-q8_0

flux.1-krea-dev-ggml

flux.1-krea-dev-ggml-q8_0

ideogram-4-iq4nl-ggml

ideogram-4-q8_0-ggml

whisper-1