Model Gallery

Discover and install AI models from our curated collection

41 models available
1 repositories
Documentation

Find Your Perfect Model

Filter by Model Type

Browse by Tags

vibevoice-cpp
VibeVoice Realtime 0.5B (C++ / GGML, Q8_0) - native C++ port of Microsoft VibeVoice via the vibevoice-cpp backend. 24kHz mono TTS with voice cloning from a single reference voice prompt. Default voice prompt: en-Carter_man.

Repository: localaiLicense: mit

vibevoice-cpp-asr
VibeVoice ASR 7B (C++ / GGML, Q4_K) - long-form speech-to-text with speaker diarization. Returns per-speaker JSON segments with start/end timestamps. English-only. ~10 GB download.

Repository: localaiLicense: mit

rfdetr-cpp-nano
RF-DETR Nano object detection model, served via the native rfdetr.cpp backend (ggml + purego, no Python). Q8_0 quantization is the recommended default for CPU: same accuracy as F16/F32, ~20MB on disk, fastest CPU latency. Pure C++/ggml runtime; no Python dependencies. Drop-in for the /v1/detection endpoint.

Repository: localaiLicense: apache-2.0

locate-anything-3b
NVIDIA LocateAnything-3B open-vocabulary object detection (visual grounding), served via the native locate-anything.cpp backend (C++/ggml + purego, no Python). Describe what to find in a text prompt and get labeled boxes back; separate multiple categories with . Q8_0 is the recommended default: box-identical to F16/F32, ~6.3GB, fastest CPU latency. Drop-in for the /v1/detection endpoint (pass the prompt).

Repository: localaiLicense: other

depth-anything-3-base
Depth Anything 3 (base) monocular metric depth + camera pose, served via the native depth-anything.cpp backend (C++/ggml + purego, no Python at inference). Given an image it returns a dense depth map plus the recovered camera extrinsics (3x4) and intrinsics (3x3). Use GenerateImage (src -> normalized depth PNG at dst) or Predict (JSON depth stats + pose). q4_k is the recommended CPU default.

Repository: localaiLicense: apache-2.0

rfdetr-cpp-small
RF-DETR Small object detection model (DINOv2-small backbone, 512px input, 3 decoder layers), served via the native rfdetr.cpp backend (ggml + purego, no Python). A step up from Nano in accuracy while staying lightweight on CPU. F16 quantization is the recommended default: identical accuracy to F32 at roughly half the size. Drop-in for the /v1/detection endpoint.

Repository: localaiLicense: apache-2.0

wan-2.1-t2v-1.3b-ggml
Wan 2.1 T2V 1.3B — text-to-video diffusion model, GGUF-quantized for the stable-diffusion.cpp backend. Generates short (33-frame) 832x480 clips from a text prompt. Cheapest Wan variant, suitable for CPU-offloaded inference with ~10 GB of usable RAM.

Repository: localaiLicense: apache-2.0

wan-2.1-i2v-14b-480p-ggml
Wan 2.1 I2V 14B 480P — image-to-video diffusion, GGUF Q4 quantization. Animates a reference image into a 33-frame 480p clip. Requires more RAM than the 1.3B T2V variant; CPU offload enabled by default.

Repository: localaiLicense: apache-2.0

wan-2.1-flf2v-14b-720p-ggml
Wan 2.1 FLF2V 14B 720P — first-last-frame-to-video diffusion, GGUF Q4_K_M. Takes a start and end reference image and interpolates a 33-frame clip between them. Unlike the plain I2V variant this model feeds the end frame through clip_vision as well, so it conditions semantically (not just in pixel-space) on both endpoints. That makes it the right choice for seamless loops (start_image == end_image) and clean narrative cuts. Native 720p but accepts 480p resolutions; shares the same VAE, t5xxl text encoder, and clip_vision_h as I2V 14B.

Repository: localaiLicense: apache-2.0

wan-2.1-i2v-14b-720p-ggml
Wan 2.1 I2V 14B 720P — image-to-video diffusion, GGUF Q4_K_M. Native 720p sibling of the 480p I2V model: animates a single reference image into a 33-frame clip at up to 1280x720. Trained purely as image-to-video (no first-last-frame interpolation path), so motion is freer and better-suited to single-anchor animation than repurposing the FLF2V 720P variant for i2v. Shares the same VAE, umt5_xxl text encoder, and clip_vision_h as the I2V 14B 480P and FLF2V 14B 720P entries.

Repository: localaiLicense: apache-2.0

sd-1.5-ggml
Stable Diffusion 1.5

Repository: localaiLicense: creativeml-openrail-m

sd-3.5-medium-ggml
Stable Diffusion 3.5 Medium is a Multimodal Diffusion Transformer (MMDiT) text-to-image model that features improved performance in image quality, typography, complex prompt understanding, and resource-efficiency.

Repository: localaiLicense: stabilityai-ai-community

sd-3.5-large-ggml
Stable Diffusion 3.5 Large is a Multimodal Diffusion Transformer (MMDiT) text-to-image model that features improved performance in image quality, typography, complex prompt understanding, and resource-efficiency.

Repository: localaiLicense: stabilityai-ai-community

flux.1-dev-ggml
FLUX.1 [dev] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. For more information, please read our blog post. Key Features Cutting-edge output quality, second only to our state-of-the-art model FLUX.1 [pro]. Competitive prompt following, matching the performance of closed source alternatives . Trained using guidance distillation, making FLUX.1 [dev] more efficient. Open weights to drive new scientific research, and empower artists to develop innovative workflows. Generated outputs can be used for personal, scientific, and commercial purposes as described in the flux-1-dev-non-commercial-license. This model is quantized with GGUF

Repository: localaiLicense: flux-1-dev-non-commercial-license

flux.1-dev-ggml-q8_0
FLUX.1 [dev] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. For more information, please read our blog post. Key Features Cutting-edge output quality, second only to our state-of-the-art model FLUX.1 [pro]. Competitive prompt following, matching the performance of closed source alternatives . Trained using guidance distillation, making FLUX.1 [dev] more efficient. Open weights to drive new scientific research, and empower artists to develop innovative workflows. Generated outputs can be used for personal, scientific, and commercial purposes as described in the flux-1-dev-non-commercial-license.

Repository: localaiLicense: flux-1-dev-non-commercial-license

flux.1-dev-ggml-abliterated-v2-q8_0
FLUX.1 [dev] is an abliterated version of FLUX.1 [dev]

Repository: localaiLicense: flux-1-dev-non-commercial-license

flux.1-krea-dev-ggml
FLUX.1 Krea [dev] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. For more information, please read our blog post and Krea's blog post. Cutting-edge output quality, with a focus on aesthetic photography. Competitive prompt following, matching the performance of closed source alternatives. Trained using guidance distillation, making FLUX.1 Krea [dev] more efficient. Open weights to drive new scientific research, and empower artists to develop innovative workflows. Generated outputs can be used for personal, scientific, and commercial purposes, as described in the flux-1-dev-non-commercial-license.

Repository: localaiLicense: flux-1-dev-non-commercial-license

flux.1-krea-dev-ggml-q8_0
FLUX.1 Krea [dev] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. For more information, please read our blog post and Krea's blog post. Cutting-edge output quality, with a focus on aesthetic photography. Competitive prompt following, matching the performance of closed source alternatives. Trained using guidance distillation, making FLUX.1 Krea [dev] more efficient. Open weights to drive new scientific research, and empower artists to develop innovative workflows. Generated outputs can be used for personal, scientific, and commercial purposes, as described in the flux-1-dev-non-commercial-license.

Repository: localaiLicense: flux-1-dev-non-commercial-license

ideogram-4-iq4nl-ggml
Ideogram 4 is a text-to-image diffusion model known for state-of-the-art prompt adherence and exceptional, accurate text rendering inside images. It is driven by a Qwen3-VL-8B text encoder and performs real classifier-free guidance from a separate unconditional diffusion model. This is the iQ4_NL (4-bit) quantization, a good balance of quality and footprint (~5.8GB diffusion + ~5.8GB unconditional). The bundle also pulls the Qwen3-VL-8B-Instruct text encoder and the FLUX.2 VAE. Quantized GGUF weights by stduhpf for use with stable-diffusion.cpp.

Repository: localaiLicense: ideogram-non-commercial-model-agreement

ideogram-4-q8_0-ggml
Ideogram 4 is a text-to-image diffusion model known for state-of-the-art prompt adherence and exceptional, accurate text rendering inside images. It is driven by a Qwen3-VL-8B text encoder and performs real classifier-free guidance from a separate unconditional diffusion model. This is the Q8_0 (8-bit) quantization for highest quality (~10.1GB diffusion + ~10.1GB unconditional). The bundle also pulls the Qwen3-VL-8B-Instruct text encoder and the FLUX.2 VAE. Quantized GGUF weights by stduhpf for use with stable-diffusion.cpp.

Repository: localaiLicense: ideogram-non-commercial-model-agreement

whisper-1
Port of OpenAI's Whisper model in C/C++

Repository: localaiLicense: mit

Page 1 of many