LocalAI - Backends

llama-cpp

LLM inference in C/C++

https://github.com/ggerganov/llama.cpp

ik-llama-cpp

Fork of llama.cpp optimized for CPU performance by ikawrakow

https://github.com/ikawrakow/ik_llama.cpp

turboquant

Fork of llama.cpp adding the TurboQuant KV-cache quantization scheme. Reuses the LocalAI llama.cpp gRPC server sources against the fork's libllama.

https://github.com/TheTom/llama-cpp-turboquant

ds4

antirez/ds4 - DeepSeek V4 Flash inference engine. Single-model, optimized for Metal (Darwin) and CUDA (Linux). Requires the GGUFs published at huggingface.co/antirez/deepseek-v4-gguf.

https://github.com/antirez/ds4

whisper

Port of OpenAI's Whisper model in C/C++

https://github.com/ggml-org/whisper.cpp

crispasr

CrispASR unified speech engine (whisper.cpp fork on ggml) supporting many ASR architectures (Parakeet, Canary, Voxtral, Qwen3-ASR, Granite, Wav2Vec2, Moonshine, OmniASR, FireRedASR, and more).

https://github.com/CrispStrobe/CrispASR

parakeet-cpp

parakeet.cpp is a C++/ggml port of NVIDIA NeMo Parakeet automatic speech recognition (ASR) models. It supports the tdt, ctc, rnnt and hybrid decoder families as well as cache-aware streaming transcription, and runs on CPU, NVIDIA CUDA, AMD ROCm/HIP, Intel SYCL and NVIDIA Jetson (L4T) targets.

https://github.com/mudler/parakeet.cpp

voxtral

Voxtral Realtime 4B Pure C speech-to-text inference engine

https://github.com/mudler/voxtral.c

stablediffusion-ggml

Stable Diffusion and Flux in pure C/C++

https://github.com/leejet/stable-diffusion.cpp

rfdetr

RF-DETR is a real-time, transformer-based object detection model architecture developed by Roboflow and released under the Apache 2.0 license. RF-DETR is the first real-time model to exceed 60 AP on the Microsoft COCO benchmark alongside competitive performance at base sizes. It also achieves state-of-the-art performance on RF100-VL, an object detection benchmark that measures model domain adaptability to real world problems. RF-DETR is fastest and most accurate for its size when compared current real-time objection models. RF-DETR is small enough to run on the edge using Inference, making it an ideal model for deployments that need both strong accuracy and real-time performance.

https://github.com/roboflow/rf-detr

insightface

Face recognition backend powered by `insightface` (ONNX Runtime). Provides face verification (/v1/face/verify), face analysis (/v1/face/analyze), face embedding (/v1/embeddings), face detection (/v1/detection), and 1:N identification (/v1/face/{register,identify,forget}). Ships two engines in a single image: one that drives the insightface model packs (buffalo_l/s/m/sc, antelopev2 — non-commercial research use only) and one that drives OpenCV Zoo's YuNet + SFace pair (Apache 2.0 — commercial-safe). Select via `options: ["engine:..."]` in your model YAML, or install one of the ready-made model-gallery entries under the `insightface-*` prefix. The backend image contains only code and Python deps; all model weights are managed by LocalAI's gallery download mechanism.

sam3-cpp

Segment Anything Model (SAM 3/2/EdgeTAM) in C/C++ using GGML. Supports text-prompted and point/box-prompted image segmentation.

https://github.com/PABannier/sam3.cpp

rfdetr-cpp

Native RF-DETR object detection and instance segmentation in C/C++ using GGML. Loads pre-built GGUF weights from the mudler/rfdetr-cpp-* family (Nano/Small/Base/Medium/Large + SegNano/SegSmall/SegMedium) and returns bounding boxes, class labels, confidence scores, and (for segmentation variants) PNG-encoded per-detection masks.

https://github.com/mudler/rf-detr.cpp