Backend Management

Discover and install AI backends to power your models

542 backends available
Documentation

Find Backend Components

Filter by Backend Type

llama-cpp
LLM inference in C/C++

Repository: localaiLicense: mit

ik-llama-cpp
Fork of llama.cpp optimized for CPU performance by ikawrakow

Repository: localaiLicense: mit

turboquant
Fork of llama.cpp adding the TurboQuant KV-cache quantization scheme. Reuses the LocalAI llama.cpp gRPC server sources against the fork's libllama.

Repository: localaiLicense: mit

whisper
Port of OpenAI's Whisper model in C/C++

Repository: localaiLicense: mit

voxtral
Voxtral Realtime 4B Pure C speech-to-text inference engine

Repository: localaiLicense: mit

stablediffusion-ggml
Stable Diffusion and Flux in pure C/C++

Repository: localaiLicense: mit

rfdetr
RF-DETR is a real-time, transformer-based object detection model architecture developed by Roboflow and released under the Apache 2.0 license. RF-DETR is the first real-time model to exceed 60 AP on the Microsoft COCO benchmark alongside competitive performance at base sizes. It also achieves state-of-the-art performance on RF100-VL, an object detection benchmark that measures model domain adaptability to real world problems. RF-DETR is fastest and most accurate for its size when compared current real-time objection models. RF-DETR is small enough to run on the edge using Inference, making it an ideal model for deployments that need both strong accuracy and real-time performance.

Repository: localaiLicense: apache-2.0

sam3-cpp
Segment Anything Model (SAM 3/2/EdgeTAM) in C/C++ using GGML. Supports text-prompted and point/box-prompted image segmentation.

Repository: localaiLicense: mit

vllm
vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Fast model execution with CUDA/HIP graph Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8 Optimized CUDA kernels, including integration with FlashAttention and FlashInfer Speculative decoding Chunked prefill

Repository: localaiLicense: apache-2.0

vllm-omni
vLLM-Omni is a unified interface for multimodal generation with vLLM. It supports image generation (text-to-image, image editing), video generation (text-to-video, image-to-video), text generation with multimodal inputs, and text-to-speech generation. Only supports NVIDIA (CUDA) and ROCm platforms.

Repository: localaiLicense: apache-2.0

mlx
Run LLMs with MLX

Repository: localaiLicense: MIT

mlx-vlm
Run Vision-Language Models with MLX

Repository: localaiLicense: MIT

mlx-audio
Run Audio Models with MLX

Repository: localaiLicense: MIT

mlx-distributed
Run distributed LLM inference with MLX across multiple Apple Silicon Macs

Repository: localaiLicense: MIT

rerankers

Repository: localai

tinygrad
tinygrad is a minimalist deep-learning framework with zero runtime dependencies that targets CUDA, ROCm, Metal, WebGPU and CPU (CLANG). The LocalAI tinygrad backend exposes a single multimodal runtime that covers LLM text generation (Llama / Qwen / Mistral via safetensors or GGUF) with native tool-call extraction, BERT-family embeddings, Stable Diffusion 1.x / 2 / XL image generation, and Whisper speech-to-text. Single image: tinygrad generates its own GPU kernels and dlopens the host driver libraries at runtime, so there is no per-toolkit build split. The same image runs CPU-only or accelerates against CUDA / ROCm / Metal when the host driver is visible.

Repository: localaiLicense: MIT

transformers
Transformers acts as the model-definition framework for state-of-the-art machine learning models in text, computer vision, audio, video, and multimodal model, for both inference and training. It centralizes the model definition so that this definition is agreed upon across the ecosystem. transformers is the pivot across frameworks: if a model definition is supported, it will be compatible with the majority of training frameworks (Axolotl, Unsloth, DeepSpeed, FSDP, PyTorch-Lightning, ...), inference engines (vLLM, SGLang, TGI, ...), and adjacent modeling libraries (llama.cpp, mlx, ...) which leverage the model definition from transformers.

Repository: localaiLicense: apache-2.0

diffusers
🤗 Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Whether you're looking for a simple inference solution or training your own diffusion models, 🤗 Diffusers is a modular toolbox that supports both.

Repository: localaiLicense: apache-2.0

ace-step
ACE-Step 1.5 is an open-source music generation model. It supports simple mode (natural language description) and advanced mode (caption, lyrics, think, bpm, keyscale, etc.). Uses in-process acestep (LLMHandler for metadata, DiT for audio).

Repository: localai

ace-step-development
ACE-Step 1.5 is an open-source music generation model. It supports simple mode (natural language description) and advanced mode (caption, lyrics, think, bpm, keyscale, etc.). Uses in-process acestep (LLMHandler for metadata, DiT for audio).

Repository: localai

acestep-cpp
ACE-Step 1.5 C++ backend using GGML. Native C++ implementation of ACE-Step music generation with GPU support through GGML backends. Generates stereo 48kHz audio from text descriptions and optional lyrics via a two-stage pipeline: text-to-code (ace-qwen3 LLM) + code-to-audio (DiT-VAE).

Repository: localai

Page 1 of 26