LocalAI - Models

lfm2.5-8b-a1b

Try LFM • Docs • LEAP • Discord # LFM2.5-8B-A1B LFM2.5 is a new family of hybrid models designed for on-device deployment. It builds on the LFM2 architecture with extended pre-training and reinforcement learning. - **On-device personal assistant**: Designed to power real-life applications, chaining tool calls, and following complex instructions on all devices. - **Compressed performance**: Competitive with much larger dense and MoE models on instruction following and agentic tasks. - **Unmatched throughput**: Fastest in its size class on both CPU and GPU inference, with day-one support for llama.cpp, MLX, vLLM, and SGLang. Find more information about LFM2.5-8B-A1B in our blog post. **AA-Omniscience Index (higher is better) rewards correct answers and penalizes hallucinations. Scores range from -100 to 100. See more results on Artificial Analysis.* ## 🗒️ Model Details LFM2.5-8B-A1B is a general-purpose text-only model with the following features: ...

Links

https://huggingface.co/LiquidAI/LFM2.5-8B-A1B-GGUF

Tags

allenai_olmo-3.1-32b-think

The **Olmo-3.1-32B-Think** model is a large language model (LLM) optimized for efficient inference using quantized versions. It is a quantized version of the original **allenai/Olmo-3.1-32B-Think** model, developed by **bartowski** using the **imatrix** quantization method. ### Key Features: - **Base Model**: `allenai/Olmo-3.1-32B-Think` (unquantized version). - **Quantized Versions**: Available in multiple formats (e.g., `Q6_K_L`, `Q4_1`, `bf16`) with varying precision (e.g., Q8_0, Q6_K_L, Q5_K_M). These are derived from the original model using the **imatrix calibration dataset**. - **Performance**: Optimized for low-memory usage and efficient inference on GPUs/CPUs. Recommended quantization types include `Q6_K_L` (near-perfect quality) or `Q4_K_M` (default, balanced performance). - **Downloads**: Available via Hugging Face CLI. Split into multiple files if needed for large models. - **License**: Apache-2.0. ### Recommended Quantization: - Use `Q6_K_L` for highest quality (near-perfect performance). - Use `Q4_K_M` for balanced performance and size. - Avoid lower-quality options (e.g., `Q3_K_S`) unless specific hardware constraints apply. This model is ideal for deploying on GPUs/CPUs with limited memory, leveraging efficient quantization for practical use cases.

Links

https://huggingface.co/bartowski/allenai_Olmo-3.1-32B-Think-GGUF

Tags

lfm2-1.2b

LFM2-1.2B is a hybrid liquid model designed for edge AI and on-device deployment, offering fast inference and multilingual support across 8 languages. It's optimized for agentic tasks, data extraction, and multi-turn conversations with efficient CPU/GPU/NPU compatibility.

Links

Tags

insightface-buffalo-s

Small insightface pack (SCRFD-500MF detector + MBF 512-d embedder + genderage, ~159MB). Good fit for mid-range CPU deployments. NON-COMMERCIAL RESEARCH USE ONLY.

Links

https://github.com/deepinsight/insightface

Tags

insightface-opencv-int8

Int8-quantized OpenCV Zoo face pair (YuNet int8 + SFace int8, ~12MB). Roughly 3x smaller and noticeably faster on CPU than the fp32 variant at comparable accuracy for face tasks. APACHE 2.0 — commercial-safe. Weights are downloaded on install via LocalAI's gallery mechanism.

Links

https://github.com/opencv/opencv_zoo

Tags

wespeaker-resnet34

Speaker recognition with WeSpeaker's ResNet34 trained on VoxCeleb, exported to ONNX. 256-d embeddings, CPU-friendly — avoids the PyTorch runtime entirely (onnxruntime only). APACHE 2.0. Pair with the `speaker-recognition` backend's OnnxDirectEngine. Use when ECAPA-TDNN's torch dependency is undesirable (small images, edge deployments).

Links

https://github.com/wenet-e2e/wespeaker

Tags

rfdetr-cpp-nano

RF-DETR Nano object detection model, served via the native rfdetr.cpp backend (ggml + purego, no Python). Q8_0 quantization is the recommended default for CPU: same accuracy as F16/F32, ~20MB on disk, fastest CPU latency. Pure C++/ggml runtime; no Python dependencies. Drop-in for the /v1/detection endpoint.

Links

Tags

locate-anything-3b

NVIDIA LocateAnything-3B open-vocabulary object detection (visual grounding), served via the native locate-anything.cpp backend (C++/ggml + purego, no Python). Describe what to find in a text prompt and get labeled boxes back; separate multiple categories with . Q8_0 is the recommended default: box-identical to F16/F32, ~6.3GB, fastest CPU latency. Drop-in for the /v1/detection endpoint (pass the prompt).

Links

Tags

depth-anything-3-base

Depth Anything 3 (base) monocular metric depth + camera pose, served via the native depth-anything.cpp backend (C++/ggml + purego, no Python at inference). Given an image it returns a dense depth map plus the recovered camera extrinsics (3x4) and intrinsics (3x3). Use GenerateImage (src -> normalized depth PNG at dst) or Predict (JSON depth stats + pose). q4_k is the recommended CPU default.

Links

Tags

depth-anything-3-small

Depth Anything 3 (small / vits), f32 — the smallest backbone (~131 MB) for fast CPU depth + camera pose. Same output as base at lower latency.

Links

Tags

rfdetr-cpp-base

RF-DETR Base object detection model, served via the native rfdetr.cpp backend. F16 quantization is recommended on CPU: identical accuracy to F32, half the size, fastest.

Links

Tags

rfdetr-cpp-small

RF-DETR Small object detection model (DINOv2-small backbone, 512px input, 3 decoder layers), served via the native rfdetr.cpp backend (ggml + purego, no Python). A step up from Nano in accuracy while staying lightweight on CPU. F16 quantization is the recommended default: identical accuracy to F32 at roughly half the size. Drop-in for the /v1/detection endpoint.

Links

Tags

rfdetr-cpp-medium

RF-DETR Medium object detection model (DINOv2-small backbone, 576px input, 4 decoder layers), served via the native rfdetr.cpp backend. Balanced detection quality vs. CPU latency — recommended when Base is not accurate enough but Large is too slow. F16 quantization is the recommended default: identical accuracy to F32, half the size. Drop-in for the /v1/detection endpoint.

Links

Tags

rfdetr-cpp-large

RF-DETR Large object detection model (DINOv2-small backbone, 704px input, 4 decoder layers), served via the native rfdetr.cpp backend. Highest-accuracy detection variant — best for offline workflows and high-resolution inputs where CPU latency is secondary to recall. F16 quantization is the recommended default: identical accuracy to F32, half the size. Drop-in for the /v1/detection endpoint.

Links

Tags

rfdetr-cpp-seg-nano

RF-DETR Seg-Nano instance segmentation model (DINOv2-small backbone, 312px input, 4 decoder layers, 100 queries), served via the native rfdetr.cpp backend. Smallest segmentation variant — fastest CPU latency, ideal for edge deployment. Returns both bounding boxes and per-instance masks via the /v1/detection endpoint. F16 quantization is the recommended default: identical accuracy to F32, half the size.

Links

Tags

rfdetr-cpp-seg-small

RF-DETR Seg-Small instance segmentation model (DINOv2-small backbone, 384px input, 4 decoder layers, 100 queries), served via the native rfdetr.cpp backend. Step up from Seg-Nano in mask quality while staying CPU-friendly. Returns both bounding boxes and per-instance masks via the /v1/detection endpoint. F16 quantization is the recommended default: identical accuracy to F32, half the size.

Links

Tags

rfdetr-cpp-seg-medium

RF-DETR Seg-Medium instance segmentation model (DINOv2-small backbone, 432px input, 5 decoder layers, 200 queries), served via the native rfdetr.cpp backend. Balanced segmentation quality vs. CPU latency — recommended for everyday segmentation workloads. Returns both bounding boxes and per-instance masks via the /v1/detection endpoint. F16 quantization is the recommended default.

Links

Tags

rfdetr-cpp-seg-2xlarge

RF-DETR Seg-2XLarge instance segmentation model (DINOv2-small backbone, 768px input, 6 decoder layers, 300 queries), served via the native rfdetr.cpp backend. Highest-accuracy segmentation variant — best for offline workflows and high-resolution inputs where CPU latency is secondary to mask quality. Returns both bounding boxes and per-instance masks via the /v1/detection endpoint. F16 quantization is the recommended default: identical accuracy to F32, half the size.

Links

Tags

qwen3-30b-a1.5b-high-speed

This repo contains the full precision source code, in "safe tensors" format to generate GGUFs, GPTQ, EXL2, AWQ, HQQ and other formats. The source code can also be used directly. This is a simple "finetune" of the Qwen's "Qwen 30B-A3B" (MOE) model, setting the experts in use from 8 to 4 (out of 128 experts). This method close to doubles the speed of the model and uses 1.5B (of 30B) parameters instead of 3B (of 30B) parameters. Depending on the application you may want to use the regular model ("30B-A3B"), and use this model for simpler use case(s) although I did not notice any loss of function during routine (but not extensive) testing. Example generation (Q4KS, CPU) at the bottom of this page using 4 experts / this model. More complex use cases may benefit from using the normal version. For reference: Cpu only operation Q4KS (windows 11) jumps from 12 t/s to 23 t/s. GPU performance IQ3S jumps from 75 t/s to over 125 t/s. (low to mid level card) Context size: 32K + 8K for output (40k total)

Links

Tags

qwen3-55b-a3b-total-recall-deep-40x

WARNING: MADNESS - UN HINGED and... NSFW. Vivid prose. INTENSE. Visceral Details. Violence. HORROR. GORE. Swearing. UNCENSORED... humor, romance, fun. Qwen3-55B-A3B-TOTAL-RECALL-Deep-40X-GGUF A highly experimental model ("tamer" versions below) based on Qwen3-30B-A3B (MOE, 128 experts, 8 activated), with Brainstorm 40X (by DavidAU - details at bottom of this page). These modifications blow the model (V1) out to 87 layers, 1046 tensors and 55B parameters. Note that some versions are smaller than this, with fewer layers/tensors and smaller parameter counts. The adapter extensively alters performance, reasoning and output generation. Exceptional changes in creative, prose and general performance. Regens of the same prompt - even with the same settings - will be very different. THREE example generations below - creative (generated with Q3_K_M, V1 model). ONE example generation (#4) - non creative (generated with Q3_K_M, V1 model). You can run this model on CPU and/or GPU due to unique model construction, size of experts and total activated experts at 3B parameters (8 experts), which translates into roughly almost 6B parameters in this version. Two quants uploaded for testing: Q3_K_M, Q4_K_M V3, V4 and V5 are also available in these two quants. V2 and V6 in Q3_k_m only; as are: V 1.3, 1.4, 1.5, 1.7 and V7 (newest) NOTE: V2 and up are from source model 2, V1 and 1.3,1.4,1.5,1.7 are from source model 1.

Links

https://huggingface.co/DavidAU/Qwen3-55B-A3B-TOTAL-RECALL-Deep-40X-GGUF

Tags

qwen3-22b-a3b-the-harley-quinn

WARNING: MADNESS - UN HINGED and... NSFW. Vivid prose. INTENSE. Visceral Details. Violence. HORROR. GORE. Swearing. UNCENSORED... humor, romance, fun. Qwen3-22B-A3B-The-Harley-Quinn This repo contains the full precision source code, in "safe tensors" format to generate GGUFs, GPTQ, EXL2, AWQ, HQQ and other formats. The source code can also be used directly. ABOUT: A stranger, yet radically different version of Kalmaze's "Qwen/Qwen3-16B-A3B" with the experts pruned to 64 (from 128, the Qwen 3 30B-A3B version) and then I added 19 layers expanding (Brainstorm 20x by DavidAU info at bottom of this page) the model to 22B total parameters. The goal: slightly alter the model, to address some odd creative thinking and output choices. Then... Harley Quinn showed up, and then it was a party! A wild, out of control (sometimes) but never boring party. Please note that the modifications affect the entire model operation; roughly I adjusted the model to think a little "deeper" and "ponder" a bit - but this is a very rough description. That being said, reasoning and output generation will be altered regardless of your use case(s). These modifications pushes Qwen's model to the absolute limit for creative use cases. Detail, vividiness, and creativity all get a boost. Prose (all) will also be very different from "default" Qwen3. Likewise, regen(s) of the same prompt - even at the same settings - will create very different version(s) too. The Brainstrom 20x has also lightly de-censored the model under some conditions. However, this model can be prone to bouts of madness. It will not always behave, and it will sometimes go -wildly- off script. See 4 examples below. Model retains full reasoning, and output generation of a Qwen3 MOE ; but has not been tested for "non-creative" use cases. Model is set with Qwen's default config: 40 k context 8 of 64 experts activated. Chatml OR Jinja Template (embedded) Four example generations below. IMPORTANT: See usage guide / repo below to get the most out of this model, as settings are very specific. If not set correctly, this model will not work the way it should. Critical settings: Chatml or Jinja Template (embedded, but updated version at repo below) Rep pen of 1.01 or 1.02 ; higher (1.04, 1.05) will result in "Harley Mode". Temp range of .6 to 1.2. ; higher you may need to prompt the model to "output" after thinking. Experts set at 8-10 ; higher will result in "odder" output BUT it might be better. That being said, "Harley Quinn" may make her presence known at any moment. USAGE GUIDE: Please refer to this model card for Specific usage, suggested settings, changing ACTIVE EXPERTS, templates, settings and the like: How to maximize this model in "uncensored" form, with specific notes on "abliterated" models. Rep pen / temp settings specific to getting the model to perform strongly. https://huggingface.co/DavidAU/Qwen3-18B-A3B-Stranger-Thoughts-Abliterated-Uncensored-GGUF GGUF / QUANTS / SPECIAL SHOUTOUT: Special thanks to team Mradermacher for making the quants! https://huggingface.co/mradermacher/Qwen3-22B-A3B-The-Harley-Quinn-GGUF KNOWN ISSUES: Model may "mis-capitalize" word(s) - lowercase, where uppercase should be - from time to time. Model may add extra space from time to time before a word. Incorrect template and/or settings will result in a drop in performance / poor performance. Can rant at the end / repeat. Most of the time it will stop on its own. Looking for the Abliterated / Uncensored version? https://huggingface.co/DavidAU/Qwen3-23B-A3B-The-Harley-Quinn-PUDDIN-Abliterated-Uncensored In some cases this "abliterated/uncensored" version may work better than this version. EXAMPLES Standard system prompt, rep pen 1.01-1.02, topk 100, topp .95, minp .05, rep pen range 64. Tested in LMStudio, quant Q4KS, GPU (CPU output will differ slightly). As this is the mid range quant, expected better results from higher quants and/or with more experts activated to be better. NOTE: Some formatting lost on copy/paste. WARNING: NSFW. Vivid prose. INTENSE. Visceral Details. Violence. HORROR. GORE. Swearing. UNCENSORED... humor, romance, fun.

Links

Tags

Model Gallery

Find Your Perfect Model

Filter by Model Type

Browse by Tags

lfm2.5-8b-a1b

allenai_olmo-3.1-32b-think

lfm2-1.2b

insightface-buffalo-s

insightface-opencv-int8

wespeaker-resnet34

rfdetr-cpp-nano

locate-anything-3b

depth-anything-3-base

depth-anything-3-small

rfdetr-cpp-base

rfdetr-cpp-small

rfdetr-cpp-medium

rfdetr-cpp-large

rfdetr-cpp-seg-nano

rfdetr-cpp-seg-small

rfdetr-cpp-seg-medium

rfdetr-cpp-seg-2xlarge

qwen3-30b-a1.5b-high-speed

qwen3-55b-a3b-total-recall-deep-40x

qwen3-22b-a3b-the-harley-quinn