LocalAI - Models

gemma-4-12b-coder-fable5-composer2.5-v1

Hugging Face | GitHub | Launch Blog | Documentation License: Apache 2.0 | Authors: Google DeepMind > [!Note] > This model card is for the Gemma 4 12B Unified model, which is part of the Gemma 4 family of open models. Built with the same multimodal functionality as Gemma 4 E2B and E4B (text, audio, image, and video inputs), it brings native audio and vision understanding directly to local environments without the need for separate encoders. This unified approach to multimodality makes the model encoder-free, offering a deployment size that is perfect for consumer devices and streamlined local execution. Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on E2B, E4B, and 12B) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages. ...

Links

https://huggingface.co/yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF

Tags

gemma-4-e2b-it-qat-q4_0

Gemma 4 E2B is a multimodal (text + image) instruction-tuned model from Google DeepMind, optimized with Quantization-Aware Training (QAT) to preserve bfloat16-level quality at a fraction of the memory. E2B is a MatFormer "effective 2B" elastic variant: it carries a larger backbone but runs at an effective 2B-parameter footprint, making it well suited to lightweight and on-device deployments. This is the official Google Q4_0 GGUF, shipped with its multimodal projector. License: Apache 2.0 | Authors: Google DeepMind

Links

https://huggingface.co/google/gemma-4-E2B-it-qat-q4_0-gguf

Tags

gemma-4-e4b-it-qat-q4_0

Gemma 4 E4B is a multimodal (text + image) instruction-tuned model from Google DeepMind, optimized with Quantization-Aware Training (QAT) to preserve bfloat16-level quality at a fraction of the memory. E4B is a MatFormer "effective 4B" elastic variant, balancing quality and footprint for on-device and edge deployments. This is the official Google Q4_0 GGUF, shipped with its multimodal projector. License: Apache 2.0 | Authors: Google DeepMind

Links

https://huggingface.co/google/gemma-4-E4B-it-qat-q4_0-gguf

Tags

step-3.7-flash

**[ModelPage]**: https://static.stepfun.com/blog/step-3.7-flash/ ## 1. Introduction Step 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts (MoE) vision-language model that combines a 196B-parameter language backbone with a 1.8B-parameter vision encoder for native image understanding. Engineered for high-frequency production workloads, it activates approximately 11B parameters per token and delivers a throughput of up to 400 tokens per second. Step 3.7 Flash supports a 256k context window and offers three selectable reasoning levels (low, medium, and high) so developers can easily balance speed, cost, and cognitive depth. We built Step 3.7 Flash for developers who need to scale agentic workflows that combine perception, search, and reasoning. It is designed to handle intensive tasks such as parsing massive financial reports in one pass, running multi-step search loops with cross-source verification, or operating concurrent coding agents in high-throughput pipelines. ## 2. Capabilities & Performance ### Multimodal Perception and Verification ...

Links

https://huggingface.co/unsloth/Step-3.7-Flash-GGUF

Tags

lfm2.5-8b-a1b

Try LFM • Docs • LEAP • Discord # LFM2.5-8B-A1B LFM2.5 is a new family of hybrid models designed for on-device deployment. It builds on the LFM2 architecture with extended pre-training and reinforcement learning. - **On-device personal assistant**: Designed to power real-life applications, chaining tool calls, and following complex instructions on all devices. - **Compressed performance**: Competitive with much larger dense and MoE models on instruction following and agentic tasks. - **Unmatched throughput**: Fastest in its size class on both CPU and GPU inference, with day-one support for llama.cpp, MLX, vLLM, and SGLang. Find more information about LFM2.5-8B-A1B in our blog post. **AA-Omniscience Index (higher is better) rewards correct answers and penalizes hallucinations. Scores range from -100 to 100. See more results on Artificial Analysis.* ## 🗒️ Model Details LFM2.5-8B-A1B is a general-purpose text-only model with the following features: ...

Links

https://huggingface.co/LiquidAI/LFM2.5-8B-A1B-GGUF

Tags

qwopus3.6-35b-a3b-v1

# Qwen3.6-35B-A3B [](https://chat.qwen.ai) > [!Note] > This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. > > These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, etc. Following the February release of the Qwen3.5 series, we're pleased to share the first open-weight variant of Qwen3.6. Built on direct feedback from the community, Qwen3.6 prioritizes stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience. ## Qwen3.6 Highlights This release delivers substantial upgrades, particularly in - **Agentic Coding:** the model now handles frontend workflows and repository-level reasoning with greater fluency and precision. - **Thinking Preservation:** we've introduced a new option to retain reasoning context from historical messages, streamlining iterative development and reducing overhead. For more details, please refer to our blog post Qwen3.6-35B-A3B. ## Model Overview ...

Links

https://huggingface.co/Jackrong/Qwopus3.6-35B-A3B-v1-GGUF

Tags

qwen3.5-9b-deepseek-v4-flash

# Qwen3.5-9B [](https://chat.qwen.ai) > [!Note] > This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. > > These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, etc. Over recent months, we have intensified our focus on developing foundation models that deliver exceptional utility and performance. Qwen3.5 represents a significant leap forward, integrating breakthroughs in multimodal learning, architectural efficiency, reinforcement learning scale, and global accessibility to empower developers and enterprises with unprecedented capability and efficiency. ## Qwen3.5 Highlights Qwen3.5 features the following enhancement: - **Unified Vision-Language Foundation**: Early fusion training on multimodal tokens achieves cross-generational parity with Qwen3 and outperforms Qwen3-VL models across reasoning, coding, agents, and visual understanding benchmarks. - **Efficient Hybrid Architecture**: Gated Delta Networks combined with sparse Mixture-of-Experts deliver high-throughput inference with minimal latency and cost overhead. ...

Links

https://huggingface.co/Jackrong/Qwen3.5-9B-DeepSeek-V4-Flash-GGUF

Tags

nemotron-3-nano-omni-30b-a3b-reasoning-apex

# Model Overview ### Description: NVIDIA Nemotron 3 Nano Omni is a multimodal large language model that unifies video, audio, image, and text understanding to support enterprise-grade Q&A, summarization, transcription, and document intelligence workflows. It extends the Nemotron Nano family with integrated video+speech comprehension, Graphical User Interface (GUI), Optical Character Recognition (OCR), and speech transcription capabilities, enabling end-to-end processing of rich enterprise content such as meeting recordings, M&E assets, training videos, and complex business documents. NVIDIA Nemotron 3 Nano Omni was developed by NVIDIA as part of the Nemotron model family. This model is available for commercial use. This model was improved using Qwen3-VL-30B-A3B-Instruct, Qwen3.5-122B-A10B, Qwen3.5-397B-A17B, Qwen2.5-VL-72B-Instruct, and gpt-oss-120b. For more information, please see the Training Dataset section below. ### License/Terms of Use Governing Terms: Use of this model is governed by the NVIDIA Open Model Agreement ### Deployment Geography: Global ...

Links

https://huggingface.co/mudler/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-APEX-GGUF

Tags

qwopus3.6-27b-v1-preview

# Qwen3.6-27B [](https://chat.qwen.ai) > [!Note] > This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. > > These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, etc. Following the February release of the Qwen3.5 series, we're pleased to share the first open-weight variant of Qwen3.6. Built on direct feedback from the community, Qwen3.6 prioritizes stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience. ## Qwen3.6 Highlights This release delivers substantial upgrades, particularly in - **Agentic Coding:** the model now handles frontend workflows and repository-level reasoning with greater fluency and precision. - **Thinking Preservation:** we've introduced a new option to retain reasoning context from historical messages, streamlining iterative development and reducing overhead. For more details, please refer to our blog post Qwen3.6-27B. ## Model Overview ...

Links

https://huggingface.co/Jackrong/Qwopus3.6-27B-v1-preview-GGUF

Tags

qwen3.6-27b

# Qwen3.6-27B [](https://chat.qwen.ai) > [!Note] > This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. > > These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, etc. Following the February release of the Qwen3.5 series, we're pleased to share the first open-weight variant of Qwen3.6. Built on direct feedback from the community, Qwen3.6 prioritizes stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience. ## Qwen3.6 Highlights This release delivers substantial upgrades, particularly in - **Agentic Coding:** the model now handles frontend workflows and repository-level reasoning with greater fluency and precision. - **Thinking Preservation:** we've introduced a new option to retain reasoning context from historical messages, streamlining iterative development and reducing overhead. For more details, please refer to our blog post Qwen3.6-27B. ## Model Overview ...

Links

https://huggingface.co/unsloth/Qwen3.6-27B-GGUF

Tags

qwen3.6-35b-a3b-apex

# Qwen3.6-35B-A3B [](https://chat.qwen.ai) > [!Note] > This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. > > These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, etc. Following the February release of the Qwen3.5 series, we're pleased to share the first open-weight variant of Qwen3.6. Built on direct feedback from the community, Qwen3.6 prioritizes stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience. ## Qwen3.6 Highlights This release delivers substantial upgrades, particularly in - **Agentic Coding:** the model now handles frontend workflows and repository-level reasoning with greater fluency and precision. - **Thinking Preservation:** we've introduced a new option to retain reasoning context from historical messages, streamlining iterative development and reducing overhead. For more details, please refer to our blog post Qwen3.6-35B-A3B. ## Model Overview ...

Links

https://huggingface.co/mudler/Qwen3.6-35B-A3B-APEX-GGUF

Tags

qwen3.6-35b-a3b

# Qwen3.6-35B-A3B [](https://chat.qwen.ai) > [!Note] > This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. > > These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, etc. Following the February release of the Qwen3.5 series, we're pleased to share the first open-weight variant of Qwen3.6. Built on direct feedback from the community, Qwen3.6 prioritizes stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience. ## Qwen3.6 Highlights This release delivers substantial upgrades, particularly in - **Agentic Coding:** the model now handles frontend workflows and repository-level reasoning with greater fluency and precision. - **Thinking Preservation:** we've introduced a new option to retain reasoning context from historical messages, streamlining iterative development and reducing overhead. For more details, please refer to our blog post Qwen3.6-35B-A3B. ## Model Overview ...

Links

https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

Tags

gemma-4-e2b-it

Google Gemma 4 E2B-IT is a lightweight open-source multimodal model with 5B total parameters and 2B effective parameters using selective parameter activation. It handles text and image input, generating text output, with a 256K context window and support for 140+ languages. Optimized for efficient execution on low-resource devices including mobile and laptops.

Links

Tags

moonshine-tiny

Moonshine Tiny is a lightweight speech-to-text model optimized for fast transcription. It is designed for efficient on-device ASR with high accuracy relative to its size.

Links

https://github.com/moonshine-ai/moonshine

Tags

supertonic-3

Supertonic multilingual text-to-speech (Supertone/supertonic-3), served through the native supertonic backend via ONNX Runtime. Lightning-fast on-device flow-matching TTS with 44.1 kHz output, 31 languages, and 10 preset voice styles (F1-F5, M1-M5). No espeak-ng dependency. Defaults to voice F1; override per request with the OpenAI `voice` field, and optionally pass `language=` (e.g. en, ko, ja, it; "na" for language-agnostic).

Links

Tags

neutts-air

NeuTTS Air is the world's first super-realistic, on-device TTS speech language model with instant voice cloning. Built on a 0.5B LLM backbone, it brings natural-sounding speech, real-time performance, and speaker cloning to local devices.

Links

https://github.com/neuphonic/neutts-air

Tags

z-image-diffusers

Z-Image is the foundation model of the ⚡️-Image family, engineered for good quality, robust generative diversity, broad stylistic coverage, and precise prompt adherence. While Z-Image-Turbo is built for speed, Z-Image is a full-capacity, undistilled transformer designed to be the backbone for creators, researchers, and developers who require the highest level of creative freedom.

Links

https://huggingface.co/Tongyi-MAI/Z-Image

Tags

z-image-turbo-diffusers

🚀 Z-Image-Turbo – A distilled version of Z-Image that matches or exceeds leading competitors with only 8 NFEs (Number of Function Evaluations). It offers ⚡️sub-second inference latency⚡️ on enterprise-grade H800 GPUs and fits comfortably within 16G VRAM consumer devices. It excels in photorealistic image generation, bilingual text rendering (English & Chinese), and robust instruction adherence.

Links

https://huggingface.co/Tongyi-MAI/Z-Image-Turbo

Tags

huihui-glm-4.7-flash-abliterated-i1

The model is a quantized version of **huihui-ai/Huihui-GLM-4.7-Flash-abliterated**, optimized for efficiency and deployment. It uses GGUF files with various quantization levels (e.g., IQ1_M, IQ2_XXS, Q4_K_M) and is designed for tasks requiring low-resource deployment. Key features include: - **Base Model**: Huihui-GLM-4.7-Flash-abliterated (unmodified, original model). - **Quantization**: Supports IQ1_M to Q4_K_M, balancing accuracy and efficiency. - **Use Cases**: Suitable for applications needing lightweight inference, such as edge devices or resource-constrained environments. - **Downloads**: Available in GGUF format with varying quality and size (e.g., 0.2GB to 18.2GB). - **Tags**: Abliterated, uncensored, and optimized for specific tasks. This model is a modified version of the original GLM-4.7, tailored for deployment with quantized weights.

Links

https://huggingface.co/mradermacher/Huihui-GLM-4.7-Flash-abliterated-i1-GGUF

Tags

tildeopen-30b-instruct-lv-i1

The **TildeOpen-30B-Instruct-LV-i1-GGUF** is a quantized version of the base model **pazars/TildeOpen-30B-Instruct-LV**, optimized for deployment. It is an instruct-based language model trained on diverse datasets, supporting multiple languages (en, de, fr, pl, ru, it, pt, cs, nl, es, fi, tr, hu, bg, uk, bs, hr, da, et, lt, ro, sk, sl, sv, no, lv, sr, sq, mk, is, mt, ga). Licensed under CC-BY-4.0, it uses the Transformers library and is designed for efficient inference. The quantized version (with imatrix format) is tailored for deployment on devices with limited resources, while the base model remains the original, high-quality version.

Links

https://huggingface.co/mradermacher/TildeOpen-30B-Instruct-LV-i1-GGUF

Tags

allenai_olmo-3.1-32b-think

The **Olmo-3.1-32B-Think** model is a large language model (LLM) optimized for efficient inference using quantized versions. It is a quantized version of the original **allenai/Olmo-3.1-32B-Think** model, developed by **bartowski** using the **imatrix** quantization method. ### Key Features: - **Base Model**: `allenai/Olmo-3.1-32B-Think` (unquantized version). - **Quantized Versions**: Available in multiple formats (e.g., `Q6_K_L`, `Q4_1`, `bf16`) with varying precision (e.g., Q8_0, Q6_K_L, Q5_K_M). These are derived from the original model using the **imatrix calibration dataset**. - **Performance**: Optimized for low-memory usage and efficient inference on GPUs/CPUs. Recommended quantization types include `Q6_K_L` (near-perfect quality) or `Q4_K_M` (default, balanced performance). - **Downloads**: Available via Hugging Face CLI. Split into multiple files if needed for large models. - **License**: Apache-2.0. ### Recommended Quantization: - Use `Q6_K_L` for highest quality (near-perfect performance). - Use `Q4_K_M` for balanced performance and size. - Avoid lower-quality options (e.g., `Q3_K_S`) unless specific hardware constraints apply. This model is ideal for deploying on GPUs/CPUs with limited memory, leveraging efficient quantization for practical use cases.

Links

https://huggingface.co/bartowski/allenai_Olmo-3.1-32B-Think-GGUF

Tags

Model Gallery

Find Your Perfect Model

Filter by Model Type

Browse by Tags

gemma-4-12b-coder-fable5-composer2.5-v1

gemma-4-e2b-it-qat-q4_0

gemma-4-e4b-it-qat-q4_0

step-3.7-flash

lfm2.5-8b-a1b

qwopus3.6-35b-a3b-v1

qwen3.5-9b-deepseek-v4-flash

nemotron-3-nano-omni-30b-a3b-reasoning-apex

qwopus3.6-27b-v1-preview

qwen3.6-27b

qwen3.6-35b-a3b-apex

qwen3.6-35b-a3b

gemma-4-e2b-it

moonshine-tiny

supertonic-3

neutts-air

z-image-diffusers

z-image-turbo-diffusers

huihui-glm-4.7-flash-abliterated-i1

tildeopen-30b-instruct-lv-i1

allenai_olmo-3.1-32b-think