LocalAI - Models

serenity-26b-a4b

.mc-wrap{background:#0d1117;color:#c9d1d9;font-family:'Inter',sans-serif;max-width:920px;margin:0 auto;padding:24px;border-radius:16px;box-sizing:border-box} .mc-wrap *{box-sizing:border-box} .mc-wrap h1,.mc-wrap h2,.mc-wrap h3,.mc-wrap h4{color:#e6edf3;border:none} .mc-wrap p{color:#c9d1d9} .mc-wrap strong{color:#7ee8d0} .mc-wrap a{color:#7ee8d0;text-decoration:none} .mc-wrap ul{list-style:none;padding-left:0;margin:0} .mc-wrap li{color:#c9d1d9;margin-bottom:8px;padding-left:4px} .mc-wrap code{background:#161b22;color:#7ee8d0;padding:2px 8px;border-radius:4px;font-family:'JetBrains Mono',monospace;font-size:.88em;border:1px solid rgba(126,232,208,.15)} .mc-hdr{text-align:center;padding:40px 32px;background:#0d1117;border:1px solid #21262d;border-radius:24px;margin-bottom:20px;position:relative;overflow:hidden} .mc-hdr::before{content:'';position:absolute;top:0;left:0;right:0;height:3px;background:linear-gradient(135deg,#7ee8d0,#a78bfa,#c4b5fd)} .mc-name{font-family:'Space Grotesk',sans-serif;font-size:2.8em;font-weight:800;margin:0;letter-spacing:-.02em;background:linear-gradient(135deg,#7ee8d0,#a78bfa,#c4b5fd);-webkit-background-clip:text;-webkit-text-fill-color:transparent;backg ...

Links

https://huggingface.co/ReadyArt/Serenity-26B-A4B-GGUF

Tags

melody1437-26b-a4b-v2.0

@import url('https://fonts.googleapis.com/css2?family=Poppins:wght@400;600&family=Playfair+Display:ital,wght@0,400;0,700&family=Roboto+Mono:wght@400;500&display=swap'); body { font-family: 'Poppins', sans-serif; background: #1a1a2e; background-image: radial-gradient(circle at 50% 50%, rgba(76, 201, 240, 0.05) 0%, transparent 70%), url('https://www.transparenttextures.com/patterns/cubes.png'); color: #e0e0e0; margin: 0; padding: 20px; line-height: 1.6; } .container { max-width: 900px; margin: 0 auto; background: rgba(26, 32, 44, 0.95); border-radius: 8px; padding: 40px; box-shadow: 0 4px 30px rgba(0, 0, 0, 0.5), 0 0 0 1px #2a3b55; border: 1px solid #2a3b55; position: relative; overflow: hidden; backdrop-filter: blur(5px); } .header { text-align: center; margin-bottom: 30px; position: relative; z-index: 1; border-bottom: 1px solid #2a3b55; padding-bottom: 15px; } ...

Links

https://huggingface.co/ReadyArt/Melody1437-26B-A4B-v2.0-GGUF

Tags

dark-scarlett-v0.3-26b-a4b

Hugging Face | GitHub | Launch Blog | Documentation License: Apache 2.0 | Authors: Google DeepMind Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages. Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in four distinct sizes: **E2B**, **E4B**, **26B A4B**, and **31B**. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI. Gemma 4 introduces key **capability and architectural advancements**: * **Reasoning** – All models in the family are designed as highly capable reasoners, with configurable thinking modes. ...

Links

https://huggingface.co/ReadyArt/Dark-Scarlett-v0.3-26B-A4B-GGUF

Tags

gemma-4-26b-a4b-it-qat

Hugging Face | GitHub | Launch Blog | Documentation License: Apache 2.0 | Authors: Google DeepMind > [!Note] > This model card is for the new versions of the Gemma 4 family optimized with Quantization-Aware Training (QAT), which allows preserving similar quality to bfloat16 while dramatically reducing the memory requirements to load the model. > Four versions of the QAT checkpoints are available: > * **Unquantized QAT checkpoints** (Q4_0): Half-precision weights extracted from the QAT pipeline, ideal for custom downstream compilation and research. Available for Gemma 4 E2B, E4B, 12B, 26B A4B, and 31B, and their drafter models. > * **GGUF** (Q4_0): Ready-to-deploy formats for broad ecosystem compatibility. Available for Gemma 4 E2B, E4B, 12B, 26B A4B, and 31B. > * **Mobile-optimized** (wNa8o8): A custom schema engineered explicitly for mobile hardware efficiency. It features targeted 2-bit decoding layers, optimized KV caches, and static activations to maximize VRAM savings. Available for Gemma 4 E2B and E4B. > * **Compressed Tensors** (w4a16): QAT checkpoints serialized in the compressed-tensors format for native, optimized inference with vLLM. Available for Gemma 4 E2B, E4B, 12B ...

Links

https://huggingface.co/unsloth/gemma-4-26B-A4B-it-qat-GGUF

Tags

gemma-4-e4b-it-qat-q4_0

Gemma 4 E4B is a multimodal (text + image) instruction-tuned model from Google DeepMind, optimized with Quantization-Aware Training (QAT) to preserve bfloat16-level quality at a fraction of the memory. E4B is a MatFormer "effective 4B" elastic variant, balancing quality and footprint for on-device and edge deployments. This is the official Google Q4_0 GGUF, shipped with its multimodal projector. License: Apache 2.0 | Authors: Google DeepMind

Links

https://huggingface.co/google/gemma-4-E4B-it-qat-q4_0-gguf

Tags

gemma-4-26b-a4b-it-qat-q4_0

Gemma 4 26B-A4B is a multimodal (text + image) instruction-tuned Mixture-of-Experts model from Google DeepMind, optimized with Quantization-Aware Training (QAT) to preserve bfloat16-level quality at a fraction of the memory. With 26B total parameters and ~4B active per token, it delivers large-model quality at a much lower inference cost. This is the official Google Q4_0 GGUF, shipped with its multimodal projector. License: Apache 2.0 | Authors: Google DeepMind

Links

https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-gguf

Tags

gemma-4-26b-a4b-it-apex

AI model: gemma-4-26b-a4b-it-apex

Links

https://huggingface.co/mudler/gemma-4-26B-A4B-it-APEX-GGUF

gemma-4-26b-a4b-it

Google Gemma 4 26B-A4B-IT is an open-source multimodal Mixture-of-Experts model with 26B total parameters and 4B active parameters. It handles text and image input, generating text output, with a 256K context window and support for 140+ languages. The MoE architecture provides strong performance with efficient inference. Well-suited for question answering, summarization, reasoning, and image understanding tasks.

Links

Tags

gemma-4-e4b-it

Google Gemma 4 E4B-IT is an open-source multimodal model with 8B total parameters and 4B effective parameters using selective parameter activation. It handles text and image input, generating text output, with a 256K context window and support for 140+ languages. Offers a good balance of performance and efficiency for deployment on consumer hardware.

Links

Tags

qwen_qwen3.5-4b

Qwen3.5-4B is a multimodal LLM with 4 billion parameters, optimized for chat and vision tasks. This GGUF quantized version enables efficient local inference via llama-cpp backend. Supports both text and image input for enhanced conversational capabilities.

Links

https://huggingface.co/bartowski/Qwen_Qwen3.5-4B-GGUF

Tags

qwen3.5-4b-claude-4.6-opus-reasoning-distilled

Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled-GGUF - A GGUF quantized model optimized for local inference. Specialized for reasoning and chain-of-thought tasks. Based on Qwen 3.5 architecture with enhanced language understanding. Available in multiple quantization levels for various hardware requirements. Distilled from Claude-style reasoning models for enhanced logical reasoning capabilities.

Links

https://huggingface.co/Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled-GGUF

Tags

voxtral-mini-4b-realtime

Voxtral Mini 4B Realtime is a speech-to-text model from Mistral AI. It is a 4B parameter model optimized for fast, accurate audio transcription with low latency, making it ideal for real-time applications. The model uses the Voxtral architecture for efficient audio processing.

Links

Tags

acestep-cpp-turbo-4b

ACE-Step 1.5 Turbo (C++ / GGML) with 4B LM — higher quality music generation from text and lyrics. Uses the larger 4B parameter LM for better metadata/code generation. Stereo 48kHz output.

Links

Tags

qwen3-vl-4b-instruct

Qwen3-VL-4B-Instruct is the 4B parameter model of the Qwen3-VL series.

Links

https://huggingface.co/unsloth/Qwen3-VL-4B-Instruct-GGUF

Tags

qwen3-vl-4b-thinking

Qwen3-VL-4B-Thinking is the 4B parameter model of the Qwen3-VL series that is thinking.

Links

https://huggingface.co/unsloth/Qwen3-VL-4B-Thinking-GGUF

Tags

opengvlab_internvl3_5-14b-q8_0

We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0% gain in overall reasoning performance and a 4.05 ×\times× inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks—narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.

Links

Tags

opengvlab_internvl3_5-14b

We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0% gain in overall reasoning performance and a 4.05 ×\times× inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks—narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.

Links

Tags

opengvlab_internvl3_5-4b

We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0% gain in overall reasoning performance and a 4.05 ×\times× inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks—narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.

Links

Tags

opengvlab_internvl3_5-4b-q8_0

We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0% gain in overall reasoning performance and a 4.05 ×\times× inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks—narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.

Links

Tags

qwen3-14b

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-14B has the following features: Type: Causal Language Models Training Stage: Pretraining & Post-training Number of Parameters: 14.8B Number of Paramaters (Non-Embedding): 13.2B Number of Layers: 40 Number of Attention Heads (GQA): 40 for Q and 8 for KV Context Length: 32,768 natively and 131,072 tokens with YaRN. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation.

Links

Tags

qwen3-4b

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-4B has the following features: Type: Causal Language Models Training Stage: Pretraining & Post-training Number of Parameters: 4.0B Number of Paramaters (Non-Embedding): 3.6B Number of Layers: 36 Number of Attention Heads (GQA): 32 for Q and 8 for KV Context Length: 32,768 natively and 131,072 tokens with YaRN.

Links

Tags

Model Gallery

Find Your Perfect Model

Filter by Model Type

Browse by Tags

serenity-26b-a4b

melody1437-26b-a4b-v2.0

dark-scarlett-v0.3-26b-a4b

gemma-4-26b-a4b-it-qat

gemma-4-e4b-it-qat-q4_0

gemma-4-26b-a4b-it-qat-q4_0

gemma-4-26b-a4b-it-apex

gemma-4-26b-a4b-it

gemma-4-e4b-it

qwen_qwen3.5-4b

qwen3.5-4b-claude-4.6-opus-reasoning-distilled

voxtral-mini-4b-realtime

acestep-cpp-turbo-4b

qwen3-vl-4b-instruct

qwen3-vl-4b-thinking

opengvlab_internvl3_5-14b-q8_0

opengvlab_internvl3_5-14b

opengvlab_internvl3_5-4b

opengvlab_internvl3_5-4b-q8_0

qwen3-14b

qwen3-4b