Model Gallery

Discover and install AI models from our curated collection

7 models available
1 repositories
Documentation

Find Your Perfect Model

Filter by Model Type

Browse by Tags

gemma-4-12b-it-qat-q4_0
Hugging Face | GitHub | Launch Blog | Documentation License: Apache 2.0 | Authors: Google DeepMind > [!Note] > This model card is for the new versions of the Gemma 4 family optimized with Quantization-Aware Training (QAT), which allows preserving similar quality to bfloat16 while dramatically reducing the memory requirements to load the model. > Four versions of the QAT checkpoints are available: > * **Unquantized QAT checkpoints** (Q4_0): Half-precision weights extracted from the QAT pipeline, ideal for custom downstream compilation and research. Available for Gemma 4 E2B, E4B, 12B, 26B A4B, and 31B, and their drafter models. > * **GGUF** (Q4_0): Ready-to-deploy formats for broad ecosystem compatibility. Available for Gemma 4 E2B, E4B, 12B, 26B A4B, and 31B. > * **Mobile-optimized** (wNa8o8): A custom schema engineered explicitly for mobile hardware efficiency. It features targeted 2-bit decoding layers, optimized KV caches, and static activations to maximize VRAM savings. Available for Gemma 4 E2B and E4B. > * **Compressed Tensors** (w4a16): QAT checkpoints serialized in the compressed-tensors format for native, optimized inference with vLLM. Available for Gemma 4 E2B, E4B, 12B ...

Repository: localaiLicense: apache-2.0

gemma-4-e2b-it-qat-q4_0
Gemma 4 E2B is a multimodal (text + image) instruction-tuned model from Google DeepMind, optimized with Quantization-Aware Training (QAT) to preserve bfloat16-level quality at a fraction of the memory. E2B is a MatFormer "effective 2B" elastic variant: it carries a larger backbone but runs at an effective 2B-parameter footprint, making it well suited to lightweight and on-device deployments. This is the official Google Q4_0 GGUF, shipped with its multimodal projector. License: Apache 2.0 | Authors: Google DeepMind

Repository: localaiLicense: apache-2.0

gemma-4-e4b-it-qat-q4_0
Gemma 4 E4B is a multimodal (text + image) instruction-tuned model from Google DeepMind, optimized with Quantization-Aware Training (QAT) to preserve bfloat16-level quality at a fraction of the memory. E4B is a MatFormer "effective 4B" elastic variant, balancing quality and footprint for on-device and edge deployments. This is the official Google Q4_0 GGUF, shipped with its multimodal projector. License: Apache 2.0 | Authors: Google DeepMind

Repository: localaiLicense: apache-2.0

gemma-4-26b-a4b-it-qat-q4_0
Gemma 4 26B-A4B is a multimodal (text + image) instruction-tuned Mixture-of-Experts model from Google DeepMind, optimized with Quantization-Aware Training (QAT) to preserve bfloat16-level quality at a fraction of the memory. With 26B total parameters and ~4B active per token, it delivers large-model quality at a much lower inference cost. This is the official Google Q4_0 GGUF, shipped with its multimodal projector. License: Apache 2.0 | Authors: Google DeepMind

Repository: localaiLicense: apache-2.0

gemma-4-31b-it-qat-q4_0
Gemma 4 31B is the largest dense multimodal (text + image) instruction-tuned model in the Gemma 4 family from Google DeepMind, optimized with Quantization-Aware Training (QAT) to preserve bfloat16-level quality while dramatically reducing the memory required to load the model. This is the official Google Q4_0 GGUF, shipped with its multimodal projector. License: Apache 2.0 | Authors: Google DeepMind

Repository: localaiLicense: apache-2.0

edgetam
EdgeTAM is an ultra-efficient variant of the Segment Anything Model (SAM) for image segmentation. It uses a RepViT backbone and is only ~16MB quantized (Q4_0), making it ideal for edge deployment. Supports point-prompted and box-prompted image segmentation via the /v1/detection endpoint. Powered by sam3.cpp (C/C++ with GGML).

Repository: localaiLicense: apache-2.0

google-gemma-3-27b-it-qat-q4_0-small
This is a requantized version of https://huggingface.co/google/gemma-3-27b-it-qat-q4_0-gguf. The official QAT weights released by google use fp16 (instead of Q6_K) for the embeddings table, which makes this model take a significant extra amount of memory (and storage) compared to what Q4_0 quants are supposed to take. Requantizing with llama.cpp achieves a very similar result. Note that this model ends up smaller than the Q4_0 from Bartowski. This is because llama.cpp sets some tensors to Q4_1 when quantizing models to Q4_0 with imatrix, but this is a static quant. The perplexity score for this one is even lower with this model compared to the original model by Google, but the results are within margin of error, so it's probably just luck. I also fixed the control token metadata, which was slightly degrading the performance of the model in instruct mode.

Repository: localaiLicense: gemma