LocalAI - Models

nemo-parakeet-tdt-0.6b

NVIDIA NeMo Parakeet TDT 0.6B v3 is an automatic speech recognition (ASR) model from NVIDIA's NeMo toolkit. Parakeet models are state-of-the-art ASR models trained on large-scale English audio data.

Links

Tags

nvidia_llama-3_3-nemotron-super-49b-v1

Llama-3.3-Nemotron-Super-49B-v1 is a large language model (LLM) which is a derivative of Meta Llama-3.3-70B-Instruct (AKA the reference model). It is a reasoning model that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling. The model supports a context length of 128K tokens. Llama-3.3-Nemotron-Super-49B-v1 is a model which offers a great tradeoff between model accuracy and efficiency. Efficiency (throughput) directly translates to savings. Using a novel Neural Architecture Search (NAS) approach, we greatly reduce the model’s memory footprint, enabling larger workloads, as well as fitting the model on a single GPU at high workloads (H200). This NAS approach enables the selection of a desired point in the accuracy-efficiency tradeoff. The model underwent a multi-phase post-training process to enhance both its reasoning and non-reasoning capabilities. This includes a supervised fine-tuning stage for Math, Code, Reasoning, and Tool Calling as well as multiple reinforcement learning (RL) stages using REINFORCE (RLOO) and Online Reward-aware Preference Optimization (RPO) algorithms for both chat and instruction-following. The final model checkpoint is obtained after merging the final SFT and Online RPO checkpoints. For more details on how the model was trained, please see this blog.

Links

Tags

nvidia_llama-3_3-nemotron-super-49b-genrm-multilingual

Llama-3.3-Nemotron-Super-49B-GenRM-Multilingual is a generative reward model that leverages Llama-3.3-Nemotron-Super-49B-v1 as the foundation and is fine-tuned using Reinforcement Learning to predict the quality of LLM generated responses. Llama-3.3-Nemotron-Super-49B-GenRM-Multilingual can be used to judge the quality of one response, or the ranking between two responses given a multilingual conversation history. It will first generate reasoning traces then output an integer score. A higher score means the response is of higher quality.

Links

Tags

nvidia_llama-3.1-8b-ultralong-1m-instruct

We introduce UltraLong-8B, a series of ultra-long context language models designed to process extensive sequences of text (up to 1M, 2M, and 4M tokens) while maintaining competitive performance on standard benchmarks. Built on the Llama-3.1, UltraLong-8B leverages a systematic training recipe that combines efficient continued pretraining with instruction tuning to enhance long-context understanding and instruction-following capabilities. This approach enables our models to efficiently scale their context windows without sacrificing general performance.

Links

Tags

nvidia_llama-3.1-8b-ultralong-4m-instruct

We introduce UltraLong-8B, a series of ultra-long context language models designed to process extensive sequences of text (up to 1M, 2M, and 4M tokens) while maintaining competitive performance on standard benchmarks. Built on the Llama-3.1, UltraLong-8B leverages a systematic training recipe that combines efficient continued pretraining with instruction tuning to enhance long-context understanding and instruction-following capabilities. This approach enables our models to efficiently scale their context windows without sacrificing general performance.

Links

Tags

nvidia_llama-3.1-nemotron-nano-4b-v1.1

Llama-3.1-Nemotron-Nano-4B-v1.1 is a large language model (LLM) which is a derivative of nvidia/Llama-3.1-Minitron-4B-Width-Base, which is created from Llama 3.1 8B using our LLM compression technique and offers improvements in model accuracy and efficiency. It is a reasoning model that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling. Llama-3.1-Nemotron-Nano-4B-v1.1 is a model which offers a great tradeoff between model accuracy and efficiency. The model fits on a single RTX GPU and can be used locally. The model supports a context length of 128K. This model underwent a multi-phase post-training process to enhance both its reasoning and non-reasoning capabilities. This includes a supervised fine-tuning stage for Math, Code, Reasoning, and Tool Calling as well as multiple reinforcement learning (RL) stages using Reward-aware Preference Optimization (RPO) algorithms for both chat and instruction-following. The final model checkpoint is obtained after merging the final SFT and RPO checkpoints This model is part of the Llama Nemotron Collection. You can find the other model(s) in this family here: Llama-3.3-Nemotron-Ultra-253B-v1 Llama-3.3-Nemotron-Super-49B-v1 Llama-3.1-Nemotron-Nano-8B-v1 This model is ready for commercial use.

Links

Tags

sicariussicariistuff_impish_llama_4b

5th of May, 2025, Impish_LLAMA_4B. Almost a year ago, I created Impish_LLAMA_3B, the first fully coherent 3B roleplay model at the time. It was quickly adopted by some platforms, as well as one of the go-to models for mobile. After some time, I made Fiendish_LLAMA_3B and insisted it was not an upgrade, but a different flavor (which was indeed the case, as a different dataset was used to tune it). Impish_LLAMA_4B, however, is an upgrade, a big one. I've had over a dozen 4B candidates, but none of them were 'worthy' of the Impish badge. This model has superior responsiveness and context awareness, and is able to pull off very coherent adventures. It even comes with some additional assistant capabilities too. Of course, while it is exceptionally competent for its size, it is still 4B. Manage expectations and all that. I, however, am very much pleased with it. It took several tries to pull off just right. Total tokens trained: about 400m (due to being a generalist model, lots of tokens went there, despite the emphasis on roleplay & adventure). This took more effort than I thought it would. Because of course it would. This is mainly due to me refusing to release a model only 'slightly better' than my two 3B models mentioned above. Because "what would be the point" in that? The reason I included so many tokens for this tune is that small models are especially sensitive to many factors, including the percentage of moisture in the air and how many times I ran nvidia-smi since the system last started. It's no secret that roleplay/creative writing models can reduce a model's general intelligence (any tune and RL risk this, but roleplay models are especially 'fragile'). Therefore, additional tokens of general assistant data were needed in my opinion, and indeed seemed to help a lot with retaining intelligence. This model is also 'built a bit different', literally, as it is based on nVidia's prune; it does not 'behave' like a typical 8B, from my own subjective impression. This helped a lot with keeping it smart at such size. To be honest, my 'job' here in open source is 'done' at this point. I've achieved everything I wanted to do here, and then some.

Links

Tags

nvidia_acereason-nemotron-14b

We're thrilled to introduce AceReason-Nemotron-14B, a math and code reasoning model trained entirely through reinforcement learning (RL), starting from the DeepSeek-R1-Distilled-Qwen-14B. It delivers impressive results, achieving 78.6% on AIME 2024 (+8.9%), 67.4% on AIME 2025 (+17.4%), 61.1% on LiveCodeBench v5 (+8%), 54.9% on LiveCodeBench v6 (+7%), and 2024 on Codeforces (+543). We systematically study the RL training process through extensive ablations and propose a simple yet effective approach: first RL training on math-only prompts, then RL training on code-only prompts. Notably, we find that math-only RL not only significantly enhances the performance of strong distilled models on math benchmarks, but also code reasoning tasks. In addition, extended code-only RL further improves code benchmark performance while causing minimal degradation in math results. We find that RL not only elicits the foundational reasoning capabilities acquired during pre-training and supervised fine-tuning (e.g., distillation), but also pushes the limits of the model's reasoning ability, enabling it to solve problems that were previously unsolvable.

Links

nvidia_nemotron-research-reasoning-qwen-1.5b

Nemotron-Research-Reasoning-Qwen-1.5B is the world’s leading 1.5B open-weight model for complex reasoning tasks such as mathematical problems, coding challenges, scientific questions, and logic puzzles. It is trained using the ProRL algorithm on a diverse and comprehensive set of datasets. Our model has achieved impressive results, outperforming Deepseek’s 1.5B model by a large margin on a broad range of tasks, including math, coding, and GPQA. This model is for research and development only.

Links

qwen3-nemotron-32b-rlbff-i1

**Model Name:** Qwen3-Nemotron-32B-RLBFF **Base Model:** Qwen/Qwen3-32B **Developer:** NVIDIA **License:** NVIDIA Open Model License **Description:** Qwen3-Nemotron-32B-RLBFF is a high-performance, fine-tuned large language model built on the Qwen3-32B foundation. It is specifically optimized to generate high-quality, helpful responses in a default thinking mode through advanced reinforcement learning with binary flexible feedback (RLBFF). Trained on the HelpSteer3 dataset, this model excels in reasoning, planning, coding, and information-seeking tasks while maintaining strong safety and alignment with human preferences. **Key Performance (as of Sep 2025):** - **MT-Bench:** 9.50 (near GPT-4-Turbo level) - **Arena Hard V2:** 55.6% - **WildBench:** 70.33% **Architecture & Efficiency:** - 32 billion parameters, based on the Qwen3 Transformer architecture - Designed for deployment on NVIDIA GPUs (Ampere, Hopper, Turing) - Achieves performance comparable to DeepSeek R1 and O3-mini at less than 5% of the inference cost **Use Case:** Ideal for applications requiring reliable, thoughtful, and safe responses—such as advanced chatbots, research assistants, and enterprise AI systems. **Access & Usage:** Available on Hugging Face with support for Hugging Face Transformers and vLLM. **Cite:** [Wang et al., 2025 — RLBFF: Binary Flexible Feedback](https://arxiv.org/abs/2509.21319) 👉 *Note: The GGUF version (mradermacher/Qwen3-Nemotron-32B-RLBFF-i1-GGUF) is a user-quantized variant. The original model is available at nvidia/Qwen3-Nemotron-32B-RLBFF.*

Links

https://huggingface.co/mradermacher/Qwen3-Nemotron-32B-RLBFF-i1-GGUF

Tags

nvidia.qwen3-nemotron-32b-rlbff

The **nvidia/Qwen3-Nemotron-32B-RLBFF** is a large language model based on the Qwen3 architecture, fine-tuned by NVIDIA using Reinforcement Learning from Human Feedback (RLHF) for improved alignment with human preferences. With 32 billion parameters, it excels in complex reasoning, instruction following, and natural language generation, making it suitable for advanced tasks such as code generation, dialogue systems, and content creation. This model is part of NVIDIA’s Nemotron series, designed to deliver high performance and safety in real-world applications. It is optimized for efficient deployment while maintaining strong language understanding and generation capabilities. **Key Features:** - **Base Model**: Qwen3-32B - **Fine-tuning**: Reinforcement Learning from Human Feedback (RLBFF) - **Use Case**: Advanced text generation, coding, dialogue, and reasoning - **License**: MIT (check Hugging Face for full details) 👉 [View on Hugging Face](https://huggingface.co/nvidia/Qwen3-Nemotron-32B-RLBFF) *Note: The GGUF version hosted by DevQuasar is a quantized variant for efficient local inference. The original, unquantized model is available at the link above.*

Links

https://huggingface.co/DevQuasar/nvidia.Qwen3-Nemotron-32B-RLBFF-GGUF

Tags

Model Gallery

Find Your Perfect Model

Filter by Model Type

Browse by Tags

nemo-parakeet-tdt-0.6b

nvidia_llama-3_3-nemotron-super-49b-v1

nvidia_llama-3_3-nemotron-super-49b-genrm-multilingual

nvidia_llama-3.1-8b-ultralong-1m-instruct

nvidia_llama-3.1-8b-ultralong-4m-instruct

nvidia_llama-3.1-nemotron-nano-4b-v1.1

sicariussicariistuff_impish_llama_4b

nvidia_acereason-nemotron-14b

nvidia_nemotron-research-reasoning-qwen-1.5b

qwen3-nemotron-32b-rlbff-i1

nvidia.qwen3-nemotron-32b-rlbff