Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

arXivarX

A high-scale 120B parameter hybrid model (12B active) combining Mamba state-space layers, Transformer attention, and a novel LatentMoE architecture, optimized for FP4 precision and agentic reasoning.

byNVIDIA

View on arXiv

Published Apr 14, 2026

Utility

8.0/10

citations

co_authors

547

Platform Dominationlow

Market Consolidationmedium

Displacement Horizon1-2 years

REASONING

Nemotron 3 Super represents a sophisticated convergence of multiple frontier architectural trends: the efficiency of Mamba (SSMs), the scaling capacity of Mixture-of-Experts (MoE), and the inference speed of Multi-Token Prediction (MTP). The 547 forks within 3 days despite 0 stars (likely due to a synchronized release/mirroring event or high-velocity institutional interest) signal massive industry attention. Its defensibility (8/10) is rooted in its hardware-software co-design; training effectively in NVFP4 (NVIDIA's 4-bit floating point) requires specific Blackwell-era hardware expertise and infrastructure that few outside of NVIDIA or top-tier labs possess. It is unlikely to be 'obsoleted' by frontier labs because NVIDIA *is* the frontier lab here, providing the open-weights alternative to GPT-4 class performance. The 'LatentMoE' component and MTP integration suggest a heavy focus on reducing the 'KV cache' bottleneck and inference latency, which are the primary barriers to agentic workflows. While the architecture can be replicated, the pre-training recipe at this scale (120B) serves as a formidable moat. The primary risk is the rapid evolution of SSM-Transformer hybrids (like Jamba or Zamba), which could offer better trade-offs before this model gains deep library ecosystem support.

COMPOSABILITY

TECH STACK

PythonPyTorchNVFP4 (NVIDIA FP4 Precision)CUDAMegatron-LMMamba (SSM)Transformer

INTEGRATION

reference_implementation

mixture_of_expertsstate_space_modelingagentic_reasoningspeculative_decodinglow_precision_training

READINESS

Composabilityalgorithm