NVIDIA’s Extreme Co-Design: From GPU Hardware to Fully Open Nemotron LLMs
These articles are AI-generated summaries. Please check the original sources for full details.
Even the chip makers are making LLMs
NVIDIA VP Kari Briski explains why the company has transitioned into a full-stack entity by developing the Nemotron family of models. Since 2018, NVIDIA has utilized a rapid hardware-software feedback loop to drive GPU architecture through difficult LLM workloads.
Why This Matters
The gap between theoretical AI models and hardware efficiency often leads to significant performance bottlenecks. By employing ‘extreme co-design,’ NVIDIA integrates model requirements into the hardware planning process—such as the Blackwell NVFP4 precision—to ensure that memory hierarchies and networking stacks are purpose-built for agentic systems. This approach moves beyond general-purpose computing toward a paradigm where software libraries and hardware SKUs are synchronized to handle million-token context lengths and disaggregated serving.
Key Insights
- NVIDIA Blackwell supports NVFP4 precision, enabling models to retain full accuracy while reducing memory footprints compared to post-training quantization.
- The Nemotron family includes Nano, Super, and Ultra models, with Nano V3 released in late 2025 and Ultra scheduled for April 2026.
- The hybrid Mamba State Space model architecture combined with Transformers improves token efficiency by avoiding the quadratic inference time growth of dense models.
- NVIDIA’s Dynamo framework enables disaggregated serving, allowing prefill and decode tasks to run on different GPU SKUs for maximum efficiency.
- The $180,000 AI robotics competition launched by Intrinsic and NVIDIA targets dexterous cable management using open-source AI tools.
Practical Applications
- Domain Specialization: ServiceNow utilized NVIDIA’s open data to create the Apriel model and custom ‘gym’ environments for task-specific verification.
- Agentic Memory Management: Using context memory engines to store and recall million-token context lengths for complex coding and documentation tasks.
- Cybersecurity: Partners leverage open-source weights to build specialized verifiers that identify false positives in threat detection systems.
References:
- https://stackoverflow.blog/2026/03/10/even-the-chip-makers-are-making-llms/
- intrinsic.ai/stack
Continue reading
Next article
FortiGate Appliances Targeted to Steal LDAP Credentials and Breach Networks
Related Content
AI Hardware Stack Rebuilt from Wafer Up: Cerebras WSE-3 Beats B200 by 21x, OpenAI Bets $20B+
AI inference costs shift as Cerebras WSE-3 delivers 21x speed and 32% lower cost per token vs B200.
NVIDIA and University of Maryland Release Audio Flamingo Next (AF-Next)
NVIDIA's AF-Next outperforms Gemini 2.5 Pro on LongAudioBench with a 73.9 score, scaling open audio reasoning to 1 million hours of data.
AMD’s Silicon Strategy: Balancing Heterogeneous Compute and AI Innovation
AMD CTO Mark Papermaster discusses the paradox of AI agents consuming massive compute while simultaneously accelerating chip innovation through heterogeneous CPU/GPU computing.