Nemotron 3 Ultra went live June 4. Here's the call that works.

NVIDIA Nemotron 3 Ultra GA June 4: how to call via NIM/OpenRouter, hardware floor, and the base-checkpoint caveat.

Jun 02, 2026

Nemotron 3 Ultra went live June 4. Here's the call that works.

NVIDIA shipped Nemotron 3 Ultra on June 4, 2026 — its largest open-weights model and the new high-water mark for US open releases. Before you wire it into an agent harness, here is exactly what landed and where it sits on the leaderboard.

What NVIDIA Launched on June 4: Specs and Leaderboard Position

Nemotron 3 Ultra is a 550-billion-parameter hybrid Mamba-Transformer mixture-of-experts (MoE) model with up to roughly 55B active parameters per token — about 90% sparsity — released by NVIDIA on June 4, 2026 . It tops the three-model Nemotron 3 family (Nano, Super, Ultra), ships a 1M-token context window, and uses NVFP4 (4-bit floating point) training on NVIDIA's Blackwell architecture plus a "LatentMoE" hardware-aware expert router . It is post-trained for agent harnesses including Hermes Agent, LangChain Deep Agents, OpenHands, and OpenCode .

On the Artificial Analysis Intelligence Index, Ultra scored 48 — the most capable open model from a US lab to date — but it trails the Chinese-led Kimi K2.6 and closed models such as Anthropic's Opus 4.8 :

Model	Type	Intelligence Index
Opus 4.8 (Anthropic)	Closed	61
Kimi K2.6	Open (China)	54
Nemotron 3 Ultra	Open (US)	48

Speed is the more interesting story. Evaluating BF16 weights in partnership with NVIDIA, Artificial Analysis measured over 300 tokens/second on a pre-release DeepInfra endpoint, versus roughly 50–100 tok/s for similarly sized Chinese open models like DeepSeek and Moonshot .

"Nemotron 3 Ultra lands in what we call the most attractive intelligence-vs-speed quadrant," — Artificial Analysis (source: Artificial Analysis).

General availability runs through build.nvidia.com (as NIM microservices), Hugging Face, OpenRouter, ModelScope, and cloud partners . The rest of this guide covers the call that actually works.

Before Invoking Ultra: NGC Auth, Compute Minimum, and Which Checkpoint

Three things gate your first Ultra call: an account, the right hardware, and the right checkpoint. For build.nvidia.com you need an NVIDIA NGC account and an API key; the free tier covers low-volume prototyping. OpenRouter is the alternative path and uses its own account key instead — pick one, not both.

Mind the compute floor. NVIDIA benchmarked Ultra Base on GB200 NVL72, and the smaller Nemotron 3 Super (120B total / 12B active) already lists an 8×H100-80GB minimum . The 550B Ultra is larger, so plan for data-center hardware or a hosted endpoint — not a workstation.

Finally, use the post-trained instruct checkpoint, not Ultra Base. NVIDIA's own Base usage guide states the base weights have not undergone instruction tuning or alignment and are not a drop-in assistant . Ultra's final public model slug was not in Build/NIM API lists before the June 4 launch, so pull the exact identifier from the live model card before writing any code.

Calling Ultra via Hosted NIM or OpenRouter

The fastest way to call Nemotron 3 Ultra is the OpenAI-compatible Chat Completions API — the same client works across all three delivery paths, only the base_url and model slug change. NVIDIA ships Ultra on June 4, 2026 via build.nvidia.com NIM microservices, OpenRouter, and Hugging Face . Pick a path based on whether you want managed inference, a no-NGC fallback, or a self-hosted container.

Path 1 — build.nvidia.com (hosted NIM). Generate an NGC API key, then instantiate the standard OpenAI Python client with base_url="https://integrate.api.nvidia.com/v1" and api_key=<NGC key>. Set model= to the exact slug printed on the live Ultra model card, enable streaming, and read tokens from the response. The confirmed pattern from the Nemotron 3 Super Build page uses the same client with a slug such as nvidia/nemotron-3-super-120b-a12b and streamed reasoning_content chunks .

This illustrative snippet (not executed — it needs a live key and the final slug) shows the minimal HTTP call:

import json
import os
import urllib.request

api_key = os.environ.get("NVIDIA_API_KEY")
if not api_key:
    raise SystemExit("Set NVIDIA_API_KEY")

payload = {
    "model": "nvidia/nemotron-3-ultra",
    "messages": [{"role": "user", "content": "Say hello in one sentence."}],
    "max_tokens": 64,
    "stream": False,
}
req = urllib.request.Request(
    "https://integrate.api.nvidia.com/v1/chat/completions",
    data=json.dumps(payload).encode(),
    headers={
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json",
        "Accept": "application/json",
    },
)
with urllib.request.urlopen(req, timeout=30) as r:
    data = json.load(r)
print(data["choices"][0]["message"]["content"])

Path 2 — OpenRouter. Identical client code, but point base_url at https://openrouter.ai/api/v1 with an OpenRouter key — no NGC credential required. This is a useful fallback while the NIM slug propagates across regions .

Path 3 — self-hosted NIM container. Run docker login nvcr.io with NGC credentials, then docker run --gpus all -p 8000:8000 <NIM image> and POST a standard messages payload to http://0.0.0.0:8000/v1/chat/completions .

For inference defaults, borrow the published Nemotron 3 Super model-card values until the Ultra card states otherwise: temperature=1.0 and top_p=0.95 across reasoning, tool-calling, and general chat. Toggle extended reasoning with enable_thinking=True/False in the chat-template kwargs; reasoning tokens then arrive in the reasoning_content field of each streamed chunk .

Known Gotchas: Base vs. Instruct Checkpoint, Compute Ceiling, and Slug Lag

Three traps will burn time if you skip them. The first is the checkpoint itself: NVIDIA's Ultra Base usage guide states the 550B-total / up-to-55B-active hybrid Mamba-Transformer MoE checkpoint has not undergone instruction tuning or post-training alignment, and is a starting point for domain fine-tuning and RL — not a drop-in assistant . Call Base directly as a chatbot and you get incoherent output. Wait for the post-trained card before wiring it into a pipeline.

The second is compute. Every throughput figure NVIDIA publishes references GB200 NVL72, and the smaller Super (120B/12B-active) already lists an 8x H100-80GB minimum, so full Ultra is a multi-GPU data-center workload . DGX Spark (GB10 SoC, 128 GB unified memory) targets Nano and quantized Super tiers, not Ultra . NVFP4 — Ultra's intended cost-reduction path — needs Blackwell-class silicon; Ampere or Hopper clusters cannot claim the FP4 savings, so your real per-token cost runs above NVIDIA's headline figures .

The third is slug lag. Hugging Face's NVIDIA profile still described Ultra as in development on at least one checked page before launch . Never copy a model slug from a third-party blog — read the live model card and paste the exact string.

Beyond the Call: SFT Recipes, Compound Orchestration, and Independent Evals

Once you have a working call, the deeper value is post-training and orchestration. The NVIDIA-NeMo/Nemotron repo exposes the full Pretrain → SFT → RL pipeline; teams building domain-specific variants should start with the SFT recipe under training/ and the usage-cookbook/ for tool-calling and RAG patterns. NVIDIA's recommended topology keeps cost down: use Ultra as the planner/reasoner for hard coding or research steps, and route cheaper Nano or Super sub-agents for perception, routing, and summarization (source: DataCamp, 2026).

Two vendor figures still need replication on your own corpora: 60% fewer reasoning tokens versus Nemotron 2 Nano and a 91% PinchBench agent-productivity score. Treat both as hypotheses until the June 4 weights and endpoints let you measure them directly. The takeaway: ship the hosted call today, but earn the cost and accuracy claims with your own evals before you wire Ultra into production.

Frequently asked questions

Is Nemotron 3 Ultra available via API on June 4, 2026?

Yes. NVIDIA states Ultra reaches general availability on June 4, 2026 , hosted via build.nvidia.com as NIM microservices, OpenRouter, Hugging Face, and select cloud partners . For the lowest-friction path, generate an NGC API key, then call the OpenAI-compatible Chat Completions endpoint at https://integrate.api.nvidia.com/v1 using the exact model slug shown on the published Ultra page.

What is the difference between the Ultra Base checkpoint and the instruct model?

Ultra Base is an unaligned pretrained checkpoint — a 550B-total, up-to-55B-active hybrid Mamba-Transformer MoE intended as a starting point for SFT and RL post-training, not a drop-in assistant. NVIDIA's own usage guide states explicitly that the base checkpoint has not undergone instruction tuning or post-training alignment and is not meant for out-of-the-box production use . For chat, reasoning, and tool-calling, call the post-trained instruct variant once its model card is live.

Can Nemotron 3 Ultra run on a DGX Spark or a single H100?

No. NVIDIA measured Ultra's throughput on the GB200 NVL72 platform, and even the smaller Super (120B/12B-active) lists an 8x H100-80GB minimum — so Ultra realistically requires multi-GPU or data-center hardware . The DGX Spark (GB10 SoC, 128 GB unified memory) targets the Nano and quantized Super tiers, not full Ultra . Without that cluster, use a hosted endpoint.

How does Nemotron 3 Ultra compare to closed frontier models on benchmarks?

Artificial Analysis scores Ultra 48 on its Intelligence Index — the most capable US open-weights model as of June 2026 — ahead of Gemma 4 31B (39) and Nemotron 3 Super (36) . It still trails the Chinese open-weights Kimi K2.6 (54) and closed models such as Anthropic's Opus 4.8 (61) . Ultra leads the US open field but is not at the closed-model frontier.

What inference defaults should I use when calling Nemotron 3 Ultra?

Until an Ultra-specific model card confirms otherwise, the best documented defaults come from the Nemotron 3 Super card: temperature=1.0 and top_p=0.95 across reasoning, tool-calling, and general chat . Toggle extended reasoning via enable_thinking=True/False in chat-template kwargs; reasoning tokens stream back in the reasoning_content field. Validate these against your own workload once the June 4 weights are live.