Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out

FP4-quantized Qwen3.6-35B fits in ~23 GB on Hopper. vLLM serve commands, env vars, DGX Spark config, and gotchas.

May 31, 2026

Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out

NVIDIA published nvidia/Qwen3.6-35B-A3B-NVFP4 on May 28, 2026 — a post-training FP4-quantized variant of Alibaba's 35B MoE model that fits on a single H100 by cutting VRAM from ~71 GB to ~23 GB. If you're on an A100 or consumer GPU, jump to the gotchas section first — this quantization format does not run on your hardware.

71 GB → 23 GB: What Gets Quantized and What Doesn't

NVFP4 quantization targets the weights and activations of linear operators inside transformer and MoE blocks specifically — LayerNorms, embeddings, and biases stay in BF16/F32 for numerical stability . The selective 4-bit compression yields a 3.06× reduction in disk footprint and VRAM versus the BF16 base, dropping from roughly 71 GB to ~23 GB equivalent on Hopper hardware .

Quick Answer: nvidia/Qwen3.6-35B-A3B-NVFP4 fits a 35B MoE reasoning model on a single H100 by applying 4-bit quantization to linear operator weights and activations, reducing VRAM from ~71 GB to ~23 GB (3.06×) with under 1-point accuracy loss on standard benchmarks. Hopper or Blackwell required — A100 and RTX 4090 lack FP4 compute paths entirely.

The calibration pipeline used two datasets: cnn_dailymail (300K+ English news articles) and NVIDIA's Nemotron-Post-Training-Dataset-v2 for multi-turn dialogue coverage, processed with NVIDIA Model Optimizer v0.44.0 . The dual-dataset approach is worth noting: a quantization calibrated only on news articles would likely regress on structured, multi-turn instruction-following — and the benchmark results bear that out.

NVIDIA's official eval suite shows the accuracy gap is narrow. NVFP4 stays within 0.5–0.8 points of BF16 across reasoning benchmarks, and marginally outperforms on instruction-following and multimodal tasks :

Benchmark	BF16	NVFP4	Delta
MMLU Pro	85.6	85.0	−0.6
GPQA Diamond	84.9	84.8	−0.1
AIME 2025	89.2	88.8	−0.4
τ²-Bench Telecom	95.5	94.7	−0.8
SciCode	40.8	40.6	−0.2
IFBench	62.3	62.8	+0.5
MMMU Pro	74.1	74.5	+0.4

"The NVFP4 quantized model achieves nearly identical accuracy to the BF16 original while reducing memory requirements by 3.06×, enabling deployment on hardware that would otherwise require tensor parallelism across multiple GPUs." — NVIDIA Model Optimization Team, nvidia/Qwen3.6-35B-A3B-NVFP4 model card

Hopper or Blackwell: Why Other Cards Won't Work

FP4 tensor core execution paths exist only on Hopper (H100, H200) and Blackwell (GB200, GB300, DGX Spark GB10) architectures . The RTX 4090 (Ada Lovelace, sm_89), RTX 5090, and A100 (Ampere, sm_80) have no native FP4 compute units. Passing --quantization modelopt on those cards will produce an error at load time or, worse, silently wrong output.

Your fallback options on non-Hopper/Blackwell hardware:

BF16 base model: Requires ~71 GB VRAM — an RTX PRO 6000 (96 GB) or H100/A100 80 GB
Community GGUF quantizations: Run on consumer hardware via llama.cpp. unsloth/Qwen3.6-35B-A3B-NVFP4 and AEON-7/Qwen3.6-35B-A3B-heretic-NVFP4 offer different quantization trade-offs and broader hardware coverage

DGX Spark (Blackwell, sm_120/121a) is officially supported but needs extra setup: CUDA 13.0 and the vllm/vllm-openai:cu130-nightly Docker image . Stable vLLM releases do not yet include the FlashInfer CUTLASS MoE kernels for that architecture. Verify your vLLM build has compressed-tensors NVFP4 support before attempting to serve — a mismatched build will silently fall back or crash at model load.

The vLLM Serve Commands: Standard and DGX Spark

The minimum viable Hopper command. Two flags matter here: --quantization modelopt activates NVIDIA Model Optimizer's compressed-tensors backend, and --reasoning-parser qwen3 strips <think>...</think> chain-of-thought blocks from API responses so callers see clean completions:

vllm serve nvidia/Qwen3.6-35B-A3B-NVFP4 \
  --port 8000 \
  --quantization modelopt \
  --max-model-len 262144 \
  --reasoning-parser qwen3

DGX Spark (Blackwell) requires three environment variables set before launching. Omitting any of them causes a FlashInfer MoE kernel mismatch at startup — the error message is not always explicit about which variable is missing, so set all three :

export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_FP8_MOE_BACKEND=flashinfer_cutlass
export FLASHINFER_DISABLE_VERSION_CHECK=1

vllm serve nvidia/Qwen3.6-35B-A3B-NVFP4 \
  --quantization modelopt \
  --kv-cache-dtype fp8 \
  --attention-backend flashinfer \
  --moe-backend marlin \
  --gpu-memory-utilization 0.85 \
  --max-model-len 65536 \
  --max-num-seqs 4 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}'

Flag-by-flag breakdown:

--kv-cache-dtype fp8 — halves KV-cache memory versus BF16, directly enabling longer usable context at 0.85 VRAM utilization
--moe-backend marlin — selects the Marlin MoE kernel for Blackwell; the default selection may not be optimal on this architecture
--max-num-seqs 4 — keeps total concurrent sequence memory predictable on constrained VRAM; raise cautiously and watch OOM behavior
--enable-chunked-prefill — required on DGX Spark; without it, long prompts OOM well before the 65536-token cap
--enable-prefix-caching — reduces time-to-first-token for repeated system prompts in multi-turn chat workloads
--speculative-config '{"method":"mtp",...}' — enables the built-in Multi-Token Prediction head; no separate draft model required or loaded

The snippet below (illustrative — not executed; running it requires a CUDA-enabled environment with transformers installed) shows how to verify your GPU is Hopper-class before attempting to load the model. The major < 9 check is the key gate: H100 reports sm_90, A100 reports sm_80:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL = "Qwen/Qwen3.6-35B-NVFP4"

if not torch.cuda.is_available():
    raise SystemExit("CUDA GPU required")

name = torch.cuda.get_device_name(0)
major, minor = torch.cuda.get_device_capability(0)
print(f"GPU: {name} (sm_{major}{minor})")
if major < 9:
    raise SystemExit("NVFP4 path requires H100-class sm_90+; A100 is sm_80")

tok = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(MODEL, device_map={"": 0})
print(model.generate(**tok("Qwen NVFP4 fits on one H100 because", return_tensors="pt").to(0), max_new_tokens=8))

An A100 hits the SystemExit before wasting time on a multi-minute model download. Run this check before provisioning storage or bandwidth for the weights.

Where It Breaks: Consumer GPUs and Wrong Flags

Four failure modes worth knowing before you lose an hour to a non-obvious error:

--quantization modelopt on A100 or RTX hardware: The FP4 matmul path does not exist on Ampere or Ada Lovelace. You get an error at load time or, worse, silently degraded output. Use BF16 or GGUF on those cards — there is no workaround.
Missing env vars on DGX Spark: Omitting VLLM_FP8_MOE_BACKEND or FLASHINFER_DISABLE_VERSION_CHECK before launch triggers a FlashInfer MoE kernel mismatch. The startup error does not always name the specific missing variable — set all three unconditionally before touching vLLM on Blackwell.
Omitting --reasoning-parser qwen3: The model emits raw <think>...</think> blocks in every completion response. Clients parsing JSON completions will see malformed output; streaming clients will surface the thinking chain directly to end users. This flag is not optional.
No --enable-chunked-prefill on DGX Spark: Long prompts OOM well before the 65536-token cap at 0.85 VRAM utilization. Chunked prefill is not a performance optimization on that platform — it is a correctness requirement for any long-context workload.

One operational caveat for DGX Spark production: the vllm/vllm-openai:cu130-nightly image is not a stable release . Pin to a specific build hash for any deployment you need to reproduce, or wait for a stable vLLM release that includes full NVFP4 Blackwell support upstream.

Pushing Further: MTP and Longer Prompts

The built-in MTP speculative decoding head achieves an 85.4% token acceptance rate at single-user baseline (512-token outputs), rising to 92.8% at 4,096-token outputs . No second draft model to load or manage — the MTP head is baked into the base checkpoint. At concurrency 1, output throughput is 55.9 tokens/s; at concurrency 32, it scales to 433.4 tokens/s . The community AEON-7 DFlash variant reports 117 tok/s greedy decoding on DGX Spark with 62–78% draft acceptance and 2.7–4.4 mean accepted tokens per target step .

The native context window is 131K tokens, extended to 262,144 via RoPE scaling . On DGX Spark, cap --max-model-len at 65536 to stay within safe VRAM margins at 0.85 utilization. The full 262K context is accessible on H100/H200 with more VRAM headroom. Note that long-context RAG quality under NVFP4 versus BF16 at the 262K limit has not been independently benchmarked as of June 2026 — treat that range as best-effort until data appears.

The same vLLM endpoint handles image and video inputs alongside text once the server is running on Hopper or Blackwell — no additional flags needed for multimodal prompts. Multimodal inference quality under NVFP4 quantization is also unbenchmarked publicly, so evaluate against your specific workload rather than relying on text benchmark results as a proxy.

Frequently Asked Questions

Does Qwen3.6-35B-A3B-NVFP4 work on an RTX 4090 or A100?

No. FP4 tensor core paths require Hopper (H100, H200) or Blackwell (GB200, GB300, DGX Spark) architecture. The RTX 4090 is Ada Lovelace (sm_89) and the A100 is Ampere (sm_80) — neither has native FP4 compute units. On those cards, use community GGUF quantizations via llama.cpp or the BF16 base model if you have 71+ GB VRAM available (RTX PRO 6000 96 GB or H100/A100 80 GB).

What does `--quantization modelopt` actually do?

It tells vLLM to route weight loading through NVIDIA Model Optimizer's compressed-tensors backend, which understands the NVFP4 format and dispatches matrix multiplications through FP4 tensor cores. Without this flag, vLLM will not recognize the quantization scheme and will either throw an error at startup or attempt to interpret the weights as a different format — neither produces usable output.

How much accuracy do you lose with NVFP4 vs BF16?

0.5–0.8 points on most benchmarks per NVIDIA's official eval suite . MMLU Pro drops from 85.6 to 85.0; GPQA Diamond from 84.9 to 84.8; AIME 2025 from 89.2 to 88.8. On instruction-following (IFBench: 62.8 vs 62.3) and multimodal reasoning (MMMU Pro: 74.5 vs 74.1), NVFP4 marginally outperforms BF16 — likely a calibration dataset effect from the multi-turn Nemotron data.

Do I need a separate draft model for MTP speculative decoding?

No. The Multi-Token Prediction head is embedded in the Qwen3.6-35B-A3B checkpoint itself. Pass --speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}' to activate it — vLLM uses the model's own MTP head without downloading or loading a second checkpoint.

Which vLLM version and CUDA version are required for DGX Spark?

CUDA 13.0 and the vllm/vllm-openai:cu130-nightly Docker image . Current stable vLLM releases lack FlashInfer CUTLASS MoE kernels for Blackwell sm_120/121a. Pin to a specific nightly build hash for any production deployment — a stable vLLM release with full NVFP4 Blackwell support had not shipped as of June 2026.

What to Try Next

On a Hopper card, the path is now practical: one vllm serve command with --quantization modelopt and --reasoning-parser qwen3, and you have a 35B reasoning model with 262K context, built-in chain-of-thought handling, and native tool calling — on a single GPU. The 3.06× memory reduction is the operational threshold between needing four-way tensor parallelism and fitting on one card.

Extend the baseline from here: add --enable-auto-tool-choice --tool-call-parser qwen3 for structured tool calling in agent workloads; toggle thinking mode off for latency-sensitive paths with --default-chat-template-kwargs '{"enable_thinking": false}'; stress-test the 262K RAG path against your actual document lengths. A RedHatAI mirror is also on Hugging Face for enterprise environments with registry requirements.

On DGX Spark: the nightly image dependency is the main operational risk. Track the AEON-7/Qwen3.6-NVFP4-DFlash repository for community patch status and watch upstream vLLM releases for when Blackwell sm_120/121a kernels land in a stable build. Until then, pin your nightly image hash.

Last updated: 2026-06-01. Based on the nvidia/Qwen3.6-35B-A3B-NVFP4 model card (released 2026-05-28) and community deployment reports reviewed as of June 2026.

Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out

71 GB → 23 GB: What Gets Quantized and What Doesn't

Hopper or Blackwell: Why Other Cards Won't Work

The vLLM Serve Commands: Standard and DGX Spark

Where It Breaks: Consumer GPUs and Wrong Flags

Pushing Further: MTP and Longer Prompts

Frequently Asked Questions

Does Qwen3.6-35B-A3B-NVFP4 work on an RTX 4090 or A100?

What does `--quantization modelopt` actually do?

How much accuracy do you lose with NVFP4 vs BF16?

Do I need a separate draft model for MTP speculative decoding?

Which vLLM version and CUDA version are required for DGX Spark?

What to Try Next

Featured posts

How to Add SuperGrok to Kilo Code in Any Environment

Kilo Code Gets Three Grok Models — Which Fits Your Workload

Anthropic SDK 0.105.0 Needed Two Hotfixes — What to Pin

What Gemini's Three I/O 2026 Research Tools Actually Do

grok-build-0.1 in Kilo Code, No API Key Needed for SuperGrok

SuperGrok 티어별로 Kilo Code 설정이 달라진다

grok-build-0.1과 Grok 4.3, Kilo Code에서 뭐가 다른가

Anthropic SDK 릴리즈가 PyPI 배포를 깨뜨린 이유

AlphaEvolve와 Co-Scientist, 발표대로 작동하는가

SuperGrok 구독자는 이제 API 키 없이 grok-build-0.1을 쓴다

Tags

Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out

71 GB → 23 GB: What Gets Quantized and What Doesn't

Hopper or Blackwell: Why Other Cards Won't Work

The vLLM Serve Commands: Standard and DGX Spark

Where It Breaks: Consumer GPUs and Wrong Flags

Pushing Further: MTP and Longer Prompts

Frequently Asked Questions

Does Qwen3.6-35B-A3B-NVFP4 work on an RTX 4090 or A100?

What does --quantization modelopt actually do?

How much accuracy do you lose with NVFP4 vs BF16?

Do I need a separate draft model for MTP speculative decoding?

Which vLLM version and CUDA version are required for DGX Spark?

What to Try Next

Featured posts

Tags

Sign up for insights and ideas

What does `--quantization modelopt` actually do?