Meta / Llama vLLM / Ollama News & Releases Dev Tools & SDK Changelogs

RDNA3 cuts llama.cpp KV VRAM 47% — and CUDA has no equivalent

RDNA3 bit-packing cuts llama.cpp KV VRAM 47% on RX 7900. Flags, VRAM math, and TurboQuant for 4.9× compression.

Jun 01, 2026

RDNA3 cuts llama.cpp KV VRAM 47% — and CUDA has no equivalent

ml-org/llama.cpp/discussions/22411" target="_blank" rel="noopener noreferrer">llama.cpp GitHub discussions and the lemonade-sdk repository for both AMD progress and any CUDA counterpart developments.

Frequently Asked Questions

Does this bit-packing optimization work on RDNA2 or RDNA4 GPUs, or only RDNA3?

gfx1100 (RDNA3) is the validated target as of May 31, 2026. The lemonade-sdk/llamacpp-rocm build b1285 targets RDNA3 through RDNA4, indicating RDNA4 (gfx1200, RX 9000 series) support is in progress. RDNA2 (gfx1030, RX 6000 series) has not been confirmed — the register geometry differs and no benchmark for that architecture has been published. If you are on RDNA2, symmetric KV quantization will still activate the fused Flash Attention speed path, but do not expect the same 47% VRAM figure that is specific to RDNA3's register layout.

What actually happens if I use asymmetric KV quantization like q4_0 K + f16 V?

llama.cpp silently falls back to the non-fused attention path — no error, no warning, no visible indication in standard output. Both the VRAM saving and the fused-path speed gain disappear. The only way to confirm the fused path is active is to inspect llama.cpp's startup log for a "flash attention" confirmation line at initialization. If that line is absent, the flags are no-ops for VRAM purposes. Use symmetric pairs — q4_0/q4_0 or q8_0/q8_0 — to avoid this entirely.

How does the 47% KV VRAM reduction compare to TurboQuant TURBO3_0's 4.9× compression?

The 47% reduction is approximately 1.9× compression — you retain roughly 53% of the original KV cache size. TURBO3_0's 4.9× compression is substantially more aggressive, shrinking the KV cache to roughly 20% of its FP16 size. The tradeoffs are distinct: the register bit-packing is closer to mainline llama.cpp, ships pre-compiled in lemonade-sdk b1285, and carries no confirmed PPL cost at tested context lengths. TURBO3_0 requires building domvox's fork from source and accepts a measured +1.17% WikiText-2 PPL regression. Both optimizations can be combined: TurboQuant provides the larger compression multiplier; register bit-packing maximizes hardware register efficiency within that compressed representation.

Is the bit-packing optimization merged into llama.cpp mainline yet?

As of June 1, 2026, no specific GitHub PR number or merge confirmation in the main ggml-org/llama.cpp repository has been publicly documented. The optimization was community-published on May 31 via r/LocalLLaMA and corroborated by AI Weekly the same day. The lemonade-sdk/llamacpp-rocm project ships it pre-compiled in build b1285 without waiting on mainline merge. Check the ggml-org/llama.cpp pull request list for current merge status — this is likely to change in the near term given the corroboration and active upstream activity around the same date.

Can NVIDIA CUDA users benefit from symmetric KV quantization too?

Yes, partially. Symmetric KV quantization activates the fused Flash Attention path on CUDA as well, providing a speed benefit from reduced memory bandwidth during the attention kernel. What CUDA users do not get is the hardware register bit-packing that drives the 47% VRAM reduction on RDNA3. On CUDA, VRAM savings from quantized KV cache come from precision reduction in storage — fewer bits per value — without the register-packing multiplier. The fused attention speed benefit is shared across both platforms; the register-level VRAM efficiency advantage is AMD RDNA3-specific as of this writing.

Decision Framework: What to Run Today

If you are on an RX 7900 series card running llama.cpp today, the immediate action is straightforward: switch to lemonade-sdk build b1285, add --flash-attn -ctk q4_0 -ctv q4_0, and confirm "flash attention" appears in the startup log. That combination delivers 47% KV VRAM savings with near-zero quality cost and under 1% throughput loss. For most users on 24 GB cards at context lengths up to 32K–55K tokens — covering Qwen3-32B at 16K and Llama 3.3-70B dual-card setups — this is the complete solution. No fork required. No source build. No quality tradeoff that needs benchmarking.

If you are targeting 80K+ context on a 27B+ model and the bit-packing headroom is insufficient, the next step is TurboQuant TURBO3_0 from domvox's fork. Accept the +1.17% PPL cost, build from source for gfx1100, and confirm the run completes within your 24 GB budget. For 128K context, the correct stack is TurboQuant plus symmetric bit-packing flags together — neither alone is sufficient on a 24 GB card at that scale.

The broader implication for hardware selection: as of May 2026, RDNA3 holds a clear, measured VRAM-efficiency edge for KV cache in local LLM inference that has no confirmed CUDA counterpart. That edge may close if CUDA gains a register-level equivalent, but no such implementation has been documented at time of writing. For developers actively selecting hardware for long-context local inference today, this is a concrete technical differentiator — not a benchmark footnote — worth factoring into the decision.

Last updated: 2026-06-01. Based on community benchmark published May 31, 2026, corroborated by AI Weekly; lemonade-sdk build b1285; and llama.cpp GitHub Discussions #22411 and #21526. Upstream PR merge status for the register bit-packing optimization was unconfirmed at publication — verify current status at the ggml-org/llama.cpp repository before treating this as mainline behavior.