262K context on 16GB VRAM because why not

Wed, 03 Jun 2026 00:00:00 +0000

I’ve been running local LLMs for a while now and the eternal struggle is always the same: you want more context, more model, more speed — and you have none of the VRAM to support any of it. So when Gemma 4 dropped with 262K context window I obviously had to try fitting the whole thing on my RTX A5000. 16GB. Turns out you can. And its actually usable.

~658 tok/s prompt eval. ~35 tok/s decode. Full 262K context window. f16 KV cache, no compression tricks.

Self-Hosted on disobey.dev

262K context on 16GB VRAM because why not