Gemma4

I’ve been running local LLMs for a while now and the eternal struggle is always the same: you want more context, more model, more speed — and you have none of the VRAM to support any of it. So when Gemma 4 dropped with 262K context window I obviously had to try fitting the whole thing on my RTX A5000. 16GB. Turns out you can. And its actually usable. ~658 tok/s prompt eval. ~35 tok/s decode. Full 262K context window. f16 KV cache, no compression tricks. ...