262K context on 16GB VRAM because why not

I’ve been running local LLMs for a while now and the eternal struggle is always the same: you want more context, more model, more speed — and you have none of the VRAM to support any of it. So when Gemma 4 dropped with 262K context window I obviously had to try fitting the whole thing on my RTX A5000. 16GB. Turns out you can. And its actually usable. ~658 tok/s prompt eval. ~35 tok/s decode. Full 262K context window. f16 KV cache, no compression tricks. ...

June 3, 2026

Some tools I've built

I’ve created a subpage on this site dedicated to various tools (the subset of them that are mature enough for me to dare mention them) -> you can find them here. I do not commit to any level of regular maintenance of them, they get fixed when they need fixing (which is usually when I’m in a rush to do something and I realize everything is broken). Occasionally these things find a life of their own, I’ve created a simple script to have data retention policies in free version of mattermost in January 2018 for a project I was consulting on and promptly forgot about it. Over the years it appeared that several people found it rather useful - it got regular updates, support for additional database type and more. That is a long way of saying feel free to submit a PR.

November 28, 2025