I Tried to Run Qwen at 160K Context on a 16GB GPU. The 35B Worked — but the 9B Won.
Using multiple AI coding tools and losing track of sessions? I built VibeCockpit — one dashboard to search and resume sessions across Claude Code, Copilot, Codex, and more.
TL;DR: Qwen3.5-35B-A3B runs at 160K context on an RTX 5060 Ti (16GB VRAM) with 348 MB to spare. Generation at 47-51 tok/s. But the 9B model matches prefill speed, uses far less VRAM, and pushes to 250K+ context. The 9B became my daily driver. [Updated: the 3.6 version now reaches the full 262K context on the 35B, and with TurboQuant KV cache compression it pushes to 400K context on the same GPU. See the follow-up post.] [Latest: the 27B dense vs MoE comparison confirms the MoE is still the best choice on 16 GB.] [New: code generation showdown — Gemma 4 vs Qwen 3.6, same prompt, one shot, visual results.]
I had a perfectly good local coding model running at 85 tokens per second. Then I made the mistake every homelab person makes: I asked what the absolute limit was.
Qwen’s 35B-A3B looked irresistible on paper: 35B total parameters, but only 3B active per token thanks to Mixture-of-Experts. In theory, that made it a candidate for a 16GB GPU. The real question was not whether the weights fit. It was whether the weights, KV cache, and compute buffer could all coexist at long context without the whole setup falling apart.
They could — barely.
I got Qwen3.5-35B-A3B running at 160K context on an RTX 5060 Ti with 16GB VRAM, and Qwen3.5-9B running at 250K+ context on the same card. The final stable setup left me with 348MB of free VRAM. Then, halfway through writing this up, Qwen released a 9B model that matched the 35B on prompt processing, used far less memory, and quietly became my new daily driver.
That ended up being the real story.
This post is about what broke, what fixed it, and why the smaller model won.
Test Setup
The machine is a headless GMKtec mini PC with an AMD Ryzen 7, 128GB RAM, and an RTX 5060 Ti 16GB attached over OCuLink rather than a full internal PCIe slot. The system runs without a desktop environment, which matters more than it sounds like it should.
With no compositor and no GPU memory wasted on drawing a UI, I get roughly 15.5GB of usable VRAM. That extra ~1GB is the difference between “interesting experiment” and “hard limit reached.”
Software-wise, the stack is straightforward:
-
llama.cpp built from source with CUDA
-
Proxmox VM with
cpu: hostand hugepages -
Unsloth GGUF quantizations
-
Models tested:
Qwen3.5-35B-A3B-UD-IQ3_XXSQwen3.5-9B-UD-Q4_K_XL
My goal was simple: find the largest practical context window I could run locally on 16GB VRAM and turn it into something usable for real coding sessions. In the end, that meant two different answers: 160K with the 35B model, and 250K+ with the 9B.
The Result, Up Front
Here is the short version.
- Qwen3.5-35B-A3B can run at 160K context on 16GB VRAM, but only just.
- Qwen3.5-9B can push past 250K context on the same GPU while still leaving useful headroom.
- The final stable configuration leaves 348MB of headroom.
- Qwen3.5-9B matches the 35B on prompt processing throughput in my setup.
- The 35B is stronger on harder reasoning tasks.
- The 9B is better as a daily driver because it has dramatically more VRAM headroom and more predictable long-session behavior.
The lesson was not “bigger model wins.” It was:
On a 16GB GPU, the real constraint is not the headline parameter count. It is VRAM headroom, bandwidth, and long-session behavior.
The Benchmark That Changed My Mind
I started this experiment expecting the 35B model to be the obvious winner if I could squeeze it in. Then I tested the 9B.
Here is the comparison that broke my mental model:
| Metric | Qwen3.5-9B (Q4_K_XL) | Qwen3.5-35B-A3B (IQ3_XXS) |
|---|---|---|
| Prompt eval (75K tokens) | 64.8 sec (1,168 tok/s) | 64.8 sec (1,168 tok/s) |
| Generation speed | 40–50 tok/s | 47–51 tok/s |
| Model size | 5.55 GiB | 13.68 GiB |
| VRAM headroom | ~3,748 MiB | 348 MiB |
The prompt processing speed was not just similar. It was effectively identical.
That sounds wrong until you look at where the bottleneck actually is.
Why Prefill Matched
Prompt processing — the prefill phase — is primarily memory-bandwidth-bound, not compute-bound.
Once the model is loaded, the GPU spends prefill moving data through VRAM as fast as it can. In this regime, the limiting factor is less about the total parameter count on paper and more about how hard you are pushing the memory subsystem. The 35B-A3B only activates 3B parameters per token, while the 9B is dense and activates all 9B, but in my setup both ended up saturating the same practical bandwidth ceiling.
So the prefill numbers converged.
Where the 35B Still Wins
Decode is different.
Generation is more sensitive to per-token compute, and that is where the MoE advantage starts to matter. The 35B-A3B generates slightly faster than the 9B in my setup — roughly 47–51 tok/s versus 40–50 tok/s — because it is only doing compute on its active experts rather than the full dense parameter set.
And the 35B still feels like the better model on the right class of tasks:
- multi-step reasoning
- architecture decisions
- subtle debugging
- instruction-following under complex constraints
But the 9B gives back something the 35B cannot: room to breathe.
That extra ~3.7 GiB of VRAM headroom changes what is practical. It is the difference between barely fitting 160K on the 35B and being able to push 250K+ context on the 9B on the same 16GB card. It means larger KV precision, fewer stability problems, and the ability to run other GPU processes without living in constant fear of an out-of-memory spike.
That is why the 9B became my default.
Getting the 35B Stable: Four Things That Broke First
The final result looks neat in a table. The path to get there was not.
I started with Qwen3.5-35B-A3B, compiled llama.cpp from source with CUDA, and tried my first large context load.
It took 92 seconds just to process the initial prompt.
Not inference. Not generation. Just loading the prompt.
For reference, a cloud API roundtrip is often under two seconds. I had built a Ferrari and was somehow commuting in first gear.
Over the next few days, I ran into four separate failure modes.
Wall 1: The VRAM Crunch
My first instinct was to run IQ3_XXS for quality and call it done. That failed immediately.
At this quantization, the model weights alone were eating almost the entire GPU budget. In one early configuration the weights consumed over 15.2GB, which left less than 1GB for everything else. That is not enough margin for large batch operations, and the GPU started choking mid-processing.
The fix was ugly but clarifying: I temporarily dropped to IQ2_M. That freed roughly 3GB and proved the real issue was not “the hardware is impossible.” It was “the configuration is too tight.”
That distinction mattered. Once I knew the system could behave with more headroom, I could start tuning toward stability instead of guessing blindly.
Wall 2: Sliding Window Amnesia
The more interesting bug-shaped-not-actually-a-bug problem came from Qwen’s architecture.
Qwen3.5-35B-A3B uses hybrid Sliding Window Attention (SWA). Some layers do not retain unlimited memory. In practice, that meant the model was effectively forgetting my 18,000-token system prompt between interactions and then re-reading it from scratch.
Every single time.
The logs made it look like something was broken. It was not broken. It was doing exactly what the architecture said it would do, which was somehow more annoying.
The fix was:
--swa-full--kv-unified
Together, those forced the SWA layers to retain full context instead of behaving like a selective memory leak. Once enabled, the system prompt went into VRAM once and stayed there.
That single change turned the model from “technically runs” into “has a chance of being usable.”
Wall 3: The Timeout Death Loop
Once SWA was under control, the next problem showed up during very long initial prompt loads.
The GPU would still be processing, but the client would give up first. The request would time out, the server would restart, and I would be back at zero.
This was the kind of failure that feels random until you notice the pattern: the hardware was fine, the software stack was impatient.
The fix was enabling context checkpoints:
--ctx-checkpoints 64
This lets llama.cpp checkpoint processing state mid-prompt so that if the connection drops or the client times out, the server can resume from the last saved point instead of reprocessing the entire context from scratch.
That made a massive difference.
With checkpointing in place, the painful 92-second first load became a mostly one-time cost. After that, incremental interactions in the same session dropped to roughly 1.2–1.5 seconds.
Wall 4: The Long-Session Collapse
Even after fixing the obvious issues, multi-hour coding sessions would eventually hit a wall.
At 80K, 100K, 130K tokens into a session, one of two things would happen:
- the server would crash when the context filled up
- or it would force a large re-read and become miserable to use
The fix here was context shifting:
--context-shiftexport LLAMA_ARG_CONTEXT_SHIFT=1
Instead of dying at the context limit, llama.cpp slides the window forward — dropping the oldest tokens while preserving the system prompt and recent history.
This was the change that made 4+ hour sessions actually viable. Without it, every complicated agentic coding session eventually ended the same way: with the model face-planting into the context ceiling.
The Performance Profile After Fixing Everything
Once those four issues were addressed, the 35B configuration became surprisingly usable.
For IQ3_XXS at 160K context, the profile looked like this:
| Operation | Time | Throughput |
|---|---|---|
| First prompt load (90K tokens) | 92.7 sec | 1,123 tok/s |
| Incremental update (~1K tokens) | 1.2–1.4 sec | 800–900 tok/s |
| Token generation | 20–24 ms/tok | 41–47 tok/s |
| SWA-forced full re-process | 4.5–35 sec | depends on context size |
The headline number here is not the token speed. It is the fact that the entire setup works with so little safety margin.
The Flags That Actually Mattered
The full launch command is long, but only a handful of flags did most of the work.
Context retention and SWA behavior
--swa-full--kv-unified
These stopped the model from repeatedly re-reading large parts of the prompt due to sliding-window behavior.
Surviving long prompt loads
--ctx-checkpoints 64
This made timeouts recoverable instead of catastrophic.
Long-session stability
--context-shiftexport LLAMA_ARG_CONTEXT_SHIFT=1
These let the server slide the context window instead of crashing at the limit.
Staying inside the VRAM budget
--cache-type-k q4_0--cache-type-v q4_0-b 512-ub 256
The tight batching was critical. With only 348MB of free VRAM, one oversized batch can spike the compute buffer and immediately OOM the process. There is no graceful degradation there. It just falls over.
The Final 35B Configuration
This is the exact llama-server configuration that made the 35B stable for me:
export LLAMA_ARG_CONTEXT_SHIFT=1
./build/bin/llama-server -m ~/Qwen3.5-35B-A3B-UD-IQ3_XXS.gguf \
-ngl 99 -fa 1 -c 163840 \
--cache-type-k q4_0 --cache-type-v q4_0 \
--swa-full --ctx-checkpoints 64 --kv-unified \
--context-shift --cache-reuse 512 \
--reasoning-format deepseek-legacy \
--perf --no-warmup --mlock \
--slot-prompt-similarity 0.0 \
-np 1 -b 512 -ub 256 -t 8 -tb 8 \
--threads-http 8 --no-mmap --jinja \
--port 11433 --host 0.0.0.0
And this was the VRAM breakdown on exit:
CUDA0 (RTX 5060 Ti): 15,847 MiB used / 16,384 MiB total
Model weights: 13,736 MiB (IQ3_XXS)
KV cache: 962 MiB (q4_0, 160K context)
Compute buffer: 258 MiB
Free: 348 MiB
That final number is the whole story.
348MB is not a margin. It is a design constraint.
Every decision downstream — batch size, KV precision, context target, whether anything else is allowed to touch the GPU — flows from that number.
The Hidden Tax: SWA Checkpoint Invalidation
There is one penalty that does not show up cleanly in summary benchmarks: SWA checkpoint invalidation.
Even with --swa-full, the sliding-window architecture occasionally invalidates cached checkpoints as the session evolves. When that happens, the server falls back to reprocessing much more of the context than you wanted.
At 100K+ tokens, that can mean 4–35 seconds of dead time.
In my usage, it happened in about 10% of interactions during long sessions, and it was not random. It clustered after:
- large context additions, like pasting in a new file
- the first context shift event in a session
- other moments where the cache topology changed enough to force rework
This is one of the reasons the 9B became so attractive in practice. No SWA layers means no SWA weirdness. Either the cache hits or it does not. There is much less drama in the middle.
When to Use the 9B vs the 35B
After running both models daily, the division of labor became obvious.
Use the 9B when:
- you want focused coding assistance
- you are ingesting large codebases or documentation
- you care more about predictable latency than maximum reasoning depth
- you want room to run other GPU processes at the same time
- you want a 250K+ context window without living on the edge of an OOM
The 9B is not a “budget option.” It is the practical tool.
Use the 35B-A3B when:
- the task needs harder reasoning
- you are debugging subtle logic failures
- you are making architecture-level decisions
- you want stronger instruction following on multi-step tasks
- you are willing to accept a fragile VRAM budget in exchange for more capability
The 35B is the model I reach for when I want the system to think harder, not just read more.
Speed vs Headroom
I also tested the 9B in IQ3_XXS quantization alongside the Q4_K_XL:
- 9B (Q4_K_XL): ~40–50 tok/s
- 9B (IQ3_XXS): ~47–51 tok/s
That’s a nice ~10-20% speed bonus, but the Q4_K_XL gave me better quality and the headroom I needed.
The 9B won because it made the whole setup easier to live with.
Here is what changed once I switched:
- a 250K+ context window, with room to target the model’s full 262K trained context without constantly thinking about memory pressure
- ~3.7 GiB of free headroom instead of 348MB
- no SWA penalty surprises in long sessions
- the same prefill speed as the 35B in my environment
- enough intelligence for the vast majority of coding tasks
That combination matters more than benchmark ideology.
For day-to-day work, the 9B is simply the better product.
It is fast enough, smart enough, and roomy enough that I can trust it as infrastructure rather than treat it like a fragile demo.
What This Actually Costs
Power draw under sustained GPU load is about 140W.
At €0.30/kWh, the running cost is basically noise compared to cloud usage:
| Usage | Daily cost | Monthly | Rough cloud equivalent |
|---|---|---|---|
| 100 requests/day | ~€0.01 | €0.33 | €15–30 |
| 500 requests/day | ~€0.05 | €1.65 | €75–150 |
| 2,000 requests/day | ~€0.20 | €6.60 | €300–600 |
The hardware cost for this setup was in the low four figures once the mini PC, GPU, and supporting parts were accounted for.
At medium usage, that still puts break-even somewhere in the within-a-year range. At heavier usage, the payoff comes much faster. The exact number depends less on electricity and more on how often you would otherwise be paying for cloud inference.
The more you use local inference, the less the economics resemble a hobby.
What I Actually Learned
Running a 250K+ context window locally on a 16GB GPU is possible — and getting a 35B model stable at 160K is possible too. But that is not the most useful lesson.
The more interesting lessons were architectural.
1. The interconnect matters once you leave VRAM
OCuLink at PCIe 4.0 x4 is fine for a workflow that stays inside VRAM. But once a model spills weights into system RAM, that link becomes the ceiling very quickly.
This is why larger models start to prefer either:
- substantially more VRAM
- or unified-memory systems like Mac Studio or Strix Halo
If the model does not fit cleanly, the transport path starts making decisions for you.
2. Headroom is a first-class design constraint
The most important number in this whole experiment was not the model size or the token speed.
It was 348MB free.
That tiny buffer decided:
- how aggressive batch sizes could be
- what KV precision was realistic
- whether context targets were stable or reckless
- whether anything else was allowed to use the GPU at the same time
Once you are this close to the edge, “free VRAM” stops being a leftover and becomes a design variable.
3. Smaller models can be better systems
Before testing the 9B, I thought I was evaluating a fallback.
After testing it, I started deliberately routing work to it.
That is the biggest reversal in this whole project.
The 9B is not a compromise. It is the model that best fits the machine, the workflow, and the latency budget. I still use the 35B when the problem genuinely needs more reasoning depth, but the 9B is the one I trust for everyday work.
Using It in OpenCode
Once the server was stable, plugging it into OpenCode was the easy part.
The configuration is simple: point OpenCode at the local llama-server endpoint and set the context limit accordingly.
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"llamacpp": {
"npm": "@ai-sdk/openai-compatible",
"name": "llama-server",
"options": {
"baseURL": "http://<your-server-ip>:11433/v1"
},
"models": {
"home-qwen": {
"name": "Home Qwen",
"limit": {
"context": 262144,
"output": 248320
}
}
}
}
}
}
After that, I could use Qwen locally for coding, documentation, debugging, and long-running sessions with full context intact.
That was the original goal.
The unexpected outcome was learning that the model I wanted to prove could run was not the one I ended up wanting to use.
Final Takeaway
Yes, you can run Qwen3.5-35B-A3B at 160K context on a 16GB GPU. And with the right model choice, you can push a 250K+ context window on that same 16GB card.
But the more important takeaway is this:
On constrained local hardware, the best model is not the biggest one you can force into memory. It is the one that gives you the best mix of reasoning, bandwidth behavior, headroom, and stability over long sessions.
For me, that turned out to be the 9B.
And that was the part I did not expect.
Update: This series has come a long way since. Qwen 3.6 replaced 3.5 with full 262K context, and MTP speculative decoding now pushes the MoE to 125 t/s at 98K context on the same hardware. That speed made it practical to run as a 24/7 autonomous agent with daily cron jobs, multi-platform messaging, and kernel-level sandboxing.
The views and opinions expressed here are my own and do not reflect those of my employer.