Can Qwen 3.5 35B run on a 16GB GPU?

Yes. Qwen3.5-35B-A3B runs on an RTX 5060 Ti (16GB VRAM) at 160K context with 348 MB to spare, generating at 47-51 tokens per second. The MoE architecture only activates 3B parameters per token, making it viable on consumer GPUs.

What is the maximum context window for Qwen 3.5 on 16GB VRAM?

With IQ3_XXS quantization and q4_0 KV cache, Qwen3.5-35B-A3B reaches 160K context on 16GB VRAM. The smaller 9B model pushes to 250K+ context on the same hardware.

Is Qwen 3.5 35B good for local coding?

Yes, but the 9B model is often the better daily driver. It matches prefill speed, uses far less VRAM, and allows longer context. The 35B excels at complex reasoning tasks where the extra parameters matter.

I Tried to Run Qwen at 160K Context on a 16GB GPU. The 35B Worked

Using multiple AI coding tools and losing track of sessions? I built VibeCockpit — one dashboard to search and resume sessions across Claude Code, Copilot, Codex, and more.

TL;DR: Qwen3.5-35B-A3B runs at 160K context on an RTX 5060 Ti (16GB VRAM) with 348 MB to spare. Generation at 47-51 tok/s. But the 9B model matches prefill speed, uses far less VRAM, and pushes to 250K+ context. The 9B became my daily driver. [Updated: the 3.6 version now reaches the full 262K context on the 35B, and with TurboQuant KV cache compression it pushes to 400K context on the same GPU. See the follow-up post.] [Latest: the 27B dense vs MoE comparison confirms the MoE is still the best choice on 16 GB.] [New: code generation showdown — Gemma 4 vs Qwen 3.6, same prompt, one shot, visual results.]

I had a perfectly good local coding model running at 85 tokens per second. Then I made the mistake every homelab person makes: I asked what the absolute limit was.

Qwen’s 35B-A3B looked irresistible on paper: 35B total parameters, but only 3B active per token thanks to Mixture-of-Experts. In theory, that made it a candidate for a 16GB GPU. The real question was not whether the weights fit. It was whether the weights, KV cache, and compute buffer could all coexist at long context without the whole setup falling apart.

They could — barely.

I got Qwen3.5-35B-A3B running at 160K context on an RTX 5060 Ti with 16GB VRAM, and Qwen3.5-9B running at 250K+ context on the same card. The final stable setup left me with 348MB of free VRAM. Then, halfway through writing this up, Qwen released a 9B model that matched the 35B on prompt processing, used far less memory, and quietly became my new daily driver.

That ended up being the real story.

This post is about what broke, what fixed it, and why the smaller model won.

Test Setup

The machine is a headless GMKtec mini PC with an AMD Ryzen 7, 128GB RAM, and an RTX 5060 Ti 16GB attached over OCuLink rather than a full internal PCIe slot. The system runs without a desktop environment, which matters more than it sounds like it should.

With no compositor and no GPU memory wasted on drawing a UI, I get roughly 15.5GB of usable VRAM. That extra ~1GB is the difference between “interesting experiment” and “hard limit reached.”

Software-wise, the stack is straightforward:

llama.cpp built from source with CUDA
Proxmox VM with cpu: host and hugepages
Unsloth GGUF quantizations
Models tested:
- Qwen3.5-35B-A3B-UD-IQ3_XXS
- Qwen3.5-9B-UD-Q4_K_XL

My goal was simple: find the largest practical context window I could run locally on 16GB VRAM and turn it into something usable for real coding sessions. In the end, that meant two different answers: 160K with the 35B model, and 250K+ with the 9B.

The Result, Up Front

Here is the short version.

Qwen3.5-35B-A3B can run at 160K context on 16GB VRAM, but only just.
Qwen3.5-9B can push past 250K context on the same GPU while still leaving useful headroom.
The final stable configuration leaves 348MB of headroom.
Qwen3.5-9B matches the 35B on prompt processing throughput in my setup.
The 35B is stronger on harder reasoning tasks.
The 9B is better as a daily driver because it has dramatically more VRAM headroom and more predictable long-session behavior.

The lesson was not “bigger model wins.” It was:

On a 16GB GPU, the real constraint is not the headline parameter count. It is VRAM headroom, bandwidth, and long-session behavior.

The Benchmark That Changed My Mind

I started this experiment expecting the 35B model to be the obvious winner if I could squeeze it in. Then I tested the 9B.

Here is the comparison that broke my mental model:

Metric	Qwen3.5-9B (Q4_K_XL)	Qwen3.5-35B-A3B (IQ3_XXS)
Prompt eval (75K tokens)	64.8 sec (1,168 tok/s)	64.8 sec (1,168 tok/s)
Generation speed	40–50 tok/s	47–51 tok/s
Model size	5.55 GiB	13.68 GiB
VRAM headroom	~3,748 MiB	348 MiB

The prompt processing speed was not just similar. It was effectively identical.

That sounds wrong until you look at where the bottleneck actually is.

Why Prefill Matched

Prompt processing — the prefill phase — is primarily memory-bandwidth-bound, not compute-bound.

Once the model is loaded, the GPU spends prefill moving data through VRAM as fast as it can. In this regime, the limiting factor is less about the total parameter count on paper and more about how hard you are pushing the memory subsystem. The 35B-A3B only activates 3B parameters per token, while the 9B is dense and activates all 9B, but in my setup both ended up saturating the same practical bandwidth ceiling.

So the prefill numbers converged.

Where the 35B Still Wins

Decode is different.

Generation is more sensitive to per-token compute, and that is where the MoE advantage starts to matter. The 35B-A3B generates slightly faster than the 9B in my setup — roughly 47–51 tok/s versus 40–50 tok/s — because it is only doing compute on its active experts rather than the full dense parameter set.

And the 35B still feels like the better model on the right class of tasks:

multi-step reasoning
architecture decisions
subtle debugging
instruction-following under complex constraints

But the 9B gives back something the 35B cannot: room to breathe.

That extra ~3.7 GiB of VRAM headroom changes what is practical. It is the difference between barely fitting 160K on the 35B and being able to push 250K+ context on the 9B on the same 16GB card. It means larger KV precision, fewer stability problems, and the ability to run other GPU processes without living in constant fear of an out-of-memory spike.

That is why the 9B became my default.

Getting the 35B Stable: Four Things That Broke First

The final result looks neat in a table. The path to get there was not.

I started with Qwen3.5-35B-A3B, compiled llama.cpp from source with CUDA, and tried my first large context load.

It took 92 seconds just to process the initial prompt.

Not inference. Not generation. Just loading the prompt.

For reference, a cloud API roundtrip is often under two seconds. I had built a Ferrari and was somehow commuting in first gear.

Over the next few days, I ran into four separate failure modes.

Wall 1: The VRAM Crunch

My first instinct was to run IQ3_XXS for quality and call it done. That failed immediately.

At this quantization, the model weights alone were eating almost the entire GPU budget. In one early configuration the weights consumed over 15.2GB, which left less than 1GB for everything else. That is not enough margin for large batch operations, and the GPU started choking mid-processing.

The fix was ugly but clarifying: I temporarily dropped to IQ2_M. That freed roughly 3GB and proved the real issue was not “the hardware is impossible.” It was “the configuration is too tight.”

That distinction mattered. Once I knew the system could behave with more headroom, I could start tuning toward stability instead of guessing blindly.

Wall 2: Sliding Window Amnesia

The more interesting bug-shaped-not-actually-a-bug problem came from Qwen’s architecture.

Qwen3.5-35B-A3B uses hybrid Sliding Window Attention (SWA). Some layers do not retain unlimited memory. In practice, that meant the model was effectively forgetting my 18,000-token system prompt between interactions and then re-reading it from scratch.

Every single time.

The logs made it look like something was broken. It was not broken. It was doing exactly what the architecture said it would do, which was somehow more annoying.

The fix was:

--swa-full
--kv-unified

Together, those forced the SWA layers to retain full context instead of behaving like a selective memory leak. Once enabled, the system prompt went into VRAM once and stayed there.

That single change turned the model from “technically runs” into “has a chance of being usable.”

Wall 3: The Timeout Death Loop

Once SWA was under control, the next problem showed up during very long initial prompt loads.

The GPU would still be processing, but the client would give up first. The request would time out, the server would restart, and I would be back at zero.

This was the kind of failure that feels random until you notice the pattern: the hardware was fine, the software stack was impatient.

The fix was enabling context checkpoints:

--ctx-checkpoints 64

This lets llama.cpp checkpoint processing state mid-prompt so that if the connection drops or the client times out, the server can resume from the last saved point instead of reprocessing the entire context from scratch.

That made a massive difference.

With checkpointing in place, the painful 92-second first load became a mostly one-time cost. After that, incremental interactions in the same session dropped to roughly 1.2–1.5 seconds.

Wall 4: The Long-Session Collapse

Even after fixing the obvious issues, multi-hour coding sessions would eventually hit a wall.

At 80K, 100K, 130K tokens into a session, one of two things would happen:

the server would crash when the context filled up
or it would force a large re-read and become miserable to use

The fix here was context shifting:

--context-shift
export LLAMA_ARG_CONTEXT_SHIFT=1

Instead of dying at the context limit, llama.cpp slides the window forward — dropping the oldest tokens while preserving the system prompt and recent history.

This was the change that made 4+ hour sessions actually viable. Without it, every complicated agentic coding session eventually ended the same way: with the model face-planting into the context ceiling.

The Performance Profile After Fixing Everything

Once those four issues were addressed, the 35B configuration became surprisingly usable.

For IQ3_XXS at 160K context, the profile looked like this:

Operation	Time	Throughput
First prompt load (90K tokens)	92.7 sec	1,123 tok/s
Incremental update (~1K tokens)	1.2–1.4 sec	800–900 tok/s
Token generation	20–24 ms/tok	41–47 tok/s
SWA-forced full re-process	4.5–35 sec	depends on context size

The headline number here is not the token speed. It is the fact that the entire setup works with so little safety margin.

The Flags That Actually Mattered

The full launch command is long, but only a handful of flags did most of the work.

Context retention and SWA behavior

--swa-full
--kv-unified

These stopped the model from repeatedly re-reading large parts of the prompt due to sliding-window behavior.

Surviving long prompt loads

--ctx-checkpoints 64

This made timeouts recoverable instead of catastrophic.

Long-session stability

--context-shift
export LLAMA_ARG_CONTEXT_SHIFT=1

These let the server slide the context window instead of crashing at the limit.

Staying inside the VRAM budget

--cache-type-k q4_0
--cache-type-v q4_0
-b 512
-ub 256

The tight batching was critical. With only 348MB of free VRAM, one oversized batch can spike the compute buffer and immediately OOM the process. There is no graceful degradation there. It just falls over.

The Final 35B Configuration

This is the exact llama-server configuration that made the 35B stable for me:

export LLAMA_ARG_CONTEXT_SHIFT=1

./build/bin/llama-server -m ~/Qwen3.5-35B-A3B-UD-IQ3_XXS.gguf \
  -ngl 99 -fa 1 -c 163840 \
  --cache-type-k q4_0 --cache-type-v q4_0 \
  --swa-full --ctx-checkpoints 64 --kv-unified \
  --context-shift --cache-reuse 512 \
  --reasoning-format deepseek-legacy \
  --perf --no-warmup --mlock \
  --slot-prompt-similarity 0.0 \
  -np 1 -b 512 -ub 256 -t 8 -tb 8 \
  --threads-http 8 --no-mmap --jinja \
  --port 11433 --host 0.0.0.0

And this was the VRAM breakdown on exit:

CUDA0 (RTX 5060 Ti): 15,847 MiB used / 16,384 MiB total
  Model weights:   13,736 MiB (IQ3_XXS)
  KV cache:           962 MiB (q4_0, 160K context)
  Compute buffer:     258 MiB
  Free:               348 MiB

That final number is the whole story.

348MB is not a margin. It is a design constraint.

Every decision downstream — batch size, KV precision, context target, whether anything else is allowed to touch the GPU — flows from that number.

The Hidden Tax: SWA Checkpoint Invalidation

There is one penalty that does not show up cleanly in summary benchmarks: SWA checkpoint invalidation.

Even with --swa-full, the sliding-window architecture occasionally invalidates cached checkpoints as the session evolves. When that happens, the server falls back to reprocessing much more of the context than you wanted.

At 100K+ tokens, that can mean 4–35 seconds of dead time.

In my usage, it happened in about 10% of interactions during long sessions, and it was not random. It clustered after:

large context additions, like pasting in a new file
the first context shift event in a session
other moments where the cache topology changed enough to force rework

This is one of the reasons the 9B became so attractive in practice. No SWA layers means no SWA weirdness. Either the cache hits or it does not. There is much less drama in the middle.

When to Use the 9B vs the 35B

After running both models daily, the division of labor became obvious.

Use the 9B when:

you want focused coding assistance
you are ingesting large codebases or documentation
you care more about predictable latency than maximum reasoning depth
you want room to run other GPU processes at the same time
you want a 250K+ context window without living on the edge of an OOM

The 9B is not a “budget option.” It is the practical tool.

Use the 35B-A3B when:

the task needs harder reasoning
you are debugging subtle logic failures
you are making architecture-level decisions
you want stronger instruction following on multi-step tasks
you are willing to accept a fragile VRAM budget in exchange for more capability

The 35B is the model I reach for when I want the system to think harder, not just read more.

Speed vs Headroom

I also tested the 9B in IQ3_XXS quantization alongside the Q4_K_XL:

9B (Q4_K_XL): ~40–50 tok/s
9B (IQ3_XXS): ~47–51 tok/s

That’s a nice ~10-20% speed bonus, but the Q4_K_XL gave me better quality and the headroom I needed.

The 9B won because it made the whole setup easier to live with.

Here is what changed once I switched:

a 250K+ context window, with room to target the model’s full 262K trained context without constantly thinking about memory pressure
~3.7 GiB of free headroom instead of 348MB
no SWA penalty surprises in long sessions
the same prefill speed as the 35B in my environment
enough intelligence for the vast majority of coding tasks

That combination matters more than benchmark ideology.

For day-to-day work, the 9B is simply the better product.

It is fast enough, smart enough, and roomy enough that I can trust it as infrastructure rather than treat it like a fragile demo.

What This Actually Costs

Power draw under sustained GPU load is about 140W.

At €0.30/kWh, the running cost is basically noise compared to cloud usage:

Usage	Daily cost	Monthly	Rough cloud equivalent
100 requests/day	~€0.01	€0.33	€15–30
500 requests/day	~€0.05	€1.65	€75–150
2,000 requests/day	~€0.20	€6.60	€300–600

The hardware cost for this setup was in the low four figures once the mini PC, GPU, and supporting parts were accounted for.

At medium usage, that still puts break-even somewhere in the within-a-year range. At heavier usage, the payoff comes much faster. The exact number depends less on electricity and more on how often you would otherwise be paying for cloud inference.

The more you use local inference, the less the economics resemble a hobby.

What I Actually Learned

Running a 250K+ context window locally on a 16GB GPU is possible — and getting a 35B model stable at 160K is possible too. But that is not the most useful lesson.

The more interesting lessons were architectural.

1. The interconnect matters once you leave VRAM

OCuLink at PCIe 4.0 x4 is fine for a workflow that stays inside VRAM. But once a model spills weights into system RAM, that link becomes the ceiling very quickly.

This is why larger models start to prefer either:

substantially more VRAM
or unified-memory systems like Mac Studio or Strix Halo

If the model does not fit cleanly, the transport path starts making decisions for you.

2. Headroom is a first-class design constraint

The most important number in this whole experiment was not the model size or the token speed.

It was 348MB free.

That tiny buffer decided:

how aggressive batch sizes could be
what KV precision was realistic
whether context targets were stable or reckless
whether anything else was allowed to use the GPU at the same time

Once you are this close to the edge, “free VRAM” stops being a leftover and becomes a design variable.

3. Smaller models can be better systems

Before testing the 9B, I thought I was evaluating a fallback.

After testing it, I started deliberately routing work to it.

That is the biggest reversal in this whole project.

The 9B is not a compromise. It is the model that best fits the machine, the workflow, and the latency budget. I still use the 35B when the problem genuinely needs more reasoning depth, but the 9B is the one I trust for everyday work.

Using It in OpenCode

Once the server was stable, plugging it into OpenCode was the easy part.

The configuration is simple: point OpenCode at the local llama-server endpoint and set the context limit accordingly.

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llamacpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama-server",
      "options": {
        "baseURL": "http://<your-server-ip>:11433/v1"
      },
      "models": {
        "home-qwen": {
          "name": "Home Qwen",
          "limit": {
            "context": 262144,
            "output": 248320
          }
        }
      }
    }
  }
}

After that, I could use Qwen locally for coding, documentation, debugging, and long-running sessions with full context intact.

That was the original goal.

The unexpected outcome was learning that the model I wanted to prove could run was not the one I ended up wanting to use.

Final Takeaway

Yes, you can run Qwen3.5-35B-A3B at 160K context on a 16GB GPU. And with the right model choice, you can push a 250K+ context window on that same 16GB card.

But the more important takeaway is this:

On constrained local hardware, the best model is not the biggest one you can force into memory. It is the one that gives you the best mix of reasoning, bandwidth behavior, headroom, and stability over long sessions.

For me, that turned out to be the 9B.

And that was the part I did not expect.

Update: This series has come a long way since. Qwen 3.6 replaced 3.5 with full 262K context, and MTP speculative decoding now pushes the MoE to 125 t/s at 98K context on the same hardware. That speed made it practical to run as a 24/7 autonomous agent with daily cron jobs, multi-platform messaging, and kernel-level sandboxing.

The views and opinions expressed here are my own and do not reflect those of my employer.