Qwopus3.6-35B-A3B-v1 — Q5_K_M evaluation

by Kyle Hessling · @KyleHessling1 on X · MoE fine-tune by Jackrong

Same 17-prompt suite as the Qwopus3.6-27B v1-preview eval, rerun against the new 35B-total / 3B-active MoE checkpoint. Same hardware. Same harness. Same prompts. 14 of 17 outputs ship cleanly; 3 creative-canvas demos (Mandelbulb shader, soft-body physics, audio-reactive visualizer) need a second turn to fix runtime errors and are excluded from the headline numbers — they're the kind of prompts one-shot models in this size class consistently fail on.

TL;DR

161.9 tok/s average across the 14 shipped runs — 2.6× the 27B v1-preview's 62.3 tok/s, despite running at a larger quant (Q5_K_M vs Q4_K_M). Pure MoE win: only 3 B of 35 B params are active per token. The MoE speedup is a massive practical improvement for single-stream work on a 5090.
The web-design outputs are some of the best one-shot HTML I've seen out of any open model in this size class. Pages feel complete on the first attempt — production-grade structure, real micro-interactions, all the requested sections actually wired up — where most models in this weight class produce surface-level scaffolding that needs another turn to fill in. The verbosity isn't padding; the model is doing more per requirement.
Structured JSON extraction now passes with thinking on. 27B v1-preview starved at 1500 tok of reasoning and emitted empty content; 35B-A3B finishes its trace and outputs valid JSON at 2501 completion tokens.
Tight throughput band. 157.4–164.8 tok/s on the substantive runs; the only outlier is the 352-token nothink rerun where prompt-processing dominates wall time.
3 of 6 creative-canvas prompts didn't ship. Mandelbulb shader, soft-body physics sandbox, audio-reactive visualizer — all common one-shot failure modes for models in this size class (shader compile bugs, collision-math drift, AudioContext gating). Pulled from the headline; expect to need a second turn to land these cleanly.
SaaS landing hit the 24K-token cap. Generated 23,839 of 24,000, but the model still managed to land a clean </html>. Bumping max_tokens to 32K is reasonable for the most ambitious design briefs.

Setup

Item	Value
Model	`Jackrong/Qwopus3.6-35B-A3B-v1-GGUF — Q5_K_M` (23.0 GB on disk)
Architecture	Hybrid MoE — Gated DeltaNet linear attention + standard gated attention, 256 experts, 8 active per token, native 262K ctx
Active params / token	~3 B of 35 B total
Base	`Qwen/Qwen3.6-35B-A3B` (Alibaba Cloud)
Fine-tune	LoRA with ~9% trainable; three-stage curriculum SFT (format → distillation → long-ctx anti-drift)
Runtime	llama.cpp cuda-12.8 (build b8708 / qwen35moe + delta-net runtime), `--flash-attn on`, `--jinja`
Context	65,536 tokens, q8_0 K+V cache, single slot
Hardware	RTX 5090 (32 GB), all layers offloaded · ~25 GB VRAM resident
Sampling	HTML: temp 0.75 / top-p 0.95 · Agentic: temp 0.3 / top-p 0.9 + thinking on

Throughput

Metric	Qwopus3.6-27B v1-preview (Q4)	Qwopus3.6-35B-A3B-v1 (Q5)
avg tok/s	62.3	162.2
min / max	61.8 / 62.7	154.4 / 164.8
VRAM resident	~20 GB	~25 GB
Completion tokens (shipped runs)	87,394 (16 of 16)	106,688 (14 of 17)
Total gen time (shipped runs)	23.4 min	11.1 min

The 2.6× speedup is exactly what an A3B routing pattern buys you on a memory-bandwidth-bound consumer GPU: only 3 B of weights move through cache per token, vs the full ~16 GB of the dense Q4 27B preview. The headline doesn't even fully credit the MoE — the 35B-A3B is doing this at Q5_K_M, a larger quant. Match quants and the MoE advantage should grow further.

One arch quirk shows up in the server logs: "forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory)". The Gated DeltaNet linear-attention layers don't share llama.cpp's standard KV reuse path, so each new prompt re-fills cache from scratch. Doesn't affect single-stream tok/s here because the suite uses fresh prompts, but it's worth noting if you stack many short turns on the same slot.

Agentic reasoning

Thinking starvation — resolved on structured_extraction

The 27B v1-preview eval flagged structured_extraction as still failing in thinking mode (4,433 chars of reasoning, then 0 chars of content — token budget exhausted before the model exited the <think> block). 35B-A3B handles the same prompt cleanly:

Task	27B v1-preview	35B-A3B-v1
multi_step_planning	3,158 tok	2,440 tok
tool_use_json	1,174 tok	1,381 tok
code_debug	1,628 tok	1,393 tok
structured_extraction (thinking)	Empty — starved	2,501 tok · valid JSON
self_critique	1,277 tok	4,391 tok

Reasoning trace lengths bounce both ways. Multi-step planning and code-debug got shorter traces on the 35B (4,179 chars vs ~5,000+ on 27B). Self-critique blew out to 4,391 completion tokens — the model went deep on the palindrome critique and then wrote a longer expand-around-center implementation. Net: thinking budgets need less margin than the 27B preview required.

Quality notes

code_debug: caught all 4 bugs cleanly — sort order, = vs ==, useless loop / bounds logic, off-by-one on nums[k]. Bounds check uses an upfront if k < 1 or k > len(nums) guard, which is more robust than the 27B's version.
self_critique: INITIAL → CRITIQUE → IMPROVED structure followed exactly. Three weaknesses listed (O(n³) time, repeated string allocation, edge-case clarity). Improved version is a clean expand-around-center O(n²) — same algorithmic family as the 27B but with tighter explanatory framing.
multi_step_planning: 10-step deploy plan for the FastAPI URL shortener. Tighter than the 27B's 10-step list; explicit pip dependencies and Dockerfile hand-off are correct.
tool_use_json: correct 3-tool sequence (search_flights → book_hotel → get_weather). Same 2024 date drift as the 27B preview — the prompt doesn't anchor a year, so the model defaults to its training distribution.
structured_extraction (thinking): valid JSON, all three people resolved with correct emails/role/phone, all three projects mapped to Karen. "Next Tuesday" from 2025-04-21 resolved to 2025-04-29 (14:30 PT → 21:30 UTC); same disagreement with the 27B nothink rerun (which picked 2025-04-28). Reasonable interpretation of "next Tuesday" as next-week's Tuesday rather than tomorrow.
structured_extraction (nothink): tighter JSON — only Aurora and Lumen on Karen, no projects on Dan or Priya (matching the literal text). Same 2025-04-29 date. Either reading is defensible.

Front-end design (5 prompts) · this is where the model shines

All 5 outputs validated: start with <!DOCTYPE html>, end with </html>, no truncation, no orphan code fences in the .raw.txt files. These are some of the best one-shot HTML pages I've seen out of any open model in this size class. The pages feel complete — not surface-level scaffolding, but production-quality work that actually wires up the requested micro-interactions, charts, and sections rather than stubbing them out.

Prompt	27B v1-preview	35B-A3B-v1
saas_landing	36.7 KB · 9.96 k tok	75.9 KB · 23.84 k tok (hit 24K cap)
analytics_dashboard	37.4 KB · 13.19 k tok	37.5 KB · 14.03 k tok
designer_portfolio	23.1 KB · 7.36 k tok	27.5 KB · 9.14 k tok
pricing_page	24.3 KB · 8.06 k tok	50.1 KB · 13.86 k tok
mobile_app_marketing	29.3 KB · 8.01 k tok	47.9 KB · 16.60 k tok

The 35B-A3B's design output averages 47.8 KB vs the 27B preview's 30.2 KB. The biggest spreads are on the SaaS landing (75.9 KB, hit the cap) and the pricing page (2.06× the 27B's bytes). Rendering them side by side, the size delta is doing real work: the animated terminal trace on the SaaS hero is genuinely animated, the pricing page's conic-gradient rotating border lands, the analytics dashboard charts are drawn from hardcoded data with hover states, and the Stillwater iPhone mockup actually breathes on the 4-7-8 cadence. This is verbosity in the good sense — the model is filling in detail other models in this class skip.

Canvas / WebGL (3 of 6 shipped)

Creative canvas is where one-shot models in this size class consistently struggle, and Qwopus3.6-35B-A3B-v1 is no exception on the hardest three: the Mandelbulb fragment shader, the soft-body physics sandbox, and the audio-reactive visualizer didn't render correctly on first attempt. These are common one-shot failure modes — shader compile bugs, collision-math drift, AudioContext user-gesture gating — and they're the kind of brief that needs a second turn to fix. Calling them out honestly here, but they're not a knock on the model: most open models at this size fail the same prompts.

Prompt	27B v1-preview	35B-A3B-v1	Status
particle_attractor	11.1 KB · 4.25 k tok	10.6 KB · 4.15 k tok	shipped
generative_flowfield	— (not in 27B dashboard)	19.3 KB · 6.93 k tok	shipped
three_scene (crystals)	17.9 KB · 6.38 k tok	16.1 KB · 5.67 k tok	shipped
webgl_shader (Mandelbulb)	11.5 KB · 4.36 k tok	17.4 KB · 6.22 k tok	multi-turn
physics_sandbox	15.1 KB · 4.38 k tok	25.9 KB · 9.89 k tok	multi-turn
audio_reactive	12.0 KB · 3.02 k tok	17.3 KB · 6.11 k tok	multi-turn

The three that shipped (particle attractor, generative flowfield, three.js crystal scene) all run cleanly first-try and look genuinely good. Treat the creative-canvas category as: excellent for one-shot on 3 of 6 prompts at this size, the rest expect a second turn.

What the MoE actually buys you

2.6× single-stream throughput on identical hardware vs the dense 27B preview — at a larger quant. The active-param count (3 B) sets the effective compute, while the full 35 B sits resident and hot in VRAM ready for routing.
The thinking-starvation tail closes. The one prompt the 27B couldn't clear with thinking on (structured_extraction) now passes cleanly without budget bumps.
Variance is still tight. 16 of 17 runs land in a 7-tok/s window (157.4–164.8). The only outlier is the 352-token nothink rerun where prompt processing is the dominant cost.
Headroom for bigger contexts. Hybrid linear-attention layers don't carry standard KV cache for most of the stack — only the 10 of 40 layers tagged full_attention in the config use the conventional KV path. The other 30 layers (linear_attention aka Gated DeltaNet) keep memory roughly flat with context length. 65 K ctx fits in ~25 GB; 131 K should land at ~26 GB; 262 K (native max) is plausible on a 5090.

Caveats

Larger outputs ≠ better outputs by default. The 35B writes more bytes per design brief. Could be more polish per requirement or could be more boilerplate. Render side-by-side before drawing conclusions.
SaaS landing hit the 24K cap. Cleanly closed the document anyway, but raise max_tokens to 32 K for the most ambitious briefs. The token budget that worked on the 27B is no longer a fit.
Quant gap with the 27B preview is large (Q4 vs Q5). A like-for-like comparison would re-quantize both at the same bit-width. The headline 2.6× already understates the MoE win.
Hybrid attention's KV-cache reuse story is rough. Each new prompt forces full re-processing in current llama.cpp. Single-shot benchmarks don't see it; chat-style sessions on long shared system prompts will.
Experimental release. Model card explicitly flags this as community/research, not safety-tested. The 9%-trainable LoRA configuration is unusually aggressive for an MoE base and the card calls out merging-instability risk.

Verdict

Qwopus3.6-35B-A3B-v1 at Q5_K_M is one of the strongest one-shot front-end + reasoning models you can run on a single 5090 right now. The MoE speedup alone is a massive practical improvement — 162 tok/s on a 35 B model at Q5 is what the dense 27B preview would need a fundamentally different machine to match — and the design-output quality is the headline. The web-design pages are some of the best one-shot HTML I've seen out of any open model in this size class: complete, verbose in the good sense, real structure and real micro-interactions on the first try where most models in this class produce surface-level scaffolding that needs another turn to fill in.

The fine-tune carries through what the 27B preview started: tighter reasoning traces, fewer thinking-on starvation cases (structured JSON now passes without a nothink fallback), excellent throughput variance. Agentic prompts pass cleanly with shorter budgets than the 27B needed.

The honest caveat is the creative-canvas tail: 3 of 6 prompts (Mandelbulb shader, soft-body physics, audio visualizer) need a second turn to fix runtime errors. That's a known failure pattern for one-shot HTML5/WebGL on any model in this size class, not a Qwopus regression — for very complex creative-canvas briefs, expect to iterate. The other 3 ship clean and look good.

If you're running the Qwopus3.6-27B v1-preview today, this is a clear upgrade across the board: faster, better one-shot UI quality, fewer reasoning starvations. An updated Qwopus3.6-27B is in the works and should land similar enhancements on the dense side. In the meantime, this 35B-A3B is an excellent model — the MoE speed is a real win and the design output quality is genuinely impressive for first-try work.

Raw outputs and per-run metadata JSON preserved alongside each HTML file in this repo. Same harness and prompts as the Qwopus3.6-27B v1-preview eval.