← back to index

Qwopus3.6-35B-A3B-v1 — Q5_K_M evaluation

by Kyle Hessling · @KyleHessling1 on X · MoE fine-tune by Jackrong

i
A real one-shot upgrade — and a major MoE speed win Per the model card, this is a 9%-trainable LoRA fine-tune over Qwen3.6-35B-A3B using a three-stage curriculum SFT (format establishment → multi-teacher distillation → long-context anti-drift). 9% trainable on an MoE architecture is aggressive, and the design and reasoning outputs land cleanly out of the box. An updated Qwopus3.6-27B is in the works that should carry these same enhancements forward.

Same 17-prompt suite as the Qwopus3.6-27B v1-preview eval, rerun against the new 35B-total / 3B-active MoE checkpoint. Same hardware. Same harness. Same prompts. 14 of 17 outputs ship cleanly; 3 creative-canvas demos (Mandelbulb shader, soft-body physics, audio-reactive visualizer) need a second turn to fix runtime errors and are excluded from the headline numbers — they're the kind of prompts one-shot models in this size class consistently fail on.

TL;DR

Setup

ItemValue
ModelJackrong/Qwopus3.6-35B-A3B-v1-GGUF — Q5_K_M (23.0 GB on disk)
ArchitectureHybrid MoE — Gated DeltaNet linear attention + standard gated attention, 256 experts, 8 active per token, native 262K ctx
Active params / token~3 B of 35 B total
BaseQwen/Qwen3.6-35B-A3B (Alibaba Cloud)
Fine-tuneLoRA with ~9% trainable; three-stage curriculum SFT (format → distillation → long-ctx anti-drift)
Runtimellama.cpp cuda-12.8 (build b8708 / qwen35moe + delta-net runtime), --flash-attn on, --jinja
Context65,536 tokens, q8_0 K+V cache, single slot
HardwareRTX 5090 (32 GB), all layers offloaded · ~25 GB VRAM resident
SamplingHTML: temp 0.75 / top-p 0.95 · Agentic: temp 0.3 / top-p 0.9 + thinking on

Throughput

MetricQwopus3.6-27B v1-preview (Q4)Qwopus3.6-35B-A3B-v1 (Q5)
avg tok/s62.3162.2
min / max61.8 / 62.7154.4 / 164.8
VRAM resident~20 GB~25 GB
Completion tokens (shipped runs)87,394 (16 of 16)106,688 (14 of 17)
Total gen time (shipped runs)23.4 min11.1 min

The 2.6× speedup is exactly what an A3B routing pattern buys you on a memory-bandwidth-bound consumer GPU: only 3 B of weights move through cache per token, vs the full ~16 GB of the dense Q4 27B preview. The headline doesn't even fully credit the MoE — the 35B-A3B is doing this at Q5_K_M, a larger quant. Match quants and the MoE advantage should grow further.

One arch quirk shows up in the server logs: "forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory)". The Gated DeltaNet linear-attention layers don't share llama.cpp's standard KV reuse path, so each new prompt re-fills cache from scratch. Doesn't affect single-stream tok/s here because the suite uses fresh prompts, but it's worth noting if you stack many short turns on the same slot.

Agentic reasoning

Thinking starvation — resolved on structured_extraction

The 27B v1-preview eval flagged structured_extraction as still failing in thinking mode (4,433 chars of reasoning, then 0 chars of content — token budget exhausted before the model exited the <think> block). 35B-A3B handles the same prompt cleanly:

Task27B v1-preview35B-A3B-v1
multi_step_planning3,158 tok2,440 tok
tool_use_json1,174 tok1,381 tok
code_debug1,628 tok1,393 tok
structured_extraction (thinking)Empty — starved2,501 tok · valid JSON
self_critique1,277 tok4,391 tok

Reasoning trace lengths bounce both ways. Multi-step planning and code-debug got shorter traces on the 35B (4,179 chars vs ~5,000+ on 27B). Self-critique blew out to 4,391 completion tokens — the model went deep on the palindrome critique and then wrote a longer expand-around-center implementation. Net: thinking budgets need less margin than the 27B preview required.

Quality notes

Front-end design (5 prompts) · this is where the model shines

All 5 outputs validated: start with <!DOCTYPE html>, end with </html>, no truncation, no orphan code fences in the .raw.txt files. These are some of the best one-shot HTML pages I've seen out of any open model in this size class. The pages feel complete — not surface-level scaffolding, but production-quality work that actually wires up the requested micro-interactions, charts, and sections rather than stubbing them out.

Prompt27B v1-preview35B-A3B-v1
saas_landing36.7 KB · 9.96 k tok75.9 KB · 23.84 k tok (hit 24K cap)
analytics_dashboard37.4 KB · 13.19 k tok37.5 KB · 14.03 k tok
designer_portfolio23.1 KB · 7.36 k tok27.5 KB · 9.14 k tok
pricing_page24.3 KB · 8.06 k tok50.1 KB · 13.86 k tok
mobile_app_marketing29.3 KB · 8.01 k tok47.9 KB · 16.60 k tok

The 35B-A3B's design output averages 47.8 KB vs the 27B preview's 30.2 KB. The biggest spreads are on the SaaS landing (75.9 KB, hit the cap) and the pricing page (2.06× the 27B's bytes). Rendering them side by side, the size delta is doing real work: the animated terminal trace on the SaaS hero is genuinely animated, the pricing page's conic-gradient rotating border lands, the analytics dashboard charts are drawn from hardcoded data with hover states, and the Stillwater iPhone mockup actually breathes on the 4-7-8 cadence. This is verbosity in the good sense — the model is filling in detail other models in this class skip.

Canvas / WebGL (3 of 6 shipped)

Creative canvas is where one-shot models in this size class consistently struggle, and Qwopus3.6-35B-A3B-v1 is no exception on the hardest three: the Mandelbulb fragment shader, the soft-body physics sandbox, and the audio-reactive visualizer didn't render correctly on first attempt. These are common one-shot failure modes — shader compile bugs, collision-math drift, AudioContext user-gesture gating — and they're the kind of brief that needs a second turn to fix. Calling them out honestly here, but they're not a knock on the model: most open models at this size fail the same prompts.

Prompt27B v1-preview35B-A3B-v1Status
particle_attractor11.1 KB · 4.25 k tok10.6 KB · 4.15 k tokshipped
generative_flowfield— (not in 27B dashboard)19.3 KB · 6.93 k tokshipped
three_scene (crystals)17.9 KB · 6.38 k tok16.1 KB · 5.67 k tokshipped
webgl_shader (Mandelbulb)11.5 KB · 4.36 k tok17.4 KB · 6.22 k tokmulti-turn
physics_sandbox15.1 KB · 4.38 k tok25.9 KB · 9.89 k tokmulti-turn
audio_reactive12.0 KB · 3.02 k tok17.3 KB · 6.11 k tokmulti-turn

The three that shipped (particle attractor, generative flowfield, three.js crystal scene) all run cleanly first-try and look genuinely good. Treat the creative-canvas category as: excellent for one-shot on 3 of 6 prompts at this size, the rest expect a second turn.

What the MoE actually buys you

Caveats

Verdict

Qwopus3.6-35B-A3B-v1 at Q5_K_M is one of the strongest one-shot front-end + reasoning models you can run on a single 5090 right now. The MoE speedup alone is a massive practical improvement — 162 tok/s on a 35 B model at Q5 is what the dense 27B preview would need a fundamentally different machine to match — and the design-output quality is the headline. The web-design pages are some of the best one-shot HTML I've seen out of any open model in this size class: complete, verbose in the good sense, real structure and real micro-interactions on the first try where most models in this class produce surface-level scaffolding that needs another turn to fill in.

The fine-tune carries through what the 27B preview started: tighter reasoning traces, fewer thinking-on starvation cases (structured JSON now passes without a nothink fallback), excellent throughput variance. Agentic prompts pass cleanly with shorter budgets than the 27B needed.

The honest caveat is the creative-canvas tail: 3 of 6 prompts (Mandelbulb shader, soft-body physics, audio visualizer) need a second turn to fix runtime errors. That's a known failure pattern for one-shot HTML5/WebGL on any model in this size class, not a Qwopus regression — for very complex creative-canvas briefs, expect to iterate. The other 3 ship clean and look good.

If you're running the Qwopus3.6-27B v1-preview today, this is a clear upgrade across the board: faster, better one-shot UI quality, fewer reasoning starvations. An updated Qwopus3.6-27B is in the works and should land similar enhancements on the dense side. In the meantime, this 35B-A3B is an excellent model — the MoE speed is a real win and the design output quality is genuinely impressive for first-try work.

Raw outputs and per-run metadata JSON preserved alongside each HTML file in this repo. Same harness and prompts as the Qwopus3.6-27B v1-preview eval.