Same 17-prompt suite as the Qwopus3.6-27B v1-preview eval, rerun against the new 35B-total / 3B-active MoE checkpoint. Same hardware. Same harness. Same prompts. 14 of 17 outputs ship cleanly; 3 creative-canvas demos (Mandelbulb shader, soft-body physics, audio-reactive visualizer) need a second turn to fix runtime errors and are excluded from the headline numbers — they're the kind of prompts one-shot models in this size class consistently fail on.
</html>. Bumping max_tokens to 32K is reasonable for the most ambitious design briefs.| Item | Value |
|---|---|
| Model | Jackrong/Qwopus3.6-35B-A3B-v1-GGUF — Q5_K_M (23.0 GB on disk) |
| Architecture | Hybrid MoE — Gated DeltaNet linear attention + standard gated attention, 256 experts, 8 active per token, native 262K ctx |
| Active params / token | ~3 B of 35 B total |
| Base | Qwen/Qwen3.6-35B-A3B (Alibaba Cloud) |
| Fine-tune | LoRA with ~9% trainable; three-stage curriculum SFT (format → distillation → long-ctx anti-drift) |
| Runtime | llama.cpp cuda-12.8 (build b8708 / qwen35moe + delta-net runtime), --flash-attn on, --jinja |
| Context | 65,536 tokens, q8_0 K+V cache, single slot |
| Hardware | RTX 5090 (32 GB), all layers offloaded · ~25 GB VRAM resident |
| Sampling | HTML: temp 0.75 / top-p 0.95 · Agentic: temp 0.3 / top-p 0.9 + thinking on |
| Metric | Qwopus3.6-27B v1-preview (Q4) | Qwopus3.6-35B-A3B-v1 (Q5) |
|---|---|---|
| avg tok/s | 62.3 | 162.2 |
| min / max | 61.8 / 62.7 | 154.4 / 164.8 |
| VRAM resident | ~20 GB | ~25 GB |
| Completion tokens (shipped runs) | 87,394 (16 of 16) | 106,688 (14 of 17) |
| Total gen time (shipped runs) | 23.4 min | 11.1 min |
The 2.6× speedup is exactly what an A3B routing pattern buys you on a memory-bandwidth-bound consumer GPU: only 3 B of weights move through cache per token, vs the full ~16 GB of the dense Q4 27B preview. The headline doesn't even fully credit the MoE — the 35B-A3B is doing this at Q5_K_M, a larger quant. Match quants and the MoE advantage should grow further.
One arch quirk shows up in the server logs: "forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory)". The Gated DeltaNet linear-attention layers don't share llama.cpp's standard KV reuse path, so each new prompt re-fills cache from scratch. Doesn't affect single-stream tok/s here because the suite uses fresh prompts, but it's worth noting if you stack many short turns on the same slot.
The 27B v1-preview eval flagged structured_extraction as still failing in thinking mode (4,433 chars of reasoning, then 0 chars of content — token budget exhausted before the model exited the <think> block). 35B-A3B handles the same prompt cleanly:
| Task | 27B v1-preview | 35B-A3B-v1 |
|---|---|---|
| multi_step_planning | 3,158 tok | 2,440 tok |
| tool_use_json | 1,174 tok | 1,381 tok |
| code_debug | 1,628 tok | 1,393 tok |
| structured_extraction (thinking) | Empty — starved | 2,501 tok · valid JSON |
| self_critique | 1,277 tok | 4,391 tok |
Reasoning trace lengths bounce both ways. Multi-step planning and code-debug got shorter traces on the 35B (4,179 chars vs ~5,000+ on 27B). Self-critique blew out to 4,391 completion tokens — the model went deep on the palindrome critique and then wrote a longer expand-around-center implementation. Net: thinking budgets need less margin than the 27B preview required.
= vs ==, useless loop / bounds logic, off-by-one on nums[k]. Bounds check uses an upfront if k < 1 or k > len(nums) guard, which is more robust than the 27B's version.search_flights → book_hotel → get_weather). Same 2024 date drift as the 27B preview — the prompt doesn't anchor a year, so the model defaults to its training distribution.All 5 outputs validated: start with <!DOCTYPE html>, end with </html>, no truncation, no orphan code fences in the .raw.txt files. These are some of the best one-shot HTML pages I've seen out of any open model in this size class. The pages feel complete — not surface-level scaffolding, but production-quality work that actually wires up the requested micro-interactions, charts, and sections rather than stubbing them out.
| Prompt | 27B v1-preview | 35B-A3B-v1 |
|---|---|---|
| saas_landing | 36.7 KB · 9.96 k tok | 75.9 KB · 23.84 k tok (hit 24K cap) |
| analytics_dashboard | 37.4 KB · 13.19 k tok | 37.5 KB · 14.03 k tok |
| designer_portfolio | 23.1 KB · 7.36 k tok | 27.5 KB · 9.14 k tok |
| pricing_page | 24.3 KB · 8.06 k tok | 50.1 KB · 13.86 k tok |
| mobile_app_marketing | 29.3 KB · 8.01 k tok | 47.9 KB · 16.60 k tok |
The 35B-A3B's design output averages 47.8 KB vs the 27B preview's 30.2 KB. The biggest spreads are on the SaaS landing (75.9 KB, hit the cap) and the pricing page (2.06× the 27B's bytes). Rendering them side by side, the size delta is doing real work: the animated terminal trace on the SaaS hero is genuinely animated, the pricing page's conic-gradient rotating border lands, the analytics dashboard charts are drawn from hardcoded data with hover states, and the Stillwater iPhone mockup actually breathes on the 4-7-8 cadence. This is verbosity in the good sense — the model is filling in detail other models in this class skip.
Creative canvas is where one-shot models in this size class consistently struggle, and Qwopus3.6-35B-A3B-v1 is no exception on the hardest three: the Mandelbulb fragment shader, the soft-body physics sandbox, and the audio-reactive visualizer didn't render correctly on first attempt. These are common one-shot failure modes — shader compile bugs, collision-math drift, AudioContext user-gesture gating — and they're the kind of brief that needs a second turn to fix. Calling them out honestly here, but they're not a knock on the model: most open models at this size fail the same prompts.
| Prompt | 27B v1-preview | 35B-A3B-v1 | Status |
|---|---|---|---|
| particle_attractor | 11.1 KB · 4.25 k tok | 10.6 KB · 4.15 k tok | shipped |
| generative_flowfield | — (not in 27B dashboard) | 19.3 KB · 6.93 k tok | shipped |
| three_scene (crystals) | 17.9 KB · 6.38 k tok | 16.1 KB · 5.67 k tok | shipped |
| webgl_shader (Mandelbulb) | 11.5 KB · 4.36 k tok | 17.4 KB · 6.22 k tok | multi-turn |
| physics_sandbox | 15.1 KB · 4.38 k tok | 25.9 KB · 9.89 k tok | multi-turn |
| audio_reactive | 12.0 KB · 3.02 k tok | 17.3 KB · 6.11 k tok | multi-turn |
The three that shipped (particle attractor, generative flowfield, three.js crystal scene) all run cleanly first-try and look genuinely good. Treat the creative-canvas category as: excellent for one-shot on 3 of 6 prompts at this size, the rest expect a second turn.
full_attention in the config use the conventional KV path. The other 30 layers (linear_attention aka Gated DeltaNet) keep memory roughly flat with context length. 65 K ctx fits in ~25 GB; 131 K should land at ~26 GB; 262 K (native max) is plausible on a 5090.max_tokens to 32 K for the most ambitious briefs. The token budget that worked on the 27B is no longer a fit.Qwopus3.6-35B-A3B-v1 at Q5_K_M is one of the strongest one-shot front-end + reasoning models you can run on a single 5090 right now. The MoE speedup alone is a massive practical improvement — 162 tok/s on a 35 B model at Q5 is what the dense 27B preview would need a fundamentally different machine to match — and the design-output quality is the headline. The web-design pages are some of the best one-shot HTML I've seen out of any open model in this size class: complete, verbose in the good sense, real structure and real micro-interactions on the first try where most models in this class produce surface-level scaffolding that needs another turn to fill in.
The fine-tune carries through what the 27B preview started: tighter reasoning traces, fewer thinking-on starvation cases (structured JSON now passes without a nothink fallback), excellent throughput variance. Agentic prompts pass cleanly with shorter budgets than the 27B needed.
The honest caveat is the creative-canvas tail: 3 of 6 prompts (Mandelbulb shader, soft-body physics, audio visualizer) need a second turn to fix runtime errors. That's a known failure pattern for one-shot HTML5/WebGL on any model in this size class, not a Qwopus regression — for very complex creative-canvas briefs, expect to iterate. The other 3 ship clean and look good.
If you're running the Qwopus3.6-27B v1-preview today, this is a clear upgrade across the board: faster, better one-shot UI quality, fewer reasoning starvations. An updated Qwopus3.6-27B is in the works and should land similar enhancements on the dense side. In the meantime, this 35B-A3B is an excellent model — the MoE speed is a real win and the design output quality is genuinely impressive for first-try work.
Raw outputs and per-run metadata JSON preserved alongside each HTML file in this repo. Same harness and prompts as the Qwopus3.6-27B v1-preview eval.