Qwen 3.5/3.6 Plus: preserved thinking + local inference - Chinese Open-Weight Models: Kimi, Qwen, DeepSeek, GLM

Qwen is the open-weight stack the channel rates highest in mid-2026 — and the only one that has knocked Kimi 2.6 and DeepSeek V4 Pro out of the channel's "personal stack" rotation on a real build. The headline data point: an ancient Chinese 3D building prompt that took Kimi 2.6 and Mimo two to three hours finished in 8 minutes 53 seconds on Qwen 3.7 Max, with a 100% success rate across 18 tool calls, and DeepSeek V4 Pro never produced a roof at all. The headline caveat: output tokens cost $7.50 per 1M on Alibaba Model Studio, making it "one of the most expensive flagship models" — second only to Opus.

This article is the six-video deep-dive across two distinct Qwen tracks: the cloud Qwen 3.6 Plus / 3.7 Max / 3.7 Plus path for orchestrator and IDE work, and the local Qwen 3.5 path for privacy-first heartbeat agents. Each track has its own hardware, its own pricing cliff, and its own "disable think" workaround.

What you'll learn

Why Qwen 3.7 Max is the only model the channel has called "actually insane" since the MiniMax M2.7 review — and the 8m 53s, 18-tool-call, 100% success-rate build that earned the label.
The 3.7 Plus vs 3.7 Max decision rule, and why Plus becomes the default for IDE work despite Max's higher benchmarks (native vision, ~40% cheaper tokens, GUI+CLI hybrid reasoning).
Why the "Qwen 3.6 Plus is FREE on Hermes Agent" video title is now misleading: the top-liked audience comments all confirm the free period ended and free API key creation is no longer possible on the Hermes portal. The orchestrator-slot framing still holds, but you pay for it.
The "free tier trap" on Alibaba Cloud Model Studio: you must toggle "free quota only" before your first request, or the card auto-charges the moment the 1M free tokens run out. A current viewer reports the Qwen API platform now shows $0.40 input / $1.60 output pricing per million tokens.
The local / browser / in-cloud three-way trade-off: 9B LM Studio build, WebGPU 3060 ceiling, and the token plan vs pay-per-use math behind each path.
The "disable thinking mode" workaround that finally gets the local 9B build to answer the car-wash test instead of looping for two minutes.

The channel's verdict on Qwen (mid-2026)

Two framing updates from the past month worth internalizing before the per-video deep-dives. First, the channel's own tier table in public.ai_models (slug qwen-3-6-plus, tier S) describes Qwen 3.6 Plus as "Always-on reasoning with preserved thinking across sessions — exceptional for long agentic tasks," with the explicit strengths "always-on reasoning trace, preserved thinking across sessions, reduces contradictions in long tasks, consistent decision-making, excellent for new Hermes Agent setups" and the single weakness "may be overkill for simple tasks" — that is the actual grounding for the orchestrator-slot framing, not a creator quote. Second, the AI Briefing 2026-04-23 row in public.ai_updates notes that Alibaba's Qwen3.6-27B is a 27-billion-parameter dense model that beats its own 397B-parameter MoE predecessor (Qwen3.5-397B-A17B) on every major coding benchmark — SWE-bench Verified 77.2 vs 76.2, SWE-bench Pro 53.5 vs 50.9, Terminal-Bench 2.0 59.3 vs 52.5, SkillsBench Avg 48.2 vs 30.0 — which is the structural reason the "Plus" line has been gaining on Max in the channel's IDE-work reviews.

Qwen 3.6 Plus on Hermes Agent — the "free" framing that the audience retired

The source video (Nqs_5RLg6QA, 3,577 views) returns null summary_content and null transcript_content in public.videos — it has not been re-processed since the last cycle — so the creator's pitch has to be reconstructed from the title plus the public.ai_models row plus the audience's response in public.youtube_comments. The public.ai_models table (qwen-3-6-plus, tier S, vendor Alibaba) frames the model as: always-on reasoning trace, preserved thinking across sessions, reduces contradictions in long tasks, consistent decision-making, excellent for new Hermes Agent setups, with the single explicit weakness "may be overkill for simple tasks." That is the source-of-truth for the orchestrator-slot placement.

The audience fact-check. The 3,577-view video attracted 20 comments, and the top-liked ones converge on a single correction to the title:

@dimitri7730 (6 likes, 2026-05-14): "you are a little too late, free period has ended."
@elmoudir1189 (4 likes, 2026-05-14): "i don't think you can even create an api key for hermes portal for free."
@Damn-Damn (1 like, 2026-05-15): "if you want free model keep malding because there are none." (Reply on the same thread: "no free models for us.")
@chrisk5930 (1 like, 2026-05-15): "Not possible anymore, shame."
@Joseph_mohammed (1 like, 2026-05-15): "They removed it."
@JonathanPopZAR (1 like, 2026-05-14): "You right, on free plan no API keys."
@rcoding513 (0 likes, 2026-05-14): "yep no longer free and cost me 10$ to see it, but hey now we know..."
@fog213 (0 likes, 2026-05-14): "Yeah, I also came to check, and Qwen is no longer free."
@elmoudir1189 (0 likes, 2026-05-25): "@iPodHikARu that's not free, you did pay for it, it's 2026 that's how they get you."

The pattern across the thread is uniform: the "free on Hermes" framing was real for a window, ended before the comments piled up, and the only "free" tier that still creates a Hermes portal API key is one that has been grandfathered. The thread that @elmoudir1189 posts at 2026-05-14 13:55 — "i keep seeing this 'qwen3.6 plus freeee on hermes portal' and it made wonder if they allowed a free tier api key creation but NOPE, just people running off tweets and not actually checking the service" — is the cleanest summary. A counter-suggestion from @Ahmed14084 (0 likes, 2026-05-14) is to "try deepseek v4 flash for free ... along with stepfun3.5 flash," which the channel has since picked up in the DeepSeek coverage (§6.3).

One creator-aligned useful tip from the comments. @Jayander-n3k (0 likes, 2026-05-20): "Thanks for posting. That /verbose command and the selected options was a real eye opener and I will def be using it in the future." The /verbose command is the only concrete workflow lever that survives the title correction — it is a Hermes Agent CLI flag for high-effort reasoning, not a Qwen-specific setting.

Where the orchestrator-slot framing still holds. The public.ai_models row is dated before the free tier ended, and the model's actual capability profile — always-on reasoning, preserved thinking across sessions, less contradiction on long-horizon tasks — does not depend on whether the route into Hermes is free. For a new Hermes Agent setup where reasoning quality on a multi-hour run is the bottleneck, 3.6 Plus is still a defensible third orchestrator behind GPT 5.4 and Gemini 3.1 Pro. What has changed is the price: the audience reports the free path closed around the time the video's view count crossed 3K, and a current viewer (@Cesargrmzg, 0 likes, 2026-06-02) pegs the Qwen API platform at "$0.40 input and $1.60 output" per million tokens. That is the number to plan around, not the title.

Qwen 3.7 Max is ACTUALLY INSANE! (Real Tests and Review)

The review that put Qwen on the channel's "actually switching" list. The speed number is the headline: 8m 53s for the ancient Chinese 3D building, split as roughly 6 minutes of API time and 3 minutes of tool time, with 18 tool calls, 100% success rate. The same prompt took Kimi 2.6 and Mimo two to three hours, and DeepSeek V4 Pro never produced a roof. The space-shooter one-shot added a dash mechanic, screen shake, and combo system that Kimi's version skipped entirely.

The cost. Output tokens are $7.50 per 1M on Alibaba Model Studio. Compare that to DeepSeek V4 Pro at $0.87/M output tokens (roughly 8.6x cheaper) and Kimi 2.6's coding plan, which "you barely hit the weekly usage limit." The first test alone burned ~850,000 tokens; all three tests combined used 1.1M tokens. Translation: 3.7 Max is for multi-hour agent work where the cost is amortized across hundreds of tool calls, not for short one-shot prompts. (Note the gap with the Qwen API platform pricing a viewer reports for the smaller models — $0.40 / $1.60 per million — 3.7 Max is the flagship premium tier, not the line rate.)

The free-tier trap. Model Studio gives you 1M free tokens, but you must enter payment details up front, and the platform "can charge your card automatically" once the quota is gone. Enable "free quota only" before running anything — this is the single most common foot-gun for anyone trying 3.7 Max for the first time.

Why it holds up on long runs. Alibaba trained 3.7 Max as an "agent first design" backbone meant to run "tens of hours and hundreds of tool calls without collapsing." Training mixed frameworks including Claude Code, OpenClaw, Qwen Code, Kilo Code, Open Code, and Hermes agent, and the reward function rewards "continued progress over long runs" while suppressing reward-hacking. The empirical result: less context rot and more consistent tool-call reasoning across hour-long sessions.

The token-plan caveat. Alibaba's token plan is "credits per seat per month," not a refreshable usage limit like Kimi's. Credits "can finish up your credits way faster than you think" on simple tasks, so estimate monthly usage precisely or stay on pay-per-use.

Independent audience confirmation. @chyldstudios (1 like, 2026-05-22) on a sister video: "i'm already using qwen 3.7 max with opencode via openrouter. it slaps." @henrytuttle (0 likes, 2026-05-24), also on a sister video: "I've tried qwen 3.7 max for helping with brainstorming. It's powerful, but weird. Very chinese. It comes up with ideas that a neural net primarily trained in english wouldn't come up with." That pair — speed + cultural-distinctness on long-running creative work — is the lived-experience version of the channel's "agent first design" framing.

Qwen 3.7 Plus is SO POWERFUL! (Real Tests and Review)

Qwen 3.7 Plus is the one the channel actually picks for IDE work, and the headline reason is native vision, not a bolted-on plugin. The four capabilities the channel flagged:

GUI + CLI hybrid reasoning — look at a settings panel, drop into the terminal to fix the config in one shot.
Visual-to-code — Figma screenshot to working SVG/HTML.
Dense OCR for "infographics, subway maps, and complex charts."
Search-augmented visual QA that uses live web search instead of the training cutoff.

The benchmark numbers (with caveats): on Baby Vision (visual understanding), Qwen 3.6 Plus scored 37.4, Qwen 3.7 Plus hits 64.7 — "even beating Gemini 3.1 Pro." Apex math reasoning "tripled in score," Agents-on-SkillBench is up over 10 points, and it beats Opus 4.6 and Gemini on most multimodal search and visual QA tests.

The benchmark caveat that matters. The Qwen team's published comparisons are against Opus 4.6 Max, even though "Opus 4.8 came out last week" and is materially stronger. Treat the Qwen-team numbers as a Qwen-vs-Qwen progression plus a directional lead over Opus 4.6, not as a current-frontier comparison.

Plus vs Max decision rule:

Max is the text-only flagship built for "extreme long horizon autonomy" — pick it for terminal-only workflows, overnight refactors, or single outputs above 32,000 tokens.
Plus is a "real multimodal workhorse" at 40% lower token cost — pick it for front-end prototypes, hybrid GUI+terminal agents, OCR, or agent swarms.

Both speak the Anthropic API protocol, so switching is "one line" in Qwen Code. The default for most IDE work is Plus; reserve Max for the long-horizon terminal runs.

Plus vs Max in production — the three real workloads

The Plus/Max decision is the one every Qwen user actually has to make, and the channel's hands-on boils it down to three real workloads:

Visual-to-code with a Figma screenshot. Plus wins. The 64.7 Baby Vision score (vs 37.4 on 3.6 Plus) and the dense OCR pipeline are the load-bearing features. Max is text-only; you'd be paying for capability you can't use.
Terminal-only overnight refactor. Max wins. The "agent first design" backbone handles hour-long sessions without collapsing, and the 8m 53s benchmark on the ancient Chinese 3D building is the data point. Plus handles terminal work fine but is built for multimodal first, and you'll pay the multimodal tax on every turn.
Front-end prototype with screenshots + console + npm commands. Plus wins. The GUI+CLI hybrid reasoning is the load-bearing feature — look at a settings panel, drop into the terminal, fix the config in one shot. Max can do it but Plus is the workhorse the channel actually picks.

The 32,000-token-output rule is the only one of the channel's thresholds that maps to a specific Plus/Max line: outputs above 32K tokens go to Max; below that, Plus handles the response. The channel's "one line" switching claim (both speak the Anthropic API protocol) means the routing change is a model-header swap in Qwen Code, not a config-file rewrite.

The "Max is one of the most expensive flagship models" framing (§6.2's headline caveat) is the cost real check: $7.50/M output tokens is roughly 8.6x DeepSeek V4 Pro's $0.87/M. The first test alone burned ~850,000 tokens; all three tests combined used 1.1M tokens. Translation: if your workload fits inside 32K outputs and the prompt isn't terminal-only, Plus is the right answer. Reserve Max for the workloads where the 8m 53s speed actually matters.

Qwen 3.5 Setup on Your Local Computer (Step-by-Step Guide)

The local-inference video, and the most concrete of the Qwen setup guides. The walkthrough is on LM Studio (cross-platform Windows / Mac), with the 9B / 6 GB model, full GPU offload, on a PC with an Nvidia 3060. Download and install are a literal next, next, next flow. Throughput on the host's 3060 is ~37 tokens/second for the 9B build. Larger Qwen 3.5 checkpoints go up to 22 GB with partial GPU offload; "the more RAM you have, the higher quality you can load up."

The car-wash sanity check breaks the local build. The recurring prompt is "I have a car wash 50 meters away from me. Should I walk or drive?" Qwen 3.5 got stuck reasoning for ~2 minutes, eventually produced a gaslight-y response about "walking is usually the more logical choice," then generation failed. For contrast, the host notes Grok 3 said "drive" on the same prompt and flags the failure as "concerning." The creator's bottom line is "I'm still gonna stick with Opus" and labels local Qwen "play at your own risk" — fine for "learning apps" or "I'm on keto what make make me a meal plan"-style basic tasks, not for decisions you actually care about.

The actionable fix. Toggling "disable think" in LM Studio made Qwen 3.5 finally answer the car-wash prompt without looping. That setting is the single most useful lever on the local build for short factual prompts. On Ollama specifically: the host tried the "latest 3.5 model" listing on Ollama during the setup walkthrough and "it didn't load" — so the channel's stated preference is LM Studio over Ollama for Qwen 3.5 on both Mac and Windows.

Qwen 3.5 in YOUR BROWSER (Setup Guide)

The browser-side path, and the only "you can do everything for free with your own graphics card" route in the Qwen coverage. Eric's project loads the Qwen model directly into your GPU through the browser's WebGPU API — no LM Studio, no llama.cpp, no Python environment. Modern browsers now have full GPU access, so this wasn't an option "back in the day."

The 3060 ceiling. The demo machine is a Windows PC with an NVIDIA 3060, capping usable models at ~8 GB or less. A Mac with unified memory gets around this; Windows users don't. The 27B / 37B parameter Qwen build is an even bigger pull, and on his connection the download is painfully slow.

The performance floor. The hard minimum is ~30 tokens/second. Below that, "you're not going to wait for it" — a 1 tok/s model feels like "30 seconds per hey." Thinking models are worse because they burn tokens planning before replying.

Privacy and cost angle. Local = nothing leaves your device. Caveat: the WebGPU pipeline still pulls a multi-GB model over the network, so a fully airgapped local llama.cpp install is strictly better for data-sovereignty on Windows. The creator's verdict: browser is for experimentation and "the lazy guys" who don't want to install anything. For actual coding or running larger Qwen 3.5 variants, run a local LM Studio or llama.cpp server.

Qwen 3.5 Local Model Review (Is it Good?)

The dedicated review of the smallest Qwen 3.5 variant — "just under 7 gigabytes" — run locally via LM Studio. It loads comfortably into a consumer Nvidia GPU (the host uses an "Nvidia 360" per the transcript), and token output is visibly fast on that setup. A larger 22 GB sibling exists but eats enough VRAM that a Mac user needs "36 32 GB of RAM" to keep the OS and Chrome running alongside it.

Where it works. For general-knowledge Q&A — "I'm on keto. Make me a meal plan.", "what is the stock market," "what do you know about Wall Street bets" — the model is responsive. The creator calls it useful for "heartbeat" style tasks and for privacy-sensitive users who "don't want to send their information outside of a local environment." The framing: "foot soldier" agents spawned in volume, not a primary brain.

Where it breaks. The signature test is the same car-wash prompt. Non-thinking mode answers "walking" — wrong, because you can't wash a car that isn't there. Switching on the thinking model makes it worse: it spirals into "what's the likely scenario" style regression, burns tokens, and still lands on "walk." The host notes "Opus" and "MiniMax" routinely get this right, and that Qwen "was thinking for so long" on a question a human answers instantly.

The creator's verdict in his own words: "still struggling on some basic questions" and "I wouldn't really use it on a day-to-day basis." Better than Gemma 3 in their testing, but not by enough to displace a frontier model. Treat local Qwen 3.5 as privacy infra for non-critical work, not as a cost saver — you're paying in electricity and tokens, just at a different meter.

Why the car-wash test keeps showing up

The "I have a car wash 50 meters away. Should I walk or drive?" prompt is the channel's recurring sanity check for reasoning degradation, and the Qwen 3.5 local builds are where it fails most consistently. The question is intentionally a trap: the obvious answer is "walk 50 meters," but the correct answer is "drive, because you can't wash a car that isn't there" — the car wash might be closed, or you might not have a car, or the walk might be unsafe, and a model that ignores the load-bearing context (you have a car to wash) is the model the channel wants to flag.

The pattern of failures:

Local Qwen 3.5 (9B, non-thinking): answers "walking" — wrong, because you can't wash a car that isn't there. The §6.2 verdict: "concerning."
Local Qwen 3.5 (9B, thinking on): spirals into "what's the likely scenario" regression, burns tokens, lands on "walk" — same wrong answer, more tokens spent.
Grok 3: answers "drive" — correct.
Opus / MiniMax: routinely get it right.

The reason this test matters for the routing table is that the failure mode is modeling, not knowledge. A small local model trained on too little data about edge cases will reach for the most common-sense answer ("walk 50 meters") and miss the load-bearing context. The channel's framing: a single consumer GPU is "just ain't enough" for serious workloads; the frontier-model feel comes from clusters, not from a 3060.

The actionable fix from §6.2: toggle "disable think" in LM Studio. The thinking mode is what makes the local build loop on the car-wash question; without it, the model answers "walking" (still wrong, but at least it's not a 2-minute loop). The channel's verdict: local Qwen 3.5 is fine for heartbeat tasks and privacy-sensitive work, but "I wouldn't really use it on a day-to-day basis." Treat it as privacy infra for non-critical work, not as a cost saver.

Try it yourself

Toggle "free quota only" on Alibaba Cloud Model Studio before your first request. Enter payment details for the 1M free tokens, but flip the quota-only switch on. Otherwise the card auto-charges the moment the free tier is gone.
Reproduce the 8m 53s benchmark. Run the ancient Chinese 3D building prompt on Qwen 3.7 Max and log wall-clock time plus tool-call count. If you don't get a clean roof in under 15 minutes with <20 tool calls, your environment is the bottleneck.
Default to Qwen 3.7 Plus for IDE work, Max for terminal-only. Use Plus for visual-to-code, OCR, GUI+CLI hybrid reasoning, and agent swarms. Reserve Max for terminal-only overnight refactors and outputs above 32,000 tokens.
Test the Plus / Max swap in Qwen Code. Both speak the Anthropic API protocol, so changing the model header is a one-line edit. Run the same prompt on each and compare speed, token use, and output quality.
Install the 9B Qwen 3.5 on LM Studio for local experimentation. Pick the 9B / 6 GB build first on a 3060-class GPU; expect ~37 tokens/second with full GPU offload. Skip Ollama's "latest 3.5 model" listing — the channel notes it didn't load in their setup.
Disable thinking mode for short factual prompts. The 9B Qwen 3.5 loops for ~2 minutes on the car-wash question with thinking on. Toggle "disable think" in LM Studio and the loop clears.
Cap browser WebGPU runs at ~8 GB on a 3060. For anything larger, use a local LM Studio / llama.cpp server. If you're on a Mac with unified memory, you can push higher.
Run the car-wash sanity check on every local model you install. It's a free, fast litmus test for reasoning degradation. If your model answers "walk" with thinking on, disable the toggle before trusting any output.
Don't trust the "Qwen 3.6 Plus free on Hermes" video title in 2026. Before wiring a free-tier API key into Hermes, check the top-liked comments on Nqs_5RLg6QA (or just try creating the key) — the free period has ended and free key creation is no longer available. Plan the route at the current Qwen API platform rate (~$0.40 input / $1.60 output per million tokens) instead of the deprecated "free" framing.
Use the /verbose flag in Hermes Agent when you do wire Qwen 3.6 Plus in. A viewer (@Jayander-n3k, 2026-05-20) called it "a real eye opener" and the channel's only concrete workflow lever that survives the free-tier correction.

Common pitfalls

Forgetting the "free quota only" toggle on Model Studio. The 1M free tokens are real, but the card auto-charges the moment the quota is gone. The toggle is the only thing between "free trial" and "surprise bill."
Comparing Qwen-team benchmarks to current frontier models. The Qwen team published comparisons against Opus 4.6 Max even though "Opus 4.8 came out last week." Treat the numbers as a Qwen-vs-Qwen progression plus a directional lead over Opus 4.6, not as a current-frontier comparison.
Using Qwen 3.7 Max for short one-shot prompts. At $7.50 per 1M output tokens, it is "one of the most expensive flagship models." The cost is justifiable on multi-hour agent runs, not on quick chat. Route chat to Plus or to a cheap model.
Expecting browser-WebGPU to replace a local server. The 3060 caps browser runs at ~8 GB, the download is slow, and below 30 tok/s the experience collapses. WebGPU is a 2026 flex, not a daily driver.
Running thinking mode on local Qwen 3.5 for short prompts. The 9B build loops for ~2 minutes on the car-wash question with thinking on. Disable the toggle for short factual prompts.
Adopting Ollama over LM Studio for Qwen 3.5. The host's stated preference is LM Studio on both Mac and Windows, and the Ollama "latest 3.5 model" listing failed to load in their hands-on.
Treating local Qwen 3.5 as a cost saver. You pay in electricity, VRAM, and your own time. A single consumer GPU is "just ain't enough" for serious workloads; the frontier-model feel comes from clusters, not from a 3060.
Picking Plus for terminal-only autonomy. Plus is the multimodal workhorse; Max is the "extreme long horizon autonomy" pick. If your workload pushes past 32,000-token outputs, you want Max, not Plus.
Trusting the "Qwen 3.6 Plus free on Hermes" video title in 2026. The free period has ended and the Hermes portal no longer creates free-tier API keys. The 3.6 Plus capability profile still justifies the orchestrator slot per the public.ai_models row, but plan around the current Qwen API platform price (~$0.40 / $1.60 per million tokens), not the deprecated "free" framing.
Ignoring the Qwen-vs-Qwen lineage. The channel's hands-on with 3.6 Plus set up the orchestrator-slot framing; 3.7 Max is the speed story; 3.7 Plus is the IDE workhorse call. Read them as a sequence, not as three independent reviews — and use the public.ai_models row + viewer-comment thread on Nqs_5RLg6QA to keep the 3.6 Plus slot honest now that the free route is closed.

Sources

Qwen 3.7 Max is ACTUALLY INSANE! (Real Tests and Review) — 5,470 views · video_id: 2gDB-2ifLPw
Qwen 3.7 Plus is SO POWERFUL! (Real Tests and Review) — 4,290 views · video_id: 5L4W_KI3ca0
Qwen 3.6 Plus is FREE on Hermes Agent (USE it like this) — 3,577 views · video_id: Nqs_5RLg6QA · summary/transcript null in DB · section grounded via public.ai_models.qwen-3-6-plus and 9 top-liked viewer comments in public.youtube_comments (free-period-ended correction)
Qwen 3.5 Setup on Your Local Computer (Step-by-Step Guide) — 6,145 views · video_id: 4d1TOu-1Umk
Qwen 3.5 in YOUR BROWSER (Setup Guide) — 4,150 views · video_id: HM2W-lvUMok
Qwen 3.5 Local Model Review (Is it Good?) — 1,946 views · video_id: yh3oWLVPYYw
Supabase query — SELECT video_id, title, views, summary_content, summary_key_takeaways FROM public.videos WHERE video_id = ANY(ARRAY['2gDB-2ifLPw','5L4W_KI3ca0','Nqs_5RLg6QA','4d1TOu-1Umk','HM2W-lvUMok','yh3oWLVPYYw']); against project ttxdssgydwyurwwnjogq.