DeepSeek v4 Flash + the "thinking inside tool calls" pattern - Chinese Open-Weight Models: Kimi, Qwen, DeepSeek, GLM

DeepSeek is the family that invented the structural pattern the channel keeps coming back to for long agent loops: "thinking inside tool calls." The model reasons while deciding which tool to invoke, self-corrects mid-execution based on tool results, and eliminates the redundant reasoning passes that other models run between every tool call. That's the technical reason DeepSeek wins long agent loops over heavier models, and it's the reason the channel's §6.3 verdict sharpens into: stop chasing flagship models for every agent task. V4 Flash is currently free on Nous Portal, the most-used model on Nous and the highest-consumed token on Hermes Agent this month, and the executor slot the channel hands most of its refactor work to.

This article covers the single video the channel has produced on DeepSeek (4,893 views) and zooms out to the routing rule the channel has settled on for V4 Flash.

What you'll learn

Why V4 Flash is the channel's go-to for "boring but expensive" agent work: Python refactors, repo mapping, and long-document summarization — and why matching V4 Pro's output at roughly a quarter of the cost is a routing decision, not a downgrade.
The five-test hands-on that grounds the "free but capable" pitch: a 453-line Python refactor, a FastAPI repo map, the Einstein zebra puzzle, a 3-paper academic summary, and 3 parallel research sub-agents.
The "default tier" foot-gun in the Hermes TUI — and the two setting changes (reasoning xhigh, verbose verbose) that stop V4 Flash from being benchmarked below its real level.
Where V4 Flash falls apart: full-codebase analysis in one shot, vision/image work, one-shot creative prompts, and orchestration. Treat the list as a routing rule, not a vibe.
The cache trick that makes V4 Flash the highest-consumed token on Hermes Agent this month — and why "free" is really "free if you cache."

The channel's verdict on DeepSeek (mid-2026)

DeepSeek's coverage on the channel is a single video, but the framing is direct: stop chasing flagship models for every agent task. V4 Flash is currently free on Nous Portal — even on the free subscription plan — and is the most-used model on Nous and the highest-consumed token on Hermes Agent. The intelligence rank is #10 out of 87 models (behind Gemini 2.6 at #1 and V4 Pro at #5), speed rank is #4 at ~134 output tokens/second, and the context window is 1M tokens with hybrid attention. The channel's read: a lightweight model in the top 10 "speaks a lot of volume about its competence" — and Flash winning ~35% of real tasks at ~4x less than V4 Pro means the default executor route has shifted within DeepSeek's own lineup.

DeepSeek v4 Flash + Hermes Agent = Surprisingly STRONG

The video's anchor is a power-user test: V4 Flash wins roughly 35% of real tasks while costing roughly 4x less than V4 Pro. On five coding problems specifically, Pro spent 4x more output tokens but produced identical answers. That's the routing rule distilled into a single sentence: when Flash and Pro produce the same answer, Flash was the right call.

The five-test hands-on is the part that gives the "surprisingly strong" framing teeth. Each test targets a different job the channel actually runs on Hermes, and the results map cleanly onto the routing rule above.

1. Python refactor on a 453-line data collector. No rate limit hit, and the model identified hardcoded API keys in one pass. This is the test that earns the "executor" framing — it's the kind of work you'd normally route to a flagship, and Flash handled it without escalation.

2. Repo mapping on a FastAPI GitHub repo. Completed in roughly 12 minutes at about 1,900 tokens. A local 50-file project finished in roughly 1 minute at about 1,200 tokens. The implication: for "show me how this codebase fits together" tasks, Flash is fast enough that the bottleneck is the model, not the wall clock.

3. Einstein's zebra puzzle. Passed with long logical chains, no fact-check needed by the creator. This is the test the channel uses as a litmus for reasoning degradation — Flash didn't loop, didn't gaslight, didn't shortcut the logic.

4. Long-context summarization across 3 academic papers (2 PDF, 1 HTML). Extracted the top 5 action items from each in roughly 2 minutes and output a merged Markdown with cross-cutting themes. The 1M context window is the headline feature that makes this test possible on a single pass; hybrid attention is what keeps it cheap.

5. Sub-agent delegation. Three parallel research sub-agents (Python async, SQLite vs PostgreSQL, HTMX vs React) spawned on its own. The output was solid but the run took 14 minutes. This is the test the channel flags as the "where it loses" data point — and the operative rule is to feed long docs directly to the main agent instead of delegating.

The list of jobs to not route to V4 Flash is concrete and worth quoting in full: full code base analysis in one shot, vision/image tasks (called "pretty inconsistent"), one-shot prompts, creative writing, and orchestration. Treat that list as a routing rule, not a vibe.

The "thinking inside tool calls" pattern, unpacked

The single most important technical concept in this article is the "thinking inside tool calls" pattern. It's the structural reason DeepSeek's V4 Flash beats heavier models on long agent loops, and it's worth pausing on what the pattern actually is.

In a typical agent loop, the model alternates between two distinct phases: a "thinking" phase (chain-of-thought reasoning about what to do next) and an "acting" phase (issuing a tool call). Most models keep these phases separate. The model thinks, decides on a tool, issues the call, gets the result, thinks again, decides on the next tool, issues the next call, and so on. The redundant reasoning between tool calls is one of the biggest sources of latency and cost in long agent loops.

DeepSeek's "thinking inside tool calls" pattern collapses the two phases. The model reasons while deciding which tool to invoke, and self-corrects mid-execution based on tool results. The result: a single reasoning pass can inform a tool selection and a tool-result interpretation, eliminating the redundant reasoning pass that other models run between every tool call. The Hermes agent's 40+ tool environments (the typical DeepSeek workload) are where the savings compound — fewer turns, fewer redundant reasoning passes, fewer cron-job errors.

Three concrete consequences for the routing table:

Long agent loops favor V4 Flash over heavier models. A 50-tool-call session on V4 Flash completes in fewer turns than the same session on a model that doesn't think inside tool calls. The cumulative savings are larger than the per-token price difference would suggest.
Cron jobs favor V4 Flash. Cron jobs fire on a schedule and re-run the same logical pattern. The fewer-turn pattern is structurally cheaper. The channel's framing: reliable scheduled task execution with "fewer errors than other models."
Self-correction mid-execution is a different quality axis. V4 Flash catches its own mistakes inside a tool call rather than after the call completes. That's why the Python refactor (test 1) found hardcoded API keys in one pass — the model reasoned about the data flow during the refactor tool call, not after.

The pattern isn't unique to DeepSeek — GLM 5.1's "self-tests instead of just spitting code" (§6.4) is a related idea — but the channel's framing puts DeepSeek first in this category. If your workload is an agent loop with many tool calls, "thinking inside tool calls" is the load-bearing reason V4 Flash is the right call.

The "default tier" foot-gun in the Hermes TUI

The single most actionable piece of setup advice in the video is buried in the middle: when you wire V4 Flash into the Hermes TUI, the defaults can benchmark you below V4 Flash's real level. The two setting changes the creator makes:

Reasoning: xhigh — not the default. Anything lower and the public benchmark numbers stop being representative of what your agent will produce.
Verbose: verbose — so you can see the chain-of-thought and confirm the model is actually thinking at the right tier, not silently degrading.

This is the same class of foot-gun as the "free quota only" toggle on Alibaba Model Studio (§6.2) and the /swarm order-of-operations rule on Kimi 2.7 (§6.1) — the platform's defaults are tuned for a different user, and the channel's content is a steady stream of "flip this switch first" advice.

The cache trick that makes "free" actually free

V4 Flash is the highest-consumed token on Hermes Agent this month because cached reads are nearly free. When you re-run an agent against the same long context (a repo, a contract, a paper set), the second pass hits the cache and the cost collapses. The channel's framing: "free" is really "free if you cache." The implication for routing: V4 Flash's price advantage compounds on workflows that share context across runs. If your workload is one-shot per request, the cache trick doesn't kick in and Flash is merely cheap, not free.

Where V4 Flash loses — and how to route around it

The "where it loses" list is short and worth memorizing: full code base analysis in one shot, vision/image tasks ("pretty inconsistent"), one-shot prompts, creative writing, and orchestration. The repo-mapping test was a focused sub-task; asking Flash to swallow a whole codebase in one shot drops quality sharply. One-shot prompts are exactly the workload where cache doesn't kick in and the cost advantage disappears. The 14-minute sub-agent delegation test is the orchestration data point — Flash can run sub-agents, but it shouldn't be the orchestrator.

A useful rule the channel uses: reserve V4 Pro / Gemini 2.6 / Opus for codebases you don't understand, critical thinking, and one-shot creative work; V4 Flash's 1M context + hybrid attention handles everything else. The "everything else" category is bigger than it sounds — Python refactors, repo mapping, long-doc summarization, the Einstein-style logical puzzles, and cached re-reads of the same context. That's the bulk of agent work in practice, not the exception.

Try it yourself

Pick a representative Python refactor and run it on Flash and Pro side-by-side. The 453-line data collector in the video is the canonical test. If Flash produces the same answer at 1/4 the cost, your routing rule for that class of work is set. If Flash misses something Pro catches, route that class of work to Pro.
Set Hermes TUI reasoning to xhigh and verbose to verbose before benchmarking. Otherwise the public numbers are not representative of what your agent will produce. This is the single most common setup mistake for new V4 Flash users.
Cap single-prompt file count around 60 iterations and under ~500 lines of Python. That's the safe envelope for the free tier's rate limit. Push past it and you'll get throttled mid-run.
Try a 3-paper long-doc summarization. Save the papers as PDF/HTML into a dedicated folder and pass the folder path — don't pass the URL, or Hermes scrapes only the abstract. This is the test that proves the 1M context window is real, not marketing.
Force a cache hit on your second agent run. Re-run a completed Hermes task with the same context (e.g., a repo map you already produced) and check the token counter. Cached reads are nearly free, and that's the mechanic that makes V4 Flash the highest-consumed token on Hermes this month.
Compare sub-agent delegation vs direct long-doc input. The 14-minute sub-agent run vs the ~2-minute direct summarization is the headline. For research tasks on Flash, feed the docs to the main agent and skip the delegation pattern.
Build a routing table. List your common agent tasks (refactor, repo map, doc summary, creative writing, one-shot prompts, vision, full-codebase analysis) and assign each one to Flash, Pro, or "skip Flash" based on the video's wins/loses list. Treat the table as a config file, not a vibe.

Common pitfalls

Running V4 Flash at the default Hermes TUI reasoning tier. The default benchmarks you below Flash's real level. Set reasoning to xhigh and verbose to verbose before you decide the model is bad.
Routing full-codebase analysis to Flash. The repo-mapping test was focused; one-shot full-codebase prompts degrade. Split the work or escalate to V4 Pro / Gemini 2.6 / Opus.
Pasting Google Scholar URLs into Hermes. Hermes scrapes only the abstract and misses the body. Save the paper as PDF or HTML into a dedicated folder and pass the folder path.
Delegating long-doc research to sub-agents on Flash. The 14-minute run vs the ~2-minute direct summarization is the data point. On Flash, the main agent is faster at long-doc work than sub-agents.
Treating "free" as a license to ignore rate limits. The safe envelope is around 60 iterations and under ~500 lines of Python per prompt. Push past that and the model throttles mid-run.
Using V4 Flash for vision, image, one-shot creative, or orchestration work. All four are in the channel's "where it loses" list. The first one is "pretty inconsistent"; the rest skip the cache and turn the cost advantage into a verbosity liability.
Forgetting the cache mechanic. If your workload doesn't share context across runs, you lose the price advantage that makes Flash worth picking over Pro in the first place. Confirm cache hits are actually firing on your re-runs.
Reading V4 Flash as a contradiction to the "you don't need Opus" thesis. They're the same thesis at different price points. V4 Flash is the route-against-flagship version of the Minimax 2.7 / M3 argument from Course 2 §2.3 — pick the model that wins your real tasks at the lowest cost, not the model with the highest benchmark score.

Sources

DeepSeek v4 Flash + Hermes Agent = Surprisingly STRONG — 4,893 views · video_id: s3Q9hvdlrmo
Supabase query — SELECT video_id, title, views, summary_content, summary_key_takeaways FROM public.videos WHERE video_id = ANY(ARRAY['s3Q9hvdlrmo']); against project ttxdssgydwyurwwnjogq.