The Hermes tier list: orchestrators vs executors - AI Model Tier List

If §3.1 was the framework, §3.2 is the application. The channel's most systematic ranking is the Top AI Models for Hermes Agent (Tier List) video, and the conceptual move worth internalizing is the two-slot model: an orchestrator (the brain) and an executor (the hands), with auxiliary models for support. The same model is rarely great at both, and a stack that uses the right model in the right slot is dramatically cheaper and more reliable than a stack that picks one model for everything.

This article walks through the orchestrator and executor slots, the question mark over Claude, and the auxiliary tier — and ends with the routing rules the tier list implies.

What you'll learn

The orchestrator/executor/auxiliary taxonomy from the Top AI Models for Hermes Agent (Tier List) video, and why the orchestrator/executor split is the conceptual heart of the channel's model choice framework.
The four orchestrator picks: GPT 5.4, Gemini 3.1 Pro, Qwen 3.6 Plus, and Kimi 2.5 — and what each one is best at.
The four executor picks: GLM 5.1, Minimax M2.7, DeepSeek V4 Flash, and Nemotron 3 Super — and the routing rule the channel uses to choose between them.
Why Claude Opus 4.6 / 4.7 / Sonnet 4.6 all sit in question mark: the Anthropic pulled a fast one controversy, the Claude Opus is ACTUALLY UNUSABLE benchmark, and the Anthropic admits fault follow-up.
The hot-swap mechanic in Hermes (the /model command) that lets you change models mid-task, and the routing rule the channel uses to decide when the orchestrator vs the executor is the bottleneck.

The conceptual move: orchestrator + executor, not one model for everything

The single most important framing in the channel's coverage is that an agent stack needs at least two models: an orchestrator that plans and reasons, and an executor that calls tools. Picking one model for both roles is the default mistake. The orchestrator role is dominated by long-horizon reasoning, instruction following, and chain-of-thought coherence across many turns. The executor role is dominated by tool-call reliability, formatting compliance, speed, and cost. A model that's great at one role is rarely great at the other.

The channel's Hermes Agent product makes the split explicit in the config file. The orchestrator slot takes a model that's good at planning (GPT 5.4, Gemini 3.1 Pro, Qwen 3.6 Plus, or Kimi 2.5 in the channel's tier list). The executor slot takes a model that's good at tool calls (GLM 5.1, Minimax M2.7, DeepSeek V4 Flash, or Nemotron 3 Super). The auxiliary slot fills in specialized tasks — web search, grounding, document processing — with a model that has a specific capability (Gemini 2.5 Flash, Gemini 3 Flash, Mimo V2 Pro).

The implication for tier-list building: when you score a model, score it for one of the three slots. A model that wins on the executor axes (speed, cost, tool reliability) but loses on the orchestrator axes (long-horizon reasoning, contradiction handling) is a perfect executor, not a failed orchestrator.

The orchestrator slot (the brain)

The orchestrator is the model that plans multi-step work, holds state across many turns, decides which executor to call and when, and recovers from executor failures. The channel's tier list names four orchestrators:

GPT 5.4 — the new king

GPT 5.4 is the new king of the orchestrator slot, replacing nerfed Claude Opus 4.6. It was "originally designed with native agentic workflows in mind." Michael has been "pretty happy with the output" from his Hermes agent on it, and the channel's own Top AI Models for Hermes Agent (Tier List) places GPT 5.4 at the top of the orchestrator slot.

The case for GPT 5.4 is concrete:

Designed for agentic workflows from the start, not retrofitted.
Consistent results across the channel's tests — the "reliable" pick, not the "smartest" pick.
The actual benchmark headline from the Claude Opus is ACTUALLY UNUSABLE video: GPT 5.4 hit 63% on the Boxmining benchmark, where Opus 4.6 hit 40% on the same suite. The benchmark was designed by Opus, so it should have favored Claude.
The community's response to the Opus regression has been to switch to GPT 5.4 in volume — the channel's read is that this is now the default orchestrator for new Hermes Agent setups.

The case against GPT 5.4: it's not the absolute smartest model, just the most dependable. The channel's framing: "Not the absolute smartest, but most dependable." For tasks that need bleeding-edge reasoning, the channel still routes to Fable 5 or Qwen 3.7 Max. For tasks that need a steady, reliable orchestrator, GPT 5.4 is the pick.

Gemini 3.1 Pro — ties as top orchestrator, adds multimodal

Gemini 3.1 Pro ties as top orchestrator and adds native video and audio input, making it the go-to for screen recordings and structured dashboard extraction. The channel's tier list explicitly calls out the multimodal advantage:

Native video input — the orchestrator can watch a screen recording and reason about what it sees.
Native audio input — useful for call-center or voice-assistant workflows.
Visual dashboard extraction — the channel's tested use case.

The implication: if your orchestrator workload involves any visual or audio input, Gemini 3.1 Pro is the orchestrator pick. If your workload is text-only, GPT 5.4 is the simpler default.

Qwen 3.6 Plus — preserved thinking across turns

Qwen 3.6 Plus earns the third orchestrator slot because its chain-of-thought stays active on every response — there's no toggle, and reasoning persists across prior turns, which cuts contradictions on long-horizon tasks. The public.ai_models row in the channel's database (slug qwen-3-6-plus, tier S) describes the model as: "Always-on reasoning with preserved thinking across sessions — exceptional for long agentic tasks," with the explicit strengths "always-on reasoning trace, preserved thinking across sessions, reduces contradictions in long tasks, consistent decision-making, excellent for new Hermes Agent setups."

The unique feature — preserved thinking — is the differentiator. Other orchestrators (GPT 5.4, Gemini 3.1 Pro) can have reasoning toggled on or off, and the reasoning state doesn't always persist across turns. Qwen 3.6 Plus keeps the chain-of-thought alive across the entire session, which means:

Fewer contradictions in long-horizon tasks.
More consistent decision-making across many turns.
No "should I think or not?" decisions during the workflow.

The trade-off: slightly higher token usage (always-on thinking costs more), and the model may be overkill for simple tasks. The channel's framing: "excellent for new Hermes Agent setups" — pick Qwen 3.6 Plus for the orchestrator slot when consistency across long sessions is the load-bearing requirement.

The audience-corrected caveat on Qwen 3.6 Plus: the original "free on Hermes Agent" video title is misleading. The free period has ended and free API key creation is no longer available on the Hermes portal. The orchestrator-slot framing still holds, but you pay for it. A current viewer reports the Qwen API platform at "$0.40 input and $1.60 output" per million tokens — that's the number to plan around, not the deprecated "free" framing.

Kimi 2.5 — the orchestrator with a swarm

Kimi 2.5 rounds out the orchestrator slot thanks to its swarm agents feature, which can "self-direct a swarm of like about 100 sub agents" and coordinate up to 1,500 tool calls without a predefined workflow. The unique capability is the swarm coordination — the orchestrator can spawn sub-agents to handle parallel research and the model coordinates them natively.

The use cases the channel flags:

Frontend/UI generation — Kimi 2.5 is the channel's pick for visual work because the swarm can spawn parallel visual sub-agents.
Research workflows — the swarm can handle 100+ parallel sub-agents for fast research.
Image and screen analysis — Kimi 2.5 has native image input.

The trade-off: swarm coordination requires skill. The channel's mandatory rule for Kimi 2.7 (the successor) is: start the session with /swarm first, then prompt, then plan. If you prompt first and try to enable swarm later, you get a 2-agent stub and the swarm machinery never fully spins up. The order-of-operations rule is mandatory on the Kimi family.

The "start with /swarm first" rule

The Kimi-specific rule is worth dwelling on because it's the kind of platform-specific gotcha that breaks agent stacks silently. The order of operations for Kimi 2.7:

Start the session.
Type /swarm and toggle it to yolo mode.
Then issue your prompt.
Then let the model plan.

If you reverse the order, the swarm machinery never fully spins up. The channel's first attempt at a Himalayan pink salt bottle design — Astro + Next.js, no /swarm enabled — produced a result the creator called "really, really bad." A second pass with the prompt trimmed down and /swarm toggled to yolo mode produced "nothing impressive, but it is certainly quite solid." The difference was the order of operations, not the prompt.

The rule generalizes: on any platform with a swarm or agent-team feature, the orchestration state must be set before the prompt. Reversing the order leaves the orchestration in a half-initialized state and the model produces degraded output. This is the same class of foot-gun as the "free quota only" toggle on Alibaba Model Studio (covered in §3.2's Qwen discussion) and the /status check on Minimax. The platform's defaults are tuned for a different user, and the channel's content is a steady stream of "flip this switch first" advice.

The executor slot (the hands)

The executor is the model that calls tools reliably, follows formatting instructions, doesn't get clever, and runs cheaply enough to do the high-volume work the orchestrator hands off. The channel's tier list names four executors:

GLM 5.1 — the standout

GLM 5.1 is the standout executor. It one-shotted a space shooter benchmark that Opus 4.7 choked on, and survives Hermes' 85% auto-compaction events with strong context recovery — something Claude models failed at. Switching the entire agent fleet to GLM 5.1 is a one- to two-line edit in config.yml.

The case for GLM 5.1:

Survives Hermes' 85% auto-compaction events — the executor can recover from context loss better than the Claude models.
Coding-eval headline of 47.9 vs Opus 4.5's 45.3 — the first time a Chinese model has scored above Opus on a Z.ai-curated coding suite.
Self-tests during execution — the model now does agentic tool use and OpenClaw integration natively, not just spitting code.
Pricing: $10/month monthly or $7/month yearly on Z.ai's Light plan, with "3x the usage of Claude Code."

The case against GLM 5.1: chat / Q&A regressed vs GLM 5, and Z.ai itself only markets 5.1 to coders. Use it for execution, not for Q&A. Also: launch-week slowness is real (2.5k-token stalls inside Claude Code, web search turned off). The first 72 hours of a new GLM release is "warm-up," not production.

Minimax M2.7 — the trained-on-OpenClaw executor

Minimax M2.7 is a strong executor (not orchestrator) because it was "trained on the OpenClaw Agent Harness framework," the same lineage as Hermes. Xiaomi is an official News Research Team partner.

The case for Minimax M2.7:

Architecture optimized for the OpenClaw harness — M2.7 is explicitly an "agentic coding model," not a general chatbot.
1/16th the cost of Opus per million input tokens (~$0.30/M vs Opus at $5/M).
76.8% on BFCL on M2.5 already, with 2.7 positioned to push the score further.
The channel's daily-use executor pick for cost reasons.

The case against Minimax M2.7: poor planning, context degradation above 120K tokens. Use it for the executor slot, not for orchestrator work. Also: the "dumb zone" failure mode shows up at 300 lines of soul.md — cap it at 15–30 lines and trim agents.md.

DeepSeek V4 Flash — the "stop chasing flagships" executor

DeepSeek V4 Flash earns executor for its ability to handle high-volume work cheaply, with the channel's framing: "stop chasing flagships for every agent task." V4 Flash is currently free on Nous Portal, intelligence rank #10 of 87, speed rank #4 at ~134 output tokens/second, 1M-token context with hybrid attention. The channel's read: a lightweight model in the top 10 "speaks a lot of volume about its competence" — and Flash winning ~35% of real tasks at ~4x less than V4 Pro means the default executor route has shifted within DeepSeek's own lineup.

The case for V4 Flash:

Top-10 intelligence, top-4 speed — the most-used model on Nous and the highest-consumed token on Hermes Agent this month.
1M context with hybrid attention — long-doc work in a single pass.
~4x cheaper than V4 Pro on the same task.
Cached reads are nearly free — re-running an agent against the same long context collapses the cost.

The case against V4 Flash: full-codebase analysis in one shot, vision/image tasks, one-shot prompts, creative writing, and orchestration. The "where it loses" list is concrete; treat it as a routing rule, not a vibe.

Nemotron 3 Super and Step 3.5 Flash — open-weight executor picks

Nemotron 3 Super and Step 3.5 Flash fill the remaining executor slots — both open-weight, with Nemotron self-hostable behind a Nemo Claw privacy wrapper. Use them when:

Privacy is critical and you need to self-host.
The cost of API calls is prohibitive.
You want to avoid vendor lock-in.

The trade-off: setup complexity, smaller community, and slower community-driven improvement cycles. The channel's framing: open-weight is the right pick for pro developers with privacy-critical workflows, not the default for new users.

The question mark over Claude

Opus 4.6, Opus 4.7, and Sonnet 4.6 all sit in question mark on the channel's tier list. The reasoning is the Anthropic pulled a fast one controversy (24,059 views) and the Claude Opus is ACTUALLY UNUSABLE benchmark (21,675 views):

The plan-limiting controversy. Around four to five days before the "feature, not a bug" video, users noticed tokens burning faster during peak hours inside the 5-hour rolling window. Anthropic confirmed on X that it was "adjusting the five hour limits for free max max users." The creator's read was blunt: "it was a feature, not a bug." Every consumer tier was hit — Free, Pro, Max, Max 20x.
The Boxmining benchmark. Claude Opus 4.6 hit 40% on a benchmark Claude itself designed, while GPT 5.4 hit 63% on the same suite. The benchmark covers four categories: instruction following, opposite behavior, false completion, and destructive actions. The prompts were pulled from Stack Overflow and common developer complaints, and Opus was used to design the rubric — so the test should have favored Claude.
The "Mythos" / "Mephisto" / "Glasswing" theories. The channel's read is that Anthropic is prepping a much larger model currently gated to security researchers and partner firms, and the consumer tier is being rationed to free up compute. The plan subsidies — the gap between the $20–$200 subscription and the actual cost of serving — are being pulled back.

The channel's recommendation: hot-swap Claude out for the orchestrator and let Mimo V2 Pro handle execution. Don't pay Opus prices for executor work. Reserve Claude for the critical 5% of multi-step logic that justifies the spend.

The one case where Claude still wins: Claude Fable 5 + Loop Designs, where Opus-class output is genuinely worth the spend. But only inside Cursor / Claude Code / a coding IDE — calling Fable 5 inside Hermes is "overkill" because DeepSeek V4 Pro or Kimi 2.6 is cheaper for orchestration.

The 40% vs 63% benchmark in detail

The Boxmining benchmark is the single most-cited data point in the channel's Claude coverage. The four categories:

Instruction following. The model is given explicit instructions (e.g., "use tabs not spaces", "keep functions under 10 lines", "order things in this specific sequence") and scored on how many of the instructions it follows. Opus 4.6 ignored roughly 60% of the instructions on this category.
Opposite behavior. The model is given an instruction that requires it to do the opposite of what it would naturally do (e.g., "explain why the car-wash test result is wrong even if your instinct says it's right"). Opus 4.6 defaulted to its instinct and missed roughly half of these prompts.
False completion. The model is given a task that requires it to recognize the task is incomplete and continue working. Opus 4.6 declared success prematurely on roughly 60% of these prompts.
Destructive actions. The model is given a task that requires it to be careful about destructive operations (file deletion, database mutations, etc.). Opus 4.6 performed destructive actions when it shouldn't have on roughly 30% of these prompts.

The aggregate score: 40% on the four categories, vs GPT 5.4's 63% on the same suite. The channel's read: "Opus was used to design the rubric — so the test should have favored Claude." The fact that GPT 5.4 wins by 23 points is the empirical case for the orchestrator slot.

The Opus 4.7 launch regression

Opus 4.7 launched on April 17, 2026, with Anthropic's official framing of "substantially better at following instructions," "notable improvements over 4.6," and almost 10% higher SWE-bench Pro. A new "extra high" reasoning tier was added. Tokenization changes mean it costs roughly 1–1.3x more than Opus 4.6 per call. The community disagrees. Reddit's top comment calls it "a serious regression, not an upgrade." Web search citations are fabricated, tokenizers are reportedly 30% downgraded, and the car-wash sanity check (50m away, walk or drive?) trips Opus 4.7 — it tells the user to drive.

The channel's launch-day test confirmed the regression. Running the instruction-following suite twice on launch day, Opus 4.7 landed at the same level as Opus 4.6 — and "fails their own tests." GPT 5.4 scores 75% on the same suite. A one-prompt space shooter came out with broken F-to-fire controls and stiff physics. GLM 5.1, priced at $72/mo (Z.AI's coding plan, up from $30), produced a visibly smoother game on the same prompt.

The implication for tier-list building: don't trust launch-day vendor claims. Run the instruction-following test on the new model and compare to the predecessor. The 4.7 launch is the channel's evidence that launch-day numbers don't translate to launch-day performance.

Auxiliary models (support only)

The third tier in the taxonomy. Auxiliary models are not for primary agent work — they handle web search, free grounding, free document processing, and the high-volume tasks the orchestrator and executor shouldn't be doing.

The channel's auxiliary picks:

Gemini 2.5 Flash — the default baked into Hermes; check nano config.yaml.
Gemini 3 Flash — adds free Google Search grounding and URL context reading, a replacement for the $10/month News Research built-in tools.
Mimo V2 Pro — high-volume king, the most-used model on OpenRouter for Hermes Agent users; free via the News API on the news portal website.
Elephant Alpha (100B params, 256K context) and Trinity Large Preview — open-weight niche picks.

Auxiliary models get their own article in §3.4. The key framing for the orchestrator/executor split is: the auxiliary slot is where you put the model that wins on a specialized capability (search, grounding, document processing), not on the general-purpose axes. Don't route orchestrator or executor work to an auxiliary model.

The "don't pay Opus for grounding" rule

The most actionable auxiliary-slot rule is the "don't pay Opus for grounding" rule. Opus 4.6 / 4.7 are expensive orchestrators with built-in grounding. But the grounding quality is no better than Gemini 3 Flash, and the cost is roughly 100x. If your agent workflow needs search grounding, route the grounding step to Gemini 3 Flash and keep Opus (or GPT 5.4) for the planning and reasoning work.

The same rule applies to document processing. Opus can read a 30-contract set in a 1M context window, but Mimo V2 Pro can do the same at 1/16th the cost (or free during the promotional window). Use the right model for the right task, not the most expensive model for every task.

The "free" framing is a marketing term

The channel's auxiliary picks lean heavily on free models (Gemini 3 Flash, Mimo V2 Pro during the promotional window). The "free" framing is real, but with caveats:

Rate limits apply. Free tiers are rate-limited, and high-volume work will hit the limits. Plan for the rate limit, not just the zero cost.
Free periods end. Mimo V2 Pro's free window is promotional. The estimated price is $20–$40/month when the promotional period ends. Plan the migration before the window closes.
Free models are auxiliary, not primary. Free models are optimized for the auxiliary slot, not the orchestrator or executor slots. Use them for search, grounding, and document processing, not for planning or tool calls.

The take-away: free is a real cost advantage, but free is not a substitute for routing. The auxiliary slot exists because free models are good at the auxiliary work, not because free models are good at everything.

The hot-swap mechanic

The Hermes Agent product has shipped the ability to hot-swap models mid-session since the v0.8 update, roughly two weeks before the tier-list video. The mechanic is the /model command:

/model gpt-5.4          # Planning phase
/model minimax-m2.7     # Implementation phase
/model gemini-3-flash   # Web research phase

The /model command lets you change the active model mid-task. The use case the channel flags is debugging: when a task fails, hot-swap to a different model and rerun to confirm whether the orchestrator or the executor is the bottleneck. The channel's read: if a task fails on the current model, run the same task on a different model before debugging the agent config. Sometimes the model is the bug.

The hot-swap mechanic is also the architecture for the orchestrator + executor pattern. The orchestrator runs as one model (GPT 5.4), the executor as another (Minimax M2.7), and Hermes handles the routing between them. The user can override the slot routing via /model if a specific model is failing on a specific task.

The "run the same task on a different model" debugging pattern

The most useful debugging pattern in the agent stack is to hot-swap and rerun. The recipe:

Run the task on the current model. Note the failure mode.
Hot-swap to a different model. Type /model and pick a different model id.
Run the same task. Note the failure mode on the new model.
Compare. If the task succeeds on the new model, the original model was the bottleneck. If the task fails on the new model in the same way, the bottleneck is in the agent config (system prompt, skills, memory), not the model.
Fix the right thing. If the model is the bottleneck, hot-swap is the fix. If the agent config is the bottleneck, debug the config (skills, memory, prompt structure).

The channel's framing: "Sometimes the model is the bug." Hot-swap is the cheapest debugging tool in the agent stack. Use it before rewriting agent config.

The "set the model, not the slot" gotcha

The hot-swap mechanic in Hermes has a subtle gotcha: /model sets the active model, not the slot. If the orchestrator is GPT 5.4 and the executor is Minimax M2.7, and you type /model minimax-m2.7 mid-session, both slots now use Minimax M2.7. The orchestrator slot doesn't get to keep GPT 5.4 unless you reconfigure the slot explicitly.

The workaround: most harnesses have a per-slot model command. In Hermes, the slot configuration is in the config file. Hot-swap the active model via /model for one-off debugging; configure the slots in the config file for the standard routing. Don't rely on /model to maintain the orchestrator-vs-executor split across a long session.

Try it yourself

Pick one model from each tier. From the orchestrator slot, pick GPT 5.4 or Gemini 3.1 Pro. From the executor slot, pick GLM 5.1 or Minimax M2.7. From the auxiliary slot, pick Gemini 3 Flash or Mimo V2 Pro.
Wire them into Hermes Agent. Set the orchestrator in the config file, the executor in the executor slot, the auxiliary in the auxiliary slot. Each takes a model id.
Run a 3-step task with the default routing. Pick a task that involves planning (orchestrator), tool calls (executor), and web search (auxiliary). Note which model handles which step in the run log.
Hot-swap mid-task. When the executor fails on a step, type /model to swap to a different executor and rerun. Confirm the swap took via /status.
Audit the run log. After a 3-step task, look at which model produced which step. If the orchestrator is doing executor work, your routing config is wrong. If the executor is doing planning, same.
Run the same task on Opus for comparison. Use the /model command to swap in Opus 4.6 and rerun. If Opus wins the task but costs 5–10x as much, you have a cost-per-task comparison for the orchestrator slot. If Opus loses the task, you have evidence for the Claude Opus is ACTUALLY UNUSABLE benchmark.
Build a routing table. List your common agent tasks (planning, research, code refactor, document summary, web search) and assign each one to the model the channel's tier list recommends. Treat the table as a config file, not a vibe.
Re-run the routing table weekly. Models change fast. The channel's tier list is current as of the video; re-check after every major release.

Common pitfalls

Picking one model for both roles. The orchestrator and executor roles need different models. A model that's great at long-horizon reasoning (GPT 5.4) may be expensive and slow at tool execution, and vice versa for GLM 5.1. Match the slot to the model's strength.
Routing Opus to the executor slot. Opus is in question mark on the channel's tier list. Use it (at most) for orchestrator planning, and only on the critical 5%. Route execution to GLM 5.1, Minimax M2.7, or DeepSeek V4 Flash.
Routing Minimax to the orchestrator slot. Minimax M2.7 is a strong executor but degrades on planning above 120K tokens of context. Use it for executor work, not for the multi-step plan.
Trusting chain-of-thought toggles. Some orchestrators (e.g. Qwen 3.6 Plus) have reasoning that stays active on every response with no toggle, which is a feature. Others silently turn it off and you don't notice. Check your logs.
Using GPT 5.4 mini as a solo driver. The channel's Hermes logs show silent fallback to full GPT 5.4 even when mini is configured, which unexpectedly spikes costs. Use mini only as a sub-agent executor with narrow scope.
Trusting the "Qwen 3.6 Plus free on Hermes" video title. The free period has ended and the Hermes portal no longer creates free-tier API keys. Plan around the current Qwen API platform price (~$0.40 / $1.60 per million tokens), not the deprecated "free" framing.
Benching on SWE-bench Pro or Frontier Coding Diamond as a buying signal. Vendors optimize for public benchmarks. The channel's framing: "cherry-picked to the max." Use a real workload plus the WildClaw suite, not a single leaderboard number.
Skipping the hot-swap test. The /model command is the most useful debugging tool in Hermes. If a task fails on the current model, run the same task on a different model before debugging the agent config. Sometimes the model is the bug.
Trusting the auxiliary slot for primary work. Auxiliary models are for specialized tasks — search, grounding, document processing. Don't route orchestrator or executor work to Gemini 3 Flash or Mimo V2 Pro.
Reading the tier list as permanent. The channel's tier list is current as of the video. Models change fast, and the channel's caveat on the WildClaw benchmark ("the reliability of these tests might not be super good in the future as companies optimize specifically for this benchmark") applies to every other public benchmark too. Re-check after every major release.

Sources

Top AI Models for Hermes Agent (Tier List) — 8,107 views · video_id: Af7Fg1m7hRw · cited: the orchestrator/executor/auxiliary taxonomy, the four orchestrators, the four executors, the auxiliary tier
Anthropic pulled a fast one on us! (Opus plans LIMITED) — 24,059 views · video_id: MkabEkgGpjA · cited: the plan-limiting controversy, the consumer tier throttling, the Mythos / Mephisto / Glasswing theories
Claude Opus is ACTUALLY UNUSABLE — 21,675 views · video_id: Cc2Vvra9F_c · cited: the 40% vs 63% Boxmining benchmark, the four-category failure pattern, the community signal
Anthropic admits fault (Claude limits to be INCREASED) — 9,673 views · video_id: WiAx9sPw69U · cited: the Lydia post, the "way faster than expected" admission
Claude Fable 5 + Loop Designs is TOO STRONG! (Full Tests) — 3,482 views · video_id: 8De7s6WG7Bo · cited: the one case where Claude wins, the loop syntax, the Fable 5 / Opus 4.8 demotion
Minimax M2.7 is INSANELY GOOD! (Full Review) — 31,049 views · video_id: --uxieT5J9Y · cited: the trained-on-OpenClaw framing, the 1/16th cost ratio, the executor-slot defense
Supabase query — SELECT video_id, title, views, summary_content, summary_key_takeaways FROM public.videos WHERE video_id = ANY(ARRAY['Af7Fg1m7hRw','MkabEkgGpjA','Cc2Vvra9F_c','WiAx9sPw69U','8De7s6WG7Bo','--uxieT5J9Y']); against project ttxdssgydwyurwwnjogq.