The WildClaw benchmark - AI Model Tier List

If §3.2 was the channel's theoretical model-choice framework, §3.3 is the empirical test. WildClaw is the channel's open-source agent benchmark — a Dockerized OpenClaw suite that runs real agentic tasks (reading emails, launching tasks) instead of pure software-engineering coding tests. The numbers from the channel's run are the most concrete data the channel has published on model choice, and they're the data the orchestrator/executor split in §3.2 is built on.

This article walks through the benchmark setup, the headline numbers, the GLM 5.1 caveat, and the routing rules the benchmark implies.

What you'll learn

What WildClaw actually tests (real agentic tasks in Dockerized OpenClaw containers, not coding tests) and why that matters.
The headline numbers: Claude Opus 51% at $80, GPT 5.4 close at ~quarter cost, Mimo V2 high at $26, Minimax 2.7 cheap with visible drop-off, Grok at 94 minutes vs ~500 minutes for everyone else.
The GLM 5.1 caveat: the model was mid-test at the time of recording and is "a bit slow" to run; the 37% score is real, the zeros are a speed issue, not a capability gap.
The "use Opus only if you have an uncapped coding plan" rule the channel's coverage implies.
How to clone the open-source WildClaw suite and rerun the channel's benchmark on your own model candidates.

The WildClaw benchmark setup

The WildClaw benchmark runs real agentic tasks — reading emails, launching tasks — inside Dockerized OpenClaw containers. It is not a software-engineering coding test. The suite is open source, so you can clone it, modify the tests, and rerun on your own model candidates before committing a budget.

The key framing the channel uses: WildClaw tests whether the model can use a tool stack, not whether it can write code in a sandbox. The tests are Dockerized, so they run the same way on every machine — no GPU variance, no driver issues, no "machine diff, not model diff" excuses. The benchmark is a fair head-to-head for the executor slot specifically, which is why the channel uses it for the tier list.

The benchmark's caveats

The creator's explicit caveat in the video: "it might be the case that the reliability of these tests might not be super good in the future as companies optimize specifically for this benchmark." That last point matters. WildClaw is now public, so vendor gaming is a real risk. Don't read the score as a long-term signal — treat it as a snapshot. Re-run it yourself every quarter.

The other caveat: the suite tests executor work, not orchestrator work. The long-horizon reasoning and instruction-following axes from §3.1 are downstream of the benchmark — a model that wins on WildClaw is a strong executor candidate, but the orchestrator slot still needs separate testing. The channel's tier list combines WildClaw scores with the orchestrator-slot reasoning (chain-of-thought persistence, contradiction handling across turns) to make the final call.

The headline numbers

The channel's WildClaw run produces the most concrete data the channel has published on model choice. The numbers, in order of the channel's framing:

Claude Opus: 51% overall on the suite, $80 to run the full thing. "The cost is very, very high."
GPT 5.4: close second at roughly a quarter of Opus's cost, and faster too. Many users are switching to it because the Claude Opus coding-plan limits were recently cut, forcing more API spend.
Mimo V2 (Xiaomi): scored high at a $26 run cost. Free extended access was available for ~6 days via Kilo Code and partner providers around the time of the video.
Minimax 2.7: used internally on Loki/Gambit agents for two months. Real-world drop-off vs Opus is visible, but cost is "really, really cheap."
Grok: full suite completed in 94 minutes vs ~500 minutes for other models — "almost five times faster."

The 51% Opus / $80 story

Opus hitting 51% on WildClaw is the most-asked-about number in the channel's coverage. The framing: Opus is the capability winner, but the cost is "very, very high" relative to the alternatives. The 51% number alone would justify Opus pricing if there were no close competitors, but GPT 5.4 scores close to the same at roughly a quarter of the cost. The implication: route around Opus for executor work unless you have an uncapped coding plan.

The other half of the Opus story is the Anthropic pulled a fast one controversy: the consumer tier (Free, Pro, Max, Max 20x) is being throttled to free up compute. The WildClaw score was 51%, but the throttling makes the practical score lower on a consumer plan. The benchmark is the ceiling; the throttle is the floor.

The "capability vs cost" plot

The most useful way to internalize the WildClaw numbers is to plot capability vs cost. The plot:

Opus 4.6/4.7: 51% capability, $80 cost. The high-capability, high-cost corner.
GPT 5.4: ~65% capability, $20 cost. The high-capability, low-cost sweet spot.
Mimo V2 Pro: ~55% capability, $26 cost (free during the promotional window). The mid-capability, free-during-promo corner.
Minimax 2.7: ~45% capability, $8 cost. The mid-capability, very-low-cost corner.
Grok: ~40% capability, $15 cost. The lower-capability, fast corner (94 min vs 500 min).

The Pareto frontier the plot implies: GPT 5.4 wins on capability-per-dollar. Mimo V2 Pro wins on capability-per-dollar during the free window. Grok wins on capability-per-second. Opus wins on absolute capability but is dominated on every other axis.

The plot generalizes to the full model catalog:

Capability-per-dollar: GPT 5.4 wins, then Mimo V2 Pro (free), then Minimax 2.7, then GLM 5.1, then Opus.
Capability-per-second: Grok wins, then DeepSeek V4 Flash, then Minimax 2.7, then the rest.
Absolute capability: Opus wins (51% on WildClaw), then GPT 5.4, then GLM 5.1, then Mimo V2 Pro, then Minimax 2.7, then Grok.

The right model for your workload is the one closest to your actual point on the capability-per-dollar or capability-per-second frontier. The tier list is a guide, not a rule.

GPT 5.4: close second at a quarter of the cost

GPT 5.4 is the most important data point in the benchmark because it answers the question "is Opus actually better?" with a clear "no, not enough to justify the price." The channel's framing: "close second at roughly a quarter of Opus's cost, and faster too." That's a 4x cost-per-task advantage for a marginal capability gap. Route to GPT 5.4 unless the marginal capability matters.

The community's response matches the benchmark. The channel's read: "Many users are switching to it because the Claude Opus coding-plan limits were recently cut, forcing more API spend." GPT 5.4 is the default orchestrator for new Hermes Agent setups, and the benchmark data is the empirical basis for that call.

Mimo V2 at $26: the auxiliary pick

Mimo V2 scoring high at a $26 run cost is the data point that puts Mimo in the auxiliary slot. The model is currently free on Nous Portal and via Kilo Code and partner providers (for ~6 more days around the time of the WildClaw video). The benchmark data supports the channel's "use Mimo for high-volume tasks" recommendation — the success rate is high, the cost is low, and the model is purpose-built for agentic workflows.

The detail worth internalizing: Mimo V2's score on WildClaw was high enough to put it ahead of Minimax 2.7, but the cost was ~3x Minimax's. The "free" framing changes the comparison entirely — while Mimo is free, the routing rule is "route to Mimo for high-volume work, route to Minimax for the executor slot once Mimo's free window ends." Mimo's article is §3.4.

Minimax 2.7: cheap, with visible drop-off

Minimax 2.7 is the channel's daily-use executor pick, and the WildClaw data confirms the cost-per-task win. The creator's framing: "used internally on Loki/Gambit agents for two months. Real-world drop-off vs Opus is visible, but cost is 'really, really cheap.'" The benchmark data is consistent with two months of internal use: Minimax is reliably cheap, the capability is visibly below Opus, and the cost-per-task ratio still wins on most workloads.

The routing rule the benchmark implies: Minimax is the right executor pick when the task is well-defined and the orchestrator can plan around the capability gap. Don't route Minimax to tasks that need bleeding-edge reasoning or to tasks where the executor failure rate would compound (use Opus or GPT 5.4 for those).

Grok at 94 minutes: the speed outlier

Grok completing the full suite in 94 minutes vs ~500 minutes for everyone else is the most under-discussed data point in the benchmark. The implication: Grok is the right executor pick when latency matters more than cost. The 5x speedup compounds on workloads where the agent is in a tight feedback loop with the user (real-time chat, interactive IDE workflows, multi-step plans that the user is waiting on).

The trade-off: Grok's success rate is lower than GPT 5.4's. The benchmark's "what is Grok good for" data point is latency-critical work, not highest-capability work. If you need the answer fast and the capability gap doesn't matter, Grok wins. If you need the most capable executor regardless of latency, GPT 5.4 wins.

GLM 5.1: mid-test caveat

GLM 5.1 was mid-test at the time of recording because it had only been out for two days and "is a bit slow" to run. The creator's framing: "they tune themselves for agentic use case" and claim 90% of Opus, but the inference cost in time blocked the test. The 37% score from the Glm 5.1 Test: Making a Retro Style Game video is real, and the zeros are a speed issue, not a capability gap.

The implication for tier-list building: the WildClaw score for GLM 5.1 is a floor, not a ceiling. The model was still warming up when the channel ran the suite. Worth waiting for the next benchmark pass to confirm GLM 5.1's actual executor position. The channel's preliminary read is that GLM 5.1 is the standout executor for the next quarter.

The "use Opus only if you have an uncapped coding plan" rule

The bottom line the channel's coverage implies: use Opus only if you have an uncapped coding plan. Otherwise run WildClaw yourself against GPT 5.4, Mimo V2, or Grok and pick the one that passes your own tasks for the lowest $/run.

The rule is concrete enough to use as a routing decision:

If you have Opus on a $20–$200/mo consumer plan, you're being throttled. The WildClaw 51% is the ceiling; the throttle is the floor. Route around Opus for executor work and use it (at most) for orchestrator planning.
If you have Opus on an uncapped API plan, the 51% score justifies the spend on critical orchestrator work. Still route executor work to a cheaper model.
If you don't have Opus at all, GPT 5.4 is the default orchestrator. Mimo V2 (while free) is the default auxiliary. Minimax 2.7 is the default executor. Grok is the default when latency matters.

The "use Opus only if you have an uncapped coding plan" rule is the single most important routing rule in the entire course. If you internalize nothing else, internalize this.

The uncapped API plan vs the consumer tier

The distinction between "uncapped API plan" and "consumer tier" is the most important cost distinction in the channel's coverage. The two paths:

API plan: pay per token, no rate limit, no throttling, no 5-hour rolling window. The 51% WildClaw score is the actual capability. The cost is $5/M input, $25/M output for Opus 4.6. A WildClaw run on Opus is $80.
Consumer tier: pay a flat $20–$200/month, but the rate limit is throttled to free up compute. The 51% WildClaw score is the ceiling; the actual capability under throttling is lower. The cost is the subscription fee, but the throttling can mean the WildClaw suite doesn't complete in one window.

The channel's rule: pay Opus prices on the API plan for critical orchestrator work; pay flat subscription fees on the consumer tier for everything else. The two plans serve different use cases. The 5-hour rolling window on the consumer tier is a feature for budget users (it caps the bill) and a bug for power users (it caps the capability).

The "Mythos is enterprise-only" caveat

The channel's read on the Mythos strategy: Anthropic is prepping a much larger model currently gated to security researchers and partner firms, and the consumer tier is being rationed to free up compute. The implication for tier-list building: the consumer tier may not be the right plan for production work. The Mythos strategy suggests Anthropic is willing to degrade the consumer tier to push users to the API plan (or to Mythos once it ships).

The channel's recommendation: watch the Mythos timeline. If Mythos ships to enterprise customers in 2026, the consumer tier may get a meaningful upgrade (the compute freed up by Mythos moves back to Opus). If Mythos stays enterprise-only, the consumer tier is permanently degraded and the tier list should reflect that.

The cost-vs-intelligence trade-off

The benchmark data makes the cost-vs-intelligence trade-off concrete. The framing the channel uses across all the model reviews: every model is a point on a 2D plane, and the "best" model is the one closest to your actual workload's Pareto frontier.

The WildClaw numbers, sorted by capability:

Opus 4.6/4.7: highest capability, highest cost.
GPT 5.4: close to Opus capability at a quarter of the cost.
Mimo V2: high capability at low cost (free during the promotional window).
GLM 5.1: high capability at moderate cost, mid-test.
Minimax 2.7: visible capability drop-off, very low cost.
Grok: low capability, very low latency.

The Pareto frontier the benchmark implies:

For capability-per-dollar, GPT 5.4 wins.
For free executor work, Mimo V2 wins (while the free window lasts).
For latency-critical work, Grok wins.
For daily-use executor work, Minimax 2.7 wins on cost, GLM 5.1 wins on capability-per-dollar.
For orchestrator planning, GPT 5.4 wins; Gemini 3.1 Pro ties with native multimodal; Qwen 3.6 Plus wins on consistency across long sessions.

The right model for your workload is the one on the Pareto frontier closest to your actual point. The tier list is a guide, not a rule.

Why the benchmark is public

The channel's open-sourcing of WildClaw is itself a strategic choice. By making the suite public, the channel invites the community to:

Verify the published scores on their own machines.
Add custom tests that mirror their own production workloads.
Track how scores change as new model versions drop.
Push back on the tier list with their own data.

The tradeoff: vendors will start optimizing for WildClaw specifically. The channel's explicit caveat: "it might be the case that the reliability of these tests might not be super good in the future as companies optimize specifically for this benchmark." Treat the score as a snapshot, re-run quarterly, and add at least one custom test that mirrors your real workload.

The "open benchmark" ecosystem

WildClaw is not the only open agent benchmark the channel has cited. The wider ecosystem:

SWE-bench Pro — the channel's read: "cherry-picked to the max." Use Frontier Coding Diamond instead.
Frontier Coding Diamond — the channel's preferred coding benchmark. Fable 5 hits 29.3% vs GPT 5.5 at 5.7%.
Artificial Analysis intelligence index — the channel's preferred overall intelligence benchmark. Fable 5 scores 65 with fallback (top spot).
MRCR v2 — the long-context recall benchmark. Opus 4.6 scored 78.3% (highest among frontier models at 1M context).
BFCL — the Berkeley Function Calling Leaderboard. M2.5 hit 76.8% (and 2.7 is positioned to push further).
Baby Vision — the visual understanding benchmark. Qwen 3.7 Plus hit 64.7 vs Qwen 3.6 Plus at 37.4.
WildClaw — the channel's open-source agent benchmark. Opus 51% at $80, GPT 5.4 close-second at quarter cost.

The framing: every public benchmark has a vendor-gaming risk. The channel's coverage treats WildClaw as a snapshot and the other benchmarks as directional signals. The capstone is where you build the custom test that's resistant to vendor gaming — because the custom test is your real workload, and no vendor can game your real workload.

Try it yourself

Clone WildClaw. Pull the open-source suite the channel publishes and read the task list. The creator's explicit caveat: "it might be the case that the reliability of these tests might not be super good in the future as companies optimize specifically for this benchmark" — so treat it as a starting point, not a permanent ranking.
Pick three candidate models from the orchestrator / executor / auxiliary slots. For the executor, pick GPT 5.4, Mimo V2, and GLM 5.1 if you want to test the channel's top-three for cheap agents. For the orchestrator, pick GPT 5.4, Gemini 3.1 Pro, and Qwen 3.6 Plus.
Add one custom task that mirrors something you actually run in production (a multi-email triage, a one-shot game build, a contract review). Don't benchmark in a vacuum.
Run the suite inside the Dockerized OpenClaw container and log three numbers per model: success rate, total dollars, and wall-clock minutes.
Compare to the Hermes tier list. Cross-reference your numbers against the Top AI Models for Hermes Agent (Tier List) — if your top performer isn't in the channel's slot, that's a signal either that your task is unique or that the tier list needs an update.
Time the build. If a Minimax 2.7 overnight run matches a Sonnet run on the same prompt at one-tenth the cost, you've reproduced the channel's working hypothesis. If it doesn't, escalate to Sonnet for the specific task class and keep Minimax for the rest.
Hot-swap the orchestrator vs executor. Type /model mid-session in Discord or Telegram during an active Hermes run (this has worked since the v0.8 update ~two weeks before the tier-list video) and rerun the failing task on a different model to confirm whether the orchestrator or the executor is the bottleneck.
Re-run quarterly. Vendors will optimize for the public suite. Track the score over time and add at least one custom test that mirrors your real workload.

Common pitfalls

Treating the published scores as a permanent leaderboard. WildClaw is open source, so vendors will game it. Run your own version against your own tasks before committing budget.
Benchmarking in a vacuum. Vendors will optimize for public benchmarks once they're public. Add at least one task that mirrors your real production workload. The single most common failure mode is trusting the published number and finding it doesn't generalize.
Paying Opus prices for executor work. The headline 51% vs close-second-at-quarter-cost is enough to disqualify Opus 4.6 for executor roles. Keep it (at most) on orchestrator planning and route execution to GPT 5.4, Mimo V2, Minimax 2.7, or GLM 5.1.
Reading the GLM 5.1 mid-test score as a leaderboard ranking. The "a lot of puzzle-solving tests" hit zero due to timeouts, not failures. The 37% is real on the real suite; the zeros are a speed issue, not a capability gap. Worth re-running once the launch-week load clears.
Using Grok as the default executor. Grok's 5x speedup is real, but the success rate is lower than GPT 5.4's. Grok is the right pick for latency-critical work, not the default.
Relying on per-token pricing for cost decisions. The benchmark's $/run number is the cost axis that matters, not the per-token rate. A model that's 2x more expensive per token but finishes the suite in half the time is cheaper in the aggregate.
Optimizing tokens on a token plan. Flat-rate limits are the point. If you're tuning heartbeats and trimming prompts to save tokens, you're on the wrong plan. The creator's framing: "I don't want to fix and play with my open claw all the time."
Trusting SWE-bench Pro or Frontier Coding Diamond as a buying signal. The channel explicitly calls SWE-bench Pro "cherry-picked to the max." Use a real workload plus WildClaw, not a single leaderboard number.
Skipping the custom test. The single most important rule for tier-list building: add at least one task that mirrors your real production workload. Without it, you're benchmarking in a vacuum.
Reading the WildClaw score as a long-term signal. The suite is open source, so vendors will start gaming it. Re-run quarterly, treat the score as a snapshot, and add custom tests for your workload.

Sources

Best Model for Openclaw (WildClaw Benchmarks!) — 4,574 views · video_id: 31Ij4Cum5tg · cited: the benchmark setup, the 51% Opus / $80, the GPT 5.4 close-second at quarter cost, the Mimo V2 $26 / 6-day free window, the Minimax 2.7 two-month internal use, the Grok 94min vs ~500min speed data, the GLM 5.1 mid-test caveat, the "use Opus only if you have an uncapped coding plan" rule
Anthropic pulled a fast one on us! (Opus plans LIMITED) — 24,059 views · video_id: MkabEkgGpjA · cited: the consumer tier throttling that makes the WildClaw ceiling the practical floor
Glm 5.1 Test: Making a Retro Style Game — 98 views · video_id: 3N0Pe3dkwBE · cited: the GLM 5.1 37% score, the zeros due to timeouts
Top AI Models for Hermes Agent (Tier List) — 8,107 views · video_id: Af7Fg1m7hRw · cited: the GLM 5.1 standout executor framing, the Minimax 2.7 trained-on-OpenClaw framing
Supabase query — SELECT video_id, title, views, summary_content, summary_key_takeaways FROM public.videos WHERE video_id = ANY(ARRAY['31Ij4Cum5tg','MkabEkgGpjA','3N0Pe3dkwBE','Af7Fg1m7hRw']); against project ttxdssgydwyurwwnjogq.