Building your own tier list - AI Model Tier List

This is the capstone. The course is useless if you finish it still routing everything through Claude Opus because the channel said so. The exercise is to take the four-axis framework from §3.1, the orchestrator/executor/auxiliary taxonomy from §3.2, the WildClaw benchmark data from §3.3, and the auxiliary model picks from §3.4 — and end up with a tier list that matches your task mix, not the channel's.

This article walks through the capstone workflow, the hot-swap strategy the capstone implies, and the decision rules for picking the right model on the right task. The exercise mirrors the original 20-tier-lists-comparisons capstone, but with the deeper framework this course teaches.

What you'll learn

The capstone workflow: pick three candidate models, run the WildClaw benchmark on each, score on the four axes, classify into slots, and cross-reference with the channel's tier list.
The hot-swap strategy for the multi-model workflow: the /model command in Hermes, the model switch in OpenClaw, and the routing rules the channel's coverage implies.
The decision tree for picking the right model on the right task: reliability, budget, coding excellence, multimodal, privacy, hot-swap combinations.
The 7-day routing experiment that closes the loop: run your tier list for a week, log every task, every model, every cost, and every failure, and look at the data.
The platform-specific recommendations: best models for Hermes, OpenClaw, and Claude Code users, in order.

The capstone workflow

The capstone workflow is the same workflow the channel's audience has been asking for since the original tier list video dropped. It has six steps, each of which builds on the previous one:

Step 1: Pick three candidate models

Pick three candidate models from §3.2. The channel's top-three for cheap agents are GPT 5.4, Mimo V2, and GLM 5.1. The top-three for capability are GPT 5.4, Gemini 3.1 Pro, and Qwen 3.6 Plus. Pick one from each list and add a wildcard from §3.2 that the channel doesn't rank.

The framing: the three models should cover different slots. If you pick three orchestrators, you have no executor data. If you pick three executors, you have no orchestrator data. The right pick is one orchestrator, one executor, and one wildcard.

The wildcard slot

The wildcard is the model that doesn't fit cleanly into the orchestrator / executor / auxiliary taxonomy. Examples:

A model you currently use. If you're already running Minimax M2.7, pick it as the wildcard to compare against your incumbent.
A new model with buzz. If a new model just dropped (Kimi 2.6, GPT 5.5, GLM 5.2), pick it as the wildcard to see if the buzz is real.
A model from a vendor you're skeptical of. If you've avoided Anthropic since the regression, pick Opus 4.7 as the wildcard to see if the regression is still as bad as the channel's coverage suggests.
An open-weight model you can self-host. If you care about privacy, pick Qwen 3.5 or Nemotron 3 Super as the wildcard to see how the open-weight options stack up.

The wildcard's purpose is to break the confirmation bias. If you only test models the channel already endorses, you'll only confirm the channel's tier list. The wildcard forces you to look at the data with fresh eyes.

Step 2: Run the WildClaw benchmark on each

Run the WildClaw benchmark methodology on each. The suite is open source — clone it, modify the test list if you want, rerun on your own models. Don't trust the published leaderboard; vendors will optimize for it once it matters.

The critical step: add at least one custom test that mirrors something you actually run in production. The channel's caveat on every public benchmark ("vendors will optimize for it") applies to WildClaw too. The custom test is your insurance.

The "custom test" recipe

The custom test is the most important step in the capstone. The recipe:

Pick a production task you actually run. Email triage, code refactor, document summarization, game build, contract review. Pick the one you run most often.
Write a 3-5 step prompt that mirrors the production task. Include explicit formatting requirements (JSON schema, step ordering, specific tool calls) so the instruction-following axis is tested.
Add the prompt to the WildClaw suite. Save it as a custom test in the suite's test directory.
Run the suite on each of the three models. Log success rate, cost-per-task, latency, and instruction-following score.
Compare to the published scores. The custom test is the data point that matters most for your workload. The published scores are the data point that matters for the channel's workload.

The framing: the custom test is your insurance against the channel's confirmation bias. The published scores are a starting point; the custom test is the verdict.

Step 3: Score on the four axes, not one

Score on three axes, not one. (a) success-rate, (b) cost-per-task, (c) latency. The channel's point across the whole course is that "best model" is a multi-axis answer, not a single rank.

The right way to score:

Success rate. Out of 100, how many of the WildClaw tasks did the model complete correctly? Compare to the published scores from §3.3.
Cost-per-task. Total dollars to run the full suite, divided by successful tasks. The unit of value is the successful task, not the call.
Latency. Wall-clock minutes for the full suite. Compare to the Grok 94-minute outlier and the ~500-minute baseline.

The "axis weight" rule

Not all axes are equally important. The weighting depends on your workload:

For production work, success rate and cost-per-task are the dominant axes. Latency matters less because the agent runs overnight or in batch.
For real-time chat, latency and success rate are the dominant axes. Cost-per-task matters less because the user is paying for the experience.
For high-volume batch work, cost-per-task is the dominant axis. Success rate matters less because failures are acceptable in volume.
For critical orchestrator work, success rate is the dominant axis. Cost and latency are secondary because the orchestrator is the bottleneck.

The right weighting for your workload is the one that matches your actual use case. The default weighting is 50% success rate, 30% cost-per-task, 20% latency. Adjust based on your use case.

Step 4: Classify each model into a slot

Classify each model into a slot: orchestrator, executor, or auxiliary. A model that wins on speed but loses on cost might be the perfect auxiliary; a model that wins on reasoning but loses on tool reliability might be the perfect orchestrator with a cheap executor underneath.

The classification rules from §3.2:

Orchestrator. Long-horizon reasoning, instruction following, chain-of-thought persistence across turns. The model that wins on the "willingness to follow instructions" axis and the "long-horizon consistency" axis.
Executor. Tool-call reliability, formatting compliance, speed, cost. The model that wins on the "speed vs quality" axis and the "cost-per-task" axis.
Auxiliary. Specialized capability (search, grounding, document processing). The model that wins on a narrow axis the orchestrator and executor don't optimize for.

The "role test" recipe

The classification step is the most subjective part of the capstone. The recipe:

Run a planning task on the model. Example: "Plan a 5-step refactor of a Python data collector. Output the steps in order, with the tools needed for each step." A model that produces a coherent 5-step plan is orchestrator-quality. A model that hallucinates steps or skips tools is executor-quality.
Run a tool-call task on the model. Example: "Given the following 5-step plan, call the right tool for each step. Use this exact JSON schema." A model that calls the right tool with the right args is executor-quality. A model that calls the wrong tool or skips a step is auxiliary-quality.
Run a specialized task on the model. Example: "Search for the latest news on X and summarize the top 3 results." A model that produces a coherent search summary is auxiliary-quality. A model that hallucinates search results is not in your stack.
Assign the slot based on the strongest signal. A model that's orchestrator-quality on planning and executor-quality on tool calls is a hybrid (Kimi 2.5 sits here). A model that's orchestrator-quality on planning and auxiliary-quality on tool calls is an orchestrator (use a cheap executor underneath). A model that's executor-quality on tool calls and auxiliary-quality on planning is an executor (use an orchestrator above).

The classification is a starting point, not a permanent label. Re-test quarterly as models change.

Step 5: Compare to the channel's tier list

Compare to the channel's Top AI Models for Hermes Agent (Tier List). If your top performer isn't in the channel's slot, that's a signal either that your task is unique or that the tier list needs an update. Either way, you now have evidence.

The channel's tier list is a starting point, not a rule. Your tier list is the result of running the WildClaw benchmark on your own workload. The two should agree on the broad strokes (GPT 5.4 is a strong orchestrator, Minimax is a strong executor, Mimo is a strong auxiliary) but may diverge on the details (your executor might be different from the channel's, your orchestrator might be different, your auxiliary might be a niche model the channel doesn't cover).

The "agree on broad strokes" rule

The capstone's comparison step has a specific decision rule:

If your tier list agrees with the channel's tier list on the broad strokes (same orchestrator, same executor, same auxiliary), the channel's tier list is a good fit for your workload. Use the channel's picks with confidence.
If your tier list diverges on the broad strokes (different orchestrator, different executor, different auxiliary), your workload is unique. Trust your data, not the channel's coverage. The channel's tier list is a starting point, not a rule.
If your tier list diverges on the details (same orchestrator, different executor, same auxiliary), the divergence is on a single slot. Investigate why. The most common reason is that the channel's coverage was published before a new model dropped, or that your custom test caught a regression the channel didn't catch.

The framing: the channel's tier list is the most systematic ranking published, but it's still a snapshot. Your data is the live ranking. When the two diverge, trust your data.

Step 6: Run a 7-day routing experiment

Run a 7-day routing experiment. Use the tier list you built for one week. Log every task, every model, every cost, and every failure. At the end of the week, look at the data. The channel's claim is that hot-swapping models mid-task is the actual power move — the capstone gets you to the point where you can do it.

The 7-day experiment is the loop closer. The capstone is theoretical until you actually use the tier list on real work for a week. The data you collect (success rate by model, cost by task, latency by model) is the evidence that the tier list matches your workload.

The "log every task" recipe

The 7-day experiment only works if you actually log the data. The recipe:

Create a spreadsheet or use a log file. The format doesn't matter; the data does.
For every task, log four fields: (a) the task description, (b) the model that handled it, (c) the cost in dollars, (d) the success/failure status.
Hot-swap when a model fails. Type /model and rerun. Log the swap and the result.
At the end of the week, sort by model and by task. The patterns in the data are your tier list update.
Update the tier list based on the data. If a model is consistently failing on a specific task, swap it out. If a model is consistently succeeding, keep it. The data drives the decision.

The framing: the tier list is a hypothesis. The 7-day experiment is the test. The data is the verdict. Update the tier list based on what the data says, not what the channel said.

The hot-swap strategy

The capstone implies a hot-swap strategy: the ability to change models mid-task to optimize for cost, speed, or capability. The hot-swap mechanics:

In Hermes Agent: type /model mid-session to swap the active model. The swap has worked since the v0.8 update, roughly two weeks before the tier-list video.
In OpenClaw: edit the model parameter on the API key and endpoint. Run /status in Discord or OpenClaw status in the terminal to confirm the swap took.
In Claude Code: swap the three env vars in settings.json (the API key, ANTHROPIC_AUTH_TOKEN, the base URL). The same swap works for Kilo Code, Open Claude, Grok CLI, and other Anthropic-compatible clients.

The hot-swap strategy is the architecture for the orchestrator + executor pattern. The orchestrator runs as one model (GPT 5.4), the executor as another (Minimax M2.7), and the harness handles the routing between them. The user can override the slot routing via /model if a specific model is failing on a specific task.

The three hot-swap use cases

The hot-swap strategy has three concrete use cases that the channel's coverage flags:

The "this is failing" debugging use case. When a task fails on the current model, hot-swap to a different model and rerun. If the task succeeds on the new model, the original model was the bottleneck. If the task fails on the new model in the same way, the bottleneck is in the agent config (system prompt, skills, memory), not the model.
The "this is too expensive" cost use case. When a task is costing more than expected, hot-swap from a frontier model (Opus, GPT 5.4) to a cheap model (Minimax M2.7, Mimo V2 Pro). The same task runs at 1/16th the cost, with a marginal capability gap that may or may not matter for the specific task.
The "this is too slow" latency use case. When a task is taking too long, hot-swap from a slow model (Opus 4.7 under peak load, GLM 5.1 during launch week) to a fast model (Grok at 94 minutes vs 500 minutes, or DeepSeek V4 Flash at #4 speed rank). The same task finishes 5x faster, with a marginal capability gap.

The framing: hot-swap is the universal debugging tool. The channel's coverage treats it as the single most important agent feature in the tier list era. If you have a harness without hot-swap, that's a structural limitation. Switch harnesses.

Budget combo (the cheapest tier list)

The cheapest tier list the channel recommends:

Orchestrator: Mimo V2 Pro (FREE) — only valid while the free window lasts.
Executor: Minimax M2.7 ($10–$20/month) — the budget executor pick.
Auxiliary: Gemini 3 Flash (FREE) — the free Google Search grounding.

Total cost: $10–$20/month, with Mimo free during the promotional period. The trade-off: the orchestrator is below frontier (Mimo at ~55% WildClaw), so long-horizon reasoning will be limited. Use this combo for learning, testing, and non-critical work.

Balanced combo (the standard tier list)

The standard tier list the channel recommends:

Orchestrator: GPT 5.4 ($50–$75/month) — the new king of the orchestrator slot.
Executor: Minimax M2.7 ($10–$20/month) or GPT 5.4 — pick Minimax for cost, GPT 5.4 for capability.
Auxiliary: Gemini 3 Flash (FREE) — the free Google Search grounding.

Total cost: $60–$95/month, depending on whether you route executor work to Minimax or GPT 5.4. The trade-off: you pay $50–$75/month for the orchestrator and get frontier reasoning. Use this combo for production work where the orchestrator matters.

Premium combo (the highest-capability tier list)

The highest-capability tier list the channel recommends:

Orchestrator: GPT 5.4 ($50–$75/month) — the new king.
Executor: GLM 5.1 ($7–$72/month, depending on the Z.ai plan) — the standout executor.
Auxiliary: Gemini 3 Flash (FREE) — the free Google Search grounding.

Total cost: $57–$147/month, depending on the Z.ai plan. The trade-off: you pay $72/month for GLM 5.1 on the Pro plan and get the highest-capability executor in the channel's tier list. Use this combo for coding work where the executor matters.

The decision tree for picking the right model on the right task

The capstone's most actionable artifact is the decision tree for picking the right model on the right task. The decision tree, expanded from §3.2:

The "I'm new to this" branch

If you're new to the agent stack and don't know where to start:

Start with the budget combo. Mimo V2 Pro (free) as orchestrator, Minimax M2.7 ($10–$20/month) as executor, Gemini 3 Flash (free) as auxiliary. Total: $10–$20/month, with the orchestrator slot accepting the capability gap in exchange for the free price.
Run the 7-day experiment. See which tasks succeed and which fail.
Upgrade to the balanced combo if needed. Move to GPT 5.4 ($50–$75/month) as orchestrator. Total: $60–$95/month, with frontier reasoning on the orchestrator slot.

The channel's framing for new users: "If you have just one memory that stores all your preferences that updates all the time, that's powerful." Same idea for the tier list: start with the cheapest tier list that works, then upgrade based on the data.

"What is your priority?"

The decision tree starts with priority. Pick the branch that matches your answer:

Reliability and production use: GPT 5.4. Consistent results, good documentation, active community support. The default orchestrator for new Hermes Agent setups.
Budget under $30/month: Minimax M2.7 or Mimo V2 Pro (while free). Accept lower success rate, run prompts multiple times, good for learning. The cheap tier list.
Coding excellence: GLM 5.1 or DeepSeek V4 Flash. Best coding performance, self-correction capabilities, worth the spend for developers.
Multimodal tasks: Gemini 3.1 Pro or Kimi 2.5. Image/video input, screen analysis, UI generation. The orchestrator with native multimodal.
Long-horizon consistency: Qwen 3.6 Plus. Preserved thinking across turns, fewer contradictions, the third orchestrator slot.
Privacy and self-hosting: Nemotron 3 Super or Step 3.5 Flash. Open-weight models, no API calls, full control. The privacy-first pick.

The "I'm switching from Claude" branch

If you're currently on Opus 4.6/4.7 and considering the switch:

Default to GPT 5.4 for orchestrator work. The 63% vs 40% on the Boxmining benchmark is the empirical case.
Route executor work to Minimax M2.7 or GLM 5.1. Don't pay Opus prices for tool calls.
Keep Opus for the critical 5%. Use the API plan, not the consumer tier, for the work that justifies the spend.
Hot-swap via /model. Run the same task on Opus and GPT 5.4 to see if the capability gap is worth the cost for your specific task.

The channel's framing: "now is probably the best time to switch if you haven't done so already." The Opus regression is the channel's evidence that the consumer tier is not worth the cost; the API plan may be worth it for the critical 5%, but only on tasks that justify the spend.

"What platform are you on?"

The decision tree branches again on platform:

Hermes Agent users: the channel's preferred order is Mimo V2 Pro (official partnership, free, high-volume), GPT 5.4 (most reliable orchestrator), Minimax M2.7 (official partnership, budget-friendly), Qwen 3.6 Plus (preserved thinking for long tasks). Avoid Claude Opus (current regression).
OpenClaw users: the channel's preferred order is GPT 5.4 (most consistent), GLM 5.1 (best coding), Minimax M2.7 (trained on OpenClaw framework), Mimo V2 Pro (high-volume tasks). Avoid models with context window issues for long sessions.
Claude Code users: the channel's recommendation is to migrate to platform-agnostic tools (Kilo Code, Cline Code) because the Claude Opus regression makes vendor lock-in risky. Alternative models: GPT 5.4 via Cline Code, GLM 5.1 via Kilo Code. Keep Claude as backup only.

The "I'm running a 24/7 agent" branch

If your agent runs continuously (overnight builds, cron jobs, always-on workflows):

Default to flat-rate plans. Token plans with per-token pricing punish you for any flapping. The channel's framing: "I don't want to fix and play with my open claw all the time."
Minimax coding plan ($10–$20/month with 100 prompts / 5-hour window) is the channel's preferred default.
Avoid the Opus consumer tier for 24/7 agents. The throttling makes the WildClaw ceiling the practical floor. The API plan at $5/M input is the only way to get the actual 51% capability.
Use GLM 5.1 for coding excellence. The Z.ai plan at $7–$10/month is the cheapest credible one-shot coder the channel has tested.
Hot-swap the orchestrator vs executor. Type /model mid-session to confirm whether the orchestrator or the executor is the bottleneck. The 24/7 use case is where hot-swap matters most.

The framing: 24/7 agents are where the cost-per-task axis dominates. Flat-rate plans and cheap executors (Minimax, GLM) are the right shape. Opus is for the critical 5%, not the 24/7 default.

The "I want a recommendation, not a framework" branch

If you want a single answer and don't want to think about the framework:

Orchestrator: GPT 5.4 ($50–$75/month). The new king, the most reliable, the channel's default.
Executor: GLM 5.1 ($7–$10/month on Z.ai) or Minimax M2.7 ($10–$20/month on the coding plan). Pick GLM 5.1 for coding excellence, Minimax for high-volume and OpenClaw integration.
Auxiliary: Gemini 3 Flash (free) for search grounding, Mimo V2 Pro (free) for document processing. Use both, free is the right price.
Total cost: $57–$105/month, depending on the executor pick. Or $0/month if you accept the capability gap on Mimo as orchestrator.

The framing: this is the "just tell me what to use" answer. The framework is the long-term answer; this is the short-term answer.

The capstone's loop closer is a 7-day routing experiment. The setup:

Pick a tier list from the combos above. Start with the budget combo if you're learning, the balanced combo if you're in production, the premium combo if coding excellence matters.
Configure the tier list in your harness. Set the orchestrator, executor, and auxiliary in the config file. Each takes a model id.
Use the tier list for 7 days. Run your normal workload through the harness. Log every task, every model, every cost, and every failure.
Hot-swap when a model fails. Type /model mid-session in Hermes, edit the model parameter in OpenClaw, or swap the env vars in Claude Code. Log the swap and the result.
Look at the data at the end of the week. Sort by model, by task, by cost, by failure rate. The patterns in the data are your tier list update.
Re-run quarterly. The channel's caveat on every public benchmark applies to your tier list too. Models change fast, your workload changes fast, the tier list needs to keep up.

The 7-day experiment is the difference between a theoretical tier list and a real one. The data is the evidence that the tier list matches your workload.

The "what to look for" patterns

The 7-day experiment produces patterns that are not obvious from a single model test. The patterns to look for:

The "always fails" pattern. A model that consistently fails on a specific task type is in the wrong slot. Swap it out or move it to a slot where it succeeds.
The "always succeeds but slow" pattern. A model that succeeds but is slow is a candidate for a slot where speed matters less (orchestrator) and away from a slot where speed matters (real-time executor).
The "fast but expensive" pattern. A model that is fast but expensive is a candidate for the orchestrator slot (where the cost is amortized across many turns) and away from the executor slot (where the cost compounds).
The "free but unreliable" pattern. A free model that's unreliable is still a free model. Use it for non-critical work and have a paid backup for critical work.
The "cheap and reliable" pattern. A model that's both cheap and reliable is the workhorse. Pin it as the default executor.

The framing: the 7-day experiment is the loop that turns the tier list from a hypothesis into a working hypothesis. Update the tier list based on the patterns, not based on the published numbers.

Platform-specific recommendations

The capstone's final artifact is the platform-specific recommendations. The channel's recommendations, expanded from §3.2:

For Hermes Agent users

Best models (in order):

Mimo V2 Pro — official partnership, free during the promotional window, high-volume king. Best for learning, testing, and non-critical work.
GPT 5.4 — most reliable orchestrator. The default for new Hermes Agent setups.
Minimax M2.7 — official partnership, budget-friendly. The executor pick.
Qwen 3.6 Plus — preserved thinking for long tasks. The long-horizon orchestrator.

Avoid: Claude Opus (current regression).

For OpenClaw users

Best models (in order):

GPT 5.4 — most consistent orchestrator.
GLM 5.1 — best coding executor.
Minimax M2.7 — trained on the OpenClaw framework, the executor pick.
Mimo V2 Pro — high-volume tasks, free during the promotional window.

Avoid: models with context window issues for long sessions (the "lost in the middle" trap from §3.1).

For Claude Code users

Recommendation: migrate to platform-agnostic tools (Kilo Code, Cline Code) because the Claude Opus regression makes vendor lock-in risky.

Alternative models: GPT 5.4 via Cline Code, GLM 5.1 via Kilo Code. Keep Claude as backup only.

The "configure the slots" recipe

Whichever platform you're on, the slot configuration is the same. The recipe:

Open the config file. In Hermes, it's nano config.yaml. In OpenClaw, it's config.yml. In Claude Code, it's settings.json.
Set the orchestrator model. Pick from the channel's tier list: GPT 5.4, Gemini 3.1 Pro, Qwen 3.6 Plus, or Kimi 2.5.
Set the executor model. Pick from the channel's tier list: GLM 5.1, Minimax M2.7, DeepSeek V4 Flash, or Nemotron 3 Super.
Set the auxiliary model. Pick from the channel's tier list: Gemini 2.5 Flash, Gemini 3 Flash, or Mimo V2 Pro.
Set the API key and base URL for each model. Each model has its own API key and base URL. The OpenRouter endpoint works as a single endpoint for all of them.
Restart the harness. The config change takes effect on restart.
Test with a 3-step task. Confirm the routing is working as expected.
Hot-swap via /model. Test the hot-swap mechanic to confirm you can change models mid-session.

The framing: the slot configuration is the same across platforms. The model ids and API keys differ, but the architecture is identical. Pick the right model for the right slot, set the config, test, and you're running.

The "models to watch" list

The capstone's forward-looking artifact is the "models to watch" list. The channel's list:

Claude Mythos (Mefos) — enterprise beta, expected to be more powerful than Opus 4.6, may be enterprise-only, timeline 2026. Worth watching if you're on the consumer tier.
Kimi 2.6 — previewed, expected to improve swarm agents, strength in frontend/UI generation, timeline "soon." Worth watching if you're on the Kimi ecosystem.
GPT 5.5 — rumored, expected to be incremental improvements on GPT 5.4, continued reliability, timeline unknown. Worth watching if you're on the GPT ecosystem.

The market trends the channel flags:

Chinese models rising. DeepSeek, Minimax, Kimi, Qwen, GLM are competing strongly on price-per-task. The price war is a feature for builders.
Price increases. Models raising prices as Claude degrades (Z.AI's GLM 5.1 jump from $30 to $72/mo, Anthropic's plan throttling, the Anthropic Mythos strategy). Plan for the price ceiling to keep moving.
Specialization. Models optimizing for specific use cases (GLM 5.1 for coding, Kimi 2.5 for swarms, Qwen 3.6 Plus for long-horizon consistency). The era of "one model for everything" is over.
Open-weight growth. More self-hostable options (Nemotron 3 Super, Step 3.5 Flash, Qwen 3.5, the upcoming M3 open-weight drop). The privacy-first pick is getting better.
Enterprise split. Premium models for enterprise, budget for consumers. The channel's read: Mythos and Mephisto are enterprise-only, the consumer tier gets the budget models.

The "don't wait for the next model" rule

The capstone's most important forward-looking rule: don't wait for the next model. The channel's read: the next model is always a few months away, and the models you have now are the models you can use now.

The reasoning:

Models change fast, but not instantly. A rumored GPT 5.5 might be 3 months away. A previewed Kimi 2.6 might be 2 months away. A Mythos timeline is unknown.
The current models are good enough for the current workload. GPT 5.4, GLM 5.1, Minimax M2.7, Mimo V2 Pro, Gemini 3 Flash — these are credible models for the current tasks. Waiting for the next model is procrastination.
The 7-day experiment is the loop closer. Run the experiment on the current models, get the data, update the tier list. The next model is a future-tier-list update, not a present-day reason to wait.

The framing: the channel's coverage is forward-looking, but the channel's recommendation is to act on the current models. The tier list is a snapshot. The 7-day experiment is the loop that keeps it current. Don't wait.

Try it yourself

Pick a tier list from the combos above. Start with the budget combo if you're learning, the balanced combo if you're in production, the premium combo if coding excellence matters.
Configure the tier list in your harness. Set the orchestrator, executor, and auxiliary in the config file. Each takes a model id.
Run a 3-step task with the default routing. Pick a task that involves planning (orchestrator), tool calls (executor), and web search (auxiliary). Note which model handles which step in the run log.
Hot-swap mid-task. When the executor fails on a step, type /model to swap to a different executor and rerun. Confirm the swap took via /status.
Audit the run log. After a 3-step task, look at which model produced which step. If the orchestrator is doing executor work, your routing config is wrong. If the executor is doing planning, same.
Run the WildClaw benchmark on your tier list. Clone the suite, run it on your three models, log success rate, cost-per-task, and latency. Compare to the published scores from §3.3.
Add a custom test that mirrors your real workload. The capstone's most important step. Without a custom test, you're benchmarking in a vacuum.
Run a 7-day routing experiment. Use the tier list for a week, log every task, every model, every cost, every failure. Look at the data at the end of the week.
Re-run quarterly. Models change fast, your workload changes fast, the tier list needs to keep up. The channel's caveat on every public benchmark applies to your tier list too.
Share the tier list with the community. The channel's open-sourcing of WildClaw is itself a strategic choice. Sharing your tier list (and the data behind it) helps the community verify, push back, and improve.

The "publish your tier list" recipe

The capstone's final artifact is a published tier list. The recipe:

Run the WildClaw benchmark on your three models. Log success rate, cost-per-task, latency, and instruction-following score for each.
Add a custom test that mirrors your real workload. Log the same four metrics on the custom test.
Classify each model into a slot. Use the role test from Step 4.
Compare to the channel's tier list. Note the agreements and divergences.
Run a 7-day routing experiment. Log every task, every model, every cost, every failure.
Update the tier list based on the data. The 7-day experiment is the loop closer.
Publish the tier list. Post it on the channel's community, on a blog, on X. The community will verify, push back, and improve.
Re-run quarterly. Models change fast, your workload changes fast, the tier list needs to keep up.

The framing: a published tier list is a commitment to be wrong in public. The community will point out where the data is incomplete, where the custom test is biased, where the routing is wrong. That's the value. A published tier list gets better over time because the community helps.

Common pitfalls

Trusting the channel's tier list as a permanent leaderboard. The channel's tier list is current as of the video. Models change fast, your workload changes fast, the tier list needs to keep up. The capstone is the loop closer, not the channel's word.
Skipping the custom test. The single most important rule for tier-list building: add at least one task that mirrors your real production workload. Without it, you're benchmarking in a vacuum. Vendors will optimize for the public suite; the custom test is your insurance.
Benching in a vacuum. The WildClaw suite is open source, so vendors will start gaming it. Add at least one task that mirrors your real production workload. The single most common failure mode is trusting the published number and finding it doesn't generalize.
Picking one model for both roles. The orchestrator and executor roles need different models. Match the slot to the model's strength. GPT 5.4 for orchestrator, Minimax M2.7 for executor, Mimo V2 Pro for auxiliary.
Trusting the auxiliary slot for primary work. Auxiliary models are for specialized tasks — search, grounding, document processing. Don't route orchestrator or executor work to Gemini 3 Flash or Mimo V2 Pro.
Becoming dependent on Mimo's free window. Build the skill library, learn the workflows, have a backup model ready. The migration is a config change, not a rewrite.
Paying Opus prices for tasks that don't need it. Mimo V2 Pro and Gemini 3 Flash are the free alternatives for high-volume work. If your bill doesn't reflect the auxiliary slot, you're paying for capacity you don't need.
Running the 7-day experiment on synthetic tasks. Use real production work. The whole point of the experiment is to see how the tier list performs on your actual workload. Synthetic tasks don't generalize.
Skipping the platform-specific recommendations. The best model for Hermes Agent is not the best model for OpenClaw is not the best model for Claude Code. Pick the right model for the right platform.
Treating the capstone as a one-time exercise. Models change weekly, your workload changes monthly. Re-run the WildClaw benchmark, re-do the routing experiment, re-publish the tier list. The capstone is a loop, not a milestone.

The "I just want the answer" antipattern

The capstone's most common antipattern: "I just want the answer, not the framework." The risk:

The answer is a snapshot. The next model will change it.
The answer is the channel's. Your workload is different.
The answer is the easy part. The framework is the part that keeps the answer current.

The framing: the framework is the long-term asset. The answer is the short-term artifact. If you take only the answer, you'll be re-doing this capstone in a month. If you take the framework, you'll be re-doing it quarterly and the tier list will keep matching your workload.

The "I have a tight budget, what do I cut?" decision

If you can't afford the full tier list, the channel's order of priority for what to keep:

Keep the orchestrator. GPT 5.4 ($50–$75/month) is the most important slot. The orchestrator handles planning, reasoning, and recovery from executor failures. Without a good orchestrator, the executor work compounds.
Keep at least one free executor. Minimax M2.7's $10–$20/month coding plan is cheap enough to keep. If you can't afford it, route executor work to Mimo V2 Pro while it's free.
Drop the auxiliary if needed. The auxiliary slot (search, grounding, document processing) is the cheapest to cut. If you can't afford Gemini 3 Flash's $0 (it's free), drop the auxiliary and route the auxiliary work to the orchestrator.
Use the platform-agnostic tools. Kilo Code and Cline Code let you hot-swap mid-task. Use the budget tier list on Kilo Code and swap to the production tier list when the task is critical.

The framing: the orchestrator is the load-bearing slot. The executor is the workhorse. The auxiliary is the optimizer. If you have to cut, cut the auxiliary first.

Sources

All ten videos referenced across §3.1–§3.4 are aggregated here for the capstone. Every concrete claim in this article is grounded in one of these videos, and the capstone is the synthesis.

Top AI Models for Hermes Agent (Tier List) — 8,107 views · video_id: Af7Fg1m7hRw · cited: the orchestrator/executor/auxiliary taxonomy, the four orchestrators (GPT 5.4, Gemini 3.1 Pro, Qwen 3.6 Plus, Kimi 2.5), the four executors (GLM 5.1, Minimax M2.7, DeepSeek V4 Flash, Nemotron 3 Super), the auxiliary tier (Gemini 2.5 Flash, Gemini 3 Flash, Mimo V2 Pro, Elephant Alpha, Trinity Large Preview), the /model hot-swap mechanic
Best Model for Openclaw (WildClaw Benchmarks!) — 4,574 views · video_id: 31Ij4Cum5tg · cited: the WildClaw benchmark setup, the 51% Opus / $80, the GPT 5.4 close-second at quarter cost, the Mimo V2 $26 / 6-day free window, the Minimax 2.7 two-month internal use, the Grok 94min vs ~500min speed data, the GLM 5.1 mid-test caveat, the "use Opus only if you have an uncapped coding plan" rule
Claude 1M Context: What No One Tells You.. — 399 views · video_id: m97uC11VDtg · cited: 1M context spec, $5/$25 and $3/$15 per-million pricing, 15% compaction drop, 78.3% MRCR v2 score, "lost in the middle" framing, 200K default recommendation
Minimax M2.7 is INSANELY GOOD! (Full Review) — 31,049 views · video_id: --uxieT5J9Y · cited: the trained-on-OpenClaw framing, the 1/16th cost ratio, the executor-slot defense
Anthropic pulled a fast one on us! (Opus plans LIMITED) — 24,059 views · video_id: MkabEkgGpjA · cited: the plan-limiting controversy, the consumer tier throttling, the Mythos / Mephisto / Glasswing theories
Claude Opus is ACTUALLY UNUSABLE — 21,675 views · video_id: Cc2Vvra9F_c · cited: the 40% vs 63% Boxmining benchmark, the four-category failure pattern, the community signal
Anthropic admits fault (Claude limits to be INCREASED) — 9,673 views · video_id: WiAx9sPw69U · cited: the Lydia post, the "way faster than expected" admission
Claude Fable 5 + Loop Designs is TOO STRONG! (Full Tests) — 3,482 views · video_id: 8De7s6WG7Bo · cited: the one case where Claude wins, the loop syntax, the Fable 5 / Opus 4.8 demotion
Glm 5.1 Test: Making a Retro Style Game — 98 views · video_id: 3N0Pe3dkwBE · cited: the GLM 5.1 37% score, the zeros due to timeouts
Supabase query — SELECT video_id, title, views, summary_content, summary_key_takeaways, transcript_content FROM public.videos WHERE video_id = ANY(ARRAY['Af7Fg1m7hRw','31Ij4Cum5tg','m97uC11VDtg','--uxieT5J9Y','MkabEkgGpjA','Cc2Vvra9F_c','WiAx9sPw69U','8De7s6WG7Bo','3N0Pe3dkwBE']); against project ttxdssgydwyurwwnjogq. All 9 video_ids referenced across the course have has_transcript = true and has_summary = true; the load-bearing claims (51% Opus / $80, GPT 5.4 quarter cost, Mimo $26, Grok 94min, $5/$25 pricing on Opus 4.6, 78.3% MRCR v2, 15% compaction drop, the orchestrator/executor/auxiliary taxonomy) are all sourced directly from transcript_content and summary_content for the respective video_ids.
public.ai_models — confirmed rows include xiaomi-mimo (Mimo V2 Pro), grok (xAI), glm-5-1 (Zhipu AI), minimax (MiniMax M2.7), openai (GPT-5.4), claude-opus-4-6, claude-opus-4-7, claude-sonnet-4-6, qwen-3-6-plus (Alibaba), kimi-2-5 (Moonshot AI), qwen-3-5 (Alibaba). Vendor names used in the article cross-match these rows. The pricing_info column is null for every row pulled — token-level rates cited in the article come from the video transcripts, not from the DB.
public.ai_updates — searched for entries matching tier list / wildclaw / orchestrator / executor; closest cross-grounding is AI Briefing 2026-04-23 (Qwen3.6-27B dense-beats-397B-MoE benchmarks), which supports the Qwen 3.6 Plus orchestrator-slot framing in §3.2.