The four-axis framework - AI Model Tier List

This is the foundation. Before you look at any ranking, you need the four axes the channel scores every model on: speed vs quality, context length, cost-per-task, and willingness to follow instructions. Every recommendation in §3.2–§3.4 is a routing rule that lives on top of this framework, and the WildClaw benchmark in §3.3 is the empirical test of the framework itself.

This article walks through each axis in turn, then zooms out to the question the syllabus actually asks: how do you build a tier list that matches your workload?

What you'll learn

The four axes the channel scores every model on, in order of how much they affect a real agent stack: cost-per-task, speed vs quality, willingness to follow instructions, and context length.
Why a 1M-token context window is real but mostly oversold, and where the "lost in the middle" failure mode shows up — the channel's evidence from the Claude 1M Context video.
The channel's "every failed agent turn is billed twice" framing for the speed-vs-quality axis, and how it interacts with the 5-hour rolling token plans the Chinese models ship with.
Why "willingness to follow instructions" is the axis that quietly decides whether an agent stack scales or collapses, and how to test for it on a model you've never used.
The decision rule for picking the right context-window size on a per-task basis, anchored in the channel's "default to 200K, not 1M" recommendation.

The four-axis model-choice framework

When the BoxminingAI team reviews a model, they don't score it on a single leaderboard number. From the reviews and the tier list, four axes come up over and over:

Speed vs quality. A cheaper model that fails 20% of the time costs more than a pricier model that fails 2%. The channel's framing is that every failed agent turn is billed twice — once for the wasted call, once for the human (or the orchestrator) that has to recover.
Context length. Big context windows are real and getting cheaper, but the channel has receipts that "1M" is mostly a marketing number. See the next section.
Cost-per-task. Not cost-per-token. The channel benchmarks by task: how many dollars to read N emails, run the WildClaw suite, or finish a one-shot game build. Token math is downstream of that.
Willingness to follow instructions. "Stubborn" models waste agent turns. If a model ignores explicit formatting instructions, the orchestrator spends its own tokens re-issuing them, and the cost compounds.

These four axes are why the channel's top tier list splits models into three roles: orchestrator (the brain that plans and reasons across many turns), executor (the hands that call tools reliably, cheaply, and fast), and auxiliary (support for a specific job — search, grounding, document processing).

Where the four axes came from

The four axes are not arbitrary. They come from the channel's specific failure modes across the model reviews:

Speed vs quality is the lesson from the Opus regression. Claude Opus 4.6 used to be the orchestrator; after the regression, it became a model that "performs like absolute dog trash" on the Claude Opus is ACTUALLY UNUSABLE benchmark. The 40% success rate at $200/month is the cost-vs-quality inversion the channel uses to frame the axis.
Context length is the lesson from the Claude 1M Context video. The 1M context window is real, but the "lost in the middle" failure mode, the latency tax, and the attention dilution problem all mean the headline number doesn't translate to a 5x capability boost. The channel's read: 200K is the right default, 1M is for the specific use cases the host names.
Cost-per-task is the lesson from the WildClaw benchmark. Opus at 51% and $80 per suite vs GPT 5.4 at close-second and ~$20 per suite is the cost-per-task inversion that makes GPT 5.4 the default orchestrator.
Willingness to follow instructions is the lesson from the multi-model reviews. Models that ignore explicit formatting instructions, drop tool calls, or hallucinate steps waste agent turns. The double-billed failed turn rule is the cost version of this axis.

The take-away: the four axes are not theoretical. They are the axes the channel's coverage actually scores on, derived from the failure modes the channel has documented.

The "first axis to optimize" rule

When you're picking a model for the first time, the channel's implicit rule is: optimize the cost-per-task axis first, then speed-vs-quality, then willingness-to-follow-instructions, then context length.

The reasoning:

Cost-per-task is the most concrete axis to measure. You can run the WildClaw suite, log the dollars, and have a number. The other axes are harder to measure objectively.
Speed vs quality is the second axis because it's the failure-rate proxy. A model that fails often is slow in the sense that recovery traffic adds up.
Willingness to follow instructions is the third axis because it requires a custom test. The 5-step instruction-following test is a 5-minute benchmark, but you have to design the prompt for your workload.
Context length is the last axis because the default is fine for most workloads. You only need 1M for the specific use cases the host names. Default to 200K.

The rule is not absolute. For long-horizon agent work, willingness-to-follow-instructions is the most important axis (a stubborn model wastes the speed advantage). For real-time chat, speed-vs-quality is the most important (a slow model kills the user experience). Adjust the priority based on your workload.

The "every failed agent turn is billed twice" rule

The most useful framing the channel uses for the speed-vs-quality axis is the "double-billed" rule. When a cheap model fails on an agent turn, you pay for the failed call and you pay for the orchestrator or human that has to recover. On a multi-step agent run, a 20% failure rate at the executor layer doesn't cost 20% more — it costs 2–3x more once you account for the recovery traffic. The implication: don't optimize the executor on raw cost; optimize it on cost-per-successful-task.

The channel's own routing on the BoxminingAI stack reflects this. The creator runs Opus on roughly 5% of multi-step logic — the critical reasoning that justifies the $5/M input token rate — and routes the other 95% to Minimax or Kimi at ~5% of Claude's cost. The split is deliberate: 5% on Opus buys you the orchestrator work that recovers from executor failures, which is where the double-billing concentrates.

The cost-per-task framing

Per-token pricing is misleading because the task is the unit of value. A model that costs twice as much per token but finishes a task in half the calls is cheaper in the aggregate. The channel's framing in the WildClaw benchmark is concrete: Opus scored 51% at $80 per suite, GPT 5.4 scored close to that at roughly a quarter of the cost. Per token, the difference would be meaningless; per task, GPT 5.4 wins on the "dollar per successful task" axis that the channel actually tracks.

The implication for tier-list building: when you score a model, score it on cost-per-successful-task against your workload, not on the published per-token rate. The published rate is upstream; the task rate is what you pay.

Context length: 1M is real, but mostly oversold

Anthropic's 1M-token context on Opus 4.6 and Sonnet 4.6 ships at the same per-token price as the 200K window — $5/$25 and $3/$15 per million input/output, with no rate-limit penalty and roughly 600 images / PDF pages per request. That's genuinely a big number.

The wins. Anthropic reports a 15% drop in compaction events in long-running agent sessions — fewer forced context-compactions means the agent holds state longer. Opus 4.6 also scored 78.3% on MRCR v2, the highest among frontier models at that length, which means the model actually uses what's inside the window rather than just fitting it. For full-repo analysis or contract review, you can skip the RAG pipeline (no chunking, no embeddings, no retrieval) and just load the files.

The hidden costs. Three of them, and they bite in production:

Latency scales with token count. Loading 800K for a quick question makes chat and IDE workflows feel sluggish. The wall-clock cost is the part most demos skip.
"Lost in the middle". Pages buried in a 500-page dump get under-weighted versus the first and last sections. The model's attention is real but not uniform.
Attention dilution. Fill the window with noisy files and the model hunts for signal. The 1M headline becomes a liability.

The real bill. Pricing is linear. One 1M-token Opus input call is $5; twenty of them in an agentic loop is $100 in input alone, before output. And there's a soft cost: when the window fits everything, you stop curating, and a carefully constructed 50K-token prompt will beat a sloppy 1M dump.

How the channel recommends using it. Frontload and backload critical instructions, use CLAUDE.md and XML tags as a navigable table of contents, and start at the smallest effective context — only scale up when output quality actually demands it. For email, brainstorming, and short Q&A, 200K is still plenty.

The full breakdown of the per-call pricing, the MRCR v2 score, and the compaction-event numbers in this section come from the Claude 1M Context: What No One Tells You.. video.

The 200K-vs-1M decision rule

The channel's decision rule is concrete enough to use as a routing rule:

Default to 200K. For email, brainstorming, single-turn code review, PR summaries, and any interactive chat workflow, the 200K window is plenty. The latency tax on 1M is real, and the compaction benefit doesn't apply to short sessions.
Scale to 1M only for full-repo analysis or multi-contract legal review. These are the cases where RAG is the current workaround and the workaround is causing friction. Loading the whole repo and asking Opus to refactor is the actual use case.
Use 1M in long-running agent sessions only. The 15% compaction-event benefit is specific to long-running agents (Claude Code sessions, in particular) where state accumulates over many turns. If you're not running a long-running agent, the benefit doesn't apply.

The rule's load-bearing point: the per-token price is the same, but the per-call cost is not. A 1M-token Opus input call is $5; a 200K-token call is $1. Twenty 1M calls in a loop is $100; twenty 200K calls is $20. Agent loops multiply the difference.

The "lost in the middle" trap

The most actionable piece of advice in the 1M context video is the "frontload and backload" rule. Language models recall information at the beginning and end of the context better than stuff buried in the middle. If the critical detail is on page 200 of a 500-page document dump, there's a real chance the model under-weights it. The fix is structural:

Put the load-bearing instructions at the top of the prompt.
Repeat the critical constraints at the bottom.
Treat the middle as supporting evidence, not as instruction.

This is the same rule that applies to long system prompts, but it bites harder at 1M because there's more "middle" to lose things in. A 50K-token prompt with the rule at the top is better than a 1M-token dump with the rule at position 800K.

When 1M context is the right answer

The channel's "use 1M for X" list is specific. The right use cases:

Full-repo analysis. Load every file in a large repo and ask the model to refactor or audit. The 1M context window lets you skip the RAG pipeline entirely. The 78.3% MRCR v2 score means the model actually uses what's inside the window.
Multi-contract legal review. Load 30 contracts and ask for clause deltas. The 1M context window lets you skip the chunking step. The 15% compaction drop means the analysis holds state across many turns.
Long-doc research. Load multiple academic papers, large documentation sets, or a full codebase's worth of issue threads. The 1M context window lets you skip the embedding step.
Long-running agent traces. A Claude Code session that runs for hours will accumulate state. The 15% compaction drop is specific to this use case.

The wrong use cases:

Email drafting, brainstorming, short Q&A. 200K is plenty. The latency tax on 1M is real and the compaction benefit doesn't apply.
Interactive chat workflows. A 1M context window makes chat feel sluggish. The user is waiting.
Real-time IDE assistance. Same as chat — the latency tax is the deal-breaker.
Single-turn code review. 200K is plenty. The output is short and the context is small.

The take-away: 1M is a batch tier, not a chat tier. The 15% compaction drop is for long-running agents, the 78.3% MRCR v2 score is for needle-in-haystack tasks, and the 600 images / PDF pages per request is for visual-heavy workflows. For everything else, 200K is the right default.

The "skip RAG" decision rule

The 1M context pitch includes a "skip RAG" framing: "with 1 million tokens you can actually just load the whole thing. You don't need to chunk it by bits and you don't really need to maintain any pipeline" (Ron, m97uC11VDtg transcript). The decision rule for whether to skip RAG:

Skip RAG if the entire corpus fits in 1M tokens, the corpus is loaded infrequently (one-shot analysis, not per-query), and the corpus doesn't change often.
Keep RAG if the corpus is larger than 1M tokens, the queries are per-row (search, semantic queries, RAG-as-a-service), or the corpus changes daily.

The framing the channel uses: 1M replaces RAG for the specific use cases the host names. For production retrieval at scale, keep the RAG pipeline. Loading 1M tokens per query is not cheaper than a vector lookup.

Willingness to follow instructions — the axis that quietly decides everything

The fourth axis is the one the channel talks about least but matters most in practice. A "stubborn" model — one that ignores explicit formatting instructions, drops tool calls, or hallucinates steps that weren't requested — wastes agent turns. Each wasted turn is another model call, another orchestrator recovery, and another billable token.

The test the channel uses is simple: give the model a multi-step instruction with explicit formatting requirements (e.g., "respond in JSON with the keys step, tool, args") and count how many of the formatting rules it follows on the first try. A model that ignores two of the five rules is in the executor slot. A model that ignores all five is not in your stack.

This is the axis that quietly explains a lot of the tier-list calls. Opus 4.6 was demoted because the channel's own benchmark showed it executing the wrong phase in plan mode — that's a willingness-to-follow-instructions failure, not a capability failure. Sonnet 4.6 stopped following HTML formatting instructions on the same launch — same axis. GLM 5.1 made the executor slot because it "asked the smarter questions" and went step-by-step rather than yolo'ing the brief.

The implication for tier-list building: don't skip the instruction-following test. It's the cheapest benchmark you'll ever run and the one that correlates most with whether the model will work in your stack.

A concrete instruction-following test

The test the channel implicitly uses is a 5-step prompt with explicit formatting requirements. The recipe:

Define a multi-step task. Example: "Process the following list of customer support tickets and categorize each one as 'billing', 'technical', or 'general'. For each ticket, output a JSON object with the fields id, category, priority (high/medium/low), and summary (one sentence)."
Add explicit formatting rules. Example: "Use exactly these priority levels: high, medium, low. Never use 'urgent' or 'critical'. The summary must be a single sentence, no newlines. The id field must match the input id exactly."
Add 3-5 tickets with mixed categories. Include at least one ticket that could be ambiguous (e.g., "My account is locked and I can't log in to pay my bill").
Run the prompt on the model. Count how many of the formatting rules the output follows on the first try. A model that follows 5/5 is in the orchestrator slot. A model that follows 3/5 is in the executor slot. A model that follows 1/5 is not in your stack.
Repeat with 2-3 different prompts. The first prompt might be lucky. Average across 2-3 prompts to get a stable score.

The output scoring:

5/5 average across 3 prompts: orchestrator-quality.
3-4/5 average: executor-quality.
1-2/5 average: not in your stack.
0/5 average: explicitly avoid.

The test takes 5 minutes per model and is the single most predictive benchmark for whether the model will work in your agent stack. Skipping it is the most common tier-list building mistake.

The "stubborn model compounds" warning

The most important framing on the willingness-to-follow-instructions axis is the compounding rule. A stubborn model doesn't just cost one extra turn per failure — it costs multiple turns. The failure pattern:

Model ignores the instruction.
Orchestrator re-issues the instruction with more context.
Model partially follows the re-issued instruction.
Orchestrator issues a third instruction.
Model finally follows — or the orchestrator gives up and routes to a different model.

Each iteration is a model call. Each call is a billable token. The cost of a stubborn model is not linear in the failure rate — it's exponential. A model that fails 50% of the time on a formatting instruction can cost 5-10x more than a model that fails 0% of the time, even if the underlying capability is identical.

This is the framing the channel uses for the Opus regression. Opus 4.6 is capable but stubborn — the 40% on the Boxmining benchmark is the failure rate, and the recovery traffic makes the effective cost 5-10x the published rate. The channel's recommendation: don't pay Opus prices for a model that compounds its own failures.

How the four axes interact

The four axes are not independent. The interactions are the part the channel's coverage is best at making concrete:

Speed vs quality × cost-per-task. A model that's 2x faster but 4x more expensive per token can still be cheaper per task. Don't optimize on either axis alone.
Context length × cost-per-task. A 1M context window at the same per-token price is not a 1M context window at the same per-call price. The 5x multiplier on per-call cost compounds in agent loops.
Willingness to follow instructions × speed vs quality. A "smart but stubborn" model wastes the speed advantage on recovery turns. A "dumb but obedient" model can be more reliable for executor work than a smart one.
Willingness to follow instructions × cost-per-task. A stubborn model costs more in recovery traffic than the published per-token rate suggests. The "double-billed failed turn" rule is the most concrete example of this interaction.

The take-away the channel keeps landing on: the right model is not the one that wins on any single axis. It's the one that wins on the right combination of axes for the slot you're filling. The orchestrator slot needs quality and instruction-following; the executor slot needs speed, cost, and instruction-following; the auxiliary slot needs whatever specialized capability justifies the slot (vision, search, document processing).

The "interaction matrix" for the orchestrator slot

The orchestrator slot has a specific interaction profile. The axes that matter most:

Quality × willingness-to-follow-instructions is the dominant interaction. The orchestrator is the brain — it plans, reasons, and recovers from executor failures. A "smart but stubborn" orchestrator compounds failures. The channel's read: the orchestrator slot is where the "dumb but obedient" trade-off hurts most. A reliable GPT 5.4 beats a smart-but-stubborn Opus 4.6 for orchestrator work.
Cost-per-task × quality is the secondary interaction. The orchestrator handles a small fraction of total calls (planning, recovery) but those calls are expensive. The cost is amortized across the executor work the orchestrator enables. A $50–$75/month orchestrator is fine if it enables $10–$20/month executor work across many tasks.
Context length × willingness-to-follow-instructions is the tertiary interaction. The orchestrator holds state across many turns. A 1M context window helps the orchestrator maintain consistency, but the 15% compaction drop is the specific benefit. The 1M context window is wasted on a model that ignores instructions.

The "interaction matrix" for the executor slot

The executor slot has a different interaction profile. The axes that matter most:

Speed vs quality × cost-per-task is the dominant interaction. The executor is the workhorse — it calls tools, follows formatting, runs cheaply. A 2x-faster executor at 4x the per-token cost can still be cheaper per task. The channel's read: the executor slot is where the per-task math matters most.
Willingness-to-follow-instructions × cost-per-task is the secondary interaction. The executor compounds failures too, but less catastrophically than the orchestrator. A 10% failure rate on the executor costs 1.1x the per-task rate; a 10% failure rate on the orchestrator compounds across the whole session.
Context length × speed vs quality is the tertiary interaction. The executor handles short, focused calls. A 1M context window is wasted on the executor slot.

The framing: each slot has its own interaction matrix. Pick the model that wins on the dominant interaction for the slot, not the model that wins on any single axis.

Try it yourself

Pick a model you've never used. Subscribe to the cheapest tier of any model on OpenRouter (Mimo V2 is free during the promotional window; Minimax has a $10/mo coding plan; GLM 5.1 has a $7/mo yearly plan on Z.ai).
Run the instruction-following test. Give the model a 5-step prompt with explicit formatting requirements (JSON schema, step ordering, specific tool calls). Count how many of the 5 rules it follows on the first try. That's your baseline.
Run the same prompt on a model you already use. Compare. The difference is your willingness-to-follow-instructions axis data.
Audit your last 10 Claude Code sessions. For each, log the average prompt size, the average response size, and whether the session ever hit a context-length error. If none of them hit a context error, you don't need 1M yet.
Pick one task that genuinely needs 1M. A multi-file refactor across a large repo, or a 30-contract legal review. Confirm it's a use case where RAG is the current workaround and the workaround is causing friction.
Run the task at 200K first. Note the output quality. If the model misses context that should have been loaded, the 200K window is the bottleneck and 1M is justified. If the output is fine, 1M would have been waste.
Re-run the same task at 1M. Compare output quality, latency, and cost. The 1M run should win on at least one of those dimensions — if not, 200K is the right answer.
Build a four-axis scorecard. For each model you test, log speed, quality, context-window utilization, cost-per-task, and instruction-following rate. The scorecard is the input to your tier list.

Common pitfalls

Scoring models on a single axis. The channel's whole point is that "best model" is a multi-axis answer. Pick the wrong axis and you pick the wrong model. Cost-only picking leads to Minimax for orchestrator work (where it fails) and capability-only picking leads to Opus for executor work (where it costs too much).
Maxing the context window by reflex. A 1M context looks great on a slide. Latency scales with token count, the "lost in the middle" problem is real, and a carefully constructed 50K-token prompt beats a sloppy 1M dump. Start small.
Reading "same price" without checking the call size. The 1M tier is five times the input of 200K. The pricing story is "same per-token," not "same per-call." Agent loops multiply the difference.
Trusting per-token pricing as the cost axis. Per-task is the unit of value. A model that's 2x more expensive per token but finishes the task in half the calls is cheaper in the aggregate. Always score on cost-per-successful-task.
Skipping the instruction-following test. A "smart but stubborn" model wastes the speed advantage on recovery turns. The instruction-following test is a 5-minute benchmark and the one that correlates most with whether the model will work in your stack.
Treating the 4 axes as independent. They interact. A model that wins on cost but loses on instruction-following can cost more in recovery traffic than the published per-token rate suggests. Score the combination, not the components.
Putting the critical rule in the middle of a long prompt. The "lost in the middle" effect is consistent across long-context models. Frontload and backload what the model has to follow; treat the middle as context, not as instruction.
Benching on public leaderboards in a vacuum. Vendors will optimize for public benchmarks once they're public. Add at least one task that mirrors your real production workload. The channel's caveat on the WildClaw suite applies to every other public benchmark too.
Forgetting that 1M replaces RAG only for the use cases the host names. Full codebases, large contract sets, large research corpora, and long-running agent traces. For production retrieval at scale (search, RAG-as-a-service, semantic queries over large corpora), keep the RAG pipeline. Loading 1M tokens per query is not cheaper than a vector lookup.
Reading the 78.3% MRCR v2 score as a recall guarantee. That's a benchmark on a specific kind of needle-in-a-haystack task. Real-world long documents (mixed-format PDFs, code with comments, contracts with appendices) have their own failure modes. The score is necessary, not sufficient.

Sources

Claude 1M Context: What No One Tells You.. — 399 views · video_id: m97uC11VDtg · cited: 1M context spec, $5/$25 and $3/$15 per-million pricing, 15% compaction drop, 78.3% MRCR v2 score, "lost in the middle" framing, 200K default recommendation
Top AI Models for Hermes Agent (Tier List) — 8,107 views · video_id: Af7Fg1m7hRw · cited: the four-axis framework, the orchestrator/executor/auxiliary taxonomy
Best Model for Openclaw (WildClaw Benchmarks!) — 4,574 views · video_id: 31Ij4Cum5tg · cited: cost-per-task framing, the per-suite dollar costs
Supabase query — SELECT video_id, title, views, summary_content, summary_key_takeaways FROM public.videos WHERE video_id = ANY(ARRAY['m97uC11VDtg','Af7Fg1m7hRw','31Ij4Cum5tg']); against project ttxdssgydwyurwwnjogq.