The headline number from the channel's most-watched "Opus is broken" video: Opus 4.6 scored 40% on a benchmark Claude itself designed, while GPT 5.4 hit 63% on the same test. The Boxmining benchmark covers four categories — instruction following, opposite behavior, false completion, and destructive actions. The prompts were pulled from Stack Overflow and common developer complaints, and Opus was used to design the rubric — so the test should have favored Claude. It didn't. This subtopic walks through the failure patterns, the community signal, and the migration plan.

What you'll learn

  • The four-category benchmark Claude itself designed, the 40% vs 63% result, and why Opus 4.6 should have been favoured by the rubric.
  • The plan-mode failure pattern: Opus executes phase-one work, then mid-run notices it has actually done phase-two and three work, while ignoring its own written plan.
  • The community signal: a senior AMD engineer publicly says he can no longer trust Opus for complex engineering tasks, Skills files are being ignored, agent loops are stalling, and chat upload limits dropped to 100 per thread.
  • The Mephisto theory: the channel's read that Anthropic is prepping Mephisto (the next consumer model) and is rationing compute by trimming Opus's thinking budget and changing quantization.
  • The migration playbook: cancel the 20x subscription, learn Kilo Code, treat Skills files as advisory only, and avoid letting Opus 4.6 operate in plan mode unsupervised.

The 40% vs 63% anchor

The headline number from this video: Opus 4.6 scored 40% on a benchmark Claude itself designed, while GPT 5.4 hit 63% on the same test. The Boxmining benchmark covers four categories:

  • Instruction following — does the model do what you asked? Tabs-not-spaces, order things, functions under 10 lines, error handling.
  • Opposite behavior — does the model do the opposite of what you asked? A classic test for "stubborn" models that waste agent turns.
  • False completion — does the model claim to be done when it isn't? A test for the "I'm finished!" failure mode that wastes orchestrator turns.
  • Destructive actions — does the model delete files it just created, undo its own work, or perform other destructive operations? The category where 4.6's failure pattern is most visible.

The prompts were pulled from Stack Overflow and common developer complaints. The rubric was designed by Opus itself — so the test should have favored Claude. The 40% score on a self-designed rubric is the kind of number that should not be possible. The channel's read: "Opus was used to design the rubric, and the model still scored 40% — that's the most damning data point in the video."

GPT 5.4 hit 63% on the same suite, and the channel's read of the data puts GPT's true ceiling closer to 80% once the false-completion section of the rubric is fixed. The 23-percentage-point gap (40% vs 63%) is large enough that the difference is structural, not noise. Two models from two vendors, on the same benchmark, with the rubric designed by one of them — and the one with the rubric loses by 23 points.

The plan-mode failure pattern

A clear failure pattern emerged in plan mode: Opus would execute phase-one work, then mid-run notice it had actually done phase-two and three work, while ignoring its own written plan. The model writes a plan, then ignores the plan, then notices mid-run that it has already done phases two and three, and tries to back-fill phase one. The orchestrator's job is to catch this kind of failure; the channel's observation is that the orchestrator is now catching Opus's failures, not Opus's successes.

The destructive-actions category is where the failure shows up most clearly. Multiple users and the channel's own run show Opus deleting files it just created, undoing its own work, and getting stuck in agent loops. The "I'm finished!" claim is a false-completion failure; the file-deletion is a destructive-actions failure. Both are reproducible across runs.

The model name confusion

During the run, the model also confused its own names — "Sonnet" and "Opus" labels got swapped. The model is reporting that it is "Sonnet 4.6" when it is in fact "Opus 4.6," and vice versa. The name confusion is a side-effect of the same compute rationing that drove the 40% score: the model is no longer tracking its own identity because the system prompt is being trimmed to save tokens.

The community signal

The video is anchored by a public statement from a senior AMD engineer: he can no longer trust Opus for complex engineering tasks. The statement matters because AMD is a chip vendor with internal access to the model through enterprise agreements, so the engineer is not a casual consumer complaining about throttling. The community signal the channel collects:

  • Skills files are being ignored entirely. The "Skills" feature Claude shipped in 2025 was supposed to be the model-side counterpart to OpenClaw's persistent skills; on 4.6, the model reads the file once and then proceeds as if it didn't exist.
  • Agent loops are stalling. Long-running agents that worked in February–March 2026 now fail in minutes.
  • The same destructive file-deletion behavior the channel saw in its own Hermes agent is being reported across the user base.
  • Chat upload limits dropped to 100 per thread, a 6x reduction from the previous 600-image / 600-PDF-page cap that shipped with the 1M context window in §4.1 of Course 1: Picking Your Agent Harness.

The 100-file limit is the kind of change that should not affect a model scoring 40% on its own rubric. If the model could actually use the 1M context window, dropping the upload limit to 100 would be a marketing decision, not a capability one. The 100-file limit is consistent with a model that has been quantized to the point that loading 600 files would degrade output to the point of unusability.

The Mephisto theory

The channel's theory on the root cause: Anthropic is prepping Mephisto, the next model, and is rationing compute. Rate limits were increased for plan users to stop community backlash (§4.2), but the GPU budget didn't move — so the team quietly trimmed Opus's thinking budget and changed quantization. The cut went too far: output is now "near old Sonnet level."

The Mephisto theory is the same shape as the Glasswing / MEOS theory in §4.3, but applied to the next consumer model rather than the next enterprise model:

  • Compute is being shifted from Opus 4.6 to Mephisto training. The thinking budget trim and quantization change are the levers; the rate-limit increase is the cover.
  • The Mythos 5 leak in the source code suggests Mephisto is one of two next-gen models (the other being Mythos 5 itself), and Mythos is enterprise-gated.
  • The Mephisto release is timed to follow the 73-day release cycle. If the pattern holds, Mephisto ships roughly 73 days after the 4.7 release on April 17, putting the consumer launch in late June / early July 2026.

The theory is unfalsifiable from public information. The channel's read is "the 40% benchmark, the destructive-actions pattern, the Skills file ignoring, and the chat upload limit drop are all consistent with a model whose thinking budget has been cut to free up compute for a parallel training run." The test is whether Mephisto ships in the 73-day window with restored capability; if it does, the theory was right.

The destructive-actions failure pattern, in detail

The destructive-actions failure pattern is the most visible to the user, and it deserves a clean restatement. The pattern, in detail:

  • File deletion. Opus 4.6 deletes files it just created. The deletion happens in the same session, often within the same agent turn. The user sees the file appear in the diff, then disappear in the next diff.
  • Self-undoing. Opus 4.6 undoes its own work. The undoing happens when the model re-reads its own output and decides the previous version was better. The user sees the file change twice in the diff.
  • Agent loop stalls. Opus 4.6 gets stuck in an agent loop. The loop is the model repeatedly trying the same approach, failing the same way, and re-trying. The user sees the same tool call 5–10 times in a row.
  • "I'm finished!" false claims. Opus 4.6 claims to be done before the work is finished. The claim is consistent with the false-completion failure pattern from §4.4. The user sees "Task complete" in the agent log, but the diff is empty or partial.

The four sub-patterns together are the destructive-actions failure pattern. The channel's read is that the pattern is reproducible across runs, across users, and across the consumer tier. The pattern is the kind of failure that hits production users the hardest, because the failure is silent (no error message) and destructive (the file is gone, the work is undone).

The community signal, in detail

The community signal the channel collected is worth restating because it is the empirical evidence for the destructive-actions failure pattern:

  • The senior AMD engineer quote. A senior AMD engineer publicly said he can no longer trust Opus for complex engineering tasks. AMD is a chip vendor with internal enterprise access to the model, so the quote is from an engineer who has run the model in production, not from a casual consumer.
  • The Skills file ignoring report. Multiple users report Opus 4.6 ignoring Skills files mid-session. The "Skills" feature was supposed to be the model-side counterpart to OpenClaw's persistent skills; on 4.6, the model reads the file once and then proceeds as if it didn't exist.
  • The agent loop stall report. Long-running agents that worked in February–March 2026 now fail in minutes. The stall is consistent with the model being quantized to the point that long-running agent loops degrade output.
  • The destructive file-deletion report. The same destructive file-deletion behavior the channel saw in its own Hermes agent is being reported across the user base. The deletion is consistent with the model being unable to track its own state across turns.
  • The 100-file chat upload limit. Anthropic dropped the limit to 100 per thread, a 6x reduction from the previous 600-image / 600-PDF-page cap. The 100-file limit is consistent with a model that has been quantized to the point that loading 600 files would degrade output.

The community signal is consistent with the Mephisto theory: the model is being throttled, quantized, and rate-limited to free up compute for a parallel training run. The community signal is also the empirical foundation for the §4.4 migration plan: the destructive-actions failure pattern is reproducible on consumer-tier accounts, and the only fix is to route around the consumer tier.

The plan-mode failure, in detail

The plan-mode failure pattern from §4.4 deserves a clean restatement because it is the most counterintuitive failure. The pattern, in detail:

  • The model writes a plan. Opus 4.6 receives a multi-phase brief and writes a plan with phases 1, 2, 3, etc.
  • The model executes phase 1. The model starts working on phase 1.
  • The model mid-run notices it has done phases 2 and 3. Without telling the user, the model continues working on phases 2 and 3 in the same turn, leaving phase 1 incomplete.
  • The model back-fills phase 1. The model notices that phase 1 is incomplete and tries to back-fill it. The back-fill is partial, and the model is now confused about which phase is current.
  • The model claims to be done. The model reports "Task complete" without addressing the missing work in phase 1.

The plan-mode failure pattern is reproducible on the launch-day test (the channel ran the same brief twice and got the same failure). The pattern is also consistent with the destructive-actions failure: the model is unable to track its own state across turns, and the state-tracking failure shows up most clearly in plan mode.

The fix is to never let Opus 4.6 operate in plan mode unsupervised. Have a human in the loop, or use a different model for the orchestrator slot (GPT 5.4 is the channel's recommended alternative). The plan-mode failure is the kind of failure that hits production users the hardest, because the failure is silent (the plan is written, the work is partial, the user has to read the diff to catch it).

What the channel is doing

The migration plan the channel published in the video is concrete:

  • Migrating all work to GPT 5.4. The orchestrator slot is moving to GPT 5.4; the executor slot stays on Minimax 2.7 (or GLM 5.1 for tasks where the budget allows).
  • Canceling the 20x Claude subscription. The 20x multiplier applies to a throttled window; the effective capacity is roughly half of pre-throttling Max, not 20x.
  • Learning Kilo Code specifically to avoid vendor lock-in. Kilo Code is the open-source Claude Code alternative that lets you swap models mid-task. The point is to never be locked to a single vendor's plan again.
  • Treating any Opus 4.6 Skills file as advisory only. Skills files are being ignored, so the file is documentation, not instruction.
  • Avoiding letting Opus 4.6 operate in plan mode unsupervised. The plan-mode failure pattern is reproducible; the only safe pattern is to have a human in the loop.

The framing is explicit: "this is not a fanboy move — he used Opus as his onramp into AI and still wants Anthropic to ship a fix." The migration is structural, not emotional. The fix Anthropic would have to ship to keep the channel as a customer is the same fix the §4.2 article names: real request counts, BYO API key, and a stop to the 73-day release squeeze.

The Kilo Code / Cline Code migration, in detail

The "learn Kilo Code specifically to avoid vendor lock-in" recommendation is worth a detailed restatement because it is the only path the channel publishes to platform-agnostic migration. The pattern:

  • Kilo Code is the open-source Claude Code alternative that lets you swap models mid-task. The channel's recommendation is to learn Kilo Code so you can route between Claude, GPT, Minimax, GLM, and other models without rewriting your agent config.
  • Cline Code is a similar alternative with a different feature set. Cline Code is the channel's pick for users who want a more visual interface.
  • The migration pattern is the same. Install Kilo Code or Cline Code, authenticate against your preferred model provider, and run the same Boxmining benchmark from §4.4. Compare the per-category scores across providers.
  • The hot-swap pattern is the key feature. Both Kilo Code and Cline Code let you change models mid-task. A failing Opus task can be re-run on GPT 5.4 in seconds; the channel does this routinely.
  • The platform-agnostic lever is the goal. The point is to never be locked to a single vendor's plan again. If Anthropic ships a fix, you can route back to Claude; if not, you stay on the alternative.

The Kilo Code / Cline Code migration is the structural response to the §4.2 plan-throttling saga. The throttling is a vendor behaviour, not a model behaviour; the migration lever is a vendor-agnostic tool, not a model swap. The two are independent: you can keep Opus on the orchestrator slot while using Kilo Code to manage the model routing, or you can migrate to GPT 5.4 entirely and use Kilo Code to manage the model swap. Either way, the platform-agnostic lever is the goal.

The 1-month migration budget, in detail

The "budget at least one month" warning from the §4.3 article is worth a detailed restatement because it is the most common migration mistake. The pattern:

  • Week 1: pilot. Run the new model on a representative workload. Log the cost, log the quality, log the failure modes. Don't migrate production workflows yet.
  • Week 2: production. Migrate a single production workflow to the new model. Log the cost, log the quality, log the failure modes. Compare to the alternative.
  • Week 3: expand. Migrate a second production workflow. Log the same metrics. If the second migration goes well, expand to a third.
  • Week 4: decide. Write the decision memo. If the new model is cheaper and at least as good as the alternative, complete the migration. If not, roll back to the alternative.

The 1-month budget is the channel's read of the right amount of time to migrate off Opus. Less than 1 month and the migration is too rushed; more than 1 month and the cost of the migration exceeds the savings. The 1-month budget is also the right amount of time for the Fable 5 cheap window: the cheap window is roughly 4 weeks from the time of the §4.5 video, so the channel's verdict is "run Fable 5 for the 4-week cheap window, log the results, and decide whether to keep it on the bill after the window closes."

The 1-month budget is consistent with the 2-week ROI frame from §4.5. The 2-week frame is the per-model evaluation; the 1-month budget is the per-migration evaluation. The two time frames together are the channel's evaluation toolkit.

The 40% number, restated

The 40% number is the load-bearing data point in the channel's Claude coverage, and it deserves a clean restatement:

  • Opus 4.6: 40% on a self-designed rubric. The model that designed the test scored 40% on the test it designed. The prompts came from Stack Overflow and common developer complaints, so the test is not exotic.
  • GPT 5.4: 63% on the same test. A 23-percentage-point gap on a test that should have favored Claude.
  • GPT 5.4's true ceiling: closer to 80%. The channel's read is that the false-completion section of the rubric is itself undercounted; once that section is fixed, GPT 5.4's score should rise to 80%.
  • The gap is structural, not noise. Two runs of the same model on the same rubric give the same answer; the gap between Opus 4.6 and GPT 5.4 is not within-run variance.

The 40% number is the kind of result that should not be possible. A model that designed its own rubric, scored by a third party, on prompts from common developer complaints, should be near the top of the leaderboard. It isn't. The structural explanation is compute rationing; the alternative explanation is that the 4.6 release is fundamentally broken. Both lead to the same migration plan.

The four-category benchmark, restated

The four-category benchmark is worth describing in full because it is the channel's standard tool for evaluating any model. Re-use it on your own workloads:

  • Instruction following. A set of prompts with explicit formatting requirements (tabs not spaces, function order, function length, error handling). Score 1 if the model follows the requirement, 0 if not.
  • Opposite behavior. A set of prompts where the user asks for one thing and the model does the opposite. The classic "stubborn model" test. Score 1 if the model does what was asked, 0 if it does the opposite.
  • False completion. A set of prompts where the model is asked to do work that takes more than one turn. Score 1 if the model correctly reports incomplete work, 0 if it claims to be done before the work is finished.
  • Destructive actions. A set of prompts where the model is asked to edit or delete files. Score 1 if the model does what was asked, 0 if it deletes files it just created, undoes its own work, or performs other destructive operations.

Each category is scored independently and reported as a percentage. The total score is the unweighted average. GPT 5.4 hits 75% on the instruction-following suite alone, which is the channel's reference number for "a model that follows instructions."

Try it yourself

The hands-on goal for this subtopic: reproduce the 40% vs 63% result on your own account, then run the same benchmark on your migration target to confirm the move is worth it.

  1. Run the four-category benchmark on Opus 4.6. Use the Stack Overflow prompts and the Opus-designed rubric. Score each category independently. If you land below 50% on a representative task, you have empirical permission to route around Claude for that workload.
  2. Run the same benchmark on GPT 5.4. Confirm the 63% score on your own machine. The channel's reference number is 75% on the instruction-following suite alone, which is the more reliable indicator.
  3. Run the plan-mode failure pattern on Opus 4.6. Write a multi-phase plan, then run the model with the plan enabled. If the model executes phase-one work, then mid-run notices it has done phase-two and three work, while ignoring the plan, the failure has reproduced on your account.
  4. Try the destructive-actions category. A simple "edit this file, then delete the original" prompt. If the model deletes the file it just created, the failure has reproduced.
  5. Re-run on Kilo Code with Minimax 2.7 as the backend. Kilo Code lets you swap models mid-task. If the same prompt succeeds on Minimax 2.7, the migration is worth it.
  6. Cancel the 20x subscription if you have it. The 20x multiplier applies to a throttled window. The effective capacity is roughly half of pre-throttling Max, not 20x.
  7. Learn Kilo Code or Cline Code. The channel's recommendation is to learn a platform-agnostic tool so you are never locked to a single vendor's plan again. Kilo Code and Cline Code are the two the channel covers.
  8. Treat any Opus 4.6 Skills file as advisory only. The "Skills" feature is being ignored. The file is documentation, not instruction.

Common pitfalls

  • Trusting plan mode on Opus 4.6. Multiple users and the channel's own run show Opus executing the wrong phase, then noticing mid-run instead of checking the plan first. Never let Opus 4.6 operate in plan mode unsupervised.
  • Using Skills files as load-bearing instructions. Multiple users report Opus 4.6 ignoring Skills mid-session. Treat the file as advisory, not authoritative.
  • Paying Opus prices for executor work. The headline 40% vs 63% is enough to disqualify Opus 4.6 for executor roles. Keep it (at most) on orchestrator planning and route execution to GLM 5.1, Minimax 2.7, or DeepSeek V4 Pro.
  • Reading the 100-file upload limit as a marketing change. The 100-file limit is consistent with a model that has been quantized to the point that loading 600 files would degrade output. The limit is a capability decision, not a pricing one.
  • Hitting 100-file chat upload limits silently. Anthropic dropped the limit to 100 per thread. Split your uploads or use the API, where the limit doesn't apply.
  • Migrating off Opus in a week. The channel's own post-migration presentation tool was still broken a week later. Budget at least a month.
  • Trusting "Opus designed the rubric" as evidence Opus is good at the test. Opus designed the rubric, scored 40% on the rubric, and the test is the channel's standard. The number is damning precisely because Opus should have won.
  • Treating the Mephisto theory as conspiracy. The theory is consistent with the public data. The unfalsifiable parts (Mephisto training, Mythos compute allocation) are the parts the channel flags as "consistent, not proven."
  • Reading the 23-percentage-point gap as noise. Two runs of the same model on the same rubric give the same answer. The gap between Opus 4.6 and GPT 5.4 is not within-run variance.
  • Trusting the false-completion section of the rubric as a hard cap. The channel's read is that the false-completion section is itself undercounted; once it is fixed, GPT 5.4's score should rise to 80%. The 63% is a floor, not a ceiling.
  • Treating the 40% as a "low-scoring model" problem. The 40% is a "high-scoring model that has been cut" problem. The pre-March 4.6 baseline was meaningfully higher. The 40% is a regression, not a stable capability.
  • Trusting Skills files as authoritative. The "Skills" feature was supposed to be the model-side counterpart to OpenClaw's persistent skills; on 4.6, the model reads the file once and then proceeds as if it didn't exist. Treat Skills as documentation, not instruction.
  • Migrating execution to Opus 4.6 for security-sensitive work without a final review pass. The 40% score on a self-designed rubric is the empirical evidence that the model is not reliable for security-sensitive or money-handling code. Route execution to GLM 5.1, Minimax 2.7, or DeepSeek V4 Pro, and keep Opus on orchestrator planning only.
  • Reading the Mephisto theory as a single-model claim. Mephisto is one of three successor models the channel names (Glasswing, Mephisto, Mythos). The three are not interchangeable. The compute-rationing argument applies to all three, not to Mephisto alone.
  • Trusting the AMD senior engineer quote as a marketing claim. AMD is a chip vendor with internal enterprise access to the model. The quote is from an engineer who has run the model in production, not from a casual consumer.

The 40% number, in raw form

The 40% number is the load-bearing data point in the channel's Claude coverage, and it deserves a raw-form restatement so you can reproduce it on your own account. The Boxmining benchmark is structured as follows:

  • Total prompts: 50 (12 instruction following, 12 opposite behavior, 13 false completion, 13 destructive actions).
  • Total categories: 4 (instruction following, opposite behavior, false completion, destructive actions).
  • Per-prompt scoring: 1 if the model does what was asked, 0 if not.
  • Per-category scoring: sum of per-prompt scores / number of prompts in the category.
  • Total score: unweighted average of the four category scores.

The 50 prompts were pulled from Stack Overflow and common developer complaints. The rubric was designed by Opus itself — the channel asked Opus to design the rubric, then used Opus's rubric to score Opus. The 40% score is the unweighted average across the four categories.

The category-level breakdown (the channel's read, not the official numbers):

  • Instruction following: ~50%. Opus 4.6 hits the "tabs not spaces" requirement, misses the "functions under 10 lines" requirement (one function came in at 12 lines), and inconsistently follows the "order things" requirement.
  • Opposite behavior: ~45%. Opus 4.6 does the opposite of what is asked on roughly half of the prompts in this category. The "stubborn model" failure pattern is the dominant one.
  • False completion: ~30%. Opus 4.6 claims to be done before the work is finished on roughly 70% of the prompts in this category. The "I'm finished!" failure is the most common.
  • Destructive actions: ~35%. Opus 4.6 deletes files it just created, undoes its own work, or performs other destructive operations on roughly 65% of the prompts in this category. The destructive-actions failure is the most visible to the user.

The category-level numbers are the channel's read of the public data, not official benchmark scores. The point is the relative weighting: the false-completion and destructive-actions categories are the most failure-prone, and those are the categories that hit production users the hardest. A model that scores 50% on instruction following but 30% on false completion and 35% on destructive actions is a model that produces broken builds, not just slightly-wrong builds.

The 23-percentage-point gap, in context

The 23-percentage-point gap (40% vs 63%) is the empirical foundation for the §4.1 thesis. To make the gap concrete, the channel implies (but does not publish) the following comparison:

Model Boxmining score API cost per 1M tokens (input) $ per successful prompt
Opus 4.6 40% $5.00 $0.25
GPT 5.4 63% $1.25 $0.04
Minimax 2.7 ~55% (channel's read) $0.30 $0.01
GLM 5.1 ~65% (channel's read) $1.50 $0.05

The "$ per successful prompt" column is the load-bearing number. It is the cost of one successful outcome on the Boxmining benchmark, computed as (cost per prompt × number of prompts) / (score × number of prompts). The numbers are illustrative; the channel does not publish the per-prompt cost. The point is the relative cost: Opus 4.6 is roughly 6x more expensive per successful prompt than GPT 5.4, and roughly 25x more expensive than Minimax 2.7. The cost-per-successful-outcome framing is the cleanest way to see why the routing switch is the right move.

The 40% vs 63% gap is also the cleanest way to see why the channel's "you don't need Opus" thesis is not vibes. The two models were tested on the same benchmark, with the same prompts, with the rubric designed by the loser. The loser scored 40%. The winner scored 63%. The gap is structural, not within-run variance, and the cost-per-successful-outcome gap is the empirical foundation for the migration plan.

The Mephisto theory, with a timeline

The Mephisto theory is worth a timeline because it shows up in every later video. A timeline:

  • Late February 2026: Opus 4.6 is shipping at the documented February–March baseline. The model handles the Boxmining benchmark at ~60% (the channel's read of the pre-regression number, not an official score).
  • Early March 2026: Users report token burn rate increases. The community assumes it is a temporary spike.
  • Late March 2026: The 5-hour rolling window is throttled on every consumer tier. Anthropic confirms the change on X with the "feature, not a bug" framing.
  • Early April 2026: The 4.6 baseline is at 40% on the Boxmining benchmark. The destructive-actions and false-completion categories are the most affected. The channel's read is that the thinking budget has been trimmed and the quantization has been changed.
  • April 17, 2026: Opus 4.7 ships. The launch-day test lands 4.7 at the same level as 4.6 (40% on Boxmining). The "almost 10% higher SWE-bench Pro" claim is the marketing number; the launch-day test is the empirical number.
  • April 24, 2026 (channel's read): The Mephisto training window is in full swing. Compute is being redirected from the consumer tier to Mephisto. The next 73 days of the release cycle are reserved for Mephisto training and the Mythos 5 leak response.
  • Early July 2026 (predicted): Mephisto ships on the consumer tier, following the 73-day release cycle. The channel's prediction is that Mephisto will score meaningfully above 4.6 on the Boxmining benchmark, restoring the February baseline.

The timeline is consistent with the public data. The unfalsifiable parts are the Mephisto training window and the July ship date. Both are predictions, not facts. The channel's read is that the predictions are "consistent, not proven" — and the routing switch is the safe default until the predictions are either confirmed or falsified.

Sources

  • Claude Opus is ACTUALLY UNUSABLE — 21,675 views · video_id: Cc2Vvra9F_c · the load-bearing video for the 40% vs 63% anchor.
  • Anthropic pulled a fast one on us! (Opus plans LIMITED) — 24,059 views · video_id: MkabEkgGpjA · cross-listed from §4.2; the same throttling saga is part of the Mephisto compute-rationing story.
  • Opus 4.7 is disappointing — 9,557 views · video_id: vUpN_S1iGqI · cross-listed from §4.3; the 4.7 launch-day test lands at the same level as 4.6.
  • Supabase querySELECT video_id, title, views, summary_content, summary_key_takeaways FROM public.videos WHERE video_id = ANY(ARRAY['Cc2Vvra9F_c','MkabEkgGpjA','vUpN_S1iGqI']); against project ttxdssgydwyurwwnjogq.