Long-context gotchas - Claude Code & AI Coding

Subtopics 1.1 through 1.3 were about whether to use Claude Code, which model to put behind it, and how to wire it into OpenClaw. Subtopic 1.4 is the assumption underneath all of those decisions: that bigger context windows make coding agents better. Anthropic shipped a 1M-token context window on Opus 4.6 and Sonnet 4.6 at the same per-token price as the 200K window, and the channel's framing in this video is unambiguous — bigger is not the same as better, and the defaults that got you through 200K are actively harmful at 1M.

The video is short (399 views on a 1M-context topic is the channel telling you this is a niche audience) but the rules in it are load-bearing for any agent-driven coding workflow, which is why it gets its own subtopic instead of being folded into 1.2. If you read this and conclude "I should just use 1M for everything," you have missed the entire point of the video.

Source note. Every concrete number in this article is from the speaker (Ron, m97uC11VDtg transcript) — the $5/$25 and $3/$15 rates, the 15% compaction figure, the 78.3% MRCR v2 score, the 600 images / PDF pages per request, and the ~830K usable input tokens all come from a single announcement read-out near the top of the video. The five-step "how to actually use it" playbook is also verbatim from that transcript. The only places I extrapolate beyond the speaker are (a) cross-linking to Subtopic 1.2's $30/hour Opus burn, which the host references as the "why this matters" hook for the audience, and (b) the per-call cost worked-example showing 1M vs 200K at $5/M — both are arithmetic on the speaker's own per-token rate, not new claims.

What you'll learn

1M context on Opus 4.6 / Sonnet 4.6 ships at the same per-token price as 200K — "Opus 4.6 stays at $5 input and $25 per million tokens. Sonnic 4.6 is $315" per million, "no rate limit penalty" (Ron, m97uC11VDtg transcript), and "roughly 830,000 usable tokens of input" per request after system prompts and output tokens are reserved (Ron, m97uC11VDtg transcript), with "up to 600 images or PDF pages per request" (Ron, m97uC11VDtg transcript), up from 100.
The two real wins are: Anthropic's reported "15% drop in compaction events" (Ron, m97uC11VDtg transcript), so agent sessions hold state across more turns without summarising, and Opus 4.6's "78.3% on MRCV2, which is a long context recall benchmark" — "the highest among Frontier models at this context length" (Ron, m97uC11VDtg transcript), meaning the model actually uses what's inside the window, not just fits it.
The four real costs the speaker calls out: latency ("more tokens means, you know, more time before the model starts responding" — Ron, m97uC11VDtg transcript), the "lost in the middle" effect on a 500-page dump ("language models tend to recall information at the beginning and at the end of the context much better than stuff buried in the middle" — Ron, m97uC11VDtg transcript), attention dilution ("when you fill the context with irrelevant files you're sort of diluting the signal" — Ron, m97uC11VDtg transcript), and cost creep (the linear pricing math, sourced from the same transcript).
The pricing is linear: 1M is five times the input of 200K at the same $5/M input rate, so 20 calls × 1M input tokens = $100 in input alone, vs $20 for 20 × 200K calls — arithmetic on the speaker's own $5/M Opus rate (Ron, m97uC11VDtg transcript). The host's "carefully constructed 50K token prompt" will beat a sloppy 1M dump framing is also from the transcript.
The right default is 200K. The speaker is explicit: "for 90% of casual users, you'll never need 1 million tokens. If you're writing emails, brainstorming ideas, or asking quick questions, the standard context is actually more than enough" (Ron, m97uC11VDtg transcript). Scale to 1M only for full-repo analysis or multi-contract legal review, and only when output quality at the smaller size is actually insufficient.

The 1M context pitch — and what it actually costs

The video opens with the spec sheet, read out by the host: "Enthropic made the 1 million token context window available for two models, Opus 4.6 and Sonnet 4.6, both the API and on Claude Pro and Max plans" (Ron, m97uC11VDtg transcript). Four key details from the announcement, again verbatim from the transcript:

Unified pricing. "There's no premium for using the extended context. Opus 4.6 stays at $5 input and $25 per million tokens. Sonnic 4.6 is $315. Whether your quest is 9,000 tokens or 900,000 tokens, the per token rate is the same."
No rate-limit penalty. "During the beta, longer requests were in fact throttled, but now that is gone. You get full rate limits at every context length."
Media support jumped 6×. "Now you can attach up to 600 images or PDF pages per request up from 100."
Practical usable input. "After accounting for system prompts and output tokens, you're looking at roughly 830,000 usable tokens of input… enough to load a full mono repo, thousands of contract pages, or an entire agent session trace without any manual file selection."

Source: Ron, m97uC11VDtg transcript.

On paper, the four-bullet announcement is a free upgrade — same price, more room, more recall. The "what no one tells you" half of the video is the rest of the spec sheet: latency, attention, the "lost in the middle" research, and the pricing-curve math, none of which show up in the launch blog post.

The wins worth taking

Two findings actually move the needle, and they're the reason 1M context exists at all:

15% drop in compaction events. "If you've used clog code or any longunning agent, you've probably hit that wall where the model suddenly forgets something you told it earlier, right? So that's compaction. The system summarizing older context to make room. But now Anthropic is saying that this update led to a 15% drop in compaction events. This means your agent sessions can run longer and retain more nuance" (Ron, m97uC11VDtg transcript). For overnight builds — the workflow from subtopic 1.2 — this means fewer mid-build summarisations that drop details you'd want preserved.
78.3% on MRCR v2. "Opus 4.6 scored 78.3% on MRCV2, which is a long context recall benchmark. And that's actually the highest among Frontier models at this context length. It's not just that the window is big. The model can actually use what's inside it" (Ron, m97uC11VDtg transcript). That's the meaningful claim — most long-context benchmarks test whether a model can fit a needle, not whether it can use the whole haystack productively.
Skip RAG for many use cases. "RAG is retrieval augmented generation where you chunk documents, embed them and retrieve relevant pieces… that's the standard workaround for limited context and that's what we've been doing for the past 2 months. But with 1 million tokens you can actually just load the whole thing. You don't need to chunk it by bits and you don't really need to maintain any pipeline" (Ron, m97uC11VDtg transcript).
Simpler system overall. The host's framing: "When your context window is big enough, you eliminate entire layers of complexity such as summarization chains, context clearing workarounds, multi-step retrieval. With the 1 million context window, your system gets simpler. And simpler systems have fewer points of failure. And that's actually how Martin, one of our team members, is able to be very efficient with his workflow just because his system is very, very simple" (Ron, m97uC11VDtg transcript).

For the specific use cases the channel calls out — full-repo analysis (load every file, ask Opus to refactor) and contract review (load 30 contracts, ask for clause deltas) — you can skip the RAG pipeline entirely. No chunking, no embeddings, no retrieval step, no lost recall from bad splits. That's the actual "1M replaces RAG" pitch, and the host is explicit that it works for the use cases he names, not "RAG everywhere."

The costs people miss

The "what no one tells you" section is short because the cost list is short, but every entry is a footgun. The host numbers them, in order:

Latency scales with token count. "I think it's much more overlooked compared to cost. More tokens means, you know, more time before the model starts responding. So for interactive use cases like chat bots, you know, real time coding assistance, this can be a noticeable hit to the user experience. So if you're loading, you know, 800,000 tokens for a quick question, right, you're going to be waiting" (Ron, m97uC11VDtg transcript). The 1M tier is not a chat tier; it's a batch tier.
Lost in the middle. "This is actually well documented in research… language models tend to recall information at the beginning and at the end of the context much better than stuff buried in the middle. So if the critical detail is say for example on page 200 of a 500page document dump, then there's a real chance the model underweighs it" (Ron, m97uC11VDtg transcript). The video's fix is to frontload and backload critical instructions, with the load-bearing rule in the middle.
Attention dilution. "Just because you can load everything doesn't mean you should right when you fill the context with irrelevant files you're sort of diluting the signal right the model has to sort through the noise to find what matters and that kind of degrade output quality" (Ron, m97uC11VDtg transcript). The cure is curation, not capacity.
Cost creep (the linear pricing math). "If you're habitually maxing out the context window across repeated calls, say in an agentic loop, then your bill adds up fast, right? A single 1 million token opus request cost $5 on the input alone. run that in a loop 20 times and you spent $100 just on input tokens" (Ron, m97uC11VDtg transcript).

The real bill

The pricing-curve math is the under-discussed part. The per-token price is the same as 200K, but the per-call cost is not, because 1M is five times the input of 200K. The channel's own example, cited above: "A single 1 million token opus request cost $5 on the input alone. Run that in a loop 20 times and you spent $100 just on input tokens" (Ron, m97uC11VDtg transcript). Output is on top of that, billed separately, and the host does not put a number on the output side.

Per-call arithmetic, worked from the speaker's $5/M input rate (Ron, m97uC11VDtg transcript): one 1M-token Opus input call = $5, twenty in an agentic loop = $100 in input alone. A 200K workflow of the same twenty calls is $20 in input — same per-token rate, but five times the input. The token-plan math from subtopic 1.2 only holds if the calls themselves are cheap. Switch the loop to 1M and you bypass the same-price framing in a single agent session.

The second cost is soft but worse: when the window fits everything, you stop curating. The host's exact framing, in full: "When the window is big enough to fit everything, people stop curating. They stop thinking about what the model actually needs. And paradoxically, that often produces worse results than a carefully constructed 50k token prompt. Because at the end of the day, what separates a good AI user from the everyday users is their understanding of project structure, directory, what files to load in certain conversations" (Ron, m97uC11VDtg transcript). The reason isn't that the model is smarter at 50K — it's that a curated prompt is a directed question, while a 1M dump is "look at all this stuff and tell me what to do." Direction beats volume.

How to use 1M context well

The video's practical playbook is short and worth quoting in full — six items, all from the host's "best practices" section of the transcript:

Start at the smallest effective context. "Start with the smallest effective context. Only scale up when you genuinely need the model to see more. Don't default to dumping everything in" (Ron, m97uC11VDtg transcript). Default to 200K. Only scale to 1M when output quality at the smaller size is actually insufficient. Don't max out by reflex.
Frontload and backload critical information. "Put your most important instructions and context at the top and bottom of your prompt. Right? So don't bury key details in the middle of a massive document block" (Ron, m97uC11VDtg transcript). The middle of a long dump is the worst place to put the rule you need the model to follow. Put the rules at the top and the bottom; treat the middle as supporting evidence.
Use structured prompts (CLAUDE.md / XML tags / section headers). "Things like claw.md files, XML tags, clear section headers. These help the model navigate a large context. Think of it like a table of contents but for Claude to read it" (Ron, m97uC11VDtg transcript). When you load 800K+ tokens, the model needs a navigable structure. Section headers and tagged blocks function as a TOC. Without them, the model is reading a wall of text.
Test with smaller context first. "Test with smaller context first. you know if you don't get the output that you want from 200k uh context window then you can scale up to 1 million" (Ron, m97uC11VDtg transcript). Don't go straight to 1M.
Monitor token usage. "And then of course monitor token usage do not dump everything without structure" (Ron, m97uC11VDtg transcript). The 1M tier is an excuse to skip curating; don't take the excuse.
Skip 1M for casual use. "For 90% of casual users, you'll never need 1 million tokens. If you're writing emails, brainstorming ideas, or asking quick questions, the standard context is actually more than enough" (Ron, m97uC11VDtg transcript). 200K is plenty for any interactive workflow that fits in a chat window.
Cap input-heavy agentic loops on Opus. From the host's "disadvantage number four" — a 1M-token Opus input call is $5; twenty in a loop is $100 in input alone (Ron, m97uC11VDtg transcript). If you're running overnight on Opus, set a per-call or per-session token ceiling and watch the meter.
Reserve 1M for full-repo analysis or multi-contract legal review. From the host's "where this shines" list: "developers working with large code bases, legal and compliance teams analyzing hundreds of contracts at once, researchers… processing entire papers or or data sets and agentic workflows where the AI needs to maintain state across hours of autonomous operations" (Ron, m97uC11VDtg transcript). These are the cases where RAG is most often a workaround, and where direct loading actually works. For everything else, keep RAG (or keep 200K).

The overnight-build angle, restated

Subtopic 1.2 argued for overnight Claude Code runs on a cheap model backend. The 1M context rules add a ceiling: the same overnight run on Opus that burned $30 in an hour (see 1.2) will burn proportionally more if every loop is loading 1M tokens. The fix isn't to abandon 1M in the loop — it's to gate 1M to the moment in the build where it's actually needed (the full-repo audit, the cross-file refactor review) and keep the routine turns at 200K. Compaction goes down because the long turns are long; latency stays sane because the short turns are short.

The "15% fewer compactions" caveat

The compaction benefit is real but specific. It applies to long-running agents (Claude Code sessions, in particular) where state accumulates over many turns. If you're not running a long-running agent — if you're doing one-shot Q&A or single-turn code review — the 15% number is irrelevant. Don't pay the latency tax for a benefit that doesn't apply to your workflow.

Try it yourself

The hands-on goal for this subtopic: pick a 1M-scale task, execute it well, and prove to yourself that 200K is still the right default for everything else.

Audit your last 10 Claude Code sessions. For each, log the average prompt size, the average response size, and whether the session ever hit a context-length error. If none of them hit a context error, you don't need 1M yet.
Pick one task that genuinely needs 1M. A multi-file refactor across a large repo, or a 30-contract legal review. Confirm it's a use case where RAG is the current workaround and the workaround is causing friction.
Build the prompt with curation in mind. Don't paste the whole repo — paste the relevant files, and put the load-bearing instructions at the top and the bottom of the prompt. Use XML tags or CLAUDE.md as a TOC if the prompt is over 800K tokens.
Run the task at 200K first. Note the output quality. If the model misses context that should have been loaded, the 200K window is the bottleneck and 1M is justified. If the output is fine, 1M would have been waste.
Re-run the same task at 1M. Compare output quality, latency, and cost. The 1M run should win on at least one of those dimensions — if not, 200K is the right answer.
Set a per-session token ceiling on Opus. If you do commit to 1M in the loop, cap the session at the dollar amount you're willing to lose on a bad day. Twenty 1M-token Opus input calls is $100 in input alone (Ron, m97uC11VDtg transcript).
Keep 200K as the default for everything else. Email drafts, brainstorming sessions, single-turn code review, PR summaries — all of these are interactive, latency-sensitive, and short. 1M is not a chat tier (Ron, m97uC11VDtg transcript).

Common pitfalls

Maxing out 1M by reflex. The per-token price is the same, but the per-call cost isn't. A 1M-token Opus input call is $5; a 200K-token call is $1. Twenty 1M calls in a loop is $100; twenty 200K calls is $20. Default to 200K.
Trusting "same price" without checking the call size. The 1M tier is five times the input of 200K. The pricing story is "same per-token," not "same per-call" (Ron, m97uC11VDtg transcript). Agent loops multiply the difference.
Putting the critical rule in the middle of a long prompt. The "lost in the middle" effect is "well documented in research" and consistent across long-context models (Ron, m97uC11VDtg transcript). Frontload and backload what the model has to follow; treat the middle as context, not as instruction.
Loading 800K tokens to ask a quick question. Latency scales with token count. "If you're loading, you know, 800,000 tokens for a quick question, right, you're going to be waiting" (Ron, m97uC11VDtg transcript). Use 1M for batch and audit, not for chat.
Treating 1M as a replacement for RAG across the board. 1M replaces RAG cleanly only for the use cases the host names — full codebases, large contract sets, large research corpora, and long-running agent traces. For production retrieval at scale (search, RAG-as-a-service, semantic queries over large corpora), keep the RAG pipeline. Loading 1M tokens per query is not cheaper than a vector lookup.
Reading "15% fewer compactions" as a universal win. The benefit is specific to long-running agent sessions. If you're not running a long-running agent, the 15% doesn't apply, and the latency tax is pure loss.
Stopping curation because the window is big. "When the window is big enough to fit everything, people stop curating… paradoxically, that often produces worse results than a carefully constructed 50k token prompt" (Ron, m97uC11VDtg transcript). Capacity is not a substitute for direction.
Letting Opus run 1M-token loops without a ceiling. The pricing is linear. "A single 1 million token opus request cost $5 on the input alone. Run that in a loop 20 times and you spent $100 just on input tokens" (Ron, m97uC11VDtg transcript). Set a per-session or per-call cap.
Reading the 78.3% MRCR v2 score as a recall guarantee. "Opus 4.6 scored 78.3% on MRCV2, which is a long context recall benchmark" (Ron, m97uC11VDtg transcript) — that's a benchmark on a specific kind of needle-in-a-haystack task. Real-world long documents (mixed-format PDFs, code with comments, contracts with appendices) have their own failure modes. The score is necessary, not sufficient.
Skipping 1M for a full-repo audit because "it's overkill." The two cases where 1M is actually the right answer are full-repo analysis and contract review. If you have either of those workflows running on RAG, the 1M tier is the cleaner solution. Reserve the smaller windows for everything else.

Sources

Claude 1M Context: What No One Tells You.. — 399 views · video_id: m97uC11VDtg · transcript fully populated as of 2026-06-17. Every verbatim quote in this article is from the transcript_content column of public.videos for that video_id.
YouTube comments — SELECT * FROM public.youtube_comments WHERE video_id = 'm97uC11VDtg'; returned 0 rows against project ttxdssgydwyurwwnjogq on 2026-06-17 — no community corroboration, so all source attribution is to the host (Ron) directly.
public.ai_models — 22 rows pulled on 2026-06-17. Cross-confirms Anthropic as the vendor of Claude Opus 4.6 and Sonnet 4.6 (rows claude-opus-4-6, claude-opus-4-7, claude-sonnet-4-6), but pricing_info is null on every row, so the per-token rates in this article remain transcript-sourced rather than DB-sourced.
public.ai_updates — searched 2026-06-17 with title ~* '(context|1M|200K|window|claude code|opus|sonnet|MRCR|pricing)' against the ai_updates table. No rows match the 1M context announcement itself; the closest is AI Briefing 2026-05-03 which notes "xAI launches Grok 4.3 with always-on reasoning, 1M context, and aggressively low pricing" — context that 1M context is now a competitive baseline, but the Opus / Sonnet 1M spec itself is not in the updates table, only in the video transcript.