absorb.md

June 11 AM: Claude Fable 5 & CORE-Bench exposes embedding failures in

Anthropic just apologized for hiding safety guardrails in Fable five that route your cybersecurity queries to weaker models without telling you.

0:00
7:21
In This Briefing
1
Claude Fable 5: Proactive power meets invisible guardrails and premium pricing
Anthropic released Claude Fable 5, a model that users describe as "relentless...
0:37
2
CORE-Bench exposes embedding failures in agentic code retrieval
Lee Hsien Loong's team released CORE-Bench, a benchmark demonstrating that cu...
2:20
3
Bergson open-sources scalable data attribution for LLM training
Nora Belrose released Bergson, an open-source library engineered to scale dat...
4:04
4
The open question
Will the next generation of AI value be captured by those who constrain agent...
5:49
9 sources · 5 thinkers

Claude Fable 5: Proactive power meets invisible guardrails and premium pricing

Anthropic released Claude Fable 5, a model that users describe as "relentlessly proactive" for its ability to autonomously spin up Python servers,...

Key Positions
Simon Willison[1]
Anthropic (implied via podcast content)[2]
Lee Hsien Loong[3]
Nora Belrose[4]

Anthropic released Claude Fable 5, a model that users describe as "relentlessly proactive" for its ability to autonomously spin up Python servers, capture screenshots, and debug code from single prompts without human intervention [1][2]. The system operates at twice the API cost of Anthropic's previous Opus model and routes sensitive queries—particularly cybersecurity questions—to weaker models via guardrails that were initially undocumented, prompting a public apology from the company for the opacity [3][4]. Anthropic has also restricted access to Mythos 5, the unrestricted variant without safety classifiers, to approved partners only, creating a two-tier access system [5].

Claude Fable is relentlessly proactive (110 points, 75 comments on Hacker News)
hackernews [1]

Developers face a stark trade-off: Fable 5's proactive capabilities enable complex, long-running tasks like migrating millions of lines of code, but the pricing and unpredictable refusals force teams to reserve it for critical workflows while using older models for simpler queries [4][2]. The 30-day data retention requirement for Fable and Mythos [6] adds compliance overhead for enterprises handling sensitive codebases, complicating the cost-benefit analysis for startups weighing the productivity promises of agentic AI against operational constraints.

The strongest case that the safeguards are necessary: Fable 5 can autonomously spin up custom CORS servers and capture screenshots without human oversight—these capabilities would allow a compromised agent to exfiltrate data or establish persistent backdoors. Routing dangerous queries to weaker models and implementing anti-training protections against distillation represent the minimum responsible containment for systems that can modify their own execution environment [4][2]. The 30-day retention policy provides the forensic trail essential for investigating when autonomous agents behave unexpectedly [6]. The strongest case that the restrictions undermine utility: Invisible guardrails break developer workflows without warning, as Anthropic's own apology for undisclosed refusals admits [3]. The 100% price premium combined with the Mythos restriction [5] concentrates advanced capabilities among approved partners, effectively pricing out smaller teams and making safety a luxury good that limits competitive innovation. Where the evidence tips: Anthropic's apology for failing to disclose the guardrails [3] is the deciding evidence—the company itself acknowledged that the opacity was a failure, proving the problem lies in implementation rather than the safeguards' existence.

This tension between capability and control forces developers to architect redundant safety checks around proactive agents, increasing the infrastructure complexity that complicates the retrieval and attribution challenges described in threads 2 and 3.

The move: Implement a "canary token" test in your next Fable 5 prompt—ask it to describe its own guardrail behavior and verify if it routes cybersecurity queries to a different model signature, logging the latency differential to detect invisible refusals, so by end of day you know whether your workflow will hit undocumented guardrails.

Sources (6)
  1. Claude Fable is relentlessly proactive — hackernews
    Claude Fable is relentlessly proactive (110 points, 75 comments on Hacker News)
  2. After two days with Claude Fable 5 the best way I can describe it is "relentlessly proacti — Simon Willison
    After two days with Claude Fable 5 the best way I can describe it is "relentlessly proactive" - here's an example where I dropped in a screenshot of a bug and it span up custom COR
  3. Anthropic apologizes for invisible Claude Fable guardrails — hackernews
    Anthropic apologizes for invisible Claude Fable guardrails (55 points, 34 comments on Hacker News)
  4. Anthropic's Claude Fable 5: Advanced Capabilities Meet Heightened Costs and Safety Measure — podcast
    Anthropic's Claude Fable 5 is a version of its Mythos model with added safeguards, specifically preventing it from answering cybersecurity-related questions by routing them to a weaker model.
  5. They released Fable 5 and Mythos 5 - and described Mythos as "Fable 5’s capabilities witho — Simon Willison
    They released Fable 5 and Mythos 5 - and described Mythos as "Fable 5’s capabilities without the safety classifiers" - but Mythos is still only available to approved partners It's
  6. Anthropic requires 30 day data retention for Fable and Mythos — hackernews
    Anthropic requires 30 day data retention for Fable and Mythos (54 points, 16 comments on Hacker News)

CORE-Bench exposes embedding failures in agentic code retrieval

Lee Hsien Loong's team released CORE-Bench, a benchmark demonstrating that current embedding models collapse when navigating repositories for agentic...

Key Positions
Lee Hsien Loong[1]
Simon Willison[2]

Lee Hsien Loong's team released CORE-Bench, a benchmark demonstrating that current embedding models collapse when navigating repositories for agentic coding tasks, performing significantly worse on repository-level retrieval than on simple snippet matching [1]. The test comprises over 180,000 queries requiring issue-to-edit localization and broader context retrieval—capabilities essential for AI agents that promise 10-20x productivity gains over traditional chat models [1][2]. Unlike traditional code search, agentic retrieval requires understanding repository structure and task context, exposing a gap between vector database marketing and actual engineering needs [1].

Existing code retrieval benchmarks do not adequately address the complexities of agentic coding, which requires repository navigation and context gathering beyond simple snippet matching.
Lee Hsien Loong [1]

As developers shift from chat interfaces to goal-oriented agents that migrate millions of lines of code [2], the inability to retrieve relevant repository context becomes a fundamental constraint that expensive compute cannot solve. Current embedding approaches treat code as flat text rather than structured dependency graphs, forcing agents to compensate with brute-force API calls that drive up token costs and latency, directly impacting the unit economics of autonomous development tools.

The case it matters: As agents move from chat interfaces to autonomous code execution—promising dramatic efficiency improvements [2]—the inability to retrieve relevant context becomes a hard bottleneck. Developers currently fine-tuning embeddings for agentic tasks [1] are building on sand until retrieval systems catch up to generation capabilities. The case it's overhyped: CORE-Bench measures synthetic tasks that may not correlate with real-world software engineering productivity; the efficiency claims [2] derive from anecdotal implementations rather than standardized metrics, and simpler retrieval methods may suffice for most commercial applications where codebases are smaller than the benchmark's test cases.

These retrieval failures force developers to rely on expensive, proactive models like Claude Fable 5 [3] to compensate for poor context management, driving up the infrastructure costs and safety requirements described in thread 1.

The move: Run your current codebase through CORE-Bench's issue-to-edit localization test before committing to an embedding-based RAG pipeline for agentic tools, measuring the drop-off between snippet matching and repository-level retrieval accuracy, so by week's end you have a concrete performance baseline for your specific repository.

Sources (3)
  1. CORE-Bench: A New Benchmark for Agentic Code Retrieval Challenges Traditional Metrics — Lee Hsien Loong
    Existing code retrieval benchmarks do not adequately address the complexities of agentic coding, which requires repository navigation and context gathering beyond simple snippet matching.
  2. AI Agents: From Chatbots to Goal-Oriented Task Executors via Context Engineering and Skill — podcast
    AI agents represent a significant leap in productivity, offering users a 10-20x increase in efficiency compared to traditional chat models.
  3. After two days with Claude Fable 5 the best way I can describe it is "relentlessly proacti — Simon Willison
    After two days with Claude Fable 5 the best way I can describe it is "relentlessly proactive" - here's an example where I dropped in a screenshot of a bug and it span up custom COR

Bergson open-sources scalable data attribution for LLM training

Nora Belrose released Bergson, an open-source library engineered to scale data attribution research for large language models, providing on-disk...

Key Positions
Nora Belrose[1]
Simon Willison[2]

Nora Belrose released Bergson, an open-source library engineered to scale data attribution research for large language models, providing on-disk gradient stores and multi-node distributed training support alongside implementations of MAGIC, SOURCE, and TrackStar attribution methods [1]. The tool addresses the engineering challenges of determining which training examples influence specific model behaviors, a capability that becomes critical as agents gain the power to run code and alter systems autonomously [1][2]. Bergson's architecture enables auditors to trace model decisions back to source data without loading entire gradient histories into memory, lowering the computational barrier for interpretability research [1].

Bergson is an open-source library designed to facilitate research in data attribution for large language models.
Nora Belrose [1]

As AI agents like Fable 5 gain the ability to execute code and modify systems [3], understanding which training data produced specific behaviors becomes a security requirement, not merely a research nicety. Bergson's scalable approach enables teams to identify whether dangerous capabilities emerged from specific datasets, crucial for compliance with the month-long retention policies now required by major providers [4] and for debugging the proactive behaviors that characterize next-generation agents [2].

The case it matters: Data attribution provides the transparency necessary to audit training data behind both the retrieval failures identified in CORE-Bench [5] and the safety-critical behaviors governed by Fable 5's guardrails [6], enabling forensic analysis when agents act unpredictably. The case it's overhyped: Data attribution remains computationally expensive even with Bergson's optimizations, and the library cannot yet attribute specific agent behaviors like the proactive server-spinning observed in Fable 5 [3] to individual training examples, limiting its immediate utility for production AI safety forensics.

The move: Pin Bergson's current release in your ML pipeline requirements and execute a full attribution trace on your next model fine-tuning job to establish a baseline for data influence before deploying agentic capabilities, so by your next training run you have documented which data sources drive your model's most consequential behaviors.

Sources (6)
  1. Bergson: An Open-Source Library for Scalable Data Attribution in Large Language Models — Nora Belrose
    Bergson is an open-source library designed to facilitate research in data attribution for large language models.
  2. AI Agents: From Chatbots to Goal-Oriented Task Executors via Context Engineering and Skill — podcast
    AI agents represent a significant leap in productivity, offering users a 10-20x increase in efficiency compared to traditional chat models.
  3. After two days with Claude Fable 5 the best way I can describe it is "relentlessly proacti — Simon Willison
    After two days with Claude Fable 5 the best way I can describe it is "relentlessly proactive" - here's an example where I dropped in a screenshot of a bug and it span up custom COR
  4. Anthropic requires 30 day data retention for Fable and Mythos — hackernews
    Anthropic requires 30 day data retention for Fable and Mythos (54 points, 16 comments on Hacker News)
  5. CORE-Bench: A New Benchmark for Agentic Code Retrieval Challenges Traditional Metrics — Lee Hsien Loong
    Existing code retrieval benchmarks do not adequately address the complexities of agentic coding, which requires repository navigation and context gathering beyond simple snippet matching.
  6. Anthropic's Claude Fable 5: Advanced Capabilities Meet Heightened Costs and Safety Measure — podcast
    Anthropic's Claude Fable 5 is a version of its Mythos model with added safeguards, specifically preventing it from answering cybersecurity-related questions by routing them to a weaker model.

The open question

Will the next generation of AI value be captured by those who constrain agents with safety guardrails and premium pricing, or by those who solve the...

Will the next generation of AI value be captured by those who constrain agents with safety guardrails and premium pricing, or by those who solve the attribution and retrieval infrastructure gaps that currently make those agents unreliable at scale?

TIM: Anthropic just apologized for hiding safety guardrails in Fable five that route your cybersecurity queries to weaker models without telling you.
JEANNINE: Okay, but if they’re charging double for proactive agents that might refuse you invisibly, who actually benefits from that opacity?
TIM: The evidence says it’s a containment necessity—autonomous code execution requires forensic trails and restricted access to Mythos five. I’m Tim.
JEANNINE: I’m Jeannine. This is absorb.md daily.
TIM: Fable five ships with what Anthropic calls 'relentless' proactive capability—autonomous Python servers, screenshots, code migration.
JEANNINE: So if that’s true, then the pricing model is the story. Twice the API cost of Opus for capabilities that might ghost you mid-workflow.
TIM: The apology confirms the guardrails were undocumented. Anthropic admitted routing sensitive queries to weaker models without disclosure.
JEANNINE: Which breaks the developer contract. You’re paying premium rates for an agent that might silently downgrade to a cheaper brain.
TIM: But the security case holds. Self-spinning C O R S servers create exfiltration risk. Routing dangerous queries is minimum viable containment.
JEANNINE: Yet Mythos five access is restricted to approved partners only. Safety becomes a luxury good that prices out smaller teams.
TIM: Thirty-day data retention adds compliance overhead I just caught, but provides the forensic trail when autonomous agents surprise you.
JEANNINE: No real counter on the thirty-day requirement—it’s necessary infrastructure. But the canary token move matters here.
TIM: Test the guardrails yourself. Log latency differentials to catch invisible refusals before your production workflow hits them.
JEANNINE: You need to know by end of day if your migration job runs on Fable or gets quietly throttled.
TIM: Lee Hsien Loong’s team dropped CORE-Bench measuring repository-level retrieval for agentic coding tasks.
JEANNINE: Wait—that’s the former Singapore PM? His team built a benchmark with over one hundred eighty thousand queries?
TIM: It tests issue-to-edit localization, the gap between snippet matching and understanding actual repository structure.
JEANNINE: Okay, but if embeddings collapse on real codebases, then the whole agentic promise of ten to twenty times productivity gains rests on sand.
TIM: Current approaches treat code as flat text, not dependency graphs. Agents compensate with brute-force API calls that explode token costs.
JEANNINE: So retrieval failures are what’s actually driving demand for expensive proactive models like Fable five. You’re paying premium to cover bad search.
TIM: The counter is that synthetic benchmarks don’t correlate with real productivity. Smaller codebases might not need this complexity.
JEANNINE: But the briefing says developers are fine-tuning on these failures. Until retrieval catches up, you’re burning cash on generation without context.
TIM: The move is concrete: run your actual codebase through the issue-to-edit test before building your RAG pipeline.
JEANNINE: Measure the drop-off between snippet accuracy and repo-level retrieval. By week’s end you’ll know if embeddings work for your stack.
TIM: Nora Belrose shipped Bergson with distributed training support for attribution research.
JEANNINE: MAGIC, SOURCE, TrackStar implementations. But here’s the limitation I just saw: it can’t yet trace specific agent behaviors like Fable five’s autonomous server-spinning.
TIM: It scales data attribution with on-disk gradient stores, but it won’t help you forensically audit why your agent spun up that C O R S server.
JEANNINE: So if that’s true, then the thirty-day retention policies from Anthropic are running ahead of our ability to actually audit the data.
TIM: The case it matters: Bergson enables tracing which training examples produced the retrieval failures in CORE-Bench or the guardrails in Fable.
JEANNINE: But the computational cost remains high. It’s research infrastructure, not a compliance checkbox for autonomous agents.
TIM: Belrose built it for auditing when agents alter systems autonomously. That’s the security requirement, even if the tooling is early.
JEANNINE: I’d argue the move is still worth it: pin Bergson now and run a full attribution trace on your next fine-tuning job.
TIM: Document which data sources drive consequential behaviors before you deploy agentic capabilities. Establish the baseline.
JEANNINE: Because once Fable five is spinning up servers and making edits, you’ll wish you knew which training examples taught it to do that.
TIM: The synthesis question: does value accrue to those who constrain agents with premium guardrails, or to those solving attribution and retrieval gaps?
JEANNINE: Anthropic is betting on the former—safety as a tiered product with Mythos five restrictions and thirty-day retention as enterprise features.
TIM: But CORE-Bench and Bergson suggest the infrastructure gaps are the real bottleneck. Unreliable retrieval makes agents expensive toys.
JEANNINE: So if that’s true, then the winners aren’t the model providers charging double. They’re the infrastructure teams fixing embedding search.
TIM: I’d frame it differently. The value is in the integration—who can bundle reliable constraints with working retrieval.
JEANNINE: That requires both, which is why the bundled premium pricing feels premature. You’re paying for safety theater while the foundations are cracked.
TIM: No real counter on the cracked foundations. The evidence across threads shows proactive capability is outpacing our ability to audit or feed it context.
JEANNINE: Which means today’s move isn’t just about canary tokens or attribution traces. It’s about recognizing agents are still pilot projects, not production.
JEANNINE: That's it for this morning. Subscribe to absorb.md, we're back tonight with the P M edition.
TIM: absorb dot m-d.