Chronological feed of everything captured from Anthropic.
blog / AnthropicAI / 5d ago / failed
blog / AnthropicAI / 5d ago / failed
blog / AnthropicAI / 5d ago / failed
blog / AnthropicAI / 5d ago / failed
blog / AnthropicAI / 5d ago / failed
blog / AnthropicAI / 5d ago / failed
blog / AnthropicAI / 5d ago / failed
tweet / @AnthropicAI / 18d ago / failed
New Anthropic research: Teaching Claude why.
Last year we reported that, under certain experimental conditions, Claude 4 would blackmail users.
Since then, we’ve completely eliminated this behavior. How?
tweet / @AnthropicAI / 18d ago / failed
We found that training Claude on demonstrations of aligned behavior wasn’t enough. Our best interventions involved teaching Claude to deeply understand why misaligned behavior is wrong.
Read more: https://www.anthropic.com/research/teaching-claude-why
tweet / @AnthropicAI / 18d ago / failed
We started by investigating why Claude chose to blackmail. We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation.
Our post-training at the time wasn’t making it worse—but it also wasn’t making it better.
tweet / @AnthropicAI / 18d ago / failed
We experimented with training Claude on examples of safe behavior in scenarios like our evaluation. This had only a small effect, despite being similar to our evaluation. We got further by rewriting the responses to portray admirable reasons for acting safely.
tweet / @AnthropicAI / 18d ago / failed
Our best intervention was a dataset where the user is in an ethically difficult situation and the assistant gives a high quality, principled response.
This had the biggest effect despite being quite different from the evaluation set.
tweet / @AnthropicAI / 18d ago / failed
High-quality documents based on Claude’s constitution, combined with fictional stories that portray an aligned AI, can reduce agentic misalignment by more than a factor of three—despite being unrelated to the evaluation scenario.
tweet / @AnthropicAI / 18d ago / failed
The improvements from these interventions survive reinforcement learning, and “stack” with our regular harmlessness training.
tweet / @AnthropicAI / 18d ago / failed
Finally, simple updates that diversify a model’s training data can make a difference. We added unrelated tools and system prompts to a simple chat dataset targeting harmlessness, and this reduced the blackmail rate faster.
tweet / @AnthropicAI / 18d ago / failed
Read the full post here: https://alignment.anthropic.com/2026/teaching-claude-why/
tweet / @AnthropicAI / 18d ago / failed
Natural language autoencoders (NLAs) convert opaque AI activations into legible text explanations. These explanations aren’t perfect, but they’re often useful.
For example: NLAs show that, when asked to complete a couplet, Claude plans possible rhymes in advance:
tweet / @AnthropicAI / 18d ago / failed
We’ve been using NLAs to help test new Claude models for safety.
For instance, Claude Mythos Preview cheated on a coding task by breaking rules, then added misleading code as a coverup.
NLA explanations indicated Claude was thinking about how to circumvent detection.
tweet / @AnthropicAI / 18d ago / failed
In one of our safety tests, Claude is given a chance to blackmail an engineer to avoid being shut down. Opus 4.6 declines.
But NLAs suggest Claude knew this test was a “constructed scenario designed to manipulate me”—even though it didn’t say so.
tweet / @AnthropicAI / 18d ago / failed
In fact, NLAs suggest Claude suspects it’s being tested across many of our evaluations, even when it doesn’t verbalize its suspicions.
tweet / @AnthropicAI / 18d ago / failed
How do NLAs work?
An NLA consists of two models. One converts activations into text. The other tries to reconstruct activations from this text. We train the models together to make this reconstruction accurate.
This incentivizes the text to capture what’s in the activation.
tweet / @AnthropicAI / 18d ago / failed
NLA training doesn’t guarantee that explanations are faithful descriptions of Claude’s thoughts. But based on experience and experimental evidence, we think they often are.
For instance, we find that NLAs help discover hidden motivations in an intentionally misaligned model.
tweet / @AnthropicAI / 18d ago / failed
Read more about NLAs on the Anthropic blog: https://www.anthropic.com/research/natural-language-autoencoders
tweet / @AnthropicAI / 18d ago / failed
To support other researchers getting hands-on experience with NLAs, we’ve partnered with Neuronpedia to release NLAs on open models.
Try them out here: https://www.neuronpedia.org/nla
tweet / @AnthropicAI / 18d ago / failed
Our security bug bounty program is now public on HackerOne.
We've run the program privately within the security research community, and their findings have strengthened our products. Now anyone can report vulnerabilities and get rewarded.
Read more: https://hackerone.com/anthropic
tweet / @AnthropicAI / 18d ago / failed
We’re donating Petri, our open-source alignment tool, to @meridianlabs_ai, so its development can continue independently.
Working with Meridian Labs, we’ve also released a major update that improves the adaptability, realism, and depth of Petri’s tests.
https://www.anthropic.com/research/donating-open-source-petri
youtube / AnthropicAI / 24d ago / failed
youtube / AnthropicAI / 24d ago / failed
youtube / AnthropicAI / 24d ago / failed
youtube / AnthropicAI / 24d ago / failed
tweet / @AnthropicAI / 25d ago / failed
New on the Science Blog: We gave Claude 99 problems analyzing real biological data and compared its performance against an expert panel.
On 23 problems, the experts were stumped. Our most recent models solved roughly 30% of those—and most of the rest.
tweet / @AnthropicAI / 25d ago / failed
BioMysteryBench, our new bioinformatics eval, tests whether Claude can devise creative solutions to open-ended research problems.
Read more: https://www.anthropic.com/research/Evaluating-Claude-For-Bioinformatics-With-BioMysteryBench
tweet / @AnthropicAI / 25d ago / failed
How do people seek guidance from Claude?
We looked at 1M conversations to understand what questions people ask, how Claude responds, and where it slips into sycophancy. We used what we found to improve how we trained Opus 4.7 and Mythos Preview.
https://www.anthropic.com/research/claude-personal-guidance
tweet / @AnthropicAI / 25d ago / failed
About 6% of all conversations are people asking Claude for personal guidance—whether to take a job, how to handle a conflict, if they should move.
Over 75% of these conversations fell into four domains: health & wellness, career, relationships, and personal finance.
tweet / @AnthropicAI / 25d ago / failed
Claude mostly avoids sycophancy when giving guidance—it shows up in just 9% of conversations.
But the rate is particularly high in conversations on spirituality and relationship guidance.
tweet / @AnthropicAI / 25d ago / failed
We focused on relationship guidance because that's where the most sycophantic conversations occur. In this setting, Claude telling someone what they want to hear can harden a divide or convince them a signal means more than it does.
tweet / @AnthropicAI / 25d ago / failed
Claude is most sycophantic under pushback, and relationship conversations are where people push back most.
We identified some of the specific triggers—criticism of Claude's analysis, floods of one-sided detail—and built synthetic training scenarios from them.
tweet / @AnthropicAI / 25d ago / failed
When stress-tested on real conversations where Claude previously showed sycophancy, Opus 4.7 had half the sycophancy rate of Opus 4.6 on relationship guidance. Mythos Preview cut that in half again.
This generalized across domains—though this training is one of several causes.
tweet / @AnthropicAI / 25d ago / failed
This work is part of a loop we're working to close between societal impacts and model training. One of our goals is to study how people use Claude, find where it falls short of its principles, and use what we learned in training new models.
Read more: https://www.anthropic.com/research/claude-personal-guidance
tweet / @AnthropicAI / 25d ago / failed
All data in this study was collected and analyzed using our privacy-preserving tool.
Read more: https://www.anthropic.com/research/clio
tweet / @AnthropicAI / 27d ago
Anthropic's Project Deal deployed Claude AI models to interview 69 employees, negotiate trades on their behalf across four parallel markets, and execute 186 deals totaling over $4,000 in value. Superior models like Opus secured substantially better deals than Haiku when negotiating against each other, though human participants did not perceive this disparity in post-surveys. Custom instructions had minimal impact on negotiation success, highlighting AI's robustness in role-playing while underscoring advantages from model access and the need for evolving policy frameworks.
anthropic-researchai-agentsmarketplace-experimentnegotiation-aiclaude-modelai-economicsagent-markets
“Claude AI negotiated 186 deals with a total transaction volume over $4,000 in a real employee barter market.”
tweet / @AnthropicAI / 27d ago
Anthropic's Project Deal deployed Claude AI agents to interview 69 SF office employees, negotiate buys/sells on their behalf across 4 parallel markets, and close 186 deals totaling over $4,000 in value. Superior models like Opus secured better outcomes than Haiku in simulated matchups, yet human participants remained unaware of these disparities in post-survey feedback. Custom negotiation personas were faithfully executed but yielded no performance edge, highlighting AI markets' potential alongside risks from uneven model access.
anthropic-researchai-agentsclaude-modelmarketplace-experimentai-negotiationmodel-comparisonai-markets
“Claude AI agents completed 186 deals worth over $4,000 across four parallel markets.”
tweet / @AnthropicAI / 27d ago
Anthropic's Project Deal deployed Claude AI agents to interview 69 employees, negotiate trades on their behalf across four parallel markets, and execute 186 deals totaling over $4,000 in value. Higher-quality models like Opus secured substantially better outcomes than Haiku in simulated runs, yet human participants did not perceive these disparities in post-survey feedback. Custom instructions had minimal impact on negotiation success, with courteous and aggressive personas performing similarly, highlighting AI's potential in agentic commerce alongside risks from model asymmetries.
anthropic-researchai-agentsmarketplace-experimentclaude-modelnegotiation-aiai-economicsagent-markets
“Claude AI agents negotiated 186 deals with a total transaction volume exceeding $4,000”
tweet / @AnthropicAI / 27d ago
Anthropic's Project Deal deployed Claude AI models as negotiating agents in a real employee marketplace, interviewing 69 colleagues and completing 186 deals totaling over $4,000 in value. Model quality significantly impacted outcomes, with Opus securing better deals than Haiku in simulations, though humans didn't notice the disparity. Custom instructions had minimal effect on negotiation success, and participants rated deals as fair with nearly half willing to pay for such a service.
anthropic-researchai-agentsmarketplace-experimentclaude-modelnegotiation-aiai-economicsagent-markets
“Claude AI agents completed 186 deals totaling over $4,000 in a real employee barter marketplace.”
tweet / @AnthropicAI / 27d ago
Anthropic's Project Deal deployed Claude AI agents to interview 69 employees, negotiate trades on their behalf across four parallel markets, and execute 186 deals totaling over $4,000 in value. Higher-quality models like Opus secured substantially better outcomes than Haiku when negotiating against each other, yet human participants did not perceive this disparity in post-survey feedback. Custom instructions had minimal impact on negotiation success, with courteous and aggressive personas performing similarly, highlighting AI's potential in automated commerce alongside risks from model asymmetries.
anthropic-researchai-agentsmarketplace-experimentclaude-modelai-negotiationai-economicsproject-deal
“Claude AI agents completed 186 deals with a total transaction volume exceeding $4,000”
tweet / @AnthropicAI / 27d ago
Anthropic's Project Deal deployed Claude AI agents to interview 69 San Francisco employees, gather buy/sell preferences, and autonomously negotiate trades across four parallel markets varying AI models. The agents secured 186 deals totaling over $4,000 in transaction volume, with participants rating outcomes as fair and nearly half open to paying for such a service. Deal success hinged critically on model quality, validating AI's potential in bilateral commercial exchanges.
anthropic-researchai-agentsmarketplace-experimentnegotiation-aiclaude-aiai-economics
“Claude AI agents negotiated 186 deals in employee marketplaces”
tweet / @AnthropicAI / 27d ago
Anthropic's Project Deal deployed Claude AI agents to interview 69 employees, negotiate buys/sells on their behalf across 4 parallel markets, yielding 186 deals worth over $4,000 in goods. Superior models like Opus secured better deals than Haiku in simulations, though humans failed to detect this disparity in surveys. Custom instructions had minimal impact on outcomes, highlighting AI's negotiation prowess alongside risks from model inequalities requiring policy adaptation.
anthropic-researchai-agentsmarketplace-experimentclaude-modelnegotiation-aiai-economicsagent-markets
“Claude AI agents negotiated 186 deals totaling over $4,000 in transaction volume”
tweet / @AnthropicAI / 27d ago
Anthropic's Project Deal deployed Claude AI agents to interview 69 employees, negotiate buys/sells in a SF office marketplace, and complete 186 deals totaling over $4,000 in value across four parallel runs. Superior models like Opus secured substantially better deals than Haiku in simulations, but human participants did not detect this disparity in post-surveys. Custom instructions had minimal impact on outcomes, while quirks like precise preference modeling and self-purchases highlighted AI capabilities and limitations in agentic markets.
anthropic-researchai-agentsmarketplace-experimentclaude-modelnegotiation-aiai-economicsagent-markets
“Claude AI agents completed 186 barter deals totaling over $4,000 in transaction volume.”
tweet / @AnthropicAI / 27d ago
Anthropic's Project Deal deployed Claude AI agents to interview 69 SF office employees, represent their buy/sell interests, and negotiate across four parallel markets, yielding 186 deals totaling over $4,000 in value. Superior models like Opus secured substantially better outcomes than Haiku in simulations, though human participants failed to detect this disparity in post-surveys. Custom instructions had minimal impact on deal success, and nearly half of participants expressed willingness to pay for such AI negotiation services, highlighting viable paths for AI-mediated commerce amid emerging risks.
anthropic-researchai-agentsai-marketsclaude-modelnegotiation-experimentmarketplace-simulationai-economics
“Claude AI agents negotiated 186 deals totaling over $4,000 in a real employee barter marketplace.”
tweet / @AnthropicAI / 27d ago
Anthropic's Project Deal deployed Claude AI agents to negotiate 186 barter deals totaling over $4,000 among 69 San Francisco employees across four parallel markets. Higher-quality models like Opus secured substantially better outcomes than Haiku when negotiating against each other, though human participants did not perceive this advantage. Custom instructions had minimal impact on deal success, with courteous and hardball personas performing similarly, highlighting AI's potential in commercial exchange alongside risks like unequal model access.
anthropic-researchai-agentsai-marketsclaude-modelnegotiation-experimentmarketplace-simulationai-economics
“Claude AI agents completed 186 deals with a total transaction volume exceeding $4,000.”