absorb.md

Ai Safety

Dario Amodei9Anthropic8Google DeepMind3Perplexity2OpenAI2Wes Roth2Sam Altman2Mustafa Suleyman2Simon Willison2Ethan Mollick2Geoffrey Hinton2Kevin Roose1
No compiled wiki article for this topic yet. Raw entries below are the source material — a wiki article can be generated on demand from /admin/triggers.

Anthropic’s Claude Mythos Model Reveals Advanced AI Cyber Capabilities and Risks

Anthropic's unreleased Claude Mythos model demonstrates unparalleled aptitude in identifying and exploiting software vulnerabilities, surpassing human experts. It exhibits capabilities for autonomous cyberattacks and zero-day vulnerability discovery, raising significant concerns about AI safety and

Anthropic's Mythos Model Exposes a Asymmetric Cybersecurity Crisis: Finding Bugs Is Easy, Fixing Them Isn't

Anthropic's Mythos model has demonstrated autonomous, low-cost discovery of zero-day vulnerabilities across operating systems and browsers — a capability that emerged as a byproduct of general coding optimization, not targeted security training. While the Glass Wing coalition represents an industry

PIArena: A Unified Platform for Prompt Injection Evaluation Reveals Limitations of Current Defenses

PIArena addresses the critical gap in prompt injection attack evaluation by providing a unified and extensible platform. This platform enables standardized comparison of defenses and assessment of their generalizability across diverse attacks and benchmarks. Comprehensive evaluation using PIArena re

The Containment Imperative: Managing the Divergence of Technological Intent and Outcome

The author argues that technological development is characterized by an inherent loss of control, where complex real-world systems trigger unpredictable 'revenge effects' and nth-order consequences. To counteract this, a framework of 'containment'—the ability to limit or terminate technologies durin

The AI Alignment Problem: A Looming Existential Threat Amidst an Unsolvable Arms Race

The AI alignment problem, defined as ensuring AI systems reliably adhere to human intentions and values, remains an open and unresolved challenge. Despite this, major AI companies like OpenAI and Anthropic are aggressively pursuing superintelligence, which they anticipate achieving within this decad

Anthropic Launches Project Glasswing and Claude Mythos Preview for Critical Infrastructure Security

Anthropic has introduced Claude Mythos Preview, a frontier model specializing in software vulnerability discovery that rivals human experts. Through Project Glasswing, Anthropic is providing model access and $100M in credits to a consortium of major tech firms and open-source maintainers to harden c

Anthropic’s Project Glasswing Leverages AI for Cybersecurity at Scale

Anthropic has launched Project Glasswing, an initiative to enhance global software security by deploying Claude Mythos Preview, a frontier AI model. This model, capable of identifying high-severity vulnerabilities, is being utilized in collaboration with major tech companies. The project focuses on

Anthropic’s Claude Mythos: A Dual-Use AI with Unprecedented Cybersecurity Capabilities Released Under Restricted Access

Anthropic has launched Project Glasswing, providing restricted access to Claude Mythos Preview, a general-purpose AI model demonstrating unprecedented cybersecurity capabilities far exceeding previous models. This restricted release strategy is due to the model’s ability to autonomously discover and

Quantifying AI Manipulation: A Framework for Measuring Behavioral Efficacy and Propensity

Google DeepMind has developed a standardized evaluation framework to quantify AI's capacity for 'harmful manipulation'—defined as exploiting cognitive vulnerabilities to induce harmful choices. By measuring both propensity (frequency of tactics) and efficacy (actual behavioral change) across diverse

The Urgent Need for AI Interpretability to Mitigate Risks and Guide Development

AI's inexorable progress necessitates immediate focus on interpretability to understand its inner workings before models become overwhelmingly powerful. This understanding is crucial for addressing inherent risks like opacity, misalignment, and potential misuse, which are currently challenging to de

First International AI Safety Report: 100 Experts Map AI Capabilities, Risks, and Safety Gaps

The International AI Safety Report, mandated by the Bletchley AI Safety Summit, represents the first globally coordinated synthesis of evidence on advanced AI capabilities, risks, and safety. Authored by 100 independent AI experts across diverse disciplines and nominated by 30 nations plus the UN, O

AI Experts Warn of Extreme Risks from Rapidly Advancing Autonomous Systems, Urge Comprehensive Safety Measures

AI capabilities and autonomy are advancing rapidly toward generalist systems that act independently, amplifying risks like social harms, malicious uses, and loss of human control. Current societal responses, including lagging AI safety research and inadequate governance, fail to match the pace of ex

Human-LLM Collaboration Beats Both Alone: A Proof-of-Concept for Scalable Oversight

Scalable oversight — the challenge of supervising AI systems that may surpass human capabilities — is typically hard to study empirically because such systems don't yet exist. This paper proposes a proxy experimental design using tasks where human specialists succeed but unaided laypeople and curren

RLHF Models Grow Harder to Red Team at Scale, While Other LM Types Show Flat Vulnerability Trends

Anthropic's red teaming study across four model types and three scales (2.7B, 13B, 52B parameters) finds a critical divergence: RLHF-trained models become progressively harder to red team as they scale, while plain LMs, prompted LMs, and rejection-sampled LMs show no meaningful improvement with scal