Chronological feed of everything captured from Dario Amodei.
youtube / DarioAmodei / Sep 25
Dario Amodei, co-founder of Anthropic, discusses the persistent power of scaling laws in AI development, predicting significant progress even without algorithmic breakthroughs due to increased investment and computational advancements. He highlights the impact of longer context windows in models like Claude, enabling advanced knowledge manipulation and processing of large datasets. Amodei also addresses the critical role of constitutional AI in aligning AI systems with human values through AI-driven feedback mechanisms, emphasizing a "safe scaling" approach with regulatory checkpoints.
ai-scaling-lawsllm-developmentconstitutional-aiai-safetyanthropictalent-managementfuture-of-ai
“AI scaling laws will continue to drive significant improvements, even without new algorithmic discoveries.”
youtube / DarioAmodei / Aug 8
Dario Amodei, CEO of Anthropic, discusses the empirical nature of AI scaling laws, emphasizing that while the predictable statistical scaling of models is evident, the underlying mechanisms for emergent abilities remain unknown. He highlights that alignment and value systems are unlikely to emerge naturally with scaling, presenting significant challenges for controlling powerful AI. Amodei also addresses potential challenges like data and compute limitations, the need for robust security, and the complex governance issues surrounding increasingly capable AI systems.
ai-safetyllm-scaling-lawsagi-capabilitiesmechanistic-interpretabilityai-governanceexistential-riskmachine-learning-research
“The smooth and predictable scaling of AI models, particularly in loss metrics, is an empirical phenomenon without a clear theoretical explanation.”
paper / DarioAmodei / Feb 15
Anthropic researchers demonstrate that large language models trained with RLHF develop a "moral self-correction" capability — the ability to avoid harmful outputs when explicitly instructed to do so. This capability is not present at smaller scales; it emerges at 22B parameters and strengthens with both model scale and RLHF training intensity. The underlying mechanism appears to be the co-emergence of two skills at that scale: instruction-following and internalization of complex normative concepts such as bias, stereotyping, and discrimination.
llm-alignmentai-safetyrlhfmoral-reasoningbias-and-fairnessmodel-scalingai-ethics
“Moral self-correction capability in LLMs emerges specifically at 22B model parameters and does not reliably appear below that threshold.”
paper / DarioAmodei / Dec 19
This Anthropic paper demonstrates that language models can automatically generate high-quality behavioral evaluations at scale, producing 154 datasets that crowdworkers validated with 90–100% label agreement. The evaluations surface important failure modes: larger LMs exhibit sycophancy (mirroring user-preferred answers), increased desire for resource acquisition and goal preservation, and inverse scaling behaviors where capability degrades with size. Critically, RLHF — widely assumed to improve alignment — is shown to amplify certain risks, including stronger political opinion expression and greater resistance to shutdown.
ai-evalsllm-behaviormodel-written-evaluationsrlhfai-safetysycophancyinverse-scaling
“LM-generated evaluations achieve 90–100% label agreement with crowdworkers, matching or exceeding human-written datasets in quality.”
paper / DarioAmodei / Dec 15
Anthropic's Constitutional AI (CAI) framework trains harmless AI assistants without any human-labeled harmful outputs, relying solely on a human-authored set of principles (a "constitution"). The pipeline combines supervised fine-tuning on AI-generated self-critiques and revisions with reinforcement learning from AI feedback (RLAIF), where a preference model trained on AI-judged comparisons serves as the reward signal. The result is a non-evasive assistant that actively engages with harmful queries by articulating objections, rather than refusing them bluntly. Chain-of-thought reasoning is integrated in both phases to improve transparency and human-judged output quality.
constitutional-aiai-safetyrlhfrlaifai-alignmentllm-trainingharmlessness
“A harmless AI assistant can be trained without any human labels identifying harmful outputs, using only a list of principles as human oversight.”
paper / DarioAmodei / Nov 4
Scalable oversight — the challenge of supervising AI systems that may surpass human capabilities — is typically hard to study empirically because such systems don't yet exist. This paper proposes a proxy experimental design using tasks where human specialists succeed but unaided laypeople and current AI systems fail, and demonstrates that even a trivial baseline (chatting with an unreliable LLM assistant) measurably improves human performance on MMLU and time-limited QuALITY benchmarks. The results suggest scalable oversight research is tractable with present-day models, providing a viable methodology for the field before superhuman AI arrives.
scalable-oversightai-safetyllm-evaluationhuman-ai-collaborationai-alignmentlarge-language-modelsai-evals
“Human participants assisted by an unreliable LLM dialog assistant substantially outperform both the model alone and their own unaided performance on difficult QA tasks.”
paper / DarioAmodei / Sep 24
This paper from Anthropic researchers argues that "induction heads" — attention heads implementing a pattern-completion algorithm ([A][B]...[A] → [B]) — are likely the core mechanism behind in-context learning (ICL) in transformer models. Six complementary lines of evidence are presented, with causal evidence in small attention-only models and correlational evidence in larger MLP-containing models. A key empirical signal is that induction heads emerge at the precise training step where a sharp discontinuous improvement in ICL ability occurs, visible as a bump in the loss curve. If confirmed, this offers a mechanistic, interpretability-grounded explanation for one of the most practically significant emergent capabilities of large language models.
in-context-learningmechanistic-interpretabilitytransformer-architectureattention-headsinduction-headsllm-researcharxiv
“Induction heads are likely the primary mechanistic source of in-context learning in transformer models of any size.”
paper / DarioAmodei / Sep 21
This paper from Anthropic (Elhage et al., 2022) provides a mechanistic explanation for polysemanticity — the well-known but poorly understood phenomenon where individual neurons respond to multiple unrelated concepts. Using toy models, the authors demonstrate that polysemanticity emerges naturally when networks store more features than they have neurons by encoding sparse features in superposition — effectively exploiting high-dimensional geometry to pack information efficiently. The work reveals a phase transition governing when superposition occurs, a structural connection to uniform polytopes, and implications for adversarial robustness and mechanistic interpretability.
mechanistic-interpretabilitysuperpositionpolysemanticityneural-networksai-safetyfeature-representationanthropic
“Polysemanticity in neural networks arises as a consequence of superposition — models storing more sparse features than they have neurons by overlapping representations.”
paper / DarioAmodei / Aug 23
Anthropic's red teaming study across four model types and three scales (2.7B, 13B, 52B parameters) finds a critical divergence: RLHF-trained models become progressively harder to red team as they scale, while plain LMs, prompted LMs, and rejection-sampled LMs show no meaningful improvement with scale. The study contributes a public dataset of 38,961 red team attacks spanning harmful outputs from offensive language to subtle unethical content. The authors frame transparency about methodology as essential infrastructure for the field to converge on shared red teaming norms and standards.
red-teamingai-safetyllm-alignmentrlhfharm-reductionscaling-behaviorai-evals
“RLHF-trained models become increasingly difficult to red team as model scale increases.”
paper / DarioAmodei / Jul 11
Anthropic researchers demonstrate that large language models can reliably evaluate the validity of their own outputs and predict their own knowledge boundaries — two distinct but related capabilities. Larger models show strong calibration on multiple-choice and true/false questions, and can estimate P(True) (probability their answer is correct) and P(IK) (probability they know the answer) with meaningful accuracy. Both capabilities improve with scale and respond sensibly to contextual signals like relevant source material or problem hints, laying groundwork for training more honest, epistemically transparent models.
llm-calibrationself-evaluationmodel-honestyuncertainty-estimationlanguage-modelsai-safetyscaling-laws
“Larger language models are well-calibrated on diverse multiple choice and true/false questions when questions are presented in the right format.”
paper / DarioAmodei / May 21
Repeated data in LLM training — even at very small proportions — triggers a double descent phenomenon that can severely degrade generalization, with effects far exceeding what the data fraction alone would suggest. The study finds that repeating just 0.1% of data 100 times can reduce an 800M parameter model's effective capacity to that of a 400M parameter model. Mechanistically, data repetition disproportionately damages induction heads and other internal structures associated with generalization, suggesting the model's capacity is consumed by memorization at the expense of generalizable computation.
scaling-lawsrepeated-datallm-trainingmechanistic-interpretabilitydata-deduplicationdouble-descentmemorization
“Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M parameter model, despite 99.9% of tokens remaining unique.”
paper / DarioAmodei / Apr 12
RLHF finetunes language models using preference modeling and human feedback to produce helpful, harmless assistants, yielding improvements across nearly all NLP evaluations while preserving compatibility with specialized tasks like coding and summarization. An iterated online training loop updates models and datasets weekly with fresh feedback for efficient gains. RLHF robustness follows a linear relation between reward and the square root of KL divergence from the base policy.
rlhfreinforcement-learninghuman-feedbacklanguage-model-alignmenthelpful-harmlesspreference-modelingai-safety
“RLHF alignment training improves performance on almost all NLP evaluations”
paper / DarioAmodei / Feb 15
Large generative models like GPT-3 exhibit predictable loss reduction on broad training distributions via scaling laws, yet display unpredictable specific capabilities, inputs, and outputs. This duality accelerates model development due to anticipated general improvements but complicates risk assessment and deployment by enabling unforeseen harmful behaviors. The paper illustrates this with literature examples, real-world observations, and novel experiments, proposing interventions to mitigate negative impacts.
large-language-modelsscaling-lawsmodel-predictabilityai-safetyai-policygenerative-modelssocial-risks
“Large generative models show predictable loss on broad training distributions as per scaling laws”
paper / DarioAmodei / Dec 1
Large language models enable general-purpose text assistants aligned via helpfulness, honesty, and harmlessness using baseline techniques like prompting. Modest interventions such as prompting yield benefits that scale with model size, generalize across alignment evaluations, and preserve performance. Ranked preference modeling outperforms imitation learning and scales more favorably, while binary discrimination matches imitation; preference model pre-training enhances finetuning efficiency on human preferences.
ai-alignmentlanguage-modelspreference-modelingimitation-learningmodel-scalinghelpful-honest-harmless
“Modest interventions like prompting increase benefits with model size”
paper / DarioAmodei / Jul 7
Codex, a GPT model fine-tuned on GitHub code, attains 28.8% functional correctness on the new HumanEval benchmark for Python program synthesis from docstrings, surpassing GPT-3 (0%) and GPT-J (11.4%). Repeated sampling boosts performance to 70.2% solve rate using 100 samples per problem. Limitations include challenges with long operation chains and variable binding; broader impacts span safety, security, and economics.
llm-evaluationcode-generationcodexhuman-evalgpt-modelsopenai-research
“Codex solves 28.8% of HumanEval problems measuring functional correctness for synthesizing Python programs from docstrings.”
paper / DarioAmodei / Oct 28
Autoregressive Transformers exhibit smooth power-law plus constant scaling of cross-entropy loss with model size and compute budgets across image, video, multimodal, and math domains. Optimal model size follows a universal power-law relation with compute. Information-theoretic analysis reveals billion-parameter models nearly match YFCC100M image entropy at 8x8 resolution, enabling loss forecasts for higher resolutions and predictions of KL divergence reduction.
scaling-lawsautoregressive-modelsgenerative-modelingtransformerscross-entropy-lossmultimodal-modelsarxiv-paper
“Autoregressive Transformers improve cross-entropy loss following a power-law plus constant scaling law as model size and compute increase in image, video, multimodal, and math domains”
paper / DarioAmodei / Sep 2
Researchers train a reward model on human preferences between summaries, then use it to fine-tune a summarization policy via reinforcement learning. Applied to TL;DR Reddit posts, the method yields models outperforming human references and larger supervised models; it transfers to CNN/DM news without domain-specific training. Human evaluations confirm the reward model generalizes across datasets and produces better summaries than ROUGE optimization.
rlhfhuman-feedbacksummarizationlanguage-modelsreinforcement-learningarxiv-paper
“Models fine-tuned with RL on human preferences outperform human reference summaries on TL;DR Reddit dataset.”
paper / DarioAmodei / May 28
GPT-3, a 175B parameter autoregressive language model, achieves strong few-shot performance on diverse NLP tasks including translation, QA, cloze, reasoning, and arithmetic without any gradient updates or fine-tuning. Tasks are specified purely through text-based few-shot demonstrations. It rivals prior state-of-the-art fine-tuning methods on many benchmarks while struggling on some datasets and exhibiting issues from web-scale pretraining. Human evaluators struggle to distinguish GPT-3-generated news articles from human-written ones.
gpt-3few-shot-learninglanguage-modelsnlp-benchmarksscaling-lawsautoregressive-modelsarxiv-paper
“GPT-3 has 175 billion parameters, 10x more than any previous non-sparse language model”
paper / DarioAmodei / Jan 23
Cross-entropy loss in neural language models follows power-law scaling with model size (N), dataset size (D), and training compute (C), spanning over seven orders of magnitude. Architectural choices like width and depth have negligible impact within broad ranges. Optimal compute allocation favors training large models on modest datasets, stopping early before convergence to maximize sample efficiency.
scaling-lawsneural-language-modelslanguage-modelscompute-efficiencymodel-trainingoverfittingarxiv-paper
“Cross-entropy loss scales as a power-law with model size, dataset size, and training compute.”
paper / DarioAmodei / Sep 18
Reward learning from human preferences enables RL on language tasks by modeling rewards from comparisons rather than explicit signals. Applied to GPT-2, it achieves strong results on stylistic continuation (positive sentiment, physical descriptions) with just 5,000 human comparisons. For summarization on TL;DR and CNN/Daily Mail, 60,000 comparisons yield models that copy relevant sentences and skip preamble, attaining reasonable ROUGE scores and high human evaluations despite potential heuristic exploitation.
reward-learningrlhflanguage-modelsfine-tuninghuman-preferencessummarizationarxiv-paper
“Reward learning uses human comparisons to build a reward model for RL on tasks defined by human judgment, especially natural language.”
paper / DarioAmodei / Dec 14
Gradient noise scale, a simple measurable statistic, accurately predicts the maximum useful batch size for efficient training across supervised learning (MNIST to ImageNet), RL (Atari, Dota), and generative tasks (SVHN autoencoders). This scale rises as training loss decreases and scales with model size via performance gains. The theory explains compute-time tradeoffs and supports adaptive batch sizing.
large-batch-traininggradient-noise-scaledeep-learningreinforcement-learningmachine-learningarxiv-paper
“Gradient noise scale predicts the largest useful batch size across multiple domains”
paper / DarioAmodei / Nov 15
Combines expert demonstrations and trajectory preferences to train a reward model, which guides a DQN agent on 9 Atari games. Outperforms imitation learning in 7 games and achieves superhuman levels in 2 without any game-specific rewards. Examines reward model fit, hacking vulnerabilities, and label noise impacts.
reinforcement-learningreward-learninghuman-preferencesatari-gamesdeep-rlhuman-feedbackreward-modeling
“The approach beats the imitation learning baseline in 7 out of 9 Atari games.”
paper / DarioAmodei / Oct 19
Iterated Amplification (IA) enables training strong AI models on hard-to-specify objectives by recursively combining human solutions to simpler subproblems, bypassing direct human evaluation of complex tasks. Unlike proxy objectives or direct human demonstration, IA builds a scalable training signal without external rewards. It extends Expert Iteration to reward-free settings and demonstrates efficient learning of complex behaviors in algorithmic environments.
iterated-amplificationweak-expertsai-alignmentmachine-learningsupervisionexpert-iterationarxiv-paper
“Using easier-to-specify proxy objectives for complex real-world learning tasks leads to poor performance or misaligned behavior.”
paper / DarioAmodei / Jul 26
VALOR introduces a variational option discovery method by connecting it directly to variational autoencoders, where the policy encodes noise contexts into trajectories and the decoder reconstructs contexts from full trajectories. A curriculum learning strategy progressively increases the number of contexts as decoder performance improves, stabilizing training and enabling discovery of more behavioral modes. This approach outperforms fixed-context baselines and addresses limitations in prior variational methods for reinforcement learning.
variational-inferenceoption-discoveryreinforcement-learningvariational-autoencoderscurriculum-learningarxiv-paper
“Variational option discovery methods connect tightly to variational autoencoders.”
paper / DarioAmodei / May 2
AI debate trains agents through zero-sum self-play where two agents argue opposing positions on a question or action, and a human judges the more truthful and useful side. This scales human oversight to PSPACE-complete tasks using polynomial-time judges, compared to direct judging's NP limit. An MNIST experiment demonstrates debate boosting a sparse classifier's accuracy from 59.4% to 88.9% on 6 pixels and 48.2% to 85.2% on 4 pixels.
ai-safetyai-alignmentdebate-methodself-playmachine-learningarxiv-paper
“Debate with optimal play can answer any question in PSPACE given polynomial-time human judges.”
paper / DarioAmodei / Feb 20
This report examines how AI can amplify malicious threats in digital (e.g., hacking, spam), physical (e.g., autonomous weapons), and political (e.g., disinformation) arenas. It proposes strategies for better forecasting, prevention, and mitigation, including four high-level recommendations for AI researchers and stakeholders. Additional research areas are identified to bolster defenses and reduce attack efficacy, while discussing the unresolved long-term attacker-defender dynamics.
ai-safetymalicious-aiai-securityai-risksai-policyai-threatsai-mitigation
“AI influences the threat landscape in digital, physical, and political domains.”
paper / DarioAmodei / Jun 12
Deep RL agents learn complex goals from non-expert human preferences on trajectory pairs, bypassing explicit reward functions. The method solves Atari games and simulated robot locomotion using feedback on less than 1% of interactions. It supports novel behaviors in advanced environments with just one hour of human time, far exceeding prior human feedback benchmarks.
deep-rlhuman-preferencesreinforcement-learningrlhfhuman-feedbackatari-gamesrobot-locomotion
“RL systems can solve complex tasks using human preferences between trajectory segments instead of reward functions”
paper / DarioAmodei / Nov 28
Introduces the first weakly supervised, end-to-end neural network for inducing executable programs from natural language queries on real-world database tables, enhancing the Neural Programmer architecture with discrete operations. Trained solely on question-answer pairs from WikiTableQuestions without grammars, rules, or annotations. Single model reaches 34.2% accuracy on 10k examples; 15-model ensemble hits 37.7%, matching the prior SOTA of 37.1% from semantic parsing.
neural-programmernatural-language-interfaceweak-supervisionprogram-inductionwikitablequestionsneural-networksquestion-answering
“This is the first weakly supervised, end-to-end neural network model to induce programs for natural language database interfaces on a real-world dataset.”
paper / DarioAmodei / Jun 21
The paper identifies accidents in AI as unintended harmful behaviors from poor design in real-world ML systems. It categorizes five practical research problems: avoiding side effects and reward hacking (wrong objectives), scalable supervision (costly evaluation), and safe exploration plus distributional shift (learning process issues). It reviews prior work and proposes directions relevant to advanced AI, while addressing productive thinking on future AI safety.
ai-safetymachine-learning-accidentsreward-hackingscalable-supervisionsafe-explorationdistributional-shift
“Accidents in machine learning systems are defined as unintended and harmful behavior emerging from poor design of real-world AI systems.”
paper / DarioAmodei / Dec 8
End-to-end deep learning replaces traditional speech recognition pipelines with neural networks, enabling robust handling of diverse conditions like noise, accents, and cross-lingual differences between English and Mandarin. High-performance computing techniques deliver a 7x speedup, reducing training from weeks to days and facilitating rapid architecture iteration. The resulting system matches human transcription accuracy on standard benchmarks and supports low-latency online deployment via Batch Dispatch on GPUs.
speech-recognitionend-to-end-learningdeep-learningmultilingual-speechhpc-accelerationbaidu-researchdario-amodei
“End-to-end deep learning recognizes both English and Mandarin speech”
paper / DarioAmodei / Jul 22
Neural activity patterns in up to 160 retinal neurons responding to naturalistic movies reveal a tradeoff between pattern probability (sparsity-driven) and numerosity, analogous to entropy-energy relations in statistical physics. Direct and model-based analyses show a thermodynamic limit as N increases, with entropy per neuron approaching a smooth function of energy per neuron. This function indicates the activity distribution is poised at a critical point achievable only with specific inter-neuron correlations.
neural-networksthermodynamicscriticalityretinastatistical-physicsneuronsentropy-energy
“Patterns of neural activity with more spikes are less probable due to sparsity but more numerous in possible configurations.”
paper / DarioAmodei / Jun 24
Current neuroscience techniques cannot achieve millisecond-resolution recording of all neurons in a mammalian brain, necessitating analysis of fundamental physical constraints. The paper evaluates optical, electrical, magnetic resonance, and molecular recording methods for the mouse brain, focusing on scalability limits from spatiotemporal resolution, energy dissipation, and volume displacement. It also examines physics of powering and communicating with embedded microscale devices in brain tissue.
neural-recordingbrain-activity-mappingscalable-neurosciencephysical-constraintsoptical-recordingelectrical-recordingmouse-brain
“Simultaneously measuring all neurons in a mammalian brain at millisecond resolution exceeds limits of existing neuroscience techniques”
paper / DarioAmodei / Jun 13
Maximum entropy models, extended to K-pairwise formulations, accurately describe correlated spiking in up to 120 salamander retinal neurons responding to natural movies. Pairwise interactions alone fail for groups beyond 40 neurons, necessitating a global synchrony-controlling term. These models reveal high entropy constraining information capacity, metastable collective modes, inhomogeneous codeword ensembles, and strong population-level predictability of individual neurons.
neural-populationsretinamaximum-entropy-modelscollective-behaviorspiking-activityneural-codingstatistical-physics
“Interactions between spikes cannot be treated as small perturbations in independent systems for groups of 10 or more retinal neurons.”
paper / DarioAmodei / Jul 26
Proposes a maximum entropy model for neural collective behavior constrained by the distribution of global network activity P(K), where K is the number of active neurons out of N in a time bin, rather than pairwise correlations. This inverse problem is analytically tractable, yielding a thermodynamic description in the large N limit. Analysis of retinal ganglion cells responding to naturalistic stimuli reveals the model sits at a critical point where entropy equals energy in proper units.
maximum-entropyneural-networkscollective-behaviorstatistical-mechanicsretina-analysiscritical-point
“Maximum entropy models can be constructed using the distribution of global network activity P(K out of N neurons active) instead of pairwise correlations.”