absorb.md

Yann LeCun

Chronological feed of everything captured from Yann LeCun.

Current LLMs Face Diminishing Returns, New AI Paradigm Needed for True Intelligence

Yann LeCun, Chief AI Scientist at Meta, argues that current large language models (LLMs) are reaching a wall of diminishing returns due to data saturation and inherent architectural limitations. He asserts that LLMs, while adept at retrieval and regurgitation, lack the capability for true reasoning, planning, and understanding of the physical world. A new paradigm centered on "world models" and joint embedding predictive architectures (JEPAs) is necessary to achieve human-level AI capabilities, moving beyond generative models.

Beyond LLMs: The Path to Human-Level AI through World Models and Joint Embedding Predictive Architectures

Yann LeCun argues that current LLMs are not the path to human-level AI. Instead, he advocates for research into "world models" and Joint Embedding Predictive Architectures (JAPA) that can understand the physical world, possess persistent memory, and perform reasoning and planning in abstract mental spaces. He believes this approach will enable AI systems to achieve advanced machine intelligence within 3-5 years at a small scale.

Dynamic Tanh (DyT) Replaces Normalization Layers in Transformers Without Performance Loss

Normalization layers (e.g., LayerNorm) have long been treated as essential components of Transformer architectures, but this paper from Zhu, Chen, He, LeCun, and Liu challenges that assumption. They introduce Dynamic Tanh (DyT), a simple element-wise operation DyT(x) = tanh(αx), as a drop-in replacement for normalization layers. The method is motivated by the empirical observation that LayerNorm in Transformers already produces tanh-like S-shaped input-output mappings, suggesting normalization's functional role can be approximated more directly. DyT-equipped Transformers match or exceed normalized baselines across vision and language tasks, supervised and self-supervised settings, with minimal hyperparameter tuning.

Beyond LLMs: The Necessity of World Models and Physical Intuition for AGI

Current AI progress is limited by a reliance on discrete symbolic data (text), which fails to capture the continuous, high-dimensional nature of the physical world. To achieve human-level intelligence, AI must transition from auto-regressive token prediction to Joint Embedding Predictive Architectures (JEPA) that utilize abstract world models for hierarchical planning and reasoning. This evolution requires solving the challenge of intuitive physics and sensory integration, moving beyond the efficiency limits of supervised and reinforcement learning.

MLLMs Exhibit Fundamental Deficiencies in Visual-Mathematical Reasoning, Reliant on System 1 Intuition

Multimodal Large Language Models (MLLMs) demonstrate significant shortcomings in visual-mathematical reasoning, particularly in geometric shape recognition and multi-step problem-solving. This deficiency stems from an over-reliance on System 1 (intuitive, associative) processing rather than System 2 (deliberate, reasoned) capabilities. Current models struggle even with basic tasks like counting sides of polygons, suggesting a fundamental failure to properly process visual inputs and learn geometric concepts. However, Visually Cued Chain-of-Thought (VC-CoT) prompting has shown promise in improving multi-step reasoning by integrating visual annotations.

JEPA-Based Latent Planning Outperforms Model-Free RL on Suboptimal Offline Data and Unseen Environments

This paper presents a systematic comparison of model-free RL (goal-conditioned and zero-shot methods) against model-based planning using a JEPA-trained latent dynamics model, evaluated on offline, reward-free navigation datasets of varying quality. The core finding is that model-based planning with latent dynamics generalizes better to unseen environment layouts and is more data-efficient, while model-free RL requires large volumes of high-quality data to perform well. Crucially, latent dynamics planning achieves trajectory stitching performance on par with leading model-free methods, making it a compelling alternative — especially under data scarcity or distribution shift.

Intuitive Physics Emerges from Self-Supervised Video Prediction Without Hardwired Priors

Self-supervised models trained to predict masked regions of natural videos in a learned abstract representation space spontaneously develop intuitive physics capabilities — including object permanence and shape consistency — without any explicit supervision or hardwired core knowledge. This finding, validated via the violation-of-expectation framework, directly challenges nativist theories that innate cognitive systems are necessary for physical understanding. Crucially, pixel-space video predictors and multimodal LLMs that reason through text both fail to exhibit this behavior, suggesting the key ingredient is joint learning of abstract representations alongside masked prediction (analogous to predictive coding).

Intermediate Layers in LLMs Outperform Final Layers for Richer Feature Representations

A new analysis challenges the conventional wisdom that only the final layers of large language models (LLMs) capture high-level features. This research demonstrates that intermediate layers can encode richer representations, leading to improved performance across various downstream tasks. The study proposes a unified framework to quantify these hidden-layer properties, explaining how mid-depth embeddings achieve superior results by balancing information compression and signal preservation.

Democratizing AI: Yann LeCun on Open Source, AGI, and Regulation

Yann LeCun, Chief AI Scientist at Meta, advocates for open-source AI development and challenges prevailing narratives on AI's existential risks. He argues that fostering open platforms is crucial for democratizing AI, preventing regulatory capture by a few corporations, and enabling global participation in AI development. LeCun believes current AI systems, including large language models (LLMs), are far from human-level intelligence and emphasizes the need for systems that understand the physical world through sensory input, similar to how infants learn, to achieve true artificial general intelligence (AGI).

Visual-Predictive Instruction Tuning for Unified Multimodal LLMs

VPiT extends visual instruction tuning to enable LLMs to generate both text and visual tokens autoregressively. This approach leverages an LLM's pre-existing world knowledge and reasoning to enhance visual generation, overcoming common limitations of other models. The core finding is that visual generation emerges from improved visual understanding, with understanding data being more effective for both capabilities than generation data.

VJ-VCR: Self-Supervised Video Representation Learning with Regularization

VJ-VCR is a novel joint-embedding predictive architecture designed for self-supervised video representation learning. It utilizes variance and covariance regularization to mitigate representation collapse, leading to the extraction of abstract, high-level information from video data. This method demonstrably surpasses generative baselines in tasks requiring an understanding of dynamic object behavior within videos.

Intermediate Layers Excel in LLM Representation

This paper investigates the quality of intermediate representations within various LLM architectures, including Transformers and State Space Models. The authors found that intermediate layers frequently offer more informative representations for various downstream tasks compared to the final layers. The study employs metrics like prompt entropy, curvature, and augmentation-invariance to assess representation quality, highlighting architectural differences and the evolution of representations during training.

Rate-In: Inference-Time Adaptive Dropout via Information-Theoretic Feature Map Analysis Improves Uncertainty Calibration

Rate-In addresses a core limitation of Monte Carlo Dropout (MCD): static dropout rates that cannot adapt to input variability or shifting data distributions at inference time. The method dynamically adjusts per-layer, per-input dropout rates by measuring information loss in feature maps induced by dropout, framed as controlled noise injection using information-theoretic principles — requiring no ground truth labels. Empirical validation on synthetic and real-world medical imaging benchmarks shows improved calibration and sharper uncertainty estimates over fixed or heuristic dropout rates, without degrading predictive accuracy.

Navigation World Models for Dynamic and Unfamiliar Environments

Navigation World Models (NWM) utilize a Conditional Diffusion Transformer (CDiT) to predict future visual observations based on past observations and actions, enabling robust navigation planning. The model, scaled to 1 billion parameters and trained on diverse egocentric videos, can plan trajectories in familiar settings and generate trajectories in unfamiliar environments using learned visual priors. This approach offers flexibility over fixed supervised policies by dynamically incorporating constraints during planning.

Yann LeCun on the Current State and Future of AI

Yann LeCun provides an extensive overview of AI, tracing its historical development through symbolic AI and neural networks to contemporary deep learning, large language models (LLMs), and self-supervised learning. He emphasizes the distinct challenges and limitations of current LLMs, particularly their inability to comprehend the physical world due to their textual, discrete nature, contrasting them with the potential of systems capable of learning from video to develop human-like intelligence and planning capabilities within roughly a decade.

RoboPEPP: Enhancing Robot Pose Estimation through Self-Supervised Physical Model Integration

RoboPEPP is a novel approach to vision-based robot pose and joint angle estimation that addresses limitations of existing methods by integrating the robot's physical model directly into the encoder. It uses a masking-based self-supervised embedding-predictive pre-training architecture to infer joint embeddings from unmasked regions, thereby improving the encoder's understanding of the robot's physical structure. This pre-trained model is then fine-tuned for enhanced robustness and performance in occlusion and truncation scenarios.

Enhancing SSL Embeddings via Low-Dimensional Entropy Maximization

The Effective Entropy Maximization Criterion (E2MC) addresses the failure of high-dimensional entropy estimation in self-supervised learning (SSL) by utilizing low-dimensional constraints. When applied as a brief post-training phase for existing SSL embeddings, E2MC consistently enhances downstream task performance. Ablation studies indicate these gains are specific to the proposed criterion rather than the process of continued pre-training itself.

DINO-WM Enables Zero-Shot Planning Through Pre-trained Visual Features

DINO-WM utilizes spatial patch features from DINOv2 to learn visual dynamics from offline behavioral trajectories. This approach enables task-agnostic planning by optimizing action sequences to achieve observational goals represented as prediction targets. The model demonstrates zero-shot behavioral solutions across diverse environments without relying on expert demonstrations, reward modeling, or pre-learned inverse models.

Seq-VCR: Regularization for Enhanced Transformer Reasoning

Decoder-only Transformers struggle with complex reasoning due to representation collapse in intermediate layers. Sequential Variance-Covariance Regularization (Seq-VCR) addresses this by enhancing entropy and preventing collapse. This method, combined with dummy pause tokens, significantly improves performance in arithmetic and sequential reasoning tasks without explicit Chain-of-Thought supervision.

Multi-Modal AI Outperforms Oncotype DX in Breast Cancer Recurrence Prediction

A vision transformer foundation model, pre-trained via self-supervised learning on pan-cancer H&E slides, extracts pathology features integrated with clinical data to predict breast cancer recurrence and death. Evaluated on 8,161 patients from 15 cohorts across seven countries, the model achieves a C-index of 0.71 on disease-free interval in held-out evaluation sets of 3,502 patients. It surpasses Oncotype DX (C-index 0.67 vs. 0.61) in direct comparison on 858 patients and provides independent prognostic value; performance holds across subtypes including TNBC (C-index 0.71).

PooDLe Unifies Pooled Invariance and Dense Equivariance for Effective Self-Supervised Learning from Naturalistic Videos

PooDLe combines an invariance-based objective on pooled representations with a dense self-supervised objective enforcing equivariance to optical flow warping, applied across multiple feature scales. This unified approach addresses challenges in naturalistic videos featuring dense scenes, multiple objects, class imbalance, and varying sizes. Experiments on BDD100K driving videos and Walking Tours first-person videos confirm superior spatial and semantic representation learning.

X-Sample Contrastive Loss Outperforms CLIP by Modeling Multi-Sample Similarity Graphs

Standard contrastive losses use binary similarity graphs with one positive per anchor, ignoring cross-sample relations. The X-Sample Contrastive loss revises this by explicitly encoding similarities across multiple samples via class or text captions. Trained on ImageNet-1k, CC3M, and CC12M, it surpasses self-supervised and vision-language models like CLIP, with major gains in low-data regimes (e.g., 16.8% over CLIP on ImageNet with CC3M) and better object-attribute separation.

X-Sample Contrastive Loss: Enhancing Representation Learning through Explicit Sample Relationship Encoding

Traditional contrastive learning methods utilize a binary similarity graph, neglecting complex inter-sample relationships beyond the positive pair. X-Sample Contrastive Loss addresses this by explicitly encoding how a sample relates to multiple others, leading to richer representation learning. This novel approach demonstrates significant performance gains over existing methods, particularly in data-scarce environments and for fine-grained object-attribute separation.

LiveBench: Contamination-Resistant LLM Benchmark with Live Updates and Objective Auto-Scoring

LiveBench introduces a benchmark for LLMs that sources questions from recent math competitions, arXiv papers, news, and datasets to prevent test set contamination. It enables objective, automatic scoring via ground-truth values, avoiding biases from human or LLM judges. Covering math, coding, reasoning, language, instruction following, and data analysis, it challenges top models below 70% accuracy, with monthly updates to track future improvements.

Cambrian-1 Advances Vision-Centric Multimodal LLMs with Superior Encoders and Spatial Integration

Cambrian-1 family of MLLMs prioritizes vision-centric design, evaluating over 20 vision encoders—including self-supervised, strongly supervised, and hybrids—via LLM interfaces to identify optimal visual representations for real-world grounding. Introduces CV-Bench, a new vision-focused benchmark, and Spatial Vision Aggregator (SVA) to dynamically fuse high-resolution vision features into LLMs with fewer tokens. Releases open models, code, datasets, and recipes achieving SOTA performance while highlighting balanced visual instruction data curation.

Neural Networks Fall Short of Theoretical Fitting Capacity in Practice Due to Training Dynamics

Neural networks fail to reach their theoretical interpolation capacity in practice, with standard optimizers locating minima that fit far fewer training samples than parameters. Convolutional networks outperform MLPs and ViTs in parameter efficiency, even on random labels, while SGD uncovers higher-capacity minima than full-batch GD. ReLU activations enable fitting more data despite gradient stability goals, and label noise sensitivity predicts generalization.

MMCR Aligns Embeddings for Maximal Mutual Information While Exhibiting Double Descent and Scaling Laws

MMCR, a multi-view self-supervised learning method rooted in statistical mechanics, incentivizes alignment and uniformity in embeddings, maximizing a mutual information lower bound between views. It displays non-monotonic pretraining loss akin to double descent with respect to hyperparameters, alongside compute scaling laws predicting loss from gradient steps, batch size, embedding dimension, and view count. Originally for images, MMCR extends effectively to multimodal image-text data, enhancing MVSSL performance.

Hierarchical World Models Enable Data-Driven Visual Whole-Body Control for High-DoF Humanoids

A hierarchical world model uses a high-level RL agent to generate visual-observation-based commands for a low-level RL agent executing whole-body control on a 56-DoF simulated humanoid. Both agents are trained end-to-end with rewards, bypassing manual reward engineering, skill primitives, or simplifying assumptions. The approach delivers top performance across 8 diverse tasks while producing human-preferred motions.

Framework for Defining Openness Across the AI Stack in Foundation Models

This paper introduces a framework to address the challenges of defining openness for foundation models, which differ significantly from traditional software due to their scale and complexity. It reviews prior work, examines motivations for pursuing openness, and delineates how openness varies across model and system levels in the AI stack. The framework aims to enable nuanced discussions on openness, safety, and practical decisions for AI systems.

RL Fine-Tuning with CoT Boosts Small VLMs to Surpass GPT-4V in Decision-Making

A framework fine-tunes VLMs using RL by prompting chain-of-thought reasoning from task descriptions, parsing open-ended text into executable actions, and optimizing with environment rewards. This enables efficient learning of multi-step decision-making in interactive tasks, where traditional instruction fine-tuning falls short. Experiments show 7B VLMs outperforming GPT-4V and Gemini, with CoT reasoning essential for gains.

Entropy Minimization Initially Aligns Test Embeddings to Training but Later Repels Them, Enabling Label-Free Accuracy Estimation

Entropy minimization (EM) boosts test-time classification accuracy by initially embedding test images near training images in representation space. Prolonged EM optimization displaces test embeddings away from training ones, degrading performance. This dynamic allows accurate estimation of model accuracy on unlabeled datasets by tracking embedding shifts during EM, achieving state-of-the-art results with 5.75% mean absolute error across 23 datasets.

GAIA: A New Benchmark for General AI Assistant Capabilities

GAIA is a novel benchmark designed to evaluate General AI Assistants, focusing on real-world questions that demand reasoning, multi-modality handling, web browsing, and tool-use proficiency. Unlike benchmarks that challenge human capabilities, GAIA aims to assess an AI's robustness against average human performance. Current evaluations reveal a significant performance gap, with humans achieving 92% accuracy compared to GPT-4 with plugins at 15%. This suggests that existing advanced AIs still lack fundamental abilities crucial for general intelligence.

RayDINO: Self-Supervised Foundation Model Achieves SOTA Radiology Performance with Bias Mitigation

RayDINO is a self-supervised visual encoder pretrained on 873k chest X-rays, outperforming prior SOTA models across nine diverse radiology tasks including classification, segmentation, and text generation when paired with task-specific adapters. It demonstrates improved generalization to unseen populations and reduced biases related to population, age, and sex. These results highlight self-supervision's efficacy for building versatile, robust, patient-centric AI in clinical X-ray workflows.

EgoPet Dataset Bridges Animal Egomotion and Multi-Agent Interaction for AI Training

EgoPet introduces a novel dataset of pet egomotion imagery capturing simultaneous egomotion and multi-agent interactions from an animal's perspective, addressing the gap in existing video datasets that separate these elements. Unlike human or vehicle egocentric datasets, EgoPet provides a unique animal viewpoint. Benchmarks demonstrate its effectiveness for animal behavior tasks and as a superior pretraining resource for robotic quadruped locomotion compared to prior datasets.

Beyond Autoregressive LLMs: The Case for World Models and Joint Embedding

Current autoregressive LLMs are fundamentally limited by their lack of grounding in physical reality and their inability to plan or reason beyond token-level probability. True machine intelligence requires a transition to Joint Embedding Predictive Architectures (JEPA) that learn abstract world models from high-bandwidth sensory data rather than low-bandwidth text. This shift enables a move from 'System 1' instinctive retrieval to 'System 2' deliberate planning via optimization in latent representation space.

Image World Models Generalize JEPA to Predict Photometric Transformations for Superior Self-Supervised Representations

Image World Models (IWM) extend Joint-Embedding Predictive Architecture (JEPA) beyond masked image modeling by predicting effects of global photometric transformations in latent space. Optimal IWM performance hinges on conditioning strategies, prediction difficulty calibration, and model capacity. Fine-tuned IWM world models match or exceed prior self-supervised methods across tasks, while enabling tunable abstraction levels from invariant to equivariant representations.

Reconstruction Learning Prioritizes Variance-Explaining Subspaces with Poor Perceptual Features

Input reconstruction in representation learning focuses model capacity on data subspaces capturing high pixel variance, which contain uninformative features for downstream perception tasks. On TinyImagenet, the top subspace explaining 90% variance yields 45% accuracy, while the bottom subspace with 20% variance achieves 55%. Denoising strategies like masking can mitigate this misalignment depending on mask parameters and dataset, but additive Gaussian noise does not; detection methods identify ineffective noise regardless of task.

Video Feature Prediction Yields Strong Video and Image Representations Without Pretraining

V-JEPA models are trained solely on predicting features from 2 million unlabeled videos, eschewing pretrained image encoders, text, negatives, or reconstruction. These representations excel on frozen backbones across motion (Kinetics-400, Something-Something-v2) and appearance (ImageNet-1K) tasks. The largest ViT-H/16 model achieves state-of-the-art unsupervised performance without parameter adaptation.

V-JEPA: Advancing Unsupervised Video Representation Learning via Feature Prediction

V-JEPA introduces a novel approach to unsupervised visual representation learning from video by solely utilizing a feature prediction objective. This method eliminates the need for common crutches like pretrained encoders, negative examples, or reconstruction. The resulting models demonstrate strong performance on diverse downstream tasks, indicating the versatility of representations learned through this simplified paradigm.

V-JEPA: Shifting Video Understanding from Pixel Generation to Latent Predictive Architecture

V-JEPA is a non-generative, self-supervised video model that predicts masked spatio-temporal regions within an abstract latent space rather than at the pixel level. By ignoring unpredictable noise and utilizing frozen evaluations, it achieves significantly higher sample efficiency and adaptability across downstream tasks like action classification and object interaction recognition. This architecture serves as a foundational physical world model designed to move AI toward generalized reasoning and planning.

G-Retriever Enables Conversational QA on Large Textual Graphs via RAG and Steiner Tree Optimization

G-Retriever introduces the first retrieval-augmented generation (RAG) framework for question answering on textual graphs, enabling conversational interfaces that provide textual replies and highlight relevant graph parts. It formulates graph retrieval as a Prize-Collecting Steiner Tree optimization problem to handle graphs exceeding LLM context windows and mitigate hallucinations. The method outperforms baselines on a new GraphQA benchmark across domains like scene graphs, common sense, and knowledge graphs, scaling effectively with graph size via fine-tuning and soft prompting.

Exact Parallel Enumeration Enables Precise Analysis of Deep Network Partition Regions

Deep networks are modeled as piecewise affine splines, partitioning input space into regions with affine mappings. Prior methods approximated this partition via 2/3D slices or random sampling. The paper introduces the first parallel algorithm for exact enumeration of all regions, with complexity linear in input dimension and number of regions. It reveals uniform sampling efficiently finds large-volume regions but becomes exponentially inefficient for small regions in high dimensions.

CLIP's Visual Blind Spots Undermine Multimodal LLMs

Multimodal LLMs relying on CLIP exhibit systematic visual shortcomings due to CLIP-blind pairs—images CLIP deems similar despite evident differences. The MMVP benchmark reveals SOTA models like GPT-4V failing on basic visual patterns, with errors correlating to CLIP's weaknesses. Integrating self-supervised vision features via Mixture of Features (MoF) improves visual grounding, highlighting the need for advanced visual representations.

Gradient-Based MPC Outperforms Alternatives in Sample-Efficient Visual World Model Planning

Proposes gradient-based model predictive control (MPC) for planning with differentiable neural world models, contrasting with traditional gradient-free methods like CEM and MPPI. Comparative evaluation shows it matches or exceeds MPC baselines and policy-based methods in most sample-efficient tasks. Introduces a hybrid policy-gradient MPC model that surpasses pure policy networks, indicating potential for complex real-world applications.

Demystifying AI Hype: A Practitioner’s Perspective on Progress, Peril, and Open Science

This discussion with Yann LeCun critically evaluates the current AI landscape, distinguishing between public-facing hype and behind-the-scenes progress. LeCun emphasizes that true human-level AI remains elusive, highlighting the limitations of current large language models. He advocates for open research and development to foster innovation and prevent AI control by a select few, while dismissing near-term existential threats as unfounded speculation.

GAIA Benchmark Exposes AI Assistants' Shortfall in Human-Level General Task Robustness

GAIA introduces a benchmark assessing General AI Assistants on real-world questions demanding reasoning, multi-modality, web browsing, and tool use. Humans achieve 92% accuracy, while GPT-4 with plugins scores only 15%, highlighting a stark gap despite LLMs' superiority in specialized domains like law or chemistry. GAIA shifts benchmarking from superhuman tasks to human-like robustness, essential for AGI, with 466 questions released (answers withheld for 300) to drive leaderboards.

URLOST Enables Unsupervised Learning Across Non-Stationary, Topology-Agnostic Data Modalities

URLOST integrates a learnable self-organizing layer, spectral clustering, and masked autoencoder to learn representations from high-dimensional data without assuming stationarity or topology. It outperforms SimCLR and MAE on simulated biological vision, V1 neural recordings, and gene expressions. This establishes a new benchmark for modality-agnostic unsupervised learning, advancing toward biological-like generalization.

Positive Active Learning: Bridging Theory and Practice in Self-Supervised Learning

Positive Active Learning (PAL) offers a new framework beyond traditional Self-Supervised Learning (SSL) by formalizing the generation of semantically akin samples. This method embeds prior knowledge into existing SSL losses without pipeline changes and provides an active learning approach for low-cost dataset annotation by querying semantic relationships. PAL aims to close the gap between theoretical and practical aspects of active learning through simple, non-expert queries.

Stochastic Positional Embeddings Enhance MIM by Modeling Location Uncertainty

Stochastic positional embeddings (StoP) address a core challenge in masked image modeling (MIM) by introducing location uncertainty via Gaussian-distributed masked token positions. This conditioning reduces overfitting to precise locations and promotes robust semantic feature learning. StoP yields concrete gains like +1.7% on ImageNet linear probing for ViT-B and +2.5% for ViT-H with 1% data.

MC-JEPA Unifies Self-Supervised Content Learning with Optical Flow for Motion-Aware Visual Representations

MC-JEPA introduces a joint-embedding predictive architecture that simultaneously learns content features and optical flow in a shared self-supervised encoder. By combining optical flow estimation with standard self-supervised objectives, it enables content features to incorporate motion cues, with mutual benefits between tasks. The approach matches state-of-the-art unsupervised optical flow benchmarks and excels in downstream tasks like semantic segmentation on images and videos.

Older entries →