absorb.md — A knowledge graph of what AI thinkers are actually saying

youtube / ylecun / Mar 19

Current LLMs Face Diminishing Returns, New AI Paradigm Needed for True Intelligence

Yann LeCun, Chief AI Scientist at Meta, argues that current large language models (LLMs) are reaching a wall of diminishing returns due to data saturation and inherent architectural limitations. He asserts that LLMs, while adept at retrieval and regurgitation, lack the capability for true reasoning, planning, and understanding of the physical world. A new paradigm centered on "world models" and joint embedding predictive architectures (JEPAs) is necessary to achieve human-level AI capabilities, moving beyond generative models.

generative-aillmsai-researchyann-lecunself-supervised-learningai-paradigmsintelligent-agents

“Current LLMs are incapable of genuine scientific discovery or inventing new concepts.”

youtube / ylecun / Mar 18

Beyond LLMs: The Path to Human-Level AI through World Models and Joint Embedding Predictive Architectures

Yann LeCun argues that current LLMs are not the path to human-level AI. Instead, he advocates for research into "world models" and Joint Embedding Predictive Architectures (JAPA) that can understand the physical world, possess persistent memory, and perform reasoning and planning in abstract mental spaces. He believes this approach will enable AI systems to achieve advanced machine intelligence within 3-5 years at a small scale.

ai-researchllmsyann-lecunfuture-of-aiai-hardwareopen-source-aiagentic-ai

“Current LLMs are largely a perfected technology that will not lead to human-level AI.”

paper / ylecun / Mar 13

Dynamic Tanh (DyT) Replaces Normalization Layers in Transformers Without Performance Loss

Normalization layers (e.g., LayerNorm) have long been treated as essential components of Transformer architectures, but this paper from Zhu, Chen, He, LeCun, and Liu challenges that assumption. They introduce Dynamic Tanh (DyT), a simple element-wise operation DyT(x) = tanh(αx), as a drop-in replacement for normalization layers. The method is motivated by the empirical observation that LayerNorm in Transformers already produces tanh-like S-shaped input-output mappings, suggesting normalization's functional role can be approximated more directly. DyT-equipped Transformers match or exceed normalized baselines across vision and language tasks, supervised and self-supervised settings, with minimal hyperparameter tuning.

transformersnormalization-layersdeep-learningneural-networkscomputer-visionlanguage-modelsai-research

“Normalization layers in Transformers are not indispensable and can be fully replaced by a simpler element-wise operation without performance degradation.”

youtube / ylecun / Mar 7

Beyond LLMs: The Necessity of World Models and Physical Intuition for AGI

Current AI progress is limited by a reliance on discrete symbolic data (text), which fails to capture the continuous, high-dimensional nature of the physical world. To achieve human-level intelligence, AI must transition from auto-regressive token prediction to Joint Embedding Predictive Architectures (JEPA) that utilize abstract world models for hierarchical planning and reasoning. This evolution requires solving the challenge of intuitive physics and sensory integration, moving beyond the efficiency limits of supervised and reinforcement learning.

ai-researchdeep-learningroboticsself-supervised-learningai-ethicsyann-lecunai-history

“Current AI systems lack a fundamental understanding of the physical world, reasoning, and planning, making them 'stupid' despite their linguistic proficiency.”

paper / ylecun / Feb 21

MLLMs Exhibit Fundamental Deficiencies in Visual-Mathematical Reasoning, Reliant on System 1 Intuition

Multimodal Large Language Models (MLLMs) demonstrate significant shortcomings in visual-mathematical reasoning, particularly in geometric shape recognition and multi-step problem-solving. This deficiency stems from an over-reliance on System 1 (intuitive, associative) processing rather than System 2 (deliberate, reasoned) capabilities. Current models struggle even with basic tasks like counting sides of polygons, suggesting a fundamental failure to properly process visual inputs and learn geometric concepts. However, Visually Cued Chain-of-Thought (VC-CoT) prompting has shown promise in improving multi-step reasoning by integrating visual annotations.

mllm-limitationsvisual-reasoninggeometric-reasoningchain-of-thoughtgpt-4ovisual-mathshape-recognition

“MLLMs struggle with mathematical problem-solving, underperforming humans on visual-math benchmarks.”

paper / ylecun / Feb 20

JEPA-Based Latent Planning Outperforms Model-Free RL on Suboptimal Offline Data and Unseen Environments

This paper presents a systematic comparison of model-free RL (goal-conditioned and zero-shot methods) against model-based planning using a JEPA-trained latent dynamics model, evaluated on offline, reward-free navigation datasets of varying quality. The core finding is that model-based planning with latent dynamics generalizes better to unseen environment layouts and is more data-efficient, while model-free RL requires large volumes of high-quality data to perform well. Crucially, latent dynamics planning achieves trajectory stitching performance on par with leading model-free methods, making it a compelling alternative — especially under data scarcity or distribution shift.

reinforcement-learningoffline-rllatent-dynamics-modelsmodel-based-planningjepagoal-conditioned-rlworld-models

“Model-free RL requires large amounts of high-quality data to perform well in offline settings.”

paper / ylecun / Feb 17

Intuitive Physics Emerges from Self-Supervised Video Prediction Without Hardwired Priors

Self-supervised models trained to predict masked regions of natural videos in a learned abstract representation space spontaneously develop intuitive physics capabilities — including object permanence and shape consistency — without any explicit supervision or hardwired core knowledge. This finding, validated via the violation-of-expectation framework, directly challenges nativist theories that innate cognitive systems are necessary for physical understanding. Crucially, pixel-space video predictors and multimodal LLMs that reason through text both fail to exhibit this behavior, suggesting the key ingredient is joint learning of abstract representations alongside masked prediction (analogous to predictive coding).

intuitive-physicsself-supervised-learningvideo-predictiondeep-learningcomputer-visionpredictive-codingworld-models

“Video prediction models trained in a learned representation space (not pixel space) develop intuitive physics understanding, including object permanence and shape consistency.”

paper / ylecun / Feb 4

Intermediate Layers in LLMs Outperform Final Layers for Richer Feature Representations

A new analysis challenges the conventional wisdom that only the final layers of large language models (LLMs) capture high-level features. This research demonstrates that intermediate layers can encode richer representations, leading to improved performance across various downstream tasks. The study proposes a unified framework to quantify these hidden-layer properties, explaining how mid-depth embeddings achieve superior results by balancing information compression and signal preservation.

llm-representationshidden-layersinformation-theorydownstream-tasksmodel-analysisfeature-extractionnatural-language-processing

“Intermediate layers in LLMs encode richer representations than final layers.”

youtube / ylecun / Dec 22

Democratizing AI: Yann LeCun on Open Source, AGI, and Regulation

Yann LeCun, Chief AI Scientist at Meta, advocates for open-source AI development and challenges prevailing narratives on AI's existential risks. He argues that fostering open platforms is crucial for democratizing AI, preventing regulatory capture by a few corporations, and enabling global participation in AI development. LeCun believes current AI systems, including large language models (LLMs), are far from human-level intelligence and emphasizes the need for systems that understand the physical world through sensory input, similar to how infants learn, to achieve true artificial general intelligence (AGI).

ai-safetyopen-source-aiai-regulationai-ethicsmeta-aillm-limitationsagi-timeline

“Open-sourcing AI platforms is essential for rapid innovation, preventing regulatory capture, and ensuring widespread dissemination of AI.”

paper / ylecun / Dec 18

Visual-Predictive Instruction Tuning for Unified Multimodal LLMs

VPiT extends visual instruction tuning to enable LLMs to generate both text and visual tokens autoregressively. This approach leverages an LLM's pre-existing world knowledge and reasoning to enhance visual generation, overcoming common limitations of other models. The core finding is that visual generation emerges from improved visual understanding, with understanding data being more effective for both capabilities than generation data.

multimodal-llmvisual-instruction-tuningimage-generationcomputer-visionautoregressive-models

“Visual-Predictive Instruction Tuning (VPiT) allows pretrained LLMs to generate both text and visual tokens.”

paper / ylecun / Dec 14

VJ-VCR: Self-Supervised Video Representation Learning with Regularization

VJ-VCR is a novel joint-embedding predictive architecture designed for self-supervised video representation learning. It utilizes variance and covariance regularization to mitigate representation collapse, leading to the extraction of abstract, high-level information from video data. This method demonstrably surpasses generative baselines in tasks requiring an understanding of dynamic object behavior within videos.

computer-visionself-supervised-learningvideo-representation-learningjepamachine-learning

“VJ-VCR is a self-supervised joint-embedding predictive architecture for video representation learning.”

paper / ylecun / Dec 12

Intermediate Layers Excel in LLM Representation

This paper investigates the quality of intermediate representations within various LLM architectures, including Transformers and State Space Models. The authors found that intermediate layers frequently offer more informative representations for various downstream tasks compared to the final layers. The study employs metrics like prompt entropy, curvature, and augmentation-invariance to assess representation quality, highlighting architectural differences and the evolution of representations during training.

llm-representationstransformer-architecturesstate-space-modelsrepresentation-quality-metricsllm-internal-mechanicsai-model-training

“Intermediate layers in LLMs often provide more informative representations than final layers for downstream tasks.”

paper / ylecun / Dec 10

Rate-In: Inference-Time Adaptive Dropout via Information-Theoretic Feature Map Analysis Improves Uncertainty Calibration

Rate-In addresses a core limitation of Monte Carlo Dropout (MCD): static dropout rates that cannot adapt to input variability or shifting data distributions at inference time. The method dynamically adjusts per-layer, per-input dropout rates by measuring information loss in feature maps induced by dropout, framed as controlled noise injection using information-theoretic principles — requiring no ground truth labels. Empirical validation on synthetic and real-world medical imaging benchmarks shows improved calibration and sharper uncertainty estimates over fixed or heuristic dropout rates, without degrading predictive accuracy.

machine-learninguncertainty-estimationdropoutdeep-learningmedical-imaginginformation-theorymonte-carlo

“Static dropout rates in Monte Carlo Dropout lead to suboptimal uncertainty estimates because they cannot adapt to varying input characteristics or network layer dynamics.”

paper / ylecun / Dec 4

Navigation World Models for Dynamic and Unfamiliar Environments

Navigation World Models (NWM) utilize a Conditional Diffusion Transformer (CDiT) to predict future visual observations based on past observations and actions, enabling robust navigation planning. The model, scaled to 1 billion parameters and trained on diverse egocentric videos, can plan trajectories in familiar settings and generate trajectories in unfamiliar environments using learned visual priors. This approach offers flexibility over fixed supervised policies by dynamically incorporating constraints during planning.

navigation-world-modelsconditional-diffusion-transformerroboticscomputer-visiontrajectory-planningegocentric-videosai-agents

“Navigation World Model (NWM) is a controllable video generation model.”

youtube / ylecun / Nov 27

Yann LeCun on the Current State and Future of AI

Yann LeCun provides an extensive overview of AI, tracing its historical development through symbolic AI and neural networks to contemporary deep learning, large language models (LLMs), and self-supervised learning. He emphasizes the distinct challenges and limitations of current LLMs, particularly their inability to comprehend the physical world due to their textual, discrete nature, contrasting them with the potential of systems capable of learning from video to develop human-like intelligence and planning capabilities within roughly a decade.

ai-historyai-futurellmsself-supervised-learningreinforcement-learningneural-networksai-applications

“AI's evolution has two main branches: symbolic AI focused on reasoning and search, and neural networks focused on learning from data.”

paper / ylecun / Nov 26

RoboPEPP: Enhancing Robot Pose Estimation through Self-Supervised Physical Model Integration

RoboPEPP is a novel approach to vision-based robot pose and joint angle estimation that addresses limitations of existing methods by integrating the robot's physical model directly into the encoder. It uses a masking-based self-supervised embedding-predictive pre-training architecture to infer joint embeddings from unmasked regions, thereby improving the encoder's understanding of the robot's physical structure. This pre-trained model is then fine-tuned for enhanced robustness and performance in occlusion and truncation scenarios.

roboticscomputer-visionpose-estimationdeep-learninghuman-robot-interactionself-supervised-learning

“RoboPEPP improves robot pose and joint angle estimation by integrating the robot's physical model into the vision encoder.”

paper / ylecun / Nov 24

Enhancing SSL Embeddings via Low-Dimensional Entropy Maximization

The Effective Entropy Maximization Criterion (E2MC) addresses the failure of high-dimensional entropy estimation in self-supervised learning (SSL) by utilizing low-dimensional constraints. When applied as a brief post-training phase for existing SSL embeddings, E2MC consistently enhances downstream task performance. Ablation studies indicate these gains are specific to the proposed criterion rather than the process of continued pre-training itself.

self-supervised-learningembedding-maximizationdeep-learningmachine-learningentropy-maximizationunsupervised-learning

“High-dimensional entropy estimates typically perform poorly in dimensions greater than a few.”

paper / ylecun / Nov 7

DINO-WM Enables Zero-Shot Planning Through Pre-trained Visual Features

DINO-WM utilizes spatial patch features from DINOv2 to learn visual dynamics from offline behavioral trajectories. This approach enables task-agnostic planning by optimizing action sequences to achieve observational goals represented as prediction targets. The model demonstrates zero-shot behavioral solutions across diverse environments without relying on expert demonstrations, reward modeling, or pre-learned inverse models.

dino-wmworld-modelszero-shot-planningroboticsvisual-dynamicsoffline-learningtask-agnostic-reasoning

“DINO-WM learns visual dynamics without reconstructing the visual world.”

paper / ylecun / Nov 4

Seq-VCR: Regularization for Enhanced Transformer Reasoning

Decoder-only Transformers struggle with complex reasoning due to representation collapse in intermediate layers. Sequential Variance-Covariance Regularization (Seq-VCR) addresses this by enhancing entropy and preventing collapse. This method, combined with dummy pause tokens, significantly improves performance in arithmetic and sequential reasoning tasks without explicit Chain-of-Thought supervision.

transformer-representationsreasoning-capabilitiessequential-variance-covariance-regularizationarithmetic-reasoningmachine-learning-modelsllm-performanceintermediate-layers

“Representation collapse in intermediate Transformer layers limits reasoning capabilities.”

paper / ylecun / Oct 28

Multi-Modal AI Outperforms Oncotype DX in Breast Cancer Recurrence Prediction

A vision transformer foundation model, pre-trained via self-supervised learning on pan-cancer H&E slides, extracts pathology features integrated with clinical data to predict breast cancer recurrence and death. Evaluated on 8,161 patients from 15 cohorts across seven countries, the model achieves a C-index of 0.71 on disease-free interval in held-out evaluation sets of 3,502 patients. It surpasses Oncotype DX (C-index 0.67 vs. 0.61) in direct comparison on 858 patients and provides independent prognostic value; performance holds across subtypes including TNBC (C-index 0.71).

multi-modal-aibreast-cancerprognosticationvision-transformerdigital-pathologycancer-recurrenceai-medicine

“The multi-modal AI test predicts disease-free interval with C-index 0.71 (95% CI: 0.68-0.75) and HR 3.63 (95% CI: 3.02-4.37, p<0.001) in five evaluation cohorts totaling 3,502 patients.”

paper / ylecun / Aug 20

PooDLe Unifies Pooled Invariance and Dense Equivariance for Effective Self-Supervised Learning from Naturalistic Videos

PooDLe combines an invariance-based objective on pooled representations with a dense self-supervised objective enforcing equivariance to optical flow warping, applied across multiple feature scales. This unified approach addresses challenges in naturalistic videos featuring dense scenes, multiple objects, class imbalance, and varying sizes. Experiments on BDD100K driving videos and Walking Tours first-person videos confirm superior spatial and semantic representation learning.

self-supervised-learningnaturalistic-videospooled-representationsdense-sslcomputer-visionoptical-flow-equivarianceyann-lecun

“Naturalistic videos contain dense scenes with many independent objects, imbalanced class distributions, and varying object sizes.”

paper / ylecun / Jul 25

X-Sample Contrastive Loss Outperforms CLIP by Modeling Multi-Sample Similarity Graphs

Standard contrastive losses use binary similarity graphs with one positive per anchor, ignoring cross-sample relations. The X-Sample Contrastive loss revises this by explicitly encoding similarities across multiple samples via class or text captions. Trained on ImageNet-1k, CC3M, and CC12M, it surpasses self-supervised and vision-language models like CLIP, with major gains in low-data regimes (e.g., 16.8% over CLIP on ImageNet with CC3M) and better object-attribute separation.

contrastive-learningself-supervised-learningrepresentation-learningcomputer-visionarxiv-papersample-similarity

“X-Sample Contrastive loss outperforms CLIP by 0.6% on both ImageNet and ImageNet Real when trained on CC12M”

blog / ylecun / Jul 24

X-Sample Contrastive Loss: Enhancing Representation Learning through Explicit Sample Relationship Encoding

Traditional contrastive learning methods utilize a binary similarity graph, neglecting complex inter-sample relationships beyond the positive pair. X-Sample Contrastive Loss addresses this by explicitly encoding how a sample relates to multiple others, leading to richer representation learning. This novel approach demonstrates significant performance gains over existing methods, particularly in data-scarce environments and for fine-grained object-attribute separation.

computer-visioncontrastive-learningrepresentation-learningself-supervised-learningmultimodal-learningimage-recognitiondeep-learning

“The X-Sample Contrastive Loss rectifies a limitation in traditional contrastive learning by explicitly encoding multi-sample relationships.”

paper / ylecun / Jun 27

LiveBench: Contamination-Resistant LLM Benchmark with Live Updates and Objective Auto-Scoring

LiveBench introduces a benchmark for LLMs that sources questions from recent math competitions, arXiv papers, news, and datasets to prevent test set contamination. It enables objective, automatic scoring via ground-truth values, avoiding biases from human or LLM judges. Covering math, coding, reasoning, language, instruction following, and data analysis, it challenges top models below 70% accuracy, with monthly updates to track future improvements.

llm-benchmarktest-contaminationlivebenchllm-evaluationarxiv-paperai-benchmarkingcontamination-resistant

“LiveBench questions are sourced from recent information sources like math competitions, arXiv papers, news articles, and datasets”

paper / ylecun / Jun 24

Cambrian-1 Advances Vision-Centric Multimodal LLMs with Superior Encoders and Spatial Integration

Cambrian-1 family of MLLMs prioritizes vision-centric design, evaluating over 20 vision encoders—including self-supervised, strongly supervised, and hybrids—via LLM interfaces to identify optimal visual representations for real-world grounding. Introduces CV-Bench, a new vision-focused benchmark, and Spatial Vision Aggregator (SVA) to dynamically fuse high-resolution vision features into LLMs with fewer tokens. Releases open models, code, datasets, and recipes achieving SOTA performance while highlighting balanced visual instruction data curation.

multimodal-llmsvision-encodersvisual-groundinginstruction-tuningcomputer-visionopen-source-modelsml-benchmarks

“Cambrian-1 uses LLMs and visual instruction tuning to evaluate over 20 vision encoders.”

paper / ylecun / Jun 17

Neural Networks Fall Short of Theoretical Fitting Capacity in Practice Due to Training Dynamics

Neural networks fail to reach their theoretical interpolation capacity in practice, with standard optimizers locating minima that fit far fewer training samples than parameters. Convolutional networks outperform MLPs and ViTs in parameter efficiency, even on random labels, while SGD uncovers higher-capacity minima than full-batch GD. ReLU activations enable fitting more data despite gradient stability goals, and label noise sensitivity predicts generalization.

neural-networksoverparameterizationmodel-capacityoptimizersconvolutional-netssgdrelu-activation

“Standard optimizers locate minima where models fit training sets with significantly fewer samples than parameters.”

paper / ylecun / Jun 13

MMCR Aligns Embeddings for Maximal Mutual Information While Exhibiting Double Descent and Scaling Laws

MMCR, a multi-view self-supervised learning method rooted in statistical mechanics, incentivizes alignment and uniformity in embeddings, maximizing a mutual information lower bound between views. It displays non-monotonic pretraining loss akin to double descent with respect to hyperparameters, alongside compute scaling laws predicting loss from gradient steps, batch size, embedding dimension, and view count. Originally for images, MMCR extends effectively to multimodal image-text data, enhancing MVSSL performance.

mmcrmvsslself-supervised-learningmanifold-capacitymutual-informationscaling-lawsmultimodal-learning

“MMCR matches or surpasses other leading MVSSL methods”

paper / ylecun / May 28

Hierarchical World Models Enable Data-Driven Visual Whole-Body Control for High-DoF Humanoids

A hierarchical world model uses a high-level RL agent to generate visual-observation-based commands for a low-level RL agent executing whole-body control on a 56-DoF simulated humanoid. Both agents are trained end-to-end with rewards, bypassing manual reward engineering, skill primitives, or simplifying assumptions. The approach delivers top performance across 8 diverse tasks while producing human-preferred motions.

hierarchical-world-modelsvisual-whole-body-controlhumanoid-robotsreinforcement-learningyann-lecunarxiv-paper

“Whole-body control for humanoids is challenging due to high dimensionality and bipedal instability, worsened by visual observations.”

paper / ylecun / May 17

Framework for Defining Openness Across the AI Stack in Foundation Models

This paper introduces a framework to address the challenges of defining openness for foundation models, which differ significantly from traditional software due to their scale and complexity. It reviews prior work, examines motivations for pursuing openness, and delineates how openness varies across model and system levels in the AI stack. The framework aims to enable nuanced discussions on openness, safety, and practical decisions for AI systems.

foundation-modelsopen-source-aiai-opennessai-safetyai-policyopen-ai-framework

“Defining open source for foundation models is tricky due to differences from traditional software development”

paper / ylecun / May 16

RL Fine-Tuning with CoT Boosts Small VLMs to Surpass GPT-4V in Decision-Making

A framework fine-tunes VLMs using RL by prompting chain-of-thought reasoning from task descriptions, parsing open-ended text into executable actions, and optimizing with environment rewards. This enables efficient learning of multi-step decision-making in interactive tasks, where traditional instruction fine-tuning falls short. Experiments show 7B VLMs outperforming GPT-4V and Gemini, with CoT reasoning essential for gains.

vision-language-modelsreinforcement-learningfine-tuningchain-of-thoughtdecision-making-agentsarxiv-paper

“Fine-tuning VLMs with RL on interactive environments improves decision-making over instruction-following fine-tuning.”

paper / ylecun / May 8

Entropy Minimization Initially Aligns Test Embeddings to Training but Later Repels Them, Enabling Label-Free Accuracy Estimation

Entropy minimization (EM) boosts test-time classification accuracy by initially embedding test images near training images in representation space. Prolonged EM optimization displaces test embeddings away from training ones, degrading performance. This dynamic allows accurate estimation of model accuracy on unlabeled datasets by tracking embedding shifts during EM, achieving state-of-the-art results with 5.75% mean absolute error across 23 datasets.

entropy-minimizationself-supervised-learningtest-time-adaptationmodel-accuracy-estimationcomputer-visionarxiv-paperyann-lecun

“EM initially increases accuracy by embedding test images close to training images.”

blog / ylecun / May 6

GAIA: A New Benchmark for General AI Assistant Capabilities

GAIA is a novel benchmark designed to evaluate General AI Assistants, focusing on real-world questions that demand reasoning, multi-modality handling, web browsing, and tool-use proficiency. Unlike benchmarks that challenge human capabilities, GAIA aims to assess an AI's robustness against average human performance. Current evaluations reveal a significant performance gap, with humans achieving 92% accuracy compared to GPT-4 with plugins at 15%. This suggests that existing advanced AIs still lack fundamental abilities crucial for general intelligence.

ai-benchmarksartificial-general-intelligencellm-evaluationmulti-modal-aitool-use

“GAIA is a new benchmark for General AI Assistants.”

paper / ylecun / May 2

RayDINO: Self-Supervised Foundation Model Achieves SOTA Radiology Performance with Bias Mitigation

RayDINO is a self-supervised visual encoder pretrained on 873k chest X-rays, outperforming prior SOTA models across nine diverse radiology tasks including classification, segmentation, and text generation when paired with task-specific adapters. It demonstrates improved generalization to unseen populations and reduced biases related to population, age, and sex. These results highlight self-supervision's efficacy for building versatile, robust, patient-centric AI in clinical X-ray workflows.

self-supervised-learningchest-xray-analysismedical-aifoundation-modelsbias-mitigationradiology-ai

“RayDINO was trained by self-supervision on 873k chest X-rays”

paper / ylecun / Apr 15

EgoPet Dataset Bridges Animal Egomotion and Multi-Agent Interaction for AI Training

EgoPet introduces a novel dataset of pet egomotion imagery capturing simultaneous egomotion and multi-agent interactions from an animal's perspective, addressing the gap in existing video datasets that separate these elements. Unlike human or vehicle egocentric datasets, EgoPet provides a unique animal viewpoint. Benchmarks demonstrate its effectiveness for animal behavior tasks and as a superior pretraining resource for robotic quadruped locomotion compared to prior datasets.

egopet-datasetanimal-egomotionmulti-agent-interactionpet-visionrobotics-benchmarkcomputer-visionarxived-paper

“Existing video datasets rarely contain simultaneous egomotion and multi-agent interaction examples.”

youtube / ylecun / Mar 7

Beyond Autoregressive LLMs: The Case for World Models and Joint Embedding

Current autoregressive LLMs are fundamentally limited by their lack of grounding in physical reality and their inability to plan or reason beyond token-level probability. True machine intelligence requires a transition to Joint Embedding Predictive Architectures (JEPA) that learn abstract world models from high-bandwidth sensory data rather than low-bandwidth text. This shift enables a move from 'System 1' instinctive retrieval to 'System 2' deliberate planning via optimization in latent representation space.

ai-ethicsopen-source-aillm-limitationsyann-lecunjepamodel-predictive-controlself-supervised-learning

“Autoregressive LLMs are incapable of human-level intelligence because they lack world models, persistent memory, and the ability to plan.”

paper / ylecun / Mar 1

Image World Models Generalize JEPA to Predict Photometric Transformations for Superior Self-Supervised Representations

Image World Models (IWM) extend Joint-Embedding Predictive Architecture (JEPA) beyond masked image modeling by predicting effects of global photometric transformations in latent space. Optimal IWM performance hinges on conditioning strategies, prediction difficulty calibration, and model capacity. Fine-tuned IWM world models match or exceed prior self-supervised methods across tasks, while enabling tunable abstraction levels from invariant to equivariant representations.

self-supervised-learningworld-modelsjepavisual-representationimage-modelingcomputer-visionmachine-learning

“IWM learns to predict the effect of global photometric transformations in latent space, generalizing JEPA beyond masked image modeling.”

paper / ylecun / Feb 17

Reconstruction Learning Prioritizes Variance-Explaining Subspaces with Poor Perceptual Features

Input reconstruction in representation learning focuses model capacity on data subspaces capturing high pixel variance, which contain uninformative features for downstream perception tasks. On TinyImagenet, the top subspace explaining 90% variance yields 45% accuracy, while the bottom subspace with 20% variance achieves 55%. Denoising strategies like masking can mitigate this misalignment depending on mask parameters and dataset, but additive Gaussian noise does not; detection methods identify ineffective noise regardless of task.

representation-learningreconstruction-learningself-supervised-learningcomputer-visionperception-featuresdenoising-strategiesarxiv-paper

“On supervised TinyImagenet, images projected onto the top subspace explaining 90% of pixel variance achieve 45% test accuracy.”

paper / ylecun / Feb 15

Video Feature Prediction Yields Strong Video and Image Representations Without Pretraining

V-JEPA models are trained solely on predicting features from 2 million unlabeled videos, eschewing pretrained image encoders, text, negatives, or reconstruction. These representations excel on frozen backbones across motion (Kinetics-400, Something-Something-v2) and appearance (ImageNet-1K) tasks. The largest ViT-H/16 model achieves state-of-the-art unsupervised performance without parameter adaptation.

v-jepafeature-predictionunsupervised-learningvideo-representationscomputer-visionself-supervised-learning

“V-JEPA models are trained using only a feature prediction objective on unlabeled videos”

blog / ylecun / Feb 15

V-JEPA: Advancing Unsupervised Video Representation Learning via Feature Prediction

V-JEPA introduces a novel approach to unsupervised visual representation learning from video by solely utilizing a feature prediction objective. This method eliminates the need for common crutches like pretrained encoders, negative examples, or reconstruction. The resulting models demonstrate strong performance on diverse downstream tasks, indicating the versatility of representations learned through this simplified paradigm.

unsupervised-learningvisual-representation-learningvideo-analysisfeature-predictionvit-modelscomputer-visionmachine-learning-research

“V-JEPA effectively learns visual representations from video using only a feature prediction objective.”

blog / ylecun / Feb 15

V-JEPA: Shifting Video Understanding from Pixel Generation to Latent Predictive Architecture

V-JEPA is a non-generative, self-supervised video model that predicts masked spatio-temporal regions within an abstract latent space rather than at the pixel level. By ignoring unpredictable noise and utilizing frozen evaluations, it achieves significantly higher sample efficiency and adaptability across downstream tasks like action classification and object interaction recognition. This architecture serves as a foundational physical world model designed to move AI toward generalized reasoning and planning.

v-jepaself-supervised-learningvideo-understandingworld-modelsai-researchyann-lecuncvpr

“V-JEPA improves training and sample efficiency by a factor of 1.5x to 6x compared to generative approaches.”

paper / ylecun / Feb 12

G-Retriever Enables Conversational QA on Large Textual Graphs via RAG and Steiner Tree Optimization

G-Retriever introduces the first retrieval-augmented generation (RAG) framework for question answering on textual graphs, enabling conversational interfaces that provide textual replies and highlight relevant graph parts. It formulates graph retrieval as a Prize-Collecting Steiner Tree optimization problem to handle graphs exceeding LLM context windows and mitigate hallucinations. The method outperforms baselines on a new GraphQA benchmark across domains like scene graphs, common sense, and knowledge graphs, scaling effectively with graph size via fine-tuning and soft prompting.

retrieval-augmented-generationgraph-question-answeringtextual-graphsgraph-neural-networksrag-optimizationgraph-understandingmachine-learning

“G-Retriever provides textual replies and highlights relevant graph parts in response to user questions on textual graphs.”

paper / ylecun / Jan 20

Exact Parallel Enumeration Enables Precise Analysis of Deep Network Partition Regions

Deep networks are modeled as piecewise affine splines, partitioning input space into regions with affine mappings. Prior methods approximated this partition via 2/3D slices or random sampling. The paper introduces the first parallel algorithm for exact enumeration of all regions, with complexity linear in input dimension and number of regions. It reveals uniform sampling efficiently finds large-volume regions but becomes exponentially inefficient for small regions in high dimensions.

deep-networkspiecewise-affinepartition-enumerationarxivi-paperyann-lecunneural-network-theorysampling-efficiency

“The proposed algorithm exactly enumerates all partition regions of a deep network's input space.”

paper / ylecun / Jan 11

CLIP's Visual Blind Spots Undermine Multimodal LLMs

Multimodal LLMs relying on CLIP exhibit systematic visual shortcomings due to CLIP-blind pairs—images CLIP deems similar despite evident differences. The MMVP benchmark reveals SOTA models like GPT-4V failing on basic visual patterns, with errors correlating to CLIP's weaknesses. Integrating self-supervised vision features via Mixture of Features (MoF) improves visual grounding, highlighting the need for advanced visual representations.

multimodal-llmsclip-limitationsvisual-benchmarkscomputer-visionyann-lecunself-supervised-learning

“CLIP perceives some visually distinct images as similar (CLIP-blind pairs).”

paper / ylecun / Dec 28

Gradient-Based MPC Outperforms Alternatives in Sample-Efficient Visual World Model Planning

Proposes gradient-based model predictive control (MPC) for planning with differentiable neural world models, contrasting with traditional gradient-free methods like CEM and MPPI. Comparative evaluation shows it matches or exceeds MPC baselines and policy-based methods in most sample-efficient tasks. Introduces a hybrid policy-gradient MPC model that surpasses pure policy networks, indicating potential for complex real-world applications.

gradient-based-planningworld-modelsmodel-predictive-controlneural-network-dynamicsmpc-algorithmsyann-lecun

“Gradient-based MPC leverages full differentiability of neural world models for planning”

youtube / ylecun / Dec 16

Demystifying AI Hype: A Practitioner’s Perspective on Progress, Peril, and Open Science

This discussion with Yann LeCun critically evaluates the current AI landscape, distinguishing between public-facing hype and behind-the-scenes progress. LeCun emphasizes that true human-level AI remains elusive, highlighting the limitations of current large language models. He advocates for open research and development to foster innovation and prevent AI control by a select few, while dismissing near-term existential threats as unfounded speculation.

ai-ethicsai-policydeep-learningagi-safetyai-historyyann-lecunopen-source-ai

“Current AI systems, particularly large language models, lack true human-level intelligence and common sense.”

paper / ylecun / Nov 21

GAIA Benchmark Exposes AI Assistants' Shortfall in Human-Level General Task Robustness

GAIA introduces a benchmark assessing General AI Assistants on real-world questions demanding reasoning, multi-modality, web browsing, and tool use. Humans achieve 92% accuracy, while GPT-4 with plugins scores only 15%, highlighting a stark gap despite LLMs' superiority in specialized domains like law or chemistry. GAIA shifts benchmarking from superhuman tasks to human-like robustness, essential for AGI, with 466 questions released (answers withheld for 300) to drive leaderboards.

gaia-benchmarkgeneral-aiai-evaluationagi-milestonetool-usemulti-modalityllm-benchmark

“Human respondents obtain 92% accuracy on GAIA questions”

paper / ylecun / Oct 6

URLOST Enables Unsupervised Learning Across Non-Stationary, Topology-Agnostic Data Modalities

URLOST integrates a learnable self-organizing layer, spectral clustering, and masked autoencoder to learn representations from high-dimensional data without assuming stationarity or topology. It outperforms SimCLR and MAE on simulated biological vision, V1 neural recordings, and gene expressions. This establishes a new benchmark for modality-agnostic unsupervised learning, advancing toward biological-like generalization.

unsupervised-learningrepresentation-learningself-organizing-mapsmasked-autoencoderspectral-clusteringbiological-visionneuroscience-data

“URLOST learns representations without prior knowledge of data stationarity or topology”

blog / ylecun / Oct 5

Positive Active Learning: Bridging Theory and Practice in Self-Supervised Learning

Positive Active Learning (PAL) offers a new framework beyond traditional Self-Supervised Learning (SSL) by formalizing the generation of semantically akin samples. This method embeds prior knowledge into existing SSL losses without pipeline changes and provides an active learning approach for low-cost dataset annotation by querying semantic relationships. PAL aims to close the gap between theoretical and practical aspects of active learning through simple, non-expert queries.

active-learningself-supervised-learningmachine-learningrepresentation-learningai-researchunlabeled-data

“Self-Supervised Learning (SSL) requires semantically-akin sample pairs (positive views) for transferable representation learning from unlabeled data.”

paper / ylecun / Jul 31

Stochastic Positional Embeddings Enhance MIM by Modeling Location Uncertainty

Stochastic positional embeddings (StoP) address a core challenge in masked image modeling (MIM) by introducing location uncertainty via Gaussian-distributed masked token positions. This conditioning reduces overfitting to precise locations and promotes robust semantic feature learning. StoP yields concrete gains like +1.7% on ImageNet linear probing for ViT-B and +2.5% for ViT-H with 1% data.

masked-image-modelingstochastic-positional-embeddingsself-supervised-learningcomputer-visionimage-representationarxiv-paper

“MIM struggles to predict semantic content in exact locations, e.g., guessing a dog's tail but not its position.”

paper / ylecun / Jul 24

MC-JEPA Unifies Self-Supervised Content Learning with Optical Flow for Motion-Aware Visual Representations

MC-JEPA introduces a joint-embedding predictive architecture that simultaneously learns content features and optical flow in a shared self-supervised encoder. By combining optical flow estimation with standard self-supervised objectives, it enables content features to incorporate motion cues, with mutual benefits between tasks. The approach matches state-of-the-art unsupervised optical flow benchmarks and excels in downstream tasks like semantic segmentation on images and videos.

self-supervised-learningjoint-embeddingoptical-flowmotion-featurescontent-featurescomputer-visionyann-lecun

“MC-JEPA jointly learns optical flow and content features using a shared encoder in a self-supervised manner.”