absorb.md

Yann LeCun

Chronological feed of everything captured from Yann LeCun.

Lie Symmetry SSL Yields Superior Representations for Heterogeneous PDE Data

Self-supervised learning with joint embedding and Lie symmetries learns general-purpose representations of PDEs from heterogeneous or incomplete real-world data, bypassing needs for tailored simulations. These representations outperform baselines on invariant tasks like PDE coefficient regression and enhance neural solver time-stepping efficiency. The approach draws from SSL successes in vision, aiming toward PDE foundation models.

Augmented Language Models: Overcoming Limitations with Reasoning and Tool Use

Augmented Language Models (ALMs) enhance traditional LMs by integrating reasoning skills (complex task decomposition) and external tool utilization (e.g., code interpreters). This approach allows ALMs to expand their context processing ability beyond the pure language modeling paradigm, enabling them to address interpretability, consistency, and scalability issues inherent in conventional LMs. ALMs achieve this while maintaining a standard missing token prediction objective, and have demonstrated superior performance on various benchmarks.

Bridging the Gap in Self-Supervised Learning: Split Invariant-Equivariant Representations

This research introduces SIE, a novel approach to self-supervised learning that combines invariant and equivariant representations. By addressing the limitations of existing methods, SIE offers richer representations suitable for diverse tasks. The accompanying 3DIEBench dataset provides a controlled environment for evaluating these advancements, bridging the gap between large-scale invariant methods and smaller-scale equivariant approaches.

Understanding Generalization in Self-Supervised Learning

Self-supervised learning (SSL) is a powerful framework, but practical applications face issues like optimizer instability and representation collapse. This research introduces a theoretical framework to analyze the interplay between data augmentation, network architecture, and training algorithms. The findings provide insights for SSL practitioners to improve generalization performance in both pretraining and downstream tasks.

VCReg: Adapting VICReg's Variance-Covariance Penalty to Supervised Pretraining Boosts Transfer Learning

VCReg adapts VICReg's self-supervised variance-covariance regularization to supervised pretraining, enforcing high-variance and low-covariance representations to counter loss-minimizing feature collapse. Applied to intermediate layers, it enhances feature diversity and transferability across images and videos. Empirical results show state-of-the-art transfer performance, plus gains in long-tail and hierarchical classification, linked to mitigating gradient starvation and neural collapse.

I-JEPA: A self-supervised approach for learning semantic image representations

I-JEPA is a novel non-generative self-supervised learning method for images. It predicts representations of target image blocks from a single context block within the same image. This approach focuses on learning highly semantic image representations without relying on hand-crafted data augmentations, demonstrating scalability and strong performance across various downstream tasks.

I-JEPA: A Step Towards More Human-like AI Learning

Meta AI introduces I-JEPA, a novel Image Joint Embedding Predictive Architecture, as a first step towards Yann LeCun's vision for more human-like AI. Unlike generative models that predict pixel-level details, I-JEPA learns by creating abstract representations of images and predicting missing information at a high semantic level. This approach offers computational efficiency and strong performance across various computer vision tasks, demonstrating potential for learning general world models from unlabeled data.

LeCun's H-JEPA: Hierarchical Latent Variable Energy-Based Architecture for Autonomous AI

Yann LeCun proposes Hierarchical Joint Embedding Predictive Architecture (H-JEPA) as the core building block for future autonomous machine intelligence, addressing limitations in current AI systems. H-JEPA integrates energy-based models with latent variable models to enable learning reliable world models, reasoning, and planning complex actions. This architecture targets applications like Level 5 self-driving cars, domestic robots, and advanced virtual assistants lacking in today's systems.

LLM Explanations Boost GNNs via Interpreter for Top TAG Performance

Prompts LLMs for zero-shot classification and explanations on text-attributed graph node texts, then uses an LLM-to-LM interpreter to convert explanations into features for GNNs. Achieves SOTA results on Cora, PubMed, ogbn-arxiv, and new tape-arxiv23 datasets. Delivers 2.88x training speedup over baselines on ogbn-arxiv, with potential for broader graph-text tasks.

The Future of AI: Beyond Large Language Models

Yann LeCun, a prominent AI researcher, argues that current large language models (LLMs) are fundamentally limited and will be superseded within five years. He advocates for a new approach centered on self-supervised learning, world models that predict outcomes, and hierarchical planning to achieve human-level AI. This paradigm shift will necessitate abandoning generative and probabilistic models in favor of joint embedding architectures and regularized methods, with a strong emphasis on open-source foundational AI.

SSL Regularization Drives Semantic Clustering in Representations

Empirical analysis across SSL models shows the regularization term of the SSL objective inherently clusters representations by semantic labels, enhancing downstream classification while compressing data information. Representations align more with semantic than random classes, with alignment strengthening during training and deeper in networks. This hierarchical semantic alignment provides mechanistic insights into SSL's effectiveness.

Yann LeCun: AI as Intelligence Amplifier Ushering Human Renaissance, Not Doom

Yann LeCun views current AI progress as a continuous evolution toward human-level intelligence, with autoregressive LLMs limited by text-only training lacking real-world understanding, planning, and controllability. Future systems will integrate objectives for safe, steerable planning, enabling open-source infrastructure like Wikipedia-vetted assistants that empower individuals with superintelligent aides. Open-source prevails over proprietary models by harnessing global talent, debunking AGI extinction risks as fallacies conflating intelligence with domination desires, and promising job creation in creative and service sectors despite transitional disruptions.

Self-Supervised Learning Cookbook Lowers Research Entry Barriers

This paper presents a comprehensive "cookbook" for self-supervised learning (SSL), framing it as a delicate process akin to cooking with numerous interdependent choices in pretext tasks and hyperparameters. It aims to democratize SSL research by providing foundational recipes and practical guidance on method navigation and knob tuning. The resource equips researchers to effectively train and innovate in SSL without prior deep expertise.

Information Bottleneck Guides Supervised DNNs but Remains Unclear in Self-Supervised Learning

Deep neural networks leverage the information bottleneck principle to balance compression and relevant information preservation in supervised learning. Self-supervised learning circumvents labeled data needs but lacks clarity on adapting this principle. The review proposes a unified information-theoretic framework for self-supervised methods, analyzes estimation challenges, and identifies research gaps.

EMP-SSL Achieves One-Epoch Self-Supervised Learning via Extreme Multi-Patch Cropping

EMP-SSL enables self-supervised learning convergence in one training epoch by extracting a massive number of crops per image, bypassing heuristics like weight sharing, normalization, quantization, and stop gradients. It matches or exceeds prior SSL performance on CIFAR-10 (85.1%), CIFAR-100 (58.5%), Tiny ImageNet (38.1%), and ImageNet-100 (58.5%) in one epoch, reaching 91.5%, 70.1%, 51.5%, and 78.9% respectively with linear probing in under ten epochs. The method demonstrates superior transferability to out-of-domain datasets over baselines.

Positive Active Learning Surpasses SSL with Minimal Low-Cost Semantic Queries

Positive Active Learning (PAL) generalizes self-supervised learning by using an oracle to query semantic relationships between samples, forming similarity graphs for representation learning. This framework extends to supervised and semi-supervised settings, embeds prior knowledge like labels into SSL losses without pipeline changes, and enables efficient active learning via simple semantic queries. PAL bridges theory and practice in active learning by relying on non-expert feasible annotations.

VICReg's Self-Supervised Learning Explained via Information Theory and Mutual Information Optimization

VICReg optimizes variance, invariance, and covariance to prevent representational collapse in self-supervised learning. The paper derives information-theoretic quantities for deterministic networks, linking VICReg's objective to mutual information maximization without stochastic assumptions. It provides a generalization bound showing advantages for downstream tasks and introduces superior SSL methods from these principles.

Duality in Self-Supervised Learning: Bridging Contrastive and Non-Contrastive Methods

This paper explores the theoretical and practical similarities between contrastive and non-contrastive self-supervised learning methods for image representation. By demonstrating algebraic equivalence under certain assumptions, the research challenges common assumptions about design choices, such as the need for large output dimensions in non-contrastive methods. The findings suggest that performance gaps between these approaches can be closed through optimized network design and hyperparameter tuning.

Augmented Language Models Integrate Reasoning and Tools to Overcome Traditional LM Limitations

Augmented Language Models (ALMs) enhance standard LMs by adding reasoning—decomposing complex tasks into subtasks—and tool usage, such as calling code interpreters, while retaining the missing token prediction objective. ALMs employ heuristics or learn from demonstrations to combine these capabilities, enabling expanded context processing via external modules. This paradigm improves interpretability, consistency, and scalability over pure LMs and boosts performance on benchmarks.

Split Invariant-Equivariant Representations Enable Richer Self-Supervised Learning via Hypernetwork Predictors

SIE splits representations into invariant and equivariant components, using a hypernetwork-based predictor to prevent collapse to invariance and learn diverse features. Evaluated on the new 3DIEBench dataset with over 2.5 million images from 55 3D classes under controlled transformations, SIE outperforms prior methods on equivariance tasks both qualitatively and quantitatively. This bridges the gap between large-scale invariant SSL and smaller-scale equivariant approaches, enabling richer unsupervised representations for complex scenarios.

Theoretical Analysis Reveals Interplay of Augmentations, Inductive Bias, and Algorithms in SSL Generalization

This paper provides a theoretical framework analyzing the interplay between data augmentations, network architecture (inductive bias), and training algorithms in self-supervised learning (SSL). It examines generalization on both pretraining and downstream tasks in a controlled setup, addressing practical issues like optimizer instability and representation collapse. Key insights from the analysis offer actionable guidance for SSL practitioners.

Blockwise Self-Supervised Pretraining Nearly Matches End-to-End Backpropagation on ImageNet

Researchers propose blockwise learning as an alternative to full backpropagation, training the four main layer blocks of ResNet-50 independently using Barlow Twins self-supervised loss. This approach yields 70.48% top-1 ImageNet accuracy with a linear probe, just 1.1% below end-to-end pretrained ResNet-50's 71.57%. Extensive experiments analyze components and adaptations, identifying paths to scale local learning rules for large networks with implications for hardware and neuroscience.

I-JEPA Enables Semantic Image Representations via Joint-Embedding Prediction Without Data Augmentations

I-JEPA is a non-generative self-supervised learning method that predicts representations of large-scale target blocks from a spatially distributed context block within the same image. It relies on a masking strategy emphasizing semantic-scale targets and informative contexts to produce highly semantic representations, avoiding hand-crafted augmentations. When paired with Vision Transformers, it scales efficiently, training a ViT-Huge/14 on ImageNet in under 72 hours on 16 A100 GPUs with strong downstream performance in classification, object counting, and depth prediction.

Graph ViT/MLP-Mixer Overcomes GNN Limitations with Linear Efficiency and Long-Range Modeling

Graph ViT/MLP-Mixer adapts ViT/MLP-Mixer architectures to graphs, replacing local message-passing with global token mixing to capture long-range dependencies and mitigate over-squashing. It achieves linear complexity in nodes and edges, outperforming Graph Transformers in speed and memory while distinguishing 3-WL non-isomorphic graphs. Empirical results on 4 simulated datasets and 7 real-world benchmarks confirm its competitiveness.

Spectral Properties Unify Self-Supervised Learning with Supervised Tasks, Favoring Non-Contrastive Joint Embeddings

Yann LeCun and Randall discuss papers linking self-supervised learning (SSL) to spectral embeddings, showing SSL surrogate tasks aid supervised learning when their similarity graph matrices share spectral properties with the supervised adjacency matrix. Data augmentation in the infinite limit yields analytical effects equivalent to infinite samples, while non-contrastive joint embedding architectures outperform contrastive methods by avoiding dimensional collapse, with a mathematical duality between them via Z Z^T vs Z^T Z. LeCun advocates multi-criteria SSL like prediction and slow feature analysis over RL, using differentiable surrogate costs for efficient planning in hierarchical action spaces.

JEPA Excels with Dynamic Distractors but Fails on Static Noise Due to Slow Feature Bias

JEPA methods using VICReg and SimCLR objectives match or exceed pixel reconstruction baselines for learning dot location representations in offline settings with timestep-varying distractor noise. However, they fail when noise is fixed across frames. A theoretical analysis reveals this limitation stems from JEPA's focus on invariant features rather than slow-changing dynamics.

POLICE: Provable Affine Constraint Enforcement for DNNs with Minimal Forward-Pass Changes

POLICE introduces the first provably optimal method to enforce affine constraints on DNN outputs over a specified input region without altering optimization or requiring sampling. It integrates via minimal forward-pass modifications, enabling standard gradient descent on parameters while guaranteeing constraint satisfaction throughout training and testing. This addresses modularity limitations in incorporating a priori knowledge or physical properties into DNNs.

Unsupervised CTRL Yields Unified Representations Excelling in Both Discrimination and Generation

The paper extends the CTRL framework to unsupervised learning via a constrained maximin game on a rate reduction objective, expanding features across samples while compressing augmentations per sample. This process induces discriminative low-dimensional structures in the representations. These unified representations achieve near SOTA unsupervised classification performance and superior conditional image generation quality under matched conditions.

Optimizing Self-Supervised Learning via Decoupled Contrastive Loss

Decoupled Contrastive Learning (DCL) optimizes self-supervised learning by eliminating the negative-positive-coupling (NPC) effect found in the InfoNCE loss denominator. By removing the positive term from said denominator, DCL reduces the dependency on massive batch sizes, momentum encoders, and extended training epochs. This architectural change enables higher accuracy on ImageNet-1K benchmarks and improves robustness against suboptimal hyperparameter selection.

NeuroAI and Embodied Turing Test as Catalysts for Next-Generation AI

Neuroscience has driven AI progress; accelerating it requires investment in NeuroAI research. The embodied Turing test benchmarks AI animal models against real animals in sensorimotor interactions. This shifts focus from human-unique skills like language to evolutionarily conserved animal capabilities, providing a roadmap for future AI.

VoLTA Enables Fine-Grained Vision-Language Tasks Using Only Image-Caption Data via Weakly-Supervised Patch Alignment

VoLTA introduces a vision-language transformer that achieves fine-grained region-level understanding, such as object detection and segmentation, using only image-caption pairs without expensive bounding box annotations. It employs graph optimal transport for weakly-supervised alignment between local image patches and text tokens, creating an explicit, self-normalized matching criterion. The model integrates multi-modal fusion deeply into uni-modal backbones during pre-training, eliminating dedicated fusion layers to reduce memory usage. Experiments show VoLTA matches or exceeds prior methods on fine- and coarse-grained tasks despite using fewer annotations.

RankMe: Unsupervised Rank-Based Metric Predicts JE-SSL Representation Quality

RankMe assesses Joint-Embedding Self-Supervised Learning (JE-SSL) representations using their effective rank as a simple, label-free indicator of downstream performance. This metric enables hyperparameter selection and quality evaluation across datasets without training or tuning. Extensive experiments show RankMe matches label-based validation with minimal performance loss.

VICRegL Bridges Global-Local Feature Gap in Self-Supervised Learning for Versatile Vision Tasks

VICRegL applies the VICReg variance-invariance-covariance regularization simultaneously to both global feature vectors and local feature maps from two distorted image views in a dual-branch CNN. Local features are only regularized if their L2 distance is below a threshold or their positions align with the known geometric transformation between views. This joint learning yields strong performance on detection/segmentation while preserving classification efficacy, addressing the global-local trade-off in self-supervised representation learning.

VICRegL: A Novel Approach for Simultaneous Global-Local Feature Learning in Self-Supervised Image Representation

Most self-supervised methods for image representation learning focus on either global or local features, excelling in classification or detection/segmentation, respectively. VICRegL introduces a novel approach that simultaneously learns both global and local features. This method employs two identical convolutional network branches fed distorted versions of the same image, applying the VICReg criterion to both global and local feature vectors. This allows VICRegL to achieve strong performance across classification, detection, and segmentation tasks.

Yann LeCun's Vision for Human-Level AI: World Models and JEPA

Yann LeCun proposes that human-level AI requires systems to learn "world models" for understanding environmental dynamics, a capability distinct from current data-intensive approaches. His suggested architecture includes six differentiable modules, with the Joint Embedding Predictive Architecture (JEPA) central to learning abstract world representations and handling prediction uncertainty. This framework aims to enable self-supervised learning for robust, adaptive AI.

Self-Supervised Learning as the Dark Matter Powering Efficient World Models and Animal-Level Intelligence

Self-supervised learning mimics how babies and animals acquire background knowledge—intuitive physics, object dynamics, and common sense—through passive observation, enabling rapid task learning like driving after minimal practice, unlike data-hungry supervised or reinforcement paradigms. It involves predicting future video frames, filling perceptual gaps, or continuing text sequences to build predictive world models that handle uncertainty via compressed latent distributions. Success in NLP via masked language modeling contrasts with vision's progress via non-contrastive methods like Barlow Twins, but video prediction remains unsolved; Yann LeCun posits this approach, integrated with gradient-based reasoning and hierarchical action planning, as AI's best path to cat-level intelligence using ~800M neurons.

High-Dimensional ML is Always Extrapolation: Reinterpreting Neural Nets as Piecewise Linear Space Partitioners

In high-dimensional spaces, the probability of test points lying in the convex hull of training data approaches zero, rendering traditional low-dimensional interpolation intuitions irrelevant; all ML, including deep learning, operates in an extrapolative regime. Neural networks with ReLU activations partition input space into polyhedral ReLU cells via hyperplanes, performing input-specific affine transformations rather than smooth manifold learning. This piecewise linear view demystifies NNs, aligning them with classical methods like decision trees and SVMs, while emphasizing engineered inductive biases and feature transformations for effective generalization.

Yann LeCun on the Past, Present, and Future of AI

Yann LeCun discusses the historical development of deep learning frameworks, including his early work on HLM, SN, and Lush, which laid foundational concepts for modern systems like PyTorch. He emphasizes the critical need for advancements in self-supervised learning, reasoning, and action planning to achieve more intelligent AI. LeCun also advocates for an interdisciplinary approach to AI education and predicts another major AI revolution driven by self-supervised learning and leading to advanced virtual assistants and robotics.

Optimizing Latent Space for Image Generation Inspiration

This paper introduces a strategy to facilitate creative inspiration from deep generative models, specifically GANs, by optimizing latent parameters. It addresses the tediousness of extracting useful generations from these models by proposing an optimization method that finds latent parameters corresponding to the closest generation to a user-provided inspirational image. The research explores various optimization techniques, including gradient descent and gradient-free optimizers, to enhance usability and control over generated outputs for creators.

Self-Supervised Learning: A Path to Common Sense AI

Self-supervised learning (SSL) is critical for advancing AI beyond the limitations of supervised learning, particularly in developing generalist models with common sense. Unlike supervised methods requiring extensive labeled data, SSL extracts supervisory signals directly from raw data, enabling models to learn more nuanced representations of reality. This approach, especially through energy-based models and joint embedding architectures, holds promise for bridging the gap to human-level intelligence by allowing AI to acquire generalized background knowledge.

Implicit Rank-Minimizing Autoencoder (IRMAE) for Compact Latent Spaces

The Implicit Rank-Minimizing Autoencoder (IRMAE) is a novel autoencoder architecture that achieves compact latent representations by implicitly minimizing the rank of the latent code's covariance matrix. This is accomplished by strategically inserting additional linear layers between the encoder and decoder, leveraging the property of gradient descent that leads to minimum-rank solutions in multi-layer linear networks. The model is characterized by its simplicity, determinism, and effectiveness in learning low-dimensional latent spaces for tasks such as image generation and representation learning.

Hierarchical Loss Functions in Neural Networks: A Critical Evaluation

This paper introduces a novel hierarchical loss metric designed to penalize classification errors proportionally to the semantic distance between classes, utilizing an ultrametric tree structure. The core finding reveals that while this hierarchical loss offers a more semantically meaningful evaluation metric, direct minimization via standard stochastic gradient descent with random initialization does not reliably outperform cross-entropy loss minimization in achieving hierarchical classification objectives. Therefore, its primary utility appears to be as a robust evaluation metric rather than an optimizable objective function.

Yann LeCun: Align AI Objectives Like Human Laws, Build World Models via Self-Supervised Learning for Reasoning and Autonomy

Yann LeCun equates AI value alignment to millennia-old human legal systems that shape objectives via constraints, dismissing HAL 9000-style misalignment as solvable through hardwired ethical rules akin to the Hippocratic Oath. Deep learning succeeds empirically despite textbook warnings due to biological inspiration from brains, emphasizing gradient-based learning over discrete logic for reasoning, which requires working memory, recurrence, and energy-based planning. Self-supervised learning via predictive world models enables rapid, sample-efficient intelligence like human babies, forming the foundation for model-based RL, causal inference, and autonomous systems grounded in reality rather than pure language.

Latent Relational Graphs for Transfer Learning

Traditional deep transfer learning primarily focuses on transferring unary feature vectors. This research explores learning and transferring latent relational graphs that capture dependencies between data units (e.g., words, pixels) from unlabeled data. This approach demonstrates improved performance across various downstream tasks, including natural language processing and image classification, and is transferable to different embedding types and even embedding-free units.

Forecasting Convolutional Features Outperforms Baselines for Future Instance Segmentation

Predicting future instance segmentation is achieved by forecasting fixed-sized convolutional features from Mask R-CNN rather than RGB frames or semantic maps. The detection head of Mask R-CNN is applied to these predicted features to generate instance masks for future frames, handling variable object counts efficiently. This method significantly surpasses baselines using optical flow and adapted segmentation architectures.