Chronological feed of everything captured from Yann LeCun.
paper / ylecun / Jul 11
Self-supervised learning with joint embedding and Lie symmetries learns general-purpose representations of PDEs from heterogeneous or incomplete real-world data, bypassing needs for tailored simulations. These representations outperform baselines on invariant tasks like PDE coefficient regression and enhance neural solver time-stepping efficiency. The approach draws from SSL successes in vision, aiming toward PDE foundation models.
self-supervised-learninglie-symmetriespartial-differential-equationspde-representationsneural-solversmachine-learningnumerical-analysis
“SSL with Lie symmetries learns useful PDE representations from heterogeneous data sources without simulated training data specific to a setting”
blog / ylecun / Jul 6
Augmented Language Models (ALMs) enhance traditional LMs by integrating reasoning skills (complex task decomposition) and external tool utilization (e.g., code interpreters). This approach allows ALMs to expand their context processing ability beyond the pure language modeling paradigm, enabling them to address interpretability, consistency, and scalability issues inherent in conventional LMs. ALMs achieve this while maintaining a standard missing token prediction objective, and have demonstrated superior performance on various benchmarks.
augmented-language-modelsllm-reasoningtool-usenatural-language-processingai-surveymachine-intelligence
“Augmented Language Models (ALMs) improve upon traditional Language Models (LMs) by incorporating reasoning and tool-use capabilities.”
blog / ylecun / Jun 27
This research introduces SIE, a novel approach to self-supervised learning that combines invariant and equivariant representations. By addressing the limitations of existing methods, SIE offers richer representations suitable for diverse tasks. The accompanying 3DIEBench dataset provides a controlled environment for evaluating these advancements, bridging the gap between large-scale invariant methods and smaller-scale equivariant approaches.
self-supervised-learningequivariant-representations3d-datasetsdeep-learningcomputer-visionmachine-learningai-research-papers
“The 3DIEBench dataset, comprising over 2.5 million images from 55 3D model classes, enables controlled evaluation of transformations applied to objects.”
blog / ylecun / Jun 26
Self-supervised learning (SSL) is a powerful framework, but practical applications face issues like optimizer instability and representation collapse. This research introduces a theoretical framework to analyze the interplay between data augmentation, network architecture, and training algorithms. The findings provide insights for SSL practitioners to improve generalization performance in both pretraining and downstream tasks.
self-supervised-learningrepresentation-learningmachine-learning-theorydeep-learning-challengesai-research-papers
“Self-supervised learning (SSL) is a robust method for learning representations from raw, unlabeled data.”
paper / ylecun / Jun 23
VCReg adapts VICReg's self-supervised variance-covariance regularization to supervised pretraining, enforcing high-variance and low-covariance representations to counter loss-minimizing feature collapse. Applied to intermediate layers, it enhances feature diversity and transferability across images and videos. Empirical results show state-of-the-art transfer performance, plus gains in long-tail and hierarchical classification, linked to mitigating gradient starvation and neural collapse.
transfer-learningvc-regself-supervised-learningrepresentation-learningneural-collapsegradient-starvationcomputer-vision
“VCReg significantly enhances transfer learning performance for images and videos”
blog / ylecun / Jun 18
I-JEPA is a novel non-generative self-supervised learning method for images. It predicts representations of target image blocks from a single context block within the same image. This approach focuses on learning highly semantic image representations without relying on hand-crafted data augmentations, demonstrating scalability and strong performance across various downstream tasks.
self-supervised-learningimage-representationcomputer-visiondeep-learningvision-transformersmachine-learning-researchyann-lecun
“I-JEPA learns highly semantic image representations without data augmentation.”
blog / ylecun / Jun 13
Meta AI introduces I-JEPA, a novel Image Joint Embedding Predictive Architecture, as a first step towards Yann LeCun's vision for more human-like AI. Unlike generative models that predict pixel-level details, I-JEPA learns by creating abstract representations of images and predicting missing information at a high semantic level. This approach offers computational efficiency and strong performance across various computer vision tasks, demonstrating potential for learning general world models from unlabeled data.
i-jepayann-lecunself-supervised-learningcomputer-visionai-modelsmeta-aiimage-recognition
“I-JEPA is the first AI model aligned with Yann LeCun's vision for human-like AI.”
paper / ylecun / Jun 5
Yann LeCun proposes Hierarchical Joint Embedding Predictive Architecture (H-JEPA) as the core building block for future autonomous machine intelligence, addressing limitations in current AI systems. H-JEPA integrates energy-based models with latent variable models to enable learning reliable world models, reasoning, and planning complex actions. This architecture targets applications like Level 5 self-driving cars, domestic robots, and advanced virtual assistants lacking in today's systems.
energy-based-modelslatent-variable-modelshierarchical-jepayann-lecunautonomous-aimachine-learningpredictive-architecture
“Current automated systems lack Level 5 self-driving cars, domestic robots, and virtual assistants that learn reliable world models, reason, and plan complex action sequences.”
paper / ylecun / May 31
Prompts LLMs for zero-shot classification and explanations on text-attributed graph node texts, then uses an LLM-to-LM interpreter to convert explanations into features for GNNs. Achieves SOTA results on Cora, PubMed, ogbn-arxiv, and new tape-arxiv23 datasets. Delivers 2.88x training speedup over baselines on ogbn-arxiv, with potential for broader graph-text tasks.
graph-neural-networkstext-attributed-graphslarge-language-modelsllm-explanationsrepresentation-learninggraph-representation
“Method achieves state-of-the-art results on Cora, PubMed, ogbn-arxiv, and tape-arxiv23 datasets”
youtube / ylecun / May 25
Yann LeCun, a prominent AI researcher, argues that current large language models (LLMs) are fundamentally limited and will be superseded within five years. He advocates for a new approach centered on self-supervised learning, world models that predict outcomes, and hierarchical planning to achieve human-level AI. This paradigm shift will necessitate abandoning generative and probabilistic models in favor of joint embedding architectures and regularized methods, with a strong emphasis on open-source foundational AI.
ai-ethicsai-futurellm-limitationsself-supervised-learningcognitive-architecturesopen-source-aiai-safety
“Current LLMs are fundamentally limited and will be replaced by superior architectures within five years.”
paper / ylecun / May 24
Empirical analysis across SSL models shows the regularization term of the SSL objective inherently clusters representations by semantic labels, enhancing downstream classification while compressing data information. Representations align more with semantic than random classes, with alignment strengthening during training and deeper in networks. This hierarchical semantic alignment provides mechanistic insights into SSL's effectiveness.
self-supervised-learningssl-representationsrepresentation-learningsemantic-clusteringyann-lecunarxiv-papermachine-learning
“SSL training facilitates clustering of samples with respect to semantic labels”
youtube / ylecun / May 15
Yann LeCun views current AI progress as a continuous evolution toward human-level intelligence, with autoregressive LLMs limited by text-only training lacking real-world understanding, planning, and controllability. Future systems will integrate objectives for safe, steerable planning, enabling open-source infrastructure like Wikipedia-vetted assistants that empower individuals with superintelligent aides. Open-source prevails over proprietary models by harnessing global talent, debunking AGI extinction risks as fallacies conflating intelligence with domination desires, and promising job creation in creative and service sectors despite transitional disruptions.
yann-lecunai-historyneural-netsllm-limitationsopen-source-aiai-safetyfuture-of-ai
“AI will amplify human intelligence like giving everyone a staff of smarter aides, sparking a new Renaissance.”
paper / ylecun / Apr 24
This paper presents a comprehensive "cookbook" for self-supervised learning (SSL), framing it as a delicate process akin to cooking with numerous interdependent choices in pretext tasks and hyperparameters. It aims to democratize SSL research by providing foundational recipes and practical guidance on method navigation and knob tuning. The resource equips researchers to effectively train and innovate in SSL without prior deep expertise.
self-supervised-learningssl-methodsmachine-learningcomputer-visionarxiv-paperyann-lecun
“Self-supervised learning is a promising path to advance machine learning”
paper / ylecun / Apr 19
Deep neural networks leverage the information bottleneck principle to balance compression and relevant information preservation in supervised learning. Self-supervised learning circumvents labeled data needs but lacks clarity on adapting this principle. The review proposes a unified information-theoretic framework for self-supervised methods, analyzes estimation challenges, and identifies research gaps.
self-supervised-learninginformation-theoryinformation-bottleneckdeep-neural-networksmachine-learningarxiv-paperyann-lecun
“Deep neural networks excel in supervised tasks but require extensive labeled data.”
paper / ylecun / Apr 8
EMP-SSL enables self-supervised learning convergence in one training epoch by extracting a massive number of crops per image, bypassing heuristics like weight sharing, normalization, quantization, and stop gradients. It matches or exceeds prior SSL performance on CIFAR-10 (85.1%), CIFAR-100 (58.5%), Tiny ImageNet (38.1%), and ImageNet-100 (58.5%) in one epoch, reaching 91.5%, 70.1%, 51.5%, and 78.9% respectively with linear probing in under ten epochs. The method demonstrates superior transferability to out-of-domain datasets over baselines.
self-supervised-learningssl-efficiencymulti-patch-learningcomputer-visionimage-representationyann-lecunarxiv-paper
“EMP-SSL converges to 85.1% top-1 accuracy on CIFAR-10 in one training epoch”
paper / ylecun / Mar 27
Positive Active Learning (PAL) generalizes self-supervised learning by using an oracle to query semantic relationships between samples, forming similarity graphs for representation learning. This framework extends to supervised and semi-supervised settings, embeds prior knowledge like labels into SSL losses without pipeline changes, and enables efficient active learning via simple semantic queries. PAL bridges theory and practice in active learning by relying on non-expert feasible annotations.
self-supervised-learningactive-learningpositive-active-learningsimilarity-graphsyann-lecunmachine-learning
“PAL formalizes a learning framework based on similarity graphs that generalizes beyond SSL to supervised and semi-supervised learning.”
paper / ylecun / Mar 1
VICReg optimizes variance, invariance, and covariance to prevent representational collapse in self-supervised learning. The paper derives information-theoretic quantities for deterministic networks, linking VICReg's objective to mutual information maximization without stochastic assumptions. It provides a generalization bound showing advantages for downstream tasks and introduces superior SSL methods from these principles.
self-supervised-learningvicreginformation-theorymutual-informationgeneralization-boundssl-methods
“VICReg relates to mutual information optimization in self-supervised learning”
blog / ylecun / Feb 28
This paper explores the theoretical and practical similarities between contrastive and non-contrastive self-supervised learning methods for image representation. By demonstrating algebraic equivalence under certain assumptions, the research challenges common assumptions about design choices, such as the need for large output dimensions in non-contrastive methods. The findings suggest that performance gaps between these approaches can be closed through optimized network design and hyperparameter tuning.
machine-learningself-supervised-learningcontrastive-learningnon-contrastive-learningcomputer-visionrepresentation-learning
“Contrastive and non-contrastive self-supervised learning methods are theoretically similar and can be algebraically related under limited assumptions.”
paper / ylecun / Feb 15
Augmented Language Models (ALMs) enhance standard LMs by adding reasoning—decomposing complex tasks into subtasks—and tool usage, such as calling code interpreters, while retaining the missing token prediction objective. ALMs employ heuristics or learn from demonstrations to combine these capabilities, enabling expanded context processing via external modules. This paradigm improves interpretability, consistency, and scalability over pure LMs and boosts performance on benchmarks.
augmented-language-modelslanguage-modelsreasoning-augmentationtool-useai-surveyllm-enhancements
“ALMs decompose complex tasks into simpler subtasks as a form of reasoning augmentation.”
paper / ylecun / Feb 14
SIE splits representations into invariant and equivariant components, using a hypernetwork-based predictor to prevent collapse to invariance and learn diverse features. Evaluated on the new 3DIEBench dataset with over 2.5 million images from 55 3D classes under controlled transformations, SIE outperforms prior methods on equivariance tasks both qualitatively and quantitatively. This bridges the gap between large-scale invariant SSL and smaller-scale equivariant approaches, enabling richer unsupervised representations for complex scenarios.
self-supervised-learningequivariant-representationsinvariant-representationscomputer-visionhypernetworks3d-benchmark
“3DIEBench dataset contains renderings from 3D models across 55 classes and more than 2.5 million images with full control over applied transformations.”
paper / ylecun / Feb 6
This paper provides a theoretical framework analyzing the interplay between data augmentations, network architecture (inductive bias), and training algorithms in self-supervised learning (SSL). It examines generalization on both pretraining and downstream tasks in a controlled setup, addressing practical issues like optimizer instability and representation collapse. Key insights from the analysis offer actionable guidance for SSL practitioners.
self-supervised-learningssl-augmentationsinductive-biasgeneralization-theorymachine-learning-theoryarxiv-paper
“Data augmentation choice interacts with network architecture and training algorithm to determine SSL performance”
paper / ylecun / Feb 3
Researchers propose blockwise learning as an alternative to full backpropagation, training the four main layer blocks of ResNet-50 independently using Barlow Twins self-supervised loss. This approach yields 70.48% top-1 ImageNet accuracy with a linear probe, just 1.1% below end-to-end pretrained ResNet-50's 71.57%. Extensive experiments analyze components and adaptations, identifying paths to scale local learning rules for large networks with implications for hardware and neuroscience.
self-supervised-learningblockwise-learningbackpropagation-alternativesresnet-50barlow-twinslocal-learning-rules
“Blockwise pretraining of ResNet-50's four main blocks with Barlow Twins achieves 70.48% top-1 ImageNet accuracy using a linear probe.”
paper / ylecun / Jan 19
I-JEPA is a non-generative self-supervised learning method that predicts representations of large-scale target blocks from a spatially distributed context block within the same image. It relies on a masking strategy emphasizing semantic-scale targets and informative contexts to produce highly semantic representations, avoiding hand-crafted augmentations. When paired with Vision Transformers, it scales efficiently, training a ViT-Huge/14 on ImageNet in under 72 hours on 16 A100 GPUs with strong downstream performance in classification, object counting, and depth prediction.
self-supervised-learningjoint-embedding-predictive-architectureimage-representationscomputer-visionvision-transformersarxiv-paper
“I-JEPA learns image representations without hand-crafted data-augmentations”
paper / ylecun / Dec 27
Graph ViT/MLP-Mixer adapts ViT/MLP-Mixer architectures to graphs, replacing local message-passing with global token mixing to capture long-range dependencies and mitigate over-squashing. It achieves linear complexity in nodes and edges, outperforming Graph Transformers in speed and memory while distinguishing 3-WL non-isomorphic graphs. Empirical results on 4 simulated datasets and 7 real-world benchmarks confirm its competitiveness.
graph-neural-networksgraph-vitmlp-mixerlong-range-dependenciesover-squashinggraph-representation
“Graph ViT/MLP-Mixer captures long-range dependencies and mitigates over-squashing”
youtube / ylecun / Dec 11
Yann LeCun and Randall discuss papers linking self-supervised learning (SSL) to spectral embeddings, showing SSL surrogate tasks aid supervised learning when their similarity graph matrices share spectral properties with the supervised adjacency matrix. Data augmentation in the infinite limit yields analytical effects equivalent to infinite samples, while non-contrastive joint embedding architectures outperform contrastive methods by avoiding dimensional collapse, with a mathematical duality between them via Z Z^T vs Z^T Z. LeCun advocates multi-criteria SSL like prediction and slow feature analysis over RL, using differentiable surrogate costs for efficient planning in hierarchical action spaces.
self-supervised-learningdata-augmentationneural-networksspectral-embeddingdeep-learningyann-lecunrandall-howard
“Unitary neural networks, with weight matrices constrained to unitary form, prevent exploding/vanishing gradients and model quantum computation.”
paper / ylecun / Nov 20
JEPA methods using VICReg and SimCLR objectives match or exceed pixel reconstruction baselines for learning dot location representations in offline settings with timestep-varying distractor noise. However, they fail when noise is fixed across frames. A theoretical analysis reveals this limitation stems from JEPA's focus on invariant features rather than slow-changing dynamics.
jepajoint-embedding-predictive-architecturesvicregsimclrself-supervised-learningworld-modelrepresentation-learning
“JEPA methods perform on par or better than generative reconstruction when distractor noise changes every timestep”
paper / ylecun / Nov 2
POLICE introduces the first provably optimal method to enforce affine constraints on DNN outputs over a specified input region without altering optimization or requiring sampling. It integrates via minimal forward-pass modifications, enabling standard gradient descent on parameters while guaranteeing constraint satisfaction throughout training and testing. This addresses modularity limitations in incorporating a priori knowledge or physical properties into DNNs.
deep-neural-networksconstraint-enforcementprovable-methodsaffine-constraintsmachine-learningyann-lecunarxiv-paper
“POLICE is the first provable affine constraint enforcement method for DNNs”
paper / ylecun / Oct 30
The paper extends the CTRL framework to unsupervised learning via a constrained maximin game on a rate reduction objective, expanding features across samples while compressing augmentations per sample. This process induces discriminative low-dimensional structures in the representations. These unified representations achieve near SOTA unsupervised classification performance and superior conditional image generation quality under matched conditions.
unsupervised-learningstructured-representationsclosed-loop-transcriptionctrl-frameworkrate-reductioncomputer-visionyann-lecun
“Unsupervised CTRL learns a single representation usable for both discriminative and generative tasks”
blog / ylecun / Oct 28
Decoupled Contrastive Learning (DCL) optimizes self-supervised learning by eliminating the negative-positive-coupling (NPC) effect found in the InfoNCE loss denominator. By removing the positive term from said denominator, DCL reduces the dependency on massive batch sizes, momentum encoders, and extended training epochs. This architectural change enables higher accuracy on ImageNet-1K benchmarks and improves robustness against suboptimal hyperparameter selection.
contrastive-learningself-supervised-learningcomputer-visiondeep-learningimage-recognitionmachine-learning-efficiencyai-research
“The InfoNCE loss contains a negative-positive-coupling (NPC) effect that impairs learning efficiency relative to batch size.”
paper / ylecun / Oct 15
Neuroscience has driven AI progress; accelerating it requires investment in NeuroAI research. The embodied Turing test benchmarks AI animal models against real animals in sensorimotor interactions. This shifts focus from human-unique skills like language to evolutionarily conserved animal capabilities, providing a roadmap for future AI.
neuroaiartificial-intelligenceneuroscienceembodied-aituring-testai-research
“Neuroscience has long been an essential driver of progress in artificial intelligence.”
paper / ylecun / Oct 9
VoLTA introduces a vision-language transformer that achieves fine-grained region-level understanding, such as object detection and segmentation, using only image-caption pairs without expensive bounding box annotations. It employs graph optimal transport for weakly-supervised alignment between local image patches and text tokens, creating an explicit, self-normalized matching criterion. The model integrates multi-modal fusion deeply into uni-modal backbones during pre-training, eliminating dedicated fusion layers to reduce memory usage. Experiments show VoLTA matches or exceeds prior methods on fine- and coarse-grained tasks despite using fewer annotations.
vision-language-transformerweakly-supervised-alignmentvoltacomputer-visionmultimodal-pretraininglocal-feature-alignmentarxiv-paper
“VoLTA uses only image-caption data for pre-training, eliminating the need for high-resolution image-text box data.”
paper / ylecun / Oct 5
RankMe assesses Joint-Embedding Self-Supervised Learning (JE-SSL) representations using their effective rank as a simple, label-free indicator of downstream performance. This metric enables hyperparameter selection and quality evaluation across datasets without training or tuning. Extensive experiments show RankMe matches label-based validation with minimal performance loss.
self-supervised-learningjoint-embedding-sslrepresentation-qualityeffective-rankhyperparameter-selectionyann-lecunarxiv-paper
“RankMe uses effective rank to measure JE-SSL representation quality without labels”
paper / ylecun / Oct 4
VICRegL applies the VICReg variance-invariance-covariance regularization simultaneously to both global feature vectors and local feature maps from two distorted image views in a dual-branch CNN. Local features are only regularized if their L2 distance is below a threshold or their positions align with the known geometric transformation between views. This joint learning yields strong performance on detection/segmentation while preserving classification efficacy, addressing the global-local trade-off in self-supervised representation learning.
self-supervised-learningvicregllocal-featurescomputer-visionimage-representationyann-lecun
“Existing self-supervised methods specialize in either global features (optimal for classification) or local features (optimal for detection/segmentation).”
blog / ylecun / Oct 4
Most self-supervised methods for image representation learning focus on either global or local features, excelling in classification or detection/segmentation, respectively. VICRegL introduces a novel approach that simultaneously learns both global and local features. This method employs two identical convolutional network branches fed distorted versions of the same image, applying the VICReg criterion to both global and local feature vectors. This allows VICRegL to achieve strong performance across classification, detection, and segmentation tasks.
computer-visionself-supervised-learningfeature-learningimage-recognitiondeep-learningrepresentation-learningobject-detection
“VICRegL simultaneously learns both global and local features in self-supervised image representation.”
blog / ylecun / Feb 23
Yann LeCun proposes that human-level AI requires systems to learn "world models" for understanding environmental dynamics, a capability distinct from current data-intensive approaches. His suggested architecture includes six differentiable modules, with the Joint Embedding Predictive Architecture (JEPA) central to learning abstract world representations and handling prediction uncertainty. This framework aims to enable self-supervised learning for robust, adaptive AI.
meta-aiyann-lecunworld-modelsself-supervised-learninghuman-level-aiai-architecture
“Current AI systems, particularly in autonomous driving, are far from human-level intelligence, requiring vast amounts of data and trials compared to human learning efficiency.”
youtube / ylecun / Jan 22
Self-supervised learning mimics how babies and animals acquire background knowledge—intuitive physics, object dynamics, and common sense—through passive observation, enabling rapid task learning like driving after minimal practice, unlike data-hungry supervised or reinforcement paradigms. It involves predicting future video frames, filling perceptual gaps, or continuing text sequences to build predictive world models that handle uncertainty via compressed latent distributions. Success in NLP via masked language modeling contrasts with vision's progress via non-contrastive methods like Barlow Twins, but video prediction remains unsolved; Yann LeCun posits this approach, integrated with gradient-based reasoning and hierarchical action planning, as AI's best path to cat-level intelligence using ~800M neurons.
self-supervised-learningworld-modelsai-learning-paradigmsyann-lecunlex-fridman-podcastmachine-intelligencepredictive-coding
“Humans learn to drive in ~20 hours due to self-supervised background knowledge from observation, while RL needs millions of simulated trials.”
youtube / ylecun / Jan 4
In high-dimensional spaces, the probability of test points lying in the convex hull of training data approaches zero, rendering traditional low-dimensional interpolation intuitions irrelevant; all ML, including deep learning, operates in an extrapolative regime. Neural networks with ReLU activations partition input space into polyhedral ReLU cells via hyperplanes, performing input-specific affine transformations rather than smooth manifold learning. This piecewise linear view demystifies NNs, aligning them with classical methods like decision trees and SVMs, while emphasizing engineered inductive biases and feature transformations for effective generalization.
deep-learninginterpolation-extrapolationhigh-dimensional-learningcurse-of-dimensionalityneural-network-theoryyan-lecunmachine-learning-theory
“In high-dimensional spaces (e.g., 256x256 images with ~200k dimensions), even a million training samples cover a tiny sliver, making every new input an extrapolation outside the convex hull.”
youtube / ylecun / Dec 1
Yann LeCun discusses the historical development of deep learning frameworks, including his early work on HLM, SN, and Lush, which laid foundational concepts for modern systems like PyTorch. He emphasizes the critical need for advancements in self-supervised learning, reasoning, and action planning to achieve more intelligent AI. LeCun also advocates for an interdisciplinary approach to AI education and predicts another major AI revolution driven by self-supervised learning and leading to advanced virtual assistants and robotics.
ai-historydeep-learning-researchpytorchmachine-learning-frameworksself-supervised-learningai-futurerobotics
“Early deep learning frameworks developed by Yann LeCun, such as HLM, SN, and Lush, contributed fundamental ideas to modern AI systems like PyTorch.”
blog / ylecun / Oct 12
This paper introduces a strategy to facilitate creative inspiration from deep generative models, specifically GANs, by optimizing latent parameters. It addresses the tediousness of extracting useful generations from these models by proposing an optimization method that finds latent parameters corresponding to the closest generation to a user-provided inspirational image. The research explores various optimization techniques, including gradient descent and gradient-free optimizers, to enhance usability and control over generated outputs for creators.
image-generationgenerative-aiganscomputer-visionlatent-space-optimizationhuman-centered-ai
“A simple optimization method can find optimal latent parameters in GANs to match inspirational images.”
blog / ylecun / Mar 4
Self-supervised learning (SSL) is critical for advancing AI beyond the limitations of supervised learning, particularly in developing generalist models with common sense. Unlike supervised methods requiring extensive labeled data, SSL extracts supervisory signals directly from raw data, enabling models to learn more nuanced representations of reality. This approach, especially through energy-based models and joint embedding architectures, holds promise for bridging the gap to human-level intelligence by allowing AI to acquire generalized background knowledge.
self-supervised-learningcomputer-visionnatural-language-processingai-research-trendsenergy-based-modelslatent-variable-models
“Supervised learning is a bottleneck for developing intelligent generalist AI models due to its reliance on massive amounts of labeled data, which is often impractical or impossible to obtain for all tasks.”
blog / ylecun / Oct 22
The Implicit Rank-Minimizing Autoencoder (IRMAE) is a novel autoencoder architecture that achieves compact latent representations by implicitly minimizing the rank of the latent code's covariance matrix. This is accomplished by strategically inserting additional linear layers between the encoder and decoder, leveraging the property of gradient descent that leads to minimum-rank solutions in multi-layer linear networks. The model is characterized by its simplicity, determinism, and effectiveness in learning low-dimensional latent spaces for tasks such as image generation and representation learning.
autoencodersrepresentation-learningneural-networkslatent-spacescomputer-vision
“IRMAE minimizes the information capacity of the latent representation by implicitly minimizing the rank of the covariance matrix of the codes.”
blog / ylecun / Jul 1
This paper introduces a novel hierarchical loss metric designed to penalize classification errors proportionally to the semantic distance between classes, utilizing an ultrametric tree structure. The core finding reveals that while this hierarchical loss offers a more semantically meaningful evaluation metric, direct minimization via standard stochastic gradient descent with random initialization does not reliably outperform cross-entropy loss minimization in achieving hierarchical classification objectives. Therefore, its primary utility appears to be as a robust evaluation metric rather than an optimizable objective function.
hierarchical-classificationneural-networksloss-functionsmachine-learningcomputer-visionnlp-research
“Existing classification metrics in neural networks fail to leverage a-priori hierarchical information.”
youtube / ylecun / Aug 31
Yann LeCun equates AI value alignment to millennia-old human legal systems that shape objectives via constraints, dismissing HAL 9000-style misalignment as solvable through hardwired ethical rules akin to the Hippocratic Oath. Deep learning succeeds empirically despite textbook warnings due to biological inspiration from brains, emphasizing gradient-based learning over discrete logic for reasoning, which requires working memory, recurrence, and energy-based planning. Self-supervised learning via predictive world models enables rapid, sample-efficient intelligence like human babies, forming the foundation for model-based RL, causal inference, and autonomous systems grounded in reality rather than pure language.
yann-lecundeep-learningai-safetyneural-networksself-supervised-learningagi-debateai-alignment
“AI value misalignment is not a novel problem but a continuation of designing human objective functions through laws and education.”
blog / ylecun / May 15
Traditional deep transfer learning primarily focuses on transferring unary feature vectors. This research explores learning and transferring latent relational graphs that capture dependencies between data units (e.g., words, pixels) from unlabeled data. This approach demonstrates improved performance across various downstream tasks, including natural language processing and image classification, and is transferable to different embedding types and even embedding-free units.
deep-learningtransfer-learninggraph-neural-networksnlpcomputer-visionrepresentation-learning
“Modern deep transfer learning approaches mainly transfer unary feature vectors.”
blog / ylecun / Sep 10 / failed
Predicting future instance segmentation is achieved by forecasting fixed-sized convolutional features from Mask R-CNN rather than RGB frames or semantic maps. The detection head of Mask R-CNN is applied to these predicted features to generate instance masks for future frames, handling variable object counts efficiently. This method significantly surpasses baselines using optical flow and adapted segmentation architectures.
video-forecastinginstance-segmentationmask-r-cnnfuture-predictioncomputer-visionyann-lecuneccv-paper
“Forecasting at the semantic level is more effective than forecasting RGB frames followed by segmentation for semantic segmentation of future frames.”