Chronological feed of everything captured from Yann LeCun.
Current Large Language Models (LLMs) achieve superhuman symbol manipulation but lack the grounded understanding of physical reality necessary for general intelligence. While LLMs rely on massive datasets (trillions of tokens) to predict discrete symbols, true intelligence requires architectures capable of learning abstract representations from high-dimensional, continuous sensory data—akin to how animals learn intuitive physics with extreme sample efficiency.
VL-JEPA is a new vision-language model leveraging a Joint Embedding Predictive Architecture. It predicts continuous embeddings of target texts, allowing it to focus on task-relevant semantics while abstracting away surface-level linguistic variability. This approach leads to stronger performance with 50% fewer trainable parameters compared to classical VLMs, while also supporting selective decoding and a versatile embedding space for various tasks.
World models, when combined with model predictive control (MPC), enable generalization in robotic planning tasks. While gradient-based planning offers a computationally efficient alternative to traditional MPC, its performance has lagged. This work identifies a train-test gap where world models trained on next-state prediction are used for action sequence estimation at inference. The proposed data synthesis techniques significantly improve gradient-based planning for existing world models.
A novel two-stage self-supervised framework combines JEPA and Density Adaptive Attention Mechanism (DAAM) to learn robust speech representations. It decouples semantic audio feature learning from waveform reconstruction, employing masked prediction in latent space. The model achieves efficient and reversible speech tokenization with a low frame rate, outperforming existing neural audio codecs.
Video generation models show promise for high-level robot planning, but direct imitation is hindered by noise and morphological distortions in generated videos. A two-stage pipeline is proposed to address this: first, video pixels are lifted into a 4D human representation and retargeted to a humanoid morphology. Second, a physics-aware reinforcement learning policy, GenMimic, is introduced to enable robots to mimic human actions from noisy, generated videos in a zero-shot manner.
LeJEPA offers a novel, theoretically-grounded approach to Joint-Embedding Predictive Architectures (JEPAs) for self-supervised learning. By introducing Sketched Isotropic Gaussian Regularization (SIGReg), LeJEPA optimizes embedding distributions to minimize prediction risk. This framework eliminates traditional heuristics, boasting linear complexity, stability across diverse architectures and domains, and efficient distributed training, making it highly suitable for scalable AI research.
The Cambrian-S research argues that true multimodal intelligence requires 'spatial supersensing'—moving beyond task-driven perception toward predictive world modeling. While data scaling provides marginal gains, the authors demonstrate that self-supervised predictive sensing (using prediction error for event segmentation) is necessary to solve long-horizon visual spatial recall and counting tasks.
This paper establishes a novel connection between attention sinks and compression valleys in large language models (LLMs), attributing both phenomena to the formation of massive activations in the residual stream. A theoretical framework is provided, demonstrating how these massive activations lead to representational compression and entropy reduction. Experimental validation across various LLMs supports this unified view, leading to the "Mix-Compress-Refine" theory of information flow within Transformers, which posits distinct computational phases across layers.
Joint Embedding Predictive Architectures (JEPAs) utilize an anti-collapse term, typically considered for representation diversity. This research reveals that this term also implicitly estimates data density, a previously unrecognized capability. This density estimation is provable across datasets and architectures, enabling applications like data curation and outlier detection.
Yann LeCun clarifies the historical timeline of the MNIST dataset, stating it was introduced five years after a previously referenced event or dataset. This correction is a direct response to a social media query regarding the dataset's debut.
A video shared on a social media platform was incorrectly dated by the uploader. The content, specifically a video, was attributed to a different, much later year than its actual creation date. This highlights potential inaccuracies in content metadata or user-provided context on social media platforms.
Accelerating to relativistic speeds causes personal time to slow down, but conversely, the external world's time speeds up. This phenomenon, while scientifically accurate, presents a paradoxical outcome for individuals seeking to manage time more effectively in a universally applicable sense. The implication is that local time advantages are negated by global time disadvantages, making it an impractical solution for general time management challenges.
Yann LeCun, a prominent AI researcher at Meta, succinctly rejected an unspecified claim. This interaction suggests that even high-profile figures in AI are directly engaging with and refuting misinformation or overblown assertions about the field.
Due to an inaccessible or deleted X post, no content could be retrieved for analysis. This entirely prevented the extraction of specific claims or insights from the intended source. The inability to access the original source material fundamentally limits knowledge compilation efforts.
Yann LeCun asserts that the Joint Embedding Predictive Architecture (JEPA) remains the primary objective for AI development, contrasting it with the Causal World Model (CWM) which he characterizes as a token-generative baseline. This suggests a strategic focus on architectures capable of learning representations by predicting missing information in a joint embedding space, rather than solely relying on sequential token generation.
Yann LeCun asserts that current "coding" activities, likely referring to large language model capabilities, do not equate to Artificial General Intelligence (AGI). This distinction highlights a fundamental difference between pattern recognition/generation and genuine intelligent reasoning or understanding. His statement implies that advanced coding abilities in AI are not sufficient indicators of human-level intelligence; a claim with significant implications for the ongoing debate on AI progress and terminology. Therefore, he sees current AI advancements in coding as a narrow application of AI rather than a step towards true artificial general intelligence.
A Code World Model operates by simulating the execution outcomes of instructions to iteratively plan and generate code. This approach shifts code generation from token prediction to a goal-oriented process based on imagined execution effects.
This tweet from Yann LeCun, a prominent AI researcher, consists solely of emojis in response to another user. It indicates social interaction on a non-technical topic, highlighting the personal use of social media among public figures in the AI community rather than disseminating technical insights or research. The lack of textual content makes it impossible to extract any technical claims or insights from this specific interaction.
Current Large Language Model (LLM) training relies on input-space reconstruction, a method found to be suboptimal in computer vision. This research introduces LLM-JEPA, a Joint Embedding Predictive Architecture for LLMs, aiming to adapt the more effective embedding-space training used in vision to language models. Initial findings indicate that LLM-JEPA significantly outperforms standard LLM training and demonstrates robustness against overfitting.
Yann LeCun traces the arc of neural network research from its academic fringe origins in the 1980s to its current dominance, crediting deliberate rebranding as "deep learning" and a coordinated infiltration of industry speech recognition pipelines as pivotal turning points. He argues the real AI competition is not geopolitical (US vs. China vs. Europe) but structural — open-source vs. proprietary — and points to Meta's Llama as evidence that open research ecosystems outpace secretive ones. On AI safety, LeCun rejects the "uncontrollable superintelligence" narrative, framing alignment as a tractable engineering problem solvable through objective-driven architectures with enforced guardrails, not existential dread.
DINO-world proposes using DINOv2's frozen image encoder as the representational backbone for a video world model, training a future predictor in latent space on large-scale uncurated video data. This approach sidesteps the need for pixel-level prediction, instead operating in a semantically rich feature space that generalizes across diverse scene types — driving, indoor, and simulated environments. The model can be fine-tuned on action-observation trajectories to become action-conditioned, enabling latent-space trajectory simulation for planning. It outperforms prior models on video prediction benchmarks including segmentation and depth forecasting, and shows emergent understanding of intuitive physics.
In a terse post, Yann LeCun asserts that Artificial Superintelligence (ASI) — not Artificial General Intelligence (AGI) — has always been his north star. The distinction signals a rejection of AGI as a meaningful or sufficient milestone, implying LeCun views human-level general intelligence as an intermediate or inadequate benchmark. The brevity and conviction of the statement ("Always have") suggests this is a long-held philosophical position rather than a reactive take.
The AI talent competition is increasingly being shaped not just by compensation, but by a cultural divide between open and closed research philosophies. Meta has emerged as the standard-bearer for open science and open source AI in the U.S. — via publications and Llama — while OpenAI and Anthropic remain notably closed. Yann LeCun's endorsement of this framing signals that the migration of top talent to Meta may reflect a values-driven realignment post-DeepSeek, with implications for innovation velocity, scientific transparency, and AI safety.
Yann LeCun critically assesses the notion that 'compressibility' alone is a sufficient or unbiased metric for understanding real-world sequences. He suggests that the observed ease with which models compress select real-world data might be an artifact of how these sequences are chosen and represented, rather than an inherent, universal property. LeCun implies a potential bias in the selection of data that appears compressible, which could distort our understanding of true data complexity.
Yann LeCun, a prominent AI researcher, has acknowledged a perceived level of observation or scrutiny related to his activities. This suggests a growing awareness or concern within the AI community regarding privacy or monitoring. The nature and extent of this observation remain unspecified.
Yann LeCun comments on the inherent comprehensibility of the world, echoing Einstein's sentiment. He suggests that this comprehensibility is facilitated by brains possessing an inductive bias, enabling them to understand their localized environment. This implies a fundamental relationship between cognitive architecture and the nature of reality's perceived order.
Yann LeCun, a prominent AI researcher, expresses strong skepticism regarding the potential of Large Language Models (LLMs) to achieve Artificial General Intelligence (AGI) or Artificial Super Intelligence (ASI). He sarcastically suggests directing inquiries about LLMs leading to advanced AI to the Chief AI Officer of his purported employer, highlighting his dismissive stance on the topic. The interaction indicates a divergence of opinions within the AI community concerning the architectural pathways to achieving AGI.
Yann LeCun, Chief AI Scientist at Meta, has maintained the same position since 2018. This indicates stability in his role within the company's AI leadership over a multi-year period. The duration suggests a consistent strategic direction or a sustained focus on his contributions in this capacity.
Yann LeCun, a prominent AI researcher at FAIR, distinguishes between Artificial General Intelligence (AGI) and Artificial Superintelligence (ASI). He contends that AGI, defined as 'human-level AI,' is a poorly conceived notion due to the specialized nature of human intelligence. Conversely, LeCun asserts that ASI represents a valid and consistent long-term objective for AI development, a goal he and FAIR have consistently pursued.
PEVA (Predict Ego-centric Video from Actions) trains an autoregressive conditional diffusion transformer to simulate first-person visual consequences of human body actions, conditioned on relative 3D kinematic pose trajectories structured by joint hierarchy. The model is trained on Nymeria, a large-scale real-world egocentric video + body pose dataset, grounding predictions in physically realistic human motion. A hierarchical evaluation protocol stress-tests the model across increasingly complex embodied prediction and control tasks. This represents a direct push toward world models that understand how physical human actions causally shape perceived environments — a key milestone for embodied AI.
V-JEPA 2 is a self-supervised learning framework that leverages massive internet-scale video data alongside limited interaction data to enable robust understanding, prediction, and planning capabilities in AI. The architecture demonstrates strong performance across diverse tasks, from motion understanding and human action anticipation to multimodal video question-answering. Crucially, its extension, V-JEPA 2-AC, facilitates zero-shot robotic planning, showcasing the potential for real-world physical world interaction without extensive task-specific training.
V-JEPA 2 is a self-supervised joint-embedding-predictive architecture pre-trained on 1M+ hours of internet video to master physical world understanding and prediction. By post-training as an action-conditioned world model (V-JEPA 2-AC) with minimal robot interaction data, it enables zero-shot robotic planning and manipulation. The model further demonstrates high-tier video-language alignment capabilities when integrated with an 8B parameter LLM.
OSVI-WM addresses a critical gap in one-shot visual imitation learning: generalizing to unseen tasks that are visually similar to training tasks but require semantically distinct responses. The framework uses a learned world model to predict latent state-action trajectories from a single expert video and the agent's current observation, then decodes these into physical waypoints for execution — bypassing the need for fine-tuning. Evaluated across two simulated benchmarks and three real-world robotic platforms, it consistently outperforms prior methods, with gains exceeding 30% in some cases.
Using an Information Bottleneck framework across 40+ LLMs, this paper by Shani et al. (incl. LeCun & Jurafsky) finds that while LLMs broadly match human category boundaries, they aggressively over-compress representations — achieving near-optimal information-theoretic efficiency at the expense of the contextual nuance humans deliberately preserve. Encoder models outperform larger decoder models in alignment with human conceptual structure, implying that language understanding and generation recruit fundamentally different representational mechanisms. Training dynamics follow a two-phase pattern: rapid concept formation, then architectural reorganization where semantic processing migrates from deep to mid-network layers as encodings become sparser. The core implication is that human-like understanding may require models that intentionally retain representational "inefficiencies."
Yann LeCun, Turing Award laureate and Meta's Chief AI Scientist, discusses the evolution of deep learning, highlighting the critical need for AI systems to move beyond current large language models (LLMs) to achieve true intelligence. He advocates for architectures capable of understanding the physical world, planning, and reasoning with abstract mental representations, moving towards what he terms "Advanced Machine Intelligence" (AMI) rather than Artificial General Intelligence (AGI). LeCun emphasizes that breakthroughs in understanding the physical world will render current LLMs obsolete within five years, urging academia to focus on these next-generation AI systems due to the massive industry investment in LLMs.
Yann LeCun, Meta's Chief AI Scientist, argues that current AI, particularly large language models (LLMs), lack genuine intelligence due to limitations in reasoning, planning, and understanding the physical world. He advocates for open-source AI development to promote diversity and counter proprietary biases, and emphasizes self-supervised learning with novel architectures like Joint-Embedding Predictive Architecture (JEPA) as critical for future breakthroughs in AI.
Yann LeCun equates AI value alignment to millennia-old human legal systems that shape objectives via constraints, dismissing HAL 9000-style misalignment as solvable through hardwired ethical rules akin to the Hippocratic Oath. Deep learning succeeds empirically despite textbook warnings due to biological inspiration from brains, emphasizing gradient-based learning over discrete logic for reasoning, which requires working memory, recurrence, and energy-based planning. Self-supervised learning via predictive world models enables rapid, sample-efficient intelligence like human babies, forming the foundation for model-based RL, causal inference, and autonomous systems grounded in reality rather than pure language.
Traditional deep transfer learning primarily focuses on transferring unary feature vectors. This research explores learning and transferring latent relational graphs that capture dependencies between data units (e.g., words, pixels) from unlabeled data. This approach demonstrates improved performance across various downstream tasks, including natural language processing and image classification, and is transferable to different embedding types and even embedding-free units.
Predicting future instance segmentation is achieved by forecasting fixed-sized convolutional features from Mask R-CNN rather than RGB frames or semantic maps. The detection head of Mask R-CNN is applied to these predicted features to generate instance masks for future frames, handling variable object counts efficiently. This method significantly surpasses baselines using optical flow and adapted segmentation architectures.