JEPA Architecture for AI Advancement
Yann LeCun
Chronological feed of everything captured from Yann LeCun.
Yann LeCun
Yann LeCun posits that thinking primarily involves manipulating mental models in an abstract, continuous representation space, rather than relying on language. This suggests that while language models may benefit specific applications like coding and mathematics where language aids reasoning, their utility for general, abstract reasoning is inherently limited by their linguistic nature.
Yann LeCun, a prominent figure in AI, expresses strong disapproval of unspecified "BS" on his X (formerly Twitter) feed. This brief but forceful statement suggests a perceived prevalence of misinformation or low-quality content within the AI/tech discourse, though the specific targets or nature of this "BS" are not detailed. The core insight is the existence of significant, unclarified contention from an influential voice.
This analysis investigates the ability of AI models to interpret and contextualize humor, specifically focusing on the use of "😂😂😂" in social media. The core insight revolves around the limitations of current AI in discerning nuanced human communication and emotional expression. The brevity of the content makes it a challenging case for robust knowledge extraction.
Proposed US federal budget cuts under the Trump administration target critical agencies including NASA, NIH, and the NSF. Specifically, the total removal of the NSF's social, economic, and behavioral sciences directorate threatens to dismantle key pillars of the US scientific infrastructure. These measures are viewed by the scientific community as a systemic risk to global research leadership.
Model Predictive Control (MPC) with learned world models struggles with long-horizon tasks due to error accumulation and large search spaces. This work proposes hierarchical planning using latent world models at multiple temporal scales. This approach reduces inference-time complexity and enables long-horizon reasoning, improving zero-shot control capabilities.
Yann LeCun asserts that closed AI models unfairly profit from advancements made by open-source models without reciprocating contributions. This creates an imbalance where commercial entities leverage community efforts without giving back to the open AI ecosystem, which could stifle collaborative progress and innovation.
The Republican Party has created a new "America First Award" and presented it to Donald Trump. This move, celebrated by Speaker Mike Johnson, suggests a solidification of Trump's influence within the party and reinforces the perception of a cult of personality. The award’s presentation, described with opulent language, indicates a strategic effort to further elevate Trump.
Yann LeCun, a prominent AI researcher, ironically<sup>1</sup> commented "Tired of winning" on a post linking to an article about the US national debt reaching $39 trillion. This suggests a subtle critique of the economic implications of current policies, potentially hinting at a broader concern about unsustainable fiscal trends despite a superficial appearance of success. The "WELP" from the Tennessee Holler adds to the sardonic tone, implying a resigned acknowledgement of the situation. <sup>1</sup> This is an interpretation. LeCun's intent could be multifaceted.
Current AI models are limited in autonomous learning. This paper proposes a new architecture inspired by human and animal cognition, integrating observation-based learning (System A) and active behavior-based learning (System B), controlled by internal meta-control signals (System M). The framework aims to enable AI to adapt to dynamic, real-world environments across evolutionary and developmental timescales.
V-JEPA 2.1 is a self-supervised model that achieves state-of-the-art performance in dense visual understanding and world modeling for both images and videos. This is accomplished by integrating a dense predictive loss, deep self-supervision across encoder layers, multi-modal tokenizers, and effective scaling of model capacity and training data. The resulting representations are spatially structured, semantically coherent, and temporally consistent, demonstrating significant improvements across various benchmarks.
Yann LeCun humorously suggests that scientists earning more than professional athletes would be a positive development. This indicates a personal sentiment rather than a factual claim about current compensation or a policy proposal. The statement serves as an expression of an aspirational ideal for the recognition and reward of scientific contributions.
LeWorldModel (LeWM) introduces a streamlined Joint-Embedding Predictive Architecture (JEPA) that achieves stable end-to-end training from pixels by utilizing a simplified two-term loss function. By replacing complex stabilization methods with a Gaussian latent regularizer, it significantly reduces hyperparameter overhead and enables high-speed planning (up to 48x faster than foundation models) while maintaining physical grounding in its latent representations.
Current machine learning approaches for spatiotemporal physical systems primarily focus on next-frame prediction, which is computationally expensive and prone to compounding errors. This research proposes evaluating models on downstream scientific tasks, specifically the estimation of governing physical parameters, to better assess the physical relevance of learned representations. The study demonstrates that latent space learning methods, such as JEPAs, are more effective for these tasks than methods optimizing pixel-level prediction objectives, even outperforming some methods designed specifically for physical modeling.
Latent planning using world models benefits significantly from effective representation learning. While pre-trained visual encoders provide strong semantic features, they often include irrelevant information detrimental to planning. This work introduces "temporal straightening," a novel curvature regularization technique applied to latent trajectories. This method, inspired by human visual processing, aims to create locally straightened latent spaces where Euclidean distance more accurately reflects geodesic distance, thereby improving gradient-based planning stability and success rates in goal-reaching tasks.
Current AI systems, particularly large language models, are limited by their inability to understand the physical world, reason, plan, and possess persistent memory, leading to what Yann LeCun describes as "stupidity." LeCun advocates for the development of "world models" using self-supervised learning, enabling AI to learn abstract representations from sensory input, predict outcomes, and perform hierarchical planning. This approach is crucial for advancing AI capabilities beyond discrete symbolic reasoning to robust physical world interaction and robotic intelligence.
Yann LeCun has successfully fundraised over $4.5 billion (post-money) for his new AGI laboratory, AMI Labs. The lab will focus on developing world models, diverging from the current industry trend of large language models. This significant investment underscores a strong belief in LeCun's vision for advancing Artificial General Intelligence through alternative research paradigms.
Advanced Machine Intelligence (AMI Labs) has completed a €890 million ($1.03 billion) seed funding round, one of the largest ever, to develop a new generation of AI systems. The company's focus is on building universally intelligent systems incorporating world models, persistent memory, reasoning, planning, controllability, and safety. This substantial capital injection positions AMI Labs to aggressively pursue its foundational AI research and development across its global locations.
Transformer language models exhibit "massive activations" (extreme outliers in channels for a few tokens) and "attention sinks" (tokens attracting disproportionate attention). While often co-occurring, these phenomena serve distinct functions. Massive activations act globally as implicit model parameters, while attention sinks operate locally, biasing attention heads towards short-range dependencies. Their co-occurrence is an architectural artifact of pre-norm Transformer configurations.
The future of AI requires a unified, long-term vision for AI and hardware co-development, moving beyond fragmented approaches. This roadmap emphasizes scaling efficiency and achieving exponential gains in intelligence per joule, rather than solely focusing on compute consumption. It redefines scaling around energy efficiency, system-level integration, and cross-layer optimization to foster holistic and adaptive AI systems across diverse environments. The paper outlines a 10-year plan for addressing the challenges and opportunities in AI+HW co-design.
This paper introduces the Transfusion framework for multimodal pretraining, specifically designed to explore the design space for native multimodal models without prior language pretraining. It details a controlled experimental approach using next-token prediction for language and diffusion for vision, trained on diverse data including text, video, image-text pairs, and action-conditioned video. Key findings address optimal visual representations, data complementarity, world modeling capabilities, and efficient scaling through Mixture-of-Experts.
This interview with Yann LeCun traces his personal and professional journey through the field of machine learning, from early neural network research to the modern era of deep learning. LeCun details the historical ebb and flow of neural network popularity, emphasizing key technical advancements and offering a critical perspective on current methodologies. He advocates for a future centered on self-supervised learning and "world models" for more efficient and human-like AI.
This paper argues against the prevailing concept of Artificial General Intelligence (AGI) as a flawed and ill-defined goal for AI development. Instead, it proposes a new framework: Superhuman Adaptable Intelligence (SAI). SAI emphasizes specialization and superhuman performance in specific domains, aiming to exceed human capabilities and fill skill gaps. This shift in perspective provides a clearer, more actionable direction for future AI research and development.
Large Language Models (LLMs) traditionally adhere to scaling laws that dictate increasing data for improved performance. This work challenges these laws by introducing the Geodesic Hypothesis and a Semantic Tube Prediction (STP) task. STP, a JEPA-style regularizer, constraints hidden-state trajectories to a curved path, enhancing signal-to-noise ratio and diversity, ultimately leading to significant data efficiency gains.
Yann LeCun argues that current AI, particularly LLMs, are primarily advanced information retrieval systems, not truly intelligent entities, and criticizes the anthropomorphization of these systems. He emphasizes that real intelligence involves learning through observation and interaction to build mental models of the world, a capability largely absent in current AI. LeCun envisions AI as an amplifier of human intelligence, acting as a "staff" for individuals, and predicts a gradual, not abrupt, advancement, with long-term technological shifts often underestimated.
Self-supervised learning aims to maximize information in representations, but is limited by the curse of dimensionality. Radial-VCReg improves upon existing methods like VCReg by introducing a radial Gaussianization loss. This aligns feature norms with the Chi distribution, a characteristic of high-dimensional Gaussians, leading to more diverse and informative representations by reducing higher-order dependencies.
C-JEPA extends masked joint embedding prediction to object-centric representations to better capture interaction-dependent dynamics in world models. By utilizing object-level masking, the architecture forces the inference of states from relational contexts, inducing a causal inductive bias that enhances counterfactual reasoning and drastically reduces the latent feature overhead for agent planning.
The stable-worldmodel (SWM) ecosystem addresses the reproducibility crisis in World Model research by providing standardized environments, tools, and baselines. It enables efficient data collection and supports research into robustness and continual learning through controllable environmental factors. SWM offers a unified platform for developing and evaluating World Models, mitigating issues of publication-specific implementations and fostering reusability.
EB-JEPA is an open-source library that implements Joint-Embedding Predictive Architectures (JEPAs) for learning representations and world models. JEPAs predict in representation space, avoiding the complexities of generative modeling while capturing semantic features. The library provides modular, single-GPU friendly implementations demonstrating scalability from image-level self-supervised learning to video and action-conditioned world models.
Rectified LpJEPA introduces a novel regularization technique, Rectified Distribution Matching Regularization (RDMReg), for Joint-Embedding Predictive Architectures (JEPA). This method addresses the limitation of existing JEPA approaches that favor dense representations by explicitly promoting sparsity. By aligning representations to a Rectified Generalized Gaussian (RGG) distribution, Rectified LpJEPA achieves controllable sparsity while maintaining maximum-entropy properties and competitive performance in image classification tasks.
World models face challenges in planning due to vast search spaces. The GRASP algorithm addresses this by using a differentiable world model for efficient, parallelized optimization. It treats states as "virtual states" with soft dynamics constraints and introduces stochasticity to avoid local optima, outperforming existing planning algorithms in success rate and convergence time on long-horizon tasks.
Joint Embedding Predictive Architectures (JEPA) struggle with representation collapse in self-supervised speech learning. GMM-Anchored JEPA addresses this by using a Gaussian Mixture Model (GMM) to generate frozen soft posteriors as auxiliary targets. This method, unlike previous iterative re-clustering approaches, applies a one-time clustering with soft assignments and a decaying supervision schedule, enhancing model stability and performance across various speech tasks.
Representation Autoencoders (RAEs) demonstrate superior performance and stability compared to Variational Autoencoders (VAEs) in large-scale text-to-image (T2I) generation. RAEs achieve faster convergence and better generation quality, even with a simplified framework, making them a more robust foundation for T2I models. This success is partly attributed to their ability to operate within a shared representation space for both visual understanding and generation, opening new avenues for unified multimodal models.
This paper explores the development of latent action world models capable of operating on "in-the-wild" video data. Traditional world models often necessitate explicit action labels, which are impractical for diverse, real-world scenarios. The research demonstrates that continuous, constrained latent actions can effectively capture the complexity of real-world interactions, even in the presence of environmental noise and varying embodiments across videos. This advancement allows for the potential of learning universal interfaces for planning tasks.
Recent advancements in AI aim to develop agents capable of solving diverse physical tasks and generalizing to new environments. A promising approach involves training world models from state-action trajectories for planning. This work characterizes a family of such models as JEPA-WMs, which optimize planning within the learned representation space of the world model to abstract irrelevant details and enhance efficiency. The study investigates the impact of model architecture, training objectives, and planning algorithms on planning success, proposing a model that outperforms established baselines.
The provided content, a satirical post quoted by Yann LeCun, juxtaposes the perceived "weakness" of European social democracies (characterized by social benefits, personal freedoms, and stability) with the "strength" of authoritarian regimes (marked by control, fear, and suppression of dissent). It implicitly argues that the stability and freedoms of the former are desirable, while the latter, despite its supposed "strength," leads to oppression and a lack of genuine well-being. The satire highlights the benefits of a society that prioritizes citizen welfare and predictable safety over control and enforced conformity.
This paper proposes an enhancement to Joint-Embedded Predictive Architectures (JEPA) for improved action planning. It addresses the limitation of current JEPA models in supporting effective planning by shaping their representation space. This shaping is achieved by approximating the negative goal-conditioned value function with a distance metric between state embeddings, leading to better performance on control tasks.
Large Language Models (LLMs) currently possess parameter counts on par with the number of synapses found in a mouse brain. This comparison highlights the significant scale achieved by modern AI models, placing them within a biological order of magnitude relevant to neuroscientific considerations. This suggests a potential, albeit abstract, benchmark for complexity in AI development relative to biological systems.
The World Wide Web, a foundational technology for free discourse, originated in a European government research institution. It was developed at CERN by Sir Tim Berners-Lee, emphasizing its non-commercial and publicly funded genesis.
This post features a prominent AI researcher playfully comparing himself to a large language model. This self-referential humor highlights the increasing public awareness and common understanding of LLMs, even among experts in the field. It subtly suggests a possible future where AI models are commonplace enough for humorous, everyday comparisons.
Yann LeCun, a prominent figure in AI, expressed apprehension, indicating a potential concern regarding a specific, unspecified topic. His brief statement suggests a sentiment of worry or fear relevant to current discussions within the AI community, though the exact subject of his concern remains unelaborated in this specific post.
The provided content contains no technical information or substantive claims, consisting only of a short French phrase ('Moi ?' meaning 'Me?'). It is insufficient for technical synthesis.
Intelligence should be conceptualized as a multidimensional vector rather than a scalar value. This perspective suggests that intelligence is not a singular, general ability but a complex interplay of various specialized capacities. All species, including humans, exhibit specialized intelligence rather than truly general intelligence, with varying degrees of adaptability across species.
The content speculates on the existence of intelligent beings whose perception of reality, or "slice of the whole space," is fundamentally different from our own. These beings would manifest to us as indistinguishable from random thermal fluctuations, rendering them undetectable and incomprehensible through our current understanding of physics and observation.
The user, presumably a prominent AI researcher given the context of Y. LeCun's feed, highlights the extensive exposure to language and non-linguistic percepts as a significant factor in their developmental experience. This suggests a perspective where diverse and prolonged environmental interaction, beyond just linguistic data, is crucial for comprehensive understanding and AI model development.
The user, a prominent AI researcher, posted a single-emoji message on a social media platform. This presents a challenge for natural language processing models tasked with sentiment analysis or humor detection, as the meaning is highly contextual and subjective, requiring advanced understanding beyond lexical analysis.
SpidR-Adapt introduces a meta-learning approach for low-resource speech representation, enabling rapid adaptation to new languages with minimal unlabeled data. It utilizes a multi-task adaptive pre-training (MAdaPT) protocol and a first-order bi-level optimization (FOBLO) heuristic. This method aims to close the efficiency gap between human language acquisition and data-intensive self-supervised models.
DexWM is a novel world model designed to handle dexterous hand-object interactions, addressing the limitations of existing models that use coarse action spaces. It overcomes data scarcity by using finger keypoints from egocentric videos, enabling training on extensive human and non-dexterous robot data. A key innovation is the incorporation of a hand consistency loss, crucial for accurate dexterity modeling, leading to superior future-state prediction and zero-shot transfer capabilities compared to previous methods.