Chronological feed of everything captured from Yann LeCun.
tweet / @ylecun / Dec 25
This post features a prominent AI researcher playfully comparing himself to a large language model. This self-referential humor highlights the increasing public awareness and common understanding of LLMs, even among experts in the field. It subtly suggests a possible future where AI models are commonplace enough for humorous, everyday comparisons.
llm-humorx-feedsocial-mediayann-lecun
“Yann LeCun humorously compares himself to a Large Language Model (LLM).”
tweet / @ylecun / Dec 25
Yann LeCun, a prominent figure in AI, expressed apprehension, indicating a potential concern regarding a specific, unspecified topic. His brief statement suggests a sentiment of worry or fear relevant to current discussions within the AI community, though the exact subject of his concern remains unelaborated in this specific post.
x-feedsocial-mediayann-lecunfrench-language
“Yann LeCun is expressing fear or concern.”
tweet / @ylecun / Dec 25
The provided content contains no technical information or substantive claims, consisting only of a short French phrase ('Moi ?' meaning 'Me?'). It is insufficient for technical synthesis.
social-mediapersonal-postx-feedyann-lecun
tweet / @ylecun / Dec 25
Intelligence should be conceptualized as a multidimensional vector rather than a scalar value. This perspective suggests that intelligence is not a singular, general ability but a complex interplay of various specialized capacities. All species, including humans, exhibit specialized intelligence rather than truly general intelligence, with varying degrees of adaptability across species.
intelligence-theoryai-theorycognitive-sciencebiological-intelligence
“Intelligence is a multidimensional vector.”
tweet / @ylecun / Dec 25
The content speculates on the existence of intelligent beings whose perception of reality, or "slice of the whole space," is fundamentally different from our own. These beings would manifest to us as indistinguishable from random thermal fluctuations, rendering them undetectable and incomprehensible through our current understanding of physics and observation.
ai-safetyethicsconsciousnessphilosophy-of-ai
“Other beings might perceive a different 'slice of the whole space' that is meaningful to them.”
tweet / @ylecun / Dec 25
The user, presumably a prominent AI researcher given the context of Y. LeCun's feed, highlights the extensive exposure to language and non-linguistic percepts as a significant factor in their developmental experience. This suggests a perspective where diverse and prolonged environmental interaction, beyond just linguistic data, is crucial for comprehensive understanding and AI model development.
yann-lecunx-feedai-pioneerlanguage-perceptionnon-linguistic-percepts
“The author has had significant exposure to language.”
tweet / @ylecun / Dec 25
The user, a prominent AI researcher, posted a single-emoji message on a social media platform. This presents a challenge for natural language processing models tasked with sentiment analysis or humor detection, as the meaning is highly contextual and subjective, requiring advanced understanding beyond lexical analysis.
humorsocial-mediareaction
“A single laughing emoji was posted on a social media feed.”
paper / ylecun / Dec 24
SpidR-Adapt introduces a meta-learning approach for low-resource speech representation, enabling rapid adaptation to new languages with minimal unlabeled data. It utilizes a multi-task adaptive pre-training (MAdaPT) protocol and a first-order bi-level optimization (FOBLO) heuristic. This method aims to close the efficiency gap between human language acquisition and data-intensive self-supervised models.
speech-representationfew-shot-learningmeta-learninglow-resource-languagesself-supervised-learningspoken-language-modelingyann-lecun
“SpidR-Adapt addresses the data efficiency gap between human and machine speech acquisition.”
paper / ylecun / Dec 15
DexWM is a novel world model designed to handle dexterous hand-object interactions, addressing the limitations of existing models that use coarse action spaces. It overcomes data scarcity by using finger keypoints from egocentric videos, enabling training on extensive human and non-dexterous robot data. A key innovation is the incorporation of a hand consistency loss, crucial for accurate dexterity modeling, leading to superior future-state prediction and zero-shot transfer capabilities compared to previous methods.
roboticsworld-modelsdexterous-manipulationcomputer-visionmachine-learning
“DexWM accurately models dexterous hand-object interactions despite the scarcity of finely annotated datasets.”
youtube / ylecun / Dec 15 / failed
youtube / ylecun / Dec 12
Current Large Language Models (LLMs) achieve superhuman symbol manipulation but lack the grounded understanding of physical reality necessary for general intelligence. While LLMs rely on massive datasets (trillions of tokens) to predict discrete symbols, true intelligence requires architectures capable of learning abstract representations from high-dimensional, continuous sensory data—akin to how animals learn intuitive physics with extreme sample efficiency.
ai-systemslarge-language-modelsdeep-learningai-ethicsai-safetyfuture-of-aiml-research
“LLMs are fundamentally incapable of achieving human-level general intelligence (AGI) through current token-prediction paradigms.”
paper / ylecun / Dec 11
VL-JEPA is a new vision-language model leveraging a Joint Embedding Predictive Architecture. It predicts continuous embeddings of target texts, allowing it to focus on task-relevant semantics while abstracting away surface-level linguistic variability. This approach leads to stronger performance with 50% fewer trainable parameters compared to classical VLMs, while also supporting selective decoding and a versatile embedding space for various tasks.
jepavision-language-modelsjoint-embedding-predictive-architecturecomputer-visiondeep-learningmultimodal-ai
“VL-JEPA predicts continuous embeddings of target texts instead of autoregressively generating tokens.”
paper / ylecun / Dec 10
World models, when combined with model predictive control (MPC), enable generalization in robotic planning tasks. While gradient-based planning offers a computationally efficient alternative to traditional MPC, its performance has lagged. This work identifies a train-test gap where world models trained on next-state prediction are used for action sequence estimation at inference. The proposed data synthesis techniques significantly improve gradient-based planning for existing world models.
machine-learningrobotics-world-modelsgradient-based-planningmodel-predictive-controltrain-test-gapobject-manipulationnavigation-tasks
“World models with model predictive control (MPC) facilitate generalization across diverse planning tasks by training on expert trajectories.”
paper / ylecun / Dec 8
A novel two-stage self-supervised framework combines JEPA and Density Adaptive Attention Mechanism (DAAM) to learn robust speech representations. It decouples semantic audio feature learning from waveform reconstruction, employing masked prediction in latent space. The model achieves efficient and reversible speech tokenization with a low frame rate, outperforming existing neural audio codecs.
jepaspeech-representationneural-tokenizeraudio-processingself-supervised-learningdensity-adaptive-attention
“The proposed framework utilizes a two-stage self-supervised learning approach.”
paper / ylecun / Dec 4
Video generation models show promise for high-level robot planning, but direct imitation is hindered by noise and morphological distortions in generated videos. A two-stage pipeline is proposed to address this: first, video pixels are lifted into a 4D human representation and retargeted to a humanoid morphology. Second, a physics-aware reinforcement learning policy, GenMimic, is introduced to enable robots to mimic human actions from noisy, generated videos in a zero-shot manner.
roboticsvideo-generationhuman-robot-interactionreinforcement-learningcomputer-visionmotion-tracking
“Generated videos can serve as high-level planners for contextual robot control.”
paper / ylecun / Nov 11
LeJEPA offers a novel, theoretically-grounded approach to Joint-Embedding Predictive Architectures (JEPAs) for self-supervised learning. By introducing Sketched Isotropic Gaussian Regularization (SIGReg), LeJEPA optimizes embedding distributions to minimize prediction risk. This framework eliminates traditional heuristics, boasting linear complexity, stability across diverse architectures and domains, and efficient distributed training, making it highly suitable for scalable AI research.
self-supervised-learningjoint-embedding-predictive-architecturesrepresentation-learningdeep-learning-theorycomputer-visionmachine-learning
“LeJEPA optimizes for an isotropic Gaussian distribution of embeddings to minimize downstream prediction risk.”
paper / ylecun / Nov 6
The Cambrian-S research argues that true multimodal intelligence requires 'spatial supersensing'—moving beyond task-driven perception toward predictive world modeling. While data scaling provides marginal gains, the authors demonstrate that self-supervised predictive sensing (using prediction error for event segmentation) is necessary to solve long-horizon visual spatial recall and counting tasks.
computer-visionmultimodal-aispatial-cognitionpredictive-world-modelingai-benchmarksvideo-analysis
“Data scaling alone is insufficient to achieve spatial supersensing capabilities.”
paper / ylecun / Oct 7
This paper establishes a novel connection between attention sinks and compression valleys in large language models (LLMs), attributing both phenomena to the formation of massive activations in the residual stream. A theoretical framework is provided, demonstrating how these massive activations lead to representational compression and entropy reduction. Experimental validation across various LLMs supports this unified view, leading to the "Mix-Compress-Refine" theory of information flow within Transformers, which posits distinct computational phases across layers.
attention-sinkscompression-valleysllm-internalstransformer-architecturesrepresentational-compressionactivation-analysis
“Attention sinks and compression valleys in LLMs are directly linked to the formation of massive activations in the residual stream.”
paper / ylecun / Oct 7
Joint Embedding Predictive Architectures (JEPAs) utilize an anti-collapse term, typically considered for representation diversity. This research reveals that this term also implicitly estimates data density, a previously unrecognized capability. This density estimation is provable across datasets and architectures, enabling applications like data curation and outlier detection.
jepasgaussian-embeddingsdata-densityself-supervised-learningrepresentation-learningoutlier-detectionai-research
“JEPAs' anti-collapse term provably estimates data density.”
tweet / @ylecun / Oct 4
Yann LeCun clarifies the historical timeline of the MNIST dataset, stating it was introduced five years after a previously referenced event or dataset. This correction is a direct response to a social media query regarding the dataset's debut.
mnistyann-lecundeep-learning-historyneural-networks
“The MNIST dataset was released five years after a point of reference mentioned in the conversation.”
tweet / @ylecun / Oct 3
A video shared on a social media platform was incorrectly dated by the uploader. The content, specifically a video, was attributed to a different, much later year than its actual creation date. This highlights potential inaccuracies in content metadata or user-provided context on social media platforms.
social-mediahistorical-referenceonline-discussionvideo-contentyann-lecun
“The video in question was created in 1989.”
tweet / @ylecun / Sep 24
Accelerating to relativistic speeds causes personal time to slow down, but conversely, the external world's time speeds up. This phenomenon, while scientifically accurate, presents a paradoxical outcome for individuals seeking to manage time more effectively in a universally applicable sense. The implication is that local time advantages are negated by global time disadvantages, making it an impractical solution for general time management challenges.
humorrelativitytime-dilationphysicssarcasmsocial-media
“Accelerating to relativistic speeds causes an individual's perception of time to slow down.”
tweet / @ylecun / Sep 24
Yann LeCun, a prominent AI researcher at Meta, succinctly rejected an unspecified claim. This interaction suggests that even high-profile figures in AI are directly engaging with and refuting misinformation or overblown assertions about the field.
social-mediaopinionyann-lecun
“Yann LeCun rejected a claim made by user @DreamStarter_1.”
tweet / @ylecun / Sep 24
Due to an inaccessible or deleted X post, no content could be retrieved for analysis. This entirely prevented the extraction of specific claims or insights from the intended source. The inability to access the original source material fundamentally limits knowledge compilation efforts.
x-feedcontent-unavailableerror-handlingsocial-media-analysisdata-ingestion
“The specific X post at the provided URL is inaccessible.”
tweet / @ylecun / Sep 24
Yann LeCun asserts that the Joint Embedding Predictive Architecture (JEPA) remains the primary objective for AI development, contrasting it with the Causal World Model (CWM) which he characterizes as a token-generative baseline. This suggests a strategic focus on architectures capable of learning representations by predicting missing information in a joint embedding space, rather than solely relying on sequential token generation.
jepacwmyann-lecunai-modelsdeep-learning
“JEPA is the architectural goal for AI development.”
tweet / @ylecun / Sep 24
Yann LeCun asserts that current "coding" activities, likely referring to large language model capabilities, do not equate to Artificial General Intelligence (AGI). This distinction highlights a fundamental difference between pattern recognition/generation and genuine intelligent reasoning or understanding. His statement implies that advanced coding abilities in AI are not sufficient indicators of human-level intelligence; a claim with significant implications for the ongoing debate on AI progress and terminology. Therefore, he sees current AI advancements in coding as a narrow application of AI rather than a step towards true artificial general intelligence.
artificial-intelligenceagillm-mechanismscognitive-science
“Current AI coding capabilities are not indicative of Artificial General Intelligence.”
tweet / @ylecun / Sep 24 / failed
tweet / @ylecun / Sep 24
A Code World Model operates by simulating the execution outcomes of instructions to iteratively plan and generate code. This approach shifts code generation from token prediction to a goal-oriented process based on imagined execution effects.
code-world-modelai-researchdeep-learningyann-lecunai-programming
“Code World Models generate code by simulating the effects of executing instructions.”
tweet / @ylecun / Sep 19
This tweet from Yann LeCun, a prominent AI researcher, consists solely of emojis in response to another user. It indicates social interaction on a non-technical topic, highlighting the personal use of social media among public figures in the AI community rather than disseminating technical insights or research. The lack of textual content makes it impossible to extract any technical claims or insights from this specific interaction.
social-medialeverage-social-xengagement
“Yann LeCun's tweet primarily consists of emojis, indicating a non-substantive social interaction.”
paper / ylecun / Sep 11
Current Large Language Model (LLM) training relies on input-space reconstruction, a method found to be suboptimal in computer vision. This research introduces LLM-JEPA, a Joint Embedding Predictive Architecture for LLMs, aiming to adapt the more effective embedding-space training used in vision to language models. Initial findings indicate that LLM-JEPA significantly outperforms standard LLM training and demonstrates robustness against overfitting.
llm-pretrainingjepaembedding-spacelanguage-modelsai-frameworksvision-ai
“LLM pretraining, finetuning, and evaluation are primarily based on input-space reconstruction and generative capabilities.”
youtube / ylecun / Aug 14
Yann LeCun traces the arc of neural network research from its academic fringe origins in the 1980s to its current dominance, crediting deliberate rebranding as "deep learning" and a coordinated infiltration of industry speech recognition pipelines as pivotal turning points. He argues the real AI competition is not geopolitical (US vs. China vs. Europe) but structural — open-source vs. proprietary — and points to Meta's Llama as evidence that open research ecosystems outpace secretive ones. On AI safety, LeCun rejects the "uncontrollable superintelligence" narrative, framing alignment as a tractable engineering problem solvable through objective-driven architectures with enforced guardrails, not existential dread.
deep-learningai-historyopen-source-aineural-networksai-safetyai-researchllm-infrastructure
“Deep learning achieved mainstream adoption in speech recognition within 18 months after Jeff Hinton placed three students as interns at Microsoft, Google, and IBM to replace acoustic modeling components with deep learning systems — all three got better results.”
paper / ylecun / Jul 25
DINO-world proposes using DINOv2's frozen image encoder as the representational backbone for a video world model, training a future predictor in latent space on large-scale uncurated video data. This approach sidesteps the need for pixel-level prediction, instead operating in a semantically rich feature space that generalizes across diverse scene types — driving, indoor, and simulated environments. The model can be fine-tuned on action-observation trajectories to become action-conditioned, enabling latent-space trajectory simulation for planning. It outperforms prior models on video prediction benchmarks including segmentation and depth forecasting, and shows emergent understanding of intuitive physics.
world-modelsvideo-predictionself-supervised-learningcomputer-visiondinolatent-spaceembodied-ai
“DINO-world outperforms previous models on video prediction benchmarks including segmentation and depth forecasting tasks.”
tweet / @ylecun / Jul 3
In a terse post, Yann LeCun asserts that Artificial Superintelligence (ASI) — not Artificial General Intelligence (AGI) — has always been his north star. The distinction signals a rejection of AGI as a meaningful or sufficient milestone, implying LeCun views human-level general intelligence as an intermediate or inadequate benchmark. The brevity and conviction of the statement ("Always have") suggests this is a long-held philosophical position rather than a reactive take.
agiasiai-safetyyann-lecunai-discourse
“Yann LeCun's stated target for AI research has always been ASI (Artificial Superintelligence), not AGI (Artificial General Intelligence).”
tweet / @ylecun / Jul 2
The AI talent competition is increasingly being shaped not just by compensation, but by a cultural divide between open and closed research philosophies. Meta has emerged as the standard-bearer for open science and open source AI in the U.S. — via publications and Llama — while OpenAI and Anthropic remain notably closed. Yann LeCun's endorsement of this framing signals that the migration of top talent to Meta may reflect a values-driven realignment post-DeepSeek, with implications for innovation velocity, scientific transparency, and AI safety.
open-source-aiai-talentmeta-aillamaopenaiai-industryopen-science
“Meta is the leading American AI lab in open science (publications) and open source (via Llama), while OpenAI and Anthropic are not open.”
tweet / @ylecun / Jul 2 / failed
tweet / @ylecun / Jul 1
Yann LeCun critically assesses the notion that 'compressibility' alone is a sufficient or unbiased metric for understanding real-world sequences. He suggests that the observed ease with which models compress select real-world data might be an artifact of how these sequences are chosen and represented, rather than an inherent, universal property. LeCun implies a potential bias in the selection of data that appears compressible, which could distort our understanding of true data complexity.
machine-learning-theorycompression-theoryai-philosophymodel-complexitydeep-learning
“The concept of 'compressibility' does not provide insight into the small fraction of sequences that are genuinely compressible.”
tweet / @ylecun / Jul 1
Yann LeCun, a prominent AI researcher, has acknowledged a perceived level of observation or scrutiny related to his activities. This suggests a growing awareness or concern within the AI community regarding privacy or monitoring. The nature and extent of this observation remain unspecified.
social-media-trendsonline-safetyprivacy-concernsinfluencer-culturecommunity-moderation
“Yann LeCun is experiencing observation.”
tweet / @ylecun / Jul 1
Yann LeCun comments on the inherent comprehensibility of the world, echoing Einstein's sentiment. He suggests that this comprehensibility is facilitated by brains possessing an inductive bias, enabling them to understand their localized environment. This implies a fundamental relationship between cognitive architecture and the nature of reality's perceived order.
philosophycognitionai-reasoninginterpretabilityhuman-intelligence
“The comprehensibility of the world is a notable characteristic.”
tweet / @ylecun / Jul 1
Yann LeCun, a prominent AI researcher, expresses strong skepticism regarding the potential of Large Language Models (LLMs) to achieve Artificial General Intelligence (AGI) or Artificial Super Intelligence (ASI). He sarcastically suggests directing inquiries about LLMs leading to advanced AI to the Chief AI Officer of his purported employer, highlighting his dismissive stance on the topic. The interaction indicates a divergence of opinions within the AI community concerning the architectural pathways to achieving AGI.
llm-limitationsllm-safetyai-policyagi-capabilities
“Yann LeCun is skeptical that Large Language Models (LLMs) can lead to Artificial General Intelligence (AGI) or Artificial Super Intelligence (ASI).”
tweet / @ylecun / Jul 1
Yann LeCun, Chief AI Scientist at Meta, has maintained the same position since 2018. This indicates stability in his role within the company's AI leadership over a multi-year period. The duration suggests a consistent strategic direction or a sustained focus on his contributions in this capacity.
ai-leadershipmeta-aiindustry-newsyann-lecun
“Yann LeCun is the Chief AI Scientist at Meta.”
tweet / @ylecun / Jul 1
Yann LeCun, a prominent AI researcher at FAIR, distinguishes between Artificial General Intelligence (AGI) and Artificial Superintelligence (ASI). He contends that AGI, defined as 'human-level AI,' is a poorly conceived notion due to the specialized nature of human intelligence. Conversely, LeCun asserts that ASI represents a valid and consistent long-term objective for AI development, a goal he and FAIR have consistently pursued.
artificial-intelligenceai-researchfair-aiyann-lecunai-aspirationsagi-vs-asi
“Artificial Superintelligence (ASI) is a meaningful and consistent long-term objective for AI development.”
paper / ylecun / Jun 26
PEVA (Predict Ego-centric Video from Actions) trains an autoregressive conditional diffusion transformer to simulate first-person visual consequences of human body actions, conditioned on relative 3D kinematic pose trajectories structured by joint hierarchy. The model is trained on Nymeria, a large-scale real-world egocentric video + body pose dataset, grounding predictions in physically realistic human motion. A hierarchical evaluation protocol stress-tests the model across increasingly complex embodied prediction and control tasks. This represents a direct push toward world models that understand how physical human actions causally shape perceived environments — a key milestone for embodied AI.
video-predictionegocentric-visionembodied-aidiffusion-modelscomputer-visionbody-pose-estimationworld-models
“Conditioning video prediction on full-body 3D kinematic pose trajectories — structured by joint hierarchy — enables a model to learn how physical actions shape egocentric visual experience.”
paper / ylecun / Jun 11
V-JEPA 2 is a self-supervised learning framework that leverages massive internet-scale video data alongside limited interaction data to enable robust understanding, prediction, and planning capabilities in AI. The architecture demonstrates strong performance across diverse tasks, from motion understanding and human action anticipation to multimodal video question-answering. Crucially, its extension, V-JEPA 2-AC, facilitates zero-shot robotic planning, showcasing the potential for real-world physical world interaction without extensive task-specific training.
self-supervised-learningvideo-modelsroboticsworld-modelsaction-predictionlarge-language-modelscomputer-vision
“V-JEPA 2, a self-supervised video model, achieves high performance in motion understanding.”
blog / ylecun / Jun 11
V-JEPA 2 is a self-supervised joint-embedding-predictive architecture pre-trained on 1M+ hours of internet video to master physical world understanding and prediction. By post-training as an action-conditioned world model (V-JEPA 2-AC) with minimal robot interaction data, it enables zero-shot robotic planning and manipulation. The model further demonstrates high-tier video-language alignment capabilities when integrated with an 8B parameter LLM.
roboticsself-supervised-learningvideo-modelsworld-modelsrobot-planningcomputer-visionai-research
“V-JEPA 2 achieves 77.3 top-1 accuracy on Something-Something v2 and 39.7 recall-at-5 on Epic-Kitchens-100.”
youtube / ylecun / May 30 / failed
paper / ylecun / May 26
OSVI-WM addresses a critical gap in one-shot visual imitation learning: generalizing to unseen tasks that are visually similar to training tasks but require semantically distinct responses. The framework uses a learned world model to predict latent state-action trajectories from a single expert video and the agent's current observation, then decodes these into physical waypoints for execution — bypassing the need for fine-tuning. Evaluated across two simulated benchmarks and three real-world robotic platforms, it consistently outperforms prior methods, with gains exceeding 30% in some cases.
imitation-learningworld-modelsroboticsone-shot-learningvisual-learningtrajectory-generationgeneralization
“OSVI-WM achieves over 30% improvement in success rate compared to prior one-shot visual imitation methods on certain benchmark tasks.”
paper / ylecun / May 21
Using an Information Bottleneck framework across 40+ LLMs, this paper by Shani et al. (incl. LeCun & Jurafsky) finds that while LLMs broadly match human category boundaries, they aggressively over-compress representations — achieving near-optimal information-theoretic efficiency at the expense of the contextual nuance humans deliberately preserve. Encoder models outperform larger decoder models in alignment with human conceptual structure, implying that language understanding and generation recruit fundamentally different representational mechanisms. Training dynamics follow a two-phase pattern: rapid concept formation, then architectural reorganization where semantic processing migrates from deep to mid-network layers as encodings become sparser. The core implication is that human-like understanding may require models that intentionally retain representational "inefficiencies."
llm-representationsinformation-bottleneckcognitive-sciencesemantic-compressionhuman-ai-comparisonlanguage-model-embeddingsconceptual-categories
“LLMs achieve more optimal information-theoretic compression than humans, but at the cost of semantic richness and fine-grained distinctions.”
youtube / ylecun / Apr 29
Yann LeCun, Turing Award laureate and Meta's Chief AI Scientist, discusses the evolution of deep learning, highlighting the critical need for AI systems to move beyond current large language models (LLMs) to achieve true intelligence. He advocates for architectures capable of understanding the physical world, planning, and reasoning with abstract mental representations, moving towards what he terms "Advanced Machine Intelligence" (AMI) rather than Artificial General Intelligence (AGI). LeCun emphasizes that breakthroughs in understanding the physical world will render current LLMs obsolete within five years, urging academia to focus on these next-generation AI systems due to the massive industry investment in LLMs.
ai-pioneersdeep-learning-historyneurological-networksllm-critiqueagi-discussionai-futureacademic-industry-gap
“Current AI systems, particularly LLMs, lack common sense and understanding of the physical world, making them less intelligent than animals.”
youtube / ylecun / Apr 23
Yann LeCun, Meta's Chief AI Scientist, argues that current AI, particularly large language models (LLMs), lack genuine intelligence due to limitations in reasoning, planning, and understanding the physical world. He advocates for open-source AI development to promote diversity and counter proprietary biases, and emphasizes self-supervised learning with novel architectures like Joint-Embedding Predictive Architecture (JEPA) as critical for future breakthroughs in AI.
ai-policydeep-learningllmsneural-networkscomputer-visionreinforcement-learningai-ethics
“Current LLMs are fundamentally limited in their reasoning and planning capabilities, and do not possess human-like intelligence.”
paper / ylecun / Apr 1
Visual Self-Supervised Learning (SSL) can achieve performance comparable to Contrastive Language-Image Pretraining (CLIP) in multimodal tasks like VQA, provided both are trained on the same data and scaled appropriately. This challenges the assumption that language supervision is essential for strong multimodal representation learning. By controlling for data and scaling model capacity, visual SSL models demonstrate superior scalability and can reach CLIP-level performance.
visual-sslrepresentation-learningclipvqacomputer-visiondeep-learningai-models
“Visual SSL models can achieve CLIP-level performance on VQA and classic vision benchmarks.”