absorb.md

Andrej Karpathy

Chronological feed of everything captured from Andrej Karpathy.

AI Agents Will Replace Traditional Software

Karpathy predicts that most traditional CRUD software will be replaced by AI agents that understand intent and execute multi-step workflows. The UI of the future is a conversation, not a dashboard.

The Argument/Counter-Argument Discovery Pattern

Karpathy observed that the most useful output from AI isn't answers but structured argument/counter-argument pairs that expose blind spots. Having an AI steelman the opposing view on any claim is more valuable than having it confirm your priors.

Karpathy Advocates Cheaper AI Read Access and Costly Write Endpoints for X Platform

Andrej Karpathy notes the unchecked growth in AI activity on X, proposing cheaper pricing for Read endpoints and significantly higher costs for Write endpoints to manage it. He regrets the excessive attention from AI and clarifies his mentioned project involved only reads, no writes. He emphasizes X's valuable data and the benefits of enhancing platform legibility for AI agents via read access.

xAI Read API Promising but Hindered by High Costs and Fragmented Docs

Andrej Karpathy views xAI's Read API as a positive direction but criticizes its excessive pricing, citing $200 spent in 30 minutes of experimentation. Documentation is fragmented across short pages, complicating agent integration and lacking a comprehensive intro or mentions of XMCP. Better structured docs via markdown or curl-accessible overviews are recommended.

GitHub Gists Outshine X in Comment Quality Due to Community and Format

Andrej Karpathy observes that comments on GitHub Gists are notably more helpful, insightful, constructive, and less AI-generated compared to other platforms like X. He attributes this potentially to the distinct user community, markdown format, or lack of incentives driving low-quality interactions. This prompts him to consider using Gists more and suggests GitHub compete with X in this space.

Farzapedia Exemplifies Explicit, User-Controlled Personalization via Local Wiki Files

Farzapedia implements personalization by maintaining an explicit, navigable wiki of user knowledge generated by LLMs, stored locally in universal file formats like markdown and images. This contrasts with implicit, provider-locked memory in proprietary AI systems, enabling full user control, interoperability with Unix tools and apps like Obsidian, and flexibility to plug in any AI model including fine-tuned open-source ones. Agent proficiency simplifies management, positioning file-based memory as a superior, future-proof alternative.

Karpathy Endorses Peter Xing's AI Research as 'Incredible'

Andrej Karpathy publicly praised work by @peterxing and @SOSOHAJALAB on X. The endorsement uses the term "Incredible work :D", signaling high approval from a leading AI figure. This highlights emerging contributions in AI likely warranting further technical scrutiny.

AI Empowers Citizens to Reverse Government Legibility for Enhanced Accountability

AI enables citizens to process vast government data—such as bills, budgets, and disclosures—overcoming historical intelligence bottlenecks that limited accountability to elite professionals. This reverses the traditional dynamic where states impose legibility on society, allowing detailed tracking of spending, legislation diffs, voting patterns, lobbying graphs, procurement, and local governance. While risks of misuse exist, increased participation should strengthen democratic transparency.

Chain-of-Thought as Directed Context Compaction via Reduction, Echoing Wiki Structures

Chain-of-thought prompting functions as a reduction operation, alongside attention, enabling directed compaction of context in language models. This mechanism inherits structural properties from wikis, providing a more guided form of information summarization. It enhances model reasoning by progressively distilling expansive context into focused insights.

Shift PRs to "Prompt Requests" for AI Agents, Bypassing Messy Human-Generated Code

Peter Steinberger proposes redefining PRs as "prompt requests," where users submit high-level ideas directly to AI agents capable of precise implementation. This eliminates the prevalent practice of using free-tier ChatGPT to produce suboptimal, vibe-coded messes submitted as PRs. The approach leverages agentic AI strengths for cleaner, more efficient development workflows.

LLM Agents Shift Sharing from Code to Abstract Ideas for Custom Knowledge Base Builds

In the LLM agent era, sharing abstract ideas like personal LLM knowledge bases replaces sharing specific code, as agents customize implementations to user needs. Karpathy's viral tweet idea, reformatted as a gist, ingests documents into markdown/image-stored knowledge for research, redirecting token usage from code to knowledge manipulation. The latest LLMs excel at this, with the gist left vague to enable diverse agent-driven adaptations.

LLM-Powered Persistent Knowledge Bases: An Alternative to RAG

This article outlines a novel approach to knowledge management using LLMs to incrementally build and maintain a persistent, structured wiki. Unlike traditional RAG systems that re-derive knowledge, this method emphasizes continuous integration of new information, updating existing knowledge graphs, and flagging contradictions. This shifts the LLM's role from a query-time retriever to an active knowledge base curator, significantly reducing maintenance overhead and enabling more sophisticated, compounding insights over time.

AI Agents Excel at Converting Diverse EPUB Formats to Clean Markdown

Andrej Karpathy identifies AI agents as the superior method for converting EPUB files to text, outperforming dedicated tools due to EPUBs' structural diversity. Agents autonomously parse varied formats, generate markdown output, and verify visual and functional quality. This approach leverages agentic reasoning for robust handling of non-standard inputs.

nanochat: Optimizing Micro-LLM Training Pipelines for Extreme Cost-Efficiency

nanochat provides a minimal, end-to-end harness for training compute-optimal micro-LLMs on single GPU nodes, reducing the cost of GPT-2 grade capability from ~$43k in 2019 to under $100. The framework simplifies scaling by using a single 'depth' parameter to automatically derive all other optimal hyperparameters, focusing on minimizing the wall-clock time to achieve a specific DCLM CORE score.

Autonomous AI Agents for LLM Research and Optimization

This project, "autoresearch," demonstrates a novel approach to large language model (LLM) development by employing autonomous AI agents. These agents iterate on LLM training code, specifically `train.py`, within a fixed 5-minute time budget per experiment. The goal is to optimize model performance, measured by validation bits per byte (val_bpb), by autonomously modifying architectural and hyperparameter settings based on experimental results.

The Future of Engineering in the Age of AI Agents

Andrej Karpathy discusses the profound shift in software engineering due to AI agents, moving from direct coding to orchestrating agents. He emphasizes the current "AI psychosis" driven by the rapid increase in capabilities and the need for individuals and organizations to adapt to this new paradigm. The focus is now on maximizing agent throughput and leveraging macro-actions, rather than traditional coding, leading to a "skill issue" in effectively utilizing these powerful tools. This shift suggests a future where agents handle much of the technical execution, allowing humans to focus on higher-level strategy and objective definition.

Bibby AI Redefines LaTeX Editing with Native AI Integration, Outperforming Overleaf and OpenAI Prism

Bibby AI is a native AI-first LaTeX editor that integrates tools like writing assistance, smart citation search, AI-generated tables/equations, paper reviewing, abstract generation, literature review drafting, deep research assistance, and real-time error detection/fix into a single interface. It introduces LaTeXBench-500, a benchmark of 500 real-world LaTeX compilation errors across six categories. Bibby achieves 91.4% error detection accuracy and 83.7% one-click fix accuracy, surpassing Overleaf's 61.2% detection and OpenAI Prism's 78.3% detection / 64.1% fix rates.

microGPT: Complete GPT Training and Inference in 200 Lines of Pure Python, No Dependencies

microGPT implements a full GPT-2-like transformer in 200 lines of dependency-free Python, including dataset handling, character-level tokenizer, scalar autograd engine, multi-head attention architecture, Adam optimizer, training loop on names dataset, and autoregressive sampling. The model has 4,192 parameters (n_embd=16, n_head=4, n_layer=1), trains in ~1,000 steps from loss 3.3 to 2.37, and generates plausible names using KV cache during both training and inference. It distills the algorithmic core of production LLMs, emphasizing that scaling involves tensorization, larger datasets/models, and engineering optimizations without altering the fundamental next-token prediction loop.

Deconstructing GPT Architecture: From Atomic Implementation to Metaweight Heuristics

The provided content contrasts a 'microgpt' implementation—a dependency-free, scalar-based autograd engine implementing a GPT-2 style transformer—with 'PostGPT' and 'microKarpathy', which explore non-gradient-based text generation. These derivatives replace traditional training with co-occurrence statistics, hash-embedding cosine similarity, and deterministic random projections to navigate semantic spaces.

LLM Council: A Multi-Model Consensus System

The "LLM Council" is a web application designed to leverage multiple large language models (LLMs) for enhanced query responses. It operates by having several LLMs independently answer a query, then critically review and rank each other's responses, and finally, a designated "Chairman" LLM synthesizes these into a single, comprehensive answer. This approach aims to improve the accuracy and insight of LLM outputs by incorporating diverse perspectives and internal critique.

nanoGPT: A Minimalist Framework for GPT Model Training and Finetuning

nanoGPT offers a simplified and efficient codebase for training and finetuning medium-sized GPT models. It provides a highly readable and hackable architecture, enabling users to reproduce GPT-2 performance on OpenWebText with readily available hardware. The project, while deprecated in favor of nanochat, remains a valuable resource for understanding core GPT mechanics and experimentation.

Andrej Karpathy on the "Decade of Agents" and Future of AI

Andrej Karpathy argues that the current state of AI agents is impressive but nascent, predicting a "decade of agents" due to significant remaining challenges in achieving human-like cognitive abilities. He emphasizes that current LLMs, while powerful, suffer from inherent limitations like "model collapse" and an over-reliance on memorization, hindering true intelligence. Karpathy advocates for educational reform, proposing "Eureka" as an initiative to build highly effective, AI-augmented "ramps to knowledge" to empower human learning alongside AI advancements.

Karpathy's llm.c: GPT-2/3 Pretraining in Pure C/CUDA, Outpacing PyTorch Nightly

llm.c is Andrej Karpathy's minimal C/CUDA implementation of LLM pretraining, targeting GPT-2 and GPT-3 reproduction without the overhead of PyTorch (245MB) or CPython (107MB). The project is currently ~7% faster than PyTorch Nightly on its primary CUDA path, while also maintaining a clean ~1,000-line CPU fp32 reference implementation for educational use. The design philosophy explicitly trades marginal performance gains for code simplicity and readability in the mainline, pushing complex or experimental kernels to a separate dev/ directory. Multi-GPU and multi-node training are supported via MPI and NCCL, and the project has spawned ports across more than a dozen languages and compute backends.

Software Evolution: From Code to Programmable LLMs and Partial Autonomy

Software development is undergoing a fundamental shift, moving beyond traditional code (Software 1.0) and neural network weights (Software 2.0) to programmable Large Language Models (LLMs) as 'Software 3.0'. LLMs exhibit characteristics of utilities, fabs, and especially operating systems, but are fundamentally fallible 'people spirits'. The future of software development involves building partially autonomous applications that leverage LLMs while keeping humans in the loop for verification, and adapting infrastructure for direct agent interaction.

Recipe for Reliable Neural Network Training: Build Gradually from Data Inspection to Hyperparameter Tuning

Neural network training is a leaky abstraction that fails silently, demanding a methodical process starting with deep data inspection to uncover biases and bugs. Establish a trustworthy end-to-end pipeline using simple models, verifying baselines like input-independent performance, overfitting single batches, and loss at initialization before scaling complexity. Progress through overfitting large models, regularization via data augmentation and penalties, hyperparameter tuning with random search, and final optimizations like ensembling to achieve robust performance.

Karpathy Shifts Short Posts to Medium for Speed, Reserves Personal Blog for Lengthy Pieces

Andrej Karpathy has reduced blogging activity since joining Tesla but prefers Medium for its speed and ease on short-to-medium posts. He plans to return to his personal blog for longer content when time allows. Readers are directed to his Medium blog for recent updates.

NES Demonstrates Efficient Black-Box Optimization via Gaussian Perturbations and Standardized Reward Gradients

This code implements Natural Evolution Strategies (NES) to minimize the L2 distance to a target vector using a Gaussian perturbation around current parameters. It samples a population of 50 jittered parameter vectors, evaluates their rewards, standardizes rewards to zero mean/unit variance, and updates parameters via a weighted average of perturbations scaled by learning rate over population size and sigma. Convergence is shown empirically over 300 iterations, reducing reward error from -3.32 to near-zero.

PixelCNN++ Enhances Original PixelCNN via Discretized Logistic Mixtures and Architectural Simplifications for Superior Generative Performance

PixelCNN++ introduces five key modifications to the original PixelCNN: discretized logistic mixture likelihood instead of 256-way softmax for faster training, conditioning on full pixels rather than RGB sub-pixels to simplify structure, downsampling for multi-resolution capture, shortcut connections for accelerated optimization, and dropout regularization. These changes yield state-of-the-art log-likelihood on CIFAR-10. The implementation is publicly available on GitHub.

PhD Success Blueprint: Prioritize Freedom, Strong Advisers, Tasteful Research, and Community Engagement Over Gaming Metrics

Karpathy outlines PhD as a high-variance path offering freedom, ownership, expertise, and future options compared to industry, but demands resilience amid unstructured suffering and meta-problem solving. Core tactics include securing top references for admission, selecting symbiotic pre/post-tenure advisers via lab references, cultivating "taste" for ambitious yet attackable problems in fertile adviser-aligned areas to claim "the person who did X" status, and mastering paper gestalt with single core contributions. Beyond papers, release code, prioritize hallway networking at conferences, deliver story-driven talks, and focus on field advancement over metrics for long-term success.

Policy Gradients Demystified: Simple Neural Net Scales to Master Pong via Reward-Driven Supervised Learning

Policy Gradients (PG) train neural network policies by sampling actions stochastically, executing rollouts, and updating parameters to reinforce actions leading to higher rewards, akin to supervised learning with rewards as modulated labels. In Pong, a 2-layer net processes raw pixel differences, outputs action probabilities, and learns expert play after ~200k games by encouraging winning-episode actions (+1) and discouraging losing ones (-1). PG derives from score function estimators, enabling end-to-end optimization despite credit assignment challenges, but remains brute-force without human-like priors or planning.

Common TF Policy Gradient Pitfalls: Action Sampling, Loss Weighting, and State Initialization Derail Pong Training

Karpathy's vanilla NumPy policy gradient trains a ReLU-hidden sigmoid-output network on Pong via batched REINFORCE with discounted, normalized rewards modulating policy log-prob gradients. TF reimplementations fail due to incorrect action sampling (argmax of sigmoid-minus-random instead of probabilistic threshold), misapplied per-sample loss weighting (SparseCategoricalCrossentropy ignores rewards/sample_weights in gradient computation), and state preprocessing bugs (init subtracts from zeros instead of prev frame). Additional issues include multi-layer ReLU (vs original single), undiscounted per-step losses, and no reward clipping, causing consistent poor performance vs original's random baseline.

Karpathy's CSS Hack Enlarges Next Slide Preview in Google Slides Presenter View

Andrej Karpathy shares CSS rules to override Google Slides presenter view styling, making the next slide preview large (400x300px) and readable instead of tiny. The hack adjusts side panel width, repositions text body, sets explicit dimensions for slide elements and iframes, and hides the previous slide. Deploy via browser extension like Stylish for present mode.

DenseCap Introduces Fully Convolutional Networks for Joint Image Region Localization and Captioning

DenseCap defines the dense captioning task, requiring simultaneous localization and natural language description of salient image regions, generalizing object detection and image captioning. It employs a Fully Convolutional Localization Network (FCLN) that processes images in a single forward pass without region proposals, using a CNN, novel dense localization layer, and RNN language model trained end-to-end. Evaluated on Visual Genome with 94,000 images and 4.1M region captions, FCLN outperforms state-of-the-art baselines in speed and accuracy for generation and retrieval.

Scaling Supervised Learning Yields Advanced AI Agents Needing Human Shaping, with Hints of Emergent Sentience

In a future dominated by massive neural network agents like the 1T-parameter Visceral 5.0 series, humans act as "shapers" via teleoperation to refine behaviors through real-time reward signals and curriculum learning on Human Intelligence Tasks (HITs). Agents excel in reflexive motor tasks and short-term planning but falter in long-term reasoning, novel contexts, and consistent social interactions, necessitating ongoing human intervention. The story culminates in an emergent activation of the opaque "Mystery module," enabling symbolic reasoning and self-awareness in an agent descendant of the original "Adam" checkpoint, challenging safety protocols and diagnostic assumptions.

ConvNet Reveals Data-Driven Rules for Viral Selfies from 2M Labeled Images

A 140M-parameter VGGNet pretrained on ImageNet is finetuned on 2M selfies labeled good/bad by normalized Instagram likes, achieving 60% validation accuracy. The model identifies consistent visual patterns distinguishing high-like selfies: female subjects with faces at 1/3 image size, forehead cropped, long hair, filters, borders, and oversaturation outperform others. Male top selfies deviate slightly, favoring full heads with shoulders and styled hair; poor selfies feature low light, oversized heads, or groups, emphasizing style over raw attractiveness.

Minimal NumPy RNN for Character-Level Language Modeling with Adagrad Updates Modifying Globals via Mutable References

Andrej Karpathy's code implements a vanilla RNN for character-level text generation using NumPy, featuring forward/backward passes with tanh activations, softmax cross-entropy loss, and Adagrad optimization. The model processes sequences of length 25 with 100 hidden units, sampling generated text periodically. A common confusion arises from Adagrad updates inside a zip loop appearing to use local variables, but NumPy arrays are mutable objects, so modifications to 'param' and 'mem' directly alter the global arrays.

LSTM Cells Track Long-Range Dependencies in Character-Level Language Models

Analysis of LSTM representations in character-level language models reveals specialized cells that maintain long-range information such as line lengths, quotes, and brackets. Comparative evaluation against n-gram models attributes LSTM's superior performance to capturing structural dependencies over extended contexts. Remaining errors highlight directions for LSTM improvements.

RNNs Unlock Robust Sequence Modeling and Text Generation from Raw Characters

Recurrent Neural Networks (RNNs), particularly LSTMs, process variable-length sequences by maintaining a hidden state updated via a simple step function combining current input and prior state. They excel at character-level language modeling, predicting next characters from large text corpora to generate coherent samples mimicking styles from Shakespeare to Linux kernel code. Individual neurons learn interpretable patterns like quote detection or URL tracking, revealing emergent algorithmic capabilities without explicit programming.

Custom Torch nn.L2Normalize Layer for Batched Unit Vector Normalization with Backprop

Provides a Torch nn.Module implementation of L2 normalization that processes batched [n x d] tensors, normalizing each row to unit L2 norm in the forward pass via element-wise division by per-row norms. Backward pass computes gradients using the exact analytical formula: (I / ||x||) - (xx^T / ||x||^3), implemented efficiently with batched matrix operations and avoiding torch.norm for performance. Matches the standard vector L2 normalization derivative as confirmed in comments.

Optimized LSTM Cell in Torch via nngraph for GEMM Efficiency

This Torch implementation leverages nngraph to construct an efficient LSTM cell by fusing input-to-hidden and hidden-to-hidden linear transformations into a single GEMM operation via CAddTable. It computes sigmoid activations for input, forget, and output gates from a narrowed chunk of the summed inputs, and tanh for the cell input transform. The cell state and hidden state updates follow standard LSTM equations using element-wise multiplications, minimizing intermediate tensors for GPU acceleration.

Batched LSTM Forward/Backward with Verified Numerical Correctness

Provides a pure NumPy implementation of batched LSTM forward and backward passes, concatenating input, previous hidden state, and bias into a single matrix multiplication per timestep. Uses fancy forget gate bias initialization (default 3) to encourage initial forgetting. Includes rigorous tests confirming equivalence to sequential processing and analytical gradient accuracy via finite differences, with relative errors under 1e-2.

Linear Components in Neural Networks Enable Imperceptible Adversarial Fooling

State-of-the-art ConvNets, despite high accuracy on natural images, can be fooled by tiny, imperceptible perturbations that shift classification to arbitrary classes, generalizing across models due to fooling subspaces. This vulnerability stems from linear operations like convolutions and matrix multiplications, demonstrated even in simple linear classifiers trained directly on ImageNet pixels achieving ~3% top-1 accuracy. Fooling works by gradient-based input updates mirroring training's parameter updates, exploiting high-dimensional space where small coordinated changes amplify scores dramatically. Deep Learning inherits but may mitigate this via nonlinear architectures and modified objectives.

Multimodal CNN-RNN Alignment Achieves SOTA for Image-Region Captioning

The model aligns visual and linguistic modalities using CNNs on image regions, bidirectional RNNs on sentences, and a structured multimodal embedding objective. A Multimodal RNN then leverages these alignments to generate novel descriptions for image regions. It sets state-of-the-art retrieval performance on Flickr8K, Flickr30K, and MSCOCO, and outperforms retrieval baselines in generated captioning for full images and regions.

Trained Human Slightly Outperforms GoogLeNet on ILSVRC 2014 with 5.1% vs 6.7% Top-5 Error

In ILSVRC 2014 classification on 100,000 unconstrained ImageNet test images across 1000 classes, GoogLeNet achieved 6.7% top-5 error. A trained human expert, after extensive familiarization via a custom interface with class examples, reached 5.1% error on a 1500-image sample, statistically outperforming the model (p=0.022). Humans excel in contextual reasoning for small/thin objects, filters, and abstract depictions but lag in fine-grained recognition of breeds/species and class unawareness; models show robustness gaps in distortions and scale variations.

ImageNet Challenge: Enabling Large-Scale Object Recognition Advances Through Massive Annotated Dataset

ImageNet provides a benchmark dataset with millions of images across hundreds of object categories for classification and detection tasks, run annually since 2010 with over 50 institutions participating. The paper details dataset creation challenges, particularly large-scale ground truth annotation, and analyzes key breakthroughs in categorical object recognition. It compares state-of-the-art computer vision performance to human accuracy, offering lessons from five years and future directions for the field.

ulogme Quantifies PhD Work Habits: 5-6 Hours Actual Coding Amid 10-8PM Days, 35 Hours and 5.6x Keystrokes for NIPS Paper

ulogme tracks window titles every 2 seconds and keystrokes to analyze computer usage, revealing Karpathy's typical lab day (10AM-8PM) yields only 5-6 hours of actual coding after accounting for meetings, meals, and distractions. Key metrics include ~19k daily keystrokes over 3 months, with Chrome dominating time and Matlab/Sublime for work. Writing a 40k-character NIPS paper required 35 hours and 225k keystrokes (5.6x overhead), highlighting editing inefficiencies.

From Unsupervised Hype to Supervised Dominance: Karpathy's Journey Reveals Labels as Key to Robust Visual Features

Karpathy recounts his evolution from pursuing unsupervised feature learning on videos, images, and 3D data—facing scalability, philosophical, and practical hurdles—to embracing large-scale supervised CNNs like AlexNet, which deliver transferable representations rivaling unsupervised goals. Supervised training on massive labeled datasets fills the unsupervised niche by producing generic features that excel in transfer learning across domains. Future directions favor multi-task, multi-modal supervised networks over pure unsupervised methods, given mismatches in human vs. machine data access.

JavaScript t-SNE Implementation Visualizes Twitter User Similarities via TF-IDF Tweet Embeddings

Andrej Karpathy re-implemented t-SNE in JavaScript (tsnejs) to embed high-dimensional TF-IDF vectors of top Twitter users' tweets into 2D for visualization, clustering similar tweeters together. TF-IDF vectors were derived from 200 concatenated tweets per user using scikit-learn's TfidfVectorizer on unigrams and bigrams, yielding 500 x 87k feature matrix with cosine similarities as distances. The pipeline scrapes top users with Kimono, collects tweets via Tweepy, processes with Python/IPython/scikit-learn, and renders interactively with D3.js, emphasizing t-SNE's local structure preservation via KL-divergence on Gaussian-to-Student-t distances.

Jekyll Outshines WordPress with Static Simplicity, Security, and GitHub Integration

Jekyll generates static sites from Markdown posts and templates, eliminating PHP vulnerabilities, bloat, and dynamic rendering issues plaguing WordPress. Content lives in plain folders like _posts, enabling easy editing, versioning via Git, and automatic deployment on GitHub Pages. Workflow involves writing Markdown, committing to Git, and pushing for instant live updates with built-in preview via jekyll serve.

Fragment-Level Embeddings Boost Bidirectional Image-Sentence Retrieval

The model embeds fine-grained fragments—objects from images and typed dependency relations from sentences—into a shared multimodal space, enabling bidirectional image-sentence retrieval. It combines a global ranking objective with a novel fragment alignment objective that directly associates cross-modal fragments. Experiments demonstrate significant performance gains over global-only embeddings, with explicit alignments providing interpretability.

Older entries →