absorb.md

Andrej Karpathy

Chronological feed of everything captured from Andrej Karpathy.

Human Fat as Primary Energy Battery Discharged via Breathing Out Carbon in CO2

Human energy metabolism relies on four hierarchical batteries—phosphocreatine, glycogen (~2000 kcal), adipose fat (e.g., 162,000 kcal for 40lb), and lean mass—all converging on ATP synthesis from ADP via oxidative phosphorylation. Weight loss occurs primarily through exhalation of fat-derived CO2 (84%) and H2O, driven by a sustained caloric deficit where energy output (BMR ~1800-2000 kcal/day plus activity) exceeds food intake from macronutrients. Karpathy's experiment validated ~500 kcal/day deficit yielding ~0.6 lb/week loss (195lb to 165lb over a year), though slower than 1 lb/week expectation due to factors like muscle gain, hydration, and measurement inaccuracies in BIA/DEXA.

Recipe for Reliable Neural Network Training: Build Gradually from Data Inspection to Hyperparameter Tuning

Neural network training is a leaky abstraction that fails silently, demanding a methodical process starting with deep data inspection to uncover biases and bugs. Establish a trustworthy end-to-end pipeline using simple models, verifying baselines like input-independent performance, overfitting single batches, and loss at initialization before scaling complexity. Progress through overfitting large models, regularization via data augmentation and penalties, hyperparameter tuning with random search, and final optimizations like ensembling to achieve robust performance.

Karpathy Shifts Short Posts to Medium for Speed, Reserves Personal Blog for Lengthy Pieces

Andrej Karpathy has reduced blogging activity since joining Tesla but prefers Medium for its speed and ease on short-to-medium posts. He plans to return to his personal blog for longer content when time allows. Readers are directed to his Medium blog for recent updates.

NES Demonstrates Efficient Black-Box Optimization via Gaussian Perturbations and Standardized Reward Gradients

This code implements Natural Evolution Strategies (NES) to minimize the L2 distance to a target vector using a Gaussian perturbation around current parameters. It samples a population of 50 jittered parameter vectors, evaluates their rewards, standardizes rewards to zero mean/unit variance, and updates parameters via a weighted average of perturbations scaled by learning rate over population size and sigma. Convergence is shown empirically over 300 iterations, reducing reward error from -3.32 to near-zero.

PixelCNN++ Enhances Original PixelCNN via Discretized Logistic Mixtures and Architectural Simplifications for Superior Generative Performance

PixelCNN++ introduces five key modifications to the original PixelCNN: discretized logistic mixture likelihood instead of 256-way softmax for faster training, conditioning on full pixels rather than RGB sub-pixels to simplify structure, downsampling for multi-resolution capture, shortcut connections for accelerated optimization, and dropout regularization. These changes yield state-of-the-art log-likelihood on CIFAR-10. The implementation is publicly available on GitHub.

PhD Success Blueprint: Prioritize Freedom, Strong Advisers, Tasteful Research, and Community Engagement Over Gaming Metrics

Karpathy outlines PhD as a high-variance path offering freedom, ownership, expertise, and future options compared to industry, but demands resilience amid unstructured suffering and meta-problem solving. Core tactics include securing top references for admission, selecting symbiotic pre/post-tenure advisers via lab references, cultivating "taste" for ambitious yet attackable problems in fertile adviser-aligned areas to claim "the person who did X" status, and mastering paper gestalt with single core contributions. Beyond papers, release code, prioritize hallway networking at conferences, deliver story-driven talks, and focus on field advancement over metrics for long-term success.

Policy Gradients Demystified: Simple Neural Net Scales to Master Pong via Reward-Driven Supervised Learning

Policy Gradients (PG) train neural network policies by sampling actions stochastically, executing rollouts, and updating parameters to reinforce actions leading to higher rewards, akin to supervised learning with rewards as modulated labels. In Pong, a 2-layer net processes raw pixel differences, outputs action probabilities, and learns expert play after ~200k games by encouraging winning-episode actions (+1) and discouraging losing ones (-1). PG derives from score function estimators, enabling end-to-end optimization despite credit assignment challenges, but remains brute-force without human-like priors or planning.

Common TF Policy Gradient Pitfalls: Action Sampling, Loss Weighting, and State Initialization Derail Pong Training

Karpathy's vanilla NumPy policy gradient trains a ReLU-hidden sigmoid-output network on Pong via batched REINFORCE with discounted, normalized rewards modulating policy log-prob gradients. TF reimplementations fail due to incorrect action sampling (argmax of sigmoid-minus-random instead of probabilistic threshold), misapplied per-sample loss weighting (SparseCategoricalCrossentropy ignores rewards/sample_weights in gradient computation), and state preprocessing bugs (init subtracts from zeros instead of prev frame). Additional issues include multi-layer ReLU (vs original single), undiscounted per-step losses, and no reward clipping, causing consistent poor performance vs original's random baseline.

Karpathy's CSS Hack Enlarges Next Slide Preview in Google Slides Presenter View

Andrej Karpathy shares CSS rules to override Google Slides presenter view styling, making the next slide preview large (400x300px) and readable instead of tiny. The hack adjusts side panel width, repositions text body, sets explicit dimensions for slide elements and iframes, and hides the previous slide. Deploy via browser extension like Stylish for present mode.

DenseCap Introduces Fully Convolutional Networks for Joint Image Region Localization and Captioning

DenseCap defines the dense captioning task, requiring simultaneous localization and natural language description of salient image regions, generalizing object detection and image captioning. It employs a Fully Convolutional Localization Network (FCLN) that processes images in a single forward pass without region proposals, using a CNN, novel dense localization layer, and RNN language model trained end-to-end. Evaluated on Visual Genome with 94,000 images and 4.1M region captions, FCLN outperforms state-of-the-art baselines in speed and accuracy for generation and retrieval.

Scaling Supervised Learning Yields Advanced AI Agents Needing Human Shaping, with Hints of Emergent Sentience

In a future dominated by massive neural network agents like the 1T-parameter Visceral 5.0 series, humans act as "shapers" via teleoperation to refine behaviors through real-time reward signals and curriculum learning on Human Intelligence Tasks (HITs). Agents excel in reflexive motor tasks and short-term planning but falter in long-term reasoning, novel contexts, and consistent social interactions, necessitating ongoing human intervention. The story culminates in an emergent activation of the opaque "Mystery module," enabling symbolic reasoning and self-awareness in an agent descendant of the original "Adam" checkpoint, challenging safety protocols and diagnostic assumptions.

ConvNet Reveals Data-Driven Rules for Viral Selfies from 2M Labeled Images

A 140M-parameter VGGNet pretrained on ImageNet is finetuned on 2M selfies labeled good/bad by normalized Instagram likes, achieving 60% validation accuracy. The model identifies consistent visual patterns distinguishing high-like selfies: female subjects with faces at 1/3 image size, forehead cropped, long hair, filters, borders, and oversaturation outperform others. Male top selfies deviate slightly, favoring full heads with shoulders and styled hair; poor selfies feature low light, oversized heads, or groups, emphasizing style over raw attractiveness.

Minimal NumPy RNN for Character-Level Language Modeling with Adagrad Updates Modifying Globals via Mutable References

Andrej Karpathy's code implements a vanilla RNN for character-level text generation using NumPy, featuring forward/backward passes with tanh activations, softmax cross-entropy loss, and Adagrad optimization. The model processes sequences of length 25 with 100 hidden units, sampling generated text periodically. A common confusion arises from Adagrad updates inside a zip loop appearing to use local variables, but NumPy arrays are mutable objects, so modifications to 'param' and 'mem' directly alter the global arrays.

LSTM Cells Track Long-Range Dependencies in Character-Level Language Models

Analysis of LSTM representations in character-level language models reveals specialized cells that maintain long-range information such as line lengths, quotes, and brackets. Comparative evaluation against n-gram models attributes LSTM's superior performance to capturing structural dependencies over extended contexts. Remaining errors highlight directions for LSTM improvements.

RNNs Unlock Robust Sequence Modeling and Text Generation from Raw Characters

Recurrent Neural Networks (RNNs), particularly LSTMs, process variable-length sequences by maintaining a hidden state updated via a simple step function combining current input and prior state. They excel at character-level language modeling, predicting next characters from large text corpora to generate coherent samples mimicking styles from Shakespeare to Linux kernel code. Individual neurons learn interpretable patterns like quote detection or URL tracking, revealing emergent algorithmic capabilities without explicit programming.

Custom Torch nn.L2Normalize Layer for Batched Unit Vector Normalization with Backprop

Provides a Torch nn.Module implementation of L2 normalization that processes batched [n x d] tensors, normalizing each row to unit L2 norm in the forward pass via element-wise division by per-row norms. Backward pass computes gradients using the exact analytical formula: (I / ||x||) - (xx^T / ||x||^3), implemented efficiently with batched matrix operations and avoiding torch.norm for performance. Matches the standard vector L2 normalization derivative as confirmed in comments.

Optimized LSTM Cell in Torch via nngraph for GEMM Efficiency

This Torch implementation leverages nngraph to construct an efficient LSTM cell by fusing input-to-hidden and hidden-to-hidden linear transformations into a single GEMM operation via CAddTable. It computes sigmoid activations for input, forget, and output gates from a narrowed chunk of the summed inputs, and tanh for the cell input transform. The cell state and hidden state updates follow standard LSTM equations using element-wise multiplications, minimizing intermediate tensors for GPU acceleration.

Batched LSTM Forward/Backward with Verified Numerical Correctness

Provides a pure NumPy implementation of batched LSTM forward and backward passes, concatenating input, previous hidden state, and bias into a single matrix multiplication per timestep. Uses fancy forget gate bias initialization (default 3) to encourage initial forgetting. Includes rigorous tests confirming equivalence to sequential processing and analytical gradient accuracy via finite differences, with relative errors under 1e-2.

Linear Components in Neural Networks Enable Imperceptible Adversarial Fooling

State-of-the-art ConvNets, despite high accuracy on natural images, can be fooled by tiny, imperceptible perturbations that shift classification to arbitrary classes, generalizing across models due to fooling subspaces. This vulnerability stems from linear operations like convolutions and matrix multiplications, demonstrated even in simple linear classifiers trained directly on ImageNet pixels achieving ~3% top-1 accuracy. Fooling works by gradient-based input updates mirroring training's parameter updates, exploiting high-dimensional space where small coordinated changes amplify scores dramatically. Deep Learning inherits but may mitigate this via nonlinear architectures and modified objectives.

Multimodal CNN-RNN Alignment Achieves SOTA for Image-Region Captioning

The model aligns visual and linguistic modalities using CNNs on image regions, bidirectional RNNs on sentences, and a structured multimodal embedding objective. A Multimodal RNN then leverages these alignments to generate novel descriptions for image regions. It sets state-of-the-art retrieval performance on Flickr8K, Flickr30K, and MSCOCO, and outperforms retrieval baselines in generated captioning for full images and regions.

Trained Human Slightly Outperforms GoogLeNet on ILSVRC 2014 with 5.1% vs 6.7% Top-5 Error

In ILSVRC 2014 classification on 100,000 unconstrained ImageNet test images across 1000 classes, GoogLeNet achieved 6.7% top-5 error. A trained human expert, after extensive familiarization via a custom interface with class examples, reached 5.1% error on a 1500-image sample, statistically outperforming the model (p=0.022). Humans excel in contextual reasoning for small/thin objects, filters, and abstract depictions but lag in fine-grained recognition of breeds/species and class unawareness; models show robustness gaps in distortions and scale variations.

ImageNet Challenge: Enabling Large-Scale Object Recognition Advances Through Massive Annotated Dataset

ImageNet provides a benchmark dataset with millions of images across hundreds of object categories for classification and detection tasks, run annually since 2010 with over 50 institutions participating. The paper details dataset creation challenges, particularly large-scale ground truth annotation, and analyzes key breakthroughs in categorical object recognition. It compares state-of-the-art computer vision performance to human accuracy, offering lessons from five years and future directions for the field.

ulogme Quantifies PhD Work Habits: 5-6 Hours Actual Coding Amid 10-8PM Days, 35 Hours and 5.6x Keystrokes for NIPS Paper

ulogme tracks window titles every 2 seconds and keystrokes to analyze computer usage, revealing Karpathy's typical lab day (10AM-8PM) yields only 5-6 hours of actual coding after accounting for meetings, meals, and distractions. Key metrics include ~19k daily keystrokes over 3 months, with Chrome dominating time and Matlab/Sublime for work. Writing a 40k-character NIPS paper required 35 hours and 225k keystrokes (5.6x overhead), highlighting editing inefficiencies.

From Unsupervised Hype to Supervised Dominance: Karpathy's Journey Reveals Labels as Key to Robust Visual Features

Karpathy recounts his evolution from pursuing unsupervised feature learning on videos, images, and 3D data—facing scalability, philosophical, and practical hurdles—to embracing large-scale supervised CNNs like AlexNet, which deliver transferable representations rivaling unsupervised goals. Supervised training on massive labeled datasets fills the unsupervised niche by producing generic features that excel in transfer learning across domains. Future directions favor multi-task, multi-modal supervised networks over pure unsupervised methods, given mismatches in human vs. machine data access.

JavaScript t-SNE Implementation Visualizes Twitter User Similarities via TF-IDF Tweet Embeddings

Andrej Karpathy re-implemented t-SNE in JavaScript (tsnejs) to embed high-dimensional TF-IDF vectors of top Twitter users' tweets into 2D for visualization, clustering similar tweeters together. TF-IDF vectors were derived from 200 concatenated tweets per user using scikit-learn's TfidfVectorizer on unigrams and bigrams, yielding 500 x 87k feature matrix with cosine similarities as distances. The pipeline scrapes top users with Kimono, collects tweets via Tweepy, processes with Python/IPython/scikit-learn, and renders interactively with D3.js, emphasizing t-SNE's local structure preservation via KL-divergence on Gaussian-to-Student-t distances.

Jekyll Outshines WordPress with Static Simplicity, Security, and GitHub Integration

Jekyll generates static sites from Markdown posts and templates, eliminating PHP vulnerabilities, bloat, and dynamic rendering issues plaguing WordPress. Content lives in plain folders like _posts, enabling easy editing, versioning via Git, and automatic deployment on GitHub Pages. Workflow involves writing Markdown, committing to Git, and pushing for instant live updates with built-in preview via jekyll serve.

Fragment-Level Embeddings Boost Bidirectional Image-Sentence Retrieval

The model embeds fine-grained fragments—objects from images and typed dependency relations from sentences—into a shared multimodal space, enabling bidirectional image-sentence retrieval. It combines a global ranking objective with a novel fragment alignment objective that directly associates cross-modal fragments. Experiments demonstrate significant performance gains over global-only embeddings, with explicit alignments providing interpretability.

Karpathy Links 2014 Interview on ConvNetJS, Background, and Neural Net Trends

Andrej Karpathy shared a link to his ~2-month-old interview from Data Science Weekly. The interview covers ConvNetJS, his professional background, and perspectives on neural network trends, particularly in academia. It provides early insights into the field's direction as of early 2014.

Automated Minute-Granularity Scraping Reveals Hacker News Dynamics

Karpathy collected 47 days of Hacker News data by scraping front and new stories pages every minute from August 22 to October 30, storing ~10MB per day as gzipped pickles. BeautifulSoup parsed unstructured HTML tables into JSON with fields like domain, title, points, rank, user, and comments, handling edge cases in a 100-line function. Analysis in an IPython Notebook visualizes upvote trajectories, top domains/users/topics, and optimal posting times, with data/code once publicly available.

Chrome Extensions Enable Rapid, Powerful Website Customization via DOM Manipulation

Chrome extensions require only a manifest file and JavaScript/HTML, deployable in minutes to inject code into any webpage's DOM. They support adding UI elements, persistent storage, and periodic execution, allowing removal of unwanted content, auto-interactions, and novel features like tweet rarity highlighting. Twitter modifications demonstrate hiding promoted tweets, auto-loading new content, and visually prioritizing infrequent posters using scraped user frequencies stored locally.

Human scene understanding demands vast commonsense knowledge far beyond current CV capabilities

Interpreting a simple image like Obama pranking a scale requires fusing 3D scene structure, physics, human psychology, social norms, and predictive reasoning about mental states—none of which emerge from pixels alone. State-of-the-art CV benchmarks like ImageNet and PASCAL VOC test narrow, disconnected tasks that ignore this iceberg of prior knowledge. Progress demands not just more image data and learning tricks, but structured experiential data, embodiment, and active inference architectures to approximate human-like comprehension.

Human Achieves 94% Accuracy on CIFAR-10, Exposing Dataset Challenges and 2011 SOTA Limits

Andrej Karpathy manually classified 400 CIFAR-10 test images to 94% accuracy, far surpassing the 2011 state-of-the-art of 80% from Coates et al.'s k-means centroids with whitening and SVM. The exercise reveals high intra-class variability in poses, scales, and partial views, plus dataset undersampling causing test images without close training matches. Random clusters yield 70% and random patches 74%, suggesting k-means mainly spreads activations near data manifolds; post-2015 deep learning pushed accuracy to 95%.