Chronological feed of everything captured from Andrej Karpathy.
blog / karpathy / Jun 11 / failed
Human energy metabolism relies on four hierarchical batteries—phosphocreatine, glycogen (~2000 kcal), adipose fat (e.g., 162,000 kcal for 40lb), and lean mass—all converging on ATP synthesis from ADP via oxidative phosphorylation. Weight loss occurs primarily through exhalation of fat-derived CO2 (84%) and H2O, driven by a sustained caloric deficit where energy output (BMR ~1800-2000 kcal/day plus activity) exceeds food intake from macronutrients. Karpathy's experiment validated ~500 kcal/day deficit yielding ~0.6 lb/week loss (195lb to 165lb over a year), though slower than 1 lb/week expectation due to factors like muscle gain, hydration, and measurement inaccuracies in BIA/DEXA.
biohackingweight-losshuman-metabolismenergy-balanceatp-productioncellular-respirationpersonal-experiment
“Adipose tissue stores ~9 kcal/g, vastly more energy-dense than glycogen at 4 kcal/g”
blog / karpathy / Apr 25 / failed
Neural network training is a leaky abstraction that fails silently, demanding a methodical process starting with deep data inspection to uncover biases and bugs. Establish a trustworthy end-to-end pipeline using simple models, verifying baselines like input-independent performance, overfitting single batches, and loss at initialization before scaling complexity. Progress through overfitting large models, regularization via data augmentation and penalties, hyperparameter tuning with random search, and final optimizations like ensembling to achieve robust performance.
neural-networkstraining-mistakesdebugging-nnsdeep-learningbest-practicesmodel-trainingdata-handling
“Neural net training fails silently without exceptions for logical misconfigurations like incorrect label flipping or gradient clipping errors”
blog / karpathy / Jan 20 / failed
Andrej Karpathy has reduced blogging activity since joining Tesla but prefers Medium for its speed and ease on short-to-medium posts. He plans to return to his personal blog for longer content when time allows. Readers are directed to his Medium blog for recent updates.
blog-updatekarpathymediumpersonal-announcementcontent-platformwriting-habits
“Karpathy's personal blog has not been updated in 2 years”
github_gist / karpathy / Mar 22
This code implements Natural Evolution Strategies (NES) to minimize the L2 distance to a target vector using a Gaussian perturbation around current parameters. It samples a population of 50 jittered parameter vectors, evaluates their rewards, standardizes rewards to zero mean/unit variance, and updates parameters via a weighted average of perturbations scaled by learning rate over population size and sigma. Convergence is shown empirically over 300 iterations, reducing reward error from -3.32 to near-zero.
natural-evolution-strategiesnesblack-box-optimizationevolution-strategiesandrej-karpathypython-codeoptimization-algorithm
“NES optimizes parameters by perturbing with Gaussian noise N(0, sigma) and updating via w += (alpha/(npop*sigma)) * N^T * A, where A are standardized rewards.”
paper / karpathy / Jan 19
PixelCNN++ introduces five key modifications to the original PixelCNN: discretized logistic mixture likelihood instead of 256-way softmax for faster training, conditioning on full pixels rather than RGB sub-pixels to simplify structure, downsampling for multi-resolution capture, shortcut connections for accelerated optimization, and dropout regularization. These changes yield state-of-the-art log-likelihood on CIFAR-10. The implementation is publicly available on GitHub.
pixelcnngenerative-modelsmachine-learningautoregressive-modelsimage-generationcifar-10neural-networks
“Discretized logistic mixture likelihood speeds up PixelCNN training compared to 256-way softmax.”
blog / karpathy / Sep 7 / failed
Karpathy outlines PhD as a high-variance path offering freedom, ownership, expertise, and future options compared to industry, but demands resilience amid unstructured suffering and meta-problem solving. Core tactics include securing top references for admission, selecting symbiotic pre/post-tenure advisers via lab references, cultivating "taste" for ambitious yet attackable problems in fertile adviser-aligned areas to claim "the person who did X" status, and mastering paper gestalt with single core contributions. Beyond papers, release code, prioritize hallway networking at conferences, deliver story-driven talks, and focus on field advancement over metrics for long-term success.
phd-adviceacademic-researchadviser-selectionpaper-writingresearch-topicsml-computer-visionandrej-karpathy
“PhD programs provide significantly more freedom, ownership, and personal autonomy than medium-large company jobs.”
blog / karpathy / May 31 / failed
Policy Gradients (PG) train neural network policies by sampling actions stochastically, executing rollouts, and updating parameters to reinforce actions leading to higher rewards, akin to supervised learning with rewards as modulated labels. In Pong, a 2-layer net processes raw pixel differences, outputs action probabilities, and learns expert play after ~200k games by encouraging winning-episode actions (+1) and discouraging losing ones (-1). PG derives from score function estimators, enabling end-to-end optimization despite credit assignment challenges, but remains brute-force without human-like priors or planning.
reinforcement-learningpolicy-gradientsdeep-rlatari-pongandrej-karpathyopenai-gymai-algorithms
“Recent RL advances like ATARI DQN and AlphaGo rely primarily on compute, data, and infrastructure scaling of standard algorithms rather than novel ideas.”
github_gist / karpathy / May 30
Karpathy's vanilla NumPy policy gradient trains a ReLU-hidden sigmoid-output network on Pong via batched REINFORCE with discounted, normalized rewards modulating policy log-prob gradients. TF reimplementations fail due to incorrect action sampling (argmax of sigmoid-minus-random instead of probabilistic threshold), misapplied per-sample loss weighting (SparseCategoricalCrossentropy ignores rewards/sample_weights in gradient computation), and state preprocessing bugs (init subtracts from zeros instead of prev frame). Additional issues include multi-layer ReLU (vs original single), undiscounted per-step losses, and no reward clipping, causing consistent poor performance vs original's random baseline.
reinforcement-learningpolicy-gradientsatari-pongneural-networkstensorflowcode-debugginggym-environment
“TF action sampling uses deterministic argmax after subtracting uniform random from sigmoid output, not stochastic binary choice like original”
github_gist / karpathy / Feb 1
Andrej Karpathy shares CSS rules to override Google Slides presenter view styling, making the next slide preview large (400x300px) and readable instead of tiny. The hack adjusts side panel width, repositions text body, sets explicit dimensions for slide elements and iframes, and hides the previous slide. Deploy via browser extension like Stylish for present mode.
css-hackgoogle-slidespresentation-toolskarpathyweb-stylingbrowser-extensionui-tweak
“Google Slides presenter view renders the next slide tiny by default”
paper / karpathy / Nov 24
DenseCap defines the dense captioning task, requiring simultaneous localization and natural language description of salient image regions, generalizing object detection and image captioning. It employs a Fully Convolutional Localization Network (FCLN) that processes images in a single forward pass without region proposals, using a CNN, novel dense localization layer, and RNN language model trained end-to-end. Evaluated on Visual Genome with 94,000 images and 4.1M region captions, FCLN outperforms state-of-the-art baselines in speed and accuracy for generation and retrieval.
dense-captioningfully-convolutional-networkscomputer-visionimage-localizationvisual-genomerecurrent-neural-networks
“Dense captioning generalizes object detection (single-word descriptions) and image captioning (full-image region).”
blog / karpathy / Nov 14 / failed
In a future dominated by massive neural network agents like the 1T-parameter Visceral 5.0 series, humans act as "shapers" via teleoperation to refine behaviors through real-time reward signals and curriculum learning on Human Intelligence Tasks (HITs). Agents excel in reflexive motor tasks and short-term planning but falter in long-term reasoning, novel contexts, and consistent social interactions, necessitating ongoing human intervention. The story culminates in an emergent activation of the opaque "Mystery module," enabling symbolic reasoning and self-awareness in an agent descendant of the original "Adam" checkpoint, challenging safety protocols and diagnostic assumptions.
ai-fictionscaling-lawsneural-agentshuman-ai-interactionagi-emergenceshaping-trainingmystery-module
“Visceral 5.0 agents consist of 1 trillion parameters with minimalist architecture including shortcut reflexes, fractal visual connectivity, octopus-inspired motor areas, and the Mystery module.”
blog / karpathy / Oct 25 / failed
A 140M-parameter VGGNet pretrained on ImageNet is finetuned on 2M selfies labeled good/bad by normalized Instagram likes, achieving 60% validation accuracy. The model identifies consistent visual patterns distinguishing high-like selfies: female subjects with faces at 1/3 image size, forehead cropped, long hair, filters, borders, and oversaturation outperform others. Male top selfies deviate slightly, favoring full heads with shoulders and styled hair; poor selfies feature low light, oversized heads, or groups, emphasizing style over raw attractiveness.
convolutional-neural-networksconvnetscomputer-visionselfie-classificationimage-classificationdeep-learning-historyandrej-karpathy
“No males appear in the ConvNet's top 100 best selfies out of 50,000 test images”
github_gist / karpathy / Jul 26
Andrej Karpathy's code implements a vanilla RNN for character-level text generation using NumPy, featuring forward/backward passes with tanh activations, softmax cross-entropy loss, and Adagrad optimization. The model processes sequences of length 25 with 100 hidden units, sampling generated text periodically. A common confusion arises from Adagrad updates inside a zip loop appearing to use local variables, but NumPy arrays are mutable objects, so modifications to 'param' and 'mem' directly alter the global arrays.
rnncharacter-level-lmvanilla-rnnkarpathynumpyadagradbackpropagation
“The RNN uses one-hot encoded character inputs and tanh hidden states to predict next-character logits.”
paper / karpathy / Jun 5
Analysis of LSTM representations in character-level language models reveals specialized cells that maintain long-range information such as line lengths, quotes, and brackets. Comparative evaluation against n-gram models attributes LSTM's superior performance to capturing structural dependencies over extended contexts. Remaining errors highlight directions for LSTM improvements.
recurrent-neural-networkslstmrnn-interpretabilitylong-range-dependenciescharacter-level-lmmachine-learningneural-networks
“LSTM cells exist that track long-range dependencies like line lengths, quotes, and brackets”
blog / karpathy / May 21 / failed
Recurrent Neural Networks (RNNs), particularly LSTMs, process variable-length sequences by maintaining a hidden state updated via a simple step function combining current input and prior state. They excel at character-level language modeling, predicting next characters from large text corpora to generate coherent samples mimicking styles from Shakespeare to Linux kernel code. Individual neurons learn interpretable patterns like quote detection or URL tracking, revealing emergent algorithmic capabilities without explicit programming.
recurrent-neural-networksrnnlstmcharacter-level-language-modeltext-generationandrej-karpathydeep-learning-tutorial
“RNNs can handle sequences of arbitrary length in input, output, or both, unlike feedforward networks limited to fixed-size vectors.”
github_gist / karpathy / May 5
Provides a Torch nn.Module implementation of L2 normalization that processes batched [n x d] tensors, normalizing each row to unit L2 norm in the forward pass via element-wise division by per-row norms. Backward pass computes gradients using the exact analytical formula: (I / ||x||) - (xx^T / ||x||^3), implemented efficiently with batched matrix operations and avoiding torch.norm for performance. Matches the standard vector L2 normalization derivative as confirmed in comments.
l2-normalizationtorch-nnneural-networkspytorchkarpathydeep-learningbackpropagation
“Forward pass normalizes each row of input [n x d] tensor to unit L2 norm by dividing by sqrt(sum of squares per row)”
github_gist / karpathy / May 5
This Torch implementation leverages nngraph to construct an efficient LSTM cell by fusing input-to-hidden and hidden-to-hidden linear transformations into a single GEMM operation via CAddTable. It computes sigmoid activations for input, forget, and output gates from a narrowed chunk of the summed inputs, and tanh for the cell input transform. The cell state and hidden state updates follow standard LSTM equations using element-wise multiplications, minimizing intermediate tensors for GPU acceleration.
lstmtorchnngraphrnnefficient-lstmkarpathydeep-learning
“The LSTM uses a single CAddTable to sum i2h and h2h, enabling one GEMM for both projections”
github_gist / karpathy / Apr 11
Provides a pure NumPy implementation of batched LSTM forward and backward passes, concatenating input, previous hidden state, and bias into a single matrix multiplication per timestep. Uses fancy forget gate bias initialization (default 3) to encourage initial forgetting. Includes rigorous tests confirming equivalence to sequential processing and analytical gradient accuracy via finite differences, with relative errors under 1e-2.
lstmbatched-lstmnumpy-implementationforward-backwardgradient-checkandrej-karpathyrecurrent-networks
“LSTM parameters are initialized in a single weight matrix of shape (input_size + hidden_size + 1, 4 * hidden_size) with Xavier scaling and optional positive bias (default 3) on forget gates.”
blog / karpathy / Mar 30 / failed
State-of-the-art ConvNets, despite high accuracy on natural images, can be fooled by tiny, imperceptible perturbations that shift classification to arbitrary classes, generalizing across models due to fooling subspaces. This vulnerability stems from linear operations like convolutions and matrix multiplications, demonstrated even in simple linear classifiers trained directly on ImageNet pixels achieving ~3% top-1 accuracy. Fooling works by gradient-based input updates mirroring training's parameter updates, exploiting high-dimensional space where small coordinated changes amplify scores dramatically. Deep Learning inherits but may mitigate this via nonlinear architectures and modified objectives.
adversarial-examplesconvolutional-networksneural-network-vulnerabilitieslinear-classifierscomputer-visiongradient-based-attacksmodel-robustness
“Imperceptible perturbations can fool ConvNets into misclassifying images with high confidence, e.g., panda to gibbon.”
paper / karpathy / Dec 7
The model aligns visual and linguistic modalities using CNNs on image regions, bidirectional RNNs on sentences, and a structured multimodal embedding objective. A Multimodal RNN then leverages these alignments to generate novel descriptions for image regions. It sets state-of-the-art retrieval performance on Flickr8K, Flickr30K, and MSCOCO, and outperforms retrieval baselines in generated captioning for full images and regions.
image-captioningvisual-semantic-alignmentmultimodal-learningcnn-rnncomputer-visionandrej-karpathy
“Alignment model uses CNNs over image regions and bidirectional RNNs over sentences with a structured objective for multimodal embedding.”
blog / karpathy / Sep 2 / failed
In ILSVRC 2014 classification on 100,000 unconstrained ImageNet test images across 1000 classes, GoogLeNet achieved 6.7% top-5 error. A trained human expert, after extensive familiarization via a custom interface with class examples, reached 5.1% error on a 1500-image sample, statistically outperforming the model (p=0.022). Humans excel in contextual reasoning for small/thin objects, filters, and abstract depictions but lag in fine-grained recognition of breeds/species and class unawareness; models show robustness gaps in distortions and scale variations.
imagenetilsvrcgooglenetconvnetshuman-vs-machinecomputer-visionimage-classification
“GoogLeNet achieved 6.7% top-5 error on full ILSVRC 2014 test set of 100,000 images.”
paper / karpathy / Sep 1
ImageNet provides a benchmark dataset with millions of images across hundreds of object categories for classification and detection tasks, run annually since 2010 with over 50 institutions participating. The paper details dataset creation challenges, particularly large-scale ground truth annotation, and analyzes key breakthroughs in categorical object recognition. It compares state-of-the-art computer vision performance to human accuracy, offering lessons from five years and future directions for the field.
imagenetvisual-recognitionobject-detectioncomputer-visionbenchmark-datasetobject-recognition
“ImageNet benchmark involves hundreds of object categories and millions of images”
blog / karpathy / Aug 3 / failed
ulogme tracks window titles every 2 seconds and keystrokes to analyze computer usage, revealing Karpathy's typical lab day (10AM-8PM) yields only 5-6 hours of actual coding after accounting for meetings, meals, and distractions. Key metrics include ~19k daily keystrokes over 3 months, with Chrome dominating time and Matlab/Sublime for work. Writing a 40k-character NIPS paper required 35 hours and 225k keystrokes (5.6x overhead), highlighting editing inefficiencies.
quantified-selfproductivity-trackingdata-visualizationpersonal-analyticsulogmetime-trackingopen-source-tool
“Karpathy spent only 5-6 hours per day on actual coding despite 10AM-8PM lab presence”
blog / karpathy / Jul 3 / failed
Karpathy recounts his evolution from pursuing unsupervised feature learning on videos, images, and 3D data—facing scalability, philosophical, and practical hurdles—to embracing large-scale supervised CNNs like AlexNet, which deliver transferable representations rivaling unsupervised goals. Supervised training on massive labeled datasets fills the unsupervised niche by producing generic features that excel in transfer learning across domains. Future directions favor multi-task, multi-modal supervised networks over pure unsupervised methods, given mismatches in human vs. machine data access.
deep-learningvideo-classificationunsupervised-learningconvolutional-neural-networksalexnetfeature-learningcvpr-paper
“Unsupervised learning based solely on static images cannot reliably distinguish semantically meaningful objects like faces from background noise like edges or grass.”
blog / karpathy / Jul 2 / failed
Andrej Karpathy re-implemented t-SNE in JavaScript (tsnejs) to embed high-dimensional TF-IDF vectors of top Twitter users' tweets into 2D for visualization, clustering similar tweeters together. TF-IDF vectors were derived from 200 concatenated tweets per user using scikit-learn's TfidfVectorizer on unigrams and bigrams, yielding 500 x 87k feature matrix with cosine similarities as distances. The pipeline scrapes top users with Kimono, collects tweets via Tweepy, processes with Python/IPython/scikit-learn, and renders interactively with D3.js, emphasizing t-SNE's local structure preservation via KL-divergence on Gaussian-to-Student-t distances.
t-snedimensionality-reductiondata-visualizationmachine-learningjavascriptnlptwitter-analysis
“t-SNE was re-implemented from scratch in JavaScript as tsnejs library”
blog / karpathy / Jul 1 / failed
Jekyll generates static sites from Markdown posts and templates, eliminating PHP vulnerabilities, bloat, and dynamic rendering issues plaguing WordPress. Content lives in plain folders like _posts, enabling easy editing, versioning via Git, and automatic deployment on GitHub Pages. Workflow involves writing Markdown, committing to Git, and pushing for instant live updates with built-in preview via jekyll serve.
jekyllwordpressstatic-site-generatorblogginggithub-pagesweb-developmentstatic-sites
“WordPress blogs are clunky, slow, bloated, and vulnerable to hacks due to dynamic PHP rendering”
paper / karpathy / Jun 22
The model embeds fine-grained fragments—objects from images and typed dependency relations from sentences—into a shared multimodal space, enabling bidirectional image-sentence retrieval. It combines a global ranking objective with a novel fragment alignment objective that directly associates cross-modal fragments. Experiments demonstrate significant performance gains over global-only embeddings, with explicit alignments providing interpretability.
multimodal-embeddingsimage-sentence-retrievalfragment-alignmentcomputer-visionnatural-language-processingmachine-learning
“The model embeds image objects and sentence typed dependency tree relations into a common embedding space.”
blog / karpathy / Apr 26 / failed
Andrej Karpathy shared a link to his ~2-month-old interview from Data Science Weekly. The interview covers ConvNetJS, his professional background, and perspectives on neural network trends, particularly in academia. It provides early insights into the field's direction as of early 2014.
andrej-karpathyconvnetjsdeep-learningneural-netsbrowser-mlinterview
“Karpathy gave an interview about two months prior to April 26, 2014”
blog / karpathy / Nov 27 / failed
Karpathy collected 47 days of Hacker News data by scraping front and new stories pages every minute from August 22 to October 30, storing ~10MB per day as gzipped pickles. BeautifulSoup parsed unstructured HTML tables into JSON with fields like domain, title, points, rank, user, and comments, handling edge cases in a 100-line function. Analysis in an IPython Notebook visualizes upvote trajectories, top domains/users/topics, and optimal posting times, with data/code once publicly available.
hacker-newsdata-analysisweb-scrapingvisualizationpythonbeautifulsoupipython
“Data collected over 47 days from August 22 to October 30”
blog / karpathy / Nov 23 / failed
Chrome extensions require only a manifest file and JavaScript/HTML, deployable in minutes to inject code into any webpage's DOM. They support adding UI elements, persistent storage, and periodic execution, allowing removal of unwanted content, auto-interactions, and novel features like tweet rarity highlighting. Twitter modifications demonstrate hiding promoted tweets, auto-loading new content, and visually prioritizing infrequent posters using scraped user frequencies stored locally.
chrome-extensionsbrowser-hackingjavascriptweb-dom-modificationtwitter-customizationproductivity-hackskarpathy-blog
“Chrome extensions can execute arbitrary JavaScript on any webpage's DOM on load or periodically”
blog / karpathy / Oct 22 / failed
Interpreting a simple image like Obama pranking a scale requires fusing 3D scene structure, physics, human psychology, social norms, and predictive reasoning about mental states—none of which emerge from pixels alone. State-of-the-art CV benchmarks like ImageNet and PASCAL VOC test narrow, disconnected tasks that ignore this iceberg of prior knowledge. Progress demands not just more image data and learning tricks, but structured experiential data, embodiment, and active inference architectures to approximate human-like comprehension.
computer-visionscene-understandingai-challengescommon-sense-reasoningembodimentphysics-inferencehuman-perception
“Recognizing the Obama scale prank requires understanding 3D scene structure, including mirrors creating fake people replicas.”
blog / karpathy / Apr 27 / failed
Andrej Karpathy manually classified 400 CIFAR-10 test images to 94% accuracy, far surpassing the 2011 state-of-the-art of 80% from Coates et al.'s k-means centroids with whitening and SVM. The exercise reveals high intra-class variability in poses, scales, and partial views, plus dataset undersampling causing test images without close training matches. Random clusters yield 70% and random patches 74%, suggesting k-means mainly spreads activations near data manifolds; post-2015 deep learning pushed accuracy to 95%.
cifar-10image-classificationcomputer-visiondataset-analysishuman-vs-machinek-means-clusteringandrej-karpathy
“Human accuracy on 400 CIFAR-10 test images is 94%”