Chronological feed of everything captured from Geoffrey Hinton.
paper / geoffreyhinton / Feb 18
Capsule Networks integrated with three detection mechanisms achieve state-of-the-art performance against standard and defense-aware adversarial attacks. The defense deflects attacks by inducing perturbations that make adversarial inputs semantically resemble the target class. Human studies confirm that undetected attacks are classified by humans as the target class, rendering them perceptually non-adversarial.
adversarial-attackscapsule-networksdefense-mechanismsmachine-learningcomputer-visionarxiv-paper
“Capsule Networks with three detection mechanisms achieve state-of-the-art detection on standard and defense-aware attacks.”
paper / geoffreyhinton / Feb 13
SimCLR introduces a minimal contrastive learning framework for visual representations, eliminating specialized architectures and memory banks. Key enablers include data augmentation composition, a learnable nonlinear projection before the contrastive loss, and scaling with large batch sizes and extended training. On ImageNet, SimCLR's self-supervised ResNet-50 representations yield 76.5% top-1 accuracy with a linear head, equaling supervised baselines, and achieve 85.8% top-5 accuracy when fine-tuned on 1% labels, surpassing AlexNet with 100x fewer labels.
simclrcontrastive-learningself-supervised-learningvisual-representationsdata-augmentationcomputer-vision
“Composition of data augmentations critically defines effective predictive tasks in contrastive learning.”
paper / geoffreyhinton / Feb 10
Subclass distillation trains a small student model to match the teacher's softmax probabilities over invented subclasses, improving generalization transfer beyond standard distillation. This method leverages dark knowledge in incorrect class probabilities, performing best with many classes as they reveal more about the teacher's learned function. For few-class datasets, the teacher dynamically creates subclasses during training; natural subclasses align with known hierarchies, and invented ones accelerate learning on clickthrough data.
knowledge-distillationsubclass-distillationneural-networksmodel-compressionmachine-learninggeneralizationarxiv-paper
“Training a small student to match teacher's probabilities on incorrect classes transfers most of the teacher's generalization ability, outperforming direct training on labeled data.”
paper / geoffreyhinton / Dec 6
NASA introduces neural indicator functions conditioned on pose to represent articulated deformable objects like human bodies, bypassing polygonal meshes and skinning. Occupancy queries are direct and avoid mesh watertightness issues. The approach supports efficient 3D tracking with potential for further extensions.
neural-articulated-shape-approximationnasaarticulated-objectsneural-indicator-functions3d-trackingcomputer-visioncomputer-graphics
“NASA uses neural indicator functions conditioned on pose for representing articulated deformable objects”
paper / geoffreyhinton / Sep 12
CvxNet introduces a neural network architecture that learns a low-dimensional family of convex polytopes via auto-encoding, enabling learnable convex decomposition of solid objects. Convexes serve as hybrid explicit-implicit representations, ideal for training due to topology-agnostic half-space constraints and capable of generating polygonal meshes at inference for downstream use. Applications include automatic convex decomposition, image-to-3D reconstruction, and part-based shape retrieval.
convex-decompositionneural-architecture3d-reconstructioncomputer-graphicsmachine-learningcomputer-visionshape-representation
“Any solid object can be decomposed into a collection of convex polytopes”
paper / geoffreyhinton / Jul 19
Lookahead is a new optimization algorithm that iteratively maintains two weight sets: slow weights updated infrequently and fast weights updated frequently by an inner optimizer like SGD or Adam. It selects search directions by evaluating sequences of fast weights ahead, improving learning stability and reducing inner optimizer variance with minimal overhead. Empirical results show significant gains on ImageNet, CIFAR-10/100, neural machine translation, and Penn Treebank using default hyperparameters.
lookahead-optimizersgd-improvementdeep-learningoptimization-algorithmneural-networksmachine-learninggeoffrey-hinton
“Lookahead iteratively updates two sets of weights: fast weights via an inner optimizer and slow weights via lookahead directions.”
paper / geoffreyhinton / Jul 5
Class-conditional capsule reconstructions detect adversarial examples by measuring reconstruction error, with CapsNets outperforming CNNs across attacks. The Reconstructive Attack, optimizing for both misclassification and low reconstruction error, succeeds less often but evades detection. CapsNets' robustness correlates with visual similarity between source and target classes, indicating better alignment with human visual features.
adversarial-examplescapsnetsimage-reconstructionneural-network-robustnessadversarial-detectioncomputer-visionmachine-learning
“Class-conditional reconstruction detects adversarial or corrupted images.”
paper / geoffreyhinton / Jun 17
SCAE is a two-stage unsupervised capsule autoencoder that models objects as geometrically organized parts with viewpoint-invariant relationships. Stage 1 predicts part template presences and poses from images for direct reconstruction; Stage 2 refines these into object capsule parameters for part pose reconstruction. Amortized inference uses standard neural encoders, yielding SOTA unsupervised classification: 55% on SVHN and 98.7% on MNIST via capsule presences.
capsule-networksautoencodersunsupervised-learningcomputer-visiongeometric-reasoningobject-recognition
“SCAE explicitly models geometric relationships between object parts that are invariant to viewpoint changes”
paper / geoffreyhinton / Jun 6
Label smoothing, mixing hard targets with uniform distribution, improves neural network generalization, learning speed, and calibration, enhancing beam-search performance in tasks like image classification and translation. It clusters representations of same-class training examples tightly in the penultimate layer, preserving prediction accuracy but discarding inter-class resemblance information. This loss explains why teacher networks trained with label smoothing fail to effectively distill knowledge to students.
label-smoothingneural-networksmodel-calibrationknowledge-distillationgeneralizationrepresentation-learning
“Label smoothing improves model calibration, significantly enhancing beam-search performance”
paper / geoffreyhinton / May 31
Neural networks train more effectively when overparameterized, but standard training does not inherently promote prunability. Targeted dropout stochastically drops units or weights based on a self-reinforcing sparsity criterion before gradient computation, making the network robust to subsequent pruning. This simple method outperforms complex sparsifying regularizers while being easy to implement and tune.
neural-networkssparse-networkstargeted-dropoutnetwork-pruningmachine-learningsparsity-regularization
“Neural networks are easier to optimize when they have more weights than required for the input-output mapping”
paper / geoffreyhinton / May 28
Cerberus is a multi-headed neural network that derenders a single 2D image into viewpoint-invariant 3D shapes and poses of free-floating deformable mesh parts. It trains by reconstructing the input image through a differentiable 3D renderer, with losses promoting invariance to viewpoint changes and articulated pose variations. This unsupervised approach outperforms prior methods for part segmentation on synthetic data and extracts natural parts from human figures.
3d-reconstructionderenderingmulti-headed-networkcomputer-visiongeometric-invarianceneural-renderingpart-extraction
“Cerberus extracts 3D shapes and camera relations of object parts from a single unlabeled image using a multi-headed neural derenderer.”
paper / geoffreyhinton / May 1
CCA and invariant linear transformation statistics cannot measure meaningful similarities between high-dimensional neural representations exceeding the number of data points. The authors introduce a similarity index based on representational similarity matrices, equivalent to centered kernel alignment (CKA), which avoids this limitation and reliably detects correspondences across networks trained from different initializations. CKA maintains a close connection to CCA while enabling robust comparisons between layer representations and models.
neural-networksrepresentation-similaritycanonical-correlation-analysiscentered-kernel-alignmentmachine-learning
“CCA belongs to a family of statistics for measuring multivariate similarity that are invariant to invertible linear transformations.”
paper / geoffreyhinton / Feb 5
The Soft Nearest Neighbor Loss quantifies entanglement of class manifolds by measuring how close same-class points are relative to different-class points in representation space. Maximizing this loss in hidden layers surprisingly improves discrimination in the final layer by encouraging class-independent similarity structures, leading to better generalization. It also enables uncertainty calibration and outlier detection, as out-of-distribution data exhibits fewer predicted-class neighbors in hidden layers than in-distribution data.
nearest-neighbor-lossrepresentation-learningclass-manifoldsentanglement-metricsuncertainty-calibrationneural-networks
“Maximizing the Soft Nearest Neighbor Loss in hidden layers improves generalization.”
paper / geoffreyhinton / Nov 16
Capsule models trained to classify and reconstruct images from class-conditional capsule parameters detect adversaries via high L2 reconstruction errors from the predicted class capsule. Adversarial images deviate from typical class members, yielding larger errors than benign ones, enabling threshold-based detection across datasets. The method extends to CNNs using last-layer reconstructions; a white-box attack fools it but requires resembling the target class.
adversarial-detectioncapsule-networksimage-reconstructionadversarial-attacksmachine-learning-securitycomputer-vision
“Setting an L2 distance threshold between input and reconstruction from the winning capsule detects adversarial images effectively on three datasets”
paper / geoffreyhinton / Jul 12
Biologically motivated alternatives to backpropagation, such as target propagation (TP), feedback alignment (FA), and difference target propagation (DTP) variants, perform well on MNIST but significantly underperform BP on CIFAR-10 and ImageNet. This gap widens in locally connected architectures versus fully connected ones. Results establish baselines indicating potential need for new algorithms or architectures to achieve biological plausibility at scale.
biologically-plausible-learningbackpropagation-alternativestarget-propagationfeedback-alignmentdeep-network-scalingneuroscience-inspired-ai
“TP and FA variants perform well on MNIST comparable to BP”
paper / geoffreyhinton / Apr 9
Online distillation enables two neural networks trained on disjoint data subsets to share knowledge by mimicking each other's stale predictions, using infrequent weight transmissions. This approach doubles training speed on massive datasets via extra parallelism, even after synchronous/asynchronous SGD yields no further gains. It also enhances prediction reproducibility cost-effectively. Experiments validate this on Criteo, ImageNet, and a 6e11-token language modeling dataset from Common Crawl.
online-distillationdistributed-trainingneural-networksmodel-ensemblinglarge-scale-trainingknowledge-distillationreproducible-predictions
“Online distillation fits very large datasets about twice as fast by enabling extra parallelism beyond SGD limits”
paper / geoffreyhinton / Nov 27
Deep neural networks excel in high-dimensional classification but lack interpretability due to distributed representations. The method uses a trained neural net to construct a soft decision tree that encodes the same knowledge via hierarchical decisions, improving explainability. These distilled trees generalize better than those trained directly on data.
neural-networksdecision-treesknowledge-distillationexplainable-aigeoffrey-hintonmachine-learning
“Deep neural networks are highly effective for classification with high-dimensional inputs, complex input-output relationships, and large labeled datasets.”
paper / geoffreyhinton / Oct 26
Capsule networks represent entities with vector activity where length encodes existence probability and orientation encodes instantiation parameters. Lower-level capsules predict higher-level ones through transformation matrices, activating superiors only when predictions align via routing-by-agreement, which iteratively routes outputs based on scalar product matches. This discriminative multi-layer system matches state-of-the-art on MNIST and outperforms CNNs on highly overlapping digits.
capsule-networksdynamic-routinggeoffrey-hintoncomputer-visionneural-networksarxiv-paper
“A capsule is a group of neurons whose activity vector represents instantiation parameters of an entity, with vector length as existence probability and orientation as parameters.”
paper / geoffreyhinton / Mar 26
Modeling individual labelers and learning sample-specific averaging weights exploits expert-specific reliability and strengths, outperforming majority vote or distributional label models in multi-expert labeling scenarios. Applied to diabetic retinopathy diagnosis, this approach surpasses baselines from Welinder & Perona (2010) and Mnih & Hinton (2012). It leverages the full structure of sparse, overlapping expert annotations for more accurate ground truth estimation.
crowd-labelingexpert-modelinglabel-aggregationmachine-learningcomputer-visiondiabetic-retinopathyarxiv-paper
“Modeling individual experts and learning averaging weights improves classification over standard approaches like majority vote or label distribution modeling”
paper / geoffreyhinton / Jan 23
Penalizing low-entropy output distributions regularizes neural networks in supervised learning, adapting a technique from RL exploration. A maximum entropy confidence penalty connects to label smoothing via KL divergence direction. Both methods boost state-of-the-art performance on image classification (MNIST, CIFAR-10), language modeling (Penn Treebank), machine translation (WMT'14 En-De), and speech recognition (TIMIT, WSJ) without hyperparameter changes.
neural-networksregularizationlow-entropy-penaltylabel-smoothingconfidence-penaltysupervised-learningarxiv-paper
“Penalizing low entropy output distributions acts as a strong regularizer in supervised learning”
paper / geoffreyhinton / Jan 23
The Sparsely-Gated Mixture-of-Experts (MoE) layer scales neural network capacity by up to 1000x through conditional computation, activating only a sparse subset of thousands of feed-forward expert sub-networks per example via a trainable gating network. Applied convolutionally between stacked LSTM layers, MoE models reach 137 billion parameters and outperform state-of-the-art on language modeling and machine translation benchmarks at lower computational cost. This realizes theoretical conditional computation benefits on GPU clusters, overcoming prior algorithmic and performance hurdles.
mixture-of-expertssparse-gatingneural-networksconditional-computationlanguage-modelingmachine-translationmodel-capacity
“MoE layer achieves greater than 1000x improvements in model capacity with only minor losses in computational efficiency”
paper / geoffreyhinton / Oct 20
Fast weights introduce a third type of variable in neural networks that evolve slower than neural activities but faster than standard weights, inspired by multi-timescale synaptic dynamics. They store temporary memories of the recent past, providing a neurally plausible mechanism for attention similar to that in sequence-to-sequence models. This approach eliminates the need to maintain explicit copies of neural activity patterns for attending to history.
fast-weightsneural-networksattention-mechanismstemporary-memorysequence-modelsartificial-neurons
“Artificial neural networks have traditionally been restricted to only two types of variables: neural activities representing current or recent input, and weights learning input-output regularities.”
paper / geoffreyhinton / Jul 21
Layer normalization computes mean and variance from summed inputs to all neurons within a single layer and training case, avoiding batch normalization's mini-batch dependency. It applies adaptive bias and gain post-normalization, ensuring identical computations at training and test time. This enables straightforward RNN application by per-timestep normalization, stabilizing hidden states and substantially reducing training time over prior methods.
layer-normalizationbatch-normalizationrecurrent-networksneural-network-trainingnormalization-techniquesdeep-learning
“Layer normalization uses mean and variance from all summed inputs to neurons in a layer on a single training case.”
paper / geoffreyhinton / Mar 28
A recurrent neural network performs iterative probabilistic inference on structured image models by attending to one scene element at a time, with the model learning the optimal number of steps. This approach enables unsupervised multi-object identification, counting, localization, and classification in both 2D variational auto-encoders and 3D probabilistic renderers. The method matches supervised performance and enhances generalization via its iterative structure.
scene-understandinggenerative-modelsrecurrent-neural-networksprobabilistic-inferenceobject-detectionvariational-autoencoderscomputer-vision
“The model performs inference using a recurrent neural network that processes scene elements sequentially via attention.”
paper / geoffreyhinton / Apr 3
Recurrent networks with rectified linear units (ReLUs) initialized using the identity matrix or its scaled version in the recurrent weight matrix effectively mitigate vanishing and exploding gradients. This simple approach eliminates the need for complex optimizations or architectures like LSTM. On benchmarks including toy long-range temporal problems, large language modeling, and speech recognition, ReLU RNNs perform comparably to LSTM.
recurrent-neural-networksrnn-initializationrectified-linear-unitsrelulong-term-dependenciesneural-networksmachine-learning
“Learning long-term dependencies in recurrent networks is difficult due to vanishing and exploding gradients”
paper / geoffreyhinton / Mar 9
Knowledge distillation trains a compact student model to mimic an ensemble of large neural networks by matching its softened output distributions, enabling efficient deployment without sacrificing much performance. This method builds on prior compression techniques, yielding surprising results on MNIST and significant improvements to a commercial speech recognition system's acoustic model. It also introduces hybrid ensembles pairing full models with parallel-trained specialist models for fine-grained class discrimination, unlike slower mixture-of-experts approaches.
knowledge-distillationneural-networksmodel-ensemblesmachine-learningarxiv-papergeoffrey-hinton
“Averaging predictions from multiple models trained on the same data improves performance of machine learning algorithms.”
paper / geoffreyhinton / Dec 23
Attention-enhanced sequence-to-sequence models deliver state-of-the-art syntactic constituency parsing on the standard dataset when trained on large synthetic corpora annotated by existing parsers. These models match standard parser performance using only small human-annotated datasets, highlighting their data efficiency over non-attention seq2seq baselines. The unoptimized CPU implementation processes over 100 sentences per second, enabling domain-agnostic, fast parsing.
syntactic-parsingsequence-to-sequenceattention-mechanismnatural-language-processingconstituency-parsingdata-efficiencymachine-learning
“Attention-enhanced seq2seq model achieves state-of-the-art results on the most widely used syntactic constituency parsing dataset”
paper / geoffreyhinton / Sep 26
Deep Boltzmann Machines (DBMs) are adapted for document modeling via judicious parameter tying, overcoming training difficulties and enabling efficient pretraining and inference comparable to Restricted Boltzmann Machines. The model extracts latent semantic representations from large unstructured document collections. Experiments demonstrate higher log probability on unseen data than Replicated Softmax and better performance than LDA, Replicated Softmax, and DocNADE on retrieval and classification tasks.
deep-boltzmann-machinesdocument-modelinglatent-semantic-representationsmachine-learninginformation-retrievalarxiv-paper
“Judicious parameter tying allows efficient training of DBMs for documents, matching RBM efficiency.”
paper / geoffreyhinton / Mar 22
Deep recurrent neural networks combining LSTM architecture with multiple representation levels and end-to-end training via Connectionist Temporal Processing outperform prior models on speech tasks. Trained with suitable regularization, these deep LSTM RNNs deliver a state-of-the-art test set phoneme error rate of 17.7% on the TIMIT benchmark, surpassing deep feedforward networks and previous RNN results. This advances sequence labeling for unaligned sequential data like speech.
speech-recognitionrecurrent-neural-networkslstmdeep-learningrnntimit-benchmarkneural-networks
“Deep LSTM RNNs achieve 17.7% test set error on TIMIT phoneme recognition”
paper / geoffreyhinton / Jan 10
High-dimensional datasets are modeled as products of many linear constraints, each frequently approximately satisfied (FAS) by the data. Data vector probability is the product of its individual constraint violation probabilities. Learning uses heavy-tailed violation distributions across three methods.
machine-learninggeoffrey-hintonconstraint-modelingfrequently-approximately-satisfiedhigh-dimensional-datauai-2001arxiv-paper
“Some high-dimensional datasets can be modeled by assuming many different linear constraints, each Frequently Approximately Satisfied (FAS) by the data.”
paper / geoffreyhinton / Oct 19
The under-complete product of experts (UPoE) models high-dimensional densities using products of one-dimensional experts on data projections, avoiding the curse of dimensionality. UPoE is fully tractable as a parametric probabilistic model for projection pursuit, with maximum likelihood learning rules matching those of under-complete ICA. An efficient sequential learning algorithm is derived, linking it to projection pursuit density estimation and feature induction in additive random field models.
machine-learningdensity-estimationprojection-pursuitproduct-of-expertsindependent-component-analysisgeoffrey-hinton
“UPoE uses one-dimensional experts each modeling a single projection of the data”
paper / geoffreyhinton / Jul 3
Randomly omitting half of feature detectors during training of large feedforward neural networks on small datasets prevents complex co-adaptations, forcing each neuron to learn generally useful features across diverse internal contexts. This dropout technique significantly reduces overfitting. It yields major improvements on benchmarks and sets records in speech and object recognition tasks.
neural-networksdropoutoverfittingfeature-detectorsmachine-learningcomputer-visionspeech-recognition
“Large feedforward neural networks trained on small training sets typically overfit and perform poorly on held-out test data.”
paper / geoffreyhinton / Jun 27
Deep Lambertian Networks integrate Deep Belief Nets with Lambertian reflectance to model images through latent variables for albedo, surface normals, and lighting. The model learns strong priors on albedo from 2D images, allowing illumination variations to be isolated by modulating only the lighting latent. Single-image estimation of albedo and normals becomes feasible by transferring knowledge from similar objects, supporting tasks like one-shot face recognition.
deep-belief-netslambertian-reflectanceillumination-invariancealbedo-estimationsurface-normalscomputer-visionmachine-learning
“The model combines Deep Belief Nets with Lambertian reflectance assumptions”
paper / geoffreyhinton / Jun 18
Deep Mixtures of Factor Analysers (DMFAs) enable efficient multi-layer density modeling by greedily training one layer of latent variables at a time, using posterior samples from prior layers as input for the next. Unlike equivalent shallow MFAs formed by multiplying factor loading matrices, DMFAs improve learning and inference efficiency through structured sharing of lower-level matrices, reducing overfitting. Empirical results show DMFAs achieve superior density models compared to shallow MFAs and two RBM variants across diverse datasets.
deep-mixtures-of-factor-analysersdmfalayer-wise-learningdensity-modelsrestricted-boltzmann-machinesmachine-learning
“DMFAs can be converted to an equivalent shallow MFA by multiplying factor loading matrices across levels”
paper / geoffreyhinton / May 9
Products of Hidden Markov Models (PoHMMs) are generative models for time-series data that were previously limited by expensive gradient-based learning and intractable log-likelihood computation. The paper introduces reliable partition function estimation using Annealed Importance Sampling (AIS) and demonstrates effective contrastive divergence learning on rainfall and paired dance data. Advances in undirected graphical model techniques and compute power position PoHMMs as viable for complex sequential modeling tasks.
hidden-markov-modelspohmmgenerative-modelscontrastive-divergenceannealed-importance-samplingtime-series-modelinggeoffrey-hinton
“Annealed Importance Sampling reliably estimates the partition function for PoHMMs.”
paper / geoffreyhinton / Feb 14
Conditional Restricted Boltzmann Machines (CRBMs) extend RBMs for structured output prediction but lack effective training methods beyond non-conditional cases. Standard Contrastive Divergence (CD) is unsuitable for CRBMs. The paper identifies two structured output problem types—low-variability (e.g., multi-label classification) and high-variability (e.g., image denoising)—and proposes tailored learning algorithms that empirically outperform CD on both.
conditional-rbmrestricted-boltzmann-machinesstructured-output-predictioncontrastive-divergencemachine-learninggenerative-modelsarxiv-paper
“Standard Contrastive Divergence-based learning may not be suitable for training CRBMs”