absorb.md

Geoffrey Hinton

Chronological feed of everything captured from Geoffrey Hinton.

Capsule Networks Deflect Adversarial Attacks by Semantic Alignment with Human Perception

Capsule Networks integrated with three detection mechanisms achieve state-of-the-art performance against standard and defense-aware adversarial attacks. The defense deflects attacks by inducing perturbations that make adversarial inputs semantically resemble the target class. Human studies confirm that undetected attacks are classified by humans as the target class, rendering them perceptually non-adversarial.

SimCLR: Simplifying Contrastive Self-Supervised Learning to Match Supervised Performance on ImageNet

SimCLR introduces a minimal contrastive learning framework for visual representations, eliminating specialized architectures and memory banks. Key enablers include data augmentation composition, a learnable nonlinear projection before the contrastive loss, and scaling with large batch sizes and extended training. On ImageNet, SimCLR's self-supervised ResNet-50 representations yield 76.5% top-1 accuracy with a linear head, equaling supervised baselines, and achieve 85.8% top-5 accuracy when fine-tuned on 1% labels, surpassing AlexNet with 100x fewer labels.

Subclass Distillation Enhances Knowledge Transfer from Large Teachers to Small Students

Subclass distillation trains a small student model to match the teacher's softmax probabilities over invented subclasses, improving generalization transfer beyond standard distillation. This method leverages dark knowledge in incorrect class probabilities, performing best with many classes as they reveal more about the teacher's learned function. For few-class datasets, the teacher dynamically creates subclasses during training; natural subclasses align with known hierarchies, and invented ones accelerate learning on clickthrough data.

Neural Articulated Shape Approximation Replaces Meshes with Pose-Conditioned Indicator Functions

NASA introduces neural indicator functions conditioned on pose to represent articulated deformable objects like human bodies, bypassing polygonal meshes and skinning. Occupancy queries are direct and avoid mesh watertightness issues. The approach supports efficient 3D tracking with potential for further extensions.

CvxNet: Auto-Encoding Low-Dimensional Families of Convex Polytopes for Shape Representation

CvxNet introduces a neural network architecture that learns a low-dimensional family of convex polytopes via auto-encoding, enabling learnable convex decomposition of solid objects. Convexes serve as hybrid explicit-implicit representations, ideal for training due to topology-agnostic half-space constraints and capable of generating polygonal meshes at inference for downstream use. Applications include automatic convex decomposition, image-to-3D reconstruction, and part-based shape retrieval.

Lookahead Optimizer Boosts SGD and Adam Performance via Forward-Looking Weight Updates

Lookahead is a new optimization algorithm that iteratively maintains two weight sets: slow weights updated infrequently and fast weights updated frequently by an inner optimizer like SGD or Adam. It selects search directions by evaluating sequences of fast weights ahead, improving learning stability and reducing inner optimizer variance with minimal overhead. Empirical results show significant gains on ImageNet, CIFAR-10/100, neural machine translation, and Penn Treebank using default hyperparameters.

CapsNets Outperform CNNs in Detecting Reconstructive Adversarial Attacks via Class-Conditional Reconstructions

Class-conditional capsule reconstructions detect adversarial examples by measuring reconstruction error, with CapsNets outperforming CNNs across attacks. The Reconstructive Attack, optimizing for both misclassification and low reconstruction error, succeeds less often but evades detection. CapsNets' robustness correlates with visual similarity between source and target classes, indicating better alignment with human visual features.

Stacked Capsule Autoencoders Enable Viewpoint-Robust Unsupervised Object Classification via Geometric Part Relationships

SCAE is a two-stage unsupervised capsule autoencoder that models objects as geometrically organized parts with viewpoint-invariant relationships. Stage 1 predicts part template presences and poses from images for direct reconstruction; Stage 2 refines these into object capsule parameters for part pose reconstruction. Amortized inference uses standard neural encoders, yielding SOTA unsupervised classification: 55% on SVHN and 98.7% on MNIST via capsule presences.

Label Smoothing Boosts Generalization and Calibration by Clustering Same-Class Representations, Hindering Distillation

Label smoothing, mixing hard targets with uniform distribution, improves neural network generalization, learning speed, and calibration, enhancing beam-search performance in tasks like image classification and translation. It clusters representations of same-class training examples tightly in the penultimate layer, preserving prediction accuracy but discarding inter-class resemblance information. This loss explains why teacher networks trained with label smoothing fail to effectively distill knowledge to students.

Targeted Dropout Enables Robust Pruning of Overparameterized Neural Networks

Neural networks train more effectively when overparameterized, but standard training does not inherently promote prunability. Targeted dropout stochastically drops units or weights based on a self-reinforcing sparsity criterion before gradient computation, making the network robust to subsequent pruning. This simple method outperforms complex sparsifying regularizers while being easy to implement and tune.

Cerberus Enables Unsupervised 3D Part Extraction from Single Images via Multi-Headed Neural Derendering

Cerberus is a multi-headed neural network that derenders a single 2D image into viewpoint-invariant 3D shapes and poses of free-floating deformable mesh parts. It trains by reconstructing the input image through a differentiable 3D renderer, with losses promoting invariance to viewpoint changes and articulated pose variations. This unsupervised approach outperforms prior methods for part segmentation on synthetic data and extracts natural parts from human figures.

CKA Overcomes CCA's Dimensionality Limits for Reliable Neural Representation Similarity

CCA and invariant linear transformation statistics cannot measure meaningful similarities between high-dimensional neural representations exceeding the number of data points. The authors introduce a similarity index based on representational similarity matrices, equivalent to centered kernel alignment (CKA), which avoids this limitation and reliably detects correspondences across networks trained from different initializations. CKA maintains a close connection to CCA while enabling robust comparisons between layer representations and models.

Maximizing Class Entanglement in Hidden Layers Boosts Generalization and Outlier Detection

The Soft Nearest Neighbor Loss quantifies entanglement of class manifolds by measuring how close same-class points are relative to different-class points in representation space. Maximizing this loss in hidden layers surprisingly improves discrimination in the final layer by encouraging class-independent similarity structures, leading to better generalization. It also enables uncertainty calibration and outlier detection, as out-of-distribution data exhibits fewer predicted-class neighbors in hidden layers than in-distribution data.

Capsule Reconstruction Errors Effectively Detect Adversarial Images

Capsule models trained to classify and reconstruct images from class-conditional capsule parameters detect adversaries via high L2 reconstruction errors from the predicted class capsule. Adversarial images deviate from typical class members, yielding larger errors than benign ones, enabling threshold-based detection across datasets. The method extends to CNNs using last-layer reconstructions; a white-box attack fools it but requires resembling the target class.

Biologically Plausible Deep Learning Algorithms Fail to Scale on Complex Image Tasks

Biologically motivated alternatives to backpropagation, such as target propagation (TP), feedback alignment (FA), and difference target propagation (DTP) variants, perform well on MNIST but significantly underperform BP on CIFAR-10 and ImageNet. This gap widens in locally connected architectures versus fully connected ones. Results establish baselines indicating potential need for new algorithms or architectures to achieve biological plausibility at scale.

Online Distillation Accelerates Large-Scale NN Training Beyond SGD Parallelism Limits

Online distillation enables two neural networks trained on disjoint data subsets to share knowledge by mimicking each other's stale predictions, using infrequent weight transmissions. This approach doubles training speed on massive datasets via extra parallelism, even after synchronous/asynchronous SGD yields no further gains. It also enhances prediction reproducibility cost-effectively. Experiments validate this on Criteo, ImageNet, and a 6e11-token language modeling dataset from Common Crawl.

Distilling Neural Networks into Interpretable Soft Decision Trees

Deep neural networks excel in high-dimensional classification but lack interpretability due to distributed representations. The method uses a trained neural net to construct a soft decision tree that encodes the same knowledge via hierarchical decisions, improving explainability. These distilled trees generalize better than those trained directly on data.

Capsule Networks Enable Superior Recognition of Overlapping Digits via Dynamic Routing

Capsule networks represent entities with vector activity where length encodes existence probability and orientation encodes instantiation parameters. Lower-level capsules predict higher-level ones through transformation matrices, activating superiors only when predictions align via routing-by-agreement, which iteratively routes outputs based on scalar product matches. This discriminative multi-layer system matches state-of-the-art on MNIST and outperforms CNNs on highly overlapping digits.

Individual Expert Modeling with Learned Weights Boosts Crowdsourced Classification Accuracy

Modeling individual labelers and learning sample-specific averaging weights exploits expert-specific reliability and strengths, outperforming majority vote or distributional label models in multi-expert labeling scenarios. Applied to diabetic retinopathy diagnosis, this approach surpasses baselines from Welinder & Perona (2010) and Mnih & Hinton (2012). It leverages the full structure of sparse, overlapping expert annotations for more accurate ground truth estimation.

Penalizing Confident Outputs Regularizes Neural Nets Across Supervised Tasks

Penalizing low-entropy output distributions regularizes neural networks in supervised learning, adapting a technique from RL exploration. A maximum entropy confidence penalty connects to label smoothing via KL divergence direction. Both methods boost state-of-the-art performance on image classification (MNIST, CIFAR-10), language modeling (Penn Treebank), machine translation (WMT'14 En-De), and speech recognition (TIMIT, WSJ) without hyperparameter changes.

Sparsely-Gated Mixture-of-Experts Enables 1000x Neural Network Capacity Gains with Minimal Compute Overhead

The Sparsely-Gated Mixture-of-Experts (MoE) layer scales neural network capacity by up to 1000x through conditional computation, activating only a sparse subset of thousands of feed-forward expert sub-networks per example via a trainable gating network. Applied convolutionally between stacked LSTM layers, MoE models reach 137 billion parameters and outperform state-of-the-art on language modeling and machine translation benchmarks at lower computational cost. This realizes theoretical conditional computation benefits on GPU clusters, overcoming prior algorithmic and performance hurdles.

Fast Weights Enable Neural Attention to Recent Past Without Storing Activity Copies

Fast weights introduce a third type of variable in neural networks that evolve slower than neural activities but faster than standard weights, inspired by multi-timescale synaptic dynamics. They store temporary memories of the recent past, providing a neurally plausible mechanism for attention similar to that in sequence-to-sequence models. This approach eliminates the need to maintain explicit copies of neural activity patterns for attending to history.

Layer Normalization: Batch-Independent Alternative for Faster RNN Training

Layer normalization computes mean and variance from summed inputs to all neurons within a single layer and training case, avoiding batch normalization's mini-batch dependency. It applies adaptive bias and gain post-normalization, ensuring identical computations at training and test time. This enables straightforward RNN application by per-timestep normalization, stabilizing hidden states and substantially reducing training time over prior methods.

Recurrent Attention Enables Unsupervised Object Decomposition in Generative Scene Models

A recurrent neural network performs iterative probabilistic inference on structured image models by attending to one scene element at a time, with the model learning the optimal number of steps. This approach enables unsupervised multi-object identification, counting, localization, and classification in both 2D variational auto-encoders and 3D probabilistic renderers. The method matches supervised performance and enhances generalization via its iterative structure.

Identity Matrix Initialization Enables ReLU RNNs to Match LSTM on Long-Dependency Tasks

Recurrent networks with rectified linear units (ReLUs) initialized using the identity matrix or its scaled version in the recurrent weight matrix effectively mitigate vanishing and exploding gradients. This simple approach eliminates the need for complex optimizations or architectures like LSTM. On benchmarks including toy long-range temporal problems, large language modeling, and speech recognition, ReLU RNNs perform comparably to LSTM.

Knowledge Distillation Compresses Neural Ensembles into Deployable Single Models

Knowledge distillation trains a compact student model to mimic an ensemble of large neural networks by matching its softened output distributions, enabling efficient deployment without sacrificing much performance. This method builds on prior compression techniques, yielding surprising results on MNIST and significant improvements to a commercial speech recognition system's acoustic model. It also introduces hybrid ensembles pairing full models with parallel-trained specialist models for fine-grained class discrimination, unlike slower mixture-of-experts approaches.

Attention-Enhanced Seq2Seq Models Achieve SOTA Parsing via Synthetic Data

Attention-enhanced sequence-to-sequence models deliver state-of-the-art syntactic constituency parsing on the standard dataset when trained on large synthetic corpora annotated by existing parsers. These models match standard parser performance using only small human-annotated datasets, highlighting their data efficiency over non-attention seq2seq baselines. The unoptimized CPU implementation processes over 100 sentences per second, enabling domain-agnostic, fast parsing.

Parameter-Tied Deep Boltzmann Machines Enable Efficient Document Modeling and Superior Latent Representations

Deep Boltzmann Machines (DBMs) are adapted for document modeling via judicious parameter tying, overcoming training difficulties and enabling efficient pretraining and inference comparable to Restricted Boltzmann Machines. The model extracts latent semantic representations from large unstructured document collections. Experiments demonstrate higher log probability on unseen data than Replicated Softmax and better performance than LDA, Replicated Softmax, and DocNADE on retrieval and classification tasks.

Deep LSTM RNNs Achieve Record 17.7% Error on TIMIT Phoneme Recognition

Deep recurrent neural networks combining LSTM architecture with multiple representation levels and end-to-end training via Connectionist Temporal Processing outperform prior models on speech tasks. Trained with suitable regularization, these deep LSTM RNNs deliver a state-of-the-art test set phoneme error rate of 17.7% on the TIMIT benchmark, surpassing deep feedforward networks and previous RNN results. This advances sequence labeling for unaligned sequential data like speech.

Frequently Approximately Satisfied Constraints Model High-Dimensional Data via Product of Violation Probabilities

High-dimensional datasets are modeled as products of many linear constraints, each frequently approximately satisfied (FAS) by the data. Data vector probability is the product of its individual constraint violation probabilities. Learning uses heavy-tailed violation distributions across three methods.

Under-Complete Product of Experts Enables Tractable Projection Pursuit Density Estimation

The under-complete product of experts (UPoE) models high-dimensional densities using products of one-dimensional experts on data projections, avoiding the curse of dimensionality. UPoE is fully tractable as a parametric probabilistic model for projection pursuit, with maximum likelihood learning rules matching those of under-complete ICA. An efficient sequential learning algorithm is derived, linking it to projection pursuit density estimation and feature induction in additive random field models.

Dropout Prevents Co-Adaptation and Reduces Overfitting in Neural Networks

Randomly omitting half of feature detectors during training of large feedforward neural networks on small datasets prevents complex co-adaptations, forcing each neuron to learn generally useful features across diverse internal contexts. This dropout technique significantly reduces overfitting. It yields major improvements on benchmarks and sets records in speech and object recognition tasks.

Deep Lambertian Networks Enable Illumination-Invariant Recognition via Generative Albedo and Normal Estimation

Deep Lambertian Networks integrate Deep Belief Nets with Lambertian reflectance to model images through latent variables for albedo, surface normals, and lighting. The model learns strong priors on albedo from 2D images, allowing illumination variations to be isolated by modulating only the lighting latent. Single-image estimation of albedo and normals becomes feasible by transferring knowledge from similar objects, supporting tasks like one-shot face recognition.

Deep Mixtures of Factor Analyzers Outperform Shallow MFAs and RBMs via Greedy Layer-Wise Training

Deep Mixtures of Factor Analysers (DMFAs) enable efficient multi-layer density modeling by greedily training one layer of latent variables at a time, using posterior samples from prior layers as input for the next. Unlike equivalent shallow MFAs formed by multiplying factor loading matrices, DMFAs improve learning and inference efficiency through structured sharing of lower-level matrices, reducing overfitting. Empirical results show DMFAs achieve superior density models compared to shallow MFAs and two RBM variants across diverse datasets.

Annealed Importance Sampling Revives Products of Hidden Markov Models for Complex Time-Series

Products of Hidden Markov Models (PoHMMs) are generative models for time-series data that were previously limited by expensive gradient-based learning and intractable log-likelihood computation. The paper introduces reliable partition function estimation using Annealed Importance Sampling (AIS) and demonstrates effective contrastive divergence learning on rainfall and paired dance data. Advances in undirected graphical model techniques and compute power position PoHMMs as viable for complex sequential modeling tasks.

Improved Algorithms Surpass Contrastive Divergence for Training CRBMs on Structured Outputs

Conditional Restricted Boltzmann Machines (CRBMs) extend RBMs for structured output prediction but lack effective training methods beyond non-conditional cases. Standard Contrastive Divergence (CD) is unsuitable for CRBMs. The paper identifies two structured output problem types—low-variability (e.g., multi-label classification) and high-variability (e.g., image denoising)—and proposes tailored learning algorithms that empirically outperform CD on both.