Influential arXiv Papers in Computer Vision
This wiki article surveys 12 influential arXiv papers (2015-2024) in computer vision, synthesizing core ideas from capsule networks with routing-by-agreement and self-supervision for 3D tasks, Pix2Seq reframing detection as autoregressive sequence generation, Boggart's imprecise pre-filtering for general-purpose video analytics, Privid's (ρ,K,ε)-event-duration privacy, self-supervised techniques including EM tracking for label-free accuracy estimation (5.75% MAE), SSL cookbooks, and X-Sample contrastive losses outperforming CLIP. It also covers zero-shot recognition from text corpora, early CNNs for autonomous driving, and the Agriculture-Vision dataset. As of April 2026, cs.CV remains highly active with CVPR 2026 papers including VOSR (arXiv:2604.03225), SD-FSMIS (arXiv:2604.03134), HIVE hierarchical pre-training of vision encoders with LLMs, Curia-2 radiology foundation models, Flow4R for 4D reconstruction, and advances in VLMs building on these foundations.
# Influential arXiv Papers in Computer Vision
Computer vision has been transformed by open-access preprints on arXiv, enabling rapid dissemination of ideas in object detection, video understanding, self-supervised methods, and domain applications. This article synthesizes 12 notable papers from 2015-2024 (with extensions to 2026), grouped thematically, with inline citations [1-12]. Recent 2026 activity includes CVPR papers on VOSR (arXiv:2604.03225), SD-FSMIS (arXiv:2604.03134), HIVE (hierarchical pre-training of vision encoders with LLMs), Curia-2 for radiology, Flow4R unifying 4D reconstruction and tracking, plus broader advances in VLMs, medical diffusion models, and steerable representations. 13
Capsule Networks
Capsule networks represent entities with activity vectors where the length encodes the probability that the entity exists and its orientation represents the instantiation parameters [6]. A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity. Lower-level capsules predict higher-level capsules via transformation matrices, and higher-level capsules activate when multiple predictions agree via routing-by-agreement, which iteratively routes lower-level capsule outputs to higher-level capsules with the largest scalar product between prediction and activity vectors [6]. This achieves state-of-the-art on MNIST and considerably outperforms convolutional nets at recognizing highly overlapping digits. Extensions propose self-supervised capsule networks for 3D point clouds using permutation-equivariant attention to decompose objects into capsules and aggregate attention masks into semantic keypoints [3]. These are trained on pairs of randomly rotated objects to enforce invariance/equivariance properties, learning a canonicalization operation without labels or aligned data and outperforming SOTA on 3D reconstruction, canonicalization, and unsupervised classification (NeurIPS 2021). Capsule Networks with three detection mechanisms achieve state-of-the-art on standard and defense-aware adversarial attacks by deflecting attacks to produce inputs semantically resembling the target class, confirmed by human studies where participants label such images as the target class [4].
Object Detection and Recognition Innovations
Pix2Seq casts object detection as a language modeling task conditioned on pixel inputs, representing bounding boxes and class labels as discrete token sequences generated autoregressively [2]. This makes minimal task assumptions beyond data augmentations, relying on training the network to output detection sequences without explicit priors like anchors or NMS, and achieves competitive results on COCO. Early work demonstrated CNNs for real-time lane and vehicle detection on large highway datasets, validating deep learning's potential for inexpensive, robust autonomous driving perception that requires large representative datasets [9]. Zero-shot recognition uses semantic embeddings from unsupervised text corpora with outlier detection in semantic space and dual recognition models to achieve SOTA on seen classes and reasonable accuracy on unseen ones without any manual semantic features for words or images [11].
Video Analytics and Privacy
Boggart generates indices using traditional computer vision algorithms that are imprecise but comprehensive across different CNNs and queries prior to any queries being issued [1]. For each query, it quickly characterizes index imprecision and runs CNNs sparingly with propagation to bound accuracy drops, achieving speedups that match or exceed prior model-specific approaches while supporting greater generality. It simultaneously meets goals of reliable accuracy, low latency, and minimal wasted work in general-purpose retrospective video analytics. Privid defines (ρ,K,ε)-event-duration privacy to protect private information visible for less than a particular duration K without requiring perfect per-frame detection [12]. It enforces this via untrusted analyst-provided DNNs while achieving 79-99% of non-private baseline accuracy across videos and queries.
Self-Supervised and Contrastive Learning
A self-supervised learning "cookbook" frames SSL as an interdependent art with a dizzying set of choices in pretext tasks and hyperparameters, lowering the barrier to entry by providing practical recipes and guidance on knob tuning [10]. Entropy minimization (EM) initially embeds test images close to training images to boost accuracy but prolonged optimization repels embeddings from training ones and degrades accuracy; tracking these embedding shifts during EM optimization enables accurate label-free accuracy estimation with 5.75% MAE across 23 challenging datasets [5]. X-Sample Contrastive loss models multi-sample similarities beyond binary positives via class or caption graphs (explicitly encoding cross-sample relations ignored by standard contrastive losses), outperforming CLIP by 16.8% on ImageNet with CC3M (0.6% with CC12M) while encouraging better object-attribute separation with gains on ImageNet9 [7].
Domain-Specific Challenges and Datasets
The first Agriculture-Vision Challenge engaged ~57 teams using a dataset of 21,061 multi-spectral farmland images for aerial semantic segmentation of agricultural patterns, with the submission server and leaderboard remaining open for ongoing research [8].
Recent Trends in 2026
High activity continues in cs.CV with emphasis on VLMs (e.g. CoME-VL), medical imaging with diffusion models (SD-FSMIS adapting Stable Diffusion for few-shot segmentation), vision-only super-resolution (VOSR), hierarchical pre-training of vision encoders with LLMs (HIVE), radiology foundation models like Curia-2 building on SSL techniques, Flow4R for unifying 4D reconstruction/tracking with scene flow, and more (CVPR 2026 main track and workshops). These build on earlier self-supervised, contrastive (X-Sample), capsule-inspired equivariance, efficient video analytics, and privacy foundations while addressing scalability, robustness, and deployment in foundation models.
Numbered to match inline [N] citations in the article above. Click any [N] to jump to its source.
- [1]Capsule Networks Enable Superior Recognition of Overlapping Digits via Dynamic Routingpaper · 2017-10-26
- [2]Deep Learning CNNs Enable Real-Time Lane and Vehicle Detection on Highway Driving Datasetspaper · 2015-04-07
- [3]Cross-Modal Transfer from Text Corpora Enables Zero-Shot Image Recognitionpaper · 2013-01-16
- [4]https://arxiv.org/list/cs.CV/currentweb
- [5]https://arxiv.org/abs/2604.03225web
- [6]https://arxiv.org/abs/2106.15315v2web
- [7]https://cvpr.thecvf.com/virtual/2026/papers.htmlweb
- [8]https://www.paperdigest.org/2026/03/most-influential-cvpr-papers-2026-03-version/web
- [9]https://x.com/0xCVYH/status/2041134036286890402X / Twitter