Chronological feed of everything captured from Ravi Netravali.
Snicket is a query-driven distributed tracing system designed to overcome the limitations of traditional head-based and tail-based sampling methods in microservice architectures. It addresses the issues of over-collection of unnecessary data and under-collection of critical data by focusing on collecting only the information relevant to a specific query. Snicket utilizes a domain-specific language to compile query extensions that run on sidecars, allowing for efficient, application-agnostic data collection and computation close to the source.
Mowgli is an offline reinforcement learning (RL) approach that improves real-time video rate control by analyzing historical telemetry logs from heuristic algorithms like Google Congestion Control (GCC). This method avoids the user-disrupting exploration phase of online RL, achieving comparable performance improvements by identifying and generalizing high-reward adaptation behaviors from existing data. It offers a practical solution for enhancing video conferencing quality without negatively impacting user experience during training.
Debugging distributed systems in production is challenging due to non-deterministic bugs and the difficulty of correlating symptoms with root causes across multiple components. Lumos addresses this by providing an online debugging framework that utilizes dependency-guided instrumentation and static analysis to identify and expose application-level bug provenances. The system aims to provide sufficient evidence for root cause identification with low runtime overhead and minimal bug occurrences.
Legilimens is a continuous learning system designed for mobile edge devices with System-on-Chip (SoC) GPUs, addressing the unique resource constraints of such platforms. It leverages the insight that visually distinct scenes often share significant model embedding overlap, enabling lightweight specialization of a base model. The system reduces retraining costs and improves accuracy by optimizing data sample selection, base model updates without full retraining, and time-sharing compute resources for inference and retraining.
FailFast is a novel speculative decoding framework that leverages Diffusion Large Language Models (dLLMs) to accelerate autoregressive LLMs. It addresses the efficiency-quality tradeoff of dLLMs by dynamically adjusting speculation length. This approach minimizes computation in difficult-to-speculate regions and aggressively extends draft lengths in easier regions, resulting in significant speedups without fine-tuning.
Aragog introduces a just-in-time model routing architecture for agentic workflows that decouples configuration selection into a one-time accuracy-preserving routing step and a lightweight per-stage scheduler. By adapting LLM assignments dynamically based on real-time system observations, it optimizes for fluctuating loads without sacrificing output quality. This approach significantly improves throughput and reduces latency compared to static binding strategies.
Aragog is a system designed to optimize the serving of agentic workflows by dynamically adjusting LLM configurations during execution. This approach addresses the limitations of static configurations, which become suboptimal due to fluctuating system loads. Aragog decouples configuration selection into pre-computation of accuracy-preserving configurations and a real-time, per-stage scheduler, significantly improving throughput and reducing latency.
LessIsMore is a novel, training-free sparse attention mechanism designed to improve the efficiency of large language models (LLMs) in reasoning tasks. It utilizes global attention patterns and unified cross-head token ranking to overcome limitations of traditional sparse attention methods, which often lead to accuracy degradation. This approach achieves significant speed-ups and token reduction without compromising accuracy.
LessIsMore is a training-free sparse attention method designed for large reasoning models, addressing the computational overhead of long token generation during test-time scaling. Rather than applying head-specific local token selection — the dominant paradigm in prior sparse attention work — it aggregates token selections across local attention heads combined with recent context to produce a unified cross-head token ranking used for future decoding layers. This global aggregation strategy avoids maintaining per-head token subsets, improving both generalization and efficiency. The result is 2× token reduction without accuracy degradation, a 1.1× decoding speed-up over full attention, and 1.13× end-to-end speed-up over existing sparse attention baselines.
SpecReason accelerates Large Reasoning Model (LRM) inference by using a lightweight model to speculatively generate intermediate reasoning steps, reserving the base model for verification and correction. Unlike standard speculative decoding which requires exact token matches, SpecReason leverages the semantic flexibility of 'thinking tokens' to maintain or improve final answer accuracy. This approach significantly reduces latency while potentially increasing accuracy across complex reasoning benchmarks.
Legilimens is a novel continuous learning system designed for mobile edge devices with System-on-Chip (SoC) GPUs. It leverages the abundant unified memory on these devices to optimize video analytics. The core innovation lies in efficiently adapting models to new visual scenes by recognizing embedding overlap, significantly reducing retraining costs and improving accuracy compared to existing systems.
Guillotine is a proposed hypervisor architecture designed to sandbox high-risk AI models by addressing failures in traditional virtualization. It mandates a hardware-software co-design to eliminate side-channel and reflection-based vulnerabilities, complemented by extreme physical fail-safes—such as electromechanical disconnects—to ensure containment if digital layers are breached.
SpecReason is a novel system that accelerates Large Reasoning Model (LRM) inference by leveraging the inherent tolerance for approximation in LRM reasoning. It uses a lightweight model to speculatively handle intermediate reasoning steps, calling upon the more robust base model only for verification and correction. This approach capitalizes on the semantic flexibility of "thinking tokens" to sustain accuracy while significantly reducing computational overhead.
Scallop reframes the Selective Forwarding Unit (SFU) — the core relay component of video conferencing infrastructure — as an SDN-style split-plane system: latency-sensitive media operations (drop, forward) are offloaded to a Tofino programmable ASIC data plane, while infrequent control logic (feedback signal analysis) remains in software. The key insight driving this design is that production SFU workloads are dominated by classic packet-processing primitives, making them a natural fit for programmable hardware. The Tofino-based implementation achieves full WebRTC compatibility while delivering 7–210x scaling improvement over a 32-core commodity server and 26x reduction in forwarding-induced latency.
Video conferencing infrastructure is under strain due to increasing traffic. This paper introduces Scallop, an SDN-inspired Selective Forwarding Unit (SFU) that offloads latency-sensitive media operations to a hardware-based data plane. This approach significantly improves scalability and reduces forwarding-induced latency compared to traditional software-based SFUs.
As AI inference pipelines grow in complexity and distribution, network latency is typically treated as an unavoidable cost. This paper reframes the network — specifically SmartNICs — as an active compute resource capable of offloading data processing tasks that are structurally similar to packet processing pipelines. The authors argue that the computational characteristics of SmartNICs are well-matched to resource-intensive preprocessing and pipeline tasks, presenting a research agenda to integrate network hardware as a first-class optimization layer in AI serving infrastructure.
As AI inference pipelines grow in complexity and distribution, network latency is typically treated as an unavoidable tax. This paper reframes the network — specifically SmartNICs — as an active compute resource capable of absorbing data processing workloads that align naturally with packet processing pipelines. The authors argue that offloading resource-intensive preprocessing and pipeline tasks to SmartNICs can reduce overhead on GPUs and CPUs, and propose a research agenda to systematically integrate network hardware into AI serving infrastructure.
METIS is a novel RAG system that addresses the trade-off between response delay and generation quality. It jointly schedules queries and dynamically adapts RAG configurations like the number of retrieved text chunks and synthesis methods. This approach allows METIS to balance quality optimization and response time reduction, outperforming prior RAG optimization schemes.
METIS is a novel RAG system designed to mitigate the inherent trade-off between response quality and generation latency. It achieves this by simultaneously scheduling RAG queries and dynamically adjusting key RAG configurations, such as the quantity of retrieved text chunks and the synthesis methods employed. This adaptive approach aims to optimize the balance between maximizing quality and minimizing response delay in RAG-based LLM applications.
Hybrid LLMs, while efficient for long contexts, struggle with traditional prefix caching due to their in-place state updates. This limitation results in low cache reuse and excessive memory consumption. Marconi addresses this by introducing novel admission and eviction policies that prioritize cache entries based on reuse likelihood and computational savings, significantly improving token hit rates and reducing time to first token.
Hybrid language models, which merge attention and recurrent layers, struggle with efficient prefix caching due to their in-place state updates. This limitation forces exact-match cache hits and generates numerous, minimally reused cache entries. Marconi introduces novel admission and eviction policies that consider reuse likelihood and compute savings to significantly improve prefix caching for these models.
Mowgli is a novel data-driven rate control system for real-time video that learns from existing production telemetry logs. This approach circumvents the performance degradation issues associated with traditional data-driven methods during training, making it practical for production environments. Mowgli leverages robust learning techniques to account for inherent uncertainties in log-based learning, leading to improved video quality and reduced freezing compared to widely deployed algorithms.
Mowgli learns an improved rate control policy solely from existing production telemetry logs generated by the incumbent GCC algorithm, avoiding the performance degradation of active training methods. It addresses log-based learning challenges—such as delayed or reordered decisions and lack of feedback for counterfactuals—through conservative reasoning on alternate actions and a noise-aware model formulation. In emulated and real-world networks, Mowgli boosts average video bitrates by 15-39% and cuts freeze rates by 60-100% versus GCC.
This talk advocates for a paradigm shift in networking research, urging the community to integrate application-level insights into network design. By understanding application behaviors, workloads, and control mechanisms, researchers can develop more effective solutions to current networking challenges, moving beyond traditional, lower-level stack optimizations. The methodology emphasizes leveraging cross-disciplinary tools, such as program analysis and data-driven algorithms, to link application and network characteristics for improved performance.
Apparate introduces automated early exits (EEs) in ML models, allowing select inputs to produce results at intermediate layers and thus bypass full computation. It addresses EE challenges like variable overhead and accuracy via runtime monitoring and adaptation powered by exit feedback. This yields 40.5-91.5% median latency reductions for CV/NLP classification and 22.6-77.9% for generative tasks, while preserving throughput and accuracy.
MadEye is a camera-server system that uses commodity PTZ cameras to continuously adapt rotation and zoom for optimal video analytics accuracy under given workloads and resource constraints. It employs a rapid search algorithm to explore the vast orientation space and a camera-side knowledge distillation method to select high-accuracy configurations without additional server resources. Experiments demonstrate 2.9-25.7% accuracy gains at fixed resource levels or 2-3.7x resource reductions for equivalent accuracy across diverse workloads.
Modern web archiving fails due to JavaScript-induced storage bloat and non-determinism, causing missing snapshots and poor page fidelity. Java, a JavaScript-aware crawler, exploits archive-specific traits—no backend servers and absent client auth—to remove non-functional and unreachable code, slashing JavaScript storage by 84% (41% overall). It eliminates client characteristic non-determinism (e.g., user agent, screen size) impacting control flow and URL fetches, reducing failed resource requests to near zero across 3000 pages, while preserving essential randomness in data flows.
MARVOLO is a binary mutator that applies semantics-preserving code transformations to augment malware and benign datasets, mimicking real-world developer changes. It propagates original labels to generated samples without reverse engineering, addressing data scarcity and copyright issues in cybersecurity. Optimizations maximize diverse sample density within resource budgets, yielding up to 5% accuracy gains on commercial datasets using only 15% of inputs.
Privid introduces row event duration privacy (ρ-DP), a differential privacy variant protecting transient objects visible for under ρ seconds (e.g., 30-60s in surveillance footage) without needing to detect or obfuscate individuals. Analysts submit custom black-box neural networks to process video chunks into tables, followed by SQL aggregations, with noise added proportional to visibility duration to bound individual impact. This supports utility for aggregate queries like hourly pedestrian counts or mask compliance while preventing tracking, with graceful privacy degradation for longer-visible objects and open-source implementation.
Dashlet addresses swipe-induced uncertainty in short video streaming by using insights from TikTok traces and user swipe studies to model swipe statistics without machine learning. It employs out-of-order video chunk pre-buffering, prioritizing chunks based on predicted swipe order and bitrate to maximize QoE under variable network and user behavior. Evaluations show Dashlet achieves 77-99% of oracle QoE, outperforms TikTok by 43.9-45.1x in QoE, and cuts unwatched video bytes by 30%.
Bamboo is a distributed system that leverages pipeline parallelism in large DNN training to insert redundant computations, where nodes execute layers from neighbors alongside their own, enabling fast recovery from frequent preemptions in cheap preemptible instances. This approach exploits natural pipeline bubbles to minimize overhead from redundancy. Evaluations across popular DNN models show 3.7x higher training throughput than traditional checkpointing and 2.4x cost reduction versus on-demand instances.
Canvas redesigns remote memory swapping to provide full isolation of swap partitions, caches, prefetchers, and RDMA bandwidth for co-running datacenter applications. This isolation prevents interference-induced slowdowns from shared data paths and enables app-specific optimizations: adaptive swap allocation, semantics-aware prefetching, and 2D RDMA scheduling. Evaluations show Canvas minimizes performance variation and reduces degradation across widely-deployed applications.
GEMEL introduces model merging to address memory constraints on edge GPUs for real-time video analytics by sharing layers and weights across similar vision models. It identifies accuracy-preserving merging configurations using per-model memory usage and inter-layer dependencies, then adjusts inference schedules to maximize benefits. Evaluations show up to 60.7% memory reduction and 8-39% accuracy gains over time/space sharing baselines.
Revelio uses deep neural networks to embed diverse inputs like natural language bug reports and system logs into a shared vector space, generating relevant debugging queries for distributed systems. It factorizes query generation into simpler tasks leveraging production observations to handle unseen faults. Evaluated on a testbed with fault-injected distributed apps and 800 Mechanical Turk reports, it ranks the optimal query in its top-3 predictions 96% of the time, validated by developer studies.
Privid proposes (ρ,K,ε)-event-duration privacy, a new differential privacy variant that safeguards private information visible for less than duration K in video analytics, bypassing the need for flawless detection in every frame. The system enforces this privacy using untrusted, analyst-supplied deep neural networks for common video queries. Evaluations across diverse videos and queries demonstrate Privid retains 79-99% of non-private system accuracy.
Boggart uses traditional computer vision algorithms to pre-build imprecise but comprehensive indices across videos, enabling support for arbitrary user-defined CNNs and queries in retrospective analytics platforms. For each query, it rapidly assesses index imprecision and applies minimal CNN inference with result propagation to bound accuracy loss while minimizing latency and wasted computation. Evaluations demonstrate speedups matching or exceeding prior model-specific systems without sacrificing generality, accuracy, or efficiency.
Dorylus is a distributed system for GNN training that separates computation into graph and tensor parallel tasks, enabling a deep bounded-asynchronous pipeline that overlaps operations and hides network latency using serverless Lambda threads. It scales to billion-edge graphs by leveraging thousands of Lambdas atop CPU servers, which provide superior performance-per-dollar compared to GPUs for large graphs. Dorylus delivers up to 2.75x better performance-per-dollar than CPU-only training, 1.22x faster and 4.83x cheaper than GPU servers, and up to 3.8x faster and 10.7x cheaper than sampling-based systems.
SAMU co-designs a Java runtime and NVMe-oF swap system for disaggregated datacenters, offloading memory-intensive GC tracing to weak-compute memory servers while keeping compaction on CPU servers. It uses a universal tower heap for address-stable object migration and region-based tracing prioritization favoring young regions. Evaluations on Spark/Flink workloads show 80% (50% cache) and 32% (25% cache) average slowdowns versus baselines, outperforming RDMA swaps and ramdisks via higher near-memory tracing throughput (8.8x CPU server).
Khameleon addresses network bottlenecks in interactive data visualization and exploration (DVE) applications through continuous prefetching and response tuning, delivering immediate lower-quality responses that progressively improve. It employs server-side scheduling that optimizes bandwidth usage based on predicted user interactions and resource constraints using a fast greedy algorithm approximating the optimal solution in real-time. Evaluations with real user traces show it outperforms perfect-prediction prefetchers, reducing latencies by 2-3 orders of magnitude while maintaining 50-80% response quality across diverse networks and devices.
ABC introduces a router-assisted congestion control protocol using a single bit per packet (accelerate/break marks via ECN) to dynamically adjust sender rates from 0 to 2x current rate per RTT, addressing underutilization and bufferbloat in wireless networks. Unlike prior explicit schemes like XCP that compare enqueue rate to capacity and overshoot, ABC's novel control loop uses dequeue rate to predict future enqueue rate one RTT ahead, ensuring accurate tracking of rapidly varying link capacities. Evaluations on cellular and Wi-Fi traces show ABC achieves superior throughput-delay tradeoffs, outperforming Cubic, AQM schemes, Verus, and XCP.
ABC is an explicit congestion control protocol for wireless networks where routers mark packets as "accelerate" or "brake" to guide senders toward a target rate via minor cwnd adjustments. It requires no header changes or device modifications, is incrementally deployable, and interoperates with non-ABC routers and traffic. Evaluations on Wi-Fi and cellular traces show ABC delivers 30-40% higher throughput than Cubic+Codel at similar delays, 2.2X lower delays than BBR on Wi-Fi, and 50% higher throughput on cellular paths.
A US survey of 876 respondents reveals 51% would participate in P2P content delivery for suitable financial incentives, exceeding expectations and addressing key adoption barriers. Gringotts introduces a secure system using novel Proof of Delivery to verify file transfers and cryptocurrency payments that resist Sybil and liar attacks. This enables content providers to integrate incentivized P2P delivery cost-effectively.