absorb.md

Ravi Netravali

Chronological feed of everything captured from Ravi Netravali.

Lumos Enables Low-Overhead Provenance-Guided Debugging for Production Distributed Systems

Lumos is an online debugging framework that automatically captures application-level bug provenances—computational histories linking runtime symptoms to root causes in distributed systems. It uses static analysis for dependency-guided instrumentation to select relevant program states, enabling lightweight on-demand recording with low runtime overhead. Developers can identify root causes from few bug occurrences using the collected evidence, overcoming limitations of existing tools that require manual trace collection.

Snicket: A Query-Driven Distributed Tracing System for Microservices

Snicket is a query-driven distributed tracing system designed to overcome the limitations of traditional head-based and tail-based sampling methods in microservice architectures. It addresses the issues of over-collection of unnecessary data and under-collection of critical data by focusing on collecting only the information relevant to a specific query. Snicket utilizes a domain-specific language to compile query extensions that run on sidecars, allowing for efficient, application-agnostic data collection and computation close to the source.

Mowgli: Offline RL for Real-time Video Rate Control

Mowgli is an offline reinforcement learning (RL) approach that improves real-time video rate control by analyzing historical telemetry logs from heuristic algorithms like Google Congestion Control (GCC). This method avoids the user-disrupting exploration phase of online RL, achieving comparable performance improvements by identifying and generalizing high-reward adaptation behaviors from existing data. It offers a practical solution for enhancing video conferencing quality without negatively impacting user experience during training.

Lumos: Provenance-Guided Debugging for Distributed Systems

Debugging distributed systems in production is challenging due to non-deterministic bugs and the difficulty of correlating symptoms with root causes across multiple components. Lumos addresses this by providing an online debugging framework that utilizes dependency-guided instrumentation and static analysis to identify and expose application-level bug provenances. The system aims to provide sufficient evidence for root cause identification with low runtime overhead and minimal bug occurrences.

Legilimens: Continuous Learning for Mobile Edge Video Analytics

Legilimens is a continuous learning system designed for mobile edge devices with System-on-Chip (SoC) GPUs, addressing the unique resource constraints of such platforms. It leverages the insight that visually distinct scenes often share significant model embedding overlap, enabling lightweight specialization of a base model. The system reduces retraining costs and improves accuracy by optimizing data sample selection, base model updates without full retraining, and time-sharing compute resources for inference and retraining.

FailFast: Optimizing Speculative Decoding with Diffusion LLMs for Enhanced LLM Acceleration

FailFast is a novel speculative decoding framework that leverages Diffusion Large Language Models (dLLMs) to accelerate autoregressive LLMs. It addresses the efficiency-quality tradeoff of dLLMs by dynamically adjusting speculation length. This approach minimizes computation in difficult-to-speculate regions and aggressively extends draft lengths in easier regions, resulting in significant speedups without fine-tuning.

Dynamic Just-in-Time Model Routing for Scalable Agentic Workflows

Aragog introduces a just-in-time model routing architecture for agentic workflows that decouples configuration selection into a one-time accuracy-preserving routing step and a lightweight per-stage scheduler. By adapting LLM assignments dynamically based on real-time system observations, it optimizes for fluctuating loads without sacrificing output quality. This approach significantly improves throughput and reduces latency compared to static binding strategies.

Aragog: Dynamic LLM Configuration for Agentic Workflows

Aragog is a system designed to optimize the serving of agentic workflows by dynamically adjusting LLM configurations during execution. This approach addresses the limitations of static configurations, which become suboptimal due to fluctuating system loads. Aragog decouples configuration selection into pre-computation of accuracy-preserving configurations and a real-time, per-stage scheduler, significantly improving throughput and reducing latency.

LessIsMore: Training-Free Sparse Attention for Efficient LLM Reasoning

LessIsMore is a novel, training-free sparse attention mechanism designed to improve the efficiency of large language models (LLMs) in reasoning tasks. It utilizes global attention patterns and unified cross-head token ranking to overcome limitations of traditional sparse attention methods, which often lead to accuracy degradation. This approach achieves significant speed-ups and token reduction without compromising accuracy.

Training-Free Sparse Attention via Cross-Head Token Aggregation Cuts Reasoning Latency Without Accuracy Loss

LessIsMore is a training-free sparse attention method designed for large reasoning models, addressing the computational overhead of long token generation during test-time scaling. Rather than applying head-specific local token selection — the dominant paradigm in prior sparse attention work — it aggregates token selections across local attention heads combined with recent context to produce a unified cross-head token ranking used for future decoding layers. This global aggregation strategy avoids maintaining per-head token subsets, improving both generalization and efficiency. The result is 2× token reduction without accuracy degradation, a 1.1× decoding speed-up over full attention, and 1.13× end-to-end speed-up over existing sparse attention baselines.

Accelerating LRM Inference via Semantic Speculative Reasoning

SpecReason accelerates Large Reasoning Model (LRM) inference by using a lightweight model to speculatively generate intermediate reasoning steps, reserving the base model for verification and correction. Unlike standard speculative decoding which requires exact token matches, SpecReason leverages the semantic flexibility of 'thinking tokens' to maintain or improve final answer accuracy. This approach significantly reduces latency while potentially increasing accuracy across complex reasoning benchmarks.

Legilimens: Continuous Learning for On-Device Video Analytics on Mobile Edge SoCs

Legilimens is a novel continuous learning system designed for mobile edge devices with System-on-Chip (SoC) GPUs. It leverages the abundant unified memory on these devices to optimize video analytics. The core innovation lies in efficiently adapting models to new visual scenes by recognizing embedding overlap, significantly reducing retraining costs and improving accuracy compared to existing systems.

Guillotine: Hardware-Software Co-Design for Existential AI Containment

Guillotine is a proposed hypervisor architecture designed to sandbox high-risk AI models by addressing failures in traditional virtualization. It mandates a hardware-software co-design to eliminate side-channel and reflection-based vulnerabilities, complemented by extreme physical fail-safes—such as electromechanical disconnects—to ensure containment if digital layers are breached.

ABC: Simple Explicit Congestion Control Excels on Wireless Networks

ABC is an explicit congestion control protocol for wireless networks where routers mark packets as "accelerate" or "brake" to guide senders toward a target rate via minor cwnd adjustments. It requires no header changes or device modifications, is incrementally deployable, and interoperates with non-ABC routers and traffic. Evaluations on Wi-Fi and cellular traces show ABC delivers 30-40% higher throughput than Cubic+Codel at similar delays, 2.2X lower delays than BBR on Wi-Fi, and 50% higher throughput on cellular paths.

Monetary Incentives and Cryptocurrency-Enabled Proofs Unlock Peer Participation in P2P Video Delivery

A US survey of 876 respondents reveals 51% would participate in P2P content delivery for suitable financial incentives, exceeding expectations and addressing key adoption barriers. Gringotts introduces a secure system using novel Proof of Delivery to verify file transfers and cryptocurrency payments that resist Sybil and liar attacks. This enables content providers to integrate incentivized P2P delivery cost-effectively.