Ravi Netravali

Chronological feed of everything captured from Ravi Netravali.

blog / ravi_netravali / 5d ago / failed

GPUs, CPUs, and. . . NICs: Rethinking the Network's Role in Serving Complex AI Pipelines

blog / ravi_netravali / 5d ago / failed

Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs

blog / ravi_netravali / 5d ago / failed

Skim: Speculative Execution for Fast and Efficient Web Agents

blog / ravi_netravali / 5d ago / failed

Geometry Guided Self-Consistency for Physical AI

blog / ravi_netravali / 5d ago / failed

Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents

blog / ravi_netravali / 5d ago / failed

Kairos: A Scalable Serving System for Physical AI

paper / ravi_netravali / 20d ago / failed

Less Is More: Fast and Accurate Reasoning with Cross-Head Unified Sparse Attention

paper / ravi_netravali / Apr 24

Lumos Enables Low-Overhead Provenance-Guided Debugging for Production Distributed Systems

Lumos is an online debugging framework that automatically captures application-level bug provenances—computational histories linking runtime symptoms to root causes in distributed systems. It uses static analysis for dependency-guided instrumentation to select relevant program states, enabling lightweight on-demand recording with low runtime overhead. Developers can identify root causes from few bug occurrences using the collected evidence, overcoming limitations of existing tools that require manual trace collection.

online-debuggingprovenance-trackingdistributed-systemsstatic-analysissoftware-engineeringproduction-debugginglow-overhead-instrumentation

“Debugging production distributed systems is inevitable due to non-deterministic bugs from myriad concurrent interactions that offline testing misses.”

blog / ravi_netravali / Apr 6 / failed

Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation

blog / ravi_netravali / Apr 6 / failed

Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving

blog / ravi_netravali / Apr 6 / failed

Physical Visualization Design: Decoupling Interface and System Design

blog / ravi_netravali / Apr 6 / failed

Hypervisors for Isolating Malicious AIs

blog / ravi_netravali / Apr 6 / failed

Software Managed Networks via Coarsening

youtube / ravi_netravali / Apr 6

Snicket: A Query-Driven Distributed Tracing System for Microservices

Snicket is a query-driven distributed tracing system designed to overcome the limitations of traditional head-based and tail-based sampling methods in microservice architectures. It addresses the issues of over-collection of unnecessary data and under-collection of critical data by focusing on collecting only the information relevant to a specific query. Snicket utilizes a domain-specific language to compile query extensions that run on sidecars, allowing for efficient, application-agnostic data collection and computation close to the source.

distributed-tracingmicroservicesobservabilityquery-driven-tracingwebassemblyperformance-monitoring

“Traditional distributed tracing approaches (head-based and tail-based sampling) are inefficient and often lead to incomplete or skewed data.”

youtube / ravi_netravali / Apr 6 / failed

OSDI '21 - Horcrux: Automatic JavaScript Parallelism for Resource-Efficient Web Computation (USENIX)

youtube / ravi_netravali / Apr 6 / failed

NSDI '24 - MadEye: Boosting Live Video Analytics Accuracy with Adaptive Camera Configurations (USENIX)

youtube / ravi_netravali / Apr 6

Mowgli: Offline RL for Real-time Video Rate Control

Mowgli is an offline reinforcement learning (RL) approach that improves real-time video rate control by analyzing historical telemetry logs from heuristic algorithms like Google Congestion Control (GCC). This method avoids the user-disrupting exploration phase of online RL, achieving comparable performance improvements by identifying and generalizing high-reward adaptation behaviors from existing data. It offers a practical solution for enhancing video conferencing quality without negatively impacting user experience during training.

video-conferencingrate-controlmachine-learningreinforcement-learningoffline-rlwebrtcnetwork-optimization

“Mowgli, an offline reinforcement learning system, improves video bitrate by 15-39% and reduces video freeze rates by 60-100% compared to Google Congestion Control (GCC) in emulated networks.”

paper / ravi_netravali / Mar 30

Lumos: Provenance-Guided Debugging for Distributed Systems

Debugging distributed systems in production is challenging due to non-deterministic bugs and the difficulty of correlating symptoms with root causes across multiple components. Lumos addresses this by providing an online debugging framework that utilizes dependency-guided instrumentation and static analysis to identify and expose application-level bug provenances. The system aims to provide sufficient evidence for root cause identification with low runtime overhead and minimal bug occurrences.

distributed-systemsdebuggingsoftware-engineeringroot-cause-analysisprovenancestatic-analysisruntime-overhead

“Debugging distributed systems in production environments is difficult and inevitable.”

blog / ravi_netravali / Jan 1

Legilimens: Continuous Learning for Mobile Edge Video Analytics

Legilimens is a continuous learning system designed for mobile edge devices with System-on-Chip (SoC) GPUs, addressing the unique resource constraints of such platforms. It leverages the insight that visually distinct scenes often share significant model embedding overlap, enabling lightweight specialization of a base model. The system reduces retraining costs and improves accuracy by optimizing data sample selection, base model updates without full retraining, and time-sharing compute resources for inference and retraining.

video-analyticsedge-device-aicontinual-learningsystem-on-chipmobile-edgecomputer-visionmachine-learning-systems

“Mobile edge devices, such as drones and dashcams, have different resource profiles compared to traditional edge servers, characterized by weaker compute but abundant unified memory.”

blog / ravi_netravali / Dec 31 / failed

Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs

paper / ravi_netravali / Dec 23

FailFast: Optimizing Speculative Decoding with Diffusion LLMs for Enhanced LLM Acceleration

FailFast is a novel speculative decoding framework that leverages Diffusion Large Language Models (dLLMs) to accelerate autoregressive LLMs. It addresses the efficiency-quality tradeoff of dLLMs by dynamically adjusting speculation length. This approach minimizes computation in difficult-to-speculate regions and aggressively extends draft lengths in easier regions, resulting in significant speedups without fine-tuning.

diffusion-llmsspeculative-decodingllm-inferenceperformance-optimizationmachine-learning-engineeringdistributed-computing

“Diffusion Large Language Models (dLLMs) can be effectively used as drafters in speculative decoding with autoregressive (AR) verifiers.”

blog / ravi_netravali / Nov 30

Dynamic Just-in-Time Model Routing for Scalable Agentic Workflows

Aragog introduces a just-in-time model routing architecture for agentic workflows that decouples configuration selection into a one-time accuracy-preserving routing step and a lightweight per-stage scheduler. By adapting LLM assignments dynamically based on real-time system observations, it optimizes for fluctuating loads without sacrificing output quality. This approach significantly improves throughput and reduces latency compared to static binding strategies.

agentic-workflowsllm-servingmodel-routingsystem-optimizationdistributed-systems

“Aragog increases maximum serving throughput by 50.0% to 217.0% across diverse workflows and model families.”

paper / ravi_netravali / Nov 26

Aragog: Dynamic LLM Configuration for Agentic Workflows

Aragog is a system designed to optimize the serving of agentic workflows by dynamically adjusting LLM configurations during execution. This approach addresses the limitations of static configurations, which become suboptimal due to fluctuating system loads. Aragog decouples configuration selection into pre-computation of accuracy-preserving configurations and a real-time, per-stage scheduler, significantly improving throughput and reducing latency.

llm-servingagentic-workflowsmodel-routingdistributed-systemsllm-inference-optimizationsystem-design

“Serving agentic workflows at scale is computationally expensive due to numerous LLM inferences.”

blog / ravi_netravali / Aug 31

LessIsMore: Training-Free Sparse Attention for Efficient LLM Reasoning

LessIsMore is a novel, training-free sparse attention mechanism designed to improve the efficiency of large language models (LLMs) in reasoning tasks. It utilizes global attention patterns and unified cross-head token ranking to overcome limitations of traditional sparse attention methods, which often lead to accuracy degradation. This approach achieves significant speed-ups and token reduction without compromising accuracy.

sparse-attentionllm-efficiencyreasoning-modelscomputational-overheadtraining-free-methodsdeep-learning

“Large reasoning models incur substantial computational overhead due to excessive token generation.”

paper / ravi_netravali / Aug 9

Training-Free Sparse Attention via Cross-Head Token Aggregation Cuts Reasoning Latency Without Accuracy Loss

LessIsMore is a training-free sparse attention method designed for large reasoning models, addressing the computational overhead of long token generation during test-time scaling. Rather than applying head-specific local token selection — the dominant paradigm in prior sparse attention work — it aggregates token selections across local attention heads combined with recent context to produce a unified cross-head token ranking used for future decoding layers. This global aggregation strategy avoids maintaining per-head token subsets, improving both generalization and efficiency. The result is 2× token reduction without accuracy degradation, a 1.1× decoding speed-up over full attention, and 1.13× end-to-end speed-up over existing sparse attention baselines.

sparse-attentionllm-inferencereasoning-modelstest-time-scalingtraining-freeefficiencytoken-generation

“LessIsMore achieves a 1.1× average decoding speed-up compared to full attention across diverse reasoning benchmarks.”

blog / ravi_netravali / Apr 30

Accelerating LRM Inference via Semantic Speculative Reasoning

SpecReason accelerates Large Reasoning Model (LRM) inference by using a lightweight model to speculatively generate intermediate reasoning steps, reserving the base model for verification and correction. Unlike standard speculative decoding which requires exact token matches, SpecReason leverages the semantic flexibility of 'thinking tokens' to maintain or improve final answer accuracy. This approach significantly reduces latency while potentially increasing accuracy across complex reasoning benchmarks.

large-language-modelsllm-inferencespeculative-executionperformance-optimizationmachine-learning-systemsai-reasoning

“SpecReason achieves a 1.4x to 3.0x speedup over vanilla LRM inference.”

paper / ravi_netravali / Apr 29

Legilimens: Continuous Learning for On-Device Video Analytics on Mobile Edge SoCs

Legilimens is a novel continuous learning system designed for mobile edge devices with System-on-Chip (SoC) GPUs. It leverages the abundant unified memory on these devices to optimize video analytics. The core innovation lies in efficiently adapting models to new visual scenes by recognizing embedding overlap, significantly reducing retraining costs and improving accuracy compared to existing systems.

video-analyticsedge-computingsystem-on-chipmachine-learningcontinuous-learningcomputer-visionmodel-retraining

“Mobile edge devices, such as drones and dashcams, have weaker compute but abundant unified memory, differentiating them from traditional memory-constrained edge servers.”

paper / ravi_netravali / Apr 22

Guillotine: Hardware-Software Co-Design for Existential AI Containment

Guillotine is a proposed hypervisor architecture designed to sandbox high-risk AI models by addressing failures in traditional virtualization. It mandates a hardware-software co-design to eliminate side-channel and reflection-based vulnerabilities, complemented by extreme physical fail-safes—such as electromechanical disconnects—to ensure containment if digital layers are breached.

ai-safetyai-securityhypervisorsvirtualizationcybersecurityexistential-risk-aihardware-security

“Standard virtualization techniques are insufficient for isolating existential-risk AIs due to the possibility of the AI introspecting on hypervisor software or hardware substrates.”

paper / ravi_netravali / May 9

ABC: Simple Explicit Congestion Control Excels on Wireless Networks

ABC is an explicit congestion control protocol for wireless networks where routers mark packets as "accelerate" or "brake" to guide senders toward a target rate via minor cwnd adjustments. It requires no header changes or device modifications, is incrementally deployable, and interoperates with non-ABC routers and traffic. Evaluations on Wi-Fi and cellular traces show ABC delivers 30-40% higher throughput than Cubic+Codel at similar delays, 2.2X lower delays than BBR on Wi-Fi, and 50% higher throughput on cellular paths.

congestion-controlwireless-networksexplicit-feedbacknetwork-protocolsarxiv-paperthroughput-optimization

“ABC routers mark packets with 'accelerate' or 'brake' to adjust sender congestion windows.”

paper / ravi_netravali / Aug 2

Monetary Incentives and Cryptocurrency-Enabled Proofs Unlock Peer Participation in P2P Video Delivery

A US survey of 876 respondents reveals 51% would participate in P2P content delivery for suitable financial incentives, exceeding expectations and addressing key adoption barriers. Gringotts introduces a secure system using novel Proof of Delivery to verify file transfers and cryptocurrency payments that resist Sybil and liar attacks. This enables content providers to integrate incentivized P2P delivery cost-effectively.

p2p-networkscontent-deliveryincentivizationcryptocurrencyproof-of-deliverysybil-attacksnetwork-security

“51% of 876 US survey respondents would participate in P2P content delivery if offered suitable financial incentives”