About Ravi Netravali

Princeton CS associate professor. Systems for ML, LLM serving, video systems, networked systems. Runs the SNS (Systems and Networking Systems) Group. Former PhD advisor work on lots of PLDI/OSDI/NSDI papers.

YouTube Blog arXiv

Compiled from 15 entries (7 videos, 35 papers) / updated Apr 10 / v3

Ravi Netravali is a Princeton CS associate professor leading the SNS Group, specializing in systems for ML, LLM serving, video systems, and networked systems with numerous PLDI/OSDI/NSDI papers. His thinking emphasizes application-aware, data-driven optimizations that bridge domain-specific insights with low-level systems design to overcome inefficiencies in distributed, resource-constrained environments. Recurring motifs include hardware-software co-design, speculative/passive learning techniques, and dynamic adaptation to variability in workloads, networks, and user behaviors.

Biography

Ravi Netravali is an associate professor of Computer Science at Princeton University, directing the Systems and Networking Systems (SNS) Group. His research spans systems for machine learning (ML), large language model (LLM) serving, video systems, and networked systems, with contributions to top venues like PLDI, OSDI, and NSDI. [bio]

LLM Systems and Inference Optimization

Netravali's work on LLM infrastructure tackles efficiency bottlenecks in serving, inference, and agentic workflows. FailFast uses diffusion LLMs for speculative decoding, dynamically adjusting speculation lengths to balance speed and quality.[5] Aragog and its dynamic just-in-time model routing decouple configuration selection for scalable agentic workflows, improving throughput under fluctuating loads.[6][7] LessIsMore introduces training-free sparse attention via cross-head token aggregation, cutting reasoning latency by 1.13× end-to-end without accuracy loss.[8][9] SpecReason accelerates large reasoning models (LRMs) with semantic speculative reasoning on thinking tokens.[10][13] METIS optimizes retrieval-augmented generation (RAG) through adaptive scheduling and configuration.[18][19] Marconi enhances prefix caching for hybrid LLMs with reuse-aware policies.[20][21]

Video Systems and Conferencing

A major focus is real-time video, from rate control to hardware acceleration. Mowgli applies offline RL to production telemetry for rate control, boosting bitrates 15-39% over GCC without online exploration.[2][22][23] Scallop offloads SFU media operations to Tofino ASICs in an SDN-style design, scaling 210× and cutting latency 26×.[14][15] Other innovations include Dashlet for swipe-uncertainty in short video QoE,[30] MadEye for PTZ camera optimization,[26] and Legilimens for continuous learning on edge SoCs.[4][11]

Networked Systems and Hardware Offload

Netravali reframes networks as active compute resources. SmartNICs are positioned for AI inference offloading due to packet-processing alignment.[16][17] ABC enables precise congestion control on wireless links with single-bit feedback.[40][41] Application-centric design integrates app insights into networking.[24] Scallop exemplifies SDN-inspired hardware for video SFUs.[14][15]

Distributed Systems Debugging and Observability

Tools like Snicket provide query-driven tracing for microservices,[1] Lumos uses provenance-guided debugging,[3] and Revelio generates ML-assisted debugging queries.[34]

Edge and Resource-Constrained ML

Systems address memory, compute limits: GEMEL merges models for edge video analytics,[33] Apparate uses early exits for ML inference,[25] Bamboo fills pipeline bubbles for preemptible DNN training,[31] and Legilimens enables on-device continuous learning.[4][11]

Privacy, Security, and Other Systems

Privid introduces duration-based differential privacy for video analytics.[29][35] Guillotine proposes hardware-software co-design for AI containment.[12] MARVOLO augments ML malware detection,[28] Canvas isolates remote memory swaps,[32] Dorylus scales GNN training on CPUs,[37] and Gringotts incentivizes P2P video with crypto proofs.[42]

Web and Miscellaneous

JavaScript-aware crawler Java cuts web archive storage 41%.[27] Khameleon prefetching for DVE apps achieves sub-100ms latencies.[39] Boggart accelerates general video analytics with imprecise pre-filtering.[36] SAMU offloads GC tracing in disaggregated datacenters.[38]

Key themes

Dynamic Adaptation and Optimization

Systems that adjust configurations, routing, or computations in real-time based on workload variability, resource constraints, or telemetry.

Aragog decouples routing for agentic workflows [6][7]
METIS adapts RAG configs for latency-quality [18][19]
Scallop SDN-style offload for video [14][15]

Speculative and Passive Learning Techniques

Avoiding costly online training via speculation, offline RL from logs, or passive telemetry to boost performance safely.

FailFast speculative decoding with dLLMs [5]
Mowgli offline RL for video rate control [2][22][23]
SpecReason semantic speculation for LRMs [10][13]

Hardware-Software Co-Design and Offloading

Leveraging SmartNICs, ASICs, and disaggregated memory to offload compute from general-purpose CPUs/GPUs.

SmartNICs for AI pipelines [16][17]
Scallop Tofino SFUs [14][15]
SAMU GC offload to memory servers [38]

Application-Centric Systems Design

Integrating app-level insights (e.g., video swipes, reasoning tokens) into low-level optimizations.

Application-centric networking talk [24]
Dashlet swipe pre-buffering [30]
LessIsMore cross-head attention [8][9]

Efficiency in Resource-Constrained Environments

Memory, compute optimizations for edge, distributed training, and serving under constraints.

Legilimens edge continual learning [4][11]
GEMEL model merging [33]
Bamboo preemptible training [31]

Observability and Debugging

Query-driven, provenance-guided tools for production distributed systems.

Snicket query-driven tracing [1]
Lumos provenance debugging [3]
Revelio ML query generation [34]

Video Analytics and Privacy

Optimizing accuracy, privacy in live/retrospective video processing.

MadEye PTZ optimization [26]
Privid duration-DP [29][35]
Boggart imprecise pre-filtering [36]

What Ravi Recommends

aragog-justintime-model-routing-for-scalable-serving-of-agentic-workflows ↗

paper · by Ravi Netravali · 2 mentions

mowgli

tool · by Neil Agarwal, Ray Pan, Ravinet Travali, Francis Yan

legilimens-performant-video-analytics-on-the-systemonchip-edge ↗

paper · by Ravi Netravali

specreason-fast-and-accurate-inferencetime-compute-via-speculative-reasoning ↗

paper · by Ravi Netravali

guillotine-hypervisors-for-isolating-malicious-ais ↗

paper · by Ravi Netravali

lessismore ↗

paper · by Ravi Netravali

less-is-more-trainingfree-sparse-attention-with-global-locality-for-efficient-reasoning ↗

paper · by Ravi Netravali

failfast ↗

repo

fail-fast-win-big-rethinking-the-drafting-strategy-in-speculative-decoding-via-diffusion-llms ↗

paper · by Ravi Netravali

remembrall-leaning-into-memory-for-accurate-video-analytics-on-systemonchip-gpus ↗

paper · by Ravi Netravali

wherefore-art-thou-provenanceguided-automatic-online-debugging-with-lumos ↗

paper · by Ravi Netravali

google-congestion-control

tool

webrtc ↗

tool

lumos

tool

snicket

tool · by Jessica Berg

Sources (15)

Every entry that fed the multi-agent compile above. Inline citation markers in the wiki text (like [1], [2]) are not yet individually linked to specific sources — this is the full set of sources the compile considered.

Lumos Enables Low-Overhead Provenance-Guided Debugging for Production Distributed Systemspaper · 2026-04-24
Snicket: A Query-Driven Distributed Tracing System for Microservicesyoutube · 2026-04-06
Mowgli: Offline RL for Real-time Video Rate Controlyoutube · 2026-04-06
Lumos: Provenance-Guided Debugging for Distributed Systemspaper · 2026-03-30
Legilimens: Continuous Learning for Mobile Edge Video Analyticsblog · 2026-01-01
FailFast: Optimizing Speculative Decoding with Diffusion LLMs for Enhanced LLM Accelerationpaper · 2025-12-23
Dynamic Just-in-Time Model Routing for Scalable Agentic Workflowsblog · 2025-11-30
Aragog: Dynamic LLM Configuration for Agentic Workflowspaper · 2025-11-26
LessIsMore: Training-Free Sparse Attention for Efficient LLM Reasoningblog · 2025-08-31
Training-Free Sparse Attention via Cross-Head Token Aggregation Cuts Reasoning Latency Without Accuracy Losspaper · 2025-08-09
Accelerating LRM Inference via Semantic Speculative Reasoningblog · 2025-04-30
Legilimens: Continuous Learning for On-Device Video Analytics on Mobile Edge SoCspaper · 2025-04-29
Guillotine: Hardware-Software Co-Design for Existential AI Containmentpaper · 2025-04-22
ABC: Simple Explicit Congestion Control Excels on Wireless Networkspaper · 2019-05-09
Monetary Incentives and Cryptocurrency-Enabled Proofs Unlock Peer Participation in P2P Video Deliverypaper · 2018-08-02