
About Ravi Netravali
Princeton CS associate professor. Systems for ML, LLM serving, video systems, networked systems. Runs the SNS (Systems and Networking Systems) Group. Former PhD advisor work on lots of PLDI/OSDI/NSDI papers.
Ravi Netravali is a Princeton CS associate professor leading the SNS Group, specializing in systems for ML, LLM serving, video systems, and networked systems with numerous PLDI/OSDI/NSDI papers. His thinking emphasizes application-aware, data-driven optimizations that bridge domain-specific insights with low-level systems design to overcome inefficiencies in distributed, resource-constrained environments. Recurring motifs include hardware-software co-design, speculative/passive learning techniques, and dynamic adaptation to variability in workloads, networks, and user behaviors.
Biography
Ravi Netravali is an associate professor of Computer Science at Princeton University, directing the Systems and Networking Systems (SNS) Group. His research spans systems for machine learning (ML), large language model (LLM) serving, video systems, and networked systems, with contributions to top venues like PLDI, OSDI, and NSDI. [bio]
LLM Systems and Inference Optimization
Netravali's work on LLM infrastructure tackles efficiency bottlenecks in serving, inference, and agentic workflows. FailFast uses diffusion LLMs for speculative decoding, dynamically adjusting speculation lengths to balance speed and quality.[5] Aragog and its dynamic just-in-time model routing decouple configuration selection for scalable agentic workflows, improving throughput under fluctuating loads.[6][7] LessIsMore introduces training-free sparse attention via cross-head token aggregation, cutting reasoning latency by 1.13× end-to-end without accuracy loss.[8][9] SpecReason accelerates large reasoning models (LRMs) with semantic speculative reasoning on thinking tokens.[10][13] METIS optimizes retrieval-augmented generation (RAG) through adaptive scheduling and configuration.[18][19] Marconi enhances prefix caching for hybrid LLMs with reuse-aware policies.[20][21]
Video Systems and Conferencing
A major focus is real-time video, from rate control to hardware acceleration. Mowgli applies offline RL to production telemetry for rate control, boosting bitrates 15-39% over GCC without online exploration.[2][22][23] Scallop offloads SFU media operations to Tofino ASICs in an SDN-style design, scaling 210× and cutting latency 26×.[14][15] Other innovations include Dashlet for swipe-uncertainty in short video QoE,[30] MadEye for PTZ camera optimization,[26] and Legilimens for continuous learning on edge SoCs.[4][11]
Networked Systems and Hardware Offload
Netravali reframes networks as active compute resources. SmartNICs are positioned for AI inference offloading due to packet-processing alignment.[16][17] ABC enables precise congestion control on wireless links with single-bit feedback.[40][41] Application-centric design integrates app insights into networking.[24] Scallop exemplifies SDN-inspired hardware for video SFUs.[14][15]
Distributed Systems Debugging and Observability
Tools like Snicket provide query-driven tracing for microservices,[1] Lumos uses provenance-guided debugging,[3] and Revelio generates ML-assisted debugging queries.[34]
Edge and Resource-Constrained ML
Systems address memory, compute limits: GEMEL merges models for edge video analytics,[33] Apparate uses early exits for ML inference,[25] Bamboo fills pipeline bubbles for preemptible DNN training,[31] and Legilimens enables on-device continuous learning.[4][11]
Privacy, Security, and Other Systems
Privid introduces duration-based differential privacy for video analytics.[29][35] Guillotine proposes hardware-software co-design for AI containment.[12] MARVOLO augments ML malware detection,[28] Canvas isolates remote memory swaps,[32] Dorylus scales GNN training on CPUs,[37] and Gringotts incentivizes P2P video with crypto proofs.[42]
Web and Miscellaneous
JavaScript-aware crawler Java cuts web archive storage 41%.[27] Khameleon prefetching for DVE apps achieves sub-100ms latencies.[39] Boggart accelerates general video analytics with imprecise pre-filtering.[36] SAMU offloads GC tracing in disaggregated datacenters.[38]
Dynamic Adaptation and Optimization
Systems that adjust configurations, routing, or computations in real-time based on workload variability, resource constraints, or telemetry.
Speculative and Passive Learning Techniques
Avoiding costly online training via speculation, offline RL from logs, or passive telemetry to boost performance safely.
Hardware-Software Co-Design and Offloading
Leveraging SmartNICs, ASICs, and disaggregated memory to offload compute from general-purpose CPUs/GPUs.
Application-Centric Systems Design
Integrating app-level insights (e.g., video swipes, reasoning tokens) into low-level optimizations.
Efficiency in Resource-Constrained Environments
Memory, compute optimizations for edge, distributed training, and serving under constraints.
Observability and Debugging
Query-driven, provenance-guided tools for production distributed systems.
Every entry that fed the multi-agent compile above. Inline citation markers in the wiki text (like [1], [2]) are not yet individually linked to specific sources — this is the full set of sources the compile considered.
- Snicket: A Query-Driven Distributed Tracing System for Microservicesyoutube · 2026-04-06
- Mowgli: Offline RL for Real-time Video Rate Controlyoutube · 2026-04-06
- Lumos: Provenance-Guided Debugging for Distributed Systemspaper · 2026-03-30
- Legilimens: Continuous Learning for Mobile Edge Video Analyticsblog · 2026-01-01
- FailFast: Optimizing Speculative Decoding with Diffusion LLMs for Enhanced LLM Accelerationpaper · 2025-12-23
- Dynamic Just-in-Time Model Routing for Scalable Agentic Workflowsblog · 2025-11-30
- Aragog: Dynamic LLM Configuration for Agentic Workflowspaper · 2025-11-26
- LessIsMore: Training-Free Sparse Attention for Efficient LLM Reasoningblog · 2025-08-31
- Training-Free Sparse Attention via Cross-Head Token Aggregation Cuts Reasoning Latency Without Accuracy Losspaper · 2025-08-09
- Accelerating LRM Inference via Semantic Speculative Reasoningblog · 2025-04-30
- Legilimens: Continuous Learning for On-Device Video Analytics on Mobile Edge SoCspaper · 2025-04-29
- Guillotine: Hardware-Software Co-Design for Existential AI Containmentpaper · 2025-04-22
- Speculative Reasoning for Faster LRM Inferencepaper · 2025-04-10
- SDN-Inspired Hardware-Offloaded SFUs Cut Video Conferencing Latency 26x and Scale 210x Over Commodity Serversblog · 2025-03-31
- Scallop: Hardware-Accelerated SFUs for Scalable Video Conferencingpaper · 2025-03-14
- SmartNICs as First-Class Compute in AI Inference Pipelines: A Case for Network-Side Offloadingblog · 2025-02-28
- SmartNICs as First-Class Compute in AI Inference Pipelines: A Case for Network-Side Offloadingpaper · 2025-01-22
- METIS: Optimizing RAG for Latency and Qualityblog · 2024-12-31
- METIS: Optimizing RAG System Performance through Adaptive Configuration and Schedulingpaper · 2024-12-13
- Marconi: Optimized Prefix Caching for Hybrid LLMsblog · 2024-11-30
- Marconi: Optimizing Prefix Caching for Hybrid LLMspaper · 2024-11-28
- Mowgli: passively learned rate control for real-time videoblog · 2024-10-31
- Mowgli Enables Production-Ready Data-Driven Rate Control via Passive Telemetry Learningpaper · 2024-10-04
- Application-Centric Network Design: Bridging the Gap for Enhanced Performanceyoutube · 2023-12-22
- Apparate Harnesses Early Exits to Decouple Latency from Throughput in ML Inferencepaper · 2023-12-08
- MadEye Dynamically Optimizes PTZ Camera Orientations to Maximize Live Video Analytics Accuracypaper · 2023-04-04
- JavaScript-Aware Crawler Java Cuts Web Archive Storage by 41% While Eliminating Fidelity Lossyoutube · 2022-09-22
- MARVOLO Enables Efficient Semantics-Preserving Augmentation for ML Malware Detectionpaper · 2022-06-07
- Privid Enables Privacy-Preserving Video Analytics via Row Event Duration Privacyyoutube · 2022-05-19
- Dashlet Tames Swipe Uncertainty with Statistical Pre-Buffering for Superior Short Video QoEpaper · 2022-04-27
- Bamboo Enables Resilient Preemptible Training of Large DNNs by Filling Pipeline Bubbles with Redundant Computationpaper · 2022-04-26
- Canvas Isolates Swap Paths to Eliminate Multi-App Interference in Remote Memory Systemspaper · 2022-03-17
- Model Merging Overcomes Edge GPU Memory Limits for Concurrent Video Analyticspaper · 2022-01-19
- Revelio ML Assistant Generates Effective Debugging Queries for Distributed Systemspaper · 2021-06-28
- Privid Introduces Duration-Based Differential Privacy for Robust Video Analyticspaper · 2021-06-22
- Boggart Enables General-Purpose Video Analytics Acceleration with Indexed Imprecise Pre-Filteringpaper · 2021-06-21
- Dorylus Enables Scalable GNN Training on Distributed CPUs with Serverless Threads for Billion-Edge Graphspaper · 2021-05-24
- SAMU: Offloading GC Tracing to Memory Servers Boosts Managed Workloads in Disaggregated Datacentersyoutube · 2020-11-18
- Khameleon Achieves Sub-100ms Latencies in Network-Bottlenecked DVE Apps via Progressive Prefetching and Greedy Schedulingpaper · 2020-07-15
- ABC: Single-Bit Feedback Enables Precise Congestion Control for Time-Varying Wireless Linksyoutube · 2020-03-25
- ABC: Simple Explicit Congestion Control Excels on Wireless Networkspaper · 2019-05-09
- Monetary Incentives and Cryptocurrency-Enabled Proofs Unlock Peer Participation in P2P Video Deliverypaper · 2018-08-02