Llm Infrastructure
DeInfer: Closing the Parallel Inference Gap for Decomposed LLMs
LLM decomposition — splitting large models into smaller sub-components — improves downstream task performance but introduces significant parallel inference bottlenecks at scale that prior work has largely ignored. DeInfer is a dedicated inference system that addresses this gap through multiple co-de…
Modal Introduces Directory Snapshots for Granular Sandbox State Management
Modal's new Directory Snapshots enable fine-grained control over sandbox environments by allowing users to snapshot and restore specific directories. This decouples the lifecycle management of different state layers, such as system dependencies and application code, addressing limitations of full fi…
Optimizing LLM Serving via Dual-Pool Token-Budget Routing
This approach addresses KV-cache over-allocation in homogeneous vLLM fleets by partitioning instances into specialized high-throughput (short-context) and high-capacity (long-context) pools. Requests are routed using a tokenizer-less, online-learned bytes-to-token ratio to estimate total token budge…
Modal Labs: Revolutionizing Serverless GPU Deployment for AI Inference
Modal Labs has engineered a novel platform to address the inefficiencies inherent in traditional GPU deployments for AI inference. Their solution tackles variable demand and resource allocation challenges by implementing a buffered instance management system, a lazy-loading file system, and GPU snap…
Memory Requirements for LLM Inference
Running large language models (LLMs) for inference, especially those with high parameter counts, typically necessitates significant GPU memory. While some quantized models can operate on consumer-grade hardware like a 256GB or 512GB Mac Studio, larger, unquantized models predominantly require high-e…
LlamaIndex Introduces Extract V2 for Enhanced Document Data Extraction
LlamaIndex has launched Extract v2, a significant upgrade to its document extraction tool. This new version offers simplified operation through intuitive tiers, pre-saved extraction configurations for efficiency, and configurable document parsing for greater control and improved results. Extract v1 …
Cohere Introduces Model Vault for Secure, Scalable AI Model Deployment
Cohere's Model Vault is a new, fully managed platform designed for deploying their AI models securely and at scale. It offers dedicated, isolated Virtual Private Clouds (VPCs), eliminating noisy neighbor issues and rate limits. The platform ensures elastic inference with predictable performance and …
Karpathy's llm.c: GPT-2/3 Pretraining in Pure C/CUDA, Outpacing PyTorch Nightly
llm.c is Andrej Karpathy's minimal C/CUDA implementation of LLM pretraining, targeting GPT-2 and GPT-3 reproduction without the overhead of PyTorch (245MB) or CPython (107MB). The project is currently ~7% faster than PyTorch Nightly on its primary CUDA path, while also maintaining a clean ~1,000-lin…
LangGraph Adds Node Caching, Deferred Execution, and Agent Hooks to Tighten Agentic Workflow Control
LangGraph's latest release week delivers a set of primitives targeting efficiency and control in agentic workflows: node-level caching reduces redundant computation during development, deferred nodes enable clean map-reduce and multi-agent coordination patterns, and pre/post model hooks give develop…



