Llm Infrastructure
Optimizing LLM Serving via Dual-Pool Token-Budget Routing
This approach addresses KV-cache over-allocation in homogeneous vLLM fleets by partitioning instances into specialized high-throughput (short-context) and high-capacity (long-context) pools. Requests are routed using a tokenizer-less, online-learned bytes-to-token ratio to estimate total token budge…
Modal Labs: Revolutionizing Serverless GPU Deployment for AI Inference
Modal Labs has engineered a novel platform to address the inefficiencies inherent in traditional GPU deployments for AI inference. Their solution tackles variable demand and resource allocation challenges by implementing a buffered instance management system, a lazy-loading file system, and GPU snap…
Memory Requirements for LLM Inference
Running large language models (LLMs) for inference, especially those with high parameter counts, typically necessitates significant GPU memory. While some quantized models can operate on consumer-grade hardware like a 256GB or 512GB Mac Studio, larger, unquantized models predominantly require high-e…
LlamaIndex Introduces Extract V2 for Enhanced Document Data Extraction
LlamaIndex has launched Extract v2, a significant upgrade to its document extraction tool. This new version offers simplified operation through intuitive tiers, pre-saved extraction configurations for efficiency, and configurable document parsing for greater control and improved results. Extract v1 …
Cohere Introduces Model Vault for Secure, Scalable AI Model Deployment
Cohere's Model Vault is a new, fully managed platform designed for deploying their AI models securely and at scale. It offers dedicated, isolated Virtual Private Clouds (VPCs), eliminating noisy neighbor issues and rate limits. The platform ensures elastic inference with predictable performance and …
Karpathy's llm.c: GPT-2/3 Pretraining in Pure C/CUDA, Outpacing PyTorch Nightly
llm.c is Andrej Karpathy's minimal C/CUDA implementation of LLM pretraining, targeting GPT-2 and GPT-3 reproduction without the overhead of PyTorch (245MB) or CPython (107MB). The project is currently ~7% faster than PyTorch Nightly on its primary CUDA path, while also maintaining a clean ~1,000-lin…
LangGraph Adds Node Caching, Deferred Execution, and Agent Hooks to Tighten Agentic Workflow Control
LangGraph's latest release week delivers a set of primitives targeting efficiency and control in agentic workflows: node-level caching reduces redundant computation during development, deferred nodes enable clean map-reduce and multi-agent coordination patterns, and pre/post model hooks give develop…
LangGraph's Functional API Brings Graph-Level Features to Standard Python Functions
LangGraph's new Functional API introduces two decorators — `@entrypoint` and `@task` — that expose core LangGraph capabilities (human-in-the-loop, persistence, streaming) without requiring developers to define an explicit graph structure. This lowers the adoption barrier for existing codebases by al…
LangGraph Platform Offers Flexible Agent Deployment and Management
LangGraph Platform, formerly LangGraph Cloud, provides a comprehensive solution for deploying and scaling LangGraph applications. It integrates LangGraph Server for robust agent infrastructure and LangGraph Studio for development and debugging. The platform offers diverse deployment options, includi…
llama2.c: Minimal C Implementation for Training and Inferencing Tiny Llama 2 Models on Narrow Domains
llama2.c provides a full-stack PyTorch training and pure C inference solution for Llama 2 architecture in under 700 lines, targeting small models (15M-110M params) trained on TinyStories that generate coherent stories at 110 tok/s on M1 Mac. It supports loading Meta's 7B Llama 2 models in fp32 (4 to…




