TOPIC · 9 entries · 6 thinkers

Llm Infrastructure

Thinkers posting on this topic

Modal Labs2

LlamaIndex1

LangChain1

Andrej Karpathy1

Simon Willison1 Cohere1

No compiled wiki article for this topic yet. Raw entries below are the source material — a wiki article can be generated on demand from /admin/triggers.

All entries on this topic (9)

paper · 35d ago

DeInfer: Closing the Parallel Inference Gap for Decomposed LLMs

LLM decomposition — splitting large models into smaller sub-components — improves downstream task performance but introduces significant parallel inference bottlenecks at scale that prior work has largely ignored. DeInfer is a dedicated inference system that addresses this gap through multiple co-de…

llm-inference llm-infrastructure model-decomposition parallel-computing inference-optimization distributed-systems

blog · Modal Labs · 44d ago

Modal Introduces Directory Snapshots for Granular Sandbox State Management

Modal's new Directory Snapshots enable fine-grained control over sandbox environments by allowing users to snapshot and restore specific directories. This decouples the lifecycle management of different state layers, such as system dependencies and application code, addressing limitations of full fi…

cloud-infrastructure serverless-computing sandbox-environments developer-tools llm-infrastructure

paper · 46d ago

Optimizing LLM Serving via Dual-Pool Token-Budget Routing

This approach addresses KV-cache over-allocation in homogeneous vLLM fleets by partitioning instances into specialized high-throughput (short-context) and high-capacity (long-context) pools. Requests are routed using a tokenizer-less, online-learned bytes-to-token ratio to estimate total token budge…

llm-serving cost-efficiency resource-management system-design routing-algorithms deep-learning-inference

youtube · Modal Labs · 47d ago

Modal Labs: Revolutionizing Serverless GPU Deployment for AI Inference

Modal Labs has engineered a novel platform to address the inefficiencies inherent in traditional GPU deployments for AI inference. Their solution tackles variable demand and resource allocation challenges by implementing a buffered instance management system, a lazy-loading file system, and GPU snap…

gpu-infrastructure serverless-computing ai-inference machine-learning-ops cloud-computing performance-engineering

tweet · Simon Willison · 48d ago

Memory Requirements for LLM Inference

Running large language models (LLMs) for inference, especially those with high parameter counts, typically necessitates significant GPU memory. While some quantized models can operate on consumer-grade hardware like a 256GB or 512GB Mac Studio, larger, unquantized models predominantly require high-e…

gpu-hardware mac-studio nvidia-servers quantization llm-inference

tweet · LlamaIndex · 54d ago

LlamaIndex Introduces Extract V2 for Enhanced Document Data Extraction

LlamaIndex has launched Extract v2, a significant upgrade to its document extraction tool. This new version offers simplified operation through intuitive tiers, pre-saved extraction configurations for efficiency, and configurable document parsing for greater control and improved results. Extract v1 …

document-extraction llm-data-processing data-pipelines platform-updates llamaindex

tweet · Cohere · 118d ago

Cohere Introduces Model Vault for Secure, Scalable AI Model Deployment

Cohere's Model Vault is a new, fully managed platform designed for deploying their AI models securely and at scale. It offers dedicated, isolated Virtual Private Clouds (VPCs), eliminating noisy neighbor issues and rate limits. The platform ensures elastic inference with predictable performance and …

model-serving managed-infrastructure llm-deployment cloud-computing enterprise-ai

github_readme · Andrej Karpathy · 334d ago

Karpathy's llm.c: GPT-2/3 Pretraining in Pure C/CUDA, Outpacing PyTorch Nightly

llm.c is Andrej Karpathy's minimal C/CUDA implementation of LLM pretraining, targeting GPT-2 and GPT-3 reproduction without the overhead of PyTorch (245MB) or CPython (107MB). The project is currently ~7% faster than PyTorch Nightly on its primary CUDA path, while also maintaining a clean ~1,000-lin…

llm-training cuda open-source gpt-2 low-level-ml gpu-computing

blog · LangChain · 351d ago

LangGraph Adds Node Caching, Deferred Execution, and Agent Hooks to Tighten Agentic Workflow Control

LangGraph's latest release week delivers a set of primitives targeting efficiency and control in agentic workflows: node-level caching reduces redundant computation during development, deferred nodes enable clean map-reduce and multi-agent coordination patterns, and pre/post model hooks give develop…

langgraph langchain agent-frameworks llm-infrastructure developer-tools workflow-orchestration