absorb.md

Together AI

Chronological feed of everything captured from Together AI.

The Imperative for an AI-Native Cloud Infrastructure

The rise of AI-native companies necessitates a new cloud paradigm, as traditional cloud infrastructure optimized for web applications cannot meet the unique demands of AI workloads. These demands include rapid iteration, GPU-intensive processing, continuous integration of research advancements, and reliable scalability for exponential growth. An AI-native cloud must provide a vertically integrated stack, a fast path from research to production, massive scalability, and developer-centric tooling.

OpenClaw Integrates with Together AI for Enhanced Agentic AI Capabilities

OpenClaw, a Jarvis-like agent, now integrates seamlessly with Together AI's platform, enabling access to powerful open-source models like Kimi K2.5, MiniMax M2.5, and GLM 5. This integration allows OpenClaw to leverage Together AI's high-throughput, low-latency infrastructure, offering a cost-effective solution for complex agentic workflows. The combined system provides users with advanced AI capabilities for task automation, web browsing, and script execution, accessible via a unified OpenAI-compatible API.

LLMs Enhance Query Plan Optimization for Databases

Large Language Models (LLMs) can significantly improve database query performance by rectifying suboptimal query plans generated by traditional optimizers. By directly patching physical operator graphs, LLMs can achieve substantial speedups and memory reductions. This approach offers a novel method for query optimization, allowing for development on small-scale data and deployment in production environments.

LLM-Driven Query Plan Patching for Database Optimization

Together Research introduces DBPlanBench, a framework that leverages LLMs to patch existing physical operator graphs in database optimizers rather than regenerating them. By correcting errors caused by missed semantic correlations in cost estimators, the approach achieves significant reductions in execution time and memory overhead on TPC-H and TPC-DS benchmarks.

Together AI Integrates Alibaba Cloud's Wan 2.7 for Enhanced Video Generation

Together AI has integrated Alibaba Cloud's Wan 2.7, providing AI native developers with advanced video generation capabilities. This integration streamlines the video production workflow from initial generation to advanced editing and control, offering functionalities like text-to-video with high-resolution output and scene continuation. The platform aims to deliver a production-ready environment with robust SLA and serverless inference, addressing a critical need for efficient and controllable video content creation.

Together AI Integrates Alibaba Cloud's Wan 2.7 for Enhanced Video Generation

Together AI has integrated Alibaba Cloud's Wan 2.7 model, providing AI-native developers with an advanced platform for video generation. This integration focuses on offering a streamlined workflow from initial video creation to advanced editing and control, leveraging features like text-to-video capabilities and forthcoming image-to-video functionalities. The platform emphasizes production readiness with a 99.9% SLA and serverless inference, aiming to empower teams with greater control over video content lifecycles.

Together AI Integrates Alibaba Cloud's Wan 2.7 Video Generation Model

Together AI has deployed Alibaba Cloud's Wan 2.7 model, enabling serverless text-to-video generation with outputs up to 1080p. The integration introduces advanced temporal controls including scene continuation and reference-driven steering, aimed at production-grade video workflows.

Deepgram Speech Models Integrated into Together AI for Real-time Voice Agents

Together AI now natively hosts Deepgram's STT (Speech-to-Text) and TTS (Text-to-Speech) models, enabling the deployment of real-time voice agents. This integration provides low-latency, production-ready solutions for conversational AI, including advanced transcription, end-of-turn detection, and structured speech generation, all within a dedicated infrastructure with a 99.9% SLA.

Together AI Integrates Deepgram for Real-time Voice AI

Together AI now natively hosts Deepgram's Speech-to-Text (STT) and Text-to-Speech (TTS) models, enabling real-time voice agents. This integration provides low-latency inference for production deployments by co-locating these models with Large Language Models (LLMs) on Together AI's dedicated infrastructure. Supported Deepgram models include Flux, Nova-3, Nova-3 Multilingual, and Aura-2.

Together AI Integrates Deepgram for Real-time Voice AI

Together AI now natively hosts Deepgram's speech-to-text (STT) and text-to-speech (TTS) models, enabling real-time voice AI agents with low-latency production deployments. This integration provides dedicated infrastructure with a 99.9% SLA and SOC 2 Type II compliance, co-locating Deepgram's voice models with large language models (LLMs) for efficient operation.

Together AI Launches Wan 2.7 for Enhanced Video Generation and Editing

Together AI has released Wan 2.7, a comprehensive suite of models for video generation, continuation, and editing. This platform aims to streamline video production workflows by integrating text-to-video, image-to-video, reference-to-video, and video editing capabilities into a single API. It offers enhanced creative control through features like audio support, frame-level conditioning, and flexible output options, addressing the challenges of fragmented video creation processes.

Correcting Database Optimizer Failures via LLM-Driven Semantic Plan Patching

DBPlanBench leverages LLMs as semantic cardinality estimators to optimize Apache DataFusion physical plans, overcoming the limitations of traditional cost-based heuristics. By utilizing a token-efficient serialization layer and an evolutionary JSON-patching mechanism, the system identifies structural inefficiencies in join ordering and pruning. The approach supports an 'optimize small, deploy large' workflow, where performance gains discovered on compact replicas transfer effectively to production-scale data.

Deepgram Speech-to-Text and Voice Models Now Available on Together AI

Together AI now natively supports Deepgram's speech-to-text (STT) and text-to-speech (TTS) models, including Nova-3, Nova-3 Multilingual, Flux, and Aura-2. This integration allows for the deployment of real-time voice agents on a single platform, combining Deepgram's voice capabilities with Together AI's LLMs. The offering emphasizes low latency, improved conversational turn-taking, and robust performance for various enterprise use cases.

Together AI Kernel Optimization Initiatives

Together AI's kernel team is focused on enhancing LLM performance through specialized low-level optimizations. The team's efforts aim to push hardware utilization and throughput boundaries beyond standard implementations.

Kernel Optimization is Key to AI Performance and Efficiency

Together AI's Kernels team focuses on optimizing the software layer between AI models and hardware to unlock full GPU potential. Their work, stemming from FlashAttention, demonstrates significant speedups and cost reductions for AI-native applications. This approach integrates academic research with production needs, enabling rapid adaptation to new hardware and custom solutions for demanding workloads.

Aurora: Closing the Loop with Online RL for Adaptive Speculative Decoding

Aurora is an open-source RL-based framework that converts speculative decoding from a static setup into a continuous serve-to-train flywheel. By asynchronously updating the draft model using live inference traces and a custom Tree Attention mechanism, it eliminates distribution drift and reduces the cost of offline distillation pipelines. The system demonstrates that real-time online adaptation can surpass the performance of carefully pretrained static speculators.

Divide and Conquer Strategy Improves LLM Long-Context Performance

A "Divide and Conquer" (D&C) framework allows smaller, cheaper language models to outperform larger, more expensive models like GPT-4o on long-context tasks. This approach breaks down complex tasks into smaller, manageable sub-tasks processed by "Worker" models, with a "Manager" model aggregating results. The framework addresses challenges like model noise, task noise, and aggregator noise through strategic prompting and task decomposition, offering significant advantages in cost, speed, and tunability.

Together AI Enhances Fine-Tuning with Advanced Capabilities for LLMs and VLMs

Together AI has expanded its fine-tuning service to address critical challenges in advanced multi-turn AI workflows. The update introduces specialized support for tool call fine-tuning with OpenAI-compatible schemas, reasoning fine-tuning for complex logic, and native vision-language model fine-tuning. Additionally, the platform now efficiently handles models up to 100B+ parameters, supporting datasets up to 100GB, and provides enhanced cost and time estimation for training jobs.

Mamba-3: State Space Model Optimized for Inference Efficiency

Mamba-3 is a novel state space model (SSM) engineered for optimal inference efficiency, diverging from Mamba-2's training-centric design. It introduces a more expressive recurrence formula, complex-valued state tracking, and a Multi-Input, Multi-Output (MIMO) variant. These enhancements enable Mamba-3 to achieve superior prefill and decode latencies compared to Mamba-2, Gated DeltaNet, and even Transformer models like Llama-3.2-1B, particularly at larger sequence lengths and the 1.5B scale.

Together AI and NVIDIA Collaborate on Open, Agentic, and Production-Ready AI Systems

Together AI is deepening its partnership with NVIDIA, focusing on advancing open, agentic, and production-ready AI systems. This collaboration leverages NVIDIA's new platforms like Dynamo 1.0 and Nemotron 3 Super, integrated with Together AI's inference infrastructure, to provide developers with enhanced tools for building and deploying large-scale AI applications, including multi-agent workflows and real-time voice agents. The key insight is the strategic alignment to democratize and industrialize advanced AI capabilities through open-source contributions and optimized performance.

Nemotron 3 Super: A Hybrid MoE for Agentic AI on Together AI

Together AI now offers NVIDIA Nemotron 3 Super, a 120B-parameter (12B active) hybrid Transformer-Mamba Mixture-of-Experts model. This model is optimized for multi-agent orchestration and complex reasoning workloads, featuring a 1M-token context window and multi-token prediction for enhanced performance. Together AI provides managed infrastructure for deploying Nemotron 3 Super, alleviating GPU management overhead for developers.

FlashAttention-4: Maximizing Blackwell GPU Utilization Through Algorithmic and Kernel Co-design for Attention

FlashAttention-4 addresses the asymmetric hardware scaling in Blackwell GPUs, where tensor core throughput outpaces other resources. This new algorithm and kernel co-design optimizes attention operations by mitigating bottlenecks in softmax exponential computation (forward pass) and shared memory traffic (backward pass). It achieves up to 1605 TFLOPs/s on B200 with BF16, outperforming cuDNN and Triton.