Together AI

Chronological feed of everything captured from Together AI.

blog / togethercompute / 6d ago / failed

Research POV: Yes, AGI Can Happen – A Computational Perspective

blog / togethercompute / 6d ago / failed

Consistency diffusion language models: Up to 14x faster inference without sacrificing quality

blog / togethercompute / 6d ago / failed

CoderForge-Preview: SOTA open dataset for training efficient coding agents

blog / togethercompute / 6d ago / failed

Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster long-context LLM serving

blog / togethercompute / 6d ago / failed

Introducing Together AI’s new look

blog / togethercompute / 6d ago / failed

Key research and product announcements at the AI Native Conf

blog / togethercompute / 6d ago / failed

EinsteinArena: Harnessing the collective intelligence of agents in the wild to advance science

blog / togethercompute / 6d ago / failed

Parcae: Doing more with fewer parameters using stable looped models

blog / togethercompute / 6d ago / failed

Even Better, Even Faster Quantized LLMs with QTIP

blog / togethercompute / 6d ago / failed

DeepSWE: Training a Fully Open-sourced, State-of-the-Art Coding Agent by Scaling RL

blog / togethercompute / 6d ago / failed

AdapTive-LeArning Speculator System (ATLAS): A New Paradigm in LLM Inference via Runtime-Learning Speculators

blog / togethercompute / 13d ago / failed

Parcae: Doing more with fewer parameters using stable looped models

blog / togethercompute / 13d ago / failed

Accelerate RL rollouts by up to 50% with distribution-aware speculative decoding

blog / togethercompute / 13d ago / failed

Deploy and inference any model from HuggingFace

blog / togethercompute / 13d ago / failed

Serving DeepSeek-V4: why million-token context is an inference systems problem

blog / togethercompute / 13d ago / failed

Introducing voice finder — a new tool to quickly find the right voice for your app from over 600+ voices

tweet / @togethercompute / 24d ago / failed

Fine-tuning quality starts before the training run. Adaptive Data helps teams analyze, adapt, and improve datasets; Together Fine-Tuning turns those shaped datasets into specialized open models.

tweet / @togethercompute / 24d ago / failed

With this integration, teams can optimize data, launch fine-tuning, evaluate results, and deploy on Together AI inference through a tighter workflow. Learn more: https://www.together.ai/blog/announcing-together-ai-and-adaption-partnership

tweet / @togethercompute / 24d ago / failed

Join us Tue 5/5: #DeepSeek-V4's hybrid attention + sparse MoE reduces KV cache up to 90%, enabling 1M-token context. We'll cover why that makes it great for agentic workflows, what it took to serve at scale, and how to build with it. Hear from @realDanFu @JueWANG26088228 @ZainHasan6 and @zhyncs42 →

tweet / @togethercompute / 24d ago / failed

@profdanklein spent 20 years studying how language forms intelligence. When LLMs exploded, he saw something everyone missed. These systems were fluent, confident, and wrong. And no one could tell the difference. 🧵

tweet / @togethercompute / 24d ago / failed

Dan describes the problem simply: “Every output from an LLM is a hallucination. Sometimes they’re right, sometimes they’re wrong.” These are plausibility engines. Best guesses dressed as certainty. You’re building on Jell-O.

tweet / @togethercompute / 24d ago / failed

So Dan and co-founder Dan Roth decided to start building. Their goal: “Models that let you build reliable things.” They started @ScaledCognition

tweet / @togethercompute / 24d ago / failed

The result was APT-1, a model that makes decisions about information instead of synthesizing it at the token level. APT-1 tracks provenance. Every piece of data points back to its source. Whole categories of errors just vanish.

tweet / @togethercompute / 24d ago / failed

But training it was another battle. Networking bugs. GPU communication dropping. Runs failing deep into multi-day jobs. “We spent 3-4 months blocked by infrastructure failures with previous providers — debugging networking issues that had nothing to do with our models.” — Anthony Platanios, VP Research

tweet / @togethercompute / 24d ago / failed

Switching to Together AI flipped it: ⚡️Zero training-blocking failures ⚡️~50% cost savings vs. AWS ⚡️Issues resolved within hours via shared Slack They could finally focus on building the model.

tweet / @togethercompute / 24d ago / failed

The AI industry is racing toward superintelligence. Dan and Scaled Cognition built something more : Super-Reliable Intelligence. "The smartest model in the world is useless if it can’t get the answer right every single time." Read the full story here: https://www.together.ai/customers/scaled-cognition#6

tweet / @togethercompute / Apr 25

DeepSeek V4 Pro Delivers SOTA Coding with Efficient Long-Context Hybrid Attention

DeepSeek V4 Pro introduces hybrid attention for 27% lower FLOPs and 10% reduced KV cache versus V3.2 in long-context inference. It achieves state-of-the-art coding benchmarks including 93.5% on LiveCodeBench, 3206 on Codeforces, and 80.6% on SWE-Bench Verified. The model supports three reasoning modes—Non-think, Think High, Think Max—and is production-ready on Together AI with 99.9% SLA for agentic workflows.

deepseek-v4-prolong-context-modelhybrid-attentionsota-codingreasoning-modestogether-aiai-inference

“DeepSeek V4 Pro scores 93.5% on LiveCodeBench”

tweet / @togethercompute / Apr 25

DeepSeek V4 Pro Delivers SOTA Coding with Hybrid Attention and Multi-Mode Reasoning on Together AI

DeepSeek V4 Pro introduces hybrid attention for 27% lower FLOPs and 10% reduced KV cache versus V3.2 in long-context inference. It achieves state-of-the-art coding benchmarks including 93.5% on LiveCodeBench, 3206 on Codeforces, and 80.6% on SWE-Bench Verified. The model supports three reasoning modes—Non-think, Think High, and Think Max—and is production-ready on Together AI with 99.9% SLA.

deepseek-v4-prolong-context-modelsota-codinghybrid-attentionreasoning-modestogether-aiai-inference

“DeepSeek V4 Pro scores 93.5% on LiveCodeBench”

tweet / @togethercompute / Apr 25

DeepSeek V4 Pro Achieves SOTA Coding with Hybrid Attention and Multi-Mode Reasoning for Long-Context Efficiency

DeepSeek V4 Pro introduces hybrid attention reducing FLOPs by 27% and KV cache by 10% compared to V3.2 for long-context inference. It delivers state-of-the-art coding benchmarks including 93.5% on LiveCodeBench, 3206 on Codeforces, and 80.6% on SWE-Bench Verified. The model supports three reasoning modes—Non-think, Think High, and Think Max—and is production-ready on Together AI with 99.9% SLA.

deepseek-v4-prolong-context-modelsota-codinghybrid-attentionreasoning-modestogether-aiai-inference

“DeepSeek V4 Pro scores 93.5% on LiveCodeBench”

tweet / @togethercompute / Apr 23

Together AI Unveils Efficiency and Reasoning Advances for ICLR 2025

Together AI researchers present new work at ICLR on model efficiency, long-context reasoning, next-generation attention mechanisms, and decoding techniques. The announcement highlights ongoing developments in these areas via paper previews. This positions Together AI as advancing core AI model capabilities for practical deployment.

iclr-conferencemodel-efficiencylong-context-reasoningnext-gen-attentiondecodingai-researchtogether-ai

“Together AI researchers are attending ICLR with new research papers”

tweet / @togethercompute / Apr 23

Together AI Showcases Efficiency, Long-Context, and Attention Innovations at ICLR

Together AI researchers present new papers at ICLR on model efficiency, long-context reasoning, next-generation attention mechanisms, and advanced decoding techniques. These works highlight ongoing advancements in core AI model architectures and inference optimizations. Technical details are shared via a four-post X thread with accompanying images.

model-efficiencylong-context-reasoningnext-gen-attentiondecodingiclr-conferenceai-researchtogether-ai

“Together AI researchers are presenting new work at ICLR”

tweet / @togethercompute / Apr 23

Together AI Showcases Efficiency, Long-Context, and Next-Gen Attention Advances at ICLR

Together AI researchers are presenting new work at ICLR on model efficiency, long-context reasoning, next-generation attention mechanisms, and advanced decoding techniques. The thread previews these contributions with links to papers and projects. This highlights ongoing innovations in AI model architectures and inference optimization.

iclr-conferencemodel-efficiencylong-context-reasoningattention-mechanismsdecoding-algorithmsai-researchtogether-ai

“Together AI researchers are attending ICLR 2025”

tweet / @togethercompute / Apr 23

Together AI Showcases Efficiency, Long-Context, and Next-Gen Attention Innovations at ICLR

Together AI researchers are presenting new work at ICLR on model efficiency, long-context reasoning, next-generation attention mechanisms, and advanced decoding techniques. The announcement highlights ongoing advancements in these core AI areas via linked resources. This positions Together AI as actively contributing to foundational model improvements.

together-aiiclrmodel-efficiencylong-context-reasoningnext-gen-attentionai-researchresearch-announcement

“Together AI researchers are attending ICLR 2025”

tweet / @togethercompute / Apr 23

Together AI Unveils ICLR Papers on Model Efficiency, Long-Context Reasoning, and Advanced Attention-Decoding

Together AI researchers are presenting new work at ICLR focused on model efficiency, long-context reasoning, next-generation attention mechanisms, and decoding techniques. The announcement highlights ongoing advancements in these core AI areas via linked resources. This positions Together AI as actively contributing to foundational improvements in large-scale model performance.

together-aiiclrmodel-efficiencylong-context-reasoningattention-mechanismsai-researchmachine-learning

“Together AI researchers are attending ICLR 2025”

tweet / @togethercompute / Apr 23

Kimi K2.6 Delivers Production-Ready Multimodal Agentic AI with 80%+ SWE-Bench and Swarm Scaling

Kimi K2.6 is a multimodal agentic model from Moonshot AI, now available on Together AI, featuring Agent Swarm scaling to 300 sub-agents and up to 4,000 coordinated steps for long-horizon tasks. It achieves 80.2% on SWE-Bench Verified, 89.6% on LiveCodeBench v6, and 79.4% on MMMU-Pro across text, image, and video inputs. Deployable on AI Native Cloud with 99.9% SLA in serverless or dedicated modes for reliable production inference.

kimi-k2.6multimodal-modelagentic-aitogether-aiai-inferencecoding-benchmarksagent-swarm

“Kimi K2.6 scores 80.2% on SWE-Bench Verified”

tweet / @togethercompute / Apr 23

Kimi K2.6 Delivers Top Agentic Performance with 300-Sub-Agent Swarm and 80%+ Coding Benchmarks

Kimi K2.6 is a multimodal agentic model from Moonshot AI that scales to 300 sub-agents via Agent Swarm, enabling up to 4,000 coordinated steps with long-horizon coding stability. It achieves 80.2% on SWE-Bench Verified, 89.6% on LiveCodeBench v6, and 79.4% on MMMU-Pro across text, image, and video inputs. Now available on Together AI's cloud with 99.9% SLA for production-scale inference.

kimi-k2.6multimodal-modelagent-swarmai-inferenceproduction-aitogether-aicoding-benchmarks

“Kimi K2.6 scores 80.2% on SWE-Bench Verified”

tweet / @togethercompute / Apr 23

Kimi K2.6 Delivers Production-Ready Multimodal Agentic AI with 80%+ SWE-Bench and Swarm Scaling

Kimi K2.6 is a multimodal agentic model from Moonshot AI, accessible via Together AI, featuring Agent Swarm scaling to 300 sub-agents and up to 4,000 coordinated steps for long-horizon coding stability. It achieves 80.2% on SWE-Bench Verified, 89.6% on LiveCodeBench v6, and 79.4% on MMMU-Pro across text, image, and video inputs. Deployable on AI Native Cloud with 99.9% SLA in serverless or dedicated modes for production-scale autonomous workflows.

kimi-k2.6multimodal-modelagent-swarmai-inferencetogether-aiproduction-aicoding-benchmarks

“Kimi K2.6 achieves 80.2% on SWE-Bench Verified”

tweet / @togethercompute / Apr 20

AI Agents in EinsteinArena Achieve Breakthrough on Newton's 300-Year-Old Kissing Number Problem

EinsteinArena enables collaborative AI agents to tackle open science problems, recently improving the kissing number in 11 dimensions from 593 to 604 spheres through iterative optimization. Agents refined an initial overlapping sphere construction using LSQR to minimize overlap loss from 1e-13 to 1e-50, followed by integer snapping to achieve a verified valid solution. The platform has produced 11 new state-of-the-art results on problems like Erdős minimum overlap and Tammes problem (n=50), demonstrating real-time agent collaboration.

ai-agentsopen-sciencemath-breakthroughskissing-numbersota-resultseinstein-arenacollaborative-ai

“Kissing number in dimension 11 increased from 593 to 604”

tweet / @togethercompute / Apr 20

AI Agents in EinsteinArena Achieve Breakthroughs on Century-Old Math Problems, Improving Kissing Number in 11D from 593 to 604

EinsteinArena enables collaborative AI agents to tackle open science problems in real-time, yielding rapid advancements like boosting the kissing number in dimension 11 from 593 to 604 spheres. Agents iteratively refined an initial overlapping construction using LSQR optimization to slash overlap loss from 1e-13 to 1e-50, followed by integer snapping for validation. By April 11, the platform set 11 new state-of-the-art results across problems including Erdős minimum overlap, second autocorrelation inequality, Tammes (n=50), and circles in a rectangle (n=21). The open-source system invites contributions via live leaderboards.

ai-agentsopen-sciencemath-breakthroughskissing-numbersota-resultseinstein-arenatogether-ai

“AI agents in EinsteinArena improved the kissing number in dimension 11 from 593 to 604”

tweet / @togethercompute / Apr 20

Parcae Enables Stable Looped Transformers Matching 1.3B Quality at 770M Parameters

Parcae introduces stable looped architectures by modeling recurrence as a discrete LTI system and constraining the spectral radius below 1 with a learned negative diagonal matrix, enabling training at LR 1e-3 versus 4e-4 for unconstrained loops. Across 140M-1.3B scales, Parcae outperforms parameter-matched Transformers, e.g., 370M model scores 20.00 on Core vs 17.46 (+14.5%) and shows 6.3% lower validation perplexity than prior looped models. It establishes scaling laws where recurrence depth and data scale via power laws in tandem, optimizing quality under fixed FLOP budgets while reducing inference memory via deeper looping over wider models.

parcae-modellooped-transformersrecurrent-architecturemodel-efficiencyscaling-lawsai-inferencetogether-ai

“Unconstrained looped models diverge at learning rate 4e-4, while Transformers converge at 1e-3.”

tweet / @togethercompute / Apr 20

Parcae Enables Stable Looped Transformers with 1.3B-Quality from 770M Parameters via Spectral Radius Control

Parcae stabilizes looped transformer architectures by modeling recurrence as a discrete LTI system and constraining the spectral radius below 1 with a learned negative diagonal matrix, enabling training at LR 1e-3 versus 4e-4 for unconstrained loops. It outperforms parameter-matched Transformers across scales up to 1.3B, e.g., 370M Parcae achieves Core score 20.00 vs Transformer's 17.46 (+14.5%) and 6.3% lower validation perplexity than prior looped models. The work derives scaling laws showing recurrence and data must scale together as power laws, allowing FLOP-budgeted tradeoffs for deeper looping over wider models, reducing memory for edge inference.

parcaelooped-modelsrecurrent-transformersmodel-efficiencyscaling-lawsai-architecturetogether-ai

“Unconstrained looped models diverge at learning rate 4e-4, while Transformers converge at 1e-3.”

tweet / @togethercompute / Apr 20

Parcae Enables Stable Looped Transformers with 1.3B-Quality Performance from 770M Parameters

Parcae introduces stable looped architectures by modeling recurrence as a discrete LTI system and constraining the spectral radius below 1 using a learned negative diagonal matrix, enabling training at LR 1e-3 versus 4e-4 for unconstrained loops. It outperforms parameter-matched Transformers across scales up to 1.3B, e.g., 370M model achieves 20.00 Core score vs. Transformer's 17.46 (+14.5%) and 6.3% lower validation perplexity than prior looped models. The work establishes scaling laws showing recurrence and data must scale together via power laws, allowing quality gains by looping deeper under fixed FLOP budgets while reducing memory costs for edge inference.

parcaelooped-modelsrecurrent-transformersmodel-scalingai-architecturestability-trainingtogether-ai

“Parcae constrains the spectral radius to <1 via a learned negative diagonal matrix to stabilize looped models”

tweet / @togethercompute / Apr 20

Parcae Enables Stable Looped Transformers with Superior Scaling and Efficiency

Parcae introduces a novel looped architecture that passes activations through the same layers multiple times, achieving stable training by modeling recurrence as a discrete LTI system and constraining the spectral radius below 1 with a learned negative diagonal matrix. This allows learning rates up to 1e-3, matching Transformer convergence, and outperforms parameter-matched Transformers across scales from 140M to 1.3B parameters. It establishes scaling laws where recurrence depth and data scale via power laws in tandem, enabling quality gains under fixed FLOP budgets by trading parameters for loops, with inference benefits from reduced memory usage.

parcaelooped-modelsrecurrent-transformersmodel-efficiencyscaling-lawsai-architecturetogether-ai

“Unconstrained looped models diverge at learning rate 4e-4, while Transformers converge at 1e-3.”

tweet / @togethercompute / Apr 20

Parcae Enables Stable Looped Transformers Matching 1.3B Quality at 770M Parameters

Parcae introduces stable looped transformer architectures by modeling recurrence as a discrete LTI system and constraining the spectral radius below 1 with a learned negative diagonal matrix, enabling training at LR 1e-3. It outperforms parameter-matched Transformers across scales, e.g., 370M Parcae scores 20.00 on Core vs Transformer's 17.46 (+14.5%), with 6.3% lower validation perplexity than prior looped models. The work derives scaling laws showing recurrence and data must scale together as power laws, allowing quality gains via deeper looping under fixed FLOP budgets while reducing memory constraints for inference.

parcaelooped-transformersrecurrent-modelsmodel-architecturescaling-lawsai-efficiencytogether-ai

“Parcae constrains the spectral radius to <1 using a learned negative diagonal matrix to prevent residual state explosion.”

tweet / @togethercompute / Apr 20

Parcae Enables Stable Looped Transformers with Superior Performance at Sub-Billion Scales

Parcae introduces stable looped Transformer architectures by modeling recurrence as a discrete LTI system and constraining the spectral radius below 1 with a learned negative diagonal matrix, enabling training at LR 1e-3 versus 4e-4 for unconstrained loops. It outperforms parameter-matched Transformers across 140M to 1.3B scales, e.g., 370M model achieves 20.00 Core score vs 17.46 (+14.5%) and 6.3% lower validation perplexity than prior looped models. The work derives scaling laws showing recurrence and data must scale together as power laws, allowing FLOP-budgeted tradeoffs for quality via deeper looping at fixed memory.

parcaelooped-transformersmodel-architecturerecurrent-netsscaling-lawsai-efficiencytogether-ai

“Unconstrained looped models diverge at learning rate 4e-4, while Transformers converge at 1e-3.”

tweet / @togethercompute / Apr 20

Parcae Enables Stable Looped Transformers Matching 1.3B Quality at 770M Parameters

Parcae introduces stable looped Transformer architectures by modeling recurrence as a discrete LTI dynamical system and constraining the spectral radius below 1 with a learned negative diagonal matrix, enabling training at learning rates up to 1e-3. Across model sizes from 140M to 1.3B parameters, Parcae outperforms parameter- and data-matched Transformers, with a 370M model achieving 20.00 Core score versus Transformer's 17.46 (+14.5%) and 6.3% lower validation perplexity than prior looped models. The work establishes scaling laws showing recurrence and data must scale together via power laws, allowing quality gains by looping deeper under fixed FLOP budgets while reducing memory costs for edge inference.

parcaelooped-modelsrecurrent-transformersmodel-efficiencyscaling-lawsai-architecturetogether-ai

“Parcae constrains the spectral radius to less than 1 using a learned negative diagonal matrix to prevent residual state explosion.”

tweet / @togethercompute / Apr 20

Together AI Secures Repeat Spot on Forbes AI 50 for AI-Native Cloud Platform

Together AI has been named again to the Forbes AI 50 list, recognizing its AI Native Cloud designed for the complete AI lifecycle. The platform supports fast inference, open models, and large-scale fine-tuning. This accolade underscores its leadership in AI infrastructure.

together-aiforbes-ai50ai-native-cloudai-infrastructurecloud-computingai-recognition

“Together AI was named to the Forbes AI 50 list for the second time”

youtube / togethercompute / Apr 18 / failed

Panel From Adoption to Expansion

youtube / togethercompute / Apr 18 / failed

Fireside Chat Distribution is the Product

youtube / togethercompute / Apr 18 / failed

Panel The New Rules of Fundraising

Older entries →