Together AI Deep Dive — Open-Source Model Inference

Opening

While everyone is debating which closed-source model is better, Together AI is doing something else entirely: making open-source models run faster and cheaper. $300 million in annualized revenue, rapidly growing customer count through 2025, and 200MW of compute capacity currently deploying Nvidia Blackwell clusters. I've used Together AI's API to run Llama and Mixtral models across several projects — it's one of the largest independent players in the open-source model inference space.

The Problem They Solve

The core contradiction of open-source models: the model is free, but running it isn't.

What does it take to run a Llama 70B model? At minimum: two A100 80GB GPUs, a suitable inference framework, an optimized inference engine, load balancing, monitoring, and auto-scaling. For most teams, the engineering cost of building this infrastructure far exceeds just using an API.

Together AI's positioning: become the "AWS" of the open-source model world — you pick the model, they run it, you pay per token.

Target customers:

AI-native startups building products with open-source models but don't want to manage infrastructure
Technical teams that need fine-tuning capabilities
Mid-size companies that are cost-sensitive but need production-grade reliability
Teams doing large-scale batch processing (data labeling, content generation)

Product Matrix

Core Products

Serverless Inference: Together AI's flagship product. Connect via API to run mainstream open-source models like Llama, Mixtral, and DeepSeek, billed per token. Supports dozens of models with fast onboarding of new releases.

Together Reasoning Clusters: Dedicated inference clusters for high-throughput, low-latency workloads. Decoding speeds up to 110 tokens/second, ideal for token-intensive use cases (agents, long-form generation).

GPU Cloud: Direct GPU rental. Supports Nvidia H100 and Blackwell series. For teams that need full control — running their own training jobs and deploying custom models.

Fine-tuning Service: Fine-tune open-source models on the Together platform, billed per training token. No need to manage your own GPU cluster.

Batch Inference: Asynchronous large-scale inference at a 50% discount, supporting up to 30 billion tokens per job.

Technical Differentiation

Together AI's technical edge isn't in the models themselves — it's in inference optimization:

Custom inference engine that outperforms general-purpose frameworks like vLLM on throughput and latency for models like Llama
Speculative decoding that delivers 2-3x speedups on certain models
Currently deploying Nvidia GB200 NVL72 and HGX B200 clusters to stay on the latest hardware cycle

Business Model

Pricing Strategy

Plan	Price	Target Customer
Serverless API (Llama 70B)	~$0.90/$0.90 per million tokens	General use
Serverless API (Llama 405B)	~$3.50 output per million tokens	High-quality inference
Batch API	50% off standard pricing	Large-scale batch processing
Reasoning Clusters	Custom pricing	High-throughput enterprises
GPU Cloud (H100)	Per-hour billing	Training/custom deployment
Fine-tuning	Per training token	Model customization

Revenue Model

Dual revenue streams: API usage (30-40%) + GPU rental (60-70%).

API revenue is billed per token with higher margins, but pricing is under competitive pressure. GPU rental revenue is more stable but requires heavy upfront capital investment.

Funding & Valuation

Round	Date	Amount	Valuation
Series A	Nov 2023	$103M	—
Series A+	Apr 2024	$125M	$1.25B
Series B	Feb 2025	$305M	$3.3B

Total funding: $534 million. Series B led by General Catalyst and Prosperity7.

Customers & Market

Key Customers

Together AI's customer profile skews heavily toward AI-native companies and developers. While specific client names are rarely disclosed, the core user base — inferred from product positioning and revenue scale — includes:

AI startups building products on open-source models
Mid-to-large enterprises with large-scale data processing needs
Research institutions and technical teams doing fine-tuning and model experimentation

Market Size

The AI inference market (Inference-as-a-Service) is projected at $50-80 billion in 2026. This market is growing rapidly — inference costs account for 60-80% of total AI application costs, and inference demand is scaling exponentially as AI adoption spreads. Together AI holds a significant position in the open-source model inference sub-market.

Competitive Landscape

Dimension	Together AI	Fireworks AI	Groq	AWS/GCP/Azure
Core Positioning	Open-source inference + GPU	Fast inference	Fastest inference (LPU)	Full-stack cloud
Inference Speed	Fast	Very fast	Fastest	Moderate
Model Coverage	Broad (dozens of open-source models)	Broad	Limited	Broadest
GPU Rental	Yes	Yes	No (custom chips)	Yes
Fine-tuning	Yes	Yes	No	Yes
Annualized Revenue	$300M	$280M	Undisclosed	Far higher
Valuation	$3.3B	$4B	$6.9B (acquired by Nvidia)	—

What I've Actually Seen

The good: On a project that needed large-scale text classification with Llama 70B, Together AI's API experience was far better than self-hosted inference — no dealing with GPU scheduling, OOM errors, or model version management. The Batch API's 50% discount is genuinely attractive for large-scale jobs. New model onboarding is fast — DeepSeek R1 was available on Together shortly after release.

The complicated: Open-source model inference is a highly commoditized market. Together, Fireworks, DeepInfra, Anyscale, and others are all doing similar things, with differences mainly in price and speed — and competition on those two dimensions is brutal. Once customers start A/B testing providers, switching costs are low. The GPU rental business requires continuous large capital outlays, and deploying 200MW of compute means massive upfront investment.

The reality: Together AI's fate is largely tied to the trajectory of open-source models. If Llama 5 and the next generation of Mistral models keep closing the gap with closed-source alternatives, Together AI's business will continue to grow. But if closed-source models pull ahead (no current trend suggests this), the open-source inference market could shrink. Another risk: Nvidia and the major cloud providers could move directly into this market — AWS is already doing inference services through Bedrock.

My Take

Yes, if: You're building products on open-source models but don't want to manage infrastructure; you're a data team needing large-scale batch inference; you're a developer who wants to quickly experiment with multiple open-source models; you're on a tight budget but need production-grade APIs
Skip if: You only use GPT-5 or Claude (you don't need an open-source inference platform); you have your own GPU cluster with a capable ops team; you need the highest security tier with private deployment (not Together's primary use case)

In one line: Together AI holds a clear position in the open-source model inference market, but competition is intensifying fast — the ultimate winner may not be whoever has the fastest API, but whoever can build lasting advantages in scale and cost.

Discussion

When running open-source models, do you self-host inference or use a third-party API? In my experience, if your monthly spend is under $5,000, an API is almost certainly more cost-effective; above that, self-hosting starts to make economic sense. Where's your breakeven point?

Together AI Deep Dive — Open-Source Model Inference

Together AI Deep Dive — Open-Source Model Inference

Opening

The Problem They Solve

Product Matrix

Core Products

Technical Differentiation

Business Model

Pricing Strategy

Revenue Model

Funding & Valuation

Customers & Market

Key Customers

Market Size

Competitive Landscape

What I've Actually Seen

My Take

Discussion

Keep reading.

Mistral AI Deep Dive — Europe's Open-Source AI Champion

Groq Deep Dive — The Fastest AI Inference Hardware

Fireworks AI Deep Dive — Fast Generative AI Inference