Together AI Deep Dive — Open-Source Model Inference

Together AI Deep Dive — Open-Source Model Inference
Opening
While everyone is debating which closed-source model is better, Together AI is doing something else entirely: making open-source models run faster and cheaper. $300 million in annualized revenue, rapidly growing customer count through 2025, and 200MW of compute capacity currently deploying Nvidia Blackwell clusters. I've used Together AI's API to run Llama and Mixtral models across several projects — it's one of the largest independent players in the open-source model inference space.
The Problem They Solve
The core contradiction of open-source models: the model is free, but running it isn't.
What does it take to run a Llama 70B model? At minimum: two A100 80GB GPUs, a suitable inference framework, an optimized inference engine, load balancing, monitoring, and auto-scaling. For most teams, the engineering cost of building this infrastructure far exceeds just using an API.
Together AI's positioning: become the "AWS" of the open-source model world — you pick the model, they run it, you pay per token.
Target customers:
- AI-native startups building products with open-source models but don't want to manage infrastructure
- Technical teams that need fine-tuning capabilities
- Mid-size companies that are cost-sensitive but need production-grade reliability
- Teams doing large-scale batch processing (data labeling, content generation)
Product Matrix
Core Products
Serverless Inference: Together AI's flagship product. Connect via API to run mainstream open-source models like Llama, Mixtral, and DeepSeek, billed per token. Supports dozens of models with fast onboarding of new releases.
Together Reasoning Clusters: Dedicated inference clusters for high-throughput, low-latency workloads. Decoding speeds up to 110 tokens/second, ideal for token-intensive use cases (agents, long-form generation).
GPU Cloud: Direct GPU rental. Supports Nvidia H100 and Blackwell series. For teams that need full control — running their own training jobs and deploying custom models.
Fine-tuning Service: Fine-tune open-source models on the Together platform, billed per training token. No need to manage your own GPU cluster.
Batch Inference: Asynchronous large-scale inference at a 50% discount, supporting up to 30 billion tokens per job.
Technical Differentiation
Together AI's technical edge isn't in the models themselves — it's in inference optimization:
- Custom inference engine that outperforms general-purpose frameworks like vLLM on throughput and latency for models like Llama
- Speculative decoding that delivers 2-3x speedups on certain models
- Currently deploying Nvidia GB200 NVL72 and HGX B200 clusters to stay on the latest hardware cycle
Business Model
Pricing Strategy
| Plan | Price | Target Customer |
|---|---|---|
| Serverless API (Llama 70B) | ~$0.90/$0.90 per million tokens | General use |
| Serverless API (Llama 405B) | ~$3.50 output per million tokens | High-quality inference |
| Batch API | 50% off standard pricing | Large-scale batch processing |
| Reasoning Clusters | Custom pricing | High-throughput enterprises |
| GPU Cloud (H100) | Per-hour billing | Training/custom deployment |
| Fine-tuning | Per training token | Model customization |
Revenue Model
Dual revenue streams: API usage (30-40%) + GPU rental (60-70%).
API revenue is billed per token with higher margins, but pricing is under competitive pressure. GPU rental revenue is more stable but requires heavy upfront capital investment.
Funding & Valuation
| Round | Date | Amount | Valuation |
|---|---|---|---|
| Series A | Nov 2023 | $103M | — |
| Series A+ | Apr 2024 | $125M | $1.25B |
| Series B | Feb 2025 | $305M | $3.3B |
Total funding: $534 million. Series B led by General Catalyst and Prosperity7.
Customers & Market
Key Customers
Together AI's customer profile skews heavily toward AI-native companies and developers. While specific client names are rarely disclosed, the core user base — inferred from product positioning and revenue scale — includes:
- AI startups building products on open-source models
- Mid-to-large enterprises with large-scale data processing needs
- Research institutions and technical teams doing fine-tuning and model experimentation
Market Size
The AI inference market (Inference-as-a-Service) is projected at $50-80 billion in 2026. This market is growing rapidly — inference costs account for 60-80% of total AI application costs, and inference demand is scaling exponentially as AI adoption spreads. Together AI holds a significant position in the open-source model inference sub-market.
Competitive Landscape
| Dimension | Together AI | Fireworks AI | Groq | AWS/GCP/Azure |
|---|---|---|---|---|
| Core Positioning | Open-source inference + GPU | Fast inference | Fastest inference (LPU) | Full-stack cloud |
| Inference Speed | Fast | Very fast | Fastest | Moderate |
| Model Coverage | Broad (dozens of open-source models) | Broad | Limited | Broadest |
| GPU Rental | Yes | Yes | No (custom chips) | Yes |
| Fine-tuning | Yes | Yes | No | Yes |
| Annualized Revenue | $300M | $280M | Undisclosed | Far higher |
| Valuation | $3.3B | $4B | $6.9B (acquired by Nvidia) | — |
What I've Actually Seen
The good: On a project that needed large-scale text classification with Llama 70B, Together AI's API experience was far better than self-hosted inference — no dealing with GPU scheduling, OOM errors, or model version management. The Batch API's 50% discount is genuinely attractive for large-scale jobs. New model onboarding is fast — DeepSeek R1 was available on Together shortly after release.
The complicated: Open-source model inference is a highly commoditized market. Together, Fireworks, DeepInfra, Anyscale, and others are all doing similar things, with differences mainly in price and speed — and competition on those two dimensions is brutal. Once customers start A/B testing providers, switching costs are low. The GPU rental business requires continuous large capital outlays, and deploying 200MW of compute means massive upfront investment.
The reality: Together AI's fate is largely tied to the trajectory of open-source models. If Llama 5 and the next generation of Mistral models keep closing the gap with closed-source alternatives, Together AI's business will continue to grow. But if closed-source models pull ahead (no current trend suggests this), the open-source inference market could shrink. Another risk: Nvidia and the major cloud providers could move directly into this market — AWS is already doing inference services through Bedrock.
My Take
- Yes, if: You're building products on open-source models but don't want to manage infrastructure; you're a data team needing large-scale batch inference; you're a developer who wants to quickly experiment with multiple open-source models; you're on a tight budget but need production-grade APIs
- Skip if: You only use GPT-5 or Claude (you don't need an open-source inference platform); you have your own GPU cluster with a capable ops team; you need the highest security tier with private deployment (not Together's primary use case)
In one line: Together AI holds a clear position in the open-source model inference market, but competition is intensifying fast — the ultimate winner may not be whoever has the fastest API, but whoever can build lasting advantages in scale and cost.
Discussion
When running open-source models, do you self-host inference or use a third-party API? In my experience, if your monthly spend is under $5,000, an API is almost certainly more cost-effective; above that, self-hosting starts to make economic sense. Where's your breakeven point?