Solo Unicorn Club logoSolo Unicorn
2,550 words

Together AI Deep Dive — Open-Source Model Inference

Company Deep DiveTogether AIOpen-Source ModelsAI InferenceGPU Cloud
Together AI Deep Dive — Open-Source Model Inference

Together AI Deep Dive — Open-Source Model Inference

Opening

While everyone is debating which closed-source model is better, Together AI is doing something else entirely: making open-source models run faster and cheaper. $300 million in annualized revenue, rapidly growing customer count through 2025, and 200MW of compute capacity currently deploying Nvidia Blackwell clusters. I've used Together AI's API to run Llama and Mixtral models across several projects — it's one of the largest independent players in the open-source model inference space.

The Problem They Solve

The core contradiction of open-source models: the model is free, but running it isn't.

What does it take to run a Llama 70B model? At minimum: two A100 80GB GPUs, a suitable inference framework, an optimized inference engine, load balancing, monitoring, and auto-scaling. For most teams, the engineering cost of building this infrastructure far exceeds just using an API.

Together AI's positioning: become the "AWS" of the open-source model world — you pick the model, they run it, you pay per token.

Target customers:

  • AI-native startups building products with open-source models but don't want to manage infrastructure
  • Technical teams that need fine-tuning capabilities
  • Mid-size companies that are cost-sensitive but need production-grade reliability
  • Teams doing large-scale batch processing (data labeling, content generation)

Product Matrix

Core Products

Serverless Inference: Together AI's flagship product. Connect via API to run mainstream open-source models like Llama, Mixtral, and DeepSeek, billed per token. Supports dozens of models with fast onboarding of new releases.

Together Reasoning Clusters: Dedicated inference clusters for high-throughput, low-latency workloads. Decoding speeds up to 110 tokens/second, ideal for token-intensive use cases (agents, long-form generation).

GPU Cloud: Direct GPU rental. Supports Nvidia H100 and Blackwell series. For teams that need full control — running their own training jobs and deploying custom models.

Fine-tuning Service: Fine-tune open-source models on the Together platform, billed per training token. No need to manage your own GPU cluster.

Batch Inference: Asynchronous large-scale inference at a 50% discount, supporting up to 30 billion tokens per job.

Technical Differentiation

Together AI's technical edge isn't in the models themselves — it's in inference optimization:

  • Custom inference engine that outperforms general-purpose frameworks like vLLM on throughput and latency for models like Llama
  • Speculative decoding that delivers 2-3x speedups on certain models
  • Currently deploying Nvidia GB200 NVL72 and HGX B200 clusters to stay on the latest hardware cycle

Business Model

Pricing Strategy

Plan Price Target Customer
Serverless API (Llama 70B) ~$0.90/$0.90 per million tokens General use
Serverless API (Llama 405B) ~$3.50 output per million tokens High-quality inference
Batch API 50% off standard pricing Large-scale batch processing
Reasoning Clusters Custom pricing High-throughput enterprises
GPU Cloud (H100) Per-hour billing Training/custom deployment
Fine-tuning Per training token Model customization

Revenue Model

Dual revenue streams: API usage (30-40%) + GPU rental (60-70%).

API revenue is billed per token with higher margins, but pricing is under competitive pressure. GPU rental revenue is more stable but requires heavy upfront capital investment.

Funding & Valuation

Round Date Amount Valuation
Series A Nov 2023 $103M
Series A+ Apr 2024 $125M $1.25B
Series B Feb 2025 $305M $3.3B

Total funding: $534 million. Series B led by General Catalyst and Prosperity7.

Customers & Market

Key Customers

Together AI's customer profile skews heavily toward AI-native companies and developers. While specific client names are rarely disclosed, the core user base — inferred from product positioning and revenue scale — includes:

  • AI startups building products on open-source models
  • Mid-to-large enterprises with large-scale data processing needs
  • Research institutions and technical teams doing fine-tuning and model experimentation

Market Size

The AI inference market (Inference-as-a-Service) is projected at $50-80 billion in 2026. This market is growing rapidly — inference costs account for 60-80% of total AI application costs, and inference demand is scaling exponentially as AI adoption spreads. Together AI holds a significant position in the open-source model inference sub-market.

Competitive Landscape

Dimension Together AI Fireworks AI Groq AWS/GCP/Azure
Core Positioning Open-source inference + GPU Fast inference Fastest inference (LPU) Full-stack cloud
Inference Speed Fast Very fast Fastest Moderate
Model Coverage Broad (dozens of open-source models) Broad Limited Broadest
GPU Rental Yes Yes No (custom chips) Yes
Fine-tuning Yes Yes No Yes
Annualized Revenue $300M $280M Undisclosed Far higher
Valuation $3.3B $4B $6.9B (acquired by Nvidia)

What I've Actually Seen

The good: On a project that needed large-scale text classification with Llama 70B, Together AI's API experience was far better than self-hosted inference — no dealing with GPU scheduling, OOM errors, or model version management. The Batch API's 50% discount is genuinely attractive for large-scale jobs. New model onboarding is fast — DeepSeek R1 was available on Together shortly after release.

The complicated: Open-source model inference is a highly commoditized market. Together, Fireworks, DeepInfra, Anyscale, and others are all doing similar things, with differences mainly in price and speed — and competition on those two dimensions is brutal. Once customers start A/B testing providers, switching costs are low. The GPU rental business requires continuous large capital outlays, and deploying 200MW of compute means massive upfront investment.

The reality: Together AI's fate is largely tied to the trajectory of open-source models. If Llama 5 and the next generation of Mistral models keep closing the gap with closed-source alternatives, Together AI's business will continue to grow. But if closed-source models pull ahead (no current trend suggests this), the open-source inference market could shrink. Another risk: Nvidia and the major cloud providers could move directly into this market — AWS is already doing inference services through Bedrock.

My Take

  • Yes, if: You're building products on open-source models but don't want to manage infrastructure; you're a data team needing large-scale batch inference; you're a developer who wants to quickly experiment with multiple open-source models; you're on a tight budget but need production-grade APIs
  • Skip if: You only use GPT-5 or Claude (you don't need an open-source inference platform); you have your own GPU cluster with a capable ops team; you need the highest security tier with private deployment (not Together's primary use case)

In one line: Together AI holds a clear position in the open-source model inference market, but competition is intensifying fast — the ultimate winner may not be whoever has the fastest API, but whoever can build lasting advantages in scale and cost.

Discussion

When running open-source models, do you self-host inference or use a third-party API? In my experience, if your monthly spend is under $5,000, an API is almost certainly more cost-effective; above that, self-hosting starts to make economic sense. Where's your breakeven point?