Open-weights models in 2025: Llama, Mistral, and the ecosystem
How the open-weights LLM ecosystem has matured, where Llama 3 and Mistral models fit in the stack, and what to expect from self-hosted deployments.
The open-weights model landscape has shifted decisively since 2023. Llama 3 70B now benchmarks within 5–10% of GPT-4-class models on most standard evals, and the ecosystem of inference runtimes, fine-tuning frameworks, and deployment tooling has caught up with the research pace.
Llama 3 70B and 8B
Meta's Llama 3 70B is the default starting point for self-hosted deployments that need GPT-3.5-to-GPT-4 capability. It runs on two A100s or on a single H100 at reasonable throughput. The 8B variant fits on consumer-grade hardware (RTX 4090, Apple M3 Max) with 4-bit quantization and is the basis for most edge/on-device experiments.
The Llama 3 Community License permits commercial use for most organisations (revenue cap applies; see Meta's license for details).
Mistral models
Mistral Small (Apache-2.0) punches above its weight for code and instruction-following tasks. Codestral (256k context, Mistral license) remains one of the best open-weights options specifically for code completion and fill-in-the-middle.
Running costs vs API costs
At scale, self-hosting breaks even against API providers at roughly 1M–5M tokens/day depending on hardware amortization. Below that, API providers are almost always cheaper once engineering time is factored in. Above it, the economics strongly favor self-hosting.
What's missing
Multimodal support lags behind frontier closed models. Tool calling reliability on sub-70B open models is noticeably lower than Claude Sonnet or GPT-4o. For production agentic workflows, most teams still default to closed APIs and use open-weights models for cost reduction on high-volume non-agentic tasks.