AI Model Intelligence

Guide

Input vs Output Token Pricing in LLMs

Why output tokens cost more than input tokens, and how to estimate monthly LLM API costs realistically.

Almost every commercial LLM API charges separately for input tokens (the prompt you send) and output tokens (the response you get back). Output tokens typically cost 3× to 5× more than input tokens. Understanding this asymmetry is the difference between a cost projection that survives contact with production and one that doesn't.

Why output costs more

Input tokens are processed once in a fixed-cost forward pass that can be heavily batched and pipelined across many concurrent requests. Output tokens are generated autoregressively — one token at a time, each conditioned on the previous — and cannot be batched as effectively. Each output token therefore consumes more compute than each input token, and providers pass that asymmetry into pricing.

Estimating real-world cost

A useful upper-bound formula:

cost ≈ requests × ((avg_input_tokens × in_price + avg_output_tokens × out_price) / 1,000,000)

  • Chat assistant: input 1500, output 300 → output dominates only 30-50% of cost.
  • RAG with 10 retrieved chunks: input 8000, output 500 → input dominates 60-80%.
  • Code generation: input 1000, output 1500 → output dominates >70%.
  • Summarization: input 50000, output 800 → input dominates >85%.

Hidden cost levers

Three further price components matter when running at scale:

  • Prompt caching (Anthropic, OpenAI): reused system prompts can drop input cost by 50-90%.
  • Tiered pricing above 200K tokens (Google, Anthropic): most providers list a separate higher rate for very long inputs — see each model detail page.
  • Reasoning tokens: thinking models (o-series, Claude Extended Thinking, Gemini Thinking, DeepSeek R1) bill the internal reasoning tokens — sometimes hidden, sometimes visible — at output rates.

Frequently asked questions

Why are output tokens more expensive than input tokens?

Generation requires running the model autoregressively for every output token, which is much more compute than reading prefilled input tokens in parallel. The 3–5× output premium most providers charge reflects that.

How do I estimate my actual input/output ratio?

Run a sample of representative prompts through the model and average the results. Production apps cluster around 5:1–20:1 input:output for chat, and 1:1–3:1 for code generation.

Does prompt caching reduce input cost?

Yes — Anthropic and OpenAI both offer cache_read prices that can be 10× cheaper than standard input. If you reuse a long system prompt across requests, caching often dominates total cost.

Hand-picked next reads

Data is sourced from models.dev and normalized for comparison. Prices and capabilities may change. Always verify critical production decisions with the provider's official documentation.