能力 · 2026-06-29
支援影像輸入的 AI 模型
對比可接受影像與文字一同輸入的 AI 模型 —— 多模態理解場景的核心選型。
這是什麼?
- 視覺語言模型在文字之外(或替代純文字)接受影像輸入。
- 多數也接受文字並以文字回覆 —— 屬於多模態 LLM,而非影像生成模型。
為什麼重要
- 典型場景:文件理解(掃描件、PDF、截圖)、從截圖做 UI/程式碼審查、商品圖問答、無障礙(alt 文字)、醫學/衛星影像等。
- 計費通常按影像張數疊加底層 token 成本 —— 請查看各服務商的 offering 表。
436 個模型支援此能力
顯示前 60 項,共 436 項。 用 完整目錄 進一步篩選。
Frequently asked questions
How many AI models support 圖像輸入?
436 canonical models in our database currently support 圖像輸入. The list is regenerated on every data refresh, so it always reflects the latest releases tracked in our catalogue.
What is the cheapest model with 圖像輸入?
PaddleOCR-VL from novita-ai is currently the lowest-priced option, at $0.020 per 1M input tokens and $0.020 per 1M output tokens. The full table above is sorted price-ascending.
Which model with 圖像輸入 has the largest context window?
Llama 4 Scout 17B Instruct (US) (Meta) leads on context at 3.50M tokens. This may matter if you also need long-document understanding alongside 圖像輸入.
Which models are available on the most providers?
Production-readiness usually correlates with how many independent providers host the same weights. The top three by provider count are: Kimi K2.6 (49), Kimi K2.5 (48), Claude Sonnet 4.6 (31).
How is 圖像輸入 different from a regular LLM?
Vision-language models accept image input alongside text. They are multimodal LLMs, not image generators — most reply in text after looking at the image.
How often is this list updated?
Daily. Our data pipeline syncs once a day, regenerates the canonical model list, and rebuilds these pages so newly released models appear within 24 hours.
Explore more
Top models with this capability
- PaddleOCR-VL$0.02 in / $0.02 out
- Llama-3.2-11B-Vision-Instruct$0.05 in / $0.05 out
- Gemma 3 4B IT$0.04 in / $0.08 out
- Google Gemma 3 27B Instruct$0.03 in / $0.11 out
- Model Router$0.14 in / $0.00 out
Other capabilities
Best-of lists you might also want
Pricing comparisons
最近更新:
Prices in USD per 1M tokens. Unknown means the provider does not publish per-token pricing.
Pricing and capabilities are refreshed daily and reconciled against each provider's official documentation. Always verify critical production decisions with the provider directly.