能力 · 2026-06-29
支持图像输入的 AI 模型
对比可接受图像与文本一同输入的 AI 模型 —— 多模态理解场景的核心选型。
这是什么?
- 视觉语言模型在文本之外(或替代纯文本)接受图像输入。
- 多数也接受文本并以文本回复 —— 属于多模态 LLM,而非图像生成模型。
为什么重要
- 典型场景:文档理解(扫描件、PDF、截图)、从截图做 UI/代码审查、商品图问答、无障碍(alt 文本)、医学/卫星影像等。
- 计费通常按图像张数叠加底层 token 成本 —— 请查看各服务商的 offering 表。
436 个模型支持此能力
显示前 60 / 共 436 项。 用 完整目录 进一步筛选。
Frequently asked questions
How many AI models support 图像输入?
436 canonical models in our database currently support 图像输入. The list is regenerated on every data refresh, so it always reflects the latest releases tracked in our catalogue.
What is the cheapest model with 图像输入?
PaddleOCR-VL from novita-ai is currently the lowest-priced option, at $0.020 per 1M input tokens and $0.020 per 1M output tokens. The full table above is sorted price-ascending.
Which model with 图像输入 has the largest context window?
Llama 4 Scout 17B Instruct (US) (Meta) leads on context at 3.50M tokens. This may matter if you also need long-document understanding alongside 图像输入.
Which models are available on the most providers?
Production-readiness usually correlates with how many independent providers host the same weights. The top three by provider count are: Kimi K2.6 (49), Kimi K2.5 (48), Claude Sonnet 4.6 (31).
How is 图像输入 different from a regular LLM?
Vision-language models accept image input alongside text. They are multimodal LLMs, not image generators — most reply in text after looking at the image.
How often is this list updated?
Daily. Our data pipeline syncs once a day, regenerates the canonical model list, and rebuilds these pages so newly released models appear within 24 hours.
Explore more
Top models with this capability
- PaddleOCR-VL$0.02 in / $0.02 out
- Llama-3.2-11B-Vision-Instruct$0.05 in / $0.05 out
- Gemma 3 4B IT$0.04 in / $0.08 out
- Google Gemma 3 27B Instruct$0.03 in / $0.11 out
- Model Router$0.14 in / $0.00 out
Other capabilities
Best-of lists you might also want
Pricing comparisons
最近更新:
Prices in USD per 1M tokens. Unknown means the provider does not publish per-token pricing.
Pricing and capabilities are refreshed daily and reconciled against each provider's official documentation. Always verify critical production decisions with the provider directly.