機能 · 2026-05-12

画像入力に対応した AI モデル

画像とテキストを一緒に入力できる AI モデルの比較。

これは何か

ビジョン言語モデルはテキストに加え（または代わりに）画像入力を受け付けます。
多くはテキスト入出力のマルチモーダル LLM であり、画像生成モデルではありません。

なぜ重要か

用途例: 文書理解（スキャン、PDF、スクリーンショット）、UI/コードのスクリーンショットレビュー、商品写真 Q&A、アクセシビリティ（alt）、医療/衛星画像など。
料金は画像枚数に加えトークン課金が乗ることが多いです —— 各プロバイダーの offering 表を確認してください。

この機能に対応するモデル 382 件

モデル	ベンダー	入力 / 1M	出力 / 1M	コンテキスト	プロバイダー
dots.ocr	chutes	$0.010	$0.011	131K	1
Gemma 3 4B	Google	$0.010	$0.027	33K	7
PaddleOCR-VL	novita-ai	$0.020	$0.020	16K	1
Llama-3.2-11B-Vision-Instruct	Meta	$0.049	$0.049	128K	8
Gemma 3 12B	Google	$0.030	$0.100	33K	10
Gemma 3 27B	Google	$0.027	$0.109	131K	14
Model Router	azure-cognitive-services	$0.140	Unknown	128K	1
Model Router	azure	$0.140	Unknown	128K	1
Gemini 1.5 Flash-8B	Google	$0.037	$0.150	1M	1
Qwen/Qwen3.5-9B	Alibaba (Qwen)	$0.050	$0.150	262K	6
Qwen/Qwen3-VL-30B-A3B-Thinking	Alibaba (Qwen)	$0.100	$0.100	262K	6
Qwen/Qwen3-VL-30B-A3B-Instruct	Alibaba (Qwen)	$0.100	$0.100	262K	6
Qwen/Qwen3-VL-8B-Instruct	Alibaba (Qwen)	$0.100	$0.100	262K	5
Reka Edge	kilo	$0.100	$0.100	16K	1
Ministral 3B	llmgateway	$0.100	$0.100	131K	1
GLM-4.6V-Flash	Z.AI / Zhipu	$0.020	$0.210	128K	3
Mistral Small 3.2 24B Instruct	Mistral	$0.060	$0.180	96K	3
Qwen/Qwen2.5-VL-32B-Instruct	Alibaba (Qwen)	$0.050	$0.220	131K	6
Pixtral 12B	Mistral	$0.150	$0.150	128K	2
Amazon: Nova Lite 1.0	kilo	$0.060	$0.240	300K	1
Nova Lite	amazon-bedrock	$0.060	$0.240	300K	1
Nova Lite	vercel	$0.060	$0.240	300K	1
Ministral 8B	llmgateway	$0.150	$0.150	262K	1
Llama 3.2 11B Vision Instruct	Meta	$0.160	$0.160	128K	1
Qwen-Omni Turbo	Alibaba (Qwen)	$0.070	$0.270	33K	3
Llama Guard 4 12B	Meta	$0.180	$0.180	131K	3
Arcee AI: Spotlight	kilo	$0.180	$0.180	131K	1
Seed 1.6 Flash (250715)	llmgateway	$0.070	$0.300	256K	1
Gemini 2.0 Flash Lite	Google	$0.075	$0.300	1.05M	8
ByteDance Seed: Seed 1.6 Flash	kilo	$0.075	$0.300	262K	1
Gemini 1.5 Flash	Google	$0.075	$0.300	1M	1
Llama 4 Scout 17B 16E Instruct	Meta	$0.080	$0.300	128K	12
Llama-4-Scout-17B-16E-Instruct-FP8	Meta	$0.080	$0.300	128K	5
Gemma 4 26B	Google	$0.100	$0.300	256K	8
Mistral Small 3.1	Mistral	$0.100	$0.300	128K	3
Phi-4-multimodal	Microsoft	$0.080	$0.320	128K	2
Mistral Small 3.2	Mistral	$0.100	$0.300	128K	2
Pixtral 12B 2409	scaleway	$0.200	$0.200	128K	1
Molmo 2 8B	nano-gpt	$0.200	$0.200	37K	1
Ministral 14B	llmgateway	$0.200	$0.200	262K	1
Mistral Small 3.2 24B Instruct (2506)	Mistral	$0.100	$0.310	128K	5
Meta Llama Guard 4 12B	Meta	$0.210	$0.210	131K	1
GPT-5 Nano	OpenAI	$0.050	$0.400	400K	17
Kilo Auto Small	kilo	$0.050	$0.400	400K	1
Gemini 2.5 Flash Lite	Google	$0.100	$0.400	1.05M	13
GPT-4.1 nano	OpenAI	$0.100	$0.400	1.05M	12
Gemini 2.5 Flash Lite Preview 09-25	Google	$0.100	$0.400	1.05M	9
Gemini 2.0 Flash	Google	$0.100	$0.400	1.05M	6
Gemini 2.5 Flash Lite Preview 06-17	Google	$0.100	$0.400	1.05M	4
Gemini 2.0 Flash	Google	$0.100	$0.400	1.05M	3
Qwen/Qwen3-Omni-30B-A3B-Thinking	Alibaba (Qwen)	$0.100	$0.400	66K	3
Qwen/Qwen3-Omni-30B-A3B-Instruct	Alibaba (Qwen)	$0.100	$0.400	66K	3
Qwen3.5 Flash	Alibaba (Qwen)	$0.100	$0.400	1M	3
Qwen2.5-Omni 7B	Alibaba (Qwen)	$0.100	$0.400	33K	2
Gemini Flash-Lite Latest	Google	$0.100	$0.400	1.05M	2
ByteDance Seed: Seed-2.0-Mini	kilo	$0.100	$0.400	262K	1
Gemma 4 31B	Google	$0.130	$0.380	256K	11
Qwen/Qwen3-VL-32B-Instruct	Alibaba (Qwen)	$0.104	$0.416	262K	3
Grok 4.1 Fast (Reasoning)	xAI	$0.180	$0.450	128K	10
Grok 4 Fast (Reasoning)	xAI	$0.180	$0.450	2M	9

全 382 件中、上位 60 件を表示。さらに絞り込むにはモデル一覧をご利用ください。

Frequently asked questions

How many AI models support 画像入力?

382 canonical models in our database currently support 画像入力. The list is regenerated on every data refresh, so it always reflects the latest model releases from models.dev.

What is the cheapest model with 画像入力?

dots.ocr from chutes is currently the lowest-priced option, at $0.010 per 1M input tokens and $0.011 per 1M output tokens. The full table above is sorted price-ascending.

Which model with 画像入力 has the largest context window?

Llama 4 Scout 17B Instruct (Meta) leads on context at 3.50M tokens. This may matter if you also need long-document understanding alongside 画像入力.

Which models are available on the most providers?

Production-readiness usually correlates with how many independent providers host the same weights. The top three by provider count are: Kimi K2.5 (45), Kimi K2.6 (31), Qwen3.5 397B-A17B (22).

How is 画像入力 different from a regular LLM?

Vision-language models accept image input alongside text. They are multimodal LLMs, not image generators — most reply in text after looking at the image.

How often is this list updated?

Daily. Our data pipeline pulls models.dev once a day, regenerates the canonical model list, and rebuilds these pages so newly released models appear within 24 hours.

Top models with this capability

dots.ocr$0.01 in / $0.01 out
Gemma 3 4B$0.01 in / $0.03 out
PaddleOCR-VL$0.02 in / $0.02 out
Llama-3.2-11B-Vision-Instruct$0.05 in / $0.05 out
Gemma 3 12B$0.03 in / $0.10 out

Other capabilities

Best-of lists you might also want

Pricing comparisons

Vendors in this list

最終更新： 2026-05-12

Prices in USD per 1M tokens. Unknown means the provider does not publish per-token pricing.

Data is sourced from models.dev and normalized for comparison. Prices and capabilities may change. Always verify critical production decisions with the provider's official documentation.