Funktion · 2026-06-29

KI-Modelle mit Vision-Eingabe

Modelle, die Bilder zusammen mit Text akzeptieren — multimodales Verständnis.

Was ist das?

Vision-Sprachmodelle akzeptieren Bildinputs neben oder anstelle von Text.
Die meisten antworten mit Text — es sind multimodale LLMs, keine Bildgeneratoren.

Warum wichtig

Use Cases: Dokumentenverständnis (Scan, PDF, Screenshot), UI-/Code-Review aus Screenshots, Produktfotos, Barrierefreiheit (Alt-Text), Medizin/Satellitenbilder.
Die Abrechnung enthält oft pro Bild zusätzliche Tokenkosten — siehe Offering-Tabellen der Anbieter.

436 Modelle mit dieser Funktion

Modell	Anbieter	Eingabe / 1M	Ausgabe / 1M	Kontext	Hoster
PaddleOCR-VL	novita-ai	$0.020	$0.020	16K	1
Llama-3.2-11B-Vision-Instruct	Meta	$0.049	$0.049	128K	9
Gemma 3 4B IT	Google	$0.040	$0.080	128K	4
Google Gemma 3 27B Instruct	Google	$0.030	$0.110	203K	10
Model Router	azure	$0.140	Unknown	128K	1
Model Router	azure-cognitive-services	$0.140	Unknown	128K	1
Google Gemma 3 12B	Google	$0.050	$0.100	131K	7
Qwen3.5 9B	Alibaba (Qwen)	$0.040	$0.150	262K	14
Ministral 3 3B 2512	Mistral	$0.100	$0.100	131K	3
Ministral 3B	llmgateway	$0.100	$0.100	131K	1
Reka Edge	kilo	$0.100	$0.100	16K	1
Reka Edge	openrouter	$0.100	$0.100	16K	1
GLM-4.6V-Flash	Z.AI / Zhipu	$0.020	$0.210	128K	3
Mistral Small 3.2 24B	Mistral	$0.060	$0.180	128K	3
Qwen2.5 VL 32B Instruct	Alibaba (Qwen)	$0.050	$0.220	131K	3
Ministral 3 8B 2512	Mistral	$0.150	$0.150	262K	3
Pixtral 12B	Mistral	$0.150	$0.150	128K	2
Nova Lite	vercel	$0.060	$0.240	300K	1
Ministral 8B	llmgateway	$0.150	$0.150	262K	1
Amazon: Nova Lite 1.0	kilo	$0.060	$0.240	300K	1
Nova Lite	amazon-bedrock	$0.060	$0.240	300K	1
Nova Lite 1.0	openrouter	$0.060	$0.240	300K	1
Llama 3.2 11B Vision Instruct	Meta	$0.160	$0.160	128K	1
Qwen-Omni Turbo	Alibaba (Qwen)	$0.070	$0.270	33K	3
Llama Guard 4 12B	Meta	$0.180	$0.180	164K	3
Arcee AI: Spotlight	kilo	$0.180	$0.180	131K	1
Seed 1.6 Flash (250715)	llmgateway	$0.070	$0.300	256K	1
Gemini 2.0 Flash-Lite	Google	$0.075	$0.300	1.05M	4
ByteDance Seed: Seed 1.6 Flash	kilo	$0.075	$0.300	262K	1
Seed 1.6 Flash	openrouter	$0.075	$0.300	262K	1
Llama 4 Scout	Meta	$0.080	$0.300	328K	5
Gemma 4 26B A4B IT	Google	$0.060	$0.330	262K	16
Gemma 4 31B IT	Google	$0.100	$0.300	262K	26
Llama 4 Scout 17B 16E Instruct	Meta	$0.100	$0.300	128K	11
Mistral Small 3.1	Mistral	$0.100	$0.300	128K	4
Ministral 3 14B 2512	Mistral	$0.200	$0.200	262K	3
Mistral Small 3.2	Mistral	$0.100	$0.300	128K	2
Phi-4-multimodal	Microsoft	$0.080	$0.320	128K	2
Ministral 14B	llmgateway	$0.200	$0.200	262K	1
Pixtral 12B 2409	scaleway	$0.200	$0.200	128K	1
Meta Llama Guard 4 12B	Meta	$0.210	$0.210	131K	1
MiMo V2.5	opencode-go	$0.140	$0.280	1M	1
MiMo-V2.5	llmgateway	$0.140	$0.280	1M	1
Gemini 2.5 Flash Lite Preview 09-2025	Google	$0.090	$0.360	1.05M	6
Qwen3.5 Flash	Alibaba (Qwen)	$0.090	$0.360	1M	4
GPT-5 Nano	OpenAI	$0.050	$0.400	400K	22
Kilo Auto Small	kilo	$0.050	$0.400	400K	1
Coding Xiaomi MiMo-V2.5	aihubmix	$0.080	$0.400	1.05M	1
Gemini 2.5 Flash-Lite	Google	$0.100	$0.400	1.05M	17
GPT-4.1 nano	OpenAI	$0.100	$0.400	1.05M	16
Gemini Flash-Lite Latest	Google	$0.100	$0.400	1.05M	5
Gemini 2.0 Flash	Google	$0.100	$0.400	1.05M	3
Qwen2.5-Omni 7B	Alibaba (Qwen)	$0.100	$0.400	33K	2
ByteDance Seed: Seed-2.0-Mini	kilo	$0.100	$0.400	262K	1
Seed-2.0-Mini	openrouter	$0.100	$0.400	262K	1
Nemotron 3 Nano Omni	NVIDIA	$0.130	$0.380	256K	3
Qwen3 VL 32B Instruct	Alibaba (Qwen)	$0.104	$0.416	131K	4
Qwen3 VL 8B Instruct	Alibaba (Qwen)	$0.080	$0.500	131K	6
Grok 4.1 Fast (Non-Reasoning)	xAI	$0.180	$0.450	128K	12
Grok 4.1 Fast (Reasoning)	xAI	$0.180	$0.450	128K	9

Top 60 von 436 angezeigt. Im vollständigen Verzeichnis weiter filtern.

Frequently asked questions

How many AI models support Bild-Eingabe?

436 canonical models in our database currently support Bild-Eingabe. The list is regenerated on every data refresh, so it always reflects the latest releases tracked in our catalogue.

What is the cheapest model with Bild-Eingabe?

PaddleOCR-VL from novita-ai is currently the lowest-priced option, at $0.020 per 1M input tokens and $0.020 per 1M output tokens. The full table above is sorted price-ascending.

Which model with Bild-Eingabe has the largest context window?

Llama 4 Scout 17B Instruct (US) (Meta) leads on context at 3.50M tokens. This may matter if you also need long-document understanding alongside Bild-Eingabe.

Which models are available on the most providers?

Production-readiness usually correlates with how many independent providers host the same weights. The top three by provider count are: Kimi K2.6 (49), Kimi K2.5 (48), Claude Sonnet 4.6 (31).

How is Bild-Eingabe different from a regular LLM?

Vision-language models accept image input alongside text. They are multimodal LLMs, not image generators — most reply in text after looking at the image.

How often is this list updated?

Daily. Our data pipeline syncs once a day, regenerates the canonical model list, and rebuilds these pages so newly released models appear within 24 hours.

Top models with this capability

PaddleOCR-VL$0.02 in / $0.02 out
Llama-3.2-11B-Vision-Instruct$0.05 in / $0.05 out
Gemma 3 4B IT$0.04 in / $0.08 out
Google Gemma 3 27B Instruct$0.03 in / $0.11 out
Model Router$0.14 in / $0.00 out

Other capabilities

Best-of lists you might also want

Pricing comparisons

Vendors in this list

Zuletzt aktualisiert: 2026-06-29

Prices in USD per 1M tokens. Unknown means the provider does not publish per-token pricing.

Pricing and capabilities are refreshed daily and reconciled against each provider's official documentation. Always verify critical production decisions with the provider directly.