KI‑Modell‑Intelligenz

Funktion · 2026-06-29

KI-Modelle mit Vision-Eingabe

Modelle, die Bilder zusammen mit Text akzeptieren — multimodales Verständnis.

Was ist das?

  • Vision-Sprachmodelle akzeptieren Bildinputs neben oder anstelle von Text.
  • Die meisten antworten mit Text — es sind multimodale LLMs, keine Bildgeneratoren.

Warum wichtig

  • Use Cases: Dokumentenverständnis (Scan, PDF, Screenshot), UI-/Code-Review aus Screenshots, Produktfotos, Barrierefreiheit (Alt-Text), Medizin/Satellitenbilder.
  • Die Abrechnung enthält oft pro Bild zusätzliche Tokenkosten — siehe Offering-Tabellen der Anbieter.

436 Modelle mit dieser Funktion

ModellAnbieterEingabe / 1MAusgabe / 1MKontextHoster
PaddleOCR-VLnovita-ai$0.020$0.02016K1
Llama-3.2-11B-Vision-InstructMeta$0.049$0.049128K9
Gemma 3 4B ITGoogle$0.040$0.080128K4
Google Gemma 3 27B InstructGoogle$0.030$0.110203K10
Model Routerazure$0.140Unknown128K1
Model Routerazure-cognitive-services$0.140Unknown128K1
Google Gemma 3 12BGoogle$0.050$0.100131K7
Qwen3.5 9BAlibaba (Qwen)$0.040$0.150262K14
Ministral 3 3B 2512Mistral$0.100$0.100131K3
Ministral 3Bllmgateway$0.100$0.100131K1
Reka Edgekilo$0.100$0.10016K1
Reka Edgeopenrouter$0.100$0.10016K1
GLM-4.6V-FlashZ.AI / Zhipu$0.020$0.210128K3
Mistral Small 3.2 24BMistral$0.060$0.180128K3
Qwen2.5 VL 32B InstructAlibaba (Qwen)$0.050$0.220131K3
Ministral 3 8B 2512Mistral$0.150$0.150262K3
Pixtral 12BMistral$0.150$0.150128K2
Nova Litevercel$0.060$0.240300K1
Ministral 8Bllmgateway$0.150$0.150262K1
Amazon: Nova Lite 1.0kilo$0.060$0.240300K1
Nova Liteamazon-bedrock$0.060$0.240300K1
Nova Lite 1.0openrouter$0.060$0.240300K1
Llama 3.2 11B Vision InstructMeta$0.160$0.160128K1
Qwen-Omni TurboAlibaba (Qwen)$0.070$0.27033K3
Llama Guard 4 12BMeta$0.180$0.180164K3
Arcee AI: Spotlightkilo$0.180$0.180131K1
Seed 1.6 Flash (250715)llmgateway$0.070$0.300256K1
Gemini 2.0 Flash-LiteGoogle$0.075$0.3001.05M4
ByteDance Seed: Seed 1.6 Flashkilo$0.075$0.300262K1
Seed 1.6 Flashopenrouter$0.075$0.300262K1
Llama 4 ScoutMeta$0.080$0.300328K5
Gemma 4 26B A4B ITGoogle$0.060$0.330262K16
Gemma 4 31B ITGoogle$0.100$0.300262K26
Llama 4 Scout 17B 16E InstructMeta$0.100$0.300128K11
Mistral Small 3.1Mistral$0.100$0.300128K4
Ministral 3 14B 2512Mistral$0.200$0.200262K3
Mistral Small 3.2Mistral$0.100$0.300128K2
Phi-4-multimodalMicrosoft$0.080$0.320128K2
Ministral 14Bllmgateway$0.200$0.200262K1
Pixtral 12B 2409scaleway$0.200$0.200128K1
Meta Llama Guard 4 12BMeta$0.210$0.210131K1
MiMo V2.5opencode-go$0.140$0.2801M1
MiMo-V2.5llmgateway$0.140$0.2801M1
Gemini 2.5 Flash Lite Preview 09-2025Google$0.090$0.3601.05M6
Qwen3.5 FlashAlibaba (Qwen)$0.090$0.3601M4
GPT-5 NanoOpenAI$0.050$0.400400K22
Kilo Auto Smallkilo$0.050$0.400400K1
Coding Xiaomi MiMo-V2.5aihubmix$0.080$0.4001.05M1
Gemini 2.5 Flash-LiteGoogle$0.100$0.4001.05M17
GPT-4.1 nanoOpenAI$0.100$0.4001.05M16
Gemini Flash-Lite LatestGoogle$0.100$0.4001.05M5
Gemini 2.0 FlashGoogle$0.100$0.4001.05M3
Qwen2.5-Omni 7BAlibaba (Qwen)$0.100$0.40033K2
ByteDance Seed: Seed-2.0-Minikilo$0.100$0.400262K1
Seed-2.0-Miniopenrouter$0.100$0.400262K1
Nemotron 3 Nano OmniNVIDIA$0.130$0.380256K3
Qwen3 VL 32B InstructAlibaba (Qwen)$0.104$0.416131K4
Qwen3 VL 8B InstructAlibaba (Qwen)$0.080$0.500131K6
Grok 4.1 Fast (Non-Reasoning)xAI$0.180$0.450128K12
Grok 4.1 Fast (Reasoning)xAI$0.180$0.450128K9

Top 60 von 436 angezeigt. Im vollständigen Verzeichnis weiter filtern.

Frequently asked questions

How many AI models support Bild-Eingabe?

436 canonical models in our database currently support Bild-Eingabe. The list is regenerated on every data refresh, so it always reflects the latest releases tracked in our catalogue.

What is the cheapest model with Bild-Eingabe?

PaddleOCR-VL from novita-ai is currently the lowest-priced option, at $0.020 per 1M input tokens and $0.020 per 1M output tokens. The full table above is sorted price-ascending.

Which model with Bild-Eingabe has the largest context window?

Llama 4 Scout 17B Instruct (US) (Meta) leads on context at 3.50M tokens. This may matter if you also need long-document understanding alongside Bild-Eingabe.

Which models are available on the most providers?

Production-readiness usually correlates with how many independent providers host the same weights. The top three by provider count are: Kimi K2.6 (49), Kimi K2.5 (48), Claude Sonnet 4.6 (31).

How is Bild-Eingabe different from a regular LLM?

Vision-language models accept image input alongside text. They are multimodal LLMs, not image generators — most reply in text after looking at the image.

How often is this list updated?

Daily. Our data pipeline syncs once a day, regenerates the canonical model list, and rebuilds these pages so newly released models appear within 24 hours.

Zuletzt aktualisiert:

Prices in USD per 1M tokens. Unknown means the provider does not publish per-token pricing.

Pricing and capabilities are refreshed daily and reconciled against each provider's official documentation. Always verify critical production decisions with the provider directly.