KI‑Modell‑Intelligenz

Funktion · 2026-05-12

KI-Modelle mit Vision-Eingabe

Modelle, die Bilder zusammen mit Text akzeptieren — multimodales Verständnis.

Was ist das?

  • Vision-Sprachmodelle akzeptieren Bildinputs neben oder anstelle von Text.
  • Die meisten antworten mit Text — es sind multimodale LLMs, keine Bildgeneratoren.

Warum wichtig

  • Use Cases: Dokumentenverständnis (Scan, PDF, Screenshot), UI-/Code-Review aus Screenshots, Produktfotos, Barrierefreiheit (Alt-Text), Medizin/Satellitenbilder.
  • Die Abrechnung enthält oft pro Bild zusätzliche Tokenkosten — siehe Offering-Tabellen der Anbieter.

382 Modelle mit dieser Funktion

ModellAnbieterEingabe / 1MAusgabe / 1MKontextHoster
dots.ocrchutes$0.010$0.011131K1
Gemma 3 4BGoogle$0.010$0.02733K7
PaddleOCR-VLnovita-ai$0.020$0.02016K1
Llama-3.2-11B-Vision-InstructMeta$0.049$0.049128K8
Gemma 3 12BGoogle$0.030$0.10033K10
Gemma 3 27BGoogle$0.027$0.109131K14
Model Routerazure-cognitive-services$0.140Unknown128K1
Model Routerazure$0.140Unknown128K1
Gemini 1.5 Flash-8BGoogle$0.037$0.1501M1
Qwen/Qwen3.5-9BAlibaba (Qwen)$0.050$0.150262K6
Qwen/Qwen3-VL-30B-A3B-ThinkingAlibaba (Qwen)$0.100$0.100262K6
Qwen/Qwen3-VL-30B-A3B-InstructAlibaba (Qwen)$0.100$0.100262K6
Qwen/Qwen3-VL-8B-InstructAlibaba (Qwen)$0.100$0.100262K5
Reka Edgekilo$0.100$0.10016K1
Ministral 3Bllmgateway$0.100$0.100131K1
GLM-4.6V-FlashZ.AI / Zhipu$0.020$0.210128K3
Mistral Small 3.2 24B InstructMistral$0.060$0.18096K3
Qwen/Qwen2.5-VL-32B-InstructAlibaba (Qwen)$0.050$0.220131K6
Pixtral 12BMistral$0.150$0.150128K2
Amazon: Nova Lite 1.0kilo$0.060$0.240300K1
Nova Liteamazon-bedrock$0.060$0.240300K1
Nova Litevercel$0.060$0.240300K1
Ministral 8Bllmgateway$0.150$0.150262K1
Llama 3.2 11B Vision InstructMeta$0.160$0.160128K1
Qwen-Omni TurboAlibaba (Qwen)$0.070$0.27033K3
Llama Guard 4 12BMeta$0.180$0.180131K3
Arcee AI: Spotlightkilo$0.180$0.180131K1
Seed 1.6 Flash (250715)llmgateway$0.070$0.300256K1
Gemini 2.0 Flash LiteGoogle$0.075$0.3001.05M8
ByteDance Seed: Seed 1.6 Flashkilo$0.075$0.300262K1
Gemini 1.5 FlashGoogle$0.075$0.3001M1
Llama 4 Scout 17B 16E InstructMeta$0.080$0.300128K12
Llama-4-Scout-17B-16E-Instruct-FP8Meta$0.080$0.300128K5
Gemma 4 26BGoogle$0.100$0.300256K8
Mistral Small 3.1Mistral$0.100$0.300128K3
Phi-4-multimodalMicrosoft$0.080$0.320128K2
Mistral Small 3.2Mistral$0.100$0.300128K2
Pixtral 12B 2409scaleway$0.200$0.200128K1
Molmo 2 8Bnano-gpt$0.200$0.20037K1
Ministral 14Bllmgateway$0.200$0.200262K1
Mistral Small 3.2 24B Instruct (2506)Mistral$0.100$0.310128K5
Meta Llama Guard 4 12BMeta$0.210$0.210131K1
GPT-5 NanoOpenAI$0.050$0.400400K17
Kilo Auto Smallkilo$0.050$0.400400K1
Gemini 2.5 Flash LiteGoogle$0.100$0.4001.05M13
GPT-4.1 nanoOpenAI$0.100$0.4001.05M12
Gemini 2.5 Flash Lite Preview 09-25Google$0.100$0.4001.05M9
Gemini 2.0 FlashGoogle$0.100$0.4001.05M6
Gemini 2.5 Flash Lite Preview 06-17Google$0.100$0.4001.05M4
Gemini 2.0 FlashGoogle$0.100$0.4001.05M3
Qwen/Qwen3-Omni-30B-A3B-ThinkingAlibaba (Qwen)$0.100$0.40066K3
Qwen/Qwen3-Omni-30B-A3B-InstructAlibaba (Qwen)$0.100$0.40066K3
Qwen3.5 FlashAlibaba (Qwen)$0.100$0.4001M3
Qwen2.5-Omni 7BAlibaba (Qwen)$0.100$0.40033K2
Gemini Flash-Lite LatestGoogle$0.100$0.4001.05M2
ByteDance Seed: Seed-2.0-Minikilo$0.100$0.400262K1
Gemma 4 31BGoogle$0.130$0.380256K11
Qwen/Qwen3-VL-32B-InstructAlibaba (Qwen)$0.104$0.416262K3
Grok 4.1 Fast (Reasoning)xAI$0.180$0.450128K10
Grok 4 Fast (Reasoning)xAI$0.180$0.4502M9

Top 60 von 382 angezeigt. Im vollständigen Verzeichnis weiter filtern.

Frequently asked questions

How many AI models support Bild-Eingabe?

382 canonical models in our database currently support Bild-Eingabe. The list is regenerated on every data refresh, so it always reflects the latest model releases from models.dev.

What is the cheapest model with Bild-Eingabe?

dots.ocr from chutes is currently the lowest-priced option, at $0.010 per 1M input tokens and $0.011 per 1M output tokens. The full table above is sorted price-ascending.

Which model with Bild-Eingabe has the largest context window?

Llama 4 Scout 17B Instruct (Meta) leads on context at 3.50M tokens. This may matter if you also need long-document understanding alongside Bild-Eingabe.

Which models are available on the most providers?

Production-readiness usually correlates with how many independent providers host the same weights. The top three by provider count are: Kimi K2.5 (45), Kimi K2.6 (31), Qwen3.5 397B-A17B (22).

How is Bild-Eingabe different from a regular LLM?

Vision-language models accept image input alongside text. They are multimodal LLMs, not image generators — most reply in text after looking at the image.

How often is this list updated?

Daily. Our data pipeline pulls models.dev once a day, regenerates the canonical model list, and rebuilds these pages so newly released models appear within 24 hours.

Zuletzt aktualisiert:

Prices in USD per 1M tokens. Unknown means the provider does not publish per-token pricing.

Data is sourced from models.dev and normalized for comparison. Prices and capabilities may change. Always verify critical production decisions with the provider's official documentation.