Funktion · 2026-05-12
KI-Modelle mit Vision-Eingabe
Modelle, die Bilder zusammen mit Text akzeptieren — multimodales Verständnis.
Was ist das?
- Vision-Sprachmodelle akzeptieren Bildinputs neben oder anstelle von Text.
- Die meisten antworten mit Text — es sind multimodale LLMs, keine Bildgeneratoren.
Warum wichtig
- Use Cases: Dokumentenverständnis (Scan, PDF, Screenshot), UI-/Code-Review aus Screenshots, Produktfotos, Barrierefreiheit (Alt-Text), Medizin/Satellitenbilder.
- Die Abrechnung enthält oft pro Bild zusätzliche Tokenkosten — siehe Offering-Tabellen der Anbieter.
382 Modelle mit dieser Funktion
Top 60 von 382 angezeigt. Im vollständigen Verzeichnis weiter filtern.
Frequently asked questions
How many AI models support Bild-Eingabe?
382 canonical models in our database currently support Bild-Eingabe. The list is regenerated on every data refresh, so it always reflects the latest model releases from models.dev.
What is the cheapest model with Bild-Eingabe?
dots.ocr from chutes is currently the lowest-priced option, at $0.010 per 1M input tokens and $0.011 per 1M output tokens. The full table above is sorted price-ascending.
Which model with Bild-Eingabe has the largest context window?
Llama 4 Scout 17B Instruct (Meta) leads on context at 3.50M tokens. This may matter if you also need long-document understanding alongside Bild-Eingabe.
Which models are available on the most providers?
Production-readiness usually correlates with how many independent providers host the same weights. The top three by provider count are: Kimi K2.5 (45), Kimi K2.6 (31), Qwen3.5 397B-A17B (22).
How is Bild-Eingabe different from a regular LLM?
Vision-language models accept image input alongside text. They are multimodal LLMs, not image generators — most reply in text after looking at the image.
How often is this list updated?
Daily. Our data pipeline pulls models.dev once a day, regenerates the canonical model list, and rebuilds these pages so newly released models appear within 24 hours.
Explore more
Top models with this capability
- dots.ocr$0.01 in / $0.01 out
- Gemma 3 4B$0.01 in / $0.03 out
- PaddleOCR-VL$0.02 in / $0.02 out
- Llama-3.2-11B-Vision-Instruct$0.05 in / $0.05 out
- Gemma 3 12B$0.03 in / $0.10 out
Other capabilities
Best-of lists you might also want
Pricing comparisons
Zuletzt aktualisiert:
Prices in USD per 1M tokens. Unknown means the provider does not publish per-token pricing.
Data is sourced from models.dev and normalized for comparison. Prices and capabilities may change. Always verify critical production decisions with the provider's official documentation.