AI 모델 인텔리전스

기능 · 2026-05-12

AI Models with Vision Input

AI models that can take images alongside text input.

이게 뭔가요?

  • Vision-language models accept image input alongside (or instead of) text.
  • Most also accept text and reply with text — they are multimodal LLMs, not image generators.

왜 중요한가

  • Use cases: document understanding (scans, PDFs, screenshots), UI/code review from screenshots, product photography Q&A, accessibility (alt text), medical/satellite imagery.
  • Pricing usually charges per image plus the underlying token cost — check each provider's offering table.

이 기능을 지원하는 모델 382개

모델벤더입력 / 1M출력 / 1M컨텍스트제공자
dots.ocrchutes$0.010$0.011131K1
Gemma 3 4BGoogle$0.010$0.02733K7
PaddleOCR-VLnovita-ai$0.020$0.02016K1
Llama-3.2-11B-Vision-InstructMeta$0.049$0.049128K8
Gemma 3 12BGoogle$0.030$0.10033K10
Gemma 3 27BGoogle$0.027$0.109131K14
Model Routerazure-cognitive-services$0.140Unknown128K1
Model Routerazure$0.140Unknown128K1
Gemini 1.5 Flash-8BGoogle$0.037$0.1501M1
Qwen/Qwen3.5-9BAlibaba (Qwen)$0.050$0.150262K6
Qwen/Qwen3-VL-30B-A3B-ThinkingAlibaba (Qwen)$0.100$0.100262K6
Qwen/Qwen3-VL-30B-A3B-InstructAlibaba (Qwen)$0.100$0.100262K6
Qwen/Qwen3-VL-8B-InstructAlibaba (Qwen)$0.100$0.100262K5
Reka Edgekilo$0.100$0.10016K1
Ministral 3Bllmgateway$0.100$0.100131K1
GLM-4.6V-FlashZ.AI / Zhipu$0.020$0.210128K3
Mistral Small 3.2 24B InstructMistral$0.060$0.18096K3
Qwen/Qwen2.5-VL-32B-InstructAlibaba (Qwen)$0.050$0.220131K6
Pixtral 12BMistral$0.150$0.150128K2
Amazon: Nova Lite 1.0kilo$0.060$0.240300K1
Nova Liteamazon-bedrock$0.060$0.240300K1
Nova Litevercel$0.060$0.240300K1
Ministral 8Bllmgateway$0.150$0.150262K1
Llama 3.2 11B Vision InstructMeta$0.160$0.160128K1
Qwen-Omni TurboAlibaba (Qwen)$0.070$0.27033K3
Llama Guard 4 12BMeta$0.180$0.180131K3
Arcee AI: Spotlightkilo$0.180$0.180131K1
Seed 1.6 Flash (250715)llmgateway$0.070$0.300256K1
Gemini 2.0 Flash LiteGoogle$0.075$0.3001.05M8
ByteDance Seed: Seed 1.6 Flashkilo$0.075$0.300262K1
Gemini 1.5 FlashGoogle$0.075$0.3001M1
Llama 4 Scout 17B 16E InstructMeta$0.080$0.300128K12
Llama-4-Scout-17B-16E-Instruct-FP8Meta$0.080$0.300128K5
Gemma 4 26BGoogle$0.100$0.300256K8
Mistral Small 3.1Mistral$0.100$0.300128K3
Phi-4-multimodalMicrosoft$0.080$0.320128K2
Mistral Small 3.2Mistral$0.100$0.300128K2
Pixtral 12B 2409scaleway$0.200$0.200128K1
Molmo 2 8Bnano-gpt$0.200$0.20037K1
Ministral 14Bllmgateway$0.200$0.200262K1
Mistral Small 3.2 24B Instruct (2506)Mistral$0.100$0.310128K5
Meta Llama Guard 4 12BMeta$0.210$0.210131K1
GPT-5 NanoOpenAI$0.050$0.400400K17
Kilo Auto Smallkilo$0.050$0.400400K1
Gemini 2.5 Flash LiteGoogle$0.100$0.4001.05M13
GPT-4.1 nanoOpenAI$0.100$0.4001.05M12
Gemini 2.5 Flash Lite Preview 09-25Google$0.100$0.4001.05M9
Gemini 2.0 FlashGoogle$0.100$0.4001.05M6
Gemini 2.5 Flash Lite Preview 06-17Google$0.100$0.4001.05M4
Gemini 2.0 FlashGoogle$0.100$0.4001.05M3
Qwen/Qwen3-Omni-30B-A3B-ThinkingAlibaba (Qwen)$0.100$0.40066K3
Qwen/Qwen3-Omni-30B-A3B-InstructAlibaba (Qwen)$0.100$0.40066K3
Qwen3.5 FlashAlibaba (Qwen)$0.100$0.4001M3
Qwen2.5-Omni 7BAlibaba (Qwen)$0.100$0.40033K2
Gemini Flash-Lite LatestGoogle$0.100$0.4001.05M2
ByteDance Seed: Seed-2.0-Minikilo$0.100$0.400262K1
Gemma 4 31BGoogle$0.130$0.380256K11
Qwen/Qwen3-VL-32B-InstructAlibaba (Qwen)$0.104$0.416262K3
Grok 4.1 Fast (Reasoning)xAI$0.180$0.450128K10
Grok 4 Fast (Reasoning)xAI$0.180$0.4502M9

전체 382개 중 상위 60개 표시. 추가 필터링은 전체 목록을 이용하세요.

Frequently asked questions

How many AI models support 이미지 입력?

382 canonical models in our database currently support 이미지 입력. The list is regenerated on every data refresh, so it always reflects the latest model releases from models.dev.

What is the cheapest model with 이미지 입력?

dots.ocr from chutes is currently the lowest-priced option, at $0.010 per 1M input tokens and $0.011 per 1M output tokens. The full table above is sorted price-ascending.

Which model with 이미지 입력 has the largest context window?

Llama 4 Scout 17B Instruct (Meta) leads on context at 3.50M tokens. This may matter if you also need long-document understanding alongside 이미지 입력.

Which models are available on the most providers?

Production-readiness usually correlates with how many independent providers host the same weights. The top three by provider count are: Kimi K2.5 (45), Kimi K2.6 (31), Qwen3.5 397B-A17B (22).

How is 이미지 입력 different from a regular LLM?

Vision-language models accept image input alongside text. They are multimodal LLMs, not image generators — most reply in text after looking at the image.

How often is this list updated?

Daily. Our data pipeline pulls models.dev once a day, regenerates the canonical model list, and rebuilds these pages so newly released models appear within 24 hours.

마지막 업데이트:

Prices in USD per 1M tokens. Unknown means the provider does not publish per-token pricing.

Data is sourced from models.dev and normalized for comparison. Prices and capabilities may change. Always verify critical production decisions with the provider's official documentation.