Capacidade · 2026-06-29

Modelos de IA com entrada visual

Modelos que aceitam imagens junto com texto — compreensão multimodal.

O que é?

Modelos visão-linguagem aceitam imagens além de (ou no lugar de) texto.
A maioria responde com texto — são LLMs multimodais, não geradores de imagem.

Por que importa

Casos de uso: compreensão de documentos (scans, PDF, capturas de tela), revisão de UI/código a partir de screenshots, Q&A de fotos de produto, acessibilidade (alt text), imagens médicas/satelitais.
A cobrança geralmente inclui um custo por imagem além do custo por token — consulte as tabelas de offering de cada provedor.

436 modelos com esta capacidade

Modelo	Fornecedor	Entrada / 1M	Saída / 1M	Contexto	Provedores
PaddleOCR-VL	novita-ai	$0.020	$0.020	16K	1
Llama-3.2-11B-Vision-Instruct	Meta	$0.049	$0.049	128K	9
Gemma 3 4B IT	Google	$0.040	$0.080	128K	4
Google Gemma 3 27B Instruct	Google	$0.030	$0.110	203K	10
Model Router	azure	$0.140	Unknown	128K	1
Model Router	azure-cognitive-services	$0.140	Unknown	128K	1
Google Gemma 3 12B	Google	$0.050	$0.100	131K	7
Qwen3.5 9B	Alibaba (Qwen)	$0.040	$0.150	262K	14
Ministral 3 3B 2512	Mistral	$0.100	$0.100	131K	3
Ministral 3B	llmgateway	$0.100	$0.100	131K	1
Reka Edge	kilo	$0.100	$0.100	16K	1
Reka Edge	openrouter	$0.100	$0.100	16K	1
GLM-4.6V-Flash	Z.AI / Zhipu	$0.020	$0.210	128K	3
Mistral Small 3.2 24B	Mistral	$0.060	$0.180	128K	3
Qwen2.5 VL 32B Instruct	Alibaba (Qwen)	$0.050	$0.220	131K	3
Ministral 3 8B 2512	Mistral	$0.150	$0.150	262K	3
Pixtral 12B	Mistral	$0.150	$0.150	128K	2
Nova Lite	vercel	$0.060	$0.240	300K	1
Ministral 8B	llmgateway	$0.150	$0.150	262K	1
Amazon: Nova Lite 1.0	kilo	$0.060	$0.240	300K	1
Nova Lite	amazon-bedrock	$0.060	$0.240	300K	1
Nova Lite 1.0	openrouter	$0.060	$0.240	300K	1
Llama 3.2 11B Vision Instruct	Meta	$0.160	$0.160	128K	1
Qwen-Omni Turbo	Alibaba (Qwen)	$0.070	$0.270	33K	3
Llama Guard 4 12B	Meta	$0.180	$0.180	164K	3
Arcee AI: Spotlight	kilo	$0.180	$0.180	131K	1
Seed 1.6 Flash (250715)	llmgateway	$0.070	$0.300	256K	1
Gemini 2.0 Flash-Lite	Google	$0.075	$0.300	1.05M	4
ByteDance Seed: Seed 1.6 Flash	kilo	$0.075	$0.300	262K	1
Seed 1.6 Flash	openrouter	$0.075	$0.300	262K	1
Llama 4 Scout	Meta	$0.080	$0.300	328K	5
Gemma 4 26B A4B IT	Google	$0.060	$0.330	262K	16
Gemma 4 31B IT	Google	$0.100	$0.300	262K	26
Llama 4 Scout 17B 16E Instruct	Meta	$0.100	$0.300	128K	11
Mistral Small 3.1	Mistral	$0.100	$0.300	128K	4
Ministral 3 14B 2512	Mistral	$0.200	$0.200	262K	3
Mistral Small 3.2	Mistral	$0.100	$0.300	128K	2
Phi-4-multimodal	Microsoft	$0.080	$0.320	128K	2
Ministral 14B	llmgateway	$0.200	$0.200	262K	1
Pixtral 12B 2409	scaleway	$0.200	$0.200	128K	1
Meta Llama Guard 4 12B	Meta	$0.210	$0.210	131K	1
MiMo V2.5	opencode-go	$0.140	$0.280	1M	1
MiMo-V2.5	llmgateway	$0.140	$0.280	1M	1
Gemini 2.5 Flash Lite Preview 09-2025	Google	$0.090	$0.360	1.05M	6
Qwen3.5 Flash	Alibaba (Qwen)	$0.090	$0.360	1M	4
GPT-5 Nano	OpenAI	$0.050	$0.400	400K	22
Kilo Auto Small	kilo	$0.050	$0.400	400K	1
Coding Xiaomi MiMo-V2.5	aihubmix	$0.080	$0.400	1.05M	1
Gemini 2.5 Flash-Lite	Google	$0.100	$0.400	1.05M	17
GPT-4.1 nano	OpenAI	$0.100	$0.400	1.05M	16
Gemini Flash-Lite Latest	Google	$0.100	$0.400	1.05M	5
Gemini 2.0 Flash	Google	$0.100	$0.400	1.05M	3
Qwen2.5-Omni 7B	Alibaba (Qwen)	$0.100	$0.400	33K	2
ByteDance Seed: Seed-2.0-Mini	kilo	$0.100	$0.400	262K	1
Seed-2.0-Mini	openrouter	$0.100	$0.400	262K	1
Nemotron 3 Nano Omni	NVIDIA	$0.130	$0.380	256K	3
Qwen3 VL 32B Instruct	Alibaba (Qwen)	$0.104	$0.416	131K	4
Qwen3 VL 8B Instruct	Alibaba (Qwen)	$0.080	$0.500	131K	6
Grok 4.1 Fast (Non-Reasoning)	xAI	$0.180	$0.450	128K	12
Grok 4.1 Fast (Reasoning)	xAI	$0.180	$0.450	128K	9

Mostrando os 60 primeiros de 436. Use o diretório completo para filtrar mais.

Frequently asked questions

How many AI models support entrada de imagem?

436 canonical models in our database currently support entrada de imagem. The list is regenerated on every data refresh, so it always reflects the latest releases tracked in our catalogue.

What is the cheapest model with entrada de imagem?

PaddleOCR-VL from novita-ai is currently the lowest-priced option, at $0.020 per 1M input tokens and $0.020 per 1M output tokens. The full table above is sorted price-ascending.

Which model with entrada de imagem has the largest context window?

Llama 4 Scout 17B Instruct (US) (Meta) leads on context at 3.50M tokens. This may matter if you also need long-document understanding alongside entrada de imagem.

Which models are available on the most providers?

Production-readiness usually correlates with how many independent providers host the same weights. The top three by provider count are: Kimi K2.6 (49), Kimi K2.5 (48), Claude Sonnet 4.6 (31).

How is entrada de imagem different from a regular LLM?

Vision-language models accept image input alongside text. They are multimodal LLMs, not image generators — most reply in text after looking at the image.

How often is this list updated?

Daily. Our data pipeline syncs once a day, regenerates the canonical model list, and rebuilds these pages so newly released models appear within 24 hours.

Top models with this capability

PaddleOCR-VL$0.02 in / $0.02 out
Llama-3.2-11B-Vision-Instruct$0.05 in / $0.05 out
Gemma 3 4B IT$0.04 in / $0.08 out
Google Gemma 3 27B Instruct$0.03 in / $0.11 out
Model Router$0.14 in / $0.00 out

Other capabilities

Best-of lists you might also want

Pricing comparisons

Vendors in this list

Última atualização: 2026-06-29

Prices in USD per 1M tokens. Unknown means the provider does not publish per-token pricing.

Pricing and capabilities are refreshed daily and reconciled against each provider's official documentation. Always verify critical production decisions with the provider directly.