ذكاء نماذج الذكاء الاصطناعي

قدرة · 2026-06-29

نماذج ذكاء اصطناعي تدعم الإدخال البصري

نماذج تقبل الصور إلى جانب النص — فهم متعدد الوسائط.

ما هذا؟

  • نماذج الرؤية واللغة تقبل صورًا بالإضافة إلى (أو بدلاً من) النص.
  • معظمها يردّ بنص — هي نماذج LLM متعددة الوسائط وليست مولّدات صور.

لماذا يهم

  • حالات الاستخدام: فهم المستندات (مسح ضوئي، PDF، لقطات شاشة)، مراجعة واجهات/كود من لقطات، أسئلة وأجوبة عن صور المنتجات، إمكانية الوصول (نص بديل)، صور طبية/فضائية.
  • عادةً ما تتضمن الفوترة تكلفة لكل صورة بالإضافة إلى تكلفة الرموز — راجع جداول العروض لكل مزوّد.

436 نماذج تدعم هذه القدرة

النموذجالناشرإدخال / 1Mإخراج / 1Mالسياقالمزودون
PaddleOCR-VLnovita-ai$0.020$0.02016K1
Llama-3.2-11B-Vision-InstructMeta$0.049$0.049128K9
Gemma 3 4B ITGoogle$0.040$0.080128K4
Google Gemma 3 27B InstructGoogle$0.030$0.110203K10
Model Routerazure$0.140Unknown128K1
Model Routerazure-cognitive-services$0.140Unknown128K1
Google Gemma 3 12BGoogle$0.050$0.100131K7
Qwen3.5 9BAlibaba (Qwen)$0.040$0.150262K14
Ministral 3 3B 2512Mistral$0.100$0.100131K3
Ministral 3Bllmgateway$0.100$0.100131K1
Reka Edgekilo$0.100$0.10016K1
Reka Edgeopenrouter$0.100$0.10016K1
GLM-4.6V-FlashZ.AI / Zhipu$0.020$0.210128K3
Mistral Small 3.2 24BMistral$0.060$0.180128K3
Qwen2.5 VL 32B InstructAlibaba (Qwen)$0.050$0.220131K3
Ministral 3 8B 2512Mistral$0.150$0.150262K3
Pixtral 12BMistral$0.150$0.150128K2
Nova Litevercel$0.060$0.240300K1
Ministral 8Bllmgateway$0.150$0.150262K1
Amazon: Nova Lite 1.0kilo$0.060$0.240300K1
Nova Liteamazon-bedrock$0.060$0.240300K1
Nova Lite 1.0openrouter$0.060$0.240300K1
Llama 3.2 11B Vision InstructMeta$0.160$0.160128K1
Qwen-Omni TurboAlibaba (Qwen)$0.070$0.27033K3
Llama Guard 4 12BMeta$0.180$0.180164K3
Arcee AI: Spotlightkilo$0.180$0.180131K1
Seed 1.6 Flash (250715)llmgateway$0.070$0.300256K1
Gemini 2.0 Flash-LiteGoogle$0.075$0.3001.05M4
ByteDance Seed: Seed 1.6 Flashkilo$0.075$0.300262K1
Seed 1.6 Flashopenrouter$0.075$0.300262K1
Llama 4 ScoutMeta$0.080$0.300328K5
Gemma 4 26B A4B ITGoogle$0.060$0.330262K16
Gemma 4 31B ITGoogle$0.100$0.300262K26
Llama 4 Scout 17B 16E InstructMeta$0.100$0.300128K11
Mistral Small 3.1Mistral$0.100$0.300128K4
Ministral 3 14B 2512Mistral$0.200$0.200262K3
Mistral Small 3.2Mistral$0.100$0.300128K2
Phi-4-multimodalMicrosoft$0.080$0.320128K2
Ministral 14Bllmgateway$0.200$0.200262K1
Pixtral 12B 2409scaleway$0.200$0.200128K1
Meta Llama Guard 4 12BMeta$0.210$0.210131K1
MiMo V2.5opencode-go$0.140$0.2801M1
MiMo-V2.5llmgateway$0.140$0.2801M1
Gemini 2.5 Flash Lite Preview 09-2025Google$0.090$0.3601.05M6
Qwen3.5 FlashAlibaba (Qwen)$0.090$0.3601M4
GPT-5 NanoOpenAI$0.050$0.400400K22
Kilo Auto Smallkilo$0.050$0.400400K1
Coding Xiaomi MiMo-V2.5aihubmix$0.080$0.4001.05M1
Gemini 2.5 Flash-LiteGoogle$0.100$0.4001.05M17
GPT-4.1 nanoOpenAI$0.100$0.4001.05M16
Gemini Flash-Lite LatestGoogle$0.100$0.4001.05M5
Gemini 2.0 FlashGoogle$0.100$0.4001.05M3
Qwen2.5-Omni 7BAlibaba (Qwen)$0.100$0.40033K2
ByteDance Seed: Seed-2.0-Minikilo$0.100$0.400262K1
Seed-2.0-Miniopenrouter$0.100$0.400262K1
Nemotron 3 Nano OmniNVIDIA$0.130$0.380256K3
Qwen3 VL 32B InstructAlibaba (Qwen)$0.104$0.416131K4
Qwen3 VL 8B InstructAlibaba (Qwen)$0.080$0.500131K6
Grok 4.1 Fast (Non-Reasoning)xAI$0.180$0.450128K12
Grok 4.1 Fast (Reasoning)xAI$0.180$0.450128K9

عرض أول 60 من إجمالي 436. استخدم الفهرس الكامل لمزيد من التصفية.

Frequently asked questions

How many AI models support إدخال صور?

436 canonical models in our database currently support إدخال صور. The list is regenerated on every data refresh, so it always reflects the latest releases tracked in our catalogue.

What is the cheapest model with إدخال صور?

PaddleOCR-VL from novita-ai is currently the lowest-priced option, at $0.020 per 1M input tokens and $0.020 per 1M output tokens. The full table above is sorted price-ascending.

Which model with إدخال صور has the largest context window?

Llama 4 Scout 17B Instruct (US) (Meta) leads on context at 3.50M tokens. This may matter if you also need long-document understanding alongside إدخال صور.

Which models are available on the most providers?

Production-readiness usually correlates with how many independent providers host the same weights. The top three by provider count are: Kimi K2.6 (49), Kimi K2.5 (48), Claude Sonnet 4.6 (31).

How is إدخال صور different from a regular LLM?

Vision-language models accept image input alongside text. They are multimodal LLMs, not image generators — most reply in text after looking at the image.

How often is this list updated?

Daily. Our data pipeline syncs once a day, regenerates the canonical model list, and rebuilds these pages so newly released models appear within 24 hours.

آخر تحديث:

Prices in USD per 1M tokens. Unknown means the provider does not publish per-token pricing.

Pricing and capabilities are refreshed daily and reconciled against each provider's official documentation. Always verify critical production decisions with the provider directly.