ð Top Video Understanding Models on Ollama
ðĨ llava â 13.9M pulls â ðïļ Best vision pioneer with video support
The OG multimodal model on Ollama. LLaVA (Large Language and Vision Assistant) combines a vision encoder with Vicuna for general-purpose visual understanding. Updated to version 1.6, it processes individual frames from videos for analysis. Available in 7B, 13B, and 34B sizes. While not explicitly designed for video, you can feed it video frames sequentially for frame-by-frame analysis. Massive community support with 98 tags.
ðū VRAM: 7B ~5.5GB, 13B ~10GB, 34B ~24GB (Q4 quantized)
âïļ Hardware: 7B runs on consumer GPUs with 8GB+ VRAM (RTX 3070/4060). 13B needs 12-16GB VRAM. 34B requires latest high-end GPUs with 24GB+ VRAM (RTX 4090/5090). 7B also works on modern CPUs via Ollama (slow but functional).
ðĨ minicpm-v â 5.1M pulls â ð§ Best native video understanding
OpenBMB's MiniCPM-V 2.6 is one of the few models on Ollama with native video understanding support. Built on SigLip-400M + Qwen2-7B (8B total), it can process multi-image sequences and video frames directly. It achieves state-of-the-art results, surpassing GPT-4V and Gemini 1.5 Pro on single image benchmarks. Its efficient token encoding (only 640 tokens per 1.8M pixel image) makes it great for processing multiple video frames. Requires Ollama 0.3.10+.
ðū VRAM: ~6GB (Q4 quantized)
âïļ Hardware: Runs on consumer GPUs with 8-12GB VRAM (RTX 3070/4070). For analyzing long video sequences with multiple frames, a latest high-end GPU with 24GB+ VRAM is recommended. Also works on modern CPUs at reduced speed.
ðĨ llama3.2-vision â 4.4M pulls â ðïļ Best for detailed image/frame reasoning
Meta's Llama 3.2 Vision is a powerhouse for visual reasoning. Available in 11B and 90B sizes. Feed it video frames and ask complex questions â it excels at reading text in images, identifying objects, and understanding charts. For video analysis, use it frame-by-frame or with keyframe extraction. The 11B variant runs on consumer GPUs, making it the most accessible high-quality vision model for video frame analysis.
ðū VRAM: 11B ~8GB, 90B ~55GB (Q4 quantized)
âïļ Hardware: 11B runs on consumer GPUs with 8-12GB VRAM (RTX 3080/4070). 90B needs multi-GPU setup â latest high-end GPUs (dual RTX 4090/5090 or A100). 11B also works on modern CPUs via Ollama (~5-10s per frame).
4ïļâĢ llava-llama3 â 2.2M pulls â ⥠Best balance of quality and speed
Fine-tuned from Llama 3 Instruct, this LLaVA variant (8B) offers better benchmark scores than the original LLaVA while being lightweight enough for most setups. Great for video frame analysis with strong instruction-following and visual reasoning. The perfect middle ground if you want quality without requiring high-end hardware.
ðū VRAM: ~6GB (Q4 quantized)
âïļ Hardware: Runs comfortably on consumer GPUs with 8GB+ VRAM (RTX 3070/4060 and up). Also works on modern CPUs for frame-by-frame analysis at slower speeds.
5ïļâĢ llama4 â 1.6M pulls â ð ïļ Best multimodal agent for video tasks
Meta's latest Llama 4 collection brings multimodal MoE (Mixture of Experts) architecture in 16x17B and 128x17B sizes. Strong vision understanding with tool-use capabilities â it can describe video frames and take actions based on what it sees. Ideal for building automated video analysis workflows where the model not only watches but acts on the content.
ðū VRAM: 16x17B ~28GB (active params), 128x17B ~100GB+
âïļ Hardware: 16x17B needs latest high-end GPUs with 32GB+ VRAM (RTX 5090, A6000). 128x17B requires multi-GPU clusters. Not suitable for consumer-grade setups.
6ïļâĢ moondream â 1.1M pulls â ð Best ultra-compact vision model
moondream2 is a tiny 1.8B vision language model designed for edge devices. At just ~1.5GB, it can run on anything â laptops, Raspberry Pi, even phones via Termux. It supports visual QA, captioning, and basic video frame analysis. Perfect for quick experiments, prototyping, or running on low-power devices.
ðū VRAM: ~1.5GB (Q4 quantized)
âïļ Hardware: Runs on literally any modern CPU or GPU with 2GB+ VRAM. The most accessible model on this list â ideal for laptops and older machines.
7ïļâĢ granite3.2-vision â 899.4K pulls â ð Best for document & video text extraction
IBM's Granite 3.2 Vision (2B) is compact and specialized for visual document understanding â extracting text from tables, charts, infographics, and diagrams. For video analysis, it excels at reading text that appears in videos: subtitles, signs, whiteboards, or presentation slides. Includes tool-use support.
ðū VRAM: ~1.8GB (Q4 quantized)
âïļ Hardware: Runs on any modern CPU or entry-level GPU (4GB+ VRAM). Perfect for laptops and low-resource environments.
8ïļâĢ nemotron3 â 105.2K pulls â ðĨ Best unified video + audio understanding
NVIDIA's Nemotron 3 Nano Omni (33B) is unique â it unifies video, audio, image, and text understanding in a single model. It can watch a video and listen to its audio track simultaneously, making it ideal for video summarization, transcription, and multimodal Q&A. If you need a model that understands both the visual and audio content of a video, this is the one.
ðū VRAM: ~20GB (Q4 quantized)
âïļ Hardware: Requires consumer GPUs with 24GB VRAM (RTX 4090). For processing full video+audio streams, a latest high-end GPU is recommended for acceptable inference speeds.
9ïļâĢ ahmadwaqar/smolvlm2-256m-video â 660 pulls â ðŠķ Smallest dedicated video model
An ultra-compact 256M parameter vision-language model specifically designed for video and image understanding. Supports visual QA, captioning, OCR, and video analysis with only 1.38GB VRAM. Built on SigLIP + SmolLM2. Apache 2.0 license. Despite its tiny size, it's purpose-built for video â frame analysis, clip captioning, and basic video Q&A.
ðū VRAM: ~1.38GB (Q8 quantized)
âïļ Hardware: Runs on literally any modern device â CPU, laptop GPU, Raspberry Pi, or phone. No dedicated GPU needed.
ð openbmb/minicpm-v4.5 â 16.5K pulls â ðą Best for mobile video understanding
MiniCPM-V 4.5 (8B) is a GPT-4o level MLLM optimized for high-FPS video understanding on your phone. Built specifically for mobile deployment, it handles single image, multi-image, and video understanding efficiently. If you want to run video understanding on a mobile device, this is your best bet.
ðū VRAM: ~6GB (Q4 quantized)
âïļ Hardware: Designed for modern CPUs and mobile-class hardware. Runs on consumer GPUs with 8GB+ VRAM. Optimized for edge deployment with efficient token usage.
ð Hardware Requirements at a Glance
ðĨïļ Modern CPUs (no GPU needed)
Models that work on CPU: moondream 1.8B, granite3.2-vision 2B, smolvlm2-256m-video, smolvlm2-500m-video, llava 7B (slow), minicpm-v 8B (slow), llava-llama3 8B (slow), llama3.2-vision 11B (slow).
Expect 5-20 seconds per frame for vision models. Best for occasional analysis or prototyping.
ðŪ Consumer GPUs (8-16GB VRAM)
Examples: RTX 3070/3080/4060/4070/4080
Can run: llava 7B/13B, minicpm-v 8B, llama3.2-vision 11B, llava-llama3 8B, moondream 1.8B, granite3.2-vision 2B, smolvlm2-256m/500m/2.2B, nemotron3 (24GB+), openbmb/minicpm-v4.5
ð Latest High-End GPUs (24GB+ VRAM)
Examples: RTX 4090/5090, A6000, A100/H100
Can run: llava 34B, llama3.2-vision 90B (dual GPU), llama4 16x17B (32GB+), nemotron3 33B, all smaller models at higher precision
ðĨïļ How to Install & Run Video Models on Every Platform
ðŠ Windows
Download the Ollama installer from ollama.com/download and run it. Open PowerShell or Command Prompt and run:
ollama run minicpm-v â best for video understanding on Windows
ollama run llava â classic choice for frame analysis
ollama run moondream â lightweight option for any Windows machine
For frame extraction, use FFmpeg (ffmpeg -i video.mp4 -vf fps=1 frames/frame_%04d.png) then feed frames to the model.
ð macOS
Install via Homebrew: brew install ollama or download from ollama.com. macOS with Apple Silicon (M1-M4) has excellent support with Metal GPU acceleration.
ollama run minicpm-v â native Apple Silicon support, excellent performance
ollama run llama3.2-vision:11b â great for detailed video frame analysis
ollama run moondream â instant startup, runs on any Mac
Apple Silicon Macs with 16GB+ unified memory can run 7B-11B models comfortably.
ð§ Linux
One-liner install: curl -fsSL https://ollama.com/install.sh | sh
ollama run minicpm-v â best video understanding model
ollama run llava-llama3 â best quality/speed balance
ollama run nemotron3 â for video+audio understanding (24GB+ GPU)
For advanced workflows, pair with FFmpeg for frame extraction and OpenCV for real-time video processing. Use the Ollama REST API at http://localhost:11434 to integrate with Python scripts.
ðĪ Android
Ollama on Android is possible via Termux, but video understanding models require significant resources.
ollama run moondream â only 1.5GB, runs on high-end Android phones (8GB+ RAM)
ollama run granite3.2-vision â 1.8GB, good for text-in-video extraction
For heavier models, connect to a remote Ollama server from your phone using the Ollama REST API. Apps like Termux + Python let you send video frames to your home server.
ðģ Docker
docker run -d --gpus all -v ollama:/root/.ollama -p 11434:11434 ollama/ollama
Then pull and run models inside the container:
docker exec -it <container> ollama pull minicpm-v
docker exec -it <container> ollama run minicpm-v
For a browser interface, add Open WebUI: docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main
Docker setup works on Linux (native GPU passthrough), Windows (via WSL2 + NVIDIA container toolkit), and macOS (CPU-only, no GPU passthrough).
ðŊ How to Analyze Videos with Ollama
Since Ollama video models understand frames rather than full video streams, here's the workflow:
Method 1: Extract frames with FFmpeg, then analyze
ffmpeg -i your_video.mp4 -vf fps=1 frames/out%04d.png
ollama run minicpm-v "Describe what's happening in this image" --image frames/out0001.png
Method 2: Use Ollama API for programmatic analysis
Send multiple frames via curl or Python to get a comprehensive video summary:
curl http://localhost:11434/api/generate -d '{
"model": "minicpm-v",
"prompt": "Describe this video frame in detail",
"images": ["<base64_encoded_frame>"]
}'
Method 3: Real-time with Nemotron3 (video + audio)
ollama run nemotron3 â this model can handle both visual frames and audio transcription from videos, making it ideal for full video content understanding.
ðŽ Video Generation: Available Models & Why They're Not on Ollama
There ARE powerful open-source models that can generate videos from text, and some even have GGUF quantizations. Here's a quick look:
FramePack (by Lvmin Zhang / lllyasviel)
Creator of ControlNet and Fooocus. FramePack is a HunyuanVideo-based diffusion transformer for image-to-video generation (~3.3B params in the transformer). Available on HuggingFace (36K+ downloads) as a HuggingFace Diffusers pipeline. Not on Ollama because it uses a diffusion denoising pipeline with VAE encoding/decoding and noise scheduling that Ollama's inference engine doesn't support. Use it via ComfyUI or standalone Diffusers scripts.
Wan2.2 (by Alibaba / Wan-AI)
State-of-the-art open video generation model. Wan2.2 has a 14B param transformer with both text-to-video (T2V) and image-to-video (I2V) variants. Interestingly, Wan2.2 does have GGUF quantizations on HuggingFace via QuantStack (Q2_K through Q8_0, 5GB-14GB file sizes). Even so, it's not on Ollama because:
âĒ The GGUF files only cover the transformer block â you still need the VAE, scheduler, and text encoder components
âĒ Ollama's runtime (llama.cpp) has no diffusion sampling loop â it can't run the multi-step denoising process
âĒ Video gen requires frame packing and temporal coherence logic that's outside Ollama's scope
CogVideo (by Zhipu AI / THU)
A series of text-to-video models (CogVideoX series). Available on HuggingFace in Diffusers format. Not on Ollama â same diffusion architecture limitation as FramePack and Wan. Use via ComfyUI or standalone Diffusers.
Can Ollama Generate Videos? What's Available & What's Not ðŽ
When you ask can Ollama generate videos, here's the real answer: Ollama hosts video understanding models, but NOT video generation models. Let me explain why.
There ARE powerful open-source video generation models out there â FramePack (by Lvmin Zhang, creator of ControlNet/Fooocus), Wan2.2 (by Alibaba), and CogVideo (by Zhipu AI). In fact, Wan2.2 even has GGUF quantizations available on HuggingFace (14B parameters, Q4_K_M quantized). So why aren't they on Ollama?
ð§ The technical reason: Ollama runs on llama.cpp, which is designed for decoder-only autoregressive language models â models that predict the next token one at a time. Video generation models are diffusion transformers â they work by denoising random noise into a video over multiple steps. Even when exported as GGUF, these models need:
âĒ A noise scheduler to control the denoising process
âĒ A VAE decoder to convert latent vectors into actual video frames
âĒ Frame packing logic to stitch frames together coherently
âĒ Text conditioning integration for text-to-video prompts
Ollama's inference engine simply doesn't have this pipeline infrastructure. It's like trying to run Photoshop inside a terminal â the tool is built for a fundamentally different job. ð ïļ
So what CAN Ollama do for video? It hosts video UNDERSTANDING models â models that analyze, describe, caption, and answer questions about video content. You feed them video frames and they tell you what's happening. All locally, all private. Let's dive in. ð
â How to Actually Generate Videos Locally
For local video generation, use these tools instead of Ollama:
âĒ ComfyUI + Wan2.2 or CogVideo â best node-based GUI for local video gen
âĒ Krita AI Diffusion â plugin for Krita with video support
âĒ HuggingFace Diffusers â Python library: pip install diffusers transformers
âĒ Pinokio â one-click installer for ComfyUI, Stable Diffusion, and video models
âĒ cloud services â Replicate, RunPod, Fal.ai for on-demand GPU
As of 2025, Ollama has been adding more model architectures (it recently added FLUX for image gen). It's possible that future Ollama versions may add diffusion pipeline support as open-weight video models become more standardized. Keep an eye on the Ollama changelog for updates! ð
ð Alternatives to Ollama for Video
âĒ LM Studio â similar local inference with GUI, supports vision models
âĒ ComfyUI â node-based workflow for both video understanding and generation
âĒ LocalAI â alternative local inference server with vision support
âĒ OpenAI API â GPT-4o with native video understanding (cloud, pay-per-use)
âĒ Anthropic Claude â strong video frame analysis via API (cloud)