🏆 Top Video Understanding Models on Ollama
🥇 llava — 13.9M pulls — 👁️ Best vision pioneer with video support
The OG multimodal model on Ollama. LLaVA (Large Language and Vision Assistant) combines a vision encoder with Vicuna for general-purpose visual understanding. Updated to version 1.6, it processes individual frames from videos for analysis. Available in 7B, 13B, and 34B sizes. While not explicitly designed for video, you can feed it video frames sequentially for frame-by-frame analysis.