Molmo 2: Ai2’s New Open-Source Video Intelligence System
The Allen Institute for Artificial Intelligence (Ai2) has unveiled Molmo 2, an innovative family of open-source AI vision models capable of watching, tracking, analyzing, and answering questions about videos with remarkable precision. Unlike many AI systems that focus on generating content, Molmo 2 specializes in understanding video content – identifying what’s happening, exactly where it’s occurring, and when events take place within the footage.
During a recent demonstration at Ai2’s Seattle headquarters, researchers showcased Molmo 2’s impressive capabilities across various short video clips. When shown a soccer match, the system identified the defensive mistake that led to a goal by analyzing the sequence and pinpointing a failure to clear the ball effectively. In a baseball clip, it correctly identified the teams (Angels and Mariners), recognized which player scored (#55), and explained how it determined the home team by reading uniforms and stadium branding. The system’s intelligence extended to practical applications as well – when given a cooking video, Molmo 2 returned a structured recipe with ingredients and step-by-step instructions, even including timing information extracted from on-screen text. Perhaps most impressive was its tracking functionality, demonstrated by following four penguins as they moved around the frame, maintaining consistent identification for each bird even when they overlapped or moved in and out of view. In a racing video, when asked to “track the car that passes the #13 car in the end,” Molmo 2 watched the entire clip first to understand the query before going back to identify and track the correct vehicle throughout the footage.
Ai2’s release of Molmo 2 caps a significant year for the Seattle-based nonprofit organization founded in 2014 by the late Microsoft co-founder Paul Allen. The institute secured $152 million in funding from the NSF and Nvidia, partnered on an AI cancer research initiative with Seattle’s Fred Hutch, and released Olmo 3, a text model competing with those from Meta and DeepSeek. According to CEO Ali Farhadi, Ai2 has seen more than 21 million downloads of its models this year and nearly 3 billion queries across its systems. Unlike commercial tech giants, Ai2’s mission as a nonprofit isn’t focused on market competition but rather on advancing AI technology and making those advancements freely available. The institute has methodically released open models for text (OLMo), images (the original Molmo), and now video, working toward what Farhadi describes as a unified model that can reason across all modalities.
What sets Ai2’s approach apart is its commitment to complete openness. While Google’s Gemini, OpenAI’s GPT-4o, and Meta’s Perception LM can also process video, Molmo 2 distinguishes itself by making its model weights, training code, and training data all publicly available. This goes beyond “open weight” models that only release the final product without sharing the development process, and represents a fundamental difference from the closed systems offered by major tech companies. This philosophy isn’t merely academic – it allows developers to trace a model’s behavior back to its training data, customize it for specific applications, and avoid vendor lock-in. Ai2 also emphasizes efficiency in its approach. While Meta’s Perception LM was trained on 72.5 million videos, Molmo 2 used approximately 9 million, relying instead on high-quality human annotations. According to Ai2, this results in a smaller, more efficient model that outperforms their own much larger model from last year and approaches the capabilities of commercial systems from Google and OpenAI, while remaining simple enough to run on a single machine.
The competitive landscape in AI is uniquely viewed by Farhadi, who suggests that real open source transforms competition into collaboration: “There is no need to compete. Everything is out there. You don’t need to reverse engineer. You don’t need to rebuild it. Just grab it, build on top of it, do the next thing. And we love it when people do that.” This approach has already shown its influence – when the original Molmo introduced its pointing capability last year (allowing the model to identify specific objects in an image), competing models quickly adopted the feature. Ranjay Krishna, who leads Ai2’s computer vision team and is also a University of Washington assistant professor, noted, “We know they adopted our data because they perform exactly as well as we do.”
Despite its impressive capabilities, Molmo 2 does have limitations. Its tracking functionality currently tops out at about 10 items – it struggles with crowded scenes like busy highways or large groups of people. “This is a very, very new capability, and it’s one that’s so experimental that we’re starting out very small,” Krishna explained. “There’s no technological limit to this, it just requires more data, more examples of really crowded scenes.” Long-form video analysis remains challenging, with the model performing best on short clips; in the public playground launching alongside Molmo 2, uploaded videos are limited to 15 seconds. Unlike some commercial systems, Molmo 2 doesn’t process live video streams but instead analyzes recordings after the fact. Krishna indicated that the team is exploring streaming capabilities for applications like robotics, where real-time response would be necessary, though this work is still in early stages. Starting today, Molmo 2 is available on both Hugging Face and Ai2’s playground, opening new possibilities for developers and researchers to explore and build upon this open-source video intelligence system.













