StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding

Yanlai Yang^1,2, Zhuokai Zhao¹, Satya Narayan Shukla¹, Aashu Singh¹, Shlok Kumar Mishra¹, Lizhu Zhang¹, Mengye Ren²

¹Meta AI, ²New York University

Abstract

Multimodal large language models (MLLMs) have made significant progress in visual-language reasoning, but their ability to efficiently handle long videos remains limited. Despite recent advances in long-context MLLMs, storing and attending to the key-value (KV) cache for long visual contexts incurs substantial memory and computational overhead. Existing visual compression methods require either encoding the entire visual context before compression or having access to the questions in advance, which is impractical for long video understanding and multi-turn conversational settings. In this work, we propose StreamMem, a query-agnostic KV cache memory mechanism for streaming video understanding. Specifically, StreamMem encodes new video frames in a streaming manner, compressing the KV cache using attention scores between visual tokens and generic query tokens, while maintaining a fixed-size KV memory to enable efficient question answering (QA) in memory-constrained, long-video scenarios. Evaluation on three long video understanding and two streaming video question answering benchmarks shows that StreamMem achieves state-of-the-art performance in query-agnostic KV cache compression and is competitive with query-aware compression approaches.

Query-Agnostic Streaming Video Processing

StreamMem addresses the challenge of streaming video processing under a memory budget by introducing a query-agnostic KV compression strategy. It maintains a bounded memory footprint by continuously compressing the KV cache after each incoming video clip, thus preventing out-of-memory (OOM) errors and avoiding costly memory offloading regardless of video length. To achieve effective and efficient memory retention, StreamMem leverages a novel saliency metric based on cross-attention scores between visual tokens and chat template tokens, allowing it to select and preserve informative visual content in a query-agnostic manner. In addition, it incorporates an input frame compression module to reduce frame-level redundancy prior to MLLM encoding, and a frame-wise KV merging mechanism that constructs prototype representations for each observed frame. Together, these components produce a diverse yet compact KV cache that supports accurate and memory-efficient streaming question answering.

Method

An overview of the full pipeline is illustrated in the figure below. (a) The overall workflow of StreamMem for streaming video understanding. Incoming frames are first filtered to reduce redundancy, then passed through the vision encoder and integrated with the existing KV memory via cross-attention. The resulting KV cache is compressed to maintain a fixed memory budget, enabling continual processing of future frames or downstream question answering. (b) Detailed illustration of the KV compression module. Some KV cache in the memory and the new frames are pruned according to the attention score between the keys and the proxy queries. In addition, we aggregated the key-value pairs for each new frame into a single frame-level prototype via weighted merging (shown in darker squares). This combination of pruning and merging ensures compact yet expressive memory representations for long video sequences.

Attention Visualization

One key insight of the proposed method is to leverage the system's chat template tokens as a proxy to approximate a generic query. Below is a visualization of visual tokens attended to by different text queries. Red indicates higher attention scores. Despite minor variations, different text queries attend to largely overlapping regions of the input images. The "Generic Question" is "What is happening in the video?", while the "Specific Question" is "What occurs just before reading the magazines?" Attention scores are averaged across all layers and heads, and then interpolated from 14×14 to 384×384 to match the image resolution. The MLLM used is LLaVA-OneVision, and the video clip is sourced from the RVS-Ego benchmark (which uses videos from the Ego4D dataset).

Results

We evaluate StreamMem across three offline and two streaming long video understanding benchmarks (EgoSchema, MLVU, VideoMME; RVS-Ego and RVS-Movie) using three open-source pre-trained MLLMs (LLaVA-OneVision, Qwen2-VL, and Qwen2.5-VL).Results show that StreamMem consistently retains high utility while keeping the KV cache compact across videos of varying lengths and question types. It achieves competitive performance with methods that rely on significantly larger memory budgets.

Conclusion

Enabling continuous video stream processing under a bounded memory constraint is essential for deploying multimodal large language models (MLLMs) in real-world, embodied scenarios. Yet, most prior work in long video-language understanding has focused on static or offline settings, assuming known queries, finite video lengths, and full access to the visusal context in advance. These assumptions limit their applicability in streaming or open-world environments. In this work, we present StreamMem, a training-free and query-agnostic KV cache compression framework tailored for streaming video understanding. By using attention scores between visual tokens and chat template tokens as a proxy for query relevance, StreamMem effectively retains salient visual information without requiring access to future queries. When applied to open-source MLLMs, StreamMem achieves state-of-the-art performance across a diverse set of both offline and streaming long video benchmarks. Beyond demonstrating competitive empirical results, we conduct an in-depth analysis of various components in our framework, including input frame filtering, KV merging strategies, and positional embedding techniques, shedding light on the design considerations for constructing a memory-bounded visual processing pipeline. These insights lay a foundation for future research in scaling MLLMs to continuously process real-world visual streams.