Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis (2024)

Junho KimHyunjun Kim111Hosu LeeYong Man Ro222
Integrated Vision and Language Lab, KAIST, South Korea
{arkimjh,kimhj709,leehosu01,ymro}@kaist.ac.kr
https://ivy-lvlm.github.io/SALOVA
Equal contribution. \dagger Corresponding author.

Abstract

Despite advances in Large Multi-modal Models, applying them to long and untrimmed video content remains challenging due to limitations in context length and substantial memory overhead. These constraints often lead to significant information loss and reduced relevance in the model responses. With the exponential growth of video data across web platforms, understanding long-form video is crucial for advancing generalized intelligence. In this paper, we introduce SALOVA: Segment-Augmented LOng Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content through targeted retrieval process. We address two main challenges to achieve it: (i) We present the SceneWalk dataset, a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich descriptive context. (ii) We develop robust architectural designs integrating dynamic routing mechanism and spatio-temporal projector to efficiently retrieve and process relevant video segments based on user queries. Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries, thereby improving the contextual relevance of the generated responses. Through extensive experiments, SALOVA demonstrates enhanced capability in processing complex long-form videos, showing significant capability to maintain contextual integrity across extended sequences.

1 Introduction

Recent advancements in Large Language Models (LLMs)[43, 44, 22] have brought us one step closer to achieving Artificial General Intelligence (AGI). Next following step, the current trend is shifting toward modular systems that integrate various multi-modality, leveraging the exceptional generalization and reasoning capabilities of LLMs to evolve into Large Multi-modal Models (LMMs). Accordingly, users can unrestrictedly interact with the models across various modalities beyond text, expanding the scope of machine understanding and enhancing user engagement. Especially, considering the widespread adoption of long-form videos across various web platforms, the importance of understanding long, untrimmed video has become increasingly prominent in the multi-modal domain.

After the pioneer works[38, 37, 14] utilizing visual instruction tuning to augment vision perception into LLMs, remarkable strides[17, 65, 11] have been made in aligning cross-modal consistency—especially between vision and language domains. Albeit more recent models[69, 29] integrate various vision modalities all at once, current approaches still face significant challenges in understanding untrimmed and long-form video content. The main challenge is attributed to the limited context length of LMMs, which is an inherent structural limitation that restricts the models to process only a finite number of tokens as the input sequences. We can exemplify that LLaVA series[30, 29], when processing a video data, require 144144144144 visual tokens per each frame, where numerical approximation is only maximum of similar-to\sim56565656 frames using 8888K max context length LMMs, which is still limited to handle long sequnce data.

Accordingly, current video-LMMs[35, 12, 39] rely on (i) sparse frame sampling to represent entire videos[72, 26], (ii) dense compression of visual tokens into a smaller size to manage the excessive number of frames[41, 34], and (iii) adaptive pooling strategies[59, 60] based on the SlowFast approach[19], all aimed at fitting the long video sequences within the limited context window of LMMs. Several studies focusing on the long video understanding task have presented memory-augmented generation[23, 53] utilizing an additional buffer to embed long-term information, or have extended the context using RoPE-based frequency extension during the training[70]. Despite of such endeavors, when handling massive video frames, previous works still confront restricted context size and significant memory overhead, which leads to substantial visual information loss. As critical events may be overlooked by the models, this hinders their ability to fully capture context changes in lengthy videos, resulting in inaccurate and irrelevant responses for the user queries.

Starting from the intuitive insight outlined below, in this paper, we propose a retrieval-driven approach for long video understanding with LLMs. Analogous to the recent Retriever-Augmented Generation (RAG) systems[28] (widely adopted in LLMs), which retrieve relevant information from external factual knowledge, humans naturally employ similar strategies when seeking specific information, efficiently locating and referring necessary materials to answer targeted questions—e.g., imagine that we are taking open-book exams or searching for a certain recipe in a cookbook. Given a long and untrimmed video, mirroring the targeted retrieval processes, we introduce a novel framework, Segment-Augmented LOng Video Assistant (SALOVA) to effectively handle the long sequence visual inputs by retrieving the relevant video segments.

To construct our video segments retrieval framework, central challenge hinges on establishing two main components: (i) Densely captioned video data, which consists of video-caption pairs with progressively-captioned descriptions that change throughout each video used to train the model to accurately identify relevant video segments. (ii) Dynamic routing mechanism, which selects pertinent video segments for the queries, followed by being connected to LLMs. To address that, our approach is outlined as follows.

Data. (§3)

Recently several video-text paired datasets[6, 4, 62, 56, 10, 8] have been released, but they are inadequate for handling long and untrimmed video data, where only partial video moments are described with limited word length as compared in Fig.1(a). To handle such insufficiency of detailed descriptions within the videos and the short durations of both videos and texts, we introduce the SceneWalk dataset, a new high-quality video dataset with thorough captioning for each video. It includes dense and detailed descriptions for every video segment across the entire scene context. The SceneWalk dataset, sourced from long and untrimmed 87.887.887.887.8K YouTube videos (avg. 486486486486 seconds each), features frequent scene transitions across a total of 11.811.811.811.8K hrs video duration and 1.31.31.31.3M massively segmented video clips. Each video segment in the dataset is provided with a detailed description (avg. 137.5137.5137.5137.5 word length), generated by combining pre-trained models[73, 54] and manual curation from human.

Architecture. (§4)

Utilizing the constructed video dataset, SALOVA learns to identify relevant video segments for the given queries within each video source and then auto-regressively predicts the next token. To do so, we present two architectural designs to seamlessly incorporate the retrieved segments in an end-to-end training: Spatio-Temporal Connector and Segment Retrieval Router. By focusing on the relevant segments, our framework can perform deeper reasoning without being constrained by context length limitations. Additionally, we present FocusFast approach, which intensively analyzes the selected segments for detailed comprehension (focus pathway), while quickly accessing overall contextual information with routing tokens obtained from the entire video segments (fast pathway). The strategy ensures SALOVA to maintain comprehensive video understanding while prioritizing details where it is most needed, effectively enhancing long and untrimmed video interpretation.

Through extensive experiments and analyses, we corroborate that competitive performance of SALOVA to the existing video-LMM models in understanding complex long-form videos. Also, our results show significant reductions in the loss of crucial visual information and a lower risk of omitting important events, demonstrating the effectiveness of our proposed method across various video benchmarks.

Our contribution can be summarized into three-fold:

  • We introduce the SceneWalk dataset, a high-quality and densely-captioned video dataset with detailed segment-level descriptions from 87.887.887.887.8K long and untrimmed video sources. The proposed dataset provides rich context and scene continuity, enabling effective training for long-form video understanding.

  • We propose Segment-Augmented LOng Video Assistant (SALOVA), a novel video-LMM framework designed to enhance long video comprehension by targeting relevant video segments in lengthy videos, optimizing the model’s focus on essential segment targets for the given queries.

  • Through extensive evaluation, we validate that SALOVA improves overall long video understanding capabilities by effectively integrating relevant video segments, thus optimizing to handle long and untrimmed video content.

Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis (1)

(a) Video-Text Dataset Comparison (b) SceneWalk Statistics (c) Overall Pipeline for Data Collection

2 Related Work

2.1 Large Multi-modal Models

After the emergence of LLMs[5, 55], which can actively interact with users through back-and-forth conversations, as a next leap, various research efforts[2, 31, 25] integrate different modalities into the LLMs, utilizing their core reasoning and zero-shot capabilities. Building on the open-sourced models[55, 13], seminal works[38, 65, 14] have bridged image and text modality under the visual instruction tuning and presented multi-modal assistant models that possess visual perception and QA capabilities. Since then, numerous research studies have been introduced to (i) enhance vision understanding with advanced architectures[17] or higher resolutions[37, 30], (ii) implement more sophisticated alignment layers[7, 42] between modalities, and (iii) train the models with more high-quality data and larger model parameters.

Recent focus has shifted towards more unified modality processing following the release of omnivorous models[46]. Some recent omni-versions of LMMs[64, 29] can handle combinatorial subsets from various modality sources, such as images, videos, audio, speech, and depth. However, the current LMMs for video[35, 41, 32] still lack of capturing the necessary details to effectively process video information due to their sparse frame sampling strategy. While such approach is seemingly adequate for relatively shorter videos, it may fail to capture comprehensive spatio-temporal information, potentially compromising the accuracy of model responses to user queries. In this paper, SALOVA first retrieves relevant video segments, then concentrates on more granular video cues. Such targeted focus allows the model to effectively understand complex analysis within the videos, significantly improving its ability to provide contextual-aware and accurate responses.

2.2 Long Video Understanding

In parallel, we detailedly introduce video-specialized models[23, 53, 12] integrated with LLMs, which have also been widely explored these days to enhance video understanding and reasoning. Here, the most challenging part of current video-LMMs lies in handling long video sequences, mainly due to the limited context length of the LLMs. This limitation compels the models to sparsely sample the video frames in only limited sizes (e.g., typically 8 or 16 frames), potentially missing important spatial and temporal information. To address this, several studies have focused on compressing visual tokens into a more manageable size, proposing aggregation[41, 34] or pooling methods[35, 59] with advanced vision encoder structures[67, 73]. In addition, memory-augmented methods[23, 53] first stored long-term information in a memory bank, then responded to specific queries by loading memory features from the stored buffer.

On the other hand, among more recent approaches, Li et al.[70] have directly extended the LLMs’ context length by exploiting RoPE-based frequency interpolation, and Xue et al.[61] have introduced sequence parallelism that can be implemented on multiple GPUs by modifying backend systems. However, we argue that current approaches inherently cannot be free from the fixed context length and provoke intensive memory demands when processing more longer videos. Instead, by focusing on the relevant segments within the entire video, SALOVA can efficiently handle the limited context length, enabling targeted processing of key moments without the need for excessive memory consumption, thereby enhancing performance on longer video sequences.

3 SceneWalk Dataset

In this section, we elaborate on how we collected the SceneWalk dataset. The overall pipeline for building the dataset and summarized statistics are illustrated inFig.1. While several video SFT datasets[41, 32] are widely used during the instruction tuning stage, they often fail to capture comprehensive details within the scenes. This stems from the nature of instruction-type questions, which provide only partial information, and the brief lengths of both videos and texts in QAs. In contrast, the SceneWalk dataset offers densely captioned video-text pairs that cover long sequence videos in full details, as shown in Fig.1(a). For further detailed data statistics, please see Appendix A.

3.1 Data Gathering and Processing

Video Source & Filtering.

For the long and untrimmed video sources, we primarily focus on three key aspects to build densely captioned video dataset: (i) extensive video length with diverse video source categories, (ii) high-quality video contents,(ii) frequent scene transitions within each video. Accordingly, our data collection is mainly sourced from YouTube, ensuring rich dynamic content that better reflects real-world complexities experienced by global users—here, because our main goal for video gathering is on complex scene understanding, we exclude low-quality and user-uploaded aesthetic videos (e.g., WebVid, Pixabay, Pexels, and etc,.) that are rather beneficial for video generation tasks, despite their merits for easy collection. We have collected YouTube urls from[27] and downloaded the whole video in untrimmed states. Among the total 32323232 coarse and diverse video categories YouTube API provided, we selectively curated 10101010 categories, excluding categories such as News & Politics, Classics, and Documentary, due to their static nature, which provides only sparse temporal information in the videos. We further supplemented the dataset with additional Movie & Drama videos sourced from[53, 21], totaling 87,8678786787,86787 , 867 video sources with 11.8711.8711.8711.87 Khrs video duration (avg. 486.5486.5486.5486.5-seconds).

Segmenting Video into Clips.

Next, for the collected long and untrimmed video sources, we cut the lengthy videos into small segments to densely caption the entire video in next phase. Instead of adopting the bottom-up approach used in the ShareGPT4Video dataset[8], which segments videos into fixed time intervals (2-seconds) in advance and then merges adjacent frames based on their CLIP similarity[48], we directly employed PysceneDetect111We use AdaptiveDetector with default setup in https://github.com/Breakthrough/PySceneDetect to segment the videos, dynamically adjusting the threshold based on the raw-level video information to reliably detect scene changes. At the end, the total number of 1.291.291.291.29M of video segments with 33.1133.1133.1133.11-seconds average video length is extracted from the original video sources.

3.2 Captioning and Scoring

Dense Segment Captioning.

After obtaining the massive video segments, our next goal is to caption each segment with visual details and narrative context to capture the scene-specific explanations, which can enrich scene-level interpretation. To achieve this, we plan to utilize pre-trained LMMs to generate detailed descriptions for the partial video segments. As the captioner, we empirically found that VILA-1.5 (13B)[36] shows competent descriptive quality than other open-sourced models, and used to generate dense captions for each video segment with randomly sampled instructions for detailed descriptions. As a result, we acquire 1.291.291.291.29M pairs of detailed descriptions corresponding to the video segments, each description with average 137.5137.5137.5137.5 word length. Please see instruction details and qualitative examples of generated captions in Appendix A.

Scoring Video-Text Correspondence.

Lastly, we score the correspondence between the video segments and the paired dense descriptions, which will later be used as explicit supervision to robustly train our retrieval framework. What we must not overlook here is that the paired video-text relationship is not solely a one-to-one correspondence but is more akin to generalized bipartite matching. That is, within the long and untrimmed video source, each video segment can be connected to other descriptions with additional edges. Therefore, for the Nvsubscript𝑁𝑣N_{v}italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT number of video segments and their paired segments, we can construct a {Nv}2superscriptsubscript𝑁𝑣2\{N_{v}\}^{2}{ italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT correspondence matrix between video-text (V2T). To measure each correspondence, we employ LanguageBind[73] due to its competitive alignment capabilities across various modalities. In addition, we build another {Nv}2superscriptsubscript𝑁𝑣2\{N_{v}\}^{2}{ italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT matrix to provide a doubly robust measure for the correspondence scores among adjacent descriptions (T2T) by comparing similarity within the textual context using the SBERT model[54].

Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis (2)

4 Segment-Augmented LOng Video Assistant

Network Overview.

For a given set of Nvsubscript𝑁𝑣N_{v}italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT video segments sampled at 1 FPS v={vi}Nv𝑣superscriptsubscript𝑣𝑖subscript𝑁𝑣v{=}\{v_{i}\}^{N_{v}}italic_v = { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where each segments viTi×H×W×Csubscript𝑣𝑖superscriptsubscript𝑇𝑖𝐻𝑊𝐶v_{i}\in\mathbb{R}^{T_{i}\times H\times W\times C}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_H × italic_W × italic_C end_POSTSUPERSCRIPT has varying video length (summing up to the total time T𝑇Titalic_T of a long and untrimmed video), SALOVA consists of four main architecture components as illustrated in Fig.2:

  • Vision Encoder: We use CLIP-ViT-L-336px[48] to extract visual features, followed by 2x2 average pooling, resulting in 144144144144 visual tokens for each frame.

  • Spatio-Temporal Connector: To handle spatio-temporal features of varying lengths from the vision encoder, we employ the Perceiver Resampler[2], which consists of 2-layer Transformer architecture followed by a 2-layer MLP with GELU activation as projector. This resampler embed each video segment feature into fixed size latent features that are connected to LLMs.

  • Segment Retrieval Router: For the given textual queries, a retrieval structure (2-layer Transformer) gathers representative information (i.e., routing tokens) from each video segment and then routes the query-relevant video features into the LLMs. Note that the router architecture is trained in an end-to-end manner.

  • Large Language Model: We select two open-sourced LLMs with varying parameter sizes, LLaMA-3.2 (3B)[18], Phi-3.5 (3.8B)[1] and Qwen-2.5 (7B)[63], both of which are instruction-tuned models that possess QA assistant capabilities.

4.1 Long Video Processing and Pipeline

4.1.1 Spatio-Temporal Connector

The first component of our model, Spatio-Temporal Connector, efficiently handles long and variable-length input video segments by extracting each segment’s visual semantics in a fixed-size latent vector. As illustrated in Fig.2(b), we first sample video frames at 1 FPS from each video segments, then visual features are acquired with 2×2222\times 22 × 2 pooling (thus, 144144144144 tokens from each frame). After that, the visual features are flattened and fed into the ST-Connector with additional positional and temporal encoding. Here, when the long video is processed, the number of unfolded patch tokens becomes extremely large, leading to exhaustive computations. To address this, we employ a dynamic token drop technique to reduce computational load.

Dynamic Token Drop.

To effectively manage long video sequences, the token drop has been utilized in video generation tasks[16, 40]. Expanding such approach, in our framework, the dropout rate is dynamically adjusted based on the length of the input sequence Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the input visual feature fvTi×HpWp×dsimilar-tosubscript𝑓𝑣subscript𝑇𝑖subscript𝐻𝑝subscript𝑊𝑝𝑑f_{v}\sim T_{i}\times H_{p}W_{p}\times ditalic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∼ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d, which allows for more efficient processing of longer sequences by reducing computational demands, while still preserving dense visual semantics in shorter videos. Additionally, to retain spatio-temporal information from the dropped patches, we add positional embeddings separately along the spatial and temporal axes. This enables more refined extraction of spatio-temporal visual semantics even after reducing the number of tokens.

4.1.2 Segment Retrieval Router

Next, the key to conveying the pertinent video information to LLMs is retrieving relevant video segments by querying sentence. To densely cue the similarities between the video and sentence information, we introduce a routing framework, Segment Retrieval Router, which consists of 2-layer Transformer as illustrated in Fig.2(c). After obtaining the routing tokens R={Ri}NvNv×D𝑅superscriptsubscript𝑅𝑖subscript𝑁𝑣superscriptsubscript𝑁𝑣𝐷R{=}\{R_{i}\}^{N_{v}}\in\mathbb{R}^{N_{v}\times D}italic_R = { italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT from entire video segments, we aggregate them and feed into the SR-Router as queries. For the given sentence, we employ the same text encoder used for the vision encoder and project it into the shared embedding space to obtain sentence features SNt×D𝑆superscriptsubscript𝑁𝑡𝐷S\in\mathbb{R}^{N_{t}\times D}italic_S ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT, where Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT indicates textual length.

Using the cross attention mechanism (q: R𝑅Ritalic_R; k/v: S𝑆Sitalic_S), we can estimate similarity scores between the video segments and given sentence queries (i.e., V-T similarity). The scores enable the SR-Router to prioritize and select the most relevant video segments that align with the sentence query.

Retrieval Objective.

To seamlessly train the SR-Router with the mainstream flows of SALOVA in an end-to-end manner, we have designed a similarity loss function simsubscriptsim\mathcal{L}_{\text{sim}}caligraphic_L start_POSTSUBSCRIPT sim end_POSTSUBSCRIPT that minimizes the distance between the high-dimensional embeddings of the video segments and sentence queries. Here, we use the correspondence scores (aforementioned in Sec.3.2) as a retrieval supervision signal yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, after applying one-hot encoding. We incorporate a simple margin-based loss, commonly used in contrastive learning settings[33], which enables the model to learn off-diagonal relaxation in the correspondence matrices between video segments and sentences. As mentioned earlier, the relationship between paired videos and sentences is closer to generalized bipartite matching than to one-to-one matching, so relaxation learning helps to accommodate the inherent complexity in aligning correspondence. In conclusion, with the binary cross-entropy loss and the score margin loss, we can formulate the similarity loss as follows:

sim=bce(yi,si)i=1Nvpoint-wise CE+1Nsjmax(0,δ(sjpsjn))score margin loss,subscriptsimsubscriptsubscriptbcesuperscriptsubscriptsubscript𝑦𝑖subscript𝑠𝑖𝑖1subscript𝑁𝑣point-wise CEsubscript1subscript𝑁𝑠subscript𝑗0𝛿subscriptsuperscript𝑠𝑝𝑗subscriptsuperscript𝑠𝑛𝑗score margin loss\mathcal{L}_{\text{sim}}=\underbrace{\mathcal{L}_{\text{bce}}{(y_{i},s_{i})}_{%i=1}^{N_{v}}}_{\text{point-wise CE}}+\underbrace{\textstyle\frac{1}{N_{s}}%\textstyle\sum_{j}\max\left(0,\delta-(s^{p}_{j}{-}s^{n}_{j})\right)}_{\text{%score margin loss}},caligraphic_L start_POSTSUBSCRIPT sim end_POSTSUBSCRIPT = under⏟ start_ARG caligraphic_L start_POSTSUBSCRIPT bce end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT point-wise CE end_POSTSUBSCRIPT + under⏟ start_ARG divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_max ( 0 , italic_δ - ( italic_s start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_s start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) end_ARG start_POSTSUBSCRIPT score margin loss end_POSTSUBSCRIPT ,(1)

where sjpsubscriptsuperscript𝑠𝑝𝑗s^{p}_{j}italic_s start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and sjnsubscriptsuperscript𝑠𝑛𝑗s^{n}_{j}italic_s start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT indicate randomly sampled scores from positive and negative pairs, respectively, and δ𝛿\deltaitalic_δ denotes the margin parameter (set as 0.20.20.20.2). Note that the similarity loss is trained in conjunction with the auto-regressive loss arsubscriptar\mathcal{L}_{\text{ar}}caligraphic_L start_POSTSUBSCRIPT ar end_POSTSUBSCRIPT from subsequent LLMs in an end-to-end manner.

4.1.3 FocusFast Pathways: Integration to LLMs

Using the routing tokens, we can calculate the similarities of each video segment for the given query. Leveraging the similarities, SALOVA efficiently retrieves the specific video features that exhibit the highest relevance score to the textual query, where the indexed video features are then directly integrated into the LLM architecture. Here, extending the SlowFast pathways concept[19], we present the FocusFast mechanism to effectively manage the processing pathway for the retrieved video segments: (i) Focus pathway concatenates the top-K most pertinent features to construct a comprehensive video representation, capturing local details across retrieved segments and enabling detailed interactions with textual queries to enhance handling complex video information. (ii) Fast pathway focuses on the more broader-level context by employing segment-wide routing tokens as the condensed global representation. It effectively contains dynamic spatio-temporal changes throughout the video stream, thereby allowing SALOVA to understand the overall video content and scene-level continuity awareness.

Once the most pertinent features are retrieved, they are delivered to the LM backbone for the final processing as in Fig.2(a), integrating video-specific details into the models’ responses. By effectively handling long and untrimmed videos with the proposed retrieval and routing mechanism, SALOVA can maintain the flow of salient information without the processing overhead for less related data, thus generating more context-aware responses.

4.2 Training Strategies

The current training strategies for LMMs predominantly consist of two-step training: (i) cross-modal alignment and (ii) visual instruction tuning. Recently, Li et al.[29] have emphasized the importance of high-quality knowledge learning between the two training stages (thus stage 1.5), pointing out that the models cannot enoughly learn necessary knowledge during the alignment with the low-qualitative web-scale image-text data. As the similar approach of using rephrased descriptions for additional knowledge learning[29], we employ the newly collected SceneWalk dataset as the parametric knowledge injection step, which enables the SALOVA to learn detailed spatial and temporal representation from the long sequence video data before the instruction tuning.

Accordingly, our training recipes and data configuration can be divided into three steps as follow (Please see the training details in Appendix B):

Stage 1: Cross-modality Alignment.

For the initial step in modality alignment, we utilize 790790790790K image/video-text paired dataset: (i) 558558558558K image-text pairs from the CC3M dataset[52], filtered by LLaVA[38] and (ii) video-text pairs sampled from the WebVid 2.52.52.52.5M subset[4]. We freeze vision encoder and LLMs during the training, and mainly focus on optimizing the connector and router to map the visual information into the textual space.

Stage 1.5: Long Video Knowledge Injection.

As an intermediate training step, we use the SceneWalk dataset to train the SALOVA, unfreezing all trainable parameters except for the vision encoder. During training, we input the long and untrimmed video instances and follow the processing pipeline shown in Fig.2. By training the model with densely captioned video-description pairs, it acquires high-quality parametric knowledge of both spatial and temporal information. In addition, through the aforementioned retrieval process, the model learns to target video segments that are mostly relevant to the video description.

Stage 2: Video Instruction Tuning.

To possess QA capabilities in SALOVA, we use extensive video instruction-tuning data as the final training step. The instruction data are mainly sourced from four different datasets: LLaVA-Video-178K[71], NeXT-QA[58], ActivityNetQA[66], and PerceptionTest[47]—comprising a total of 1.41.41.41.4M video-instruction QA data, including caption entries, open-ended QA, and multiple-choice QA. Note that we train all the network parameters during this stage and auto-regressively update the instruction-following assistant’s response for the next word prediction.

5 Experiments

5.1 Experimental Details

Implementation.

For the vision encoder and text encoder of the SR-Router, we utilize the CLIP-ViT-L model[48] with a resolution size of 336336336336. We employ a 2-layer transformer with a head size of 2 for the ST-Connector, which has a latent dimension of 256256256256. The token drop mechanism is dynamically applied according to video length, with varying maximum drop rates for each training stage—Stage 1 has no token drop, Stage 1.5 up to 0.70.70.70.7, and Stage 2 up to 0.40.40.40.4. For the configuration of SR-Router, we set 2-layer of transformers with a single head, and top-K number is set to 5555 during the stage 2. Following[37, 30], the projector layer consists of 2-layer MLP with GELU. Our LLM backbones are (i) 3B: Llama-3.2-3B[18], (ii) 3.8B: Phi-3.5-mini[1], and (iii) 7B: Qwen2.5-7B[63].

Training Details.

For the each training stage, we train SALOVA for 1111 epoch with 1 node of 8 A100 GPUs. The total training hours for 3B/3.8B and 7B models roughly take 5 and 7 days, respectively. We employ FlashAttention-2[15], gradient checkpointing[9], and ZeRO-2[49] to minimize the memory footprint associated with model components (i.e., gradient, activation, and optimizer states). Additionally, we fine-tune the trainable parameters at each step without employing LoRA[24]. For the extended training configuration, we have attached the details in Appendix C.

Evaluation Benchmarks.

We evaluate our model using two types of video analysis benchmarks—long video understanding and general video understanding, categorized based on the video length. For the long video benchmark, we primarily utilize Video-MME[20] and LongVideoBench[57], both of benchmarks includes videos up to two hours long duration. As the general video analysis evaluations, we employ various benchmarks such as ActivityNetQA[66], VideoChatGPT[41], and MVBench[32]. Note that the same pipeline is used to obtain video segments for each benchmark, and all benchmarks are sampled at 1 FPS without token drop during inference. As a comparison baseline, considering academic budget constraints, we evaluate against models that have similar parameter size.

Video-MMELVBench
Model#paramShortMediumLongOverallAcc. (val)
Proprietary LMMs
GPT-4V[45]n/a70.555.853.559.9-
GPT-4o[46]n/a80.070.365.371.966.7
Gemini 1.5 Pro[50]n/a81.774.367.475.064.0
\cdashline1-7 Open-sourced LMMs
ST-LLM[39]7B45.736.831.337.9-
VideoChat2[32]7B48.337.033.239.539.3
ShareGPT4Video[8]8B48.336.335.039.939.7
Video-LLaVA[35]7B45.338.036.239.939.1
Chat-UniVi-V1.5[26]7B45.740.335.840.6-
Qwen-VL-Chat[3]7B46.938.737.841.1-
ShareGemini[51]7B49.141.339.143.2-
SliME[72]8B53.342.739.845.3-
PLLaVA[59]7B----40.2
VideoLLaMA2[12]8B56.045.442.147.9-
\cdashline1-7 Ours
SALOVA-Llama3B48.346.341.145.341.4
SALOVA-Phi3.8B47.148.844.146.741.6
SALOVA-Qwen7B52.350.946.850.043.5

5.2 Experimental Results

Results on Long Video Understanding.

Video-MME[20] evaluates LMMs with a focus on video analysis across a variety of video types and durations. We primarily compare the benchmark results in settings without subtitles, relying solely on video frames. Therefore it can assess the LMMs’ visual comprehension capabilities rigorously, based purely on visual content. Also, LongVideoBench[57] is designed to assess LMMs’ understanding of long-duration videos up to two hours. It includes a diverse collection of videos, challenging the models’ ability to process and interpret extensive visual and contextual information across a variety of themes. As shown in Tab.1, our model shows competent video understanding performance across all video length distributions in Video-MME and lengthy video instances in LongVideoBench. Notably, we highlight that SALOVA achieved significant performance in the medium (average 562.7 seconds) and long (average 2385.8 seconds) length categories in Video-MME benchmark, even with more smaller size of backbone LM parameters compared with the baseline models.

ActivityNetQAVideoChatGPTMVBench
Model#paramtest (acc/score)test (acc)test (acc)
Proprietary LMMs
GPT-4V[45]n/a57.0-4.0643.5
GPT-4o[46]n/a61.9---
Gemini 1.5 Pro[50]n/a57.5---
\cdashline1-6 Open-sourced LMMs
VideoLLaMA[68]7B12.41.12.1634.1
VideoChatGPT[41]7B35.22.72.4232.7
MovieChat[53]7B45.7-2.67-
Chat-UniVi[26]7B46.13.22.99-
LLaMA-VID[34]7B47.43.32.8941.3
VideoChat2[32]7B49.13.32.9851.1
VideoLLaMA2[12]8B50.23.33.1354.6
\cdashline1-6 Ours
SALOVA-Llama3B52.63.43.0851.7
SALOVA-Phi3.8B51.13.52.8346.4
SALOVA-Qwen7B55.63.63.0952.6

Such performance gains in long video instances are attributed to our model’s dynamic capability to retrieve and process only the relevant video segments, enabling it to handle lengthy video content efficiently without being constrained by the limited context length. Especially, the routing mechanism in SALOVA strategically prioritizes video segments that are likely to contain crucial visual and contextual cues relevant to the query. This selective routing mechanism reduces the computational load and minimizes the information loss that commonly occurs in current video-LMMs trying to process extensive video data in entirety.

Results on General Video Understanding.

Using benchmarks such as ActivityNetQA[66], VideoChatGPT[41], and MVBench[32], SALOVA was evaluated across various video types to assess its general video understanding capabilities. As shown in Tab.2, SALOVA demonstrated competent performance, comparable to existing video-LMMs, especially in dynamic and shorter video sequences. On ActivityNetQA, the model effectively utilized its segment retrieval strategy to provide focused and contextually appropriate responses, which helped maintain accuracy. This approach was similarly effective in the multi-modal settings of VideoChatGPT and MVBench, where SALOVA showed consistent performance in handling dialogues and visual cues. These outcomes highlight SALOVA’s capability to process general video content efficiently through its dynamic routing mechanism, offering a reliable solution that balances computational resources with output quality.

AblationVideo-MME
Short: \leq2mMid: 4-15mLong: 30-60mOverall
frm sample: Video frame sampling (w/o SR-Router)
8 frm48.342.037.242.5
16 frm50.042.838.043.6
1 fps48.346.341.145.3
\cdashline1-51 / 1.5 / 2: Train stage - Long video knowledge injection
✓✗✓45.643.740.243.6
✓✓✓48.346.341.145.3
\cdashline1-5FastFocs: Local-global video representation
36.438.635.636.9
48.346.341.145.3

5.3 Additional Analyses on SALOVA

Ablation Study.

We conduct ablation studies on three components as follows: (i) different video frame sampling strategies, (ii) intermediate training stage for long video knowledge injection, and (iii) the FocusFast mechanism to understand branched local-global representation in videos.

As shown in Table 3, we first observe that using more frames significantly enhances performance, particularly in understanding long-form videos. This aligns with our key insights on managing long videos, suggesting that a higher frame count can provide more spatio-temporal information and improve the model’s response without losing contextual information within the video. Additionally, we compare with a baseline trained with stage 1-2 (skipping stage 1.5). Here, we highlight the effectiveness of the SceneWalk dataset as an intermediate training step to enhance parametric knowledge for the long video analysis by allowing the model to learn from high-quality and densely captioned scene-level information, which is crucial for adapting to various lengths and contexts. Lastly, we conduct an analysis on the FocusFast method and demonstrate its efficacy in analyzing not only local details from relevant video segments but also in understanding the global video context through the simultaneous use of routing tokens, thereby facilitating a more comprehensive understanding of video content.

Analysis of Retrieving Segments.

By retrieving relevant video segments for the given queries, SALOVA can effectively target salient information in the long video and retain long context information. To further demonstrate the model’s targeting capabilities beyond numerical performance in long video analysis, we explore our model’s application in the Visual Needle-In-A-Haystack (V-NIAH) task[70], which extends the Needle-in-a-Haystack (NIAH) evaluation for LLMs to a vision-level benchmark. This task is particularly challenging as it requires models to not only detect but also precisely retrieve the sparse yet crucial visual cues scattered across lengthy videos.

As shown in Fig.3, we compare our model to a baseline trained on sparsely sampled frames (16 frm, without SR-Router). Our framework effectively identifies and extracts relevant video segments from densely packed content, even when handling long context lengths. These results highlight SALOVA’s robustness in managing complex, long-form videos, maintaining contextual continuity and relevance by strategically focusing on critical segments in response to user queries.

Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis (3)

(a) SALOVA-Llama-3B (16 frm sample)

Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis (4)

(b) SALOVA-Llama-3B (1 fps sample)

6 Discussion and Conclusion

Discussion.

Despite SALOVA’s competence in handling extended video sequences, it is important to recognize scenarios where its complex architecture may not be necessary. Specifically, for shorter videos where sparse sampling suffices to capture essential spatio-temporal information, simpler models could potentially outperform the efficiency of SALOVA without necessitating its extensive processing capabilities. This suggests a future avenue for integrating a hybrid approach based on our framework by dynamically adjusting the complexity of the retrieval and processing mechanisms based on the video length and content density.

Conclusion.

In this paper, we introduce SALOVA, a novel framework designed to enhance the comprehension of long and untrimmed video by leveraging a retrieval-driven approach with new densely captioned dataset, the SceneWalk dataset. SALOVA strategically targets and processes only the relevant video segments, effectively addressing the structural limitations of current Video-LMMs with its Spatio-Temporal Connector and Segment Retrieval Router. Through extensive evaluation on various benchmarks, SALOVA exhibits its robust performance in interpreting complex video content, enhancing efficiency, and improving the understanding of extended videos.

References

  • Abdin etal. [2024]Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, AmmarAhmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, etal.Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219, 2024.
  • Alayrac etal. [2022]Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, etal.Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  • Bai etal. [2023]Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou.Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 1(2):3, 2023.
  • Bain etal. [2021]Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman.Frozen in time: A joint video and image encoder for end-to-end retrieval.In Proceedings of the IEEE/CVF international conference on computer vision, pages 1728–1738, 2021.
  • Brown etal. [2020]Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredD Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, etal.Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020.
  • CabaHeilbron etal. [2015]Fabian CabaHeilbron, Victor Escorcia, Bernard Ghanem, and Juan CarlosNiebles.Activitynet: A large-scale video benchmark for human activity understanding.In Proceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015.
  • Cha etal. [2023]Junbum Cha, Wooyoung Kang, Jonghwan Mun, and Byungseok Roh.Honeybee: Locality-enhanced projector for multimodal llm.arXiv preprint arXiv:2312.06742, 2023.
  • Chen etal. [2024a]Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, etal.Sharegpt4video: Improving video understanding and generation with better captions.arXiv preprint arXiv:2406.04325, 2024a.
  • Chen etal. [2016]Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin.Training deep nets with sublinear memory cost.arXiv preprint arXiv:1604.06174, 2016.
  • Chen etal. [2024b]Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, ByungEun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, etal.Panda-70m: Captioning 70m videos with multiple cross-modality teachers.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13320–13331, 2024b.
  • Chen etal. [2023]Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, etal.Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks.arXiv preprint arXiv:2312.14238, 2023.
  • Cheng etal. [2024]Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, etal.Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024.
  • Chiang etal. [2023]Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, JosephE. Gonzalez, Ion Stoica, and EricP. Xing.Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
  • Dai etal. [2023]Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi.InstructBLIP: Towards general-purpose vision-language models with instruction tuning.In Advances in Neural Information Processing Systems, 2023.
  • Dao [2023]Tri Dao.Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023.
  • Dehghani etal. [2024]Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, IbrahimM Alabdulmohsin, etal.Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Information Processing Systems, 36, 2024.
  • Dong etal. [2024]Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, etal.Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model.arXiv preprint arXiv:2401.16420, 2024.
  • Dubey etal. [2024]Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, etal.The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024.
  • Feichtenhofer etal. [2019]Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He.Slowfast networks for video recognition.In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
  • Fu etal. [2024]Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, etal.Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024.
  • Ghermi etal. [2024]Ridouane Ghermi, Xi Wang, Vicky Kalogeiton, and Ivan Laptev.Short film dataset (sfd): A benchmark for story-level video understanding.arXiv preprint arXiv:2406.10221, 2024.
  • Google [2023]Google.Gemini, 2023.
  • He etal. [2024]Bo He, Hengduo Li, YoungKyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim.Ma-lmm: Memory-augmented large multimodal model for long-term video understanding.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13504–13514, 2024.
  • Hu etal. [2021]EdwardJ Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021.
  • Huang etal. [2024]Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, OwaisKhan Mohammed, Barun Patra, etal.Language is not all you need: Aligning perception with language models.Advances in Neural Information Processing Systems, 36, 2024.
  • Jin etal. [2024]Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan.Chat-univi: Unified visual representation empowers large language models with image and video understanding.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700–13710, 2024.
  • Ju etal. [2024]Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xintao Wang, Ailing Zeng, Yu Xiong, Qiang Xu, and Ying Shan.Miradata: A large-scale video dataset with long durations and structured captions.arXiv preprint arXiv:2407.06358, 2024.
  • Lewis etal. [2020]Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, etal.Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  • Li etal. [2024a]Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li.Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a.
  • Li etal. [2024b]Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li.Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895, 2024b.
  • Li etal. [2023]Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.In International Conference on Machine Learning. PMLR, 2023.
  • Li etal. [2024c]Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, etal.Mvbench: A comprehensive multi-modal video understanding benchmark.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024c.
  • Li etal. [2024d]Pandeng Li, Chen-Wei Xie, Hongtao Xie, Liming Zhao, Lei Zhang, Yun Zheng, Deli Zhao, and Yongdong Zhang.Momentdiff: Generative video moment retrieval from random to real.Advances in neural information processing systems, 36, 2024d.
  • Li etal. [2025]Yanwei Li, Chengyao Wang, and Jiaya Jia.Llama-vid: An image is worth 2 tokens in large language models.In European Conference on Computer Vision, pages 323–340. Springer, 2025.
  • Lin etal. [2023]Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan.Video-llava: Learning united visual representation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023.
  • Lin etal. [2024]Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han.Vila: On pre-training for visual language models.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26689–26699, 2024.
  • Liu etal. [2023a]Haotian Liu, Chunyuan Li, Yuheng Li, and YongJae Lee.Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023a.
  • Liu etal. [2023b]Haotian Liu, Chunyuan Li, Qingyang Wu, and YongJae Lee.Visual instruction tuning.In Advances in Neural Information Processing Systems, 2023b.
  • Liu etal. [2025]Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li.St-llm: Large language models are effective temporal learners.In European Conference on Computer Vision, pages 1–18. Springer, 2025.
  • Liu etal. [2024]Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, etal.Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024.
  • Maaz etal. [2023]Muhammad Maaz, Hanoona Rasheed, Salman Khan, and FahadShahbaz Khan.Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023.
  • McKinzie etal. [2024]Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, etal.Mm1: Methods, analysis & insights from multimodal llm pre-training.arXiv preprint arXiv:2403.09611, 2024.
  • OpenAI [2023a]OpenAI.ChatGPT.https://openai.com/blog/chatgpt/, 2023a.
  • OpenAI [2023b]OpenAI.Gpt-4 technical report, 2023b.
  • OpenAI [2023c]OpenAI.GPT-4V(ision) System Card, 2023c.
  • OpenAI [2024]OpenAI.Hello gpt-4o, 2024.
  • Patraucean etal. [2024]Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Recasens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, etal.Perception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36, 2024.
  • Radford etal. [2021]Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal.Learning transferable visual models from natural language supervision.In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Rajbhandari etal. [2020]Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He.Zero: Memory optimizations toward training trillion parameter models.In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
  • Reid etal. [2024]Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, etal.Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024.
  • Share [2024]Share.Sharegemini: Scaling up video caption data for multimodal large language models, 2024.
  • Sharma etal. [2018]Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut.Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning.In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  • Song etal. [2024]Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, etal.Moviechat: From dense token to sparse memory for long video understanding.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024.
  • Thakur etal. [2021]Nandan Thakur, Nils Reimers, Johannes Daxenberger, and Iryna Gurevych.Augmented SBERT: Data augmentation method for improving bi-encoders for pairwise sentence scoring tasks.In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 296–310, Online, 2021. Association for Computational Linguistics.
  • Touvron etal. [2023]Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, etal.Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023.
  • Wang etal. [2023]Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, etal.Internvid: A large-scale video-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2307.06942, 2023.
  • Wu etal. [2024]Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li.Longvideobench: A benchmark for long-context interleaved video-language understanding.arXiv preprint arXiv:2407.15754, 2024.
  • Xiao etal. [2021]Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua.Next-qa: Next phase of question-answering to explaining temporal actions.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021.
  • Xu etal. [2024a]Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, SeeKiong Ng, and Jiashi Feng.Pllava: Parameter-free llava extension from images to videos for video dense captioning.arXiv preprint arXiv:2404.16994, 2024a.
  • Xu etal. [2024b]Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, and Afshin Dehghan.Slowfast-llava: A strong training-free baseline for video large language models.arXiv preprint arXiv:2407.15841, 2024b.
  • Xue etal. [2024]Fuzhao Xue, Yukang Chen, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, etal.Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024.
  • Xue etal. [2022]Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo.Advancing high-resolution video-language representation with large-scale video transcriptions.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5036–5045, 2022.
  • Yang etal. [2024]An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, etal.Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024.
  • Ye etal. [2024]Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou.mplug-owl3: Towards long image-sequence understanding in multi-modal large language models.arXiv preprint arXiv:2408.04840, 2024.
  • Ye etal. [2023]Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, etal.mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023.
  • Yu etal. [2019]Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao.Activitynet-qa: A dataset for understanding complex web videos via question answering.In Proceedings of the AAAI Conference on Artificial Intelligence, pages 9127–9134, 2019.
  • Zhai etal. [2023]Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer.Sigmoid loss for language image pre-training.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023.
  • Zhang etal. [2023]Hang Zhang, Xin Li, and Lidong Bing.Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858, 2023.
  • Zhang etal. [2024a]Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, etal.Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output.arXiv preprint arXiv:2407.03320, 2024a.
  • Zhang etal. [2024b]Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu.Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024b.
  • Zhang etal. [2024c]Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li.Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024c.
  • Zhang etal. [2024d]Yi-Fan Zhang, Qingsong Wen, Chaoyou Fu, Xue Wang, Zhang Zhang, Liang Wang, and Rong Jin.Beyond llava-hd: Diving into high-resolution large multimodal models.arXiv preprint arXiv:2406.08487, 2024d.
  • [73]Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, WANG HongFa, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, etal.Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment.In The Twelfth International Conference on Learning Representations.

\thetitle

Supplementary Material

Appendix A Details of SceneWalk Dataset

A.1 Detailed Data Statistics

We provide a comprehensive analysis of the proposed SceneWalk dataset, focusing on detailed data statistics, including video duration, categorical distribution, and segment-level descriptions. The information emphasizes the versatility and diversity of the dataset, ensuring its applicability for training our video-LLM.

Dataset Composition.

The SceneWalk dataset comprises 87,867 long-form video sources, spanning a total of 11.87 Khrs (average video duration: 486.5 seconds). The video sources are collected from a curated selection of 10 diverse categories as inFig.1 sourced primarily from YouTube, with additional contributions from Movie & Drama datasets[53, 21]. This ensures a wide range of real-world scenarios, avoiding static categories.

Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis (5)
Video Duration Distribution.

The collected videos can be split into three distinct duration ranges to analyze temporal diversity: (i) 0–240 seconds (short): This range constitutes about 24.4% of all segments, (ii) 240–600 seconds (long): This intermediate range accounts for the largest proportion, approximately 46.1% of the dataset, and (iii) 600–2280 seconds (extreme-long): The longest duration range comprises around 29.5% of the dataset. The distribution of video durations for each video category is illustrated in Fig.1(b) (outer circle), and more detailed duration distributions can be found in Fig.4.

A.2 Pipeline for Dense Caption

Splitting into Video Segments

To divide untrimmed and long video sources into a massive 1.291.291.291.29M video segments, we directly utilize PySceneDetect with the AdaptiveDetector using the default adaptive threshold (3.03.03.03.0), which compares the difference in content between adjacent frames similar using a rolling average of adjacent frame changes. This can help mitigate false detections in situations such as fast camera motions.

Instructions of Dense Segment Captioning.

To generate detailed descriptions for each video segment obtained from the above process, we mainly use a pre-trained LMM (VILA-1.5-13B[36]). Below Tab.4 includes the instructions for generating those captions. We randomly select one from the list and use it as a query for the model.

Captioning and Scoring.

Each segment is densely captioned, generating highly detailed textual descriptions that average 137.5137.5137.5137.5 words per segment. Please see the densely captioned video examples in Fig.6 and Fig.7. To ensure alignment quality, a generalized bipartite matching framework is employed: (i) Video-to-Text (V2T) Correspondence: A matrix evaluates the alignment between video segments and their paired captions using LanguageBind[73], and (ii) Text-to-Text (T2T) Context Similarity: The textual coherence among adjacent captions is assessed using SBERT[54], enhancing overall alignment robustness.

Supervision from Correspondence Scores.

To derive the supervision signal yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for training, we leverage the correspondence scores SV2Tsubscript𝑆V2TS_{\text{V2T}}italic_S start_POSTSUBSCRIPT V2T end_POSTSUBSCRIPT (Video-to-Text) and ST2Tsubscript𝑆T2TS_{\text{T2T}}italic_S start_POSTSUBSCRIPT T2T end_POSTSUBSCRIPT (Text-to-Text), as discussed in Sec.4. For each correspondence score matrix, we apply thresholding to extract meaningful relationships. Specifically, we define thresholds τV2Tsubscript𝜏V2T\tau_{\text{V2T}}italic_τ start_POSTSUBSCRIPT V2T end_POSTSUBSCRIPT and τT2Tsubscript𝜏T2T\tau_{\text{T2T}}italic_τ start_POSTSUBSCRIPT T2T end_POSTSUBSCRIPT for the two matrices, and elements with scores exceeding these thresholds are treated as positive correspondences (th: 0.18 (τV2Tsubscript𝜏V2T\tau_{\text{V2T}}italic_τ start_POSTSUBSCRIPT V2T end_POSTSUBSCRIPT) and 0.8 (τT2Tsubscript𝜏T2T\tau_{\text{T2T}}italic_τ start_POSTSUBSCRIPT T2T end_POSTSUBSCRIPT), respectively). These positive correspondences are then one-hot encoded to form binary matrices YV2Tsubscript𝑌V2TY_{\text{V2T}}italic_Y start_POSTSUBSCRIPT V2T end_POSTSUBSCRIPT and YT2Tsubscript𝑌T2TY_{\text{T2T}}italic_Y start_POSTSUBSCRIPT T2T end_POSTSUBSCRIPT, where each element indicates whether a specific correspondence is valid. Finally, we compute the union of these binary matrices to produce the final supervision signal (e.g., yi=YV2TYT2Tsubscript𝑦𝑖subscript𝑌V2Tsubscript𝑌T2Ty_{i}{=}Y_{\text{V2T}}\cup Y_{\text{T2T}}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_Y start_POSTSUBSCRIPT V2T end_POSTSUBSCRIPT ∪ italic_Y start_POSTSUBSCRIPT T2T end_POSTSUBSCRIPT). The union operation ensures that any correspondence deemed valid by either of the two modalities contributes to the final supervision. This approach captures both the multi-modal alignment (Video-to-Text) and intra-modal coherence (Text-to-Text), providing a robust supervision signal for the retrieval task.

The resulting yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is then incorporated into the similarity loss function simsubscriptsim\mathcal{L}_{\text{sim}}caligraphic_L start_POSTSUBSCRIPT sim end_POSTSUBSCRIPT as described in Eq.1, ensuring that the model effectively learns the nuanced relationships between video segments and their corresponding textual descriptions. By combining SV2Tsubscript𝑆V2TS_{\text{V2T}}italic_S start_POSTSUBSCRIPT V2T end_POSTSUBSCRIPT and ST2Tsubscript𝑆T2TS_{\text{T2T}}italic_S start_POSTSUBSCRIPT T2T end_POSTSUBSCRIPT in this manner, we account for the complexity of generalized bipartite matching and enhance the model’s ability to align correspondences across and within modalities.

Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis (6)

A.3 Word Cloud Analysis.

The Word Cloud visualization in Fig.5 highlights the richness and diversity of visual cues captured within the SceneWalk dataset. The prominent keywords such as man, woman, person, and group reflect the dataset’s strong emphasis on human-centric descriptions, focusing on capturing the presence, actions, and interactions of individuals within a scene. These terms highlight that the dataset prioritizes detailed portrayals of people as central subjects, providing rich context about their appearance, activities, and relationships with the surrounding environment.

Furthermore, the inclusion of descriptive spatial and contextual terms (e.g., stage, floor, light, tree, etc,.) illustrates how the dataset prioritizes capturing environmental details alongside subject interactions. This level of granularity ensures that the visual-textual mappings are comprehensive, enabling the dataset to serve as a robust resource for training models that require an in-depth understanding of scene composition and narrative continuity.

By focusing on such fine-grained visual details, the SceneWalk dataset can provide generic scene descriptions, encapsulating nuanced visual content that is critical for multi-modal tasks. The highlighted terms reflect not only the dataset’s diversity but also its deliberate emphasis on actionable visual semantics, making it particularly valuable for an intermediate training step, as proposed in Sec.4.2, by enabling models to effectively learn and represent long video knowledge, including scene comprehension and nuanced understanding.

Appendix B Training Details of SALOVA

Training Config.

In this section, we elaborate the training process of SALOVA. All variations of SALOVA undergo training with unified settings, though per-device batch sizes differ slightly due to hardware limitations. To equalize the global batch size across these variations, gradient accumulation is implemented, facilitating a consistent training timeline for each variant. The detailed training configuration for each step can be found in Tab.5, which optimizes the use of available GPU memory for batch sizing and ensures efficient training dynamics with limited hardware resource.

configStage1Stage1.5Stage2
input modalityimage, videovideovideo
input frame1 FPS
input resolution336 ×\times× 336
optimizerAdamW (β1,β2=0.9,0.999formulae-sequencesubscript𝛽1subscript𝛽20.90.999\beta_{1},\beta_{2}{=}0.9,0.999italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.9 , 0.999)
lr schedulecosine decay
training precisionBFloat16
DeepSpeed trainZeRO-2
warmup epochs0.03
trainable paramsconnectorsfullfull
lr_{vision, text}-2e-62e-6
lr_{LLM, others}1e-32e-52e-5
global batch size256864
total epochs111
Max token drop0.00.70.4

Appendix C Architecture Details of SALOVA

Network Config.

Here, we explain our network configurations in detail. For the first part of our architecture, the Spatio-Temporal Connector, we employ the Perceiver Resampler[2] architecture (but, smaller size), which consists of a 2-layer, 2-head Transformer architecture followed by a 2-layer MLP with GELU activation as a projector. For the connector’s latent features, we set the number of latent features to 256 and the hidden size to 1024. Next, the second module, the Segment Retrieval Router, consists of a 2-layer, single head Transformer architecture. The Transformer uses a d_model of 1024 and PReLU as the activation function.

Appendix D Additional Experiments

Results of LongVideoBench.

Due to the page limit of the main manuscript, in this additional section, we elaborate on both validation and test set results of LongVideoBench[57] for further demonstration. As in Tab.6, it shows an analogous tendency for Video-MME benchmark, which exhibits a significant performance increase after the short duration video (15ssimilar-to\sim). We highlight again that such trend is mainly due to the retrieval capability of SALOVA, which excels in associating visual content with contextual information, even as video lengths increase.

LongVideoBench
ModelSize

8-15s

15-60s

180-600s

900-3600s

test set

val set

Proprietary LMMs
GPT-4o[46]-71.676.866.761.666.766.7
Gemini 1.5 Pro[50]-70.275.365.059.164.464.0
GPT-4-Turbo[44]-66.471.161.754.560.759.1
\cdashline1-8 Open-sourced LMMs
VideoChat2[32]7B38.140.533.533.635.136.0
VideoLLaVA[35]8B43.144.636.434.437.639.1
PLLaVA[59]7B45.347.338.535.239.240.2
LLaVA-1.5[37]7B45.047.440.137.040.440.3
ShareGPT4Video[8]7B46.950.140.038.741.839.7
\cdashline1-7 Ours
SALOVA-Llama3B46.346.741.939.842.241.4
SALOVA-Phi3.8B45.348.342.640.642.941.6
SALOVA-Qwen7B46.050.744.442.144.543.5

AblationVideo-MME
Short: \leq2mMid: 4-15mLong: 30-60mOverall
Top-k𝑘kitalic_k: Number of Video Segments for Retrieval
148.144.439.143.9
548.145.039.244.1
948.346.341.145.3
1348.144.739.744.1

Ablation Study for Retrieval Number.

In addition, we conduct an analysis of the number of video segments used for inference on the Video-MME benchmark. In our architectural design, the number of video segments can be dynamically set based on retrieval estimates from the SR-Router, which forwards partial yet pertinent spatio-temporal information from the video to the LMMs. We compare how varying the number of video segments affects performance results. Note that the maximum number of video segments in Video-MME is 13. As in Tab.7, we clearly observe that increasing the number of frames tends to enhance performance. However, performance saturates after the retrieval number reaches 9. This saturation may be related to the fact that excessive input information becomes more disruptive than helpful for reasoning about partial scenes in the video.

Qualitative Results.

We provide qualitative results with varying video lengths to clearly demonstrate the effectiveness of SALOVA across short, medium, and long videos as in Fig.8. For example, a short video in Fig.8, our model accurately retrieved a scene of the Moon colliding with the Earth, depicting an astronomical disaster based on the given question. Similarly, in medium-length videos, SALOVA effectively identified the scene where a male judge selects a card corresponding to the question.Specifically, even in videos longer than 40 minutes, SALOVA accurately identifies scenes related to the correct answer, such as people eating BBQ after exploring the history of the food’s origin, based solely on the query input and the video content.

These consistent qualitative results across all lengths indicate that the successful retrieval of pertinent video segments relevant to the input query significantly contributes to the model’s efficiency. As demonstrated in our analysis, SALOVA effectively handles different amounts of video data, which supports robust scene understanding and reasoning.

Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis (7)
Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis (8)
Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis (9)
Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis (2024)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Nathanael Baumbach

Last Updated:

Views: 6160

Rating: 4.4 / 5 (75 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Nathanael Baumbach

Birthday: 1998-12-02

Address: Apt. 829 751 Glover View, West Orlando, IN 22436

Phone: +901025288581

Job: Internal IT Coordinator

Hobby: Gunsmithing, Motor sports, Flying, Skiing, Hooping, Lego building, Ice skating

Introduction: My name is Nathanael Baumbach, I am a fantastic, nice, victorious, brave, healthy, cute, glorious person who loves writing and wants to share my knowledge and understanding with you.