Elysium: Exploring Object-level Perception in Videos via MLLM

Han Wang, Yanjie Wang, Yongjie Ye, Yuxiang Nie, and Can Huang

Bytedance Inc

ECCV 2024

Paper Code Dataset

Demo

Referring Single Object Tracking (RSOT): We use prompt "Please find {expression} in the initial frame and provide the detailed coordinates in each frame." for each video.

Single Object Tracking (SOT): We use prompt "This is a video showing an object with coordinates {coordinates} in Frame 1. Provide the detailed coordinates of the object in each frame." for each video.

Shoes
Shoes
The Cap on a Dog's Head
The Cap on a Dog's Head
The Person in Red
The Person in Red
The Snow Field
The Snow Field
A Running Dog Played in the Snow Field
A Running Dog Played in the Snow Field
Boy Back to Camera
Boy Back to Camera
A Dancing Kangaroo
A Dancing Kangaroo
Dog
Dog
Coords Airplane
[35,48,60,55]
Coords Dog
[34,40,51,67]

Abstract

Multi-modal Large Language Models (MLLMs) have demonstrated their ability to perceive objects in still images, but their application in video-related tasks, such as object tracking, remains understudied. This lack of exploration is primarily due to two key challenges. Firstly, extensive pretraining on large-scale video datasets is required to equip MLLMs with the capability to perceive objects across multiple frames and understand inter-frame relationships. Secondly, processing a large number of frames within the context window of Large Language Models (LLMs) can impose a significant computational burden. To address the first challenge, we introduce ElysiumTrack-1M, a large-scale video dataset supported for three tasks: Single Object Tracking (SOT), Referring Single Object Tracking (RSOT), and Video Referring Expression Generation (Video-REG). ElysiumTrack-1M contains 1.27 million annotated video frames with corresponding object boxes and descriptions. Leveraging this dataset, we conduct training of MLLMs and propose a token-compression model T-Selector to tackle the second challenge. Our proposed approach, Elysium: Exploring Object-level Perception in Videos via MLLM, is an end-to-end trainable MLLM that attempts to conduct object-level tasks in videos without requiring any additional plug-in or expert models.

@article{elysium,
    Author = {Han, Wang and Yanjie, Wang and Yongjie, Ye and Yuxiang, Nie and Huang, Can},
    Title = {Elysium: Exploring Object-level Perception in Videos via MLLM},
    Conference = {ECCV},
    Year = {2024}
}