Demo
Referring Single Object Tracking (RSOT): We use prompt "Please find {expression} in the initial frame and provide the detailed coordinates in each frame." for each video.
Single Object Tracking (SOT): We use prompt "This is a video showing an object with coordinates {coordinates} in Frame 1. Provide the detailed coordinates of the object in each frame." for each video.
data:image/s3,"s3://crabby-images/8f653/8f653cbc551324b97a3b888d99a97b9c6a75fe12" alt="Shoes"
data:image/s3,"s3://crabby-images/cbbea/cbbea61ec9f611fa9861fdef5d353a6f54b70955" alt="The Cap on a Dog's Head"
data:image/s3,"s3://crabby-images/8eeb1/8eeb126f93b418792c825ab50f69e8156349542f" alt="The Person in Red"
data:image/s3,"s3://crabby-images/c680f/c680fbf9324a3ffec7cbc7205d09de2ae407bbf0" alt="The Snow Field"
data:image/s3,"s3://crabby-images/715fe/715fec0bd18f8ad7ed572233aa30dfa4723a4f8c" alt="A Running Dog Played in the Snow Field"
data:image/s3,"s3://crabby-images/0c6d5/0c6d58119792f746a214914fe75d47f7a0ec0f85" alt="Boy Back to Camera"
data:image/s3,"s3://crabby-images/c2450/c2450e17a31f31d620453bf0463d682d1b8909ac" alt="A Dancing Kangaroo"
data:image/s3,"s3://crabby-images/32030/320309150ab04c2289b44c8a68552f0fbf5e38e8" alt="Dog"
data:image/s3,"s3://crabby-images/92ed1/92ed1166020665bf69319a7ead4b5983ed69ed39" alt="Coords Airplane"
data:image/s3,"s3://crabby-images/c8eef/c8eef7283f84aa7e04ec50fb7118a4884524416e" alt="Coords Dog"
Abstract
Multi-modal Large Language Models (MLLMs) have demonstrated their ability to perceive objects in still images, but their application in video-related tasks, such as object tracking, remains understudied. This lack of exploration is primarily due to two key challenges. Firstly, extensive pretraining on large-scale video datasets is required to equip MLLMs with the capability to perceive objects across multiple frames and understand inter-frame relationships. Secondly, processing a large number of frames within the context window of Large Language Models (LLMs) can impose a significant computational burden. To address the first challenge, we introduce ElysiumTrack-1M, a large-scale video dataset supported for three tasks: Single Object Tracking (SOT), Referring Single Object Tracking (RSOT), and Video Referring Expression Generation (Video-REG). ElysiumTrack-1M contains 1.27 million annotated video frames with corresponding object boxes and descriptions. Leveraging this dataset, we conduct training of MLLMs and propose a token-compression model T-Selector to tackle the second challenge. Our proposed approach, Elysium: Exploring Object-level Perception in Videos via MLLM, is an end-to-end trainable MLLM that attempts to conduct object-level tasks in videos without requiring any additional plug-in or expert models.
@article{elysium, Author = {Han, Wang and Yanjie, Wang and Yongjie, Ye and Yuxiang, Nie and Huang, Can}, Title = {Elysium: Exploring Object-level Perception in Videos via MLLM}, Conference = {ECCV}, Year = {2024} }