Eprint first made available on ORBi (E-prints, working papers and research blog)
TrackMAE: Video Representation Learning via Track Mask and Predict
Vandeghen, Renaud; Thoker, Fida Mohammad; Ghanem, Bernard et al.
2026
 

Files


Full Text
Vandeghen2026TrackMAE-preprint.pdf
Author preprint (4.59 MB)
Download

All documents in ORBi are protected by a user license.

Send to



Details



Keywords :
Video understanding; Artificial intelligence; Masking; Video modeling; Self supervised training
Abstract :
[en] Masked video modeling (MVM) has emerged as a simple and scalable self-supervised pretraining paradigm, but only models motion information implicitly, limiting the encoding of temporal dynamics in the learned representations. As a result, such models struggle on motion-centric tasks that require fine-grained motion awareness. To address this, we propose TrackMAE, a simple masked video modeling paradigm that explicitly uses motion information as a reconstruction signal. In TrackMAE, we use an off-the-shelf point tracker to sparsely track points in the input videos, generating motion trajectories. Furthermore, we exploit the extracted trajectories to improve the random tube masking with a motion-aware masking strategy. We enhance video representations learned in both pixel and feature semantic reconstruction space by providing a complimentary supervision signal in the form of motion targets. We evaluate on six datasets across diverse downstream settings and find that TrackMAE consistently outperforms the state-of-the-art video SSL baselines, therefore learning more discriminative and generalizable representations.
Research Center/Unit :
Montefiore Institute - Montefiore Institute of Electrical Engineering and Computer Science - ULiège
TELIM
VIULAB
Disciplines :
Electrical & electronics engineering
Author, co-author :
Vandeghen, Renaud  ;  Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science
Thoker, Fida Mohammad
Ghanem, Bernard
Van Droogenbroeck, Marc  ;  Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Télécommunications
Language :
English
Title :
TrackMAE: Video Representation Learning via Track Mask and Predict
Publication date :
12 January 2026
Number of pages :
17
Name of the research project :
Lucia
Funders :
SPW - Public Service of Wallonia
Funding number :
1910247
Funding text :
The present research benefited from computational resources made available on Lucia, the Tier-1 supercomputer of the Walloon Region, infrastructure funded by the Walloon Region under the grant agreement n°1910247. The research reported in this publication was supported by funding from King Abdullah University of Science and Technology (KAUST) - Center of Excellence for Generative AI, under award number 5940. For computing time, this research used Ibex managed by the Supercomputing Core Laboratory at King Abdullah University of Science & Technology (KAUST) in Thuwal, Saudi Arabia. We acknowledge EuroPC JU for awarding the project ID EHPC-DEV-2025D10-008 access to Leonardo on Leonardo Booster hosted by CINECA, Italy and access to MareNostrum5 on MN5 ACC hosted by BSC, Spain.
Available on ORBi :
since 19 January 2026

Statistics


Number of views
43 (8 by ULiège)
Number of downloads
76 (0 by ULiège)

Bibliography


Similar publications



Contact ORBi