Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders

[en] Self-supervised pre-training of image encoders is omnipresent in the literature, particularly following the introduction of Masked autoencoders (MAE). Current efforts attempt to learn object-centric representations from motion in videos. In particular, SiamMAE recently introduced a Siamese network, training a shared-weight encoder from two frames of a video with a high asymmetric masking ratio (95%). In this work, we propose CropMAE, an alternative approach to the Siamese pre-training introduced by SiamMAE. Our method specifically differs by exclusively considering pairs of cropped images sourced from the same image but cropped differently, deviating from the conventional pairs of frames extracted from a video. CropMAE therefore alleviates the need for video datasets, while maintaining competitive performances and drastically reducing pre-training time. Furthermore, we demonstrate that CropMAE learns similar object-centric representations without explicit motion, showing that current self-supervised learning methods do not learn objects from motion, but rather thanks to the Siamese architecture. Finally, CropMAE achieves the highest masking ratio to date (98.5%), enabling the reconstruction of images using only two visible patches. Our code is available at https://github.com/alexandre-eymael/CropMAE.

Research Center/Unit :

TELIM
Montefiore Institute - Montefiore Institute of Electrical Engineering and Computer Science - ULiège
VIULab

Disciplines :

Computer science

Author, co-author :

Eymaël, Alexandre ^✱; University of Liège, Belgium

Vandeghen, Renaud ^✱; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science

Cioppa, Anthony ; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science ; KAUST, Saudi Arabia

Giancola, Silvio; KAUST, Saudi Arabia

Ghanem, Bernard; KAUST, Saudi Arabia

Van Droogenbroeck, Marc ; Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Télécommunications

^✱ These authors have contributed equally to this work.

Language :

English

Title :

Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders

Publication date :

31 October 2024

Event name :

European Conference on Computer Vision (ECCV)

Event organizer :

ECVA

Event place :

Milan, Italy

Event date :

September 29 to October 4, 2024

Event number :

Audience :

International

Main work title :

European Conference on Computer Vision

Publisher :

Springer

Collection name :

Lecture Notes in Computer Science, volume 15081

Pages :

348–366

Peer review/Selection committee :

Peer reviewed

Tags :

CÉCI : Consortium des Équipements de Calcul Intensif
Tier-1 supercalculateur

European Projects :

H2020 - 951732 - EUROCC - National Competence Centres in the framework of EuroHPC

Name of the research project :

Lucia

Funders :

F.R.S.-FNRS - Fonds de la Recherche Scientifique
SPW - Public Service of Wallonia
European Union

Funding number :

1910247

Funding text :

A. Cioppa is funded by the F.R.S.-FNRS. The research reported in this publication was supported by funding from KAUST Center of Excellence on GenAI, under award number 5940, and the SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence. The present research benefited from computational resources made available on Lucia, the Tier-1 supercomputer of the Walloon Region, infrastructure funded by the Walloon Region under the grant agreement no 1910247. We acknowledge EuroCC Belgium for awarding this project access to the LUMI supercomputer, owned by the EuroHPC Joint Undertaking, hosted by CSC (Finland) and the LUMI consortium.

Available on ORBi :

since 27 March 2024

Statistics

Number of views

291 (38 by ULiège)

Number of downloads

70 (11 by ULiège)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

Balestriero, R., et al.: A cookbook of self-supervised learning. CoRR abs/2304.12210 (2023). https://doi.org/10.48550/arXiv.2304.12210
Bandara, W.G.C., Patel, N., Gholami, A., Nikkhah, M., Agrawal, M., Patel, V.M.: AdaMAE: adaptive masking for efficient spatiotemporal learning with masked autoencoders. In: IEEE/CVF Conference on Computing Vision Pattern Recognition (CVPR), Vancouver, Canada, pp. 14507–14517. Institute of Electrical and Electronics Engineers (IEEE) (2023). https://doi.org/10.1109/cvpr52729.2023.01394
Bao, H., Dong, L., Piao, S., Wei, F.: BEiT: BERT pre-training of image transformers. In: International Conference on Learning Representations (ICLR) (2022). https://openreview.net/forum?id=p-BhZSz59o4
Bao, Z., Tokmakov, P., Jabri, A., Wang, Y.X., Gaidon, A., Hebert, M.: Discovering objects that can move. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 11779–11788. Institute of Electrical and Electronics Engineers (IEEE) (2022). https://doi.org/10.1109/cvpr52688.2022.01149
Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification using a “Siamese” time delay neural network. In: Cowan, J., Tesauro, G., Alspector, J. (eds.) Advances in Neural Information Processing Systems. vol. 6. Morgan-Kaufmann (1993). https://proceedings.neurips.cc/paper_files/paper/1993/file/288cc0ff022877bd3df94bc9360b9c5d-Paper.pdf
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, pp. 9630–9640. Institute of Electrical and Electronics Engineers (IEEE) (2021). https://doi.org/10.1109/iccv48922.2021.00951
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning (ICML). Proceedings of Machine Learning Research, vol. 119, pp. 1597–1607 (2020)
Chen, X., et al.: Context autoencoder for self-supervised representation learning. Int. J. Comput. Vis. 132(1), 208–223 (2023). https://doi.org/10.1007/s11263-023-01852-4
Chen, X., He, K.: Exploring simple Siamese representation learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, pp. 15745–15753. Institute of Electrical and Electronics Engineers (IEEE) (2021). https://doi.org/10.1109/cvpr46437.2021.01549
Dave, I., Gupta, R., Rizve, M.N., Shah, M.: TCLR: temporal contrastive learning for video representation. Comput. Vis. Image Underst. 219, 1–9 (2022). https://doi.org/10.1016/j.cviu.2022.103406
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, pp. 248–255. Institute of Electrical and Electronics Engineers (IEEE) (2009). https://doi.org/10.1109/CVPR.2009.5206848
Doersch, C., Zisserman, A.: Multi-task self-supervised visual learning. In: IEEE International Conference on Computer Vision (ICCV), Venice, Italy, pp. 2070–2079. Institute of Electrical and Electronics Engineers (IEEE) (2017). https://doi.org/10.1109/iccv.2017.226
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR), Austria (2021)
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010). https://doi.org/10.1007/s11263-009-0275-4
Fan, D., et al.: Motion-guided masking for spatiotemporal representation learning. In: IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, pp. 5596–5606. Institute of Electrical and Electronics Engineers (IEEE) (2023). https://doi.org/10.1109/iccv51070.2023.00517
Feichtenhofer, C., FFan, H., Li, Y., He, K.: Masked autoencoders as spatiotemporal learners. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 35946–35958. Curran Assoc. Inc. (2022), https://proceedings.neurips.cc/paper_files/paper/2022/file/e97d1081481a4017df96b51be31001d3-Paper-Conference.pdf
Feng, Z., Zhang, S.: Evolved part masking for self-supervised learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10386–10395. Institute of Electrical and Electronics Engineers (IEEE), Vancouver, Canada (2023). https://doi.org/10.1109/cvpr52729.2023.01001
da Girdhar, R., El-Nouby, A., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: OmniMAE: single model masked pretraining on images and videos. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, pp. 10406–10417. Institute of Electrical and Electronics Engineers (IEEE) (2023). https://doi.org/10.1109/cvpr52729.2023.01003
Girdhar, R., Ramanan, D.: Attentional pooling for action recognition. In: Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, vol. 30, pp. 1–12. Curran Assoc. Inc. (2017)
Grill, J.B., et al.: Bootstrap your own latent – a new approach to self-supervised learning. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 21271–21284. Curran Assoc. Inc. (2020)
Gupta, A., Wu, J., Deng, J., Fei-Fei, L.: Siamese masked autoencoders. In: Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, vol. 37. Curran Assoc. Inc. (2023). https://openreview.net/forum?id=yC3q7vInux
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), New York, NY, USA, vol. 2, pp. 1735–1742. Institute of Electrical and Electronics Engineers (IEEE) (2019). https://doi.org/10.1109/cvpr.2006.100
He, K., Chen, X., Xie, S., Li, Y., Dollar, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 15979–15988. Institute of Electrical and Electronics Engineers (IEEE) (2022). https://doi.org/10.1109/cvpr52688.2022.01553
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 9726–9735. Institute of Electrical and Electronics Engineers (IEEE) (2020). https://doi.org/10.1109/cvpr42600.2020.00975
Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUs). CoRR abs/1606.08415 (2016). https://doi.org/10.48550/arXiv.1606.08415
Jabri, A., Owens, A., Efros, A.A.: Space-time correspondence as a contrastive random walk. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 34. Curran Assoc. Inc. (2020)
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: IEEE International Conference on Computer Vision (ICCV), Sydney, NSW, Aust, pp. 3192–3199. Institute of Electrical and Electronics Engineers (IEEE) (2013). https://doi.org/10.1109/iccv.2013.396
Jiang, Z., et al.: Concatenated masked autoencoders as spatial-temporal learner. CoRR abs/2311.00961 (2023). https://doi.org/10.48550/arXiv.2311.00961
Kay, W., et al.: The kinetics human action video dataset. CoRR abs/1705.06950 (2017). https://doi.org/10.48550/arXiv.1705.06950
Kenton, L., Devlin, J., Chang, M.W., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, vol. 1, pp. 4171–4186. Minneapolis, Minnesota (2019)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (ICLR), New Orleans, LA, USA (2019)
Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_32
Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_5
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. Trans. Mach. Learn. Res. (2024). https://openreview.net/forum?id=a68SUt6zFt
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 DAVIS challenge on video object segmentation. CoRR abs/1704.00675 (2017). https://doi.org/10.48550/arXiv.1704.00675
Qing, Z., et al.: MAR: masked autoencoders for efficient action recognition. IEEE Trans. Multimedia 26, 218–233 (2024). https://doi.org/10.1109/tmm.2023.3263288
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Sermanet, P., et al.: Time-contrastive networks: self-supervised learning from video. In: IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, pp. 1134–1141. Institute of Electrical and Electronics Engineers (IEEE) (2018). https://doi.org/10.1109/icra.2018.8462891
Spyros, G., Praveer, S., Nikos, K.: Unsupervised representation learning by predicting image rotations. In: International Conference on Learning Representations (ICLR), Vancouver, Canada (2018). https://openreview.net/forum?id=S1v4N2l0-
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Tong, Z., Song, Y., Wang, J., Wang, L.: VideoMAE: masked autoencoders are data-efficient learners for self-supervised video pre-training. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 10078–10093. Curran Assoc. Inc. (2022)
Vaswani, A., et al.: Attention is all you need. CoRR abs/1706.03762 (2017). https://doi.org/10.48550/arXiv.1706.03762
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning - ICML 2008, Helsinki, Finland, pp. 1096–1103. ACM Press (2008). https://doi.org/10.1145/1390156.1390294
Wang, L., et al.: VideoMAE V2: scaling video masked autoencoders with dual masking. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, pp. 14549–14560. Institute of Electrical and Electronics Engineers (IEEE) (2023). https://doi.org/10.1109/cvpr52729.2023.01398
Wang, X., Jabri, A., Efros, A.A.: Learning correspondence from the cycle-consistency of time. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, pp. 2561–2571. Institute of Electrical and Electronics Engineers (IEEE) (2019). https://doi.org/10.1109/cvpr.2019.00267
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA. pp. 3733–3742. Institute of Electrical and Electronics Engineers (IEEE) (2018). https://doi.org/10.1109/cvpr.2018.00393
Xiao, T., Wang, X., Efros, A.A., Darrell, T.: What should not be contrastive in contrastive learning. In: International Conference on Learning Representations (ICLR), Vienna, Austria (2021)
Xie, R., Wang, C., Zeng, W., Wang, Y.: An empirical study of the collapsing problem in semi-supervised 2D human pose estimation. In: IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, pp. 11220–11229. Institute of Electrical and Electronics Engineers (IEEE) (2021). https://doi.org/10.1109/iccv48922.2021.01105
Xie, Z., et al.: SimMIM: a simple framework for masked image modeling. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 9643–9653. Institute of Electrical and Electronics Engineers (IEEE) (2022). https://doi.org/10.1109/cvpr52688.2022.00943
Yao, R., Lin, G., Xia, S., Zhao, J., Zhou, Y.: Video object segmentation and tracking: a survey. ACM Trans. Intell. Syst. Technol. 11(4), 36:1–47 (2020). https://doi.org/10.1145/3391743
Zhou, J., et al.: iBOT: Image bert pre-training with online tokenizer. In: International Conference on Learning Representations (ICLR), Vienna, Austria (2022). https://openreview.net/forum?id=ydopy-e6Dg
Zhou, Q., Liang, X., Gong, K., Lin, L.: Adaptive temporal encoding network for video instance-level human parsing. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 1527–1535. ACM (2018). https://doi.org/10.1145/3240508.3240660