Analytical tool; Data acquisition system; Growing demand; Localization and tracking; Object localization; Object Tracking; Real-world; Realistic rendering; Virtualizations; World coordinates; Automotive Engineering; Mechanical Engineering; Computer Science Applications
Abstract :
[en] There is a growing demand for the virtualization of real-world scenes to become more realistic, driven by improvements in data acquisition systems and availability of analytical tools and computational power to treat that data. Despite the growing number of studies which tackle specific steps in this process, there is still a lack of solutions which leverage recent advances in computer vision techniques to facilitate the digital reconstruction of a real-world environment. Such an approach has obvious advantages, e.g., being able of operating on simple video data and not requiring an expensive data acquisition system, such as LIDAR cameras. However, it comes with several challenges, including the need to individualize and categorize objects in an image, estimate their position with respect to the camera and also deal with artifacts. To tackle these challenges, we conceived an innovative workflow for extracting and positioning in real-world coordinates relevant elements in a video scene, in our case of video captured from the front of a moving train. Starting from a semantic segmentation task, our method then leverages the object segmentation model SegmentAnything to assign a unique identifier to each instance of each class, which are subsequently tracked along the video sequence using a video segmentation algorithm. Each unique object instance can be positioned in real-world coordinates by integrating the output with its depth calibrated to metric units by leveraging scene reconstruction from Structure from Motion (SfM). The full approach is validated by comparing the estimated positions of buildings and traffic signs with their real positions based on an open-source database, achieving in both a sub-meter accuracy. This novel approach provides a comprehensive framework for the positioning of any object from a sequence of video frames and can be applied to a wide-range of domains beyond the one tackled here.
Disciplines :
Computer science
Author, co-author :
Roux, Francois Le; EluciDATA Lab of Sirris, Brussels, Belgium
Cabral, Henrique; EluciDATA Lab of Sirris, Brussels, Belgium
Yarroudh, Anass ; Université de Liège - ULiège > Département de géographie > Geospatial Data Science and City Information Modelling (GeoScITY) ; GIM Wallonie, Gembloux, Belgium
Nlemba, Laurent; Gim Wallonie, Gembloux, Belgium
Campling, Matthias; Ku Leuven, Departement of Computer Science, Heverlee, Belgium
Tsiporkova, Elena; EluciDATA Lab of Sirris, Brussels, Belgium
Language :
English
Title :
Object Localization and Tracking Pipeline for the Realistic Rendering of Railway Environments
Publication date :
20 March 2025
Event name :
2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC)
The authors would like to thank Steven Smolders and Clement Massart from GIM Wallonie (Rue Camille Hubert, 13C, 5032 Gembloux, BE) for their contribution to this work. This research was conducted as part of the TrackGen project and supported by SPW and Logistics in Wallonia.
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollar, and R. Girshick, "Segment Anything," Apr. 2023, arXiv:2304.02643 [cs]. [Online]. Available: http://arxiv.org/abs/2304.02643
H. K. Cheng, S. W. Oh, B. Price, A. Schwing, and J.-Y. Lee, "Tracking Anything with Decoupled Video Segmentation," Sep. 2023, arXiv:2309.03903 [cs]. [Online]. Available: http://arxiv.org/abs/2309.03903
C. Zhao, Y. Zhang, M. Poggi, F. Tosi, X. Guo, Z. Zhu, G. Huang, Y. Tang, and S. Mattoccia, "MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer," in 2022 International Conference on 3D Vision (3DV), Sep. 2022, pp. 668-678, arXiv:2208.03543 [cs]. [Online]. Available: http://arxiv.org/abs/2208.03543
O. Ozyesil, V. Voroninski, R. Basri, and A. Singer, "A survey of structure from motion." Acta Numerica, vol. 26, pp. 305-364, May 2017. [Online]. Available: https://arxiv.org/abs/1701.08493
D. Laumer, N. Lang, N. van Doorn, O. Mac Aodha, P. Perona, and J. D. Wegner, "Geocoding of trees from street addresses and street-level images," Feb. 2020, arXiv:2002.01708 [cs]. [Online]. Available: http://arxiv.org/abs/2002.01708
S. Alsheimer and Z. Zhu, "Monocularly Generated 3D High Level Semantic Model by Integrating Deep Learning Models and Traditional Vision Techniques," in 2021 IEEE International Conference on Imaging Systems and Techniques (IST). Kaohsiung, Taiwan: IEEE, Aug. 2021, pp. 1-6. [Online]. Available: https://ieeexplore.ieee.org/document/9651471/
S. Bullinger, C. Bodensteiner, M. Arens, and R. Stiefelhagen, "Stereo 3D Object Trajectory Reconstruction," Aug. 2018, arXiv:1808.09297 [cs]. [Online]. Available: http://arxiv.org/abs/1808.09297
J. Kang, M. Korner, Y. Wang, H. Taubenbock, and X. X. Zhu, "Building Instance Classification Using Street View Images," ISPRS Journal of Photogrammetry and Remote Sensing, vol. 145, pp. 44-59, Nov. 2018, arXiv:1802.09026 [cs, eess]. [Online]. Available: http://arxiv.org/abs/1802.09026
A. Al-Habashna and R. Murdoch, "Building height estimation from street-view imagery using deep learning, image processing and automated geospatial analysis," Multimedia Tools and Applications, Nov. 2023. [Online]. Available: https://link.springer.com/10.1007/s11042- 023-17363-w
A. Salehitangrizi, S. Jabari, M. Sheng, and Y. Zhang, "3D Modeling of Facade Elements Using Multi-View Images from Mobile Scanning Systems," Canadian Journal of Remote Sensing, vol. 50, no. 1, p. 2309895, Dec. 2024. [Online]. Available: https://www.tandfonline.com/doi/full/10.1080/07038992.2024.2309895
J. Jain, J. Li, M. Chiu, A. Hassani, N. Orlov, and H. Shi, "OneFormer: One Transformer to Rule Universal Image Segmentation," Dec. 2022, arXiv:2211.06220 [cs]. [Online]. Available: http://arxiv.org/abs/2211.06220
B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, "Scene Parsing through ADE20K Dataset," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI: IEEE, Jul. 2017, pp. 5122-5130. [Online]. Available: http://ieeexplore.ieee.org/document/8100027/
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, "The Cityscapes Dataset for Semantic Urban Scene Understanding," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, Jun. 2016, pp. 3213-3223. [Online]. Available: http://ieeexplore.ieee.org/document/7780719/
T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollar, "Microsoft COCO: Common Objects in Context," Feb. 2015, arXiv:1405.0312 [cs]. [Online]. Available: http://arxiv.org/abs/1405.0312
H. Caesar, J. Uijlings, and V. Ferrari, "COCO-Stuff: Thing and Stuff Classes in Context," Mar. 2018, arXiv:1612.03716 [cs]. [Online]. Available: http://arxiv.org/abs/1612.03716
H. K. Cheng and A. G. Schwing, "XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model," Jul. 2022, arXiv:2207.07115 [cs]. [Online]. Available: http://arxiv.org/abs/2207.07115
H. W. Kuhn, "The Hungarian Method for the Assignment Problem," in 50 Years of Integer Programming 1958-2008, M. Junger, T. M. Liebling, D. Naddef, G. L. Nemhauser, W. R. Pulleyblank, G. Reinelt, G. Rinaldi, and L. A.Wolsey, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 29-47.
C. Godard, O. Mac Aodha, M. Firman, and G. Brostow, "Digging Into Self-Supervised Monocular Depth Estimation," Aug. 2019, arXiv:1806.01260 [cs, stat]. [Online]. Available: http://arxiv.org/abs/1806.01260
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," Jun. 2021, arXiv:2010.11929 [cs]. [Online]. Available: http://arxiv.org/abs/2010.11929
K. Saunders, G. Vogiatzis, and L. Manso, "Self-supervised Monocular Depth Estimation: Let's Talk About The Weather," Jul. 2023, arXiv:2307.08357 [cs]. [Online]. Available: http://arxiv.org/abs/2307.08357
A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, "Vision meets robotics: The KITTI dataset," The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231-1237, Sep. 2013. [Online]. Available: http://journals.sagepub.com/doi/10.1177/0278364913491297
I. Loshchilov and F. Hutter, "Decoupled Weight Decay Regularization," Jan. 2019, arXiv:1711.05101 [cs, math]. [Online]. Available: http://arxiv.org/abs/1711.05101
D. Lowe, "Object recognition from local scale-invariant features," in Proceedings of the Seventh IEEE International Conference on Computer Vision. Kerkyra, Greece: IEEE, 1999, pp. 1150-1157 vol.2. [Online]. Available: http://ieeexplore.ieee.org/document/790410/
J. L. Schonberger and J.-M. Frahm, "Structure-from-Motion Revisited," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, Jun. 2016, pp. 4104-4113. [Online]. Available: http://ieeexplore.ieee.org/document/7780814/
J. Rabbi, N. Ray, M. Schubert, S. Chowdhury, and D. Chao, "Small-Object Detection in Remote Sensing Images with End-to- End Edge-Enhanced GAN and Object Detector Network," Apr. 2020, number: arXiv:2003.09085 arXiv:2003.09085 [cs]. [Online]. Available: http://arxiv.org/abs/2003.09085