[en] Given an input video sequence, whose frames depict the same scene at different times, the background estimation problem consists in generating a model of the scene background, free of the foreground elements occluding it. In this thesis, we are interested in a unimodal variant of this problem, whose resulting background model is a single image. To date, background estimation, that is often confused with background subtraction, has been marginally explored. The simplest method, called the temporal median filter, consists in computing the median pixel value in each pixel position. While it produces excellent results for basic scenes, it relies on the strong assumption that the background is observed more than half of the time in each pixel position. As this assumption is rarely met in complex video sequences, such as the ones containing a large amount of foreground elements, or subject to background motion and/or intermittent motion, the temporal median filter usually fails to generate a clean background image for realistic scenes.
In this thesis, we propose LaBGen, a new background estimation method built upon the temporal median filter while improving its robustness. It is mainly based on the idea that, if we had an information indicating for a given frame which pixels are in motion, we could filter out foreground pixel values considered during median computations, and relax the need to observe the background more than half of the time. After describing and justifying the design of LaBGen, we test the relevance of the motion detection performed by different popular background subtraction algorithms for our task. It turns out that the simple frame difference algorithm enables LaBGen to achieve its best performance. For this reason, we integrate this algorithm in LaBGen-P, another of our methods that improves LaBGen by avoiding some artifacts that it sometimes introduces into the generated background images. In addition to achieving a better performance than many sophisticated state-of-the-art methods, while having a much lower run time, LaBGen and LaBGen-P were ranked first in the international IEEE Scene Background Modeling Contest organized in 2016.
Thereafter, we study the relationship between the performance of motion detection, and the performance of our methods. Although we do not find an obvious correlation between both, we make the assumption upon previous experimental evidence that a temporally memoryless motion detection is the most relevant for LaBGen. Unlike a temporally aware motion detection that does not ignore the temporal information history, a temporally memoryless approach detects motion between two frames without relying on additional past frames. Based on this hypothesis, we design LaBGen-OF, a variant of LaBGen that leverages temporally memoryless optical flow algorithms (i.e. that determine the displacement of each pixel from one frame to another). A consecutive performance study highlights that LaBGen-OF always performs better than LaBGen embedded with different temporally aware motion detection algorithms. Even better, LaBGen-OF is ranked number 2 over 30 on the popular SBMnet background estimation dataset, and takes the lead in 2 categories over 8.
These promising results lead us to push the temporally memoryless even further. For this purpose, we propose two homemade intra-frame motion detection algorithms that leverage semantic segmentation (i.e. a segmentation indicating which object is currently depicted in a given pixel) to determine the possibility of observing motion from spatial information only. Afterwards, we integrate those algorithms into a new variant of LaBGen-P, called LaBGen-P-Semantic, and determine their relevance to our task. In addition to validate the use of intra-frame motion detection algorithms, a performance evaluation shows that LaBGen-P-Semantic performs better than LaBGen and LaBGen-P, and takes the lead in 3 other SBMnet categories.
Finally, an additional contribution of this thesis lies in the performance evaluation subfield. Indeed, as most evaluation methodologies and datasets are used blindly without ever being questioned, we describe and analyze them in depth in order to determine whether such trust is justified. It turns out that some evaluation tools are mathematically inaccurate, and/or redundant, and/or not well correlated with what the visual perception of a human considers as an acceptable background image. In addition, the public implementations of some of those tools return erroneous results. We thus revisit the performance evaluation paradigms used in background estimation, review the problems, and provide possible solutions. Furthermore, as no methodology to assess online background estimation methods (i.e. generating a background image after each frame of the input video sequence) has been proposed to date, we provide insights into how to evaluate such methods. Our proposal is based on paradigms borrowed from the video quality assessment field, and a proof of concept shows that it is able to discriminate the performance of two different online methods that traditional evaluation tools consider to be identical.