Abstract :
[en] Monitoring wildlife and livestock in protected areas is essential to reach natural ecosystem conservation goals. In large open areas, this is often carried out by direct counting from observers in manned aircrafts flying at low altitude. However, there are several biases associated with this method, resulting in a low accuracy of large groups counts. Unmanned Aerial Vehicles (UAVs) have experienced a significant growth in recent years and seem to be relatively well-suited systems for photographing animals. While UAVs allow for more accurate herd counts than traditional methods, identification and counting are usually indirectly done during a manual time-consuming photo-interpretation process. For several years, machine learning and deep learning techniques have been developed and now show encouraging results for automatic animal detection. Some of them use Convolutional Neural Networks (CNNs) through anchor-based object detectors. These algorithms automatically extract relevant features from images, produce thousands of anchors all over the image and eventually decide which ones actually contain an object. Counting and classification are then achieved by summing and classifying all the selected bounding boxes. While this approach worked well for isolated mammals or sparse herds, it showed limits in close-by individuals by generating too many false positives, resulting in overestimated counts in dense herds. This raises the question: are anchor-based algorithms the most suitable for counting large mammals in aerial imagery? In an attempt to answer this, we built a simple one stage point-based object detector on a dataset acquired over various African landscapes which contains six large mammal species: buffalo (Syncerus caffer), elephant (Loxodonta africana), kob (Kobus kob), topi (Damaliscus lunatus jimela), warthog (Phacochoerus africanus) and waterbuck (Kobus ellipsiprymnus). An adapted version of the CNN DLA-34 was trained on points only (center of the original bounding boxes), splat onto a Focal Inverse Distance Transform (FIDT) map regressed in a pixel-wise manner using the focal loss. During inference, local maxima were extracted from the predicted map to obtain the animals location. Binary model’s performances were then compared to those of the state-of-the-art model, Libra-RCNN. Although our model detected 5% fewer animals compared to the baseline, its precision doubled from 37% to 70%, reducing the number of false positives by one third without using any hard negative mining method. The results obtained also showed a clear increase in precision in close-by individuals areas, letting it appear that a point-based approach seems to be better adapted for animal detection in herds than anchor-based ones. Future work will apply this approach on other animal datasets with different acquisition conditions (e.g. oblique viewing angle, coarser resolution, denser herds) to evaluate its range of use.