Abstract :
[en] While 3D cinema is becoming increasingly established, little effort has focused on the general problem of producing a 3D sound scene spatially coherent with the visual content of a stereoscopic-3D (s-3D) movie. The perceptual relevance of such spatial audiovisual coherence is of significant interest.
In this thesis, we investigate the possibility of adding spatially accurate sound rendering to regular s-3D cinema. Our goal is to provide a perceptually matched sound source at the position of every object producing sound in the visual scene. We examine and contribute to the understanding of the usefulness and the feasibility of this combination.
By usefulness, we mean that the technology should positively contribute to the experience, and in particular to the storytelling. In order to carry out experiments proving the usefulness, it is necessary to have an appropriate s-3D movie and its corresponding 3D audio soundtrack. We first present the procedure followed to obtain this joint 3D video and audio content from an existing animated s-3D movie, problems encountered, and some of the solutions employed. Second, as s-3D cinema aims at providing the spectator with a strong impression of being part of the movie (sense of presence), we investigate the impact of the spatial rendering quality of the soundtrack on the reported sense of presence. The short 3D audiovisual content is presented with three different soundtracks. These soundtracks differ by their spatial rendering quality, from stereo (low spatial coherence) to Wave Field Synthesis (WFS, high spatial coherence). The original stereo version serves as a reference. Results show that the sound condition does not impact on the sense of presence of all participants. However, participants can be classified according to three different levels of presence sensitivity with the sound condition impacting only on the highest level (12 out of 33 participants). Within this group, the spatially coherent soundtrack provides a lower reported sense of presence than the other custom soundtrack. The analysis of the participants' heart rate variability (HRV) shows that the frequency-domain parameters correlate to the reported presence scores.
By feasibility, we mean that a large portion of the spectators in the audience should benefit from this new technology. In this thesis, we explain why the combination of accurate sound positioning and stereoscopic-3D images can lead to an incongruence between the sound and the image for multiple spectators. Then, we adapt to s-3D viewing a method originally proposed for 2D images in the literature to reduce this error. Finally, a subjective experiment is carried out to prove the efficiency of the method. In this experiment, an angular error between an s-3D video and a spatially accurate sound reproduced through WFS is simulated. The psychometric curve is measured with the method of constant stimuli, and the threshold for bimodal integration is estimated. The impact of the presence of background noise is also investigated. A comparison is made between the case without any background noise and the case with an SNR of 4 dBA. Estimates of the thresholds and the slopes, as well as their confidence intervals, are obtained for each level of background noise. When background noise is present, the point of subjective equality (PSE) is higher (19.4° instead of 18.3°) and the slope is steeper (-0.077 instead of -0.062 per degree). Because of the overlap between the confidence intervals, however, it is not possible to statistically differentiate between the two levels of noise. The implications for the sound reproduction in a cinema theater are discussed.