[en] The cocktail party effect (when our attention is involuntarily captured by a word spoken in another conversation than the one we are engaged in) challenges current computational models which struggle to account for both top-down and bottom-up phenomena in noisy, multi-source environments. This study aimed to investigate how different neural network architectures mimic this phenomenon, particularly when faced with semantic distraction.
We implemented the ASAM model from Xu et al. (2018), which uses a BiLSTM network supplemented by an attention module, and a unidirectional LSTM variant. The task was to separate a target speaker’s narrative from a distractor's speech. During training, the models were familiarized with a specific set of words by presenting them at varying frequencies. During the testing phase, these “familiar” words were embedded within the distractor’s speech. This allowed us to assess whether the models were susceptible to semantic distraction from familiar words compared to novel ones. Model performance was evaluated using source separation metrics and analysed by MANOVA.
Our results showed that both architectures seemed to exhibit a cocktail party effect (performance was significantly degraded by the presence of distractors), but this was not modulated by familiarity to the distractor, suggesting the interference was primarily driven by energetic masking rather than semantic masking.
This study highlights that current computational models rely more on learning acoustic patterns than on genuine semantic processing. It calls for a more critical investigation into how to meaningfully incorporate and test for semantic influences in models of the cocktail party effect.