Dual Attention for Audio-Visual Speech Enhancement with Facial Cues

Fexiang Wang (ICT, UCAS),* Shuang Yang (ICT, CAS), Shiguang Shan (Institute of Computing Technology, Chinese Academy of Sciences), Xilin Chen (Institute of Computing Technology, Chinese Academy of Sciences)

The 34^th British Machine Vision Conference

Abstract

In this work, we focus on taking advantage of the facial cues, beyond the lip region, for robust Audio-Visual Speech Enhancement (AVSE). The facial region covers the lip region and furthermore reflects more speech-related attributes obviously, which is beneficial for AVSE. % However, besides the above speech-related attributes, there also exist some static and dynamic speech-unrelated attributes which always cause speech-unrelated appearance changes in the speaking process. % To address these challenges, we propose a dual attention cooperative framework to fully capture speech-related information with facial cues and dynamically integrate such information with the audio signal for AVSE. % Specifically, to capture and enhance the visual speech information beyond the lip region, we propose a spatial attention based visual branch to introduce the global facial context for robust visual feature extraction. % Secondly, we introduce a dynamic visual feature fusion strategy by incorporating a temporal-dimensional self-attention module for fusion of the visual feature, which enables the model to robustly handle facial variations in the process. % Thirdly, the acoustic noise in the speaking process is always not a stable constant noise, which makes the speech quality in the contaminated audio signal varied in the process. Therefore, we also introduce a dynamic fusion strategy for the audio feature. % By integrating the cooperative dual attention reflected in both the visual branch and the audio-visual fusion strategy, our model can effectively extract beneficial speech information from both audio and visual cues for AVSE. % We performed a thorough analysis and comparison on different datasets with several settings, including the normal case and hard case when visual information is unreliable or even absent. % These results consistently show that our model outperforms existing methods under multiple metrics.

Video

Citation

@inproceedings{Wang_2023_BMVC,
author    = {Fexiang Wang and Shuang Yang and Shiguang Shan and Xilin Chen},
title     = {Dual Attention for Audio-Visual Speech Enhancement with Facial Cues},
booktitle = {34th British Machine Vision Conference 2023, {BMVC} 2023, Aberdeen, UK, November 20-24, 2023},
publisher = {BMVA},
year      = {2023},
url       = {https://papers.bmvc2023.org/0144.pdf}
}

Copyright © 2023 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection