UniLip: Learning Visual-Textual Mapping with Uni-Modal Data for Lip Reading


Bingquan Xia (Institute of Computing Technology, Chinese Academy of Sciences),* Shuang Yang (ICT, CAS), Shiguang Shan (Institute of Computing Technology, Chinese Academy of Sciences), Xilin Chen (Institute of Computing Technology, Chinese Academy of Sciences)
The 34th British Machine Vision Conference

Abstract

We propose a novel way, namely UniLip, to utilize uni-modal texts and uni-modal talking face videos for lip reading. With only uni-modal data, we achieve totally unsupervised lip reading for the first time. We reformulate the lip reading task with uni-modal data into two sub-tasks: learning linguistic priors from uni-modal texts and learning to map uni-modal videos to texts under the constraint of such priors. We formulate the two sub-tasks as language modeling and conditional generation tasks, respectively, and introduce a multi-grained adversarial learning strategy to embed these two sub-tasks into a unified framework. Specifically, we construct a discriminator to learn linguistic priors from uni-modal texts, which will be further used to supervise the generation task of text distributions conditioned on input videos. Generally, uni-modal texts often contain both diversified biases and consistent linguistic features. To precisely guide the text generation, we aim to encode the general linguistic priors and alleviate biases of text sources. Considering linguistic features often relate to local language patterns such as word spelling and grammar correctness, we introduce a novel multi-grained discrimination strategy based on local n-gram sub-utterances. On the other hand, with only uni-modal data, learning visual speech cues is difficult due to the lack of strong and explicit supervision. We first leverage self-supervised models to extract base visual features and then adapt them to our task by a necessary multi-grained feature fusion module. With only uni-modal data, we yield a best unsupervised Word Error Rate of 51.2% and 57.3% on LRS3 and LRS2, respectively. The result on LRS3 is comparable with mainstream supervised models trained on it. With both uni-modal and labeled data, we show that UniLip could co-work with traditional supervised frameworks. In our case, it improves supervised Seq2Seq methods by relatively 4.2% and 1.4% on LRS3 and LRS2, respectively.

Citation

@inproceedings{Xia_2023_BMVC,
author    = {Bingquan Xia and Shuang Yang and Shiguang Shan and Xilin Chen},
title     = {UniLip: Learning Visual-Textual Mapping with Uni-Modal Data for Lip Reading},
booktitle = {34th British Machine Vision Conference 2023, {BMVC} 2023, Aberdeen, UK, November 20-24, 2023},
publisher = {BMVA},
year      = {2023},
url       = {https://papers.bmvc2023.org/0190.pdf}
}


Copyright © 2023 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection