How Can Contrastive Pre-training Benefit Audio-Visual Segmentation? A Study from Supervised and Zero-shot Perspectives


Jiarui Yu (USTC),* Haoran Li (University of Science and Technology of China), Yanbin Hao (University of Science and Technology of China), Wu Jinmeng (Wuhan Institute of Technology), Tong Xu (University of Science and Technology of China), Shuo Wang (University of Science and Technology of China), Xiangnan He (University of Science and Technology of China)
The 34th British Machine Vision Conference

Abstract

Sharing a similar spirit with the successful contrastive language-image pre-training (CLIP), audio-aware contrastive pre-training has also exhibited its powerful ability to align instances in audio retrieval and audio-guided image generation. In this paper, we aim to extend its capabilities to the pixel level to achieve audio-visual segmentation (AVS). Specifically, we explore the following question: how can the instance-level alignment knowledge gained from contrastive pre-training benefit pixel-level audio-visual segmentation? To address this question, we approach the problem from two perspectives in AVS: a supervised setting and a zero-shot setting. In the supervised setting, we enhance the instance-level AudioCLIP model by incorporating a pixel-wise multi-modal fusion module. This leads to a simple yet effective model AC-FPN that enables pixel-level predictions for sounding objects, following the standard AVS training fashion. On the other hand, in the zero-shot setting, we further investigate the feasibility of promoting the Segment-Anything-Model (SAM) for AVS by proposing three prompt formulizing strategies based on instance-level contrastive pre-training models. Experimental results on both subtasks demonstrate the potential of leveraging instance-level contrastive pre-training for advancing audio-visual segmentation to the pixel level.

Citation

@inproceedings{Yu_2023_BMVC,
author    = {Jiarui Yu and Haoran Li and Yanbin Hao and Wu Jinmeng and Tong Xu and Shuo Wang and Xiangnan He},
title     = {How Can Contrastive Pre-training Benefit Audio-Visual Segmentation? A Study from Supervised and Zero-shot Perspectives},
booktitle = {34th British Machine Vision Conference 2023, {BMVC} 2023, Aberdeen, UK, November 20-24, 2023},
publisher = {BMVA},
year      = {2023},
url       = {https://papers.bmvc2023.org/0367.pdf}
}


Copyright © 2023 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection