Zero-shot Composed Text-Image Retrieval

Yikun Liu (Beijing University of Posts and Telecommunications),* Jiangchao Yao (Cooperative Medianet Innovation Center, Shang hai Jiao Tong University), Ya Zhang (Cooperative Medianet Innovation Center, Shang hai Jiao Tong University), Yan-Feng Wang (Cooperative medianet innovation center of Shanghai Jiao Tong University), Weidi Xie (Shanghai Jiao Tong University)
The 34th British Machine Vision Conference


In this paper, we consider the problem of composed image retrieval (CIR), with the goal of developing models that can understand and combine multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user’s expression ability. We make the following contributions: (i) we initiate a scalable pipeline to automatically construct datasets for training CIR model, by simply exploiting a large-scale dataset of image-text pairs, e.g., a subset of LAION-5B; (ii) we introduce a transformer-based adaptive aggregation model, TransAgg, which employs a simple yet efficient fusion to adaptively combine information from diverse modalities; (iii) we conduct extensive ablation studies to investigate the usefulness of our proposed data construction procedure, and the effectiveness of core components in TransAgg; (iv) When evaluating on the publicly available benchmarks under zero-shot scenario, i.e., training on the automatically constructed datasets, then directly inference on target downstream datasets, e.g., CIRR and FashionIQ, our proposed approach either performs on par with or significantly outperforms the existing state-of-the-art (SOTA) models.



author    = {Yikun Liu and Jiangchao Yao and Ya Zhang and Yan-Feng Wang and Weidi Xie},
title     = {Zero-shot Composed Text-Image Retrieval},
booktitle = {34th British Machine Vision Conference 2023, {BMVC} 2023, Aberdeen, UK, November 20-24, 2023},
publisher = {BMVA},
year      = {2023},
url       = {}

