SeqCo-DETR: Sequence Consistency Training for Self-Supervised Object Detection with Transformers

Guoqiang Jin (SenseTime Research),* Fan Yang (中国科学院自动化研究所), Mingshan Sun (SenseTime Research ), Ruyi Zhao (Tongji University), Yakun Liu (SenseTime Research), Wei Li (SenseTime Research), Tianpeng Bao (SenseTime Research), Liwei Wu (SenseTime Research), Xingyu ZENG (SenseTime Group Limited), Rui Zhao (SenseTime Group Limited)
The 34th British Machine Vision Conference


Self-supervised pre-training and transformer-based architectures have significantly enhanced object detection performance. However, most current self-supervised object detection methods are built on convolutional-based architectures. We believe the transformers' sequence characteristics should be considered when designing a transformer-based self-supervised method for the object detection task. To this end, we propose SeqCo-DETR, a novel Sequence Consistency-based self-supervised method for object DEtection with TRansformers. SeqCo-DETR defines a simple yet effective pretext by minimizing the discrepancy of the output sequences of transformers with different image views as input and leveraging bipartite matching to find the most relevant sequence pairs which predict the same object. Furthermore, we provide a complementary mask strategy incorporated with the sequence consistency strategy to extract more representative contextual information about the object for the object detection task. Our method achieves state-of-the-art results on MS COCO (45.8 AP) and PASCAL VOC (64.1 AP), demonstrating the effectiveness of our approach.



