Train ViT on Small Dataset With Translation Perceptibility

CHEN HUAN (Institute of Computing Technology),* WENTAO WEI (Southeast University), Ping Yao (Institute of Computing Technology, Chinese Academy of Sciences )
The 34th British Machine Vision Conference


The Vision Transformer (ViT) has become a popular vision model in recent years, replacing traditional Convolutional Neural Network (CNN) models. However, ViT models tend to require a larger amount of data due to the lack of some properties inherent in the CNN architecture. To address this problem, researchers have proposed various methods to optimize ViTs' performance on small datasets. In this paper, we propose a self-supervised auxiliary task to guide ViT models in learning translation perceptibility, which enables the models to acquire inductive bias more efficiently from small datasets, without the need for pre-training on large datasets or modifications to the network architecture. The effectiveness of the approach has been demonstrated on multiple small datasets, as well as its scale perceptibility, and its application in conjunction with current state-of-the-art methods has further improved performance.



author    = {CHEN HUAN and WENTAO WEI and Ping Yao},
title     = {Train ViT on Small Dataset With Translation Perceptibility},
booktitle = {34th British Machine Vision Conference 2023, {BMVC} 2023, Aberdeen, UK, November 20-24, 2023},
publisher = {BMVA},
year      = {2023},
url       = {}

Copyright © 2023 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection