Vision Transformers are Inherently Saliency Learners

The 34th British Machine Vision Conference


Training a Convolutional neural network's (CNNs) auto-encoder has been the defacto approach for visual attention modelling. Recently, (Vision) Transformer models (ViT) achieved excellent performance on various computer vision tasks. In this context, the self-attention mechanism plays a crucial role enabling early aggregation of global information, and ViT residual connections strongly propagate features from lower to higher layers. This raises two important questions: are Vision Transformers inherently learning saliency maps? Are the self-attention maps focusing on the salient regions of the input image? Analyzing the self-attention maps of a pretrained ViTs on saliency prediction datasets, we find that smoothing the internal attention maps with a small number of convolutional filters can achieve reasonable saliency maps with acceptable metric scores. We explore how this phenomenon arises, finding that self-attention promotes early aggregation of global information, then in higher layers, it associates highly attended features, compares their dependencies, and makes analogies over the recurring patterns. This suggests that ViTs first perform feature search, followed by conjunction search combining multiple features sharing higher mutual information. We study the analogies between the self-attention maps and the human generated saliency maps, and conclude with a discussion on the relationship to human visual attention such as feature integration theory.


author    = {YASSER ABDELAZIZ DAHOU DJILALI and Kevin McGuinness and Noel O Connor},
title     = {Vision Transformers are Inherently Saliency Learners},
booktitle = {34th British Machine Vision Conference 2023, {BMVC} 2023, Aberdeen, UK, November 20-24, 2023},
publisher = {BMVA},
year      = {2023},
url       = {}

Copyright © 2023 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection