Zero-Shot Video Captioning by Evolving Pseudo-tokens

Yoad Tewel (Tel-Aviv University),* Yoav Shalev (Tel Aviv University), Roy Nadler (Tel Aviv University), Idan Schwartz (Technion), Lior Wolf (Tel Aviv University, Israel)
The 34th British Machine Vision Conference


We introduce a zero-shot video captioning method that employs two frozen networks: the GPT-2 language model to generate sentences and the CLIP to maintain a high average matching score between the generated text and the video frames. Existing zero-shot captioning methods use token-level optimization that drives the generation of each token to be related to the image. However, maintaining language fluency with a set of frames can be challenging since (i) a single token has to describe a set of non-homogeneous frames, and (ii) the generation may commit to a single direction, restricting the flexibility of the process. In our approach, we use pseudo-tokens that update after each complete sentence is generated, gradually improving the specificity and comprehensiveness of the sentence while letting the user control the level of specificity. The optimization takes into account the whole sentence and does not require beam-searching. Our experiments show that the generated captions are fluent and display a broad range of real-world knowledge for both videos and images. Moreover, while current supervised video captioning methods generate captions that often follow a short and generic pattern based on the datasets they were trained on, our approach generates diverse and descriptive captions that are much more appealing to humans. Our code is attached supplementary.



author    = {Yoad Tewel and Yoav Shalev and Roy Nadler and Idan Schwartz and Lior Wolf},
title     = {Zero-Shot Video Captioning by Evolving Pseudo-tokens},
booktitle = {34th British Machine Vision Conference 2023, {BMVC} 2023, Aberdeen, UK, November 20-24, 2023},
publisher = {BMVA},
year      = {2023},
url       = {}

Copyright © 2023 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection