Video Infilling with Rich Motion Prior


Xinyu Hou (Nanyang Technological University),* Liming Jiang (Nanyang Technological University), Rui Shao (Harbin Institute of Technology (Shenzhen)), Chen Change Loy (Nanyang Technological University)
The 34th British Machine Vision Conference

Abstract

Video infilling is a task of generating visually smooth and plausible intermediate frames in between given context frames. The infilling interval is usually large, and thus the intermediate contents to be filled experience significant and non-uniform changes in motion. To handle this challenging task, it is required for the model to learn robust motion dynamics to synthesize rich and plausible motion trajectories in between given contexts. In this work, we demonstrate the possibility of learning rich motion prior for video infilling via masked motion modeling. Our key insight is that the powerful ability of masked autoencoder to capture long-range dependencies could help us model and therefore generate rich and realistic in-between motions. Unlike previous multi-scale optical flow-based video interpolation methods, our framework is simple yet effective in longer-interval and larger-motion cases. In particular, we use the optical flow tokens learned by a pre-trained discrete tokenizer as the reconstruction target in masked motion modeling. With a random masking ratio over 0.5 during training, reasonable intermediate optical flows can be predicted by iterative decoding during inference. To demonstrate pixel-level infilling results, a dedicated bi-directional fusion of the warping results is applied. Through experiments conducted on the human action dataset, we demonstrate the effectiveness of our approach in predicting valid and diverse motions between given contexts. Quantitative results of pixel-level evaluation metrics show that our approach can outperform previous state-of-the-art methods even with the naïve fusion results.

Video



Citation

@inproceedings{Hou_2023_BMVC,
author    = {Xinyu Hou and Liming Jiang and Rui Shao and Chen Change  Loy},
title     = {Video Infilling with Rich Motion Prior},
booktitle = {34th British Machine Vision Conference 2023, {BMVC} 2023, Aberdeen, UK, November 20-24, 2023},
publisher = {BMVA},
year      = {2023},
url       = {https://papers.bmvc2023.org/0103.pdf}
}


Copyright © 2023 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection