Video-and-Language (VidL) models and their cognitive relevance

Anne Zonneveld, Albert Gatt, Iacer Calixto

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

Abstract

In this paper we give a narrative review of multi-modal video-language (VidL) models. We introduce the current landscape of VidL models and benchmarks, and draw inspiration from neuroscience and cognitive science to propose avenues for future research in VidL models in particular and artificial intelligence (AI) in general. We argue that iterative feedback loops between AI, neuroscience, and cognitive science are essential to spur progress across these disciplines. We motivate why we focus specifically on VidL models and their benchmarks as a promising type of model to bring improvements in AI and categorise current VidL efforts across multiple 'cognitive relevance axioms'. Finally, we provide suggestions on how to effectively incorporate this interdisciplinary viewpoint into research on VidL models in particular and AI in general. In doing so, we hope to create awareness of the potential of VidL models to narrow the gap between neuroscience, cognitive science, and AI.
Original languageEnglish
Title of host publicationProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops
Pages325-338
Number of pages8
Publication statusPublished - 2023

Cite this