Colloquium PR[AI]RIE

Combining modalities: two experiments on multimodal NLP

16/05/2023
14h

Zoom: https://u-paris.zoom.us/j/82231267433?pwd=SHl6YkpIM3ZFck5oNTN4UWR1dkRldz09

Speaker: Benoit Sagot, Inria

Bio

Research Director at Inria, head of the ALPAGE (2014-2016) and ALMAnaCH (2017-) teams. Co-founder of the Verbatim Analysis (2009-) and opensquare (2016-) Inria start-ups.

Abstract

The spread of neural networks within all subfields of Artificial Intelligence (AI), including Natural Language Processing (NLP), speech processing, computer vision, has drastically impacted how we can tackle multimodal tasks and has allowed for new unified approaches. In this talk, after a brief review of the ongoing paradigm shift, I will describe two recent works involving multimodality in relation to machine translation (MT), one involving speech and the other images. I will first present a new approach to zero-shot cross-modal transfer between speech and text for translation tasks, which relies on a modular architecture in which multilingual speech and text are encoded in a joint fixed-size representation space. Despite this bottleneck and no cross-modal labelled translation data being used during training, we achieve competitive results in all translation tasks. I will then present a novel approach to image-enhanced MT, also known as multimodal MT (MMT). Recent work has shown that obtaining improvements from images is challenging, limited not only by the difficulty of building effective cross-modal representations, but also by the lack of specific evaluation and training data. I will describe our novel approach to the task as well as our new contrastive multimodal translation evaluation dataset CoMMuTE, and will show that we obtain competitive results compared to strong text-only models on standard English-to-French, English-to-German and English-to-Czech benchmarks and outperform baselines and state-of-the-art MMT systems by a large margin on our contrastive test set. I will conclude with a few thoughts on the future of multimodal NLP in the context of a new generation of conversational agents.

The work on cross-modal MT between text and speech was carried out in collaboration with Paul-Ambroise Duquenne (META and Inria) and Holger Schwenk (META). The work on MMT was carried out in collaboration with Matthieu Futeral-Peter (Inria; PRAIRIE), Rachel Bawden (Inria; PRAIRIE), Ivan Laptev (Inria and ENS; PRAIRIE) and Cordelia Schmid (Inria and ENS; PRAIRIE). Both works are also carried out under the umbrella of my role as PRAIRIE chair.