Why are we still translating sentences?
Speaker: Matt Post, Microsoft
Matt Post is a research scientist working in the Microsoft Translator group, where he has been since 2021. He holds a courtesy appointment in the department of computer science at Johns Hopkins University, where, prior to joining Microsoft, he worked for ten years or so as a research scientist at the HLTCOE (Human Language Technology Center of Excellence) and with the Center for Language and Speech Processing (CLSP). He is interested mostly in machine translation, but also enjoys working on practical applied problems in many areas within NLP. He has contributed to many open source projects, including Joshua, Sockeye, Fairseq, and sacrebleu. He helped organize the WMT manual evaluation for many years, has served on the NAACL executive board, and is the director of the ACL Anthology.
The technology and architectures underlying machine translation have changed a number of times over the decades, but apart from occasional research projects, the basic unit of translation has always been, and remains, the sentence. This paradigm persists despite the many clear advantages of translating at the document level, and it grows more glaring as much of NLP technology moves to large language models, which are natively document-based. This talk will survey research in document translation, highlighting difficulties in training, models, and evaluation. We then propose simple, workable solutions in each of these areas that may help the field escape its sentence-level rut.
Joint work with Marcin Junczys-Dowmunt.