Natural language processing
Natural Language Processing (NLP) is at the intersection between linguistics, computer science and artificial intelligence. It is known to the general public through its applications such as machine translation and fake news detection (to name just a few) and now through generative language models (LMs) such as ChatGPT.
A core part of our NLP activity is dedicated to the design and training of models for various tasks, in order to improve training/inference, efficiency and performance. Model efficiency is relevant from both an ecological and user-oriented perspective, given the high computational cost of training and deploying large LMs and their democratisation. One of the most important aspects of improving model performance (and one of our priorities) is creating data: (i) task-annotated data, (ii) evaluation data and (iii) monolingual data (e.g. the OSCAR corpus project) for training LMs. We cover a wide range of languages, dialects and idiolects, including non-standard language (e.g. user-generated content) and low-resource scenarios, which requires exploring dedicated techniques. This includes training models to generalise better, creating additional resources and doing cross-lingual transferring. We also address how to bring domain knowledge from linguistics to improve large LMs, e.g. by using knowledge-based features to fine-tune and re-rank outputs. Also important is the interpretability of NLP models, to understand the properties learnt, how they work and how they could be improved.
Within PRAIRIE, there are various interactions between NLP and other fields due to the utility of text processing for information extraction, content analysis and accessing data trends. These include health (e.g. the analysis of electronic health records, medical interviews and reported symptoms on social media), politics and sociology (e.g. the analysis of press articles and the study of information spread), literary analysis and production (e.g. poetry generation), conversational and discourse analysis and generation (e.g. the detection and generation of speaker personality in dialogues, and of indirect language such as hedges) and linguistics (the analysis of language structure, literary corpus analysis, diachronic change, etc.). Finally, an emerging topic is the interaction with other modalities including speech, embodied behaviours (such as gestures and facial movements) and images.