Towards better language representation in Natural Language Processing A multilingual dataset for text-level Grammatical Error Correction

  • Arianna Masciolini
  • , Andrew Caines
  • , Orphée De Clercq
  • , Joni Kruijsbergen
  • , Murathan Kurfali
  • , Ricardo Muñoz Sánchez
  • , Elena Volodina
  • , Robert Östling
  • , Kais Allkivi
  • , Špela Arhar Holdt
  • , Ilze Auzina
  • , Roberts Dargis
  • , Elena Drakonaki
  • , Jennifer Carmen Frey
  • , Isidora Glišič
  • , Pinelopi Kikilintza
  • , Lionel Nicolas
  • , Mariana Romanyshyn
  • , Alexandr Rosen
  • , Alla Rozovskaya
  • Kristjan Suluste, Oleksiy Syvokon, Alexandros Tantos, Despoina Ourania Touriki, Konstantinos Tsiotskas, Eleni Tsourilla, Vassilis Varsamopoulos, Katrin Wisniewski, Aleš Žagar, Torsten Zesch

Research output: Contribution to journalReview articlepeer-review

Abstract

This paper introduces MultiGEC, a dataset for multilingual Grammatical Error Correction (GEC) in twelve European languages: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian. MultiGEC distinguishes itself from previous G E C datasets in that it covers several underrepresented languages, which we argue should be included in resources used to train models for Natural Language Processing tasks which, as G E C itself, have implications for Learner Corpus Research and Second Language Acquisition. Aside from multilingualism, the novelty of the MultiGEC dataset is that it consists of full texts — typically learner essays — rather than individual sentences, making it possible to train systems that take a broader context into account. The dataset was built for MultiGEC-2025, the first shared task in multilingual text-level GEC, but it remains accessible after its competitive phase, serving as a resource to train new error correction systems and perform cross-lingual G E C studies.

Original languageEnglish
JournalInternational Journal of Learner Corpus Research
Volume11
Issue number2
DOIs
Publication statusPublished - 15 May 2025
Externally publishedYes

Bibliographical note

Publisher Copyright: © John Benjamins Publishing Company.

Other keywords

  • Matthew effect
  • MultiGEC shared task
  • grammatical error correction
  • learner corpora
  • multilingual corpora

Fingerprint

Dive into the research topics of 'Towards better language representation in Natural Language Processing A multilingual dataset for text-level Grammatical Error Correction'. Together they form a unique fingerprint.

Cite this