TY - JOUR
T1 - Towards better language representation in Natural Language Processing A multilingual dataset for text-level Grammatical Error Correction
AU - Masciolini, Arianna
AU - Caines, Andrew
AU - De Clercq, Orphée
AU - Kruijsbergen, Joni
AU - Kurfali, Murathan
AU - Muñoz Sánchez, Ricardo
AU - Volodina, Elena
AU - Östling, Robert
AU - Allkivi, Kais
AU - Holdt, Špela Arhar
AU - Auzina, Ilze
AU - Dargis, Roberts
AU - Drakonaki, Elena
AU - Frey, Jennifer Carmen
AU - Glišič, Isidora
AU - Kikilintza, Pinelopi
AU - Nicolas, Lionel
AU - Romanyshyn, Mariana
AU - Rosen, Alexandr
AU - Rozovskaya, Alla
AU - Suluste, Kristjan
AU - Syvokon, Oleksiy
AU - Tantos, Alexandros
AU - Touriki, Despoina Ourania
AU - Tsiotskas, Konstantinos
AU - Tsourilla, Eleni
AU - Varsamopoulos, Vassilis
AU - Wisniewski, Katrin
AU - Žagar, Aleš
AU - Zesch, Torsten
N1 - Publisher Copyright: © John Benjamins Publishing Company.
PY - 2025/5/15
Y1 - 2025/5/15
N2 - This paper introduces MultiGEC, a dataset for multilingual Grammatical Error Correction (GEC) in twelve European languages: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian. MultiGEC distinguishes itself from previous G E C datasets in that it covers several underrepresented languages, which we argue should be included in resources used to train models for Natural Language Processing tasks which, as G E C itself, have implications for Learner Corpus Research and Second Language Acquisition. Aside from multilingualism, the novelty of the MultiGEC dataset is that it consists of full texts — typically learner essays — rather than individual sentences, making it possible to train systems that take a broader context into account. The dataset was built for MultiGEC-2025, the first shared task in multilingual text-level GEC, but it remains accessible after its competitive phase, serving as a resource to train new error correction systems and perform cross-lingual G E C studies.
AB - This paper introduces MultiGEC, a dataset for multilingual Grammatical Error Correction (GEC) in twelve European languages: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian. MultiGEC distinguishes itself from previous G E C datasets in that it covers several underrepresented languages, which we argue should be included in resources used to train models for Natural Language Processing tasks which, as G E C itself, have implications for Learner Corpus Research and Second Language Acquisition. Aside from multilingualism, the novelty of the MultiGEC dataset is that it consists of full texts — typically learner essays — rather than individual sentences, making it possible to train systems that take a broader context into account. The dataset was built for MultiGEC-2025, the first shared task in multilingual text-level GEC, but it remains accessible after its competitive phase, serving as a resource to train new error correction systems and perform cross-lingual G E C studies.
KW - Matthew effect
KW - MultiGEC shared task
KW - grammatical error correction
KW - learner corpora
KW - multilingual corpora
UR - https://www.scopus.com/pages/publications/105003035015
U2 - 10.1075/ijlcr.24033.mas
DO - 10.1075/ijlcr.24033.mas
M3 - Review article
SN - 2215-1478
VL - 11
JO - International Journal of Learner Corpus Research
JF - International Journal of Learner Corpus Research
IS - 2
ER -