Stökkva yfir í aðalyfirlit Stökkva yfir í leit Stökkva yfir í aðalefni

A Sentence Alignment Approach to Document Alignment and Multi-faceted Filtering for Curating Parallel Sentence Pairs from Web-crawled Data

Rannsóknarafurð: Kafli í bók/skýrslu/ráðstefnuritiRáðstefnuframlagritrýni

Útdráttur

This paper describes the AST submission to the WMT23 Shared Task on Parallel Data Curation. We experiment with two approaches for curating data from the provided web-scraped texts. We use sentence alignment to identify document alignments in the data and extract parallel sentence pairs from the aligned documents. All other sentences, not aligned in that step, are paired based on cosine similarity before we apply various different filters. For filtering, we use language detection, fluency classification, word alignments, cosine distance as calculated by multilingual sentence embedding models, and Bicleaner AI. Our best model outperforms the baseline by 1.9 BLEU points on average over the four provided evaluation sets.

Upprunalegt tungumálEnska
Titill gistiútgáfuProceedings of the 8th Conference on Machine Translation, WMT 2023
ÚtgefandiAssociation for Computational Linguistics
Síður364-372
Síðufjöldi9
ISBN-númer (rafrænt)9798891760417
ÚtgáfustaðaÚtgefið - 2023
Viðburður8th Conference on Machine Translation, WMT 2023 - Singapore, Singapúr
Tímalengd: 6 des. 20237 des. 2023

Ritröð

NafnConference on Machine Translation - Proceedings

Ráðstefna

Ráðstefna8th Conference on Machine Translation, WMT 2023
Land/YfirráðasvæðiSingapúr
Borg/bærSingapore
Tímabil6/12/237/12/23

Athugasemd

Publisher Copyright: © 2023 Association for Computational Linguistics.

Fingerprint

Sökktu þér í rannsóknarefni „A Sentence Alignment Approach to Document Alignment and Multi-faceted Filtering for Curating Parallel Sentence Pairs from Web-crawled Data“. Saman myndar þetta einstakt fingrafar.

Vitna í þetta