Skip to main navigation Skip to search Skip to main content

A Sentence Alignment Approach to Document Alignment and Multi-faceted Filtering for Curating Parallel Sentence Pairs from Web-crawled Data

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

This paper describes the AST submission to the WMT23 Shared Task on Parallel Data Curation. We experiment with two approaches for curating data from the provided web-scraped texts. We use sentence alignment to identify document alignments in the data and extract parallel sentence pairs from the aligned documents. All other sentences, not aligned in that step, are paired based on cosine similarity before we apply various different filters. For filtering, we use language detection, fluency classification, word alignments, cosine distance as calculated by multilingual sentence embedding models, and Bicleaner AI. Our best model outperforms the baseline by 1.9 BLEU points on average over the four provided evaluation sets.

Original languageEnglish
Title of host publicationProceedings of the 8th Conference on Machine Translation, WMT 2023
PublisherAssociation for Computational Linguistics
Pages364-372
Number of pages9
ISBN (Electronic)9798891760417
Publication statusPublished - 2023
Event8th Conference on Machine Translation, WMT 2023 - Singapore, Singapore
Duration: 6 Dec 20237 Dec 2023

Publication series

NameConference on Machine Translation - Proceedings

Conference

Conference8th Conference on Machine Translation, WMT 2023
Country/TerritorySingapore
CitySingapore
Period6/12/237/12/23

Bibliographical note

Publisher Copyright: © 2023 Association for Computational Linguistics.

Fingerprint

Dive into the research topics of 'A Sentence Alignment Approach to Document Alignment and Multi-faceted Filtering for Curating Parallel Sentence Pairs from Web-crawled Data'. Together they form a unique fingerprint.

Cite this