A filter for syntactically incomparable parallel sentences

Martin Kroon, Sjef Barbiers, Jan Odijk, Stéphanie Van Der Pas

Research output: Contribution to journalArticleAcademicpeer-review

1 Citation (Scopus)

Abstract

Massive automatic comparison of languages in parallel corpora will greatly speed up and enhance comparative syntactic research. Automatically extracting and mining syntactic differences from parallel corpora requires a pre-processing step that filters out sentence pairs that cannot be compared syntactically, for example because they involve "free" translations. In this paper we explore four possible filters: The Damerau-Levenshtein distance between POS-Tags, the sentence-length ratio, the graph-edit distance between dependency parses, and a combination of the three in a logistic regression model. Results suggest that the dependency-parse filter is the most stable throughout language pairs, while the combination filter achieves the best results.

Original languageEnglish
Pages (from-to)147-161
Number of pages15
JournalLinguistics in the Netherlands
Volume36
Issue number1
DOIs
Publication statusPublished - 5 Nov 2019
Externally publishedYes

Keywords

  • Dependency parses
  • Filter
  • Parallel corpus
  • Syntactic comparability

Cite this