A filter for syntactically incomparable parallel sentences

Martin Kroon; Sjef Barbiers; Jan Odijk; Stéphanie Van Der Pas

doi:https://doi.org/10.1075/avt.00029.kro

A filter for syntactically incomparable parallel sentences

Martin Kroon, Sjef Barbiers, Jan Odijk, Stéphanie Van Der Pas

Research output: Contribution to journal › Article › Academic › peer-review

1 Citation (Scopus)

Abstract

Massive automatic comparison of languages in parallel corpora will greatly speed up and enhance comparative syntactic research. Automatically extracting and mining syntactic differences from parallel corpora requires a pre-processing step that filters out sentence pairs that cannot be compared syntactically, for example because they involve "free" translations. In this paper we explore four possible filters: The Damerau-Levenshtein distance between POS-Tags, the sentence-length ratio, the graph-edit distance between dependency parses, and a combination of the three in a logistic regression model. Results suggest that the dependency-parse filter is the most stable throughout language pairs, while the combination filter achieves the best results.

Original language	English
Pages (from-to)	147-161
Number of pages	15
Journal	Linguistics in the Netherlands
Volume	36
Issue number	1
DOIs	https://doi.org/10.1075/avt.00029.kro
Publication status	Published - 5 Nov 2019
Externally published	Yes

Keywords

Dependency parses
Filter
Parallel corpus
Syntactic comparability

Access to Document

https://doi.org/10.1075/avt.00029.kro

Cite this

@article{4a52eedaa04d4a89ad39bffee07c7ebf,

title = "A filter for syntactically incomparable parallel sentences",

abstract = "Massive automatic comparison of languages in parallel corpora will greatly speed up and enhance comparative syntactic research. Automatically extracting and mining syntactic differences from parallel corpora requires a pre-processing step that filters out sentence pairs that cannot be compared syntactically, for example because they involve {"}free{"} translations. In this paper we explore four possible filters: The Damerau-Levenshtein distance between POS-Tags, the sentence-length ratio, the graph-edit distance between dependency parses, and a combination of the three in a logistic regression model. Results suggest that the dependency-parse filter is the most stable throughout language pairs, while the combination filter achieves the best results.",

keywords = "Dependency parses, Filter, Parallel corpus, Syntactic comparability",

author = "Martin Kroon and Sjef Barbiers and Jan Odijk and {Van Der Pas}, St{\'e}phanie",

year = "2019",

month = nov,

day = "5",

doi = "https://doi.org/10.1075/avt.00029.kro",

language = "English",

volume = "36",

pages = "147--161",

journal = "Linguistics in the Netherlands",

issn = "0929-7332",

publisher = "John Benjamins Publishing Company",

number = "1",

}

TY - JOUR

T1 - A filter for syntactically incomparable parallel sentences

AU - Kroon, Martin

AU - Barbiers, Sjef

AU - Odijk, Jan

AU - Van Der Pas, Stéphanie

PY - 2019/11/5

Y1 - 2019/11/5

N2 - Massive automatic comparison of languages in parallel corpora will greatly speed up and enhance comparative syntactic research. Automatically extracting and mining syntactic differences from parallel corpora requires a pre-processing step that filters out sentence pairs that cannot be compared syntactically, for example because they involve "free" translations. In this paper we explore four possible filters: The Damerau-Levenshtein distance between POS-Tags, the sentence-length ratio, the graph-edit distance between dependency parses, and a combination of the three in a logistic regression model. Results suggest that the dependency-parse filter is the most stable throughout language pairs, while the combination filter achieves the best results.

AB - Massive automatic comparison of languages in parallel corpora will greatly speed up and enhance comparative syntactic research. Automatically extracting and mining syntactic differences from parallel corpora requires a pre-processing step that filters out sentence pairs that cannot be compared syntactically, for example because they involve "free" translations. In this paper we explore four possible filters: The Damerau-Levenshtein distance between POS-Tags, the sentence-length ratio, the graph-edit distance between dependency parses, and a combination of the three in a logistic regression model. Results suggest that the dependency-parse filter is the most stable throughout language pairs, while the combination filter achieves the best results.

KW - Dependency parses

KW - Filter

KW - Parallel corpus

KW - Syntactic comparability

UR - http://www.scopus.com/inward/record.url?scp=85074930637&partnerID=8YFLogxK

U2 - https://doi.org/10.1075/avt.00029.kro

DO - https://doi.org/10.1075/avt.00029.kro

M3 - Article

SN - 0929-7332

VL - 36

SP - 147

EP - 161

JO - Linguistics in the Netherlands

JF - Linguistics in the Netherlands

IS - 1

ER -

A filter for syntactically incomparable parallel sentences

Abstract

Keywords

Access to Document

Other files and links

Cite this