A mixture model for the analysis of data derived from record linkage

M. H. P. Hof; A. H. Zwinderman

doi:https://doi.org/10.1002/sim.6315

A mixture model for the analysis of data derived from record linkage

M. H. P. Hof, A. H. Zwinderman

Research output: Contribution to journal › Article › Academic › peer-review

8 Citations (Scopus)

Abstract

Combining information from two data sources depends on finding records that belong to the same individual (matches). Sometimes, unique identifiers per individual are not available, and we have to rely on partially identifying variables that are registered in both data sources. A risk of relying on these variables is that some records from both datasets are wrongly linked to each other, which introduces bias in further regression analyses. In this paper, we propose a mixture model where we treat the indicator whether records belong to the same individual as missing. Each pair of records from both datasets contributes independently to a pairwise pseudo-likelihood, which we maximize with an expectation-maximization algorithm. Each part of the pseudo-likelihood is parameterized by the appropriate (parametric) density function. Moreover, some structures of the data allow for simplifying assumptions, which makes the pseudo-likelihood considerably easier to parameterize. Because the optimization requires a product over all combinations of records from both datasets, we suggest a procedure that summarizes information from highly unlikely matches. With simulations, we showed that the new approach produces accurate estimates in different linkage scenarios. Moreover, the estimator remained accurate in scenarios where previously proposed analysis approaches give biased results. We applied the method to estimation of the association between pregnancy duration of the first and second born children from the same mother from a register without mother identifier

Original language	English
Pages (from-to)	74-92
Journal	Statistics in medicine
Volume	34
Issue number	1
DOIs	https://doi.org/10.1002/sim.6315
Publication status	Published - 2015

Access to Document

https://doi.org/10.1002/sim.6315

Cite this

@article{f02085a8cb09481ab0c8013ec21e338d,

title = "A mixture model for the analysis of data derived from record linkage",

abstract = "Combining information from two data sources depends on finding records that belong to the same individual (matches). Sometimes, unique identifiers per individual are not available, and we have to rely on partially identifying variables that are registered in both data sources. A risk of relying on these variables is that some records from both datasets are wrongly linked to each other, which introduces bias in further regression analyses. In this paper, we propose a mixture model where we treat the indicator whether records belong to the same individual as missing. Each pair of records from both datasets contributes independently to a pairwise pseudo-likelihood, which we maximize with an expectation-maximization algorithm. Each part of the pseudo-likelihood is parameterized by the appropriate (parametric) density function. Moreover, some structures of the data allow for simplifying assumptions, which makes the pseudo-likelihood considerably easier to parameterize. Because the optimization requires a product over all combinations of records from both datasets, we suggest a procedure that summarizes information from highly unlikely matches. With simulations, we showed that the new approach produces accurate estimates in different linkage scenarios. Moreover, the estimator remained accurate in scenarios where previously proposed analysis approaches give biased results. We applied the method to estimation of the association between pregnancy duration of the first and second born children from the same mother from a register without mother identifier",

author = "Hof, {M. H. P.} and Zwinderman, {A. H.}",

year = "2015",

doi = "https://doi.org/10.1002/sim.6315",

language = "English",

volume = "34",

pages = "74--92",

journal = "Statistics in medicine",

issn = "0277-6715",

publisher = "John Wiley and Sons Ltd",

number = "1",

}

TY - JOUR

T1 - A mixture model for the analysis of data derived from record linkage

AU - Hof, M. H. P.

AU - Zwinderman, A. H.

PY - 2015

Y1 - 2015

N2 - Combining information from two data sources depends on finding records that belong to the same individual (matches). Sometimes, unique identifiers per individual are not available, and we have to rely on partially identifying variables that are registered in both data sources. A risk of relying on these variables is that some records from both datasets are wrongly linked to each other, which introduces bias in further regression analyses. In this paper, we propose a mixture model where we treat the indicator whether records belong to the same individual as missing. Each pair of records from both datasets contributes independently to a pairwise pseudo-likelihood, which we maximize with an expectation-maximization algorithm. Each part of the pseudo-likelihood is parameterized by the appropriate (parametric) density function. Moreover, some structures of the data allow for simplifying assumptions, which makes the pseudo-likelihood considerably easier to parameterize. Because the optimization requires a product over all combinations of records from both datasets, we suggest a procedure that summarizes information from highly unlikely matches. With simulations, we showed that the new approach produces accurate estimates in different linkage scenarios. Moreover, the estimator remained accurate in scenarios where previously proposed analysis approaches give biased results. We applied the method to estimation of the association between pregnancy duration of the first and second born children from the same mother from a register without mother identifier

AB - Combining information from two data sources depends on finding records that belong to the same individual (matches). Sometimes, unique identifiers per individual are not available, and we have to rely on partially identifying variables that are registered in both data sources. A risk of relying on these variables is that some records from both datasets are wrongly linked to each other, which introduces bias in further regression analyses. In this paper, we propose a mixture model where we treat the indicator whether records belong to the same individual as missing. Each pair of records from both datasets contributes independently to a pairwise pseudo-likelihood, which we maximize with an expectation-maximization algorithm. Each part of the pseudo-likelihood is parameterized by the appropriate (parametric) density function. Moreover, some structures of the data allow for simplifying assumptions, which makes the pseudo-likelihood considerably easier to parameterize. Because the optimization requires a product over all combinations of records from both datasets, we suggest a procedure that summarizes information from highly unlikely matches. With simulations, we showed that the new approach produces accurate estimates in different linkage scenarios. Moreover, the estimator remained accurate in scenarios where previously proposed analysis approaches give biased results. We applied the method to estimation of the association between pregnancy duration of the first and second born children from the same mother from a register without mother identifier

U2 - https://doi.org/10.1002/sim.6315

DO - https://doi.org/10.1002/sim.6315

M3 - Article

C2 - 25274539

SN - 0277-6715

VL - 34

SP - 74

EP - 92

JO - Statistics in medicine

JF - Statistics in medicine

IS - 1

ER -