Ignoring dependency between linking variables and its impact on the outcome of probabilistic record linkage studies

Miranda Tromp, Nora Meray, Anita C. J. Ravelli, Johannes B. Reitsma, Gouke J. Bonsel

Research output: Contribution to journalArticleAcademicpeer-review

16 Citations (Scopus)

Abstract

Objectives: This study sought to examine the differences between ignoring (naive) and incorporating dependency (nonnaive) among linkage variables on the outcome of a probabilistic record linkage study. Design and Measurements: We used the outcomes of a previously developed probabilistic linkage procedure for different registries in perinatal care assuming independence among linkage variables. We estimated the impact of ignoring dependency by re-estimating the linkage weights after constructing a variable that combines the outcomes of the comparison of 2 correlated linking variables. The results of the original naive and the new nonnaive strategy were systematically compared for 3 scenarios: the empirical dataset using 9 variables, the empirical dataset using 5 variables, and a simulated dataset using 5 variables. Results: The linking weight for agreement on 2 correlated variables among nonmatches was estimated considerably higher in the naive strategy than in the nonnaive strategy (16.87 vs. 13.55). Therefore, ignoring dependency overestimates the amount of identifying information if both correlated variables agree. The impact on the number of pairs that was classified differently with both approaches was modest in the situation in which there were many different linking variables but grew substantially with fewer variables. The simulation study confirmed the results of the empirical study and suggests that the number of misclassifications can increase substantially by ignoring dependency under less favorable linking conditions. Conclusion: Dependency often exists between linking variables and has the potential to bias the outcome of a linkage study. The nonnaive approach is a straightforward method for creating linking weights that accommodate dependency. The impact on the number of misclassifications depends on the quality and number of linking variables relative to the number of correlated linking variables
Original languageEnglish
Pages (from-to)654-660
JournalJournal of the American Medical Informatics Association
Volume15
Issue number5
DOIs
Publication statusPublished - 2008

Cite this