Pooling individual participant data from randomized controlled trials: Exploring potential loss of information

Lennard L. van Wanrooij; Marieke P. Hoevenaar-Blom; Nicola Coley; Tiia Ngandu; Yannick Meiller; Juliette Guillemont; Anna Rosenberg; Cathrien R. L. Beishuizen; Eric P. Moll van Charante; Hilkka Soininen; Carol Brayne; Sandrine Andrieu; Miia Kivipelto; Edo Richard

doi:https://doi.org/10.1371/journal.pone.0232970

Pooling individual participant data from randomized controlled trials: Exploring potential loss of information

Lennard L. van Wanrooij, Marieke P. Hoevenaar-Blom, Nicola Coley, Tiia Ngandu, Yannick Meiller, Juliette Guillemont, Anna Rosenberg, Cathrien R. L. Beishuizen, Eric P. Moll van Charante, Hilkka Soininen, Carol Brayne, Sandrine Andrieu, Miia Kivipelto, Edo Richard

Research output: Contribution to journal › Article › Academic › peer-review

2 Citations (Scopus)

Abstract

Background Pooling individual participant data to enable pooled analyses is often complicated by diversity in variables across available datasets. Therefore, recoding original variables is often necessary to build a pooled dataset. We aimed to quantify how much information is lost in this process and to what extent this jeopardizes validity of analyses results. Methods Data were derived from a platform that was developed to pool data from three randomized controlled trials on the effect of treatment of cardiovascular risk factors on cognitive decline or dementia. We quantified loss of information using the R-squared of linear regression models with pooled variables as a function of their original variable(s). In case the R-squared was below 0.8, we additionally explored the potential impact of loss of information for future analyses. We did this second step by comparing whether the Beta coefficient of the predictor differed more than 10% when adding original or recoded variables as a confounder in a linear regression model. In a simulation we randomly sampled numbers, recoded those < = 1000 to 0 and those >1000 to 1 and varied the range of the continuous variable, the ratio of recoded zeroes to recoded ones, or both, and again extracted the R-squared from linear models to quantify information loss. Results The R-squared was below 0.8 for 8 out of 91 recoded variables. In 4 cases this had a substantial impact on the regression models, particularly when a continuous variable was recoded into a discrete variable. Our simulation showed that the least information is lost when the ratio of recoded zeroes to ones is 1:1. Conclusions Large, pooled datasets provide great opportunities, justifying the efforts for data harmonization. Still, caution is warranted when using recoded variables which variance is explained limitedly by their original variables as this may jeopardize the validity of study results.

Original language	English
Article number	e0232970
Journal	PLOS ONE
Volume	15
Issue number	5
DOIs	https://doi.org/10.1371/journal.pone.0232970
Publication status	Published - 1 May 2020

Access to Document

https://doi.org/10.1371/journal.pone.0232970

Cite this

van Wanrooij, L. L., Hoevenaar-Blom, M. P., Coley, N., Ngandu, T., Meiller, Y., Guillemont, J., Rosenberg, A., Beishuizen, C. R. L., Moll van Charante, E. P., Soininen, H., Brayne, C., Andrieu, S., Kivipelto, M., & Richard, E. (2020). Pooling individual participant data from randomized controlled trials: Exploring potential loss of information. PLOS ONE, 15(5), Article e0232970. https://doi.org/10.1371/journal.pone.0232970

@article{35606851c31b4076b8d4965312c2f47b,

title = "Pooling individual participant data from randomized controlled trials: Exploring potential loss of information",

abstract = "Background Pooling individual participant data to enable pooled analyses is often complicated by diversity in variables across available datasets. Therefore, recoding original variables is often necessary to build a pooled dataset. We aimed to quantify how much information is lost in this process and to what extent this jeopardizes validity of analyses results. Methods Data were derived from a platform that was developed to pool data from three randomized controlled trials on the effect of treatment of cardiovascular risk factors on cognitive decline or dementia. We quantified loss of information using the R-squared of linear regression models with pooled variables as a function of their original variable(s). In case the R-squared was below 0.8, we additionally explored the potential impact of loss of information for future analyses. We did this second step by comparing whether the Beta coefficient of the predictor differed more than 10% when adding original or recoded variables as a confounder in a linear regression model. In a simulation we randomly sampled numbers, recoded those < = 1000 to 0 and those >1000 to 1 and varied the range of the continuous variable, the ratio of recoded zeroes to recoded ones, or both, and again extracted the R-squared from linear models to quantify information loss. Results The R-squared was below 0.8 for 8 out of 91 recoded variables. In 4 cases this had a substantial impact on the regression models, particularly when a continuous variable was recoded into a discrete variable. Our simulation showed that the least information is lost when the ratio of recoded zeroes to ones is 1:1. Conclusions Large, pooled datasets provide great opportunities, justifying the efforts for data harmonization. Still, caution is warranted when using recoded variables which variance is explained limitedly by their original variables as this may jeopardize the validity of study results.",

author = "{van Wanrooij}, {Lennard L.} and Hoevenaar-Blom, {Marieke P.} and Nicola Coley and Tiia Ngandu and Yannick Meiller and Juliette Guillemont and Anna Rosenberg and Beishuizen, {Cathrien R. L.} and {Moll van Charante}, {Eric P.} and Hilkka Soininen and Carol Brayne and Sandrine Andrieu and Miia Kivipelto and Edo Richard",

year = "2020",

month = may,

day = "1",

doi = "https://doi.org/10.1371/journal.pone.0232970",

language = "English",

volume = "15",

journal = "PLOS ONE",

issn = "1932-6203",

publisher = "Public Library of Science",

number = "5",

}

van Wanrooij, LL, Hoevenaar-Blom, MP, Coley, N, Ngandu, T, Meiller, Y, Guillemont, J, Rosenberg, A, Beishuizen, CRL, Moll van Charante, EP, Soininen, H, Brayne, C, Andrieu, S, Kivipelto, M & Richard, E 2020, 'Pooling individual participant data from randomized controlled trials: Exploring potential loss of information', PLOS ONE, vol. 15, no. 5, e0232970. https://doi.org/10.1371/journal.pone.0232970

TY - JOUR

T1 - Pooling individual participant data from randomized controlled trials: Exploring potential loss of information

AU - van Wanrooij, Lennard L.

AU - Hoevenaar-Blom, Marieke P.

AU - Coley, Nicola

AU - Ngandu, Tiia

AU - Meiller, Yannick

AU - Guillemont, Juliette

AU - Rosenberg, Anna

AU - Beishuizen, Cathrien R. L.

AU - Moll van Charante, Eric P.

AU - Soininen, Hilkka

AU - Brayne, Carol

AU - Andrieu, Sandrine

AU - Kivipelto, Miia

AU - Richard, Edo

PY - 2020/5/1

Y1 - 2020/5/1

N2 - Background Pooling individual participant data to enable pooled analyses is often complicated by diversity in variables across available datasets. Therefore, recoding original variables is often necessary to build a pooled dataset. We aimed to quantify how much information is lost in this process and to what extent this jeopardizes validity of analyses results. Methods Data were derived from a platform that was developed to pool data from three randomized controlled trials on the effect of treatment of cardiovascular risk factors on cognitive decline or dementia. We quantified loss of information using the R-squared of linear regression models with pooled variables as a function of their original variable(s). In case the R-squared was below 0.8, we additionally explored the potential impact of loss of information for future analyses. We did this second step by comparing whether the Beta coefficient of the predictor differed more than 10% when adding original or recoded variables as a confounder in a linear regression model. In a simulation we randomly sampled numbers, recoded those < = 1000 to 0 and those >1000 to 1 and varied the range of the continuous variable, the ratio of recoded zeroes to recoded ones, or both, and again extracted the R-squared from linear models to quantify information loss. Results The R-squared was below 0.8 for 8 out of 91 recoded variables. In 4 cases this had a substantial impact on the regression models, particularly when a continuous variable was recoded into a discrete variable. Our simulation showed that the least information is lost when the ratio of recoded zeroes to ones is 1:1. Conclusions Large, pooled datasets provide great opportunities, justifying the efforts for data harmonization. Still, caution is warranted when using recoded variables which variance is explained limitedly by their original variables as this may jeopardize the validity of study results.

AB - Background Pooling individual participant data to enable pooled analyses is often complicated by diversity in variables across available datasets. Therefore, recoding original variables is often necessary to build a pooled dataset. We aimed to quantify how much information is lost in this process and to what extent this jeopardizes validity of analyses results. Methods Data were derived from a platform that was developed to pool data from three randomized controlled trials on the effect of treatment of cardiovascular risk factors on cognitive decline or dementia. We quantified loss of information using the R-squared of linear regression models with pooled variables as a function of their original variable(s). In case the R-squared was below 0.8, we additionally explored the potential impact of loss of information for future analyses. We did this second step by comparing whether the Beta coefficient of the predictor differed more than 10% when adding original or recoded variables as a confounder in a linear regression model. In a simulation we randomly sampled numbers, recoded those < = 1000 to 0 and those >1000 to 1 and varied the range of the continuous variable, the ratio of recoded zeroes to recoded ones, or both, and again extracted the R-squared from linear models to quantify information loss. Results The R-squared was below 0.8 for 8 out of 91 recoded variables. In 4 cases this had a substantial impact on the regression models, particularly when a continuous variable was recoded into a discrete variable. Our simulation showed that the least information is lost when the ratio of recoded zeroes to ones is 1:1. Conclusions Large, pooled datasets provide great opportunities, justifying the efforts for data harmonization. Still, caution is warranted when using recoded variables which variance is explained limitedly by their original variables as this may jeopardize the validity of study results.

UR - http://www.scopus.com/inward/record.url?scp=85084533400&partnerID=8YFLogxK

U2 - https://doi.org/10.1371/journal.pone.0232970

DO - https://doi.org/10.1371/journal.pone.0232970

M3 - Article

C2 - 32396543

SN - 1932-6203

VL - 15

JO - PLOS ONE

JF - PLOS ONE

IS - 5

M1 - e0232970

ER -

Pooling individual participant data from randomized controlled trials: Exploring potential loss of information

Abstract

Access to Document

Other files and links

Cite this