Investigating data-driven biological subtypes of sychiatric disorders using specification-curve analysis

Lian Beijers; Hanna M. van Loo; Jan-Willem Romeijn; Femke Lamers; Robert A. Schoevers; Klaas J. Wardenaar

doi:https://doi.org/10.1017/S0033291720002846

Investigating data-driven biological subtypes of sychiatric disorders using specification-curve analysis

Lian Beijers, Hanna M. van Loo, Jan-Willem Romeijn, Femke Lamers, Robert A. Schoevers, Klaas J. Wardenaar

Research output: Contribution to journal › Article › Academic › peer-review

1 Citation (Scopus)

Abstract

BackgroundCluster analyses have become popular tools for data-driven classification in biological psychiatric research. However, these analyses are known to be sensitive to the chosen methods and/or modelling options, which may hamper generalizability and replicability of findings. To gain more insight into this problem, we used Specification-Curve Analysis (SCA) to investigate the influence of methodological variation on biomarker-based cluster-analysis results.MethodsProteomics data (31 biomarkers) were used from patients (n = 688) and healthy controls (n = 426) in the Netherlands Study of Depression and Anxiety. In SCAs, consistency of results was evaluated across 1200 k-means and hierarchical clustering analyses, each with a unique combination of the clustering algorithm, fit-index, and distance metric. Next, SCAs were run in simulated datasets with varying cluster numbers and noise/outlier levels to evaluate the effect of data properties on SCA outcomes.ResultsThe real data SCA showed no robust patterns of biological clustering in either the MDD or a combined MDD/healthy dataset. The simulation results showed that the correct number of clusters could be identified quite consistently across the 1200 model specifications, but that correct cluster identification became harder when the number of clusters and noise levels increased.ConclusionSCA can provide useful insights into the presence of clusters in biomarker data. However, SCA is likely to show inconsistent results in real-world biomarker datasets that are complex and contain considerable levels of noise. Here, the number and nature of the observed clusters may depend strongly on the chosen model-specification, precluding conclusions about the existence of biological clusters among psychiatric patients.

Original language	English
Pages (from-to)	1089-1100
Number of pages	12
Journal	Psychological Medicine
Volume	52
Early online date	2020
DOIs	https://doi.org/10.1017/S0033291720002846
Publication status	Published - 2022

Keywords

biochemistry
cluster analysis
complexity
heterogeneity
psychiatry
specification-curve analysis
subtyping

Access to Document

https://doi.org/10.1017/S0033291720002846

Cite this

@article{27c40df23943459baed71c281e862ccb,

title = "Investigating data-driven biological subtypes of sychiatric disorders using specification-curve analysis",

abstract = "BackgroundCluster analyses have become popular tools for data-driven classification in biological psychiatric research. However, these analyses are known to be sensitive to the chosen methods and/or modelling options, which may hamper generalizability and replicability of findings. To gain more insight into this problem, we used Specification-Curve Analysis (SCA) to investigate the influence of methodological variation on biomarker-based cluster-analysis results.MethodsProteomics data (31 biomarkers) were used from patients (n = 688) and healthy controls (n = 426) in the Netherlands Study of Depression and Anxiety. In SCAs, consistency of results was evaluated across 1200 k-means and hierarchical clustering analyses, each with a unique combination of the clustering algorithm, fit-index, and distance metric. Next, SCAs were run in simulated datasets with varying cluster numbers and noise/outlier levels to evaluate the effect of data properties on SCA outcomes.ResultsThe real data SCA showed no robust patterns of biological clustering in either the MDD or a combined MDD/healthy dataset. The simulation results showed that the correct number of clusters could be identified quite consistently across the 1200 model specifications, but that correct cluster identification became harder when the number of clusters and noise levels increased.ConclusionSCA can provide useful insights into the presence of clusters in biomarker data. However, SCA is likely to show inconsistent results in real-world biomarker datasets that are complex and contain considerable levels of noise. Here, the number and nature of the observed clusters may depend strongly on the chosen model-specification, precluding conclusions about the existence of biological clusters among psychiatric patients.",

keywords = "biochemistry, cluster analysis, complexity, heterogeneity, psychiatry, specification-curve analysis, subtyping",

author = "Lian Beijers and {van Loo}, {Hanna M.} and Jan-Willem Romeijn and Femke Lamers and Schoevers, {Robert A.} and Wardenaar, {Klaas J.}",

year = "2022",

doi = "https://doi.org/10.1017/S0033291720002846",

language = "English",

volume = "52",

pages = "1089--1100",

journal = "Psychological Medicine",

issn = "0033-2917",

publisher = "Cambridge University Press",

}

TY - JOUR

T1 - Investigating data-driven biological subtypes of sychiatric disorders using specification-curve analysis

AU - Beijers, Lian

AU - van Loo, Hanna M.

AU - Romeijn, Jan-Willem

AU - Lamers, Femke

AU - Schoevers, Robert A.

AU - Wardenaar, Klaas J.

PY - 2022

Y1 - 2022

N2 - BackgroundCluster analyses have become popular tools for data-driven classification in biological psychiatric research. However, these analyses are known to be sensitive to the chosen methods and/or modelling options, which may hamper generalizability and replicability of findings. To gain more insight into this problem, we used Specification-Curve Analysis (SCA) to investigate the influence of methodological variation on biomarker-based cluster-analysis results.MethodsProteomics data (31 biomarkers) were used from patients (n = 688) and healthy controls (n = 426) in the Netherlands Study of Depression and Anxiety. In SCAs, consistency of results was evaluated across 1200 k-means and hierarchical clustering analyses, each with a unique combination of the clustering algorithm, fit-index, and distance metric. Next, SCAs were run in simulated datasets with varying cluster numbers and noise/outlier levels to evaluate the effect of data properties on SCA outcomes.ResultsThe real data SCA showed no robust patterns of biological clustering in either the MDD or a combined MDD/healthy dataset. The simulation results showed that the correct number of clusters could be identified quite consistently across the 1200 model specifications, but that correct cluster identification became harder when the number of clusters and noise levels increased.ConclusionSCA can provide useful insights into the presence of clusters in biomarker data. However, SCA is likely to show inconsistent results in real-world biomarker datasets that are complex and contain considerable levels of noise. Here, the number and nature of the observed clusters may depend strongly on the chosen model-specification, precluding conclusions about the existence of biological clusters among psychiatric patients.

AB - BackgroundCluster analyses have become popular tools for data-driven classification in biological psychiatric research. However, these analyses are known to be sensitive to the chosen methods and/or modelling options, which may hamper generalizability and replicability of findings. To gain more insight into this problem, we used Specification-Curve Analysis (SCA) to investigate the influence of methodological variation on biomarker-based cluster-analysis results.MethodsProteomics data (31 biomarkers) were used from patients (n = 688) and healthy controls (n = 426) in the Netherlands Study of Depression and Anxiety. In SCAs, consistency of results was evaluated across 1200 k-means and hierarchical clustering analyses, each with a unique combination of the clustering algorithm, fit-index, and distance metric. Next, SCAs were run in simulated datasets with varying cluster numbers and noise/outlier levels to evaluate the effect of data properties on SCA outcomes.ResultsThe real data SCA showed no robust patterns of biological clustering in either the MDD or a combined MDD/healthy dataset. The simulation results showed that the correct number of clusters could be identified quite consistently across the 1200 model specifications, but that correct cluster identification became harder when the number of clusters and noise levels increased.ConclusionSCA can provide useful insights into the presence of clusters in biomarker data. However, SCA is likely to show inconsistent results in real-world biomarker datasets that are complex and contain considerable levels of noise. Here, the number and nature of the observed clusters may depend strongly on the chosen model-specification, precluding conclusions about the existence of biological clusters among psychiatric patients.

KW - biochemistry

KW - cluster analysis

KW - complexity

KW - heterogeneity

KW - psychiatry

KW - specification-curve analysis

KW - subtyping

UR - http://www.scopus.com/inward/record.url?scp=85090307991&partnerID=8YFLogxK

U2 - https://doi.org/10.1017/S0033291720002846

DO - https://doi.org/10.1017/S0033291720002846

M3 - Article

C2 - 32779563

SN - 0033-2917

VL - 52

SP - 1089

EP - 1100

JO - Psychological Medicine

JF - Psychological Medicine

ER -

Investigating data-driven biological subtypes of sychiatric disorders using specification-curve analysis

Abstract

Keywords

Access to Document

Other files and links

Cite this