TY - JOUR
T1 - Training sample selection: Impact on screening automation in diagnostic test accuracy reviews
AU - van Altena, Allard J.
AU - Spijker, René
AU - Leeflang, Mariska M. G.
AU - Olabarriaga, S. lvia Delgado
N1 - Funding Information: This work was carried out on the High Performance Computing Cloud resources of the Dutch national e-infrastructure with the support of the SURF Foundation. We like to thank A. H. Zwinderman and P. D. Moerland for their support designing the methodology, and B. D. Yang and M. Borgers for proofreading the manuscript. Publisher Copyright: © 2021 The Authors. Research Synthesis Methods published by John Wiley & Sons Ltd.
PY - 2021/11
Y1 - 2021/11
N2 - When performing a systematic review, researchers screen the articles retrieved after a broad search strategy one by one, which is time-consuming. Computerised support of this screening process has been applied with varying success. This is partly due to the dependency on large amounts of data to develop models that predict inclusion. In this paper, we present an approach to choose which data to use in model training and compare it with established approaches. We used a dataset of 50 Cochrane diagnostic test accuracy reviews, and each was used as a target review. From the remaining 49 reviews, we selected those that most closely resembled the target review's clinical topic using the cosine similarity metric. Included and excluded studies from these selected reviews were then used to develop our prediction models. The performance of models trained on the selected reviews was compared against models trained on studies from all available reviews. The prediction models performed best with a larger number of reviews in the training set and on target reviews that had a research subject similar to other reviews in the dataset. Our approach using cosine similarity may reduce computational costs for model training and the duration of the screening process.
AB - When performing a systematic review, researchers screen the articles retrieved after a broad search strategy one by one, which is time-consuming. Computerised support of this screening process has been applied with varying success. This is partly due to the dependency on large amounts of data to develop models that predict inclusion. In this paper, we present an approach to choose which data to use in model training and compare it with established approaches. We used a dataset of 50 Cochrane diagnostic test accuracy reviews, and each was used as a target review. From the remaining 49 reviews, we selected those that most closely resembled the target review's clinical topic using the cosine similarity metric. Included and excluded studies from these selected reviews were then used to develop our prediction models. The performance of models trained on the selected reviews was compared against models trained on studies from all available reviews. The prediction models performed best with a larger number of reviews in the training set and on target reviews that had a research subject similar to other reviews in the dataset. Our approach using cosine similarity may reduce computational costs for model training and the duration of the screening process.
KW - computerised support
KW - cosine similarity
KW - machine learning
KW - screening automation
KW - training sample selection
UR - http://www.scopus.com/inward/record.url?scp=85113381765&partnerID=8YFLogxK
U2 - https://doi.org/10.1002/jrsm.1518
DO - https://doi.org/10.1002/jrsm.1518
M3 - Article
C2 - 34390193
SN - 1759-2887
VL - 12
SP - 831
EP - 841
JO - Research synthesis methods
JF - Research synthesis methods
IS - 6
ER -