TY - JOUR
T1 - SuRF: A new method for sparse variable selection, with application in microbiome data analysis
AU - Liu, Lihui
AU - Gu, Hong
AU - van Limbergen, Johan
AU - Kenney, Toby
N1 - Funding Information: information Natural Sciences and Engineering Research Council of Canada, RGPIN/4945-2014, RGPIN-2017-05108 Publisher Copyright: © 2020 John Wiley & Sons Ltd
PY - 2021/2/20
Y1 - 2021/2/20
N2 - In this article, we present a new variable selection method for regression and classification purposes, particularly for microbiome analysis. Our method, called subsampling ranking forward selection (SuRF), is based on LASSO penalized regression, subsampling and forward-selection methods. SuRF offers major advantages over existing variable selection methods in terms of both sparsity of selected models and model inference. We provide an R package that can implement our method for generalized linear models. We apply our method to classification problems from microbiome data, using a novel agglomeration approach to deal with the special tree-like correlation structure of the variables. Existing methods arbitrarily choose a taxonomic level a priori before performing the analysis, whereas by combining SuRF with these aggregated variables, we are able to identify the key biomarkers at the appropriate taxonomic level, as suggested by the data. We present simulations in multiple sparse settings to demonstrate that our approach performs better than several other popularly used existing approaches in recovering the true variables. We apply SuRF to two microbiome datasets: one about prediction of pouchitis and another for identifying samples from two healthy individuals. We find that SuRF can provide a better or comparable prediction with other methods while controlling the false positive rate of variable selection.
AB - In this article, we present a new variable selection method for regression and classification purposes, particularly for microbiome analysis. Our method, called subsampling ranking forward selection (SuRF), is based on LASSO penalized regression, subsampling and forward-selection methods. SuRF offers major advantages over existing variable selection methods in terms of both sparsity of selected models and model inference. We provide an R package that can implement our method for generalized linear models. We apply our method to classification problems from microbiome data, using a novel agglomeration approach to deal with the special tree-like correlation structure of the variables. Existing methods arbitrarily choose a taxonomic level a priori before performing the analysis, whereas by combining SuRF with these aggregated variables, we are able to identify the key biomarkers at the appropriate taxonomic level, as suggested by the data. We present simulations in multiple sparse settings to demonstrate that our approach performs better than several other popularly used existing approaches in recovering the true variables. We apply SuRF to two microbiome datasets: one about prediction of pouchitis and another for identifying samples from two healthy individuals. We find that SuRF can provide a better or comparable prediction with other methods while controlling the false positive rate of variable selection.
KW - LASSO
KW - SuRF
KW - forward selection
KW - generalized linear models
KW - identifying biomarkers
KW - microbiome
KW - stability selection
KW - variable selection
UR - http://www.scopus.com/inward/record.url?scp=85096720285&partnerID=8YFLogxK
U2 - https://doi.org/10.1002/sim.8809
DO - https://doi.org/10.1002/sim.8809
M3 - Article
C2 - 33219557
SN - 0277-6715
VL - 40
SP - 897
EP - 919
JO - Statistics in medicine
JF - Statistics in medicine
IS - 4
ER -