Regularized linear and gradient boosted ensemble methods to predict athletes' gender based on a survey of masters athletes
Walsh, J, Heazlewood, TI & Climstein, M 2019, 'Regularized linear and gradient boosted ensemble methods to predict athletes' gender based on a survey of masters athletes', Model Assisted Statistics and Applications, vol. 14, no. 1, pp. 47-64.
Published version available from
The aim of this research was to investigate statistical learning techniques to predict gender based upon psychological constructs (measuring motivations to participate in masters sports). Motivations of marathoners scales (MOMS) sports psychological data for 3928 masters athletes (2,010 males) from the World Masters Games (the largest sporting event in the world by participant numbers) was investigated using the regularized linear modelling methodologies ridge regression, the lasso and the elastic net. Comparison was made between previously published research utilizing logistic regression, discriminate function analysis, radial basis functions, multilayer perceptrons and a selection of boosted decision tree based models. It was hypothesized that the regularized linear models would perform better than other models except the boosted decision trees, however ensembles of the regularized linear models and gradient boosted machines would result in improved accuracy over any prior models in the literature. Implementing modern regression methods with regularization provided improvements in classification accuracy based on gender compared to non-regularized linear models, however not boosted trees such as a gradient boosted machine (GBM). Models that were solely or partially based on L2 regularization (including a penalty term to reduce the sum of squares of the parameters) performed better than those than relied solely or primarily on L1 regularization (including a penalty term to reduce the sum of the absolute values of the parameters) or subset selection. This finding had implications for analysis of MOMS data in general with respect to using the 56 questions in MOMS as opposed to the underlying nine constructs for analysis in order to compensate for multicollinearity. Ensemble methods stacking ridge regression and GBMs with out-of-sample prediction further improved accuracy, giving higher accuracy scores (0.7236) than obtained in any preceding literature. This demonstrates the potential benefits from such an ensemble approach in terms of developing models with improved accuracy, as well as increasing the likelihood of developing practical applications from predictions using MOMS psychometric data.