Overfitting and Parsimony
Overfitting a regression model is stuffing it with so many variables that have little contributional weight to help predict the dependent variable (Field, 2013; Vandekerckhove, Matzke, & Wagenmakers, 2014). Thus, to avoid the over-fitting problem, the use of parsimony is important in big data analytics. Parsimony is describing a dependent variable with the fewest independent variables as possible (Field, 2013; Huck, 2013; Smith, 2015). The best way to describe this is to use the “Keep It Simple Sweaty,” concept on the regression model. The concept of parsimony could be attributed to Occam’s Razor, which states “plurality out never be posited without necessity” (Duignan, 2015). Vandekerckhove et al. (2014) describe parsimony as a way of removing the noise from the signal to create better predictive regression models.
Overfitting in General Least Squares Model (GLM)
For multivariate regressions, a correlation matrix could be conducted on all the variables, to help with identifying parsimony, such that the software will try to maximize the correlation while minimizing the number of variables (Field, 2013). Smith (2015) stated that the proportion of variation should remain high between the variables and that the correlation between the separate independent variables should be as low as possible. If the correlation coefficient between the independent variables is high (0.8 or higher), then there is a chance that there are extraneous variables (Smith, 2015). Another technique to achieve parsimony is called the backward stepwise method, which is to run a regression model with all variables, and remove those variables that don’t contribute to the models significantly, or the model could start with one variable and add variables until it has maximized correlation and variance in a forward stepwise method (Field, 2013; Huck, 2015).
Unfortunately, there is still a problem of overfitting when conducting a backward stepwise method, forward stepwise method, or correlation matrix in multivariate linear models. That is because, computers tend to remove, add or consider variables systematically and mathematically, not based on human knowledge (Field, 2013; Huck, 2015). Thus, it is still important to have a human to evaluate the computational output for logic, consistency, and reliability. However, if the focus is to reduce overfitting, it should be noted that underfitting should also be avoided. Underfitting a regression model happens when the model leaves out key independent variables that can help predict the dependent variable from the model (Field, 2013).
Hierarchical regression methods
When the researcher builds a multivariate regression model, they build it in stages, as they tend to add known independent variables first, and add newer independent variables to avoid overfitting in a technique called hierarchical regression (Austin, Goel & van Walraven, 2001; Field, 2013; Huck 2013). The new and unknown independent variables could be entered in through a stepwise algorithm as abovementioned, or another step could be created where suspected new variables that may have a high contribution to the predictability of the dependent variables are added next (Field, 2013). Hierarchical regression methods allow the researcher to analyze the differing hierarchical levels by examining not only the correlations between the levels but also the intercepts and slopes, helping drive valid statistical inferences (Austin et al., 2001).
Vandekerckhove et al. (2014) listed these three hierarchical methods for model selection; where each method is balancing between the goodness of fit and parsimony:
- Akaike’s Information Criterion (AIC) considers how much-observed data influences the belief of one model over the other, but is unreliable with huge amounts of data
- Bayesian Information Criterion (BIC) considers how much-observed data influences the belief of one model over the other and can handle huge amounts of data, but is known to underfit
- Minimum Description Length (MDL) considers how much a model can compress the observed data, through identifying regularity within the data values
Vandekerckhove et al, (2014), also stated that the model with the lowest AIC and/or BIC score would be the best to choose.
In conclusion, under parsimony, if adding another variable does not improve the regression formula, then should not be added into the assessment to avoid overfitting (Field, 2013). General Least Squares Models have issues in overfitting because computers systematically and mathematically conduct their analysis and lack the human knowledge to keep removing unneeded variables from the equation. Hierarchical regression methods can help minimize overfitting through indirect calculation of a parsimony value (Vandekerckhove et al., 2014).
- Austin, P. C., Goel, V., & van Walraven, C. (2001). An introduction to multilevel regression models. Canadian Journal of Public Health, 92(2), 150.
- Duignan, B. (2015). Occam’s razor. Encyclopaedia Britannica. Retrieved from https://www.britannica.com/topic/Occams-razor
- Field, Andy. (2013). Discovering Statistics Using IBM SPSS Statistics, 4th Edition. [VitalSource Bookshelf Online].
- Huck, S. W. (2013) Reading Statistics and Research (6th ed.). Pearson Learning Solutions. [VitalSource Bookshelf Online].
- Smith, M. (2015). Statistical analysis handbook. Retrieved from http://www.statsref.com/HTML/index.html?introduction.html
- Vandekerckhove, J., Matzke, D., & Wagenmakers, E. J. (2014). Model comparison and the principle of parsimony.