Adv Quant: More on Logistic Regression

In assessing the predictive power of categorical predictors of a binary outcome, should logistic regression be used? In other words, how is the logistic function used to predict categorical outcomes?

Advertisements

Logistic regression is another flavor of multi-variable regression, where one or more independent variables are continuous or categorical which are used to predict a dichotomous/ binary/ categorical dependent variable (Ahlemeyer-Stubbe, & Coleman, 2014; Field, 2013; Gall, Gall, & Borg, 2006; Huck, 2011).  Zheng and Agresti (2000) defines predictive power as a measure that helps compare competing regressions via analyzing the importance of the independent variables.  For linear regression and multiple linear regression, the correlation coefficient and coefficient of determination are adequate for predictive power (Field, 2014; Zheng & Agresti, 2000). The more data that is collected could yield a stronger predictive power (Field, 2014).  Predictive power is used to sell the relationships between variables to management (Ahlemeyer-Stubbe, & Coleman, 2014).

For logistic regression, the predictive power of the independent variables can be evaluated by the concept of the odds ratio for each independent variable (Huck, 2011). Field (2013) and Schumacker (2014) explained that when the logistic regression is calculated, the categorical/binary variables are transformed into ln(odds ratio) and a regression is then performed on this newly scaled variable (scale factor seen in equation 1):

eq1.PNG                                                 (1)

Since the probability of one categorical variable varies between 0à0.999…, the odds ratio value can vary between 0 à 999.999… (Schumacker, 2014). If the value of the odds ratio is greater than 1, it will show that as the independent variable value increases, so do the odds of the dependent variable (Y = n) occurs increases and vice versa (Field, 2013). Thus, the odds ratio measures the constant strength of association between the independent and dependent variables (Huck, 2011; Smith, 2015).  Due to this ln(odds ratio) transformation, logistic regression should be used for binary outcomes.

Field (2013) and Schumacker (2014) further explained that given that this ln(odds ratio) transformation needs to be made on the variables; the way to predict categorical outcomes from the regression formula (2),

  eq2.PNG                                            (2)

is best to explained the probability of the categorical outcome value one is trying to calculate:

eq3                                                (3)

The probability equation (3) can be expressed in multiple ways, through typical algebraic manipulations.  Thus, the probability/likelihood of the dependent variable Y is defined between 0-100% and the odds ratio is used to discuss the strength of these relationships.

References

  • Ahlemeyer-Stubbe, Andrea, Shirley Coleman. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online].
  • Gall, M. D., Gall, J. P., Borg, W. R. (2006). Educational Research: An Introduction, 8th Edition. [VitalSource Bookshelf Online].
  • Field, Andy. (2013). Discovering Statistics Using IBM SPSS Statistics, 4th Edition. [VitalSource Bookshelf Online].
  • Huck, Schuyler W. (2011). Reading Statistics and Research, 6th Edition. [VitalSource Bookshelf Online].
  • Schumacker, Randall E. (2014). Learning Statistics Using R, 1st Edition. [VitalSource Bookshelf Online].
  • Smith, M. (2015). Statistical analysis handbook. Retrieved from http://www.statsref.com/HTML/index.html?introduction.html
  • Zheng, B. and Agresti, A. (2000) Summarizing the predictive power of a generalized linear model.  Retrieved from http://www.stat.ufl.edu/~aa/articles/zheng_agresti.pdf

Quant: Regression and Correlations

Top management of a large company has told you that they really would like to be able to determine what the impact of years of service at their company has on workers’ productivity levels, and they would like to be able to predict potential productivity based upon years of service. The company has data on all of its employees and has been using a valid productivity measure that assesses each employee’s productivity. You have told management that there is a possible way to do that.

Through a regression analysis, it should be possible to predict the potential productivity based upon years of service, depending on two factors: (1) that the productivity assessment tool is valid and reliable (Creswell, 2014) and (2) we have a large enough sample size to conduct our analysis and be able to draw statistical inference of the population based on the sample data which has been collected (Huck, 2011). Assuming these two conditions are met, then regression analysis could be made on the data to create a prediction formula. Regression formulas are useful for summarizing the relationship between the variables in question (Huck, 2011). There are multiple types of regression all of them are tests of prediction: Linear, Multiple, Log-Linear, Quadratic, Cubic, etc. (Huck, 2011; Schumacker, 2014).  The linear regression is the most well-known because it uses basic algebra, a straight line, and the Pearson correlation coefficient to aid in stating the regression’s prediction strength (Huck, 2011; Schumacker, 2014).  The linear regression formula is: y = a + bx + e, where y is the dependent variable (in this case the productivity measure), x is the independent variable (years of service), a (the intercept) and b (the regression weight) are a constants that are to be defined through the regression analysis, and e is the regression prediction error (Field, 2013; Schumacker, 2014).  The sum of the errors should be equal to zero (Schumacker, 2014).

Linear regression models try to describe the relationship between one dependent and one independent variable, which are measured at the ratios or interval level (Schumacker, 2014).  However, other regression models are tested to find the best regression fit over the data.  Even though these are different regression tests, the goal for each regression model is to try to describe the current relationship between the dependent variable and the independent variable(s) and for predicting.  Multiple regression is used when there are multiple independent variables (Huck, 2011; Schumacker, 2014). Log-Linear Regression is using a categorical or continuously independent variable (Schumacker, 2014). Quadratic and Cubic regressions use a quadratic and cubic formula to help predict trends that are quadratic or cubic in nature respectively (Field, 2013).  When modeling predict potential productivity based upon years of service the regression with the strongest correlation will be used as it is that regression formula that explains the variance between the variables the best.   However, just because the regression formula can predict some or most of the variance between the variables, it will never imply causation (Field, 2013).

Correlations help define the strength of the regression formula in defining the relationships between the variables, and can vary in value from -1 to +1.  The closer the correlation coefficient is to -1 or +1; it informs the researcher that the regression formula is a good predictor of the variance between the variables.  The closer the correlation coefficient is to zero, indicates that there is hardly any relationship between the variable (Field, 2013; Huck, 2011; Schumacker, 2014).  A negative correlation could show that as the years of service increases the productivity measured is decreased, which could be caused by apathy or some other factor that has yet to be measured.  A positive correlation could show that as the years of service increases the productivity also measured increases, which could also be influenced by other factors that are not directly related to the years of service.  Thus, correlation doesn’t imply causation, but can help determine the percentage of the variances between the variables by the regression formula result, when the correlation value is squared (r2) (Field, 2013).

References

  • Creswell, J. W. (2014) Research design: Qualitative, quantitative and mixed method approaches (4th ed.). California, SAGE Publications, Inc. VitalBook file.
  • Field, A. (2013) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
  • Huck, S. W. (2011) Reading Statistics and Research (6th ed.). Pearson Learning Solutions. VitalBook file.
  • Schumacker, R. E. (2014) Learning statistics using R. California, SAGE Publications, Inc, VitalBook file.