## Adv Quant: Logistic Regression in R

Introduction

The German credit data contains attributes and outcomes on 1,000 loan applications. The data are available at this Web site, where datasets are provided for the machine learning community.

Results

Figure 1: Image shows the first six entries in the German credit data.

Figure 2: Matrix scatter plot, showing the 2×2 relationships between all the variables within the German credit data.

Figure 3: A summary of the credit data with the variables of interest.

Figure 4: Shows the entries in the designer matrix which will be used for logistical analysis.

Figure 5: Summarized logistic regression information based on the training data.

Figure 6: The coeficients’ confidence interval at the 95% level using log-likelihood vlaues, with values to the right including the standard errors values.

Figure 7: Wald Test statistic to test the significance level of the entire ranked variable.

Figure 8: The Odds Ratio for each independent variable along with the 95% confidence interval for those odds ratio.

Figure 9: Part of the summarized test data set for the logistics regression model.

Figure 10: The ROC curve, which illustrates the false positive rate versus the true positive rate of the prediction model.

Discussion

The results from Figure 1 means that the data needs to be formatted before any analysis could be conducted on the data.  Hence, the following lines of code were needed to redefine the variables in the German data set.   Given the data output (Figure 1), the matrix scatter plot (Figure 2) show that duration, amount, and age are continuous variables, while the other five variables are factor variables, which have categorized scatterplots.  Even though installment and default show box plot data in the summary (Figure 3), the data wasn’t factored like history, purpose, or rent, thus it won’t show a count.  From the count data (Figure 3), the ~30% of the purpose of loans are for cars, where as 28% is for TVs.  In this German credit data, about 82% of those asking for credit do not rent and about 53% of borrowers have an ok credit history with 29.3% having a horrible credit history.  The mean average default rate is 30%.

Variables (Figure 5) that have statistical significance at the 0.10 include duration, amount, installment, age, history (per category), rent, and some of the purposes categories.  Though it is preferred to see a large difference in the null deviance and residual deviance, it is still a difference.  The 95% confidence interval for all the logistic regression equation don’t show much spread from their central tendencies (Figure 6).  Thus, the logistic regression model is (from figure 5):

The odds ratio measures the constant strength of association between the independent and dependent variables (Huck, 2011; Smith, 2015).  This is similar to the correlation coefficient (r) and coefficient of determination (r2) values for linear regression.  According to UCLA: Statistical Consulting Group, (2007), if the P value is less than 0.05, then the overall effect of the ranked term is statistically significant (Figure 7), which in this case the three main terms are.  The odds ratio data (Figure 8) is promising, as values closer to one is desirable for this prediction model (Field, 2013). If the value of the odds ratio is greater than 1, it will show that as the independent variable value increases, so do the odds of the dependent variable (Y = n) occurs increases and vice versa (Fields, 2013).

Moving into the testing phase of the logistics regression model, the 100 value data set needs to be extracted, and the results on whether or not there will be a default or not on the loan are predicted. Comparing the training and the test data sets, the maximum values between the both are not the same for durations and amount of the loan.  All other variables and statistical distributions are similar to each other between the training and the test data.  Thus, the random sampling algorithm in R was effective.

The area underneath the ROC curve (Figure 10), is 0.6994048, which is closer to 0.50 than it is to one, thus this regression does better than pure chance, but it is far from perfect (Alice, 2015).

In conclusion, the regression formula has a 0.699 prediction accuracy, and the purpose, history, and rent ranked categorical variables were statistically significant as a whole.  Therefore, the logistic regression on these eight variables shows more promise in prediction accuracy than pure chance, on who will and will not default on their loan.

Code

#

## The German credit data contains attributes and outcomes on 1,000 loan applications.

##    •   You need to use random selection for 900 cases to train the program, and then the other 100 cases will be used for testing.

##    •   Use duration, amount, installment, and age in this analysis, along with loan history, purpose, and rent.

### ———————————————————————————————————-

#

#

## Reading the data from source and displaying the top six entries.

#

#

## Defining the variables (Taddy, n.d.)

#

default = credits\$V21 – 1 # set default true when = 2

duration = credits\$V2

amount = credits\$V5

installment = credits\$V8

age = credits\$V13

history = factor(credits\$V3, levels = c(“A30”, “A31”, “A32”, “A33”, “A34”))

purpose = factor(credits\$V4, levels = c(“A40″,”A41″,”A42″,”A43″,”A44″,”A45″,”A46″,”A48″,”A49″,”A410”))

rent = factor(credits\$V15==”A151″) # renting status only

# rent = factor(credits\$V15 , levels = c(“A151″,”A152″,”153”)) # full property status

#

## Re-leveling the variables (Taddy, n.d.)

#

levels(history) = c(“great”, “good”, “ok”, “poor”, “horrible”)

levels(purpose) = c(“newcar”, “usedcar”, “furniture/equip”, “radio/TV”, “apps”, “repairs”, “edu”, “retraining”, “biz”, “other”)

# levels(rent) = c(“rent”, “own”, “free”) # full property status

#

## Create a new matrix called “cred” with the 8 defined variables (Taddy, n.d.)

#

credits\$default = default

credits\$duration= duration

credits\$amount  = amount

credits\$installment = installment

credits\$age     = age

credits\$history = history

credits\$purpose = purpose

credits\$rent    = rent

cred = credits[,c(“default”,”duration”,”amount”,”installment”,”age”,”history”,”purpose”,”rent”)]

#

##  Plotting & reading to make sure the data was transfered correctly into this dataset and present summary stats (Taddy, n.d.)

#

plot(cred)

cred[1:3,]

summary(cred[,])

#

## Create a design matrix, such that factor variables are turned into indicator variables

#

Xcred = model.matrix(default~., data=cred)[,-1]

Xcred[1:3,]

#

## Creating training and prediction datasets: Select 900 rows for esitmation and 100 for testing

#

set.seed(1)

train = sample(1:1000,900)

## Defining which x and y values in the design matrix will be for training and for testing

xtrain = Xcred[train,]

xnew = Xcred[-train,]

ytrain = cred\$default[train]

ynew = cred\$default[-train]

#

## logistic regresion

#

datas=data.frame(default=ytrain,xtrain)

creditglm=glm(default~., family=binomial, data=datas)

summary(creditglm)

#

## Confidence Intervals (UCLA: Statistical Consulting Group, 2007)

#

confint(creditglm)

confint.default(creditglm)

#

## Overall effect of the rank using the wald.test function from the aod library (UCLA: Statistical Consulting Group, 2007)

#

install.packages(“aod”)

library(aod)

wald.test(b=coef(creditglm), Sigma = vcov(creditglm), Terms = 6:9) # for all ranked terms for history

wald.test(b=coef(creditglm), Sigma = vcov(creditglm), Terms = 10:18) # for all ranked terms for purpose

wald.test(b=coef(creditglm), Sigma = vcov(creditglm), Terms = 19) # for the ranked term for rent

#

## Odds Ratio for model analysis (UCLA: Statistical Consulting Group, 2007)

#

exp(coef(creditglm))

exp(cbind(OR=coef(creditglm), confint(creditglm))) # odds ration next to the 95% confidence interval for odds ratios

#

## Predicting default from the test data (Alice, 2015; UCLA: Statistical Consulting Group., 2007)

#

newdatas=data.frame(default=ynew,xnew)

newestdata=newdatas[,2:19] #removing the variable default from the data matrix

summary(newdatas)

#

## Plotting the true positive rate against the false positive rate (ROC Curve) (Alice, 2015)

#

install.packages(“ROCR”)

library(ROCR)

prf = performance(pr, measure=”tpr”, x.measure=”fpr”)

plot(prf)

## Area under the ROC curve (Alice, 2015)

auc= performance(pr, measure = “auc”)

auc= auc@y.values[[1]]

auc # The closer this value is to 1 the better, much better than to 0.5

References

## Adv Quant: More on Logistic Regression

Logistic regression is another flavor of multi-variable regression, where one or more independent variables are continuous or categorical which are used to predict a dichotomous/ binary/ categorical dependent variable (Ahlemeyer-Stubbe, & Coleman, 2014; Field, 2013; Gall, Gall, & Borg, 2006; Huck, 2011).  Zheng and Agresti (2000) defines predictive power as a measure that helps compare competing regressions via analyzing the importance of the independent variables.  For linear regression and multiple linear regression, the correlation coefficient and coefficient of determination are adequate for predictive power (Field, 2014; Zheng & Agresti, 2000). The more data that is collected could yield a stronger predictive power (Field, 2014).  Predictive power is used to sell the relationships between variables to management (Ahlemeyer-Stubbe, & Coleman, 2014).

For logistic regression, the predictive power of the independent variables can be evaluated by the concept of the odds ratio for each independent variable (Huck, 2011). Field (2013) and Schumacker (2014) explained that when the logistic regression is calculated, the categorical/binary variables are transformed into ln(odds ratio) and a regression is then performed on this newly scaled variable (scale factor seen in equation 1):

(1)

Since the probability of one categorical variable varies between 0à0.999…, the odds ratio value can vary between 0 à 999.999… (Schumacker, 2014). If the value of the odds ratio is greater than 1, it will show that as the independent variable value increases, so do the odds of the dependent variable (Y = n) occurs increases and vice versa (Field, 2013). Thus, the odds ratio measures the constant strength of association between the independent and dependent variables (Huck, 2011; Smith, 2015).  Due to this ln(odds ratio) transformation, logistic regression should be used for binary outcomes.

Field (2013) and Schumacker (2014) further explained that given that this ln(odds ratio) transformation needs to be made on the variables; the way to predict categorical outcomes from the regression formula (2),

(2)

is best to explained the probability of the categorical outcome value one is trying to calculate:

(3)

The probability equation (3) can be expressed in multiple ways, through typical algebraic manipulations.  Thus, the probability/likelihood of the dependent variable Y is defined between 0-100% and the odds ratio is used to discuss the strength of these relationships.

References

• Ahlemeyer-Stubbe, Andrea, Shirley Coleman. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online].
• Gall, M. D., Gall, J. P., Borg, W. R. (2006). Educational Research: An Introduction, 8th Edition. [VitalSource Bookshelf Online].
• Field, Andy. (2013). Discovering Statistics Using IBM SPSS Statistics, 4th Edition. [VitalSource Bookshelf Online].
• Huck, Schuyler W. (2011). Reading Statistics and Research, 6th Edition. [VitalSource Bookshelf Online].
• Schumacker, Randall E. (2014). Learning Statistics Using R, 1st Edition. [VitalSource Bookshelf Online].
• Zheng, B. and Agresti, A. (2000) Summarizing the predictive power of a generalized linear model.  Retrieved from http://www.stat.ufl.edu/~aa/articles/zheng_agresti.pdf

## Adv Quant: Logistic Vs Linear Regression

To generalize the results of the research the insights gained from a sample of data needs to use the correct mathematical procedures for using probabilities and information, statistical inference (Gall et al., 2006; Smith, 2015).  Gall et al. (2006), stated that statistical inference is what dictates the order of procedures, for instance, a hypothesis and a null hypothesis must be defined before a statistical significance level, which also has to be defined before calculating a z or t statistic value. Essentially, a statistical inference allows for quantitative researchers to make inferences about a population.  A population, where researchers must remember where that data was generated and collected from during quantitative research process.  The orders of procedures are important to apply statistical inferences to regressions, if not the prediction formula will not be generalizable.

Logistic regression is another flavor of multi-variable regression, where one or more independent variables are continuous or categorical which are used to predict a dichotomous/ binary/ categorical dependent variable (Ahlemeyer-Stubbe, & Coleman, 2014; Field, 2013; Gall, Gall, & Borg, 2006; Huck, 2011).  Logistic regression is an alternative to linear regression, which assumes all variables are continuous (Ahlemeyer-Stubbe, & Coleman, 2014). Both the multi-variable linear regression and logistic regression formula are (Field, 2013; Schumacker, 2014):

Y = a + b11 + b2X2 + …                                                       (1)

The main difference between these two regressions is that the variables in the equation (1) represent different types of dependent (Y) and independent variables (Xi).  These different types of variables may have to undergo a transformation before the regression analysis begins (Field, 2013; Schumacker 2014).  Due to the difference in the types of variables between logistic and linear regression the assumptions on when to use either regression are also different (Table 1).

Table 1: Discusses and summarizes the types of assumptions and variables used in both logistic and regular regression, created from Ahlemeyer-Stubbe & Coleman (2014), Field (2013), Gall et al. (2006), Huck (2011) and Schumacker, (2014).

 Assumptions of Logistic Regression Assumptions for Linear Regression ·         Multicollinearity should be minimized between the independent variables ·         There is no need for linearity between the dependent and independent variables ·         Normality only on the continuous independent variables ·         No need for homogeneity of variance within the categorical variables ·         Error terms a not normally distributed ·         Independent variables don’t have to be continuous ·         There are no missing data (no null values) ·         Variance that is not zero ·         Multicollinearity should be minimized between the multiple independent variables ·         Linearity exists between all variables ·         Additivity (for multi-variable linear regression) ·         Errors in the dependent variable and its predicted values are independent and uncorrelated ·         All variables are continuous ·         Normality on all variables ·         Normality on the error values ·         Homogeneity of variance ·         Homoscedasticity- variance between residuals are constant ·         Variance that is not zero Variable Types of Logistic Regression Variable Types of Linear Regression ·         2 or more Independent variables ·         Independent variables: continuous, dichotomous, binary, or categorical ·         Dependent variable: dichotomous, binary ·         1 or more Independent variables ·         Independent variables: continuous ·         Dependent variables: continuous

References

• Ahlemeyer-Stubbe, Andrea, Shirley Coleman. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online].
• Gall, M. D., Gall, J. P., Borg, W. R. (2006). Educational Research: An Introduction, 8th Edition. [VitalSource Bookshelf Online].
• Field, Andy. (2013). Discovering Statistics Using IBM SPSS Statistics, 4th Edition. [VitalSource Bookshelf Online].
• Huck, Schuyler W. (2011). Reading Statistics and Research, 6th Edition. [VitalSource Bookshelf Online].
• Schumacker, Randall E. (2014). Learning Statistics Using R, 1st Edition. [VitalSource Bookshelf Online].

## Adv Quant: Overfitting & Parsimony

Overfitting and Parsimony

Overfitting a regression model is stuffing it with so many variables that have little contributional weight to help predict the dependent variable (Field, 2013; Vandekerckhove, Matzke, & Wagenmakers, 2014).  Thus, to avoid the over-fitting problem, the use of parsimony is important in big data analytics.  Parsimony is describing a dependent variable with the fewest independent variables as possible (Field, 2013; Huck, 2013; Smith, 2015).  The best way to describe this is to use the “Keep It Simple Sweaty,” concept on the regression model.  The concept of parsimony could be attributed to Occam’s Razor, which states “plurality out never be posited without necessity” (Duignan, 2015).  Vandekerckhove et al. (2014) describe parsimony as a way of removing the noise from the signal to create better predictive regression models.

Overfitting in General Least Squares Model (GLM)

For multivariate regressions, a correlation matrix could be conducted on all the variables, to help with identifying parsimony, such that the software will try to maximize the correlation while minimizing the number of variables (Field, 2013).  Smith (2015) stated that the proportion of variation should remain high between the variables and that the correlation between the separate independent variables should be as low as possible. If the correlation coefficient between the independent variables is high (0.8 or higher), then there is a chance that there are extraneous variables (Smith, 2015).   Another technique to achieve parsimony is called the backward stepwise method, which is to run a regression model with all variables, and remove those variables that don’t contribute to the models significantly, or the model could start with one variable and add variables until it has maximized correlation and variance in a forward stepwise method (Field, 2013; Huck, 2015).

Unfortunately, there is still a problem of overfitting when conducting a backward stepwise method, forward stepwise method, or correlation matrix in multivariate linear models.  That is because, computers tend to remove, add or consider variables systematically and mathematically, not based on human knowledge (Field, 2013; Huck, 2015). Thus, it is still important to have a human to evaluate the computational output for logic, consistency, and reliability.  However, if the focus is to reduce overfitting, it should be noted that underfitting should also be avoided.  Underfitting a regression model happens when the model leaves out key independent variables that can help predict the dependent variable from the model (Field, 2013).

Hierarchical regression methods

When the researcher builds a multivariate regression model, they build it in stages, as they tend to add known independent variables first, and add newer independent variables to avoid overfitting in a technique called hierarchical regression (Austin, Goel & van Walraven, 2001; Field, 2013; Huck 2013).  The new and unknown independent variables could be entered in through a stepwise algorithm as abovementioned, or another step could be created where suspected new variables that may have a high contribution to the predictability of the dependent variables are added next (Field, 2013).  Hierarchical regression methods allow the researcher to analyze the differing hierarchical levels by examining not only the correlations between the levels but also the intercepts and slopes, helping drive valid statistical inferences (Austin et al., 2001).

Vandekerckhove et al. (2014) listed these three hierarchical methods for model selection; where each method is balancing between the goodness of fit and parsimony:

• Akaike’s Information Criterion (AIC) considers how much-observed data influences the belief of one model over the other, but is unreliable with huge amounts of data
• Bayesian Information Criterion (BIC) considers how much-observed data influences the belief of one model over the other and can handle huge amounts of data, but is known to underfit
• Minimum Description Length (MDL) considers how much a model can compress the observed data, through identifying regularity within the data values

Vandekerckhove et al, (2014), also stated that the model with the lowest AIC and/or BIC score would be the best to choose.

In conclusion, under parsimony, if adding another variable does not improve the regression formula, then should not be added into the assessment to avoid overfitting (Field, 2013). General Least Squares Models have issues in overfitting because computers systematically and mathematically conduct their analysis and lack the human knowledge to keep removing unneeded variables from the equation.  Hierarchical regression methods can help minimize overfitting through indirect calculation of a parsimony value (Vandekerckhove et al., 2014).

References

• Austin, P. C., Goel, V., & van Walraven, C. (2001). An introduction to multilevel regression models. Canadian Journal of Public Health92(2), 150.
• Field, Andy. (2013). Discovering Statistics Using IBM SPSS Statistics, 4th Edition. [VitalSource Bookshelf Online].
• Huck, S. W. (2013) Reading Statistics and Research (6th ed.). Pearson Learning Solutions. [VitalSource Bookshelf Online].
• Vandekerckhove, J., Matzke, D., & Wagenmakers, E. J. (2014). Model comparison and the principle of parsimony.

## Adv Quant: Polynomial Regression in R

Introduction

For this local polynomial regression, the “oldfaithful.csv” will be used from the open-source data. The eruption times (in minutes) and the waiting time to the next eruption (in minutes) of 272 eruptions are provided for the Old Faithful geyser.

Results

Figure 1: Density Histograms for eruptions times and eruption waiting times.

Figure 2: Smoothed density histrogram from local polynomial regresion.

Figure 3: Intercomparisson of linear regression (blue), lowess regresion(red), and polynomial regression (green) on the eruption data.

Figure 4: Residual plots for both linear and polynomial regression.

Discussion

The histogram plots (Figure 1) illustrate that both variables, eruption times and eruptions waiting time are both bimodal distributions.  Thus, a linear regression (Figure 3), would not capture the relationship between these two variables.  A polynomial smoothed version of the bimodal curve (Figure 2) show that for low values of the geysers magnitude, there is a low wait time for the next occurrence and vice versa.  The smoothed density curve shows the estimate values of the geyser’s variable distribution better than the bar histogram

LOCFIT (locally fitted regression) and LOWESS (locally weighted scatterplot smoothing regression) are assessed alongside the typical LM (linear regression).  LOCFIT is based on LOWESS, which allows the end user to specify the smoothing parameter and neighborhood size, but LOCFIT affords the end user more control over other the smoothing parameters (Futschik & Crompton, 2004).  Both LOCFIT and LOWESS are methods for regression that uses the nearest-neighbor-based model (Field, 2013; Futschik & Crompton, 2004; Loader, 2013; Smith, 2015).  This analysis will look at all three.

The goal is to see if there is a relationship between the waiting time to the next eruption to the magnitude of the eruption (per eruption time).  Through the linear regression algorithm, the linear model is eruptions = 0.075628 (waiting) – 1.874016.  The Pearson’s correlation coefficient is 0.9008112. Thus 81.14% of the variation could be explained by a linear regression model.  The lowess regression appears not to capture the distribution of data at smaller eruption times, but it is better than the linear regression model since its correlation is 0.9809684, and can explain 0.9622990 of the variation between the variables.

Finally, to evaluate the effectiveness of the linear model and the polynomial model, residuals must be assessed (Figure 4). Both of the residual plots don’t show any discernable pattern. However, the residuals are closer to zero in the polynomial regression, suggesting that it does a better job at explaining the variance between the eruption magnitude and the next eruption wait time.  In conclusion, the best regression for this data set appears to be the polynomial regression.

Code

#

## Use R to analyze the faithful dataset.

## This is a version of the eruption data from the “Old Faithful” geyser in Yellowstone National Park, Wyoming.

##  •     X (primary key)

##  •     eruptions (eruption time [mins])

##  •     waiting (wait time for this eruptions [mins])

#

# Produce density histograms of eruption times and of waiting times.

hist(fateful\$eruptions, freq=F, xlab = “eruptions time [mins]”,  main = “Histogram of the eruptions time”)

hist(fateful\$waiting, freq=F, xlab = “eruptions waiting time [mins]”,  main = “Histogram of the eruptions waiting time”)

# Produce a smoothed density histogram from local polynomial regression.

install.packages(“locfit”)

library(locfit)

plot(locfit(~lp(fateful\$eruptions),data=fateful), xlab = “eruptions time [mins]”,  main = “Histogram of the eruptions time”)

plot(locfit(~lp(fateful\$waiting),data=fateful), xlab = “eruptions waiting time [mins]”,  main = “Histogram of the eruptions waiting time”)

# Compare local polynomial regression to regular regression.

lowessRegression = lowess(fateful\$waiting, faithful\$eruptions, f=2/3)

polynomialRegression = locfit(fateful\$eruptions~lp(fateful\$waiting))

linearRegression = lm(fateful\$eruptions~fateful\$waiting)

# Graphing the data

plot(fateful\$waiting, fateful\$eruptions, main = “Eruption Times”, xlab=”eruption time [min]”, ylab = “Waiting time to next eruption [min]”)

lines(lowessRegression, col=”red”)

abline(linearRegression, col=”blue”)

lines(polynomialRegression, col=”green”)

# summary on the regressions

summary(linearRegression)

# correlations on the regressions

cor(fateful\$eruptions,fateful\$waiting)

cor(lowessRegression\$x, lowessRegression\$y)

# Plotting residuals

plot(residuals(linearRegression), main = “residuals for the linear regression”, ylab = “residuals”)

plot(residuals(polynomialRegression), main = “residuals for the polynomial regression”, ylab=”residuals”)

References

## Adv Quant: Locally Weighted Scatterplot Smothing (LOWESS) in R

Locally weighted scatterplot smoothing (LOWESS) method for multiple regression models in a k-nearest-neighbor-based model is a regression model with 1+ independent variables, which uses a non-parametric method which creates a smoothed surface/curve (Field, 2013; Smith, 2015).  LOWESS aims not to introduce a parametric model to the data, because doing so, would require much more resources (Cleveland, 1979).  Non-parametric tests have fewer assumptions than parametric tests, such as there are no assumptions on the sampled variable’s population distribution (Field, 2013; Huck, 2013; Schumacker, 2014; Smith, 2015).

Assumptions in the parametric analysis, which are based on the normal distribution, include (1) additivity and linearity; (2) normality; (3) homoscedasticity/homogeneity of variance; and (3) independence (Field, 2013; Huck, 2013). However, the assumption of independence still exists in the non-parametric analysis (Huck, 2013).  Smith (2015) states that these non-parametric analyses are less powerful than parametric analysis.  However, Field (2013) disagrees and says that they are powerful, but admits that there is a loss of information about the magnitude between the observed values.  Huck (2013), states that when using non-parametric analysis correctly, they have the similar power/weight to them as parametric analysis on data that meet the parametric assumptions. Thus, to conduct non-parametric analysis, data values are ranked and arranged, thus higher valued data have higher valued ranks and vice versa (Field, 2013; Huck, 2013; Smith, 2015). Cleveland (1979), describes that only a fraction of the data (local neighbors) are considered at a time, to minimize the weighing function.  Thus, a LOWESS regression is carried out on the ranked data, which help eliminates the effects of outliers, irons out skewed distributions (Field, 2013; Smith, 2015).

+ LOWESS doesn’t depend on an underlying population distribution (Field, 2013; Huck, 2013; Schumacker, 2014; Smith, 2015)

+ Looking at the data’s local neighboring data creates a smoothing function, which visually enhances pattern (Cleveland, 1979)

– The LOWESS technique is not a substitute for parametric regression analysis (Huck, 2013).  Thus, to use non-parametric tests, one must reject the null hypothesis: the data follows a defined distribution; with its corresponding alternative hypothesis: the data does not follow a defined distribution (Field, 2013; Huck, 2013).

– LOWESS is computationally heavy, especially depending on the weights chosen (Cleveland, 1979).

– Though the regression formula is easily and visually represented/smoothed, but the regression formula may not be as cleanly written (Cleveland, 1979).

Multiple Regression Analysis

From the R dataset archived website (http://vincentarelbundock.github.io/Rdatasets/), the NOxEmissions.csv file was downloaded, which is the Nox Air Pollution Data and it has 5 variables: primary key, Julian Calendar Day (julday), hourly mean of NOx concentrations in the air in parts per billion (LNOx), hourly sum of NOx car emissions in parts per billion (LNOxEm), and square root of the wind speed in meters per second (sqrtWS).

From this dataset, it is hypothesized, that the wind speed combined with the sum of NOx from car emissions could contribute to the mean Nox concentrations in the atmosphere.  Thus, given that there are multiple independent variables for one dependent variable, then multiple regression analysis is best suited (Field, 2013; Huck, 2013; Schumacker, 2014; Smith, 2015).

Figure 1: Histogram of each of the variables in the data set.

Figure 2: Simple Linear Regression between each of the independent variables to the dependent variables.  For the image on the right the regression formula is LNOx = -0.86443(sqrtWS) + 5.55885, with a correlation of -0.4300 and for the image on the left the regression formula is LNOx = 0.590066 (LNOxEm) + 0.048645, with a correlation of 0.6399.

Figure 3: The summation output of the Linear Multiple Regression, where the regression formula is LNOx= -1.018198 (sqrtWS) + 0.641412 (LNOxEm) + 1.061950, which explains 66.3% of the variation between these variables.

Figure 4: Normal Quantile-Quantile plot, for the multiple linear regression as described by Figure 3.

The histograms (Figure 1) are not convincing that this could be tested with a normal multiple linear regression analysis, but from the Normal quantile-quantile plot (Figure 4), shows normalcy in the data, justifying the results (Figure 3).  For furthering the understanding of the multiple linear regression, the simple linear regression per independent variable (Figure 2), shows that neither independent variable alone explain the variance between the variables as well as with the multiple regression analysis.

Figure 5: Multiple LOWESS regression plot with varying smoothing span.

Even though there is normalcy in the data, a LOWESS was still plotted on the data, just to illustrate how the differences between smoothing factors can influence the result.  The smoothing factor describes how small the neighborhood is on the k-nearest neighbor (Cleveland, 1979).  The smaller the smoothing factor, the smaller the neighborhood, and the blue line (f=2/3) is the default value in R (R, n.d.e,).  The larger the smoothing factor, the bigger the neighborhood, over simplifying the result.

Code

hist(NOxData\$LNOx, freq=F, xlab = “hourly mean of NOx concentrations [ppb]”,  main = “Histogram of the hourly mean of NOx concentrations”)

hist(NOxData\$LNOxEm, freq=F, xlab = “hourly sum of NOx car emissions [ppb]”,  main = “Histogram of the hourly sum of NOx car emissions”)

hist(NOxData\$sqrtWS, freq=F, xlab = “square root of winds [m/s]”, main = “Histogram of the square root of winds”)

# Single Linear Regressions on LNOxEm

## LNOx

plot(NOxData\$LNOxEm, NOxData\$LNOx)

abline(lm(NOxData\$LNOx~NOxData\$LNOxEm), col=”red”)

summary(lm(NOxData\$LNOx~NOxData\$LNOxEm))

cor(NOxData\$LNOx,NOxData\$LNOxEm)

## sqrtWS

plot(NOxData\$sqrtWS, NOxData\$LNOx)

abline(lm(NOxData\$LNOx~NOxData\$sqrtWS), col=”red”)

summary(lm(NOxData\$LNOx~NOxData\$sqrtWS))

cor(NOxData\$LNOx,NOxData\$sqrtWS)

# Multiple Linear Regression on both LNOxEM and sqrtWS variables on LNOx

RegressionModel = lm(NOxData\$LNOx~ NOxData\$LNOxEm + NOxData\$sqrtWS)

summary(RegressionModel)

plot(RegressionModel)

# Pearson’s Correlation between independent variables

cor(NOxData\$LNOxEm, NOxData\$sqrtWS)

# 95% Confidence Intervals on the regression model

confint(RegressionModel, conf.level=0.95)

# LOWESS MODEL

LowessModel = lowess(NOxData\$LNOx~ NOxData\$LNOxEm + NOxData\$sqrtWS, f=2/3)

LowessModel2 = lowess(NOxData\$LNOx~ NOxData\$LNOxEm + NOxData\$sqrtWS, f=0.01)

LowessModel3 = lowess(NOxData\$LNOx~ NOxData\$LNOxEm + NOxData\$sqrtWS, f=1)

plot(LowessModel,type=”l”,col=”blue”, main=”LOWESS Regression: green is f=1, blue is f=2/3, & red is f=0.01″)

lines(LowessModel2, col=”red”)

lines(LowessModel3, col=”green”)

References

## Adv Quant: General Linear Regression Model in R

Introduction

A goal for this post is to convert the dataset to a dataframe for analysis and performing a regression on the state.x77 dataset.

Results

Figure 1: Scatter plot matrix of the dataframe state.x77.  The red box illustrates the relationship that is personally identified for further analysis.

Figure 2: Scatter plot of murder rates versus illiteracy rates across the united states, with the linear regression function of illiteracy = 0.11607 * Murder + 0.31362; with a correlation of 0.729752.

Discussion

This post analyzes the dataset state.x77 under the MASS R library, was converted into a data frame (see code section), and an analysis of the data was conducted.  To identify which variable relationship would be interesting to conduct a regression on this dataset, all the relationships within the data frame were plotted in a matrix (Figure 1).  The relationship that personally seemed interesting was the relationship between illiteracy and murder.  Thus, moving forward with these variables a simple linear regression was conducted on that data.  It was determined that there is a positive correlation on this data of 0.729752, and the relationship between the data is defined by

illiteracy = 0.11607 * Murder + 0.31362                                        (1)

From this equation that describes the relationship (Figure 2) between these variables, can explain, 53.25% of the variance between these variables. Both the intercept value and the regression weight are statistically significant at the 0.01 level, meaning that there is less than a 1% chance that this relationship could be developed from pure random chance (R output between Figure 1 & 2).  In conclusion, this data is stating that states with lower illiteracy rates will have the least amount of murder rates in their state, and vice versa.

Code

#

## Converting a dataset to a dataframe for analysis.

#

library(MASS)             # Activate the MASS library

library(nutshell)         # Activate the nutshell library to access the plot function

data()                    # Lists all data and datasets within the Mass Library

data(state)               # Data in question is located in state

head(state.x77)           # Print out the top five entries of state.x77

df= data.frame(state.x77) # Convert the state.x77 data into a dataframe

#

## Regression formulation

#

plot(df)                                           # Scatter plot matrix, of all relationships between the variables in the df

stateRegression = lm(Illiteracy~Murder, data= df)  # Selecting this relationship for further analysis

summary(stateRegression)                           # Plotting a summary of the regression data

# Plotting a scatterplot from a dataframe below

plot(df\$Murder, df\$Illiteracy, type=”p”, main=”Illiteracy rates vs Murder rates”, xlab=”Murder”, ylab=”Illiteracy”)           # Plotting a scatterplot from a dataframe

abline(lm(Illiteracy~Murder, data= df), col=”red”) # Plotting a red regression line

cor(df\$Murder, df\$Illiteracy)

References

## Adv Quant: General Least Squares Model

Regression formulas are useful for summarizing the relationship between the variables in question (Huck, 2011). There are multiple types of regression all of them are tests of prediction (Huck, 2011; Schumacker, 2014).  The least squares (linear) regression is the most well-known because it uses basic algebra, a straight line, and the correlation coefficient to aid in stating the regression’s prediction strength (Huck, 2011; Schumacker, 2014).  The linear regression model is:

y = (a + bx) + e                                                                   (1)

Where y is the dependent variable, x is the independent variable, a (the intercept) and b (the regression weight, also known as the slope) are a constants that are to be defined through the regression analysis, and e is the regression prediction error (Field, 2013; Schumacker, 2014).  The sum of the squared errors should be minimized per the least squares criterion, and that is reflected in the b term in equation 1 (Schumacker, 2014).

Correlation coefficients help define the strength of the regression formula in defining the relationships between the variables, and can vary in value from -1 to +1.  The closer the correlation coefficient is to -1 or +1; it informs the researcher that the regression formula is a good predictor of the variance between the variables.  The closer the correlation coefficient is to zero, indicates that there is hardly any relationship between the variable (Field, 2013; Huck, 2011; Schumacker, 2014).  Correlations never imply causation, but they can help determine the percentage of the variances between the variables by the regression formula result when the correlation value is squared (r2) (Field, 2013).

Assumptions for the General Least Square Model (GLM) modeling for regression and correlations

The General Least Squares Model (GLM) is the line of best fit, for linear regressions modeling along with its corresponding correlations (Smith, 2015).  There are five assumptions to a linear regression model: additivity, linearity, independent errors, homoscedasticity, and normally distributed errors.  Variables should be linearly related the independent variables(s), and the combined effects of multiple independent variables should be additive. A residual is the difference between the predicted value from the observed value: (1) no two residuals should be correlated, which can be numerically tested by using the Durbin-Watson test; (2) the variance of these residuals should be constant for each independent variable; and (3) the residuals should be random and normally distributed with a mean of 0 (Field, 2013; Schumacker, 2014).

Covering the issues with transforming variables to make them linear

When viewing the data through scatter plots, if the linearity and additivity assumptions could not be met, then transformations to the variables could be made to make the relationship linear. The above is an iterative trial and error process.  Transformation must occur to every point of the data set to correct for the linearity and addititvity issues since it changes the difference between the variables due to the change of units in the variables (Field, 2013).

Table 1: Types of data transformations and their uses (adapted from Field (2013) Table 5.1).

 Data Transformation Can Correct for Log [independent variable(s)] Positive skew, positive kurtosis, unequal variances, lack of linearity Square root [independent variable(s)] Positive skew, positive kurtosis, unequal variances, lack of linearity Reciprocal [independent variable(s)] Positive skew, positive kurtosis, unequal variances Reverse score [independent variable(s)]: subtracting the highest value in the variable for each data set Negative skew

Describe the R procedures for linear regression

lm( ) is a function for running linear regression, glm( ) is a function for running logistic regression (should not be confused for GLM), and loglm( ) is a function for running log-linear regression in R (Schumacker, 2014; Smith, 2015). The summary( ) function is used to output the results of the linear regression. Dependent variables are represented with a tilde “~” and independent variables are represented with a “+” (Schumacker, 2014). Thus, the R procedures for linear regression are (Marin, 2013):

> cor (x, y) # correlation coefficient

> myRegression = lm (y ~ x, data = dataSet ) # conduct a linear regression on x and y

> summary(myRegression) # produces the outputs of the lm( ) function calculations

> attributes(myRegression) # lists the attributes of the lm( ) function

> myRegression\$coefficients # gives you the slope and intercept coefficients

> plot (x, y, main=“Title to graph”) # scatter plot

> abline(myRegression) # regression line

> confint(myRegression, level= 0.99) # 99% level of confidence intervals for the regression coefficients

> anova(myRegression) # anova analysis on the regression analysis

References

• Field, A. (2013) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
• Huck, S. W. (2011) Reading Statistics and Research (6th ed.). Pearson Learning Solutions. VitalBook file.
• Schumacker, R. E. (2014) Learning statistics using R. California, SAGE Publications, Inc, VitalBook file.

## Adv Quant: Birth Rate Dataset in R

Introduction

Built in the R library is the Births dataset with 400,000 records and 13 variables.  The following is an analysis of this dataset.

Results

Figure 1. The first five data point entries in the births2006.smpl data set.

Figure 2. The frequency of births in 2006 per day of the week.

Figure 3. Histogram of 2006 births frequencies graphed by day of the week and separated by method of delivery.

Figure 4. A trellis histogram plot of 2006 birth weight per birth number.

Figure 5. A trellis histogram plot of 2006 birth weight per birth delivery method.

Figure 6. A boxplot of 2006 birth weight per Apgar score.

Figure 7. A boxplot of 2006 birth weight per day of week.

Figure 8. A histogram of 2006 average birth weight per multiple births separated by gender.

Discussion

Given the open-sourced nature of the R software, many libraries are being built and shared with the greater community, and the Comprehensive R Archive Network (CRAN), has a ton of these programs as part of R Packages (Schumacker, 2014).  Thus, as part of the nutshell library, there exists a data set of 2006 births called “births2006.smpl”.  To view the first few entries the head() command can be used (R, n.d.g.).  The printout from the head() command (Figure 1) shows all 13 variables of the dataset along with the first five entries in the births2006.smpl dataset.

The number of birth seems to be approximately uniform (but not precisely) during the work week, assuming Sunday is 1 and Saturday is 7.  However, Tuesday-Thursday has the highest births in the week with the weekends having the least amount of births in the week.

Breaking down the method of deliveries in 2006 per day of the week, it can be seen that Vaginal birth in all seven days of the week outnumbers C-section deliveries in 2006 (Figure 3).  Also on Tuesday-Thursday there are more vaginal births compared to those during the weekend, and in C-section deliveries, there are most deliveries occur between Tuesday-Friday, and the least amount occurs during the weekends.

Breaking down the number of births frequencies per birth weight (Figure 4), it can be seen that the normal distribution of birth weight in grams shifts to the left as the number of multiple births increases.  This seems to suggest that babies born as a set of twins, triplets, etc. have lower birth rates on average and per distribution.  Birth weight is almost normally distributed for the single child birth but begins to lose normality as the number of births increases.

Further analysis of birth weights in 2006, per delivery method, shows that for whether or not the delivery method is known or not and its type of delivery method doesn’t play too much of a huge role in the determination of the child’s birth weight (Figure 5).  Statistical tests and effect size analysis could be conducted to verify and enhance the discussion and this assertion that is made through the graphical representation in Figure 5.

Apgar test is tested on the child after one and five minutes of birth looking at the skin color, heart rate, reflexes, muscle tone, and respiration rate of the child, where 10 is the highest but rarely obtain score (Hirsch, 2014).  Thus, observing the Apgar score variable (1-10) on birth weight in grams those with higher Apgar scores had on average higher median birth weights.  Typically, as Apgar score increases the tighter the distribution becomes, and the more outliers begin to appear (disregarding the results from Apgar score of 1).  These results from the boxplots tend to confirm Hirsch (2014) assertion that higher Apgar scores are harder to obtain.

Looking at the boxplot analysis of birth weight per day of the week (Figure 7) shows that the median, Q1, Q3, max, and min are normally distributed and unchanging per day of the week.  Outliers, the heavier babies, tend to occur without respect of the day of the week, and also appears to have little to no effect on the distribution of birth weight per day of the week.

Finally, looking at a mean birth weight per gender and per multiple births, shows a similar distribution of males and females (Figure 8). The main noticeable difference is the male Quintuplet or higher number of births on average weigh more than the corresponding female Quintuplet or higher number of births.  This chart also confirms the conclusions made (from Figure 4) where as the number of births increases the average weight of the children decrease.

In conclusion, the day of the week doesn’t predict birth weights, but probably birth frequency. In general, babies are heavier if they are single births and if they achieve Apgar score of 10.  Birth weights are not predictable through delivery method.  All of these conclusions are made on the visual representation of the dataset births2006.smpl.  What would increase the validity of these statements would be to conduct statistical significance tests and the effect size, to add further weight to what could be derived from through these images.

Code

#
## Use R to analyze the Birth dataset.
## The Birth dataset is in the Nutshell library.
##  • SEX and APGAR5 (SEX and Apgar score)
##  • DPLURAL (single or multiple birth)
##  • WTGAIN (weight gain of mother)
##  • ESTGEST (estimated gestation in weeks)
##  • DOB_MM, DOB_WK (month and day of week of birth)
##  • BWT (birth weight)
##  • DMETH_REC (method of delivery)
#
install.packages(“nutshell”)
library(nutshell)
data(births2006.smpl)

# First, list the data for the first 5 births.

# Next, show a bar chart of the frequencies of births according to the day of the week of the birth.
births.dayofweek = table(births2006.smpl\$DOB_WK) #Goal of this variable is to speed up the calculations
barplot(births.dayofweek, ylab=”frequency”, xlab=”Day of week”, col = “darkred”, main= “Number of births in 2006 per day of the week”)

# Obtain frequencies for two-way classifications of birth according to the day of the week and the method of delivery.
births.methodsVdaysofweek = table(births2006.smpl\$DOB_WK,births2006.smpl\$DMETH_REC)
barplot(births.methodsVdaysofweek[,-2], col=heat.colors(length(rownames(births.methodsVdaysofweek))), width=2, beside=TRUE, main = “bar plot of births per method per day of the week”)
legend (“topleft”, fill=heat.colors(length(rownames(births.methodsVdaysofweek))),legend=rownames(births.methodsVdaysofweek))

# Use lattice (trellis) graphs (R package lattice) to condition density histograms on the values of a third variable.
library(lattice)

# The variable for multiple births and the method of delivery are conditioning variables.
# Separate the histogram of birth weight according to these variable.
histogram(~DBWT|DPLURAL,data=births2006.smpl,layout=c(1,5),col=”black”, xlab = “birth weight”, main = “trellis plot of birth weight vs birth number”)

histogram(~DBWT|DMETH_REC,data=births2006.smpl,layout=c(1,3),col=”black”, xlab = “birth weight”, main = “trellis plot of birth weight vs birth method”)

# Do a box plot of birth weight against Apgar score and box plots of birth weight by day of week of delivery.
boxplot(DBWT~APGAR5,data=births2006.smpl,ylab=”birth weight”,xlab=”AGPAR5″, main=”Boxplot of birthweight per Apgar score”)

boxplot(DBWT~DOB_WK,data=births2006.smpl,ylab=”birth weight”,xlab=”Day of Week”, main=”Boxplot of birthweight per day of week”)

# Calculate the average birth weight as a function of multiple births for males and females separately.
# Use the “tapply” function, and for missing values use the “option nz.rm=TRUE.”
listed = list(births2006.smpl\$DPLURAL,births2006.smpl\$SEX)
tapplication=tapply(births2006.smpl\$DBWT,listed,mean,na.rm=TRUE)
barplot(tapplication,ylab=”birth weight”, beside=TRUE, legend=TRUE,xlab=”gender”, main = “bar plot of average birthweight per multiple births by gender”)

References

• R (n.d.b.). Apply a function over a ragged array. Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/base/html/tapply.html
• R (n.d.f.). Produce box-and-wisker plot(s) of a given (grouped) values.  Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/boxplot.html
• Schumacker, R. E. (2014) Learning statistics using R. California, SAGE Publications, Inc, VitalBook file.

## Adv Quant: Statistical Significance and Machine Learning

Statistical significance on large samples sizes can be affected by small differences and can show up as significant, while in smaller samples large differences may be deemed statistically insignificant (Field, 2014).  Statistically significant results allow the researcher to reject a null hypothesis but do not test the importance of the observations made (Huck, 2011). Statistical analysis is highly deductive (Creswell, 2014), and supervised learning is highly inductive (Connolly & Begg, 2014).  Also, statistical analysis tries to identify trends in a given sample size by assuming normality, linearity or constant variance; whereas in machine learning it aims to find a pattern in a large sample of data and it is expected that these statistical analysis assumptions are not met and therefore require a higher random sampling set (Ahlemeyer-Stubbe, & Coleman, 2014).

Machine learning tries to emulate the way humans learn. When humans learn, they create a model based off of observations to help describe key features of a situation and help them predict an outcome, and thus machine learning does predictive modeling of large data sets in a similar fashion (Connolly & Begg, 2014).  The biggest selling point of supervised machine learning is that the machine can build models that identify key patterns in the data when humans can no longer compute the volume, velocity, and variety of the data (Ahlemeyer-Stubbe, & Coleman, 2014). There are many applications that use machine learning: marketing, investments, fraud detection, manufacturing, telecommunication, etc. (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). Figure 1 illustrates how supervised learning can classify data or predict their values through a two-phase process.  The two-phase process consists of (1) training where the model is built by ingesting huge amounts of historical data; and (2) testing where the new model is tested on new current data that helps establish its accuracy, reliability, and validity (Ahlemeyer-Stubbe & Coleman, 2014; Connolly & Begg, 2014). The model that is created by machines through this learning is quickly adaptable to new data (Minelli, Chambers, & Dhiraj, 2013).  These models themselves are a set of rules or formulas, and that depends on which analytical algorithm is used (Ahlemeyer-Stubbe & Coleman, 2014).  Given that the supervised machine learning is trained with known responses (or outputs) to make its future predictions, it is vital to have a clear purpose defined before running the algorithm.  The model is only as good as the data that goes in it.

Figure 1:  Simplified process diagram on supervised machine learning.

Thus, for classification the machine is learning a function to map data into one or many different defining characteristics, and it could consist of decision trees and neural network induction techniques (Connolly & Begg, 2014; Fayyad et al., 1996).  Fayyad et al. (1996) mentioned that it is impossible to classify data cleanly into one camp versus another. For value prediction, regression is used to map a function to the data that when followed gives an estimate on where the next value would be (Connolly & Begg, 2014; Fayyad et al. 1996).  However, in these regression formulas, it is good to remember that correlation between the data/variables does not imply causation.

Random sampling is core to statistics and the concept of statistical inference (Smith, 2015; Field, 2011), but it also serves a purpose in supervised learning (Ahlemeyer-Stubbe & Coleman, 2014).  Random sampling of data, is selecting a proportion of the data from a population, where each data point has an equal opportunity of being selected (Smith, 2015; Huck, 2013). The larger the sample, on average tends to represent the population fairly well (Field, 2014; Huck, 2013). Given nature big data, high volume, velocity, and variety, it is assumed that there is plenty of data to draw upon and run a supervised machine learning algorithm.  However, too much data that is fed into the machine learning algorithm can increase the process and analysis time.  Also, the bigger the random sampling size used for the learning, the more time it would take to process and analyze the data.

There are also unsupervised learning algorithms, where it also needs training and testing, but unlike supervised learning, it doesn’t need to validate its model on some predetermined output value (Ahlemeyer-Stubbe & Coleman, 2014, Conolly & Begg, 2014).   Therefore, unsupervised learning tries to find the natural relationships in the input data (Ahlemeyer-Stubbe & Coleman, 2014).  Cluster analysis is an example of unsupervised learning, where the model seeks to find a finite set of the cluster that can help describe the data into subsets of similarities (Ahlemeyer-Stubbe & Coleman, 2014, Fayyad et al., 1996). Finally, in supervised learning the results could be checked through estimation error; however it is not so easy with unsupervised learning because of a lack of a target but requires retesting to see if the patterns are similar or repeatable (Ahlemeyer-Stubbe & Coleman, 2014).

References

• Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online].
• Connolly, T., Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. [VitalSource Bookshelf Online].
• Creswell, J. W. (2014) Research design: Qualitative, quantitative and mixed method approaches (4th ed.). California, SAGE Publications, Inc. VitalBook file.
• Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. Advances in Knowledge Discovery and Data Mining, 17(3), 37–54.
• Field, A. (2011) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
• Huck, S. W. (2013) Reading Statistics and Research (6th ed.). Pearson Learning Solutions. VitalBook file.
• Minelli, M., Chambers, M., Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses, 1st Edition. [VitalSource Bookshelf Online].