Adv Quant: Logistic Regression in R

Introduction

The German credit data contains attributes and outcomes on 1,000 loan applications. The data are available at this Web site, where datasets are provided for the machine learning community.

Results

IP3F1.PNG

Figure 1: Image shows the first six entries in the German credit data.

IP3F2.png

Figure 2: Matrix scatter plot, showing the 2×2 relationships between all the variables within the German credit data.

IP3F3.png

Figure 3: A summary of the credit data with the variables of interest.

IP3F3.png

Figure 4: Shows the entries in the designer matrix which will be used for logistical analysis.

IP3F4

Figure 5: Summarized logistic regression information based on the training data.

IP3F6.1.pngIP3F6.2.png

Figure 6: The coeficients’ confidence interval at the 95% level using log-likelihood vlaues, with values to the right including the standard errors values.

IP3F7.png

Figure 7: Wald Test statistic to test the significance level of the entire ranked variable.

IP3F8.png

Figure 8: The Odds Ratio for each independent variable along with the 95% confidence interval for those odds ratio.

IP3F9.png

Figure 9: Part of the summarized test data set for the logistics regression model.

IP3F10.png

Figure 10: The ROC curve, which illustrates the false positive rate versus the true positive rate of the prediction model.

Discussion

The results from Figure 1 means that the data needs to be formatted before any analysis could be conducted on the data.  Hence, the following lines of code were needed to redefine the variables in the German data set.   Given the data output (Figure 1), the matrix scatter plot (Figure 2) show that duration, amount, and age are continuous variables, while the other five variables are factor variables, which have categorized scatterplots.  Even though installment and default show box plot data in the summary (Figure 3), the data wasn’t factored like history, purpose, or rent, thus it won’t show a count.  From the count data (Figure 3), the ~30% of the purpose of loans are for cars, where as 28% is for TVs.  In this German credit data, about 82% of those asking for credit do not rent and about 53% of borrowers have an ok credit history with 29.3% having a horrible credit history.  The mean average default rate is 30%.

Variables (Figure 5) that have statistical significance at the 0.10 include duration, amount, installment, age, history (per category), rent, and some of the purposes categories.  Though it is preferred to see a large difference in the null deviance and residual deviance, it is still a difference.  The 95% confidence interval for all the logistic regression equation don’t show much spread from their central tendencies (Figure 6).  Thus, the logistic regression model is (from figure 5):

IP3F11.PNG

The odds ratio measures the constant strength of association between the independent and dependent variables (Huck, 2011; Smith, 2015).  This is similar to the correlation coefficient (r) and coefficient of determination (r2) values for linear regression.  According to UCLA: Statistical Consulting Group, (2007), if the P value is less than 0.05, then the overall effect of the ranked term is statistically significant (Figure 7), which in this case the three main terms are.  The odds ratio data (Figure 8) is promising, as values closer to one is desirable for this prediction model (Field, 2013). If the value of the odds ratio is greater than 1, it will show that as the independent variable value increases, so do the odds of the dependent variable (Y = n) occurs increases and vice versa (Fields, 2013).

Moving into the testing phase of the logistics regression model, the 100 value data set needs to be extracted, and the results on whether or not there will be a default or not on the loan are predicted. Comparing the training and the test data sets, the maximum values between the both are not the same for durations and amount of the loan.  All other variables and statistical distributions are similar to each other between the training and the test data.  Thus, the random sampling algorithm in R was effective.

The area underneath the ROC curve (Figure 10), is 0.6994048, which is closer to 0.50 than it is to one, thus this regression does better than pure chance, but it is far from perfect (Alice, 2015).

In conclusion, the regression formula has a 0.699 prediction accuracy, and the purpose, history, and rent ranked categorical variables were statistically significant as a whole.  Therefore, the logistic regression on these eight variables shows more promise in prediction accuracy than pure chance, on who will and will not default on their loan.

Code

#

## The German credit data contains attributes and outcomes on 1,000 loan applications.

##    •   You need to use random selection for 900 cases to train the program, and then the other 100 cases will be used for testing.

##    •   Use duration, amount, installment, and age in this analysis, along with loan history, purpose, and rent.

### ———————————————————————————————————-

## Data source: https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data

## Metadata file: https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.doc

#

#

## Reading the data from source and displaying the top six entries.

#

credits=read.csv(“https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data”, header = F, sep = ” “)

head(credits)

#

## Defining the variables (Taddy, n.d.)

#

default = credits$V21 – 1 # set default true when = 2

duration = credits$V2

amount = credits$V5

installment = credits$V8

age = credits$V13

history = factor(credits$V3, levels = c(“A30”, “A31”, “A32”, “A33”, “A34”))

purpose = factor(credits$V4, levels = c(“A40″,”A41″,”A42″,”A43″,”A44″,”A45″,”A46″,”A48″,”A49″,”A410”))

rent = factor(credits$V15==”A151″) # renting status only

# rent = factor(credits$V15 , levels = c(“A151″,”A152″,”153”)) # full property status

#

## Re-leveling the variables (Taddy, n.d.)

#

levels(history) = c(“great”, “good”, “ok”, “poor”, “horrible”)

levels(purpose) = c(“newcar”, “usedcar”, “furniture/equip”, “radio/TV”, “apps”, “repairs”, “edu”, “retraining”, “biz”, “other”)

# levels(rent) = c(“rent”, “own”, “free”) # full property status

#

## Create a new matrix called “cred” with the 8 defined variables (Taddy, n.d.)

#

credits$default = default

credits$duration= duration

credits$amount  = amount

credits$installment = installment

credits$age     = age

credits$history = history

credits$purpose = purpose

credits$rent    = rent

cred = credits[,c(“default”,”duration”,”amount”,”installment”,”age”,”history”,”purpose”,”rent”)]

#

##  Plotting & reading to make sure the data was transfered correctly into this dataset and present summary stats (Taddy, n.d.)

#

plot(cred)

cred[1:3,]

summary(cred[,])

#

## Create a design matrix, such that factor variables are turned into indicator variables

#

Xcred = model.matrix(default~., data=cred)[,-1]

Xcred[1:3,]

#

## Creating training and prediction datasets: Select 900 rows for esitmation and 100 for testing

#

set.seed(1)

train = sample(1:1000,900)

## Defining which x and y values in the design matrix will be for training and for testing

xtrain = Xcred[train,]

xnew = Xcred[-train,]

ytrain = cred$default[train]

ynew = cred$default[-train]

#

## logistic regresion

#

datas=data.frame(default=ytrain,xtrain)

creditglm=glm(default~., family=binomial, data=datas)

summary(creditglm)

#

## Confidence Intervals (UCLA: Statistical Consulting Group, 2007)

#

confint(creditglm)

confint.default(creditglm)

#

## Overall effect of the rank using the wald.test function from the aod library (UCLA: Statistical Consulting Group, 2007)

#

install.packages(“aod”)

library(aod)

wald.test(b=coef(creditglm), Sigma = vcov(creditglm), Terms = 6:9) # for all ranked terms for history

wald.test(b=coef(creditglm), Sigma = vcov(creditglm), Terms = 10:18) # for all ranked terms for purpose

wald.test(b=coef(creditglm), Sigma = vcov(creditglm), Terms = 19) # for the ranked term for rent

#

## Odds Ratio for model analysis (UCLA: Statistical Consulting Group, 2007)

#

exp(coef(creditglm))

exp(cbind(OR=coef(creditglm), confint(creditglm))) # odds ration next to the 95% confidence interval for odds ratios

#

## Predicting default from the test data (Alice, 2015; UCLA: Statistical Consulting Group., 2007)

#

newdatas=data.frame(default=ynew,xnew)

newestdata=newdatas[,2:19] #removing the variable default from the data matrix

newestdata$defaultPrediction = predict(creditglm, newdata=newestdata, type = “response”)

summary(newdatas)

#

## Plotting the true positive rate against the false positive rate (ROC Curve) (Alice, 2015)

#

install.packages(“ROCR”)

library(ROCR)

pr  = prediction(newestdata$defaultPrediction, newdatas$default)

prf = performance(pr, measure=”tpr”, x.measure=”fpr”)

plot(prf)

## Area under the ROC curve (Alice, 2015)

auc= performance(pr, measure = “auc”)

auc= auc@y.values[[1]]

auc # The closer this value is to 1 the better, much better than to 0.5

 

 

References