Adv Quant: Logistic Regression in R

The German credit data contains attributes and outcomes on 1,000 loan applications. The data are available at this Web site, where datasets are provided for the machine learning community.

Advertisements

Introduction

The German credit data contains attributes and outcomes on 1,000 loan applications. The data are available at this Web site, where datasets are provided for the machine learning community.

Results

IP3F1.PNG

Figure 1: Image shows the first six entries in the German credit data.

IP3F2.png

Figure 2: Matrix scatter plot, showing the 2×2 relationships between all the variables within the German credit data.

IP3F3.png

Figure 3: A summary of the credit data with the variables of interest.

IP3F3.png

Figure 4: Shows the entries in the designer matrix which will be used for logistical analysis.

IP3F4

Figure 5: Summarized logistic regression information based on the training data.

IP3F6.1.pngIP3F6.2.png

Figure 6: The coeficients’ confidence interval at the 95% level using log-likelihood vlaues, with values to the right including the standard errors values.

IP3F7.png

Figure 7: Wald Test statistic to test the significance level of the entire ranked variable.

IP3F8.png

Figure 8: The Odds Ratio for each independent variable along with the 95% confidence interval for those odds ratio.

IP3F9.png

Figure 9: Part of the summarized test data set for the logistics regression model.

IP3F10.png

Figure 10: The ROC curve, which illustrates the false positive rate versus the true positive rate of the prediction model.

Discussion

The results from Figure 1 means that the data needs to be formatted before any analysis could be conducted on the data.  Hence, the following lines of code were needed to redefine the variables in the German data set.   Given the data output (Figure 1), the matrix scatter plot (Figure 2) show that duration, amount, and age are continuous variables, while the other five variables are factor variables, which have categorized scatterplots.  Even though installment and default show box plot data in the summary (Figure 3), the data wasn’t factored like history, purpose, or rent, thus it won’t show a count.  From the count data (Figure 3), the ~30% of the purpose of loans are for cars, where as 28% is for TVs.  In this German credit data, about 82% of those asking for credit do not rent and about 53% of borrowers have an ok credit history with 29.3% having a horrible credit history.  The mean average default rate is 30%.

Variables (Figure 5) that have statistical significance at the 0.10 include duration, amount, installment, age, history (per category), rent, and some of the purposes categories.  Though it is preferred to see a large difference in the null deviance and residual deviance, it is still a difference.  The 95% confidence interval for all the logistic regression equation don’t show much spread from their central tendencies (Figure 6).  Thus, the logistic regression model is (from figure 5):

IP3F11.PNG

The odds ratio measures the constant strength of association between the independent and dependent variables (Huck, 2011; Smith, 2015).  This is similar to the correlation coefficient (r) and coefficient of determination (r2) values for linear regression.  According to UCLA: Statistical Consulting Group, (2007), if the P value is less than 0.05, then the overall effect of the ranked term is statistically significant (Figure 7), which in this case the three main terms are.  The odds ratio data (Figure 8) is promising, as values closer to one is desirable for this prediction model (Field, 2013). If the value of the odds ratio is greater than 1, it will show that as the independent variable value increases, so do the odds of the dependent variable (Y = n) occurs increases and vice versa (Fields, 2013).

Moving into the testing phase of the logistics regression model, the 100 value data set needs to be extracted, and the results on whether or not there will be a default or not on the loan are predicted. Comparing the training and the test data sets, the maximum values between the both are not the same for durations and amount of the loan.  All other variables and statistical distributions are similar to each other between the training and the test data.  Thus, the random sampling algorithm in R was effective.

The area underneath the ROC curve (Figure 10), is 0.6994048, which is closer to 0.50 than it is to one, thus this regression does better than pure chance, but it is far from perfect (Alice, 2015).

In conclusion, the regression formula has a 0.699 prediction accuracy, and the purpose, history, and rent ranked categorical variables were statistically significant as a whole.  Therefore, the logistic regression on these eight variables shows more promise in prediction accuracy than pure chance, on who will and will not default on their loan.

Code

#

## The German credit data contains attributes and outcomes on 1,000 loan applications.

##    •   You need to use random selection for 900 cases to train the program, and then the other 100 cases will be used for testing.

##    •   Use duration, amount, installment, and age in this analysis, along with loan history, purpose, and rent.

### ———————————————————————————————————-

## Data source: https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data

## Metadata file: https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.doc

#

#

## Reading the data from source and displaying the top six entries.

#

credits=read.csv(“https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data”, header = F, sep = ” “)

head(credits)

#

## Defining the variables (Taddy, n.d.)

#

default = credits$V21 – 1 # set default true when = 2

duration = credits$V2

amount = credits$V5

installment = credits$V8

age = credits$V13

history = factor(credits$V3, levels = c(“A30”, “A31”, “A32”, “A33”, “A34”))

purpose = factor(credits$V4, levels = c(“A40″,”A41″,”A42″,”A43″,”A44″,”A45″,”A46″,”A48″,”A49″,”A410”))

rent = factor(credits$V15==”A151″) # renting status only

# rent = factor(credits$V15 , levels = c(“A151″,”A152″,”153”)) # full property status

#

## Re-leveling the variables (Taddy, n.d.)

#

levels(history) = c(“great”, “good”, “ok”, “poor”, “horrible”)

levels(purpose) = c(“newcar”, “usedcar”, “furniture/equip”, “radio/TV”, “apps”, “repairs”, “edu”, “retraining”, “biz”, “other”)

# levels(rent) = c(“rent”, “own”, “free”) # full property status

#

## Create a new matrix called “cred” with the 8 defined variables (Taddy, n.d.)

#

credits$default = default

credits$duration= duration

credits$amount  = amount

credits$installment = installment

credits$age     = age

credits$history = history

credits$purpose = purpose

credits$rent    = rent

cred = credits[,c(“default”,”duration”,”amount”,”installment”,”age”,”history”,”purpose”,”rent”)]

#

##  Plotting & reading to make sure the data was transfered correctly into this dataset and present summary stats (Taddy, n.d.)

#

plot(cred)

cred[1:3,]

summary(cred[,])

#

## Create a design matrix, such that factor variables are turned into indicator variables

#

Xcred = model.matrix(default~., data=cred)[,-1]

Xcred[1:3,]

#

## Creating training and prediction datasets: Select 900 rows for esitmation and 100 for testing

#

set.seed(1)

train = sample(1:1000,900)

## Defining which x and y values in the design matrix will be for training and for testing

xtrain = Xcred[train,]

xnew = Xcred[-train,]

ytrain = cred$default[train]

ynew = cred$default[-train]

#

## logistic regresion

#

datas=data.frame(default=ytrain,xtrain)

creditglm=glm(default~., family=binomial, data=datas)

summary(creditglm)

#

## Confidence Intervals (UCLA: Statistical Consulting Group, 2007)

#

confint(creditglm)

confint.default(creditglm)

#

## Overall effect of the rank using the wald.test function from the aod library (UCLA: Statistical Consulting Group, 2007)

#

install.packages(“aod”)

library(aod)

wald.test(b=coef(creditglm), Sigma = vcov(creditglm), Terms = 6:9) # for all ranked terms for history

wald.test(b=coef(creditglm), Sigma = vcov(creditglm), Terms = 10:18) # for all ranked terms for purpose

wald.test(b=coef(creditglm), Sigma = vcov(creditglm), Terms = 19) # for the ranked term for rent

#

## Odds Ratio for model analysis (UCLA: Statistical Consulting Group, 2007)

#

exp(coef(creditglm))

exp(cbind(OR=coef(creditglm), confint(creditglm))) # odds ration next to the 95% confidence interval for odds ratios

#

## Predicting default from the test data (Alice, 2015; UCLA: Statistical Consulting Group., 2007)

#

newdatas=data.frame(default=ynew,xnew)

newestdata=newdatas[,2:19] #removing the variable default from the data matrix

newestdata$defaultPrediction = predict(creditglm, newdata=newestdata, type = “response”)

summary(newdatas)

#

## Plotting the true positive rate against the false positive rate (ROC Curve) (Alice, 2015)

#

install.packages(“ROCR”)

library(ROCR)

pr  = prediction(newestdata$defaultPrediction, newdatas$default)

prf = performance(pr, measure=”tpr”, x.measure=”fpr”)

plot(prf)

## Area under the ROC curve (Alice, 2015)

auc= performance(pr, measure = “auc”)

auc= auc@y.values[[1]]

auc # The closer this value is to 1 the better, much better than to 0.5

 

 

References

Quant: In-depth Analysis in SPSS

This short analysis attempts to understand the marital happiness level on combined income. It was found that marital happiness levels are depended on a couples’ combined income, but for the happiest couples, they were happy regardless how much money they had. This, quantitative analysis on the sample data, has shown that when the happiness levels are low, there is a higher chance of lower levels of combined income.

Abstract

This short analysis attempts to understand the marital happiness level on combined income.  It was found that marital happiness levels are depended on a couples’ combined income, but for the happiest couples, they were happy regardless how much money they had.  This, quantitative analysis on the sample data, has shown that when the happiness levels are low, there is a higher chance of lower levels of combined income.

Introduction

Mulligan (1973), was one of the first that stated arguments about money was one of the top reasons for divorce between couples.  Factors for financial arguments could stem from: Goals and savings; record keeping; delaying tactics; apparel cost-cutting strategies; controlling expenditures; financial statements; do-it-yourself techniques; and cost cutting techniques (Lawrence, Thomasson, Wozniak, & Prawitz, 1993). Lawrence et al. (1993) exerts that financial arguments are common between families.  However, when does money no longer become an issue?  Does the increase in combined family income affect the marital happiness levels?  This analysis attempts to answer these questions.

Methods

Crosstabulation was conducted to get a descriptive exploration of the data.  Graphical images of box-plots helped show the spread and distribution of combined income per marital happiness.  In this analysis of the data the two alternative hypothesis will be tested:

  • There is a difference between the mean values of combined income per marital happiness levels.
  • There is a dependence between the combined income and marital happiness level

This would lead to finally analyzing the hypothesis introduced in the previous section, one-way analysis of variance and two-way chi-square test was conducted respectively.

Results

Table 1: Case processing summary for analyzing happiness level versus family income.

u6db1f7Table 2: Crosstabulation for analyzing happiness level versus family income (<$21,250).

u6db1f3Table 3: Crosstabulation for analyzing happiness level versus family income for (>$21,250).
u6db1f4

Table 4: Chi-square test for analyzing happiness level versus family income.

u6db1f5

Table 5: Analysis of Variance for analyzing happiness level versus family income.

u6db1f6

u6db1f1.png

Figure 1: Boxplot diagram per happiness level of a marriage versus the family incomes.

u6db1f2.png

Figure 2: Line diagram per happiness level of a marriage versus the mean of the family incomes.

Discussions and Conclusions

There are 1419 participants, and only 38.5% had responded to both their happiness of marriage and family income (Table 1).  What may have contributed to this huge unresponsive rate is that there could have been people who were not married, and thus making the happiness of marriage question not applicable to the participants.  Thus, it is suggested that in the future, there should be an N/A classification in this survey instrument, to see if we can have a higher response rate.  Given that there are still 547 responses, there is other information to be gained from analyzing this data.

As a family unit gains more income, their happiness level increases (Table 2-3).  This can be seen as the dollar value increases, the % within the family income and ranges recorded to midpoint for the very happy category increases as well from the 50% to the 75% level.    The unhappiest couples seem to be earning a combined medium amount of $7500-9000 and at $27500-45000.  Though for marriages that are pretty happy, it’s about stable at 30-40% of respondents at $13750 or more.

The mean values of family income to happiness (Figure 2), shows that on average, happier couples make more money together, but at a closer examination using boxplots (Figure 1), the happiest couples, seem to be happy regardless of how much money they make as the tails of the box plot extend really far from the median.  One interesting feature is that the spread of family combined income is shrinks as happiness decreases (Figure 1).  This could possibly suggest that though money is not a major factor for those couples that are happy, if the couple is unhappy it could be driven by lower combined incomes.

The two-tailed chi-squared test, shows statistical significance between family combined income and marital happiness allowing us to reject the null hypothesis #2, which stated that these two variables were independent of each other (Table 4).  Whereas the analysis of variance doesn’t allow for a rejection of the null hypothesis #1, which states the means are different between the groups of marital happiness level (Table 5).

There could be many reasons for this analysis, thus future work could include analyzing other variables that could help define other factors for marital happiness.  A possible multi-variate analysis may be necessary to see the impact on marital happiness as the dependent variable and combined income as one of many independent variables.

SPSS Code

GET

  FILE=’C:\Users\mkher\Desktop\SAV files\gss.sav’.

DATASET NAME DataSet1 WINDOW=FRONT.

CROSSTABS

  /TABLES=hapmar BY incomdol

  /FORMAT=AVALUE TABLES

  /STATISTICS=CHISQ CORR

  /CELLS=COUNT ROW COLUMN

  /COUNT ROUND CELL.

ONEWAY rincome BY hapmar

  /MISSING ANALYSIS

* Chart Builder.

GGRAPH

  /GRAPHDATASET NAME=”graphdataset” VARIABLES=hapmar incomdol MISSING=LISTWISE REPORTMISSING=NO

  /GRAPHSPEC SOURCE=INLINE.

BEGIN GPL

  SOURCE: s=userSource(id(“graphdataset”))

  DATA: hapmar=col(source(s), name(“hapmar”), unit.category())

  DATA: incomdol=col(source(s), name(“incomdol”))

  DATA: id=col(source(s), name(“$CASENUM”), unit.category())

  GUIDE: axis(dim(1), label(“HAPPINESS OF MARRIAGE”))

  GUIDE: axis(dim(2), label(“Family income; ranges recoded to midpoints”))

  SCALE: cat(dim(1), include(“1”, “2”, “3”))

  SCALE: linear(dim(2), include(0))

  ELEMENT: schema(position(bin.quantile.letter(hapmar*incomdol)), label(id))

END GPL.

* Chart Builder.

GGRAPH

  /GRAPHDATASET NAME=”graphdataset” VARIABLES=hapmar MEAN(incomdol)[name=”MEAN_incomdol”]

    MISSING=LISTWISE REPORTMISSING=NO

  /GRAPHSPEC SOURCE=INLINE.

BEGIN GPL

  SOURCE: s=userSource(id(“graphdataset”))

  DATA: hapmar=col(source(s), name(“hapmar”), unit.category())

  DATA: MEAN_incomdol=col(source(s), name(“MEAN_incomdol”))

  GUIDE: axis(dim(1), label(“HAPPINESS OF MARRIAGE”))

  GUIDE: axis(dim(2), label(“Mean Family income; ranges recoded to midpoints”))

  SCALE: cat(dim(1), include(“1”, “2”, “3”))

  SCALE: linear(dim(2), include(0))

  ELEMENT: line(position(hapmar*MEAN_incomdol), missing.wings())

END GPL.

References

Quant: Chi-Square Test in SPSS

The aim of this analysis is to determine the association strength for the variables agecat and degree as well the major contributing cells through a chi-square analysis. Through the use of standardized residuals, it should aid in determining the cell contributions.

Introduction

The aim of this analysis is to determine the association strength for the variables agecat and degree as well the major contributing cells through a chi-square analysis. Through the use of standardized residuals, it should aid in determining the cell contributions.

Hypothesis

  • Null: There is no basis of difference between the agecat and degree
  • Alternative: There is are real differences between the agecat and degree

Methodology

For this project, the gss.sav file is loaded into SPSS (GSS, n.d.).  The goal is to look at the relationships between the following variables: agecat (Age category) and degree (Respondent’s highest degree).

To conduct a chi-square analysis, navigate through Analyze > Descriptive Statistics > Crosstabs.

The variable degree was placed in the “Row(s)” box and agecat was placed under “Column(s)” box.  Select “Statistics” button and select “Chi-square” and under the “Nominal” section select “Lambda”. Select the “Cells” button and select “Standardized” under the “Residuals” section. The procedures for this analysis are provided in video tutorial form by Miller (n.d.).  The following output were observed in the next four tables.

Results

Table 1: Case processing summary.

Cases
Valid Missing Total
N Percent N Percent N Percent
Degree * Age category 1411 99.4% 8 0.6% 1419 100.0%

From the total sample size of 1419 participants, 8 cases are reported to be missing, yielding a 99.4% response rate (Table 1).   Examining the cross tabulation, for the age groups 30-39, 40-49, 50-59, and 60-89 the standardized residual is far less than -1.96 or far greater than +1.96 respectively.  Thus, the frequencies between these two differ significantly.  Finally, for the 60-89 age group the standardized residual is less than -1.96, making these two frequencies differ significantly.  Thus, for all these frequencies, SPSS identified that the observed frequencies are far apart from the expected frequencies (Miller, n.d.).  For those significant standardized residuals that are negative is pointing out that the SPSS model is over predicting people of that age group with that respective diploma (or lack thereof).  For those significant standardized residuals that are positive is point out that the SPSS model is under-predicting people of that age group with a lack of a diploma.

Table 2: Degree by Age category crosstabulation.

Age category Total
18-29 30-39 40-49 50-59 60-89
Degree Less than high school Count 42 33 36 20 112 243
Standardized Residual -.1 -2.8 -2.3 -2.7 7.1
High school Count 138 162 154 113 158 725
Standardized Residual .9 .2 -.2 .4 -1.2
Junior college or more Count 68 115 114 78 68 443
Standardized Residual -1.1 1.8 1.9 1.4 -3.7
Total Count 248 310 304 211 338 1411

Deriving the degrees of freedom from Table 2, df = (5-1)*(3-1) is 8.  However, none of the expected counts were less than five because the minimum expected count is 36.3 (Table 3) which is desirable.  The chi-squared value is 96.364 and is significance at the 0.05 level. Thus, the null hypothesis is rejected, and there is a statistically significant association between a person’s age category and diploma level.  This test doesn’t tell us anything about the directionality of the relationship.

Table 3: Chi-Square Tests

Value df Asymptotic Significance (2-sided)
Pearson Chi-Square 96.364a 8 .000
Likelihood Ratio 90.580 8 .000
Linear-by-Linear Association 23.082 1 .000
N of Valid Cases 1411
a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 36.34.

Table 4: Directional Measures

Value Asymptotic Standard Errora Approximate Tb Approximate Significance
Nominal by Nominal Lambda Symmetric .029 .013 2.278 .023
Degree Dependent .000 .000 .c .c
Age category Dependent .048 .020 2.278 .023
Goodman and Kruskal tau Degree Dependent .024 .005 .000d
Age category Dependent .019 .004 .000d
a. Not assuming the null hypothesis.
b. Using the asymptotic standard error assuming the null hypothesis.
c. Cannot be computed because the asymptotic standard error equals zero.
d. Based on chi-square approximation

Since there is a statistically significant association between a person’s age category and diploma level, the chi-square test doesn’t show how much these variables are related to each other. The lambda value (when we reject the null hypothesis) is 0.029; there is a 2.9% relationship between the two variables. Thus the relationship has a very weak effect (Table 4). Thus, 2.9% of the variance is accounted for, and there is nothing going on in here.

Conclusions

There is a statistically significant association between a person’s age category and diploma level.  According to the crosstabulation, the SPSS model is significantly over-predicting the number of people with less education than a high school diploma for the age groups of 20-59 as well as those with a college degree for the 60-89 age group.  This difference in the standard residual helped drive a large and statistically significant chi-square value. With a lambda of 0.029, it shows that 2.9% of the variance is accounted for, and there is nothing going on in here.

SPSS Code

CROSSTABS

  /TABLES=ndegree BY agecat

  /FORMAT=AVALUE TABLES

  /STATISTICS=CHISQ CC LAMBDA

  /CELLS=COUNT SRESID

  /COUNT ROUND CELL.

References:

Quant: Parametric and Non-Parametric Stats

There are numerous times when the information collected from a real organization will not conform to the requirements of a parametric analysis. That is, a practitioner would not be able to analyze the data with a t-test or F-test (ANOVA). Presume that a young professional came to you and said he or she had read about tests—such as the Chi-Square, the Mann-Whitney U test, the Wilcoxon Signed-Rank test, and Kruskal-Wallis one-way analysis of variance—and wanted to know when you would use each and why each would be used instead of the t-tests and ANOVA.

Parametric statistics is inferential and based on random sampling from a well-defined population, and that the sample data is making strict inferences about the population’s parameters. Thus tests like t-tests, chi-square, f-tests (ANOVA) can be used (Huck, 2011; Schumacker, 2014).  Nonparametric statistics, “assumption-free tests”, is used for tests that are using ranked data like Mann-Whitney U-test, Wilcoxon Signed-Rank test, Kruskal-Wallis H-test, and chi-square (Field, 2013; Huck, 2011).

First, there is a need to define the types of data.  Continuous data is interval/ratio data, and categorical data is nominal/ordinal data.  Modified from Schumacker (2014) with data added from Huck (2011):

Statistic Dependent Variable Independent Variable
Analysis of Variance (ANOVA)
     One way Continuous Categorical
t-Tests
     Single Sample Continuous
     Independent groups Continuous Categorical
     Dependent (paired groups) Continuous Categorical
Chi-square Categorical Categorical
Mann-Whitney U-test Ordinal Ordinal
Wilcoxon Ordinal Ordinal
Kruskal-Wallis H-test Ordinal Ordinal

ANOVAs (or F-tests) are used to analyze the differences in a group of three or more means, through studying the variation between the groups, and tests the null hypothesis to see if the means between the groups are equal (Huck, 2011). Student t-tests, or t-tests, test as a null hypothesis that the mean of a population has some specified number and is used when the sample size is relatively small compared to the population size (Field, 2013; Huck, 2011; Schumacker, 2014).  The test assumes a normal distribution (Huck, 2011). With large sample sizes, t-test/values are the same as z-tests/values, the same can happen with chi-square, as t and chi-square are distributions with samples size in their function (Schumacker, 2014).  In other words, at large sample sizes the t-distribution and chi-square distribution begin to look like a normal curve.  Chi-square is related to the variance of a sample, and the chi-square tests are used for testing the null hypothesis, which is the sample mean is part of a normal distribution (Schumacker, 2014).  Chi-square tests are so versatile it can be used as a parametric and non-parametric test (Field, 2013; Huck, 2011; Schumacker, 2014).

The Mann-Whiteney U-test and Wilcox signed-rank test are both equivalent, since they are the non-parametric equivalent of the t-tests and the samples don’t even have to be of the same sample length (Field, 2013).

The nonparametric Mann-Whitney U-test can be substituted for a t-test when the normal distribution cannot be assumed and was designed for two independent samples that do not have repeated measures (Field, 2013; Huck, 2011). Thus, this makes this a great substitution for the independent group’s t-test (Field, 2013). A benefit of choosing the Mann-Whitney U test is that it probably will not produce type II error-false negative (Huck, 2011). The null hypothesis is that the two independent samples come from the same population (Field, 2013; Huck, 2011).

The nonparametric Wilcoxon signed-rank test is best for distributions that are skewed, where variance homogeneity cannot be assumed, and a normal distribution cannot be assumed (Field, 2013; Huck, 2011).  Wilcoxon signed test can help compare two related/correlated samples from the same population (Huck, 2011). Each pair of data is chosen randomly and independently and not repeating between the pairs (Huck, 2011).  This is a great substitution for the dependent t-tests (Field, 2013; Huck, 2011).  The null hypothesis is that the central tendency is 0 (Huck, 2011).

The nonparametric Kruskal-Wallis H-test can be used to compare two or more independent samples from the same distribution, which is considered to be like a one-way analysis of variance (ANOVA) and focuses on central tendencies (Huck, 2011).  It is usually an extension of the Mann-Whitney U-test (Huck, 2011). The null hypothesis is that the medians in all groups are equal (Huck, 2011).

References

  • Field, A. (2013) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
  • Huck, S. W. (2011) Reading Statistics and Research (6th ed.). Pearson Learning Solutions. VitalBook file.
  • Schumacker, R. E. (2014) Learning statistics using R. California, SAGE Publications, Inc, VitalBook file.