Adv Quant: K-means classification in R

The explanatory variables in the logistic regression are both the type of loan and the borrowing amount.

Figure 1: The summary output of the logistic regression based on the type of loan and the borrowing amount.

The logistic equation shows statistical significance at the 0.01 level when the variables amount, and when the type of loan is used for a used car and a radio/television (Figure 1).  Thus, the regression equation comes out to be:

default = -0.9321 + 0.0001330(amount) – 1.56(Purpose is for used car) – 0.6499(purpose is for radio/television)

Figure 2: The comparative output of the logistic regression prediction versus actual results.

When comparing the predictions to the actual values (Figure 2), the mean and minimum scores between both of them are similar.  However, all other values are not. When the prediction values are rounded to the nearest whole number the actual prediction rate is 73%.

K-means classification, on the 3 continuous variables: duration, amount, and installment.

In K-means classification the data is clustered by the mean Euclidean distance between their differences (Ahlemeyer-Stubbe & Coleman, 2014).  In this exercise, there are two clusters. Thus, the cluster size is 825 no defaults, 175 defaults, where the within-cluster sum of squares for between/total is 69.78%.  The matrix of cluster centers is shown below (Figure 3).

Figure 3: K means center values, per variable

Cross-validation with k = 5 for the nearest neighbor.

K-nearest neighbor (K =5) is when a data point is clustered into a group, by having 5 of the nearest neighbors vote on that data point, and it is particularly useful if the data is a binary or categorical (Berson, Smith, & Thearling, 1999).  In this exercise, the percentage of correct classifications from the trained and predicted classification is 69%.  However, logistic regression in this scenario was able to produce a much higher prediction rate of 73%, this for this exercise and this data set, logistic regression was quite useful in predicting the default rate than the k-nearest neighbor algorithm at k=5.

Code

#

## The German credit data contains attributes and outcomes on 1,000 loan applications.

#

## Reading the data from source and displaying the top five entries.

#

##

### ———————————————————————————————————-

## The two outcomes are success (defaulting on the loan) and failure (not defaulting).

## The explanatory variables in the logistic regression are both the type of loan and the borrowing amount.

### ———————————————————————————————————-

##

#

## Defining and re-leveling the variables (Taddy, n.d.)

default = credits\$V21 – 1 # set default true when = 2

amount = credits\$V5

purpose = factor(credits\$V4, levels = c(“A40″,”A41″,”A42″,”A43″,”A44″,”A45″,”A46″,”A48″,”A49″,”A410”))

levels(purpose) = c(“newcar”, “usedcar”, “furniture/equip”, “radio/TV”, “apps”, “repairs”, “edu”, “retraining”, “biz”, “other”)

## Create a new matrix called “cred” with the 8 defined variables (Taddy, n.d.)

credits\$default = default

credits\$amount  = amount

credits\$purpose = purpose

cred = credits[,c(“default”,”amount”,”purpose”)]

summary(cred[,])

## Create a design matrix, such that factor variables are turned into indicator variables

Xcred = model.matrix(default~., data=cred)[,-1]

Xcred[1:5,]

## Creating training and prediction datasets: Select 900 rows for esitmation and 100 for testing

set.seed(1)

train = sample(1:1000,900)

## Defining which x and y values in the design matrix will be for training and for testing

xtrain = Xcred[train,]

xtest = Xcred[-train,]

ytrain = cred\$default[train]

ytest = cred\$default[-train]

## logistic regresion

datas=data.frame(default=ytrain,xtrain)

creditglm=glm(default~., family=binomial, data=datas)

summary(creditglm)

percentOfCorrect=100*(sum(ytest==round(testingdata\$defaultPrediction))/100)

percentOfCorrect

## Predicting default from the test data (Alice, 2015; UCLA: Statistical Consulting Group., 2007)

testdata=data.frame(default=ytest,xtest)

testdata[1:5,]

testingdata=testdata[,2:11] #removing the variable default from the data matrix

testingdata\$defaultPrediction = predict(creditglm, newdata=testdata, type = “response”)

results = data.frame(ytest,testingdata\$defaultPrediction)

summary(results)

#

##

### ———————————————————————————————————-

##  K-means classification, on the 3 continuous variables: duration, amount, and installment.

### ———————————————————————————————————-

##

#

install.packages(“class”)

library(class)

## Defining and re-leveling the variables (Taddy, n.d.)

default = credits\$V21 – 1 # set default true when = 2

duration = credits\$V2

amount = credits\$V5

installment = credits\$V8

## Create a new matrix called “cred” with the 8 defined variables (Taddy, n.d.)

credits\$default = default

credits\$amount  = amount

credits\$installment = installment

credits\$duration = duration

creds = credits[,c(“duration”,”amount”,”installment”,”default”)]

summary(creds[,])

## K means classification (R, n.b.a)

kmeansclass= cbind(creds\$default,creds\$duration,creds\$amount,creds\$installment)

kmeansresult= kmeans(kmeansclass,2)

kmeansresult\$cluster

kmeansresult\$size

kmeansresult\$centers

kmeansresult\$betweenss/kmeansresult\$totss

#

##

### ———————————————————————————————————-

##  Cross-validation with k = 5 for the nearest neighbor.

### ———————————————————————————————————-

##

#

## Create a design matrix, such that factor variables are turned into indicator variables

Xcreds = model.matrix(default~., data=creds)[,-1]

Xcreds[1:5,]

## Creating training and prediction datasets: Select 900 rows for esitmation and 100 for testing

set.seed(1)

train = sample(1:1000,900)

## Defining which x and y values in the design matrix will be for training and for testing

xtrain = Xcreds[train,]

xtest = Xcreds[-train,]

ytrain = creds\$default[train]

ytest = creds\$default[-train]

## K-nearest neighbor clustering (R, n.d.b.)

nearestFive=knn(train = xtrain[,2,drop=F],test=xtest[,2,drop=F],cl=ytrain,k=5)

knnresults=cbind(ytest+1,nearestFive) # The addition of 1 is done on ytest because when cbind is applied to nearestFive it adds 1 to each value.

percentOfCorrect=100*(sum(ytest==nearestFive)/100)

References