Adv Quant: Compelling Topics

Compelling topics summary/definitions

  • Supervised machine learning algorithms: is a model that needs training and testing data set. However it does need to validate its model on some predetermined output value (Ahlemeyer-Stubbe & Coleman, 2014, Conolly & Begg, 2014).
  • Unsupervised machine learning algorithms: is a model that needs training and testing data set, but unlike supervised learning, it doesn’t need to validate its model on some predetermined output value (Ahlemeyer-Stubbe & Coleman, 2014, Conolly & Begg, 2014). Therefore, unsupervised learning tries to find the natural relationships in the input data (Ahlemeyer-Stubbe & Coleman, 2014).
  • General Least Squares Model (GLM): is the line of best fit, for linear regressions modeling along with its corresponding correlations (Smith, 2015). There are five assumptions to a linear regression model: additivity, linearity, independent errors, homoscedasticity, and normally distributed errors.
  • Overfitting: is stuffing a regression model with so many variables that have little contributional weight to help predict the dependent variable (Field, 2013; Vandekerckhove, Matzke, & Wagenmakers, 2014). Thus, to avoid the over-fitting problem, the use of parsimony is important in big data analytics.
  • Parsimony: is describing a dependent variable with the fewest independent variables as possible (Field, 2013; Huck, 2013; Smith, 2015). The concept of parsimony could be attributed to Occam’s Razor, which states “plurality out never be posited without necessity” (Duignan, 2015).  Vandekerckhove et al. (2014) describe parsimony as a way of removing the noise from the signal to create better predictive regression models.
  • Hierarchical Regression: When the researcher builds a multivariate regression model, they build it in stages, as they tend to add known independent variables first, and add newer independent variables in order to avoid overfitting in a technique called hierarchical regression (Austin, Goel & van Walraven, 2001; Field, 2013; Huck 2013).
  • Logistic Regression: multi-variable regression, where one or more independent variables are continuous or categorical which are used to predict a dichotomous/ binary/ categorical dependent variable (Ahlemeyer-Stubbe, & Coleman, 2014; Field, 2013; Gall, Gall, & Borg, 2006; Huck, 2011).
  • Nearest Neighbor Methods: K-nearest neighbor (i.e. K =5) is when a data point is clustered into a group, by having 5 of the nearest neighbors vote on that data point, and it is particularly useful if the data is a binary or categorical (Berson, Smith, & Thearling, 1999).
  • Classification Trees: aid in data abstraction and finding patterns in an intuitive way (Ahlemeyer-Stubbe & Coleman, 2014; Brookshear & Brylow, 2014; Conolly & Begg, 2014) and aid the decision-making process by mapping out all the paths, solutions, or options available to the decision maker to decide upon.
  • Bayesian Analysis: can be reduced to a conditional probability that aims to take into account prior knowledge, but updates itself when new data becomes available (Hubbard, 2010; Smith, 2015; Spiegelhalter & Rice, 2009; Yudkowsky, 2003).
  • Discriminate Analysis: how should data be best separated into several groups based on several independent variables that create the largest separation of the prediction (Ahlemeyer-Stubbe, & Coleman, 2014; Field, 2013).
  • Ensemble Models: can perform better than a single classifier, since they are created as a combination of classifiers that have a weight attached to them to properly classify new data points (Bauer & Kohavi, 1999; Dietterich, 2000), through techniques like Bagging and Boosting. Boosting procedures help reduce both bias and variance of the different methods, and bagging procedures reduce just the variance of the different methods (Bauer & Kohavi, 1999; Liaw & Wiener, 2002).

 

References

  • Ahlemeyer-Stubbe, Andrea, Shirley Coleman. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online].
  • Austin, P. C., Goel, V., & van Walraven, C. (2001). An introduction to multilevel regression models. Canadian Journal of Public Health92(2), 150.
  • Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine learning,36(1-2), 105-139.
  • Berson, A. Smith, S. & Thearling K. (1999). Building Data Mining Applications for CRM. McGraw-Hill. Retrieved from http://www.thearling.com/text/dmtechniques/dmtechniques.htm
  • Brookshear, G., & Brylow, D. (2014). Computer Science: An Overview, 12th Edition. [VitalSource Bookshelf Online].
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. [VitalSource Bookshelf Online].
  • Dietterich, T. G. (2000). Ensemble methods in machine learning. International workshop on multiple classifier systems (pp. 1-15). Springer Berlin Heidelberg.
  • Duignan, B. (2015). Occam’s razor. Encyclopaedia Britannica. Retrieved from https://www.britannica.com/topic/Occams-razor
  • Field, Andy. (2013). Discovering Statistics Using IBM SPSS Statistics, 4th Edition. [VitalSource Bookshelf Online].
  • Gall, M. D., Gall, J. P., Borg, W. R. (2006). Educational Research: An Introduction, 8th Edition. [VitalSource Bookshelf Online].
  • Hubbard, D. W. (2010). How to measure anything: Finding the values of “intangibles” in business. (2nd e.d.) New Jersey, John Wiley & Sons, Inc.
  • Huck, Schuyler W. (2011). Reading Statistics and Research, 6th Edition. [VitalSource Bookshelf Online].
  • Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18-22.
  • Smith, M. (2015). Statistical analysis handbook. Retrieved from http://www.statsref.com/HTML/index.html?introduction.html
  • Spiegelhalter, D. & Rice, K. (2009) Bayesian statistics. Retrieved from http://www.scholarpedia.org/article/Bayesian_statistics
  • Vandekerckhove, J., Matzke, D., & Wagenmakers, E. J. (2014). Model comparison and the principle of parsimony.
  • Yudkowsky, E.S. (2003). An intuitive explanation of Bayesian reasoning. Retrieved from http://yudkowsky.net/rational/bayes
Advertisements

Adv Quant: K-means classification in R

The explanatory variables in the logistic regression are both the type of loan and the borrowing amount.

4dbf1.PNG

Figure 1: The summary output of the logistic regression based on the type of loan and the borrowing amount.

The logistic equation shows statistical significance at the 0.01 level when the variables amount, and when the type of loan is used for a used car and a radio/television (Figure 1).  Thus, the regression equation comes out to be:

default = -0.9321 + 0.0001330(amount) – 1.56(Purpose is for used car) – 0.6499(purpose is for radio/television)

4dbf2.PNG

Figure 2: The comparative output of the logistic regression prediction versus actual results.

When comparing the predictions to the actual values (Figure 2), the mean and minimum scores between both of them are similar.  However, all other values are not. When the prediction values are rounded to the nearest whole number the actual prediction rate is 73%.

K-means classification, on the 3 continuous variables: duration, amount, and installment.

In K-means classification the data is clustered by the mean Euclidean distance between their differences (Ahlemeyer-Stubbe & Coleman, 2014).  In this exercise, there are two clusters. Thus, the cluster size is 825 no defaults, 175 defaults, where the within-cluster sum of squares for between/total is 69.78%.  The matrix of cluster centers is shown below (Figure 3).

4dbf3

Figure 3: K means center values, per variable

Cross-validation with k = 5 for the nearest neighbor.

K-nearest neighbor (K =5) is when a data point is clustered into a group, by having 5 of the nearest neighbors vote on that data point, and it is particularly useful if the data is a binary or categorical (Berson, Smith, & Thearling, 1999).  In this exercise, the percentage of correct classifications from the trained and predicted classification is 69%.  However, logistic regression in this scenario was able to produce a much higher prediction rate of 73%, this for this exercise and this data set, logistic regression was quite useful in predicting the default rate than the k-nearest neighbor algorithm at k=5.

Code

#

## The German credit data contains attributes and outcomes on 1,000 loan applications.

## Data source: https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data

## Metadata file: https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.doc

#

## Reading the data from source and displaying the top five entries.

credits=read.csv(“https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data”, header = F, sep = ” “)

head(credits)

#

##

### ———————————————————————————————————-

## The two outcomes are success (defaulting on the loan) and failure (not defaulting).

## The explanatory variables in the logistic regression are both the type of loan and the borrowing amount.

### ———————————————————————————————————-

##

#

## Defining and re-leveling the variables (Taddy, n.d.)

default = credits$V21 – 1 # set default true when = 2

amount = credits$V5

purpose = factor(credits$V4, levels = c(“A40″,”A41″,”A42″,”A43″,”A44″,”A45″,”A46″,”A48″,”A49″,”A410”))

levels(purpose) = c(“newcar”, “usedcar”, “furniture/equip”, “radio/TV”, “apps”, “repairs”, “edu”, “retraining”, “biz”, “other”)

## Create a new matrix called “cred” with the 8 defined variables (Taddy, n.d.)

credits$default = default

credits$amount  = amount

credits$purpose = purpose

cred = credits[,c(“default”,”amount”,”purpose”)]

head(cred[,])

summary(cred[,])

## Create a design matrix, such that factor variables are turned into indicator variables

Xcred = model.matrix(default~., data=cred)[,-1]

Xcred[1:5,]

## Creating training and prediction datasets: Select 900 rows for esitmation and 100 for testing

set.seed(1)

train = sample(1:1000,900)

## Defining which x and y values in the design matrix will be for training and for testing

xtrain = Xcred[train,]

xtest = Xcred[-train,]

ytrain = cred$default[train]

ytest = cred$default[-train]

## logistic regresion

datas=data.frame(default=ytrain,xtrain)

creditglm=glm(default~., family=binomial, data=datas)

summary(creditglm)

percentOfCorrect=100*(sum(ytest==round(testingdata$defaultPrediction))/100)

percentOfCorrect

## Predicting default from the test data (Alice, 2015; UCLA: Statistical Consulting Group., 2007)

testdata=data.frame(default=ytest,xtest)

testdata[1:5,]

testingdata=testdata[,2:11] #removing the variable default from the data matrix

testingdata$defaultPrediction = predict(creditglm, newdata=testdata, type = “response”)

results = data.frame(ytest,testingdata$defaultPrediction)

summary(results)

head(results,10)

#

##

### ———————————————————————————————————-

##  K-means classification, on the 3 continuous variables: duration, amount, and installment.

### ———————————————————————————————————-

##

#

install.packages(“class”)

library(class)

## Defining and re-leveling the variables (Taddy, n.d.)

default = credits$V21 – 1 # set default true when = 2

duration = credits$V2

amount = credits$V5

installment = credits$V8

## Create a new matrix called “cred” with the 8 defined variables (Taddy, n.d.)

credits$default = default

credits$amount  = amount

credits$installment = installment

credits$duration = duration

creds = credits[,c(“duration”,”amount”,”installment”,”default”)]

head(creds[,])

summary(creds[,])

## K means classification (R, n.b.a)

kmeansclass= cbind(creds$default,creds$duration,creds$amount,creds$installment)

kmeansresult= kmeans(kmeansclass,2)

kmeansresult$cluster

kmeansresult$size

kmeansresult$centers

kmeansresult$betweenss/kmeansresult$totss

#

##

### ———————————————————————————————————-

##  Cross-validation with k = 5 for the nearest neighbor. 

### ———————————————————————————————————-

##

#

## Create a design matrix, such that factor variables are turned into indicator variables

Xcreds = model.matrix(default~., data=creds)[,-1]

Xcreds[1:5,]

## Creating training and prediction datasets: Select 900 rows for esitmation and 100 for testing

set.seed(1)

train = sample(1:1000,900)

## Defining which x and y values in the design matrix will be for training and for testing

xtrain = Xcreds[train,]

xtest = Xcreds[-train,]

ytrain = creds$default[train]

ytest = creds$default[-train]

## K-nearest neighbor clustering (R, n.d.b.)

nearestFive=knn(train = xtrain[,2,drop=F],test=xtest[,2,drop=F],cl=ytrain,k=5)

knnresults=cbind(ytest+1,nearestFive) # The addition of 1 is done on ytest because when cbind is applied to nearestFive it adds 1 to each value.

percentOfCorrect=100*(sum(ytest==nearestFive)/100)

References