Adv Quant: Ensemble Classifiers and RandomForests

A discussion on creating ensembles from different methods such as logistic regression, nearest neighbor methods, classification trees, Bayesian, or discriminate analysis and a discussion on the use of RandomForest to do analysis.

Advertisements

Ensembles classifiers can perform better than a single classifier since they are created as a combination of classifiers that have a weight attached to them to properly classify new data points (Bauer & Kohavi, 1999; Dietterich, 2000).  The ensemble classifier can include methods such as:

  • Logistic Regression: multi-variable regression, where one or more independent variables are continuous or categorical which are used to predict a dichotomous/ binary/ categorical dependent variable (Ahlemeyer-Stubbe, & Coleman, 2014; Field, 2013; Gall, Gall, & Borg, 2006; Huck, 2011).
  • Nearest Neighbor Methods: K-nearest neighbor (i.e. K =5) is when a data point is clustered into a group, by having 5 of the nearest neighbors vote on that data point, and it is particularly useful if the data is a binary or categorical (Berson, Smith, & Thearling, 1999).
  • Classification Trees: aid in data abstraction and finding patterns in an intuitive way (Ahlemeyer-Stubbe & Coleman, 2014; Brookshear & Brylow, 2014; Conolly & Begg, 2014) and aid the decision-making process by mapping out all the paths, solutions, or options available for the decision maker to decide upon.
  • Bayesian Analysis: can be reduced to a conditional probability that aims to take into account prior knowledge, but updates itself when new data becomes available (Hubbard, 2010; Smith, 2015; Spiegelhalter & Rice, 2009; Yudkowsky, 2003).
  • Discriminate Analysis: how should data be best separated into several groups based on several independent variables that create the largest separation of the prediction (Ahlemeyer-Stubbe, & Coleman, 2014; Field, 2013).

As mentioned above, the ensemble classifier can create weights for each classifier to help improve the accuracy of the total “ensemble classifier result,” through boosting and bagging procedures.  Boosting procedures help reduce both bias and variance of the different methods, and bagging procedures reduce just the variance of the different methods (Bauer & Kohavi, 1999; Liaw & Wiener, 2002).

  • Boosting: helps boost weak classifying algorithms done serially in systems, to force a reduction in the expected error (Bauer & Kohavi, 1999). The reason why this algorithm is done serially is that the classifier done previously had voted on the variables previously, and that vote is taken into account in this next classifier prediction (Liaw & Wiener, 2002)
  • Bagging (Bootstrap aggregating): assigns values to classifiers which are created from different uniform samples from the training data set with replacement, which is computed in parallel because they don’t depend on other classifiers’ votes to run the next classification prediction (Bauer & Kohavi, 1999; Liaw & Wiener, 2002). This is also known as an averaging method or a random forest (Ahlemeyer-Stubbe & Coleman, 2014).

Random Forest

According to Ahlemeyer-Stubbe and Coleman (2014), random forests are multiple decision trees conducted from selecting multiple random samples from the same data set (either through resampled or disjoint sampling), and the variables that appear more frequently in the forest adds more confidence that this variable has a real influence on the dependent variable.  Liaw and Wiener (2002) affirmed this by stating not only does a variable that frequently appears among many trees in the forest add more confidence in its influence, but also can help determine its proximity to the root node.  Random forests add a new level of randomness to bagging algorithms and is robust against over fitting which is a problem with some decision trees algorithms (Ahlemeyer-Stubbe & Coleman, 2014; Liaw & Wiener, 2002).

The use of random forests is most helpful when relationships between the variables are weak or if there is very little data available (Ahlemeyer-Stubbe and Coleman, 2014).  Also, it is worth considering that the numbers of trees needed to achieve great performance increases as the number of variables under consideration increases (Liaw & Wiener, 2002). To learn how to run random forests algorithms in the statistical programming language R, Liaw and Wiener (2002) shared some of their coding syntax as well as observations on how to effectively meet the objectives.

References:

  • Ahlemeyer-Stubbe, Andrea, Shirley Coleman. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online].
  • Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine learning,36(1-2), 105-139.
  • Berson, A. Smith, S. & Thearling K. (1999). Building Data Mining Applications for CRM. McGraw-Hill. Retrieved from http://www.thearling.com/text/dmtechniques/dmtechniques.htm
  • Brookshear, G., & Brylow, D. (2014). Computer Science: An Overview, 12th Edition. [VitalSource Bookshelf Online].
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. [VitalSource Bookshelf Online].
  • Dietterich, T. G. (2000). Ensemble methods in machine learning. International workshop on multiple classifier systems (pp. 1-15). Springer Berlin Heidelberg.
  • Field, Andy. (2013). Discovering Statistics Using IBM SPSS Statistics, 4th Edition. [VitalSource Bookshelf Online].