Data Tools: Artificial Intelligence and Data Analytics

The three main ingredients for artificial intelligence are computers, data, and fast algorithms. Big data technologies can turn infinite streams of data into business intelligence decisions capabilities. So how does artificial intelligence influence data analytics.

Advertisements

Machine learning, also known as Artificial Intelligence (AI) adds an intelligence layer to big data to handle the bigger sets of data to derive patterns from it that even a team of data scientist would find challenging (Maycotte, 2014; Power, 2015). AI makes their insights not by how machines are programmed, but how the machines perceive the data and take actions from that perception, essentially conducting self-learning (Maycotte, 2014).  Understanding how a machine perceives the big dataset is a hard task, which also makes it hard to interpret the resulting final models (Power, 2015).  AI is even revolutionizing how we understand what intelligence is (Spaulding, 2013).

So what is intelligence

At first, doing arithmetic was thought of as a sign of biological intelligence until the invention of the digital computers, which then shift biological intelligence to be known for logical reasoning, deduction and inferences to eventually fuzzy logic, grounded learning, and reasoning under uncertainty, which is now matched through Bayes Nets probability and current data analytics (Spaulding, 2013). So as humans keep moving the dial of what biological intelligence is to a more complex structure, if it requires high frequency and voluminous data, then it can be matched by AI (Goldbloom, 2016).  Therefore, as our definition of intelligence expands so will drive the need to capture intelligence artificially, driving change in how big datasets are analyzed.

AI on influencing the future of data analytics modeling, results, and interpretation

This concept should help revolutionize how data scientists and statisticians think about which hypotheses to ask, which variables are relevant, how do the resulting outputs fit in an appropriate conceptual model, and why do these patterns hidden in the data help generate the decision outcome forecasted by AI (Power, 2015). To figure out or make sense of these models would require subject matter experts from multiple fields and multiple levels of employment hierarchy analyzing these model outputs because it is through diversity and inclusion of thought will we understand an AI’s analytical insight.

Also, owning data is different from understanding data (Lapowsky, 2014). Thus, AI can make use of data hidden in “dark wells” and silos, where the end-user had no idea that the data even existed, to begin with, which allows for a data scientist to gain a better understanding of their datasets (Lapowsky, 2014; Power, 2015).

AI on generating datasets and using data analytics for self-improvements

Data scientists currently collected, preprocess, process and analyze big volumes of data regularly to help provide decision makers with insights from the data to make data-driven decisions (Fayyad, Piatetsky-Shapiro, & Smyth, 1996).  From these data-driven decisions, data scientist then measure the outcomes to prove the effectiveness of their insights (Maycotte, 2014).   This analysis on how the results of data-driven decisions, will allow machine learning algorithms to learn from their decisions and actions to create better ways of searching for key patterns in bigger and future datasets. This is an ability of AI to conduct self-learning based off of the results of data analytics through the use of data analytics (Maycotte, 2014). Meetoo (2016), stated that if there is enough data to create accurate rules it is enough to create insights; because machine learning can run millions of simulations against itself to generate huge volumes of data to which to learn from.

AI on Data Analytics Process

AI is a result of the massive amounts of data being collected, the culmination of ideas from the most brilliant computer scientists of our time, and on an IT infrastructure that didn’t use to exist a few years ago (Power, 2015).  Given that data analytics processes include collecting data, preprocessing data, processing data, and analyzing the results, any improvements made for AI on the infrastructure can have an influence on any part of the data analytics process (Fayyad et al., 1996; Power, 2015).  For example, as AI technology begins to learn how to read raw data to turn that into information, the need for most of the current preprocessing techniques for data cleaning could disappear (Minelli, Chambers, & Dhiraj, 2013). Therefore, as AI begins to advance, newer IT infrastructures will be dreamt up and built such that data analytics and its processes can now leverage this new infrastructure, which can also change the way on how big datasets are analyzed.

Resources:

Data Tools: Artificial Intelligence and Decision Making

Decision making is related to reasoning. You make a choice between different alternatives for what you want to do, and the intuitive notion of human free will in choosing between options helps your reasoning at times. So should artificial intelligence be used as a supporting tool or a replacement for decision makers?

“Machines can excel at frequent high-volume tasks. Humans can tackle novel situations.” – Anthony Goldbloom

Jobs today will look drastically different in 30 years from now (Goldbloom, 2016; McAfee, 2013).  Artificial intelligence (AI) works on Sundays, they don’t take holidays, and they work well at high frequency and voluminous tasks, and thus they have the possibility of replacing many of the current jobs of 2016 (Goldbloom, 2016; Meetoo, 2016).  AI has been doing things that haven’t been done before: understanding, speaking, hearing, seeing, answering, writing, and analyzing (McAfee, 2013). Also, AI can make use of data hidden in “dark wells” and silos, where the end-user had no idea that the data even existed, to begin with (Power, 2015). Eventually, AI and machine learning will be commonly used as a tool to augment or replace decision makers.  Goldbloom (2016) gave the example that a teacher may be able to read 10,000 essays or an ophthalmologist may see 50,000 eyes over a 40-year period; whereas a machine can read millions of essays and see millions of eyes in minutes.

Machine learning is one of the most powerful branches to AI, where machines learn from data, similar to how humans learn to create predictions of the future (Cringely, 2013; Cyranoski, 2015; Goldbloom, 2016; Power, 2015). It would take many scientists to analyze a big dataset in its entirety without a loss of memory such that to gain insights and to fully understand how the connections were made in the AI system (Cringely, 2013; Goldbloom, 2016). This is no easy task because the eerily accurate rules created by AI out of thousands of variables can lack substantive human meaning, making it hard to interpret the results and make an informed data-driven decision (Power, 2015).

AI has been used to solve problems in industry and academia already, which has given data scientist knowledge on the current limitations of AI and whether or not they can augment or replace key decision makers (Cyranoski, 2015; Goldbloom, 2016). Machine learning and AI does well at analyzing patterns from frequent and voluminous amounts of data at faster speeds than humans, but they fail to recognize patterns in infrequent and small amounts of data (Goldbloom, 2016).  Therefore, for small datasets artificial intelligence will not be able to replace decision makers, but for big datasets, they would.

Thus, the fundamental question that decision makers need to ask is how is the decision reduced to frequent high volume task and how much of it is reduced to novel situations (Goldbloom, 2016).  Thus, if the ratio is skewed on the high volume tasks then AI could be a candidate to replace decision makers, if the ratio is evenly split, then AI could augment and assist decision makers, and if the ratio is skewed on novel situations, then AI wouldn’t help decision makers.  They novel situations are equivalent to our tough challenges today (McAfee, 2013).

Finally, Meetoo (2016), warned that it doesn’t matter how intelligent or strategic a job could be, if there is enough data on that job to create accurate rules it can be automated as well; because machine learning can run millions of simulations against itself to generate huge volumes of data to learn from.  This is no different than humans doing self-study and continuous practice to be subject matter experts in their field. But people in STEAM (Science, Technology, Engineering, Arts, and Math) will be best equip them for the future world with AI, because it is from knowing how to combine these fields that novel, infrequent, and unique challenges will arise that humans can solve and machine learning cannot (Goldbloom, 2016; McAfee, 2013; Meetoo, 2016).

Resources:

Data Tools: Artificial Intelligence

Analyzing large data sets requires developing and applying complex algorithms. As data sets become larger, the ability of skilled individual to make sense of it all becomes more difficult.

Big data Analytics and Artificial Intelligence

Artificial Intelligence (AI) is an embedded technology, based off of the current infrastructure (i.e. supercomputers), big data, and machine learning algorithms (Cyranoski, 2015; Power, 2015). Though previously, AI wasn’t able to come into existence without the proper computational power that is provided today (Cringely, 2013).  AI can make use of data hidden in “dark wells” and silos, where the end-user had no idea that the data even existed, to begin with (Power, 2015).  The goal of AI is to use huge amounts of data to draw out a set of rules through machine learning that will effectively replace experts in a certain field (Cringely, 2013; Power, 2015). Cringely (2013) stated that in some situations big data can eliminate the need for theory and that AI can aid in analyzing big data where theory is either lacking or impossible to define.

AI can provide tremendous value since it builds thousands of models and correlations automatically in one week, which use to take a few quantitative data scientist years to do (Dewey, 2013; Power, 2015).  The thing that has slowed down the progression of AI in the past was the creation of human readable computer languages like XML or SQL, which is not intuitive for computers to read (Cringely, 2013).  Fortunately, AI can easily use structured data and now use unstructured data thanks to everyone who tags all these unstructured data either in comments or on the data point itself, speeding up the computational time (Cringely, 2013; Power, 2015).  Dewey (2013), hypothesized that not only will AI be able to analyze big data at speeds faster than any human can, but that the AI system can also begin to improve its search algorithms in phenomena called intelligence explosion.  Intelligence explosion is when an AI system begins to analyze itself to improve itself in an iterative process to a point where there is an exponential growth in improvement (Dewey, 2013).

Unfortunately, the rules created by AI out of 50K variables lack substantive human meaning, or the “Why” behind it, thus making it hard to interpret the results (Power, 2015).  It would take many scientists to analyze the same big data and analyze it all, to fully understand how the connections were made in the AI system, which is no longer feasible (Cringely, 2013).  It is as if data scientist is trying to read the mind of the AI system, and they currently cannot read a human’s mind. However, the results of AI are becoming accurate, with AI identifying cats in photographs in 72 hours of machine learning and after a cat is tagged in a few photographs (Cringely, 2013). AI could be applied to any field of study like finance, social science, science, engineering, etc. or even play against champions on the Jeopardy game show (Cyranoski, 2015; Cringely, 2013; Dewey, 2013; Power, 2015).

Example of artificial intelligence use in big data analysis: Genomics

The goal of AI use on genomic data is to help analyze physiological traits and lifestyle choices to provide a dedicated and personalized health plan to treat and eventually prevent disease (Cyranoski, 2015; Power, 2015).  This is done by feeding the AI systems with huge amounts of genomic data, which is considered big data by today’s standards (Cyranoski, 2015). Systems like IBM’s Watson (an AI system) could provide treatment options based on the results gained from analyzing thousands or even millions of genomic data (Power, 2015).  This is done by analyzing all this data and allowing machine learning techniques to devise algorithms based on the input data (Cringely, 2013; Cyranoski, 2015; Power, 2015).  As of 2015, there is about 100,000 individual genomic data in the system, and even with this huge amounts of data, it is still not enough to provide the personalized health plan that is currently being envisioned based on a person’s genomic data (Cyranoski, 2015).  Eventually, millions of individuals will need to be added into the AI system, and not just genomic data, but also proteomics, metabolomics, lipidomics, etc.

Resources:

Data tools: Analysis of big data involving text mining

The demand for big data talent is growing, and there is a shortage of data analytics talent in United States. Because data analytics is used by many different industries and the data analytics is an interdisciplinary sector, learning and teaching it requires careful planning. This post discusses how big data analytics can be implemented in a given case study.

Definitions

Big data – any set of data that has high velocity, volume, and variety, also known as the 3Vs (Davenport & Dyche, 2013; Fox & Do 2013, Podesta, Pritzker, Moniz, Holdren, & Zients, 2014).

Text mining – a process that involves discovering implicit knowledge from unstructured textual data (Gera & Goel, 2015; Hashimi & Hafez, 2015; Nassirtoussi Aghabozorgi, Wah, & Ngo, 2015).

Case study: Basole, Seuss, and Rouse (2013). IT innovation adoption by enterprises: Knowledge discovery through text analytics.

The goal of this study was to use text mining techniques on 472 quality peer reviewed articles that spanned 30 years of knowledge (1977-2008).  The selection criteria for the articles were based on articles focused on the adoption of IT innovation; focused on the enterprise, organization, or firm; rigorous research methods; and publishable leading journals.  The reason to go through all this analysis is to prove the usefulness of text analytics for literature reviews.  In 2016, most literature reviews contain recent literature from the last five years, and in certain fields, it may not just be useful to focus on the last five years.  Extending the literature search beyond this 5-year period, requires a ton of attention and manual labor, which makes the already literature an even more time-consuming endeavor than before. So, the author’s question is to see if it is possible to use text mining to conduct a more thorough review of the body of knowledge that expands beyond just the typical five years on any subject matter.  They argue that the time it takes to conduct this tedious task could benefit from automation.  However, this should be thought of as a first pass through the literature review. Thinking of this regarding a first pass allows for the generation of new research questions and a generation of ideas, which drives more future analysis.In the end, the study was able to conclude that cost and complexity were two of the most frequent determinants of IT innovation adoption from the perspective of an IT department.  Other determinants for IT departments were the complexity, capability, and relative advantage of the innovation.  However, when going up one level of extraction to the enterprise/organizational level, the perceived benefits and usefulness were the main determinants of IT innovation.  Ease of use of the technology was a big deal for the organization.  When comparing, IT innovation with costs there was a negative correlation between the two, while IT innovation has a positive correlation to organization size and top management support.

How was big data analytics learned, taught, and used in the case study?

The research approach for this study was: (1) Document Identification and extraction, (2) document classification and coding, (3) document analysis and knowledge discovery (key terms, co-occurrence), and (4) research gap identification.

Analysis of the data consisted of classifying the data into four time periods (bins): 1988-1979; 1980-1989; 1990-1999; and 2000-2008 and use of a classification scheme based on existing taxonomies (case study, content analysis, field experiment, field study, frameworks and conceptual model, interview, laboratory experiment, literature analysis, mathematical model, qualitative research, secondary data, speculation/commentary, and survey).  Data was also classified by their functional discipline (Information systems and computer science, decision science, management and organization sciences, economics, and innovation) and finally by IT innovation (software, hardware, networking infrastructure, and the tool’s IT term catalog). This study used a tool called Northernlight (http://georgiatech.northernlight.com/).

The hopes of this study are to use the bag-of-words technique and word proximity to other words (or their equivalents) to help extract meaning from a large set of text-based documents.  Bag-of-words technique is known for counting and identifying key terms and phrases, which help uncover themes.  The simplest way of thinking of the bag-of-words technique is word frequencies in a document.

However, understanding the meaning behind the themes means studying the context in which the words are located in, and relating them amongst other themes, also called co-occurrence of terms.  The best way of doing this meaning extraction is to measure the strength/distance between the themes.  Finally, the researcher in this study can set minimums, maximums that can enhance the meaning extraction algorithm to garner insights into IT innovation, while reducing the overall noise in the final results. The researchers set the following rules for co-occurrences between themes:

  • There are approximately 40 words per sentence
  • There are approximately 150 words per paragraph

How could this implementation of big data have been improved upon?

Goldbloom (2016) stated that using big data techniques (machine learning) is best on big data that requires classifying and it breaks down when the task is too small and specialized, therefore prime for only human analysis.  This study only looked at 427 articles, is this considered big enough for analysis, or should the analysis go back through multiple years beyond just the 30 years (Basole et al., 2013).  What is considered big data in 2013 (the time of this study), may not be big data in 2023 (Fox & Do, 2013).

Mei & Zhai (2005), observed how terms and term frequencies evolved over time and graphed it by year, rather than binning the data into four different groups as in Basole et al. (2013).  This case study could have shown how cost and complexity in IT innovation changed over time.  Graphing the results similar to Mei & Zhai (2005) and Yoon and Song (2014) would also allow for an analysis of IT innovation themes and if each of these themes is in an Introduction, Growth, Majority, or Decline mode.

 Reference

  • Basole, R. C., Seuss, C. D., & Rouse, W. B. (2013). IT innovation adoption by enterprises: Knowledge discovery through text analytics. Decision Support Systems, 54, 1044-1054. Retrieved from http://www.sciencedirect.com.ctu.idm.oclc.org/science/article/pii/S0167923612002849
  • Davenport, T. H., & Dyche, J. (2013). Big Data in Big Companies. International Institute for Analytics, (May), 1–31.
  • Fox, S., & Do, T. (2013). Getting real about Big Data: applying critical realism to analyse Big Data hype. International Journal of Managing Projects in Business, 6(4), 739–760. http://doi.org/10.1108/IJMPB-08-2012-0049
  • Gera, M., & Goel, S. (2015). Data Mining-Techniques, Methods and Algorithms: A Review on Tools and their Validity. International Journal of Computer Applications, 113(18), 22–29.
  • Goldbloom, A. (2016). The jobs we’ll lose to machines –and the ones we won’t. TED. Retrieved from http://www.ted.com/talks/anthony_goldbloom_the_jobs_we_ll_lose_to_machines_and_the_ones_we_won_t
  • Hashimi, H., & Hafez, A. (2015). Selection criteria for text mining approaches. Computers in Human Behavior, 51, 729–733. http://doi.org/10.1016/j.chb.2014.10.062
  • Mei, Q., & Zhai, C. (2005). Discovering evolutionary theme patterns from text: an exploration of temporal text mining. Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, 198–207. http://doi.org/10.1145/1081870.1081895
  • Nassirtoussi, A. K., Aghabozorgi, S., Wah, T. Y., & Ngo, D. C. L. (2015). Text-mining of news-headlines for FOREX market prediction: a multi-layer dimension reduction algorithm with semantics and sentiment. Expert Systems with Applications42(1), 306-324.
  • Podesta, J., Pritzker, P., Moniz, E. J., Holdren, J., & Zients, J. (2014). Big Data: Seizing Opportunities. Executive Office of the President of USA, 1–79.
  • Yoon, B., & Song, B. (2014). A systematic approach of partner selection for open innovation. Industrial Management & Data Systems, 114(7), 1068.

Adv Quant: Compelling Topics

A discussion on what were the most compelling topics learned in the subject of Advance Quantitative Analysis.

Compelling topics summary/definitions

  • Supervised machine learning algorithms: is a model that needs training and testing data set. However it does need to validate its model on some predetermined output value (Ahlemeyer-Stubbe & Coleman, 2014, Conolly & Begg, 2014).
  • Unsupervised machine learning algorithms: is a model that needs training and testing data set, but unlike supervised learning, it doesn’t need to validate its model on some predetermined output value (Ahlemeyer-Stubbe & Coleman, 2014, Conolly & Begg, 2014). Therefore, unsupervised learning tries to find the natural relationships in the input data (Ahlemeyer-Stubbe & Coleman, 2014).
  • General Least Squares Model (GLM): is the line of best fit, for linear regressions modeling along with its corresponding correlations (Smith, 2015). There are five assumptions to a linear regression model: additivity, linearity, independent errors, homoscedasticity, and normally distributed errors.
  • Overfitting: is stuffing a regression model with so many variables that have little contributional weight to help predict the dependent variable (Field, 2013; Vandekerckhove, Matzke, & Wagenmakers, 2014). Thus, to avoid the over-fitting problem, the use of parsimony is important in big data analytics.
  • Parsimony: is describing a dependent variable with the fewest independent variables as possible (Field, 2013; Huck, 2013; Smith, 2015). The concept of parsimony could be attributed to Occam’s Razor, which states “plurality out never be posited without necessity” (Duignan, 2015).  Vandekerckhove et al. (2014) describe parsimony as a way of removing the noise from the signal to create better predictive regression models.
  • Hierarchical Regression: When the researcher builds a multivariate regression model, they build it in stages, as they tend to add known independent variables first, and add newer independent variables in order to avoid overfitting in a technique called hierarchical regression (Austin, Goel & van Walraven, 2001; Field, 2013; Huck 2013).
  • Logistic Regression: multi-variable regression, where one or more independent variables are continuous or categorical which are used to predict a dichotomous/ binary/ categorical dependent variable (Ahlemeyer-Stubbe, & Coleman, 2014; Field, 2013; Gall, Gall, & Borg, 2006; Huck, 2011).
  • Nearest Neighbor Methods: K-nearest neighbor (i.e. K =5) is when a data point is clustered into a group, by having 5 of the nearest neighbors vote on that data point, and it is particularly useful if the data is a binary or categorical (Berson, Smith, & Thearling, 1999).
  • Classification Trees: aid in data abstraction and finding patterns in an intuitive way (Ahlemeyer-Stubbe & Coleman, 2014; Brookshear & Brylow, 2014; Conolly & Begg, 2014) and aid the decision-making process by mapping out all the paths, solutions, or options available to the decision maker to decide upon.
  • Bayesian Analysis: can be reduced to a conditional probability that aims to take into account prior knowledge, but updates itself when new data becomes available (Hubbard, 2010; Smith, 2015; Spiegelhalter & Rice, 2009; Yudkowsky, 2003).
  • Discriminate Analysis: how should data be best separated into several groups based on several independent variables that create the largest separation of the prediction (Ahlemeyer-Stubbe, & Coleman, 2014; Field, 2013).
  • Ensemble Models: can perform better than a single classifier, since they are created as a combination of classifiers that have a weight attached to them to properly classify new data points (Bauer & Kohavi, 1999; Dietterich, 2000), through techniques like Bagging and Boosting. Boosting procedures help reduce both bias and variance of the different methods, and bagging procedures reduce just the variance of the different methods (Bauer & Kohavi, 1999; Liaw & Wiener, 2002).

 

References

  • Ahlemeyer-Stubbe, Andrea, Shirley Coleman. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online].
  • Austin, P. C., Goel, V., & van Walraven, C. (2001). An introduction to multilevel regression models. Canadian Journal of Public Health92(2), 150.
  • Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine learning,36(1-2), 105-139.
  • Berson, A. Smith, S. & Thearling K. (1999). Building Data Mining Applications for CRM. McGraw-Hill. Retrieved from http://www.thearling.com/text/dmtechniques/dmtechniques.htm
  • Brookshear, G., & Brylow, D. (2014). Computer Science: An Overview, 12th Edition. [VitalSource Bookshelf Online].
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. [VitalSource Bookshelf Online].
  • Dietterich, T. G. (2000). Ensemble methods in machine learning. International workshop on multiple classifier systems (pp. 1-15). Springer Berlin Heidelberg.
  • Duignan, B. (2015). Occam’s razor. Encyclopaedia Britannica. Retrieved from https://www.britannica.com/topic/Occams-razor
  • Field, Andy. (2013). Discovering Statistics Using IBM SPSS Statistics, 4th Edition. [VitalSource Bookshelf Online].
  • Gall, M. D., Gall, J. P., Borg, W. R. (2006). Educational Research: An Introduction, 8th Edition. [VitalSource Bookshelf Online].
  • Hubbard, D. W. (2010). How to measure anything: Finding the values of “intangibles” in business. (2nd e.d.) New Jersey, John Wiley & Sons, Inc.
  • Huck, Schuyler W. (2011). Reading Statistics and Research, 6th Edition. [VitalSource Bookshelf Online].
  • Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18-22.
  • Smith, M. (2015). Statistical analysis handbook. Retrieved from http://www.statsref.com/HTML/index.html?introduction.html
  • Spiegelhalter, D. & Rice, K. (2009) Bayesian statistics. Retrieved from http://www.scholarpedia.org/article/Bayesian_statistics
  • Vandekerckhove, J., Matzke, D., & Wagenmakers, E. J. (2014). Model comparison and the principle of parsimony.
  • Yudkowsky, E.S. (2003). An intuitive explanation of Bayesian reasoning. Retrieved from http://yudkowsky.net/rational/bayes

Adv Quant: Statistical Significance and Machine Learning

Data mining and analytics are used to test hypotheses and detect trends from very large datasets. In statistics, the significance is determined to some extent by the sample size. How can supervised learning be used in such large data sets to overcome the problem where everything is significant with statistical analysis?

Statistical significance on large samples sizes can be affected by small differences and can show up as significant, while in smaller samples large differences may be deemed statistically insignificant (Field, 2014).  Statistically significant results allow the researcher to reject a null hypothesis but do not test the importance of the observations made (Huck, 2011). Statistical analysis is highly deductive (Creswell, 2014), and supervised learning is highly inductive (Connolly & Begg, 2014).  Also, statistical analysis tries to identify trends in a given sample size by assuming normality, linearity or constant variance; whereas in machine learning it aims to find a pattern in a large sample of data and it is expected that these statistical analysis assumptions are not met and therefore require a higher random sampling set (Ahlemeyer-Stubbe, & Coleman, 2014).

Machine learning tries to emulate the way humans learn. When humans learn, they create a model based off of observations to help describe key features of a situation and help them predict an outcome, and thus machine learning does predictive modeling of large data sets in a similar fashion (Connolly & Begg, 2014).  The biggest selling point of supervised machine learning is that the machine can build models that identify key patterns in the data when humans can no longer compute the volume, velocity, and variety of the data (Ahlemeyer-Stubbe, & Coleman, 2014). There are many applications that use machine learning: marketing, investments, fraud detection, manufacturing, telecommunication, etc. (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). Figure 1 illustrates how supervised learning can classify data or predict their values through a two-phase process.  The two-phase process consists of (1) training where the model is built by ingesting huge amounts of historical data; and (2) testing where the new model is tested on new current data that helps establish its accuracy, reliability, and validity (Ahlemeyer-Stubbe & Coleman, 2014; Connolly & Begg, 2014). The model that is created by machines through this learning is quickly adaptable to new data (Minelli, Chambers, & Dhiraj, 2013).  These models themselves are a set of rules or formulas, and that depends on which analytical algorithm is used (Ahlemeyer-Stubbe & Coleman, 2014).  Given that the supervised machine learning is trained with known responses (or outputs) to make its future predictions, it is vital to have a clear purpose defined before running the algorithm.  The model is only as good as the data that goes in it.

U1db2F1.PNG

Figure 1:  Simplified process diagram on supervised machine learning.

Thus, for classification the machine is learning a function to map data into one or many different defining characteristics, and it could consist of decision trees and neural network induction techniques (Connolly & Begg, 2014; Fayyad et al., 1996).  Fayyad et al. (1996) mentioned that it is impossible to classify data cleanly into one camp versus another. For value prediction, regression is used to map a function to the data that when followed gives an estimate on where the next value would be (Connolly & Begg, 2014; Fayyad et al. 1996).  However, in these regression formulas, it is good to remember that correlation between the data/variables does not imply causation.

Random sampling is core to statistics and the concept of statistical inference (Smith, 2015; Field, 2011), but it also serves a purpose in supervised learning (Ahlemeyer-Stubbe & Coleman, 2014).  Random sampling of data, is selecting a proportion of the data from a population, where each data point has an equal opportunity of being selected (Smith, 2015; Huck, 2013). The larger the sample, on average tends to represent the population fairly well (Field, 2014; Huck, 2013). Given nature big data, high volume, velocity, and variety, it is assumed that there is plenty of data to draw upon and run a supervised machine learning algorithm.  However, too much data that is fed into the machine learning algorithm can increase the process and analysis time.  Also, the bigger the random sampling size used for the learning, the more time it would take to process and analyze the data.

There are also unsupervised learning algorithms, where it also needs training and testing, but unlike supervised learning, it doesn’t need to validate its model on some predetermined output value (Ahlemeyer-Stubbe & Coleman, 2014, Conolly & Begg, 2014).   Therefore, unsupervised learning tries to find the natural relationships in the input data (Ahlemeyer-Stubbe & Coleman, 2014).  Cluster analysis is an example of unsupervised learning, where the model seeks to find a finite set of the cluster that can help describe the data into subsets of similarities (Ahlemeyer-Stubbe & Coleman, 2014, Fayyad et al., 1996). Finally, in supervised learning the results could be checked through estimation error; however it is not so easy with unsupervised learning because of a lack of a target but requires retesting to see if the patterns are similar or repeatable (Ahlemeyer-Stubbe & Coleman, 2014).

References

  • Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online].
  • Connolly, T., Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. [VitalSource Bookshelf Online].
  • Creswell, J. W. (2014) Research design: Qualitative, quantitative and mixed method approaches (4th ed.). California, SAGE Publications, Inc. VitalBook file.
  • Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. Advances in Knowledge Discovery and Data Mining, 17(3), 37–54.
  • Field, A. (2011) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
  • Huck, S. W. (2013) Reading Statistics and Research (6th ed.). Pearson Learning Solutions. VitalBook file.
  • Minelli, M., Chambers, M., Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses, 1st Edition. [VitalSource Bookshelf Online].
  • Smith, M. (2015). Statistical analysis handbook. Retrieved from http://www.statsref.com/HTML/index.html?introduction.html