Adv Quant: Statistical Significance and Machine Learning

Data mining and analytics are used to test hypotheses and detect trends from very large datasets. In statistics, the significance is determined to some extent by the sample size. How can supervised learning be used in such large data sets to overcome the problem where everything is significant with statistical analysis?

Advertisements

Statistical significance on large samples sizes can be affected by small differences and can show up as significant, while in smaller samples large differences may be deemed statistically insignificant (Field, 2014).  Statistically significant results allow the researcher to reject a null hypothesis but do not test the importance of the observations made (Huck, 2011). Statistical analysis is highly deductive (Creswell, 2014), and supervised learning is highly inductive (Connolly & Begg, 2014).  Also, statistical analysis tries to identify trends in a given sample size by assuming normality, linearity or constant variance; whereas in machine learning it aims to find a pattern in a large sample of data and it is expected that these statistical analysis assumptions are not met and therefore require a higher random sampling set (Ahlemeyer-Stubbe, & Coleman, 2014).

Machine learning tries to emulate the way humans learn. When humans learn, they create a model based off of observations to help describe key features of a situation and help them predict an outcome, and thus machine learning does predictive modeling of large data sets in a similar fashion (Connolly & Begg, 2014).  The biggest selling point of supervised machine learning is that the machine can build models that identify key patterns in the data when humans can no longer compute the volume, velocity, and variety of the data (Ahlemeyer-Stubbe, & Coleman, 2014). There are many applications that use machine learning: marketing, investments, fraud detection, manufacturing, telecommunication, etc. (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). Figure 1 illustrates how supervised learning can classify data or predict their values through a two-phase process.  The two-phase process consists of (1) training where the model is built by ingesting huge amounts of historical data; and (2) testing where the new model is tested on new current data that helps establish its accuracy, reliability, and validity (Ahlemeyer-Stubbe & Coleman, 2014; Connolly & Begg, 2014). The model that is created by machines through this learning is quickly adaptable to new data (Minelli, Chambers, & Dhiraj, 2013).  These models themselves are a set of rules or formulas, and that depends on which analytical algorithm is used (Ahlemeyer-Stubbe & Coleman, 2014).  Given that the supervised machine learning is trained with known responses (or outputs) to make its future predictions, it is vital to have a clear purpose defined before running the algorithm.  The model is only as good as the data that goes in it.

U1db2F1.PNG

Figure 1:  Simplified process diagram on supervised machine learning.

Thus, for classification the machine is learning a function to map data into one or many different defining characteristics, and it could consist of decision trees and neural network induction techniques (Connolly & Begg, 2014; Fayyad et al., 1996).  Fayyad et al. (1996) mentioned that it is impossible to classify data cleanly into one camp versus another. For value prediction, regression is used to map a function to the data that when followed gives an estimate on where the next value would be (Connolly & Begg, 2014; Fayyad et al. 1996).  However, in these regression formulas, it is good to remember that correlation between the data/variables does not imply causation.

Random sampling is core to statistics and the concept of statistical inference (Smith, 2015; Field, 2011), but it also serves a purpose in supervised learning (Ahlemeyer-Stubbe & Coleman, 2014).  Random sampling of data, is selecting a proportion of the data from a population, where each data point has an equal opportunity of being selected (Smith, 2015; Huck, 2013). The larger the sample, on average tends to represent the population fairly well (Field, 2014; Huck, 2013). Given nature big data, high volume, velocity, and variety, it is assumed that there is plenty of data to draw upon and run a supervised machine learning algorithm.  However, too much data that is fed into the machine learning algorithm can increase the process and analysis time.  Also, the bigger the random sampling size used for the learning, the more time it would take to process and analyze the data.

There are also unsupervised learning algorithms, where it also needs training and testing, but unlike supervised learning, it doesn’t need to validate its model on some predetermined output value (Ahlemeyer-Stubbe & Coleman, 2014, Conolly & Begg, 2014).   Therefore, unsupervised learning tries to find the natural relationships in the input data (Ahlemeyer-Stubbe & Coleman, 2014).  Cluster analysis is an example of unsupervised learning, where the model seeks to find a finite set of the cluster that can help describe the data into subsets of similarities (Ahlemeyer-Stubbe & Coleman, 2014, Fayyad et al., 1996). Finally, in supervised learning the results could be checked through estimation error; however it is not so easy with unsupervised learning because of a lack of a target but requires retesting to see if the patterns are similar or repeatable (Ahlemeyer-Stubbe & Coleman, 2014).

References

  • Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online].
  • Connolly, T., Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. [VitalSource Bookshelf Online].
  • Creswell, J. W. (2014) Research design: Qualitative, quantitative and mixed method approaches (4th ed.). California, SAGE Publications, Inc. VitalBook file.
  • Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. Advances in Knowledge Discovery and Data Mining, 17(3), 37–54.
  • Field, A. (2011) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
  • Huck, S. W. (2013) Reading Statistics and Research (6th ed.). Pearson Learning Solutions. VitalBook file.
  • Minelli, M., Chambers, M., Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses, 1st Edition. [VitalSource Bookshelf Online].
  • Smith, M. (2015). Statistical analysis handbook. Retrieved from http://www.statsref.com/HTML/index.html?introduction.html

Quant: Compelling topics

A discussion on what were the most compelling topics learned in the subject of Quantitative Analysis.

Most Compelling Topics

Field (2013) states that both quantitative and qualitative methods are complimentary at best, none competing approaches to solving the world’s problems. Although these methods are quite different from each other. Simply put, quantitative methods are utilized when the research contains variables that are numerical, and qualitative methods are utilized when the research contains variables that are based on language (Field, 2013).  Thus, central to quantitative research and methods is to understand the numerical, ordinal, or categorical dataset and what the data represents. This can be done through either descriptive statistics, where the researcher uses statistics to help describe a data set, or it can be done through inferential statistics, where conclusions can be drawn about the data set (Miller, n.d.).

Field (2013) and Schumacker (2014), defined central tendency as an all-encompassing term to help describe the “center of a frequency distribution” through the commonly used measures mean, median, and mode.  Outliers, missing values, and multiplication of a constant, and adding a constant are factors that affect the central tendency (Schumacker, 2014).  Besides just looking at one central tendency measure, researchers can also analyze the mean and median together to understand how skewed the data is and in which direction.  Heavily skewed distributions would heavily increase the distance between these two values, and if the mean less than the median the distribution is skewed negatively (Field, 2013).  To understand the distribution, better other measures like variance and standard deviations could be used.

Variance and standard deviations are considered as measures of dispersion, where the variance is considered as measures of average dispersion (Field, 2013; Schumacker, 2014).  Variance is a numerical value that describes how the observed data values are spread across the data distribution and how they differ from the mean on average (Huck, 2011; Field, 2013; Schumacker, 2014).  The smaller the variance indicates that the observed data values are close to the mean and vice versa (Field, 2013).

Rarely is every member of the population studied, and instead a sample from that population is randomly taken to represent that population for analysis in quantitative research (Gall, Gall, & Borg 2006). At the end of the day, the insights gained from this type of research should be impersonal, objective, and generalizable.  To generalize the results of the research the insights gained from a sample of data needs to use the correct mathematical procedures for using probabilities and information, statistical inference (Gall et al., 2006).  Gall et al. (2006), stated that statistical inference is what dictates the order of procedures, for instance, a hypothesis and a null hypothesis must be defined before a statistical significance level, which also has to be defined before calculating a z or t statistic value.  Essentially, a statistical inference allows for quantitative researchers to make inferences about a population.  A population, where researchers must remember where that data was generated and collected from during quantitative research process.

Most flaws in research methodology exist because the validity and reliability weren’t established (Gall et al., 2006). Thus, it is important to ensure a valid and reliable assessment instrument.  So, in using any existing survey as an assessment instrument, one should report the instrument’s: development, items, scales, reports on reliability, and reports on validity through past uses (Creswell, 2014; Joyner, 2012).  Permission must be secured for using any instrument and placed in the appendix (Joyner, 2012).  The validity of the assessment instrument is key to drawing meaningful and useful statistical inferences (Creswell, 2014).

Through sampling of a population and using a valid and reliable survey instrument for assessment, attitudes and opinions about a population could be correctly inferred from the sample (Creswell, 2014).  Sometimes, a survey instrument doesn’t fit those in the target group. Thus it would not produce valid nor reliable inferences for the targeted population. One must select a targeted population and determine the size of that stratified population (Creswell, 2014).

Parametric statistics, are inferential and based on random sampling from a distinct population, and that the sample data is making strict inferences about the population’s parameters, thus tests like t-tests, chi-square, f-tests (ANOVA) can be used (Huck, 2011; Schumacker, 2014).  Nonparametric statistics, “assumption-free tests”, is used for tests that are using ranked data like Mann-Whitney U-test, Wilcoxon Signed-Rank test, Kruskal-Wallis H-test, and chi-square (Field, 2013; Huck, 2011).

First, there is a need to define the types of data.  Continuous data is interval/ratio data, and categorical data is nominal/ordinal data.  Modified from Schumacker (2014) with data added from Huck (2011):

Statistic Dependent Variable Independent Variable
Analysis of Variance (ANOVA)
     One way Continuous Categorical
t-Tests
     Single Sample Continuous
     Independent groups Continuous Categorical
     Dependent (paired groups) Continuous Categorical
Chi-square Categorical Categorical
Mann-Whitney U-test Ordinal Ordinal
Wilcoxon Ordinal Ordinal
Kruskal-Wallis H-test Ordinal Ordinal

So, meaningful results get reported and their statistical significance, confidence intervals and effect sizes (Creswell, 2014). If the results from a statistical test have a low probability of occurring by chance (5% or 1% or less) then the statistical test is considered significant (Creswell, 2014; Field, 2014; Huck, 2011Statistical significance test can have the same effect yet result in different values (Field, 2014).  Statistical significance on large samples sizes can be affected by small differences and can show up as significant, while in smaller samples large differences may be deemed insignificant (Field, 2014).  Statistically significant results allow the researcher to reject a null hypothesis but do not test the importance of the observations made (Huck, 2011).  Huck (2011) stated two main factors that could influence whether or not a result is statistically significant is the quality of the research question and research design.

Huck (2011) suggested that after statistical significance is calculated and the research can either reject or fail to reject a null hypothesis, effect size analysis should be conducted.  The effect size allows researchers to measure objectively the magnitude or practical significance of the research findings through looking at the differential impact of the variables (Huck, 2011; Field, 2014).  Field (2014), defines one way of measuring the effect size is through Cohen’s d: d = (Avg(x1) – Avg(x2))/(standard deviation).  If d = 0.2 there is a small effect, d = 0.5 there is a moderate effect, and d = 0.8 or more there is a large effect (Field, 2014; Huck, 2011). Thus, this could be the reason why a statistical test could yield a statistically significant value, but further analysis with effect size could show that those statistically significant results do not explain much of what is happening in the total relationship.

In regression analysis, it should be possible to predict the dependent variable based on the independent variables, depending on two factors: (1) that the productivity assessment tool is valid and reliable (Creswell, 2014) and (2) we have a large enough sample size to conduct our analysis and be able to draw statistical inference of the population based on the sample data which has been collected (Huck, 2011). Assuming these two conditions are met, then regression analysis could be made on the data to create a prediction formula. Regression formulas are useful for summarizing the relationship between the variables in question (Huck, 2011).

When modeling predict the dependent variable based upon the independent variable the regression model with the strongest correlation will be used as it is that regression formula that explains the variance between the variables the best.   However, just because the regression formula can predict some or most of the variance between the variables, it will never imply causation (Field, 2013).  Correlations help define the strength of the regression formula in defining the relationships between the variables, and can vary in value from -1 to +1.  The closer the correlation coefficient is to -1 or +1; it informs the researcher that the regression formula is a good predictor of the variance between the variables.  The closer the correlation coefficient is to zero, indicates that there is hardly any relationship between the variable (Field, 2013; Huck, 2011; Schumacker, 2014).  It should never be forgotten that correlation doesn’t imply causation, but can help determine the percentage of the variances between the variables by the regression formula result, when the correlation value is squared (r2) (Field, 2013).

 

References:

  • Creswell, J. W. (2014) Research design: Qualitative, quantitative and mixed method approaches (4th ed.). California, SAGE Publications, Inc. VitalBook file.
  • Field, A. (2013) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
  • Gall, M. D., Gall, J., & Borg W. (2006). Educational research: An introduction (8th ed.). Pearson Learning Solutions. VitalBook file.
  • Huck, S. W. (2011) Reading Statistics and Research (6th ed.). Pearson Learning Solutions. VitalBook file.
  • Joyner, R. L. (2012) Writing the Winning Thesis or Dissertation: A Step-by-Step Guide (3rd ed.). Corwin. VitalBook file.
  • Miller, R. (n.d.). Week 1: Central tendency [Video file]. Retrieved from http://breeze.careeredonline.com/p9fynztexn6/?launcher=false&fcsContent=true&pbMode=normal
  • Schumacker, R. E. (2014) Learning statistics using R. California, SAGE Publications, Inc, VitalBook file.

Quant: Statistical Significance

Presume that you have analyzed a relationship between 2 management styles and found they are significantly related. A statistician has looked at your output and said that the results really do not explain much of what is happening in the total relationship.

In quantitative research methodologies, meaningful results get reported and their statistical significance, confidence intervals and effect sizes (Creswell, 2014). If the results from a statistical test have a low probability of occurring by chance (5% or 1% or less) then the statistical test is considered significant (Creswell, 2014; Field, 2014; Huck, 2011).  Low statistical significance values usually try to protect against type I errors (Huck, 2011). Statistical significance test can have the same effect yet result in different values (Field, 2014).  Statistical significance on large samples sizes can be affected by small differences and can show up as significant, while in smaller samples large differences may be deemed insignificant (Field, 2014).  Statistically significant results allow the researcher to reject a null hypothesis but do not test the importance of the observations made (Huck, 2011).  Huck (2011) stated two main factors that could influence whether or not a result is statistically significant is the quality of the research question and research design.  This is why Creswell (2014) also stated confidence intervals and effect size. Confidence intervals explain a range of values that describe the uncertainty of the overall observation and effect size defines the strength of the conclusions made on the observations (Creswell, 2014).  Huck (2011) suggested that after statistical significance is calculated and the research can either reject or fail to reject a null hypothesis, effect size analysis should be conducted.  The effect size allows researchers to measure objectively the magnitude or practical significance of the research findings through looking at the differential impact of the variables (Huck, 2011; Field, 2014).  Field (2014), defines one way of measuring the effect size is through Cohen’s d: d = (Avg(x1) – Avg(x2))/(standard deviation).  There are multiple ways to pick a standard deviation for the denominator of the effect size equation: control group standard deviation, group standard deviation, population standard deviation or pooling the groups of standard deviations that are assuming there is independence between the groups (Field, 2014).   If d = 0.2 there is a small effect, d = 0.5 there is a moderate effect, and d = 0.8 or more there is a large effect (Field, 2014; Huck, 2011). Thus, this could be the reason why the statistical test yielded a statistically significant value, but further analysis with effect size could show that those statistically significant results do not explain much of what is happening in the total relationship.

Resources

  • Creswell, J. W. (2014) Research design: Qualitative, quantitative and mixed method approaches (4th ed.). California, SAGE Publications, Inc. VitalBook file.
  • Field, A. (2011) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
  • Huck, S. W. (2013) Reading Statistics and Research (6th ed.). Pearson Learning Solutions. VitalBook file.