Adv Quant: Birth Rate Dataset in R

Built in the R library is the Births dataset with 400,000 records and 13 variables. The following is an analysis of this dataset.

Introduction

Built in the R library is the Births dataset with 400,000 records and 13 variables.  The following is an analysis of this dataset.

Results

IP1F1

Figure 1. The first five data point entries in the births2006.smpl data set.

IP1F2

Figure 2. The frequency of births in 2006 per day of the week.

IP1F3.png

Figure 3. Histogram of 2006 births frequencies graphed by day of the week and separated by method of delivery.

IP1F4.png

Figure 4. A trellis histogram plot of 2006 birth weight per birth number.

IP1F5

Figure 5. A trellis histogram plot of 2006 birth weight per birth delivery method.

IP1F6.png

Figure 6. A boxplot of 2006 birth weight per Apgar score.

IP1F7

Figure 7. A boxplot of 2006 birth weight per day of week.

IP1F8

Figure 8. A histogram of 2006 average birth weight per multiple births separated by gender.

Discussion

Given the open-sourced nature of the R software, many libraries are being built and shared with the greater community, and the Comprehensive R Archive Network (CRAN), has a ton of these programs as part of R Packages (Schumacker, 2014).  Thus, as part of the nutshell library, there exists a data set of 2006 births called “births2006.smpl”.  To view the first few entries the head() command can be used (R, n.d.g.).  The printout from the head() command (Figure 1) shows all 13 variables of the dataset along with the first five entries in the births2006.smpl dataset.

The number of birth seems to be approximately uniform (but not precisely) during the work week, assuming Sunday is 1 and Saturday is 7.  However, Tuesday-Thursday has the highest births in the week with the weekends having the least amount of births in the week.

Breaking down the method of deliveries in 2006 per day of the week, it can be seen that Vaginal birth in all seven days of the week outnumbers C-section deliveries in 2006 (Figure 3).  Also on Tuesday-Thursday there are more vaginal births compared to those during the weekend, and in C-section deliveries, there are most deliveries occur between Tuesday-Friday, and the least amount occurs during the weekends.

Breaking down the number of births frequencies per birth weight (Figure 4), it can be seen that the normal distribution of birth weight in grams shifts to the left as the number of multiple births increases.  This seems to suggest that babies born as a set of twins, triplets, etc. have lower birth rates on average and per distribution.  Birth weight is almost normally distributed for the single child birth but begins to lose normality as the number of births increases.

Further analysis of birth weights in 2006, per delivery method, shows that for whether or not the delivery method is known or not and its type of delivery method doesn’t play too much of a huge role in the determination of the child’s birth weight (Figure 5).  Statistical tests and effect size analysis could be conducted to verify and enhance the discussion and this assertion that is made through the graphical representation in Figure 5.

Apgar test is tested on the child after one and five minutes of birth looking at the skin color, heart rate, reflexes, muscle tone, and respiration rate of the child, where 10 is the highest but rarely obtain score (Hirsch, 2014).  Thus, observing the Apgar score variable (1-10) on birth weight in grams those with higher Apgar scores had on average higher median birth weights.  Typically, as Apgar score increases the tighter the distribution becomes, and the more outliers begin to appear (disregarding the results from Apgar score of 1).  These results from the boxplots tend to confirm Hirsch (2014) assertion that higher Apgar scores are harder to obtain.

Looking at the boxplot analysis of birth weight per day of the week (Figure 7) shows that the median, Q1, Q3, max, and min are normally distributed and unchanging per day of the week.  Outliers, the heavier babies, tend to occur without respect of the day of the week, and also appears to have little to no effect on the distribution of birth weight per day of the week.

Finally, looking at a mean birth weight per gender and per multiple births, shows a similar distribution of males and females (Figure 8). The main noticeable difference is the male Quintuplet or higher number of births on average weigh more than the corresponding female Quintuplet or higher number of births.  This chart also confirms the conclusions made (from Figure 4) where as the number of births increases the average weight of the children decrease.

In conclusion, the day of the week doesn’t predict birth weights, but probably birth frequency. In general, babies are heavier if they are single births and if they achieve Apgar score of 10.  Birth weights are not predictable through delivery method.  All of these conclusions are made on the visual representation of the dataset births2006.smpl.  What would increase the validity of these statements would be to conduct statistical significance tests and the effect size, to add further weight to what could be derived from through these images.

Code

#
## Use R to analyze the Birth dataset. 
## The Birth dataset is in the Nutshell library. 
##  • SEX and APGAR5 (SEX and Apgar score) 
##  • DPLURAL (single or multiple birth) 
##  • WTGAIN (weight gain of mother) 
##  • ESTGEST (estimated gestation in weeks) 
##  • DOB_MM, DOB_WK (month and day of week of birth) 
##  • BWT (birth weight) 
##  • DMETH_REC (method of delivery)
#
install.packages(“nutshell”)
library(nutshell)
data(births2006.smpl)

# First, list the data for the first 5 births. 
head(births2006.smpl)

# Next, show a bar chart of the frequencies of births according to the day of the week of the birth.
births.dayofweek = table(births2006.smpl$DOB_WK) #Goal of this variable is to speed up the calculations
barplot(births.dayofweek, ylab=”frequency”, xlab=”Day of week”, col = “darkred”, main= “Number of births in 2006 per day of the week”)

# Obtain frequencies for two-way classifications of birth according to the day of the week and the method of delivery.
births.methodsVdaysofweek = table(births2006.smpl$DOB_WK,births2006.smpl$DMETH_REC) 
head(births.methodsVdaysofweek,7)
barplot(births.methodsVdaysofweek[,-2], col=heat.colors(length(rownames(births.methodsVdaysofweek))), width=2, beside=TRUE, main = “bar plot of births per method per day of the week”)
legend (“topleft”, fill=heat.colors(length(rownames(births.methodsVdaysofweek))),legend=rownames(births.methodsVdaysofweek))

# Use lattice (trellis) graphs (R package lattice) to condition density histograms on the values of a third variable. 
library(lattice)

# The variable for multiple births and the method of delivery are conditioning variables. 
# Separate the histogram of birth weight according to these variable.
histogram(~DBWT|DPLURAL,data=births2006.smpl,layout=c(1,5),col=”black”, xlab = “birth weight”, main = “trellis plot of birth weight vs birth number”)

histogram(~DBWT|DMETH_REC,data=births2006.smpl,layout=c(1,3),col=”black”, xlab = “birth weight”, main = “trellis plot of birth weight vs birth method”)

# Do a box plot of birth weight against Apgar score and box plots of birth weight by day of week of delivery. 
boxplot(DBWT~APGAR5,data=births2006.smpl,ylab=”birth weight”,xlab=”AGPAR5″, main=”Boxplot of birthweight per Apgar score”)

boxplot(DBWT~DOB_WK,data=births2006.smpl,ylab=”birth weight”,xlab=”Day of Week”, main=”Boxplot of birthweight per day of week”)

# Calculate the average birth weight as a function of multiple births for males and females separately. 
# Use the “tapply” function, and for missing values use the “option nz.rm=TRUE.” 
listed = list(births2006.smpl$DPLURAL,births2006.smpl$SEX)
tapplication=tapply(births2006.smpl$DBWT,listed,mean,na.rm=TRUE)
barplot(tapplication,ylab=”birth weight”, beside=TRUE, legend=TRUE,xlab=”gender”, main = “bar plot of average birthweight per multiple births by gender”)

References

  • CRAN (n.d.). Using lattice’s historgram (). Retrieved from https://cran.r-project.org/web/packages/tigerstats/vignettes/histogram.html
  • Hirsch, L. (2014). About the Apgar score. Retrieved from http://kidshealth.org/en/parents/apgar.html#
  • R (n.d.a.). Add legends to plots. Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/legend.html
  • R (n.d.b.). Apply a function over a ragged array. Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/base/html/tapply.html
  • R (n.d.c.). Bar plots. Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/barplot.html
  • R (n.d.d.). Cross tabulation and table creation. Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/base/html/table.html
  • R (n.d.e.). List-Generic and dotted pairs. Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/base/html/list.html
  • R (n.d.f.). Produce box-and-wisker plot(s) of a given (grouped) values.  Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/boxplot.html
  • R (n.d.g.). Return the first or last part of an object. Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/utils/html/head.html
  • Schumacker, R. E. (2014) Learning statistics using R. California, SAGE Publications, Inc, VitalBook file.

Adv Quant: Statistical Significance and Machine Learning

Data mining and analytics are used to test hypotheses and detect trends from very large datasets. In statistics, the significance is determined to some extent by the sample size. How can supervised learning be used in such large data sets to overcome the problem where everything is significant with statistical analysis?

Statistical significance on large samples sizes can be affected by small differences and can show up as significant, while in smaller samples large differences may be deemed statistically insignificant (Field, 2014).  Statistically significant results allow the researcher to reject a null hypothesis but do not test the importance of the observations made (Huck, 2011). Statistical analysis is highly deductive (Creswell, 2014), and supervised learning is highly inductive (Connolly & Begg, 2014).  Also, statistical analysis tries to identify trends in a given sample size by assuming normality, linearity or constant variance; whereas in machine learning it aims to find a pattern in a large sample of data and it is expected that these statistical analysis assumptions are not met and therefore require a higher random sampling set (Ahlemeyer-Stubbe, & Coleman, 2014).

Machine learning tries to emulate the way humans learn. When humans learn, they create a model based off of observations to help describe key features of a situation and help them predict an outcome, and thus machine learning does predictive modeling of large data sets in a similar fashion (Connolly & Begg, 2014).  The biggest selling point of supervised machine learning is that the machine can build models that identify key patterns in the data when humans can no longer compute the volume, velocity, and variety of the data (Ahlemeyer-Stubbe, & Coleman, 2014). There are many applications that use machine learning: marketing, investments, fraud detection, manufacturing, telecommunication, etc. (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). Figure 1 illustrates how supervised learning can classify data or predict their values through a two-phase process.  The two-phase process consists of (1) training where the model is built by ingesting huge amounts of historical data; and (2) testing where the new model is tested on new current data that helps establish its accuracy, reliability, and validity (Ahlemeyer-Stubbe & Coleman, 2014; Connolly & Begg, 2014). The model that is created by machines through this learning is quickly adaptable to new data (Minelli, Chambers, & Dhiraj, 2013).  These models themselves are a set of rules or formulas, and that depends on which analytical algorithm is used (Ahlemeyer-Stubbe & Coleman, 2014).  Given that the supervised machine learning is trained with known responses (or outputs) to make its future predictions, it is vital to have a clear purpose defined before running the algorithm.  The model is only as good as the data that goes in it.

U1db2F1.PNG

Figure 1:  Simplified process diagram on supervised machine learning.

Thus, for classification the machine is learning a function to map data into one or many different defining characteristics, and it could consist of decision trees and neural network induction techniques (Connolly & Begg, 2014; Fayyad et al., 1996).  Fayyad et al. (1996) mentioned that it is impossible to classify data cleanly into one camp versus another. For value prediction, regression is used to map a function to the data that when followed gives an estimate on where the next value would be (Connolly & Begg, 2014; Fayyad et al. 1996).  However, in these regression formulas, it is good to remember that correlation between the data/variables does not imply causation.

Random sampling is core to statistics and the concept of statistical inference (Smith, 2015; Field, 2011), but it also serves a purpose in supervised learning (Ahlemeyer-Stubbe & Coleman, 2014).  Random sampling of data, is selecting a proportion of the data from a population, where each data point has an equal opportunity of being selected (Smith, 2015; Huck, 2013). The larger the sample, on average tends to represent the population fairly well (Field, 2014; Huck, 2013). Given nature big data, high volume, velocity, and variety, it is assumed that there is plenty of data to draw upon and run a supervised machine learning algorithm.  However, too much data that is fed into the machine learning algorithm can increase the process and analysis time.  Also, the bigger the random sampling size used for the learning, the more time it would take to process and analyze the data.

There are also unsupervised learning algorithms, where it also needs training and testing, but unlike supervised learning, it doesn’t need to validate its model on some predetermined output value (Ahlemeyer-Stubbe & Coleman, 2014, Conolly & Begg, 2014).   Therefore, unsupervised learning tries to find the natural relationships in the input data (Ahlemeyer-Stubbe & Coleman, 2014).  Cluster analysis is an example of unsupervised learning, where the model seeks to find a finite set of the cluster that can help describe the data into subsets of similarities (Ahlemeyer-Stubbe & Coleman, 2014, Fayyad et al., 1996). Finally, in supervised learning the results could be checked through estimation error; however it is not so easy with unsupervised learning because of a lack of a target but requires retesting to see if the patterns are similar or repeatable (Ahlemeyer-Stubbe & Coleman, 2014).

References

  • Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online].
  • Connolly, T., Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. [VitalSource Bookshelf Online].
  • Creswell, J. W. (2014) Research design: Qualitative, quantitative and mixed method approaches (4th ed.). California, SAGE Publications, Inc. VitalBook file.
  • Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. Advances in Knowledge Discovery and Data Mining, 17(3), 37–54.
  • Field, A. (2011) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
  • Huck, S. W. (2013) Reading Statistics and Research (6th ed.). Pearson Learning Solutions. VitalBook file.
  • Minelli, M., Chambers, M., Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses, 1st Edition. [VitalSource Bookshelf Online].
  • Smith, M. (2015). Statistical analysis handbook. Retrieved from http://www.statsref.com/HTML/index.html?introduction.html

Adv Quant: Statistical Features of R

Comparing the statistical features of R to its programming features and an explanation on how they are useful in analyzing big datasets.
• Describe how the analytics of R are suited for Big Data.

Ward and Barker (2013) traced back definition of Volume, Velocity, and Variety from Gartner.  Now, a predominately widely accepted definition for big data is any set of data that has high velocity, volume, and variety (Davenport & Dyche, 2013; Fox & Do 2013, Kaur & Rani, 2015. Mao, Xu, Wu, Li, Li, & Lu, 2015; Podesta, Pritzker, Moniz, Holdren, & Zients, 2014; Richards & King, 2014; Sagiroglu & Sinanc, 2013; Zikopoulous and Eaton, 2012). Davenport et al. (2012), stated that IT companies define big data as “more insightful data analysis”, but if used properly companies can gain a competitive edge.  Data scientists from companies like Google, Facebook, and LinkedIn, use R for their finance and data analytics (Revolution Analytics, n.d.). According to Minelli, Chambers and Dhiraj (2013) R has 2 million end-users and is used in industries like health, finance, etc.

Why is R so popular and have that many users?  It could be that R is a free open-source software that works on multiple platforms (Unix, Windows, Mac), and has an extensive statistical library to help conduct basic statistical data analysis, to multivariate analysis, scaling up to big data analytics (Hothorn, 2016; Leisch & Gruen, 2016; Schumacker, 2014 & 2016; Templ, 2016; Theussl & Borchers, 2016; Wild, 2015).  Given the open-sourced nature of the R software, many libraries are being built and shared with the greater community, and the Comprehensive R Archive Network (CRAN), has a ton of these programs as part of R Packages (Schumacker, 2014).  Other advantages of R, is the customizable statistical analysis, control over the analytical processes, extensive documentation, and references (Schumacker, 2016).  R Packages allow for everyday data analytics, visually aesthetic data visualizations, faster results than legacy statistical software that the end-user can control, drawing upon the talents of leading data scientists (Revolution Analytics, n.d.).  R programming features include dealing with a whole suite of data types, (scalars, vectors, matrices, arrays, and data frames), as well as impetrating and exporting data into multiple other commercially available statistical/data software (SPSS, SAS, Excel, etc.) (Schumacker, 2014 & 2016).  All the features of R related to big data analytics, statistical, and programming features are listed in Table 1 (below).  Given all the R Packages listed below and the importing and exporting features to other big data statistical software illustrates how useful R is for analyzing big datasets of various types (Schumacker, 2014, 2016).

Finally, R is the most dominant analytics tool for Big Data Analytics (Minelli et al., 2013).  Big data analytics is at the border of computing science, data mining, and statistics, it is natural to see multiple R Packages and libraries listed within CRAN that are freely available to use.  Within the field of big data analytics, some (but not all) of common sets of techniques that have R Packages are machine learning, cluster analysis, finite mixture models, and natural language processing. Given the extensive libraries through R Packages and extensive documentation, R is well suited for Big Data.

Table 1: Big Data Analytics, Statistical, and Programmable features of R

R Programming Features (Schumacker, 2014) Input, Process, Output, R Packages
Variables in R (Schumacker, 2014) number, character, logical
Data Types in R (Schumacker, 2014) scalars, arrays, vectors, matrices, list, data frames
Flow control: Loops (Schumacker, 2014) Loops (for, if, while, else, …)

Boolean Operators (and, not, or)

Visualizations (Schumacker, 2014) pie charts, bar charts, histogram, stem-and-leaf plots, scatter plots, box-whiskers plot, surface plots, contour plots, geographic maps, colors, plus others from the many R Packages
Statistical Analysis (Schumacker, 2014) Central tendency, dispersion, correlation test, linear Regression, multiple regression, logistic regression, log-linear regression, analysis of variance, probability, confidence intervals, plus others from the many R Packages
Distributions: population, sampling, and statistical (Schumacker, 2014) Binomial, Uniform, Exponential, Normal, Hypothesis testing, chi-square, z-test, t-test, f-test, plus others from the many R Packages
Multivariate Statistical Analysis (Schumacker, 2016) MANOVA, MANCOVA, factor analysis, principle components analysis, structural equation modeling, multidimensional scaling, discriminant analysis, canonical correlation, multiple group multivariate statistical analysis, plus others from the many R Packages
Big Data Analytics: Cluster Analysis (Leisch & Gruen, 2016)

 

hierarchical clustering, partitioning clustering, model-based clustering, K-means clustering, fuzzy clustering, cluster-wise regression, principal component analysis, self-organizing maps, density based clustering
Big Data Analytics: Machine Learning

(Hothorn, 2016; Templ, 2016)

neural networks, recursive partitioning, random forests, regularized and shrinkage methods, boosting, support vector machines, association rules, fuzzy rules based systems, model selection and validation, tree methods, expectation-maximization, nearest neighbor
Big Data Analytics: Natural Language Processing (Wild, 2015)

 

Frameworks, lexical databases, keyword extraction, string manipulation, stemming, semantic, pragmatics
Big Data Analytics: Optimization and Mathematical Programing (Theussl & Borchers, 2016)

 

optimization infrastructure packages, general purpose continuous solvers, least-squares problems, semidefinite and convex solvers, global and stochastic optimization, mathematical programming solvers

 

References

  • Davenport, T. H., Barth, P., & Bean, R. (2012). How big data is different. MIT Sloan Management Review, 54(1), 43.
  • Fox, S., & Do, T. (2013). Getting real about Big Data: applying critical realism to analyse Big Data hype. International Journal of Managing Projects in Business, 6(4), 739–760. http://doi.org/10.1108/IJMPB-08-2012-0049
  • Hothorn, T. (2016). CRAN task view: Machine learning & statistical learning. Retrieved from https://cran.r-project.org/web/views/MachineLearning.html
  • Kaur, K., & Rani, R. (2015). Managing Data in Healthcare Information Systems: Many Models, One Solution. Big Data Management, 52–59.
  • Leisch, F. & Gruen, B. (2016). CRAN task view: Cluster analysis & finite mixture models. Retrieved from https://cran.r-project.org/web/views/Cluster.html
  • Mao, R., Xu, H., Wu, W., Li, J., Li, Y., & Lu, M. (2015). Overcoming the Challenge of Variety: Big Data Abstraction, the Next Evolution of Data Management for AAL Communication Systems. Ambient Assisted Living Communications, 42–47.
  • Minelli, M., Chambers M., & Dhiraj A. (2013) Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. VitalBook file.
  • Podesta, J., Pritzker, P., Moniz, E. J., Holdren, J., & Zients, J. (2014). Big Data: Seizing Opportunities. Executive Office of the President of USA, 1–79.
  • Revolution Analytics (n.d.). What is R? Retrieved from http://www.revolutionanalytics.com/what-r
  • Richards, N. M., & King, J. H. (2014). Big Data Ethics. Wake Forest Law Review, 49, 393–432.
  • Sagiroglu, S., & Sinanc, D. (2013). Big Data : A Review. Collaboration Technologies and Systems (CTS), 42–47.
  • Schumacker, R. E. (2014) Learning statistics using R. California, SAGE Publications, Inc, VitalBook file.
  • Schumacker, R. E. (2016) Using R with multivariate statistics. California, SAGE Publications, Inc.
  • Templ, M. (2016). CRAN task view: Official statistics & survey methodology. Retrieved from https://cran.r-project.org/web/views/OfficialStatistics.html
  • Theussl, S. & Borchers, H. W. (2016). CRAN task view: Optimization and mathematical programming. Retrieved from https://cran.r-project.org/web/views/Optimization.html
  • Ward, J. S., & Barker, A. (2013). Undefined by data: a survey of big data definitions. arXiv preprint arXiv:1309.5821.
  • Wild, F. (2015). CRAN task view: Natural language processing. Retrieved from https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
  • Zikopoulos, P., &Eaton, C. (2012). Understanding Big Data: Analytics for enterprise class hadoop and streaming data. McGraw-Hill Osborne Media.

Quant: Compelling topics

A discussion on what were the most compelling topics learned in the subject of Quantitative Analysis.

Most Compelling Topics

Field (2013) states that both quantitative and qualitative methods are complimentary at best, none competing approaches to solving the world’s problems. Although these methods are quite different from each other. Simply put, quantitative methods are utilized when the research contains variables that are numerical, and qualitative methods are utilized when the research contains variables that are based on language (Field, 2013).  Thus, central to quantitative research and methods is to understand the numerical, ordinal, or categorical dataset and what the data represents. This can be done through either descriptive statistics, where the researcher uses statistics to help describe a data set, or it can be done through inferential statistics, where conclusions can be drawn about the data set (Miller, n.d.).

Field (2013) and Schumacker (2014), defined central tendency as an all-encompassing term to help describe the “center of a frequency distribution” through the commonly used measures mean, median, and mode.  Outliers, missing values, and multiplication of a constant, and adding a constant are factors that affect the central tendency (Schumacker, 2014).  Besides just looking at one central tendency measure, researchers can also analyze the mean and median together to understand how skewed the data is and in which direction.  Heavily skewed distributions would heavily increase the distance between these two values, and if the mean less than the median the distribution is skewed negatively (Field, 2013).  To understand the distribution, better other measures like variance and standard deviations could be used.

Variance and standard deviations are considered as measures of dispersion, where the variance is considered as measures of average dispersion (Field, 2013; Schumacker, 2014).  Variance is a numerical value that describes how the observed data values are spread across the data distribution and how they differ from the mean on average (Huck, 2011; Field, 2013; Schumacker, 2014).  The smaller the variance indicates that the observed data values are close to the mean and vice versa (Field, 2013).

Rarely is every member of the population studied, and instead a sample from that population is randomly taken to represent that population for analysis in quantitative research (Gall, Gall, & Borg 2006). At the end of the day, the insights gained from this type of research should be impersonal, objective, and generalizable.  To generalize the results of the research the insights gained from a sample of data needs to use the correct mathematical procedures for using probabilities and information, statistical inference (Gall et al., 2006).  Gall et al. (2006), stated that statistical inference is what dictates the order of procedures, for instance, a hypothesis and a null hypothesis must be defined before a statistical significance level, which also has to be defined before calculating a z or t statistic value.  Essentially, a statistical inference allows for quantitative researchers to make inferences about a population.  A population, where researchers must remember where that data was generated and collected from during quantitative research process.

Most flaws in research methodology exist because the validity and reliability weren’t established (Gall et al., 2006). Thus, it is important to ensure a valid and reliable assessment instrument.  So, in using any existing survey as an assessment instrument, one should report the instrument’s: development, items, scales, reports on reliability, and reports on validity through past uses (Creswell, 2014; Joyner, 2012).  Permission must be secured for using any instrument and placed in the appendix (Joyner, 2012).  The validity of the assessment instrument is key to drawing meaningful and useful statistical inferences (Creswell, 2014).

Through sampling of a population and using a valid and reliable survey instrument for assessment, attitudes and opinions about a population could be correctly inferred from the sample (Creswell, 2014).  Sometimes, a survey instrument doesn’t fit those in the target group. Thus it would not produce valid nor reliable inferences for the targeted population. One must select a targeted population and determine the size of that stratified population (Creswell, 2014).

Parametric statistics, are inferential and based on random sampling from a distinct population, and that the sample data is making strict inferences about the population’s parameters, thus tests like t-tests, chi-square, f-tests (ANOVA) can be used (Huck, 2011; Schumacker, 2014).  Nonparametric statistics, “assumption-free tests”, is used for tests that are using ranked data like Mann-Whitney U-test, Wilcoxon Signed-Rank test, Kruskal-Wallis H-test, and chi-square (Field, 2013; Huck, 2011).

First, there is a need to define the types of data.  Continuous data is interval/ratio data, and categorical data is nominal/ordinal data.  Modified from Schumacker (2014) with data added from Huck (2011):

Statistic Dependent Variable Independent Variable
Analysis of Variance (ANOVA)
     One way Continuous Categorical
t-Tests
     Single Sample Continuous
     Independent groups Continuous Categorical
     Dependent (paired groups) Continuous Categorical
Chi-square Categorical Categorical
Mann-Whitney U-test Ordinal Ordinal
Wilcoxon Ordinal Ordinal
Kruskal-Wallis H-test Ordinal Ordinal

So, meaningful results get reported and their statistical significance, confidence intervals and effect sizes (Creswell, 2014). If the results from a statistical test have a low probability of occurring by chance (5% or 1% or less) then the statistical test is considered significant (Creswell, 2014; Field, 2014; Huck, 2011Statistical significance test can have the same effect yet result in different values (Field, 2014).  Statistical significance on large samples sizes can be affected by small differences and can show up as significant, while in smaller samples large differences may be deemed insignificant (Field, 2014).  Statistically significant results allow the researcher to reject a null hypothesis but do not test the importance of the observations made (Huck, 2011).  Huck (2011) stated two main factors that could influence whether or not a result is statistically significant is the quality of the research question and research design.

Huck (2011) suggested that after statistical significance is calculated and the research can either reject or fail to reject a null hypothesis, effect size analysis should be conducted.  The effect size allows researchers to measure objectively the magnitude or practical significance of the research findings through looking at the differential impact of the variables (Huck, 2011; Field, 2014).  Field (2014), defines one way of measuring the effect size is through Cohen’s d: d = (Avg(x1) – Avg(x2))/(standard deviation).  If d = 0.2 there is a small effect, d = 0.5 there is a moderate effect, and d = 0.8 or more there is a large effect (Field, 2014; Huck, 2011). Thus, this could be the reason why a statistical test could yield a statistically significant value, but further analysis with effect size could show that those statistically significant results do not explain much of what is happening in the total relationship.

In regression analysis, it should be possible to predict the dependent variable based on the independent variables, depending on two factors: (1) that the productivity assessment tool is valid and reliable (Creswell, 2014) and (2) we have a large enough sample size to conduct our analysis and be able to draw statistical inference of the population based on the sample data which has been collected (Huck, 2011). Assuming these two conditions are met, then regression analysis could be made on the data to create a prediction formula. Regression formulas are useful for summarizing the relationship between the variables in question (Huck, 2011).

When modeling predict the dependent variable based upon the independent variable the regression model with the strongest correlation will be used as it is that regression formula that explains the variance between the variables the best.   However, just because the regression formula can predict some or most of the variance between the variables, it will never imply causation (Field, 2013).  Correlations help define the strength of the regression formula in defining the relationships between the variables, and can vary in value from -1 to +1.  The closer the correlation coefficient is to -1 or +1; it informs the researcher that the regression formula is a good predictor of the variance between the variables.  The closer the correlation coefficient is to zero, indicates that there is hardly any relationship between the variable (Field, 2013; Huck, 2011; Schumacker, 2014).  It should never be forgotten that correlation doesn’t imply causation, but can help determine the percentage of the variances between the variables by the regression formula result, when the correlation value is squared (r2) (Field, 2013).

 

References:

  • Creswell, J. W. (2014) Research design: Qualitative, quantitative and mixed method approaches (4th ed.). California, SAGE Publications, Inc. VitalBook file.
  • Field, A. (2013) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
  • Gall, M. D., Gall, J., & Borg W. (2006). Educational research: An introduction (8th ed.). Pearson Learning Solutions. VitalBook file.
  • Huck, S. W. (2011) Reading Statistics and Research (6th ed.). Pearson Learning Solutions. VitalBook file.
  • Joyner, R. L. (2012) Writing the Winning Thesis or Dissertation: A Step-by-Step Guide (3rd ed.). Corwin. VitalBook file.
  • Miller, R. (n.d.). Week 1: Central tendency [Video file]. Retrieved from http://breeze.careeredonline.com/p9fynztexn6/?launcher=false&fcsContent=true&pbMode=normal
  • Schumacker, R. E. (2014) Learning statistics using R. California, SAGE Publications, Inc, VitalBook file.

Quant: In-depth Analysis in SPSS

This short analysis attempts to understand the marital happiness level on combined income. It was found that marital happiness levels are depended on a couples’ combined income, but for the happiest couples, they were happy regardless how much money they had. This, quantitative analysis on the sample data, has shown that when the happiness levels are low, there is a higher chance of lower levels of combined income.

Abstract

This short analysis attempts to understand the marital happiness level on combined income.  It was found that marital happiness levels are depended on a couples’ combined income, but for the happiest couples, they were happy regardless how much money they had.  This, quantitative analysis on the sample data, has shown that when the happiness levels are low, there is a higher chance of lower levels of combined income.

Introduction

Mulligan (1973), was one of the first that stated arguments about money was one of the top reasons for divorce between couples.  Factors for financial arguments could stem from: Goals and savings; record keeping; delaying tactics; apparel cost-cutting strategies; controlling expenditures; financial statements; do-it-yourself techniques; and cost cutting techniques (Lawrence, Thomasson, Wozniak, & Prawitz, 1993). Lawrence et al. (1993) exerts that financial arguments are common between families.  However, when does money no longer become an issue?  Does the increase in combined family income affect the marital happiness levels?  This analysis attempts to answer these questions.

Methods

Crosstabulation was conducted to get a descriptive exploration of the data.  Graphical images of box-plots helped show the spread and distribution of combined income per marital happiness.  In this analysis of the data the two alternative hypothesis will be tested:

  • There is a difference between the mean values of combined income per marital happiness levels.
  • There is a dependence between the combined income and marital happiness level

This would lead to finally analyzing the hypothesis introduced in the previous section, one-way analysis of variance and two-way chi-square test was conducted respectively.

Results

Table 1: Case processing summary for analyzing happiness level versus family income.

u6db1f7Table 2: Crosstabulation for analyzing happiness level versus family income (<$21,250).

u6db1f3Table 3: Crosstabulation for analyzing happiness level versus family income for (>$21,250).
u6db1f4

Table 4: Chi-square test for analyzing happiness level versus family income.

u6db1f5

Table 5: Analysis of Variance for analyzing happiness level versus family income.

u6db1f6

u6db1f1.png

Figure 1: Boxplot diagram per happiness level of a marriage versus the family incomes.

u6db1f2.png

Figure 2: Line diagram per happiness level of a marriage versus the mean of the family incomes.

Discussions and Conclusions

There are 1419 participants, and only 38.5% had responded to both their happiness of marriage and family income (Table 1).  What may have contributed to this huge unresponsive rate is that there could have been people who were not married, and thus making the happiness of marriage question not applicable to the participants.  Thus, it is suggested that in the future, there should be an N/A classification in this survey instrument, to see if we can have a higher response rate.  Given that there are still 547 responses, there is other information to be gained from analyzing this data.

As a family unit gains more income, their happiness level increases (Table 2-3).  This can be seen as the dollar value increases, the % within the family income and ranges recorded to midpoint for the very happy category increases as well from the 50% to the 75% level.    The unhappiest couples seem to be earning a combined medium amount of $7500-9000 and at $27500-45000.  Though for marriages that are pretty happy, it’s about stable at 30-40% of respondents at $13750 or more.

The mean values of family income to happiness (Figure 2), shows that on average, happier couples make more money together, but at a closer examination using boxplots (Figure 1), the happiest couples, seem to be happy regardless of how much money they make as the tails of the box plot extend really far from the median.  One interesting feature is that the spread of family combined income is shrinks as happiness decreases (Figure 1).  This could possibly suggest that though money is not a major factor for those couples that are happy, if the couple is unhappy it could be driven by lower combined incomes.

The two-tailed chi-squared test, shows statistical significance between family combined income and marital happiness allowing us to reject the null hypothesis #2, which stated that these two variables were independent of each other (Table 4).  Whereas the analysis of variance doesn’t allow for a rejection of the null hypothesis #1, which states the means are different between the groups of marital happiness level (Table 5).

There could be many reasons for this analysis, thus future work could include analyzing other variables that could help define other factors for marital happiness.  A possible multi-variate analysis may be necessary to see the impact on marital happiness as the dependent variable and combined income as one of many independent variables.

SPSS Code

GET

  FILE=’C:\Users\mkher\Desktop\SAV files\gss.sav’.

DATASET NAME DataSet1 WINDOW=FRONT.

CROSSTABS

  /TABLES=hapmar BY incomdol

  /FORMAT=AVALUE TABLES

  /STATISTICS=CHISQ CORR

  /CELLS=COUNT ROW COLUMN

  /COUNT ROUND CELL.

ONEWAY rincome BY hapmar

  /MISSING ANALYSIS

* Chart Builder.

GGRAPH

  /GRAPHDATASET NAME=”graphdataset” VARIABLES=hapmar incomdol MISSING=LISTWISE REPORTMISSING=NO

  /GRAPHSPEC SOURCE=INLINE.

BEGIN GPL

  SOURCE: s=userSource(id(“graphdataset”))

  DATA: hapmar=col(source(s), name(“hapmar”), unit.category())

  DATA: incomdol=col(source(s), name(“incomdol”))

  DATA: id=col(source(s), name(“$CASENUM”), unit.category())

  GUIDE: axis(dim(1), label(“HAPPINESS OF MARRIAGE”))

  GUIDE: axis(dim(2), label(“Family income; ranges recoded to midpoints”))

  SCALE: cat(dim(1), include(“1”, “2”, “3”))

  SCALE: linear(dim(2), include(0))

  ELEMENT: schema(position(bin.quantile.letter(hapmar*incomdol)), label(id))

END GPL.

* Chart Builder.

GGRAPH

  /GRAPHDATASET NAME=”graphdataset” VARIABLES=hapmar MEAN(incomdol)[name=”MEAN_incomdol”]

    MISSING=LISTWISE REPORTMISSING=NO

  /GRAPHSPEC SOURCE=INLINE.

BEGIN GPL

  SOURCE: s=userSource(id(“graphdataset”))

  DATA: hapmar=col(source(s), name(“hapmar”), unit.category())

  DATA: MEAN_incomdol=col(source(s), name(“MEAN_incomdol”))

  GUIDE: axis(dim(1), label(“HAPPINESS OF MARRIAGE”))

  GUIDE: axis(dim(2), label(“Mean Family income; ranges recoded to midpoints”))

  SCALE: cat(dim(1), include(“1”, “2”, “3”))

  SCALE: linear(dim(2), include(0))

  ELEMENT: line(position(hapmar*MEAN_incomdol), missing.wings())

END GPL.

References

Appreciative Inquiry

A popular model for organizational change is appreciative inquiry. It has been criticized by academics as lacking rigor in its assessment approaches.

Creswell (2014), stated that inquiry procedures come in three flavors: quantitative, qualitative, mix.  Mostly under the inquiry procedures of quantitative methodologies, action research is a style of participatory research (Creswell, 2014).  It is the action research methodology that Appreciative Inquiry began (Holmber & Reed, 2010).  Appreciative Inquiry, usually asks what went well, or what was done well rather than what went wrong (Hammond, 2006).  Appreciative Inquiring works on a psychological concept of “Framing”.  Especially, since words not only have definitions but can also have an emotional connotation (Hammond, 2006).  Words with these emotional connotations can influence people.  Therefore, people’s perceptions and preferences change based on how a question or statement is framed (Prentice, 2007). According to appreciative inquiry, it is easier to sell an idea where the focus is on the positive aspect of that idea.  It is usually better to say that a bag of dried plantain chips is 95% fat-free, rather than 5% fat (Hammond, 2006; Prentice, 2007). How statements or questions are worded, can change how people view/frame the issue.  But, the goal of appreciative inquiry is not just a matter of framing, but also finding out what has been done that works well and doing more of that (Hammond, 2006).

Even though Appreciative Inquiry is a popular model for organizational change, its lack of rigor in its assessment approaches can upset those that are more quantitative in nature.  Especially since quantitative methodologist use the quantitative measure to deductively reach a conclusion (Cresswell, 2014).  However, Appreciative Inquiry gives researchers a different perspective than what they are accustom to and more researchers are becoming inspired by this model (Holmber & Reed, 2006).  For instance, if questions in an assessment instrument were to remove words with negative connotations such as “dysfunction”, “co-dependent”, “stress”, “addition”, “depress”, etc., to words that are more neutral or positive connotations, it can reframe the issue (Brooksher & Brylow, 2014). Using words with negative connotations can trigger a person’s need for loss aversion.  Loss aversion suggests that people perceive pain at least two times more than pleasure, and always aim to mitigate the pain (Prentice, 2007).  Thus, wording assessment instruments and the results they generate should be heavily considered (Brookshear & Brylow, 2014).  This is especially the case when the goal of quantitative research is to remain impersonal and objective in their studies (Creswell, 2014).

Another idea from Appreciative Inquiry that quantitative methodologist could use is the fact that it tries to examine an idea from a different angle.  For instance, sometimes conducting a Fermi decomposition of an idea, which is when a researcher is trying to solve a big idea by breaking it up into smaller more tangible and quantitatively measurable set of solutions, is one way of viewing the idea from a different angle (Hubbard, 2010).   Also, asking “Why?” iteratively five times, as suggested in lean six sigma DMAIC (define, measure, analyze, implement, and control) process is another way to understand an idea better (iSixSigma, n.d.).  Thus, looking at ideas from different angles can help find the cause of an idea or an opportunity for improvement.

References

  • Brookshear, G. & Brylow, D. (2014). Computer Science: An Overview (12th ed.). Pearson Learning Solution. VitalBook file.
  • Creswell, J. W. (2014) Research design: Qualitative, quantitative and mixed method approaches (4th ed.). California, SAGE Publications, Inc. VitalBook file.
  • Hammond, S. (2006). The Thin Book of Appreciative Inquiry. Thin Book Publishing.
  • Holmber, L. & Reed, J. (2010). AI research ntoes. AI Practitioners, 12(4), 55-57. Retrieved from https://www.academia.edu/362704/A_quantitative_approach_to_AI-research
  • Hubbard, D. W. (2010). How to measure anything: Finding the values of “intangibles” in business. (2nd e.d.) New Jersey, John Wiley & Sons, Inc.
  • iSixSigma (n.d.).  Determine the root cause: 5 whys. Retrieved from https://www.isixsigma.com/tools-templates/cause-effect/determine-root-cause-5-whys/
  • Prentice, R. A. (2007). Ethical decision making: More needed than good intentions. Financial Analysis Journal, 63(6), 17–30.

Differences between Quantitative and Qualitative Intros and Lit Reviews

The differences in the content and structure of a qualitative introduction and literature review as compared to a quantitative introduction and literature review.

Simply put, quantitative methods are utilized when the research contains variables that are numerical, and qualitative methods are utilized when the research contains variables that are based on language (Field, 2013).  Thus, each methods goals and procedures are quite different. This difference in goals and procedures drives differences in how a research paper’s introduction and literature review are written.

Introductions in a research paper allow the researcher to announce the problem and why it is important enough to be explored through a study.  Given that qualitative research may not have any known variables or theories, the introductions tend to vary tremendously (Creswell, 2014; Edmondson & MacManus, 2007).  Creswell (2014), suggested that qualitative methods introductions can begin with a quote from one the participants; stating the researchers’ personal story from a first person or third person viewpoint, or can be written in an inductive style.  There is less variation in quantitative methods introductions because the best way to introduce the problem is to introduce the variables, from an impersonal viewpoint (Creswell, 2014).  It is through gaining further understanding of these variables’ influence on a particular outcome is what’s driving the study in the first place.

The purpose of the literature review is for the researcher to share the results of other studies tangential to theirs to show how their study relates to the bigger picture and what gaps in the knowledge they are trying to solve (Creswell, 2014).  Edmondson and MacManus (2007) stated that when the nature of the field of research is nascent, the study becomes exploratory and qualitative in nature.  Given their exploratory nature, in qualitative methods, the researchers write their literature review in the form that is exploratory and in an inductive manner (Creswell, 2014).  Edmondson and MacManus (2007) stated that when the nature of the research is mature, there are plenty of related and existing research studies on the topic, a more quantitative approach is more appropriate.  Given that there is a huge body of knowledge to draw from when it comes to quantitative methods, the researchers tend to have substantially large amounts of literature at the beginning and structure it in a deductive fashion (Creswell, 2014).  Framing the literature review in a deductive manner allows the researcher at the end of the literature review to state clearly and measurably their research question(s) and hypotheses (Creswell, 2014; Miller, n.d.).

To conclude, understanding which methodological approach best fits a research study can help drive how the introduction and literature review sections are crafted and written.

References

  • Creswell, J. W. (2014) Research design: Qualitative, quantitative and mixed method approaches (4th ed.). California, SAGE Publications, Inc. VitalBook file.
  • Edmondson, A. C., & McManus, S. E. (2007). Methodological fit in management field research. Academy of Management Review, 32(4), 1155–1179. http://doi.org/10.5465/AMR.2007.26586086
  • Field, A. (2013) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.