Adv Quant: General Least Squares Model

Regression formulas are useful for summarizing the relationship between the variables in question (Huck, 2011). There are multiple types of regression all of them are tests of prediction (Huck, 2011; Schumacker, 2014).  The least squares (linear) regression is the most well-known because it uses basic algebra, a straight line, and the correlation coefficient to aid in stating the regression’s prediction strength (Huck, 2011; Schumacker, 2014).  The linear regression model is:

y = (a + bx) + e                                                                   (1)

Where y is the dependent variable, x is the independent variable, a (the intercept) and b (the regression weight, also known as the slope) are a constants that are to be defined through the regression analysis, and e is the regression prediction error (Field, 2013; Schumacker, 2014).  The sum of the squared errors should be minimized per the least squares criterion, and that is reflected in the b term in equation 1 (Schumacker, 2014).

Correlation coefficients help define the strength of the regression formula in defining the relationships between the variables, and can vary in value from -1 to +1.  The closer the correlation coefficient is to -1 or +1; it informs the researcher that the regression formula is a good predictor of the variance between the variables.  The closer the correlation coefficient is to zero, indicates that there is hardly any relationship between the variable (Field, 2013; Huck, 2011; Schumacker, 2014).  Correlations never imply causation, but they can help determine the percentage of the variances between the variables by the regression formula result when the correlation value is squared (r2) (Field, 2013).

Assumptions for the General Least Square Model (GLM) modeling for regression and correlations

The General Least Squares Model (GLM) is the line of best fit, for linear regressions modeling along with its corresponding correlations (Smith, 2015).  There are five assumptions to a linear regression model: additivity, linearity, independent errors, homoscedasticity, and normally distributed errors.  Variables should be linearly related the independent variables(s), and the combined effects of multiple independent variables should be additive. A residual is the difference between the predicted value from the observed value: (1) no two residuals should be correlated, which can be numerically tested by using the Durbin-Watson test; (2) the variance of these residuals should be constant for each independent variable; and (3) the residuals should be random and normally distributed with a mean of 0 (Field, 2013; Schumacker, 2014).

Covering the issues with transforming variables to make them linear

When viewing the data through scatter plots, if the linearity and additivity assumptions could not be met, then transformations to the variables could be made to make the relationship linear. The above is an iterative trial and error process.  Transformation must occur to every point of the data set to correct for the linearity and addititvity issues since it changes the difference between the variables due to the change of units in the variables (Field, 2013).

Table 1: Types of data transformations and their uses (adapted from Field (2013) Table 5.1).

 Data Transformation Can Correct for Log [independent variable(s)] Positive skew, positive kurtosis, unequal variances, lack of linearity Square root [independent variable(s)] Positive skew, positive kurtosis, unequal variances, lack of linearity Reciprocal [independent variable(s)] Positive skew, positive kurtosis, unequal variances Reverse score [independent variable(s)]: subtracting the highest value in the variable for each data set Negative skew

Describe the R procedures for linear regression

lm( ) is a function for running linear regression, glm( ) is a function for running logistic regression (should not be confused for GLM), and loglm( ) is a function for running log-linear regression in R (Schumacker, 2014; Smith, 2015). The summary( ) function is used to output the results of the linear regression. Dependent variables are represented with a tilde “~” and independent variables are represented with a “+” (Schumacker, 2014). Thus, the R procedures for linear regression are (Marin, 2013):

> cor (x, y) # correlation coefficient

> myRegression = lm (y ~ x, data = dataSet ) # conduct a linear regression on x and y

> summary(myRegression) # produces the outputs of the lm( ) function calculations

> attributes(myRegression) # lists the attributes of the lm( ) function

> myRegression\$coefficients # gives you the slope and intercept coefficients

> plot (x, y, main=“Title to graph”) # scatter plot

> abline(myRegression) # regression line

> confint(myRegression, level= 0.99) # 99% level of confidence intervals for the regression coefficients

> anova(myRegression) # anova analysis on the regression analysis

References

• Field, A. (2013) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
• Huck, S. W. (2011) Reading Statistics and Research (6th ed.). Pearson Learning Solutions. VitalBook file.
• Schumacker, R. E. (2014) Learning statistics using R. California, SAGE Publications, Inc, VitalBook file.

Adv Quant: Birth Rate Dataset in R

Introduction

Built in the R library is the Births dataset with 400,000 records and 13 variables.  The following is an analysis of this dataset.

Results Figure 1. The first five data point entries in the births2006.smpl data set. Figure 2. The frequency of births in 2006 per day of the week. Figure 3. Histogram of 2006 births frequencies graphed by day of the week and separated by method of delivery. Figure 4. A trellis histogram plot of 2006 birth weight per birth number. Figure 5. A trellis histogram plot of 2006 birth weight per birth delivery method. Figure 6. A boxplot of 2006 birth weight per Apgar score. Figure 7. A boxplot of 2006 birth weight per day of week. Figure 8. A histogram of 2006 average birth weight per multiple births separated by gender.

Discussion

Given the open-sourced nature of the R software, many libraries are being built and shared with the greater community, and the Comprehensive R Archive Network (CRAN), has a ton of these programs as part of R Packages (Schumacker, 2014).  Thus, as part of the nutshell library, there exists a data set of 2006 births called “births2006.smpl”.  To view the first few entries the head() command can be used (R, n.d.g.).  The printout from the head() command (Figure 1) shows all 13 variables of the dataset along with the first five entries in the births2006.smpl dataset.

The number of birth seems to be approximately uniform (but not precisely) during the work week, assuming Sunday is 1 and Saturday is 7.  However, Tuesday-Thursday has the highest births in the week with the weekends having the least amount of births in the week.

Breaking down the method of deliveries in 2006 per day of the week, it can be seen that Vaginal birth in all seven days of the week outnumbers C-section deliveries in 2006 (Figure 3).  Also on Tuesday-Thursday there are more vaginal births compared to those during the weekend, and in C-section deliveries, there are most deliveries occur between Tuesday-Friday, and the least amount occurs during the weekends.

Breaking down the number of births frequencies per birth weight (Figure 4), it can be seen that the normal distribution of birth weight in grams shifts to the left as the number of multiple births increases.  This seems to suggest that babies born as a set of twins, triplets, etc. have lower birth rates on average and per distribution.  Birth weight is almost normally distributed for the single child birth but begins to lose normality as the number of births increases.

Further analysis of birth weights in 2006, per delivery method, shows that for whether or not the delivery method is known or not and its type of delivery method doesn’t play too much of a huge role in the determination of the child’s birth weight (Figure 5).  Statistical tests and effect size analysis could be conducted to verify and enhance the discussion and this assertion that is made through the graphical representation in Figure 5.

Apgar test is tested on the child after one and five minutes of birth looking at the skin color, heart rate, reflexes, muscle tone, and respiration rate of the child, where 10 is the highest but rarely obtain score (Hirsch, 2014).  Thus, observing the Apgar score variable (1-10) on birth weight in grams those with higher Apgar scores had on average higher median birth weights.  Typically, as Apgar score increases the tighter the distribution becomes, and the more outliers begin to appear (disregarding the results from Apgar score of 1).  These results from the boxplots tend to confirm Hirsch (2014) assertion that higher Apgar scores are harder to obtain.

Looking at the boxplot analysis of birth weight per day of the week (Figure 7) shows that the median, Q1, Q3, max, and min are normally distributed and unchanging per day of the week.  Outliers, the heavier babies, tend to occur without respect of the day of the week, and also appears to have little to no effect on the distribution of birth weight per day of the week.

Finally, looking at a mean birth weight per gender and per multiple births, shows a similar distribution of males and females (Figure 8). The main noticeable difference is the male Quintuplet or higher number of births on average weigh more than the corresponding female Quintuplet or higher number of births.  This chart also confirms the conclusions made (from Figure 4) where as the number of births increases the average weight of the children decrease.

In conclusion, the day of the week doesn’t predict birth weights, but probably birth frequency. In general, babies are heavier if they are single births and if they achieve Apgar score of 10.  Birth weights are not predictable through delivery method.  All of these conclusions are made on the visual representation of the dataset births2006.smpl.  What would increase the validity of these statements would be to conduct statistical significance tests and the effect size, to add further weight to what could be derived from through these images.

Code

#
## Use R to analyze the Birth dataset.
## The Birth dataset is in the Nutshell library.
##  • SEX and APGAR5 (SEX and Apgar score)
##  • DPLURAL (single or multiple birth)
##  • WTGAIN (weight gain of mother)
##  • ESTGEST (estimated gestation in weeks)
##  • DOB_MM, DOB_WK (month and day of week of birth)
##  • BWT (birth weight)
##  • DMETH_REC (method of delivery)
#
install.packages(“nutshell”)
library(nutshell)
data(births2006.smpl)

# First, list the data for the first 5 births.

# Next, show a bar chart of the frequencies of births according to the day of the week of the birth.
births.dayofweek = table(births2006.smpl\$DOB_WK) #Goal of this variable is to speed up the calculations
barplot(births.dayofweek, ylab=”frequency”, xlab=”Day of week”, col = “darkred”, main= “Number of births in 2006 per day of the week”)

# Obtain frequencies for two-way classifications of birth according to the day of the week and the method of delivery.
births.methodsVdaysofweek = table(births2006.smpl\$DOB_WK,births2006.smpl\$DMETH_REC)
barplot(births.methodsVdaysofweek[,-2], col=heat.colors(length(rownames(births.methodsVdaysofweek))), width=2, beside=TRUE, main = “bar plot of births per method per day of the week”)
legend (“topleft”, fill=heat.colors(length(rownames(births.methodsVdaysofweek))),legend=rownames(births.methodsVdaysofweek))

# Use lattice (trellis) graphs (R package lattice) to condition density histograms on the values of a third variable.
library(lattice)

# The variable for multiple births and the method of delivery are conditioning variables.
# Separate the histogram of birth weight according to these variable.
histogram(~DBWT|DPLURAL,data=births2006.smpl,layout=c(1,5),col=”black”, xlab = “birth weight”, main = “trellis plot of birth weight vs birth number”)

histogram(~DBWT|DMETH_REC,data=births2006.smpl,layout=c(1,3),col=”black”, xlab = “birth weight”, main = “trellis plot of birth weight vs birth method”)

# Do a box plot of birth weight against Apgar score and box plots of birth weight by day of week of delivery.
boxplot(DBWT~APGAR5,data=births2006.smpl,ylab=”birth weight”,xlab=”AGPAR5″, main=”Boxplot of birthweight per Apgar score”)

boxplot(DBWT~DOB_WK,data=births2006.smpl,ylab=”birth weight”,xlab=”Day of Week”, main=”Boxplot of birthweight per day of week”)

# Calculate the average birth weight as a function of multiple births for males and females separately.
# Use the “tapply” function, and for missing values use the “option nz.rm=TRUE.”
listed = list(births2006.smpl\$DPLURAL,births2006.smpl\$SEX)
tapplication=tapply(births2006.smpl\$DBWT,listed,mean,na.rm=TRUE)
barplot(tapplication,ylab=”birth weight”, beside=TRUE, legend=TRUE,xlab=”gender”, main = “bar plot of average birthweight per multiple births by gender”)

References

• R (n.d.b.). Apply a function over a ragged array. Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/base/html/tapply.html
• R (n.d.f.). Produce box-and-wisker plot(s) of a given (grouped) values.  Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/boxplot.html
• Schumacker, R. E. (2014) Learning statistics using R. California, SAGE Publications, Inc, VitalBook file.

Adv Quant: Statistical Features of R

Ward and Barker (2013) traced back definition of Volume, Velocity, and Variety from Gartner.  Now, a predominately widely accepted definition for big data is any set of data that has high velocity, volume, and variety (Davenport & Dyche, 2013; Fox & Do 2013, Kaur & Rani, 2015. Mao, Xu, Wu, Li, Li, & Lu, 2015; Podesta, Pritzker, Moniz, Holdren, & Zients, 2014; Richards & King, 2014; Sagiroglu & Sinanc, 2013; Zikopoulous and Eaton, 2012). Davenport et al. (2012), stated that IT companies define big data as “more insightful data analysis”, but if used properly companies can gain a competitive edge.  Data scientists from companies like Google, Facebook, and LinkedIn, use R for their finance and data analytics (Revolution Analytics, n.d.). According to Minelli, Chambers and Dhiraj (2013) R has 2 million end-users and is used in industries like health, finance, etc.

Why is R so popular and have that many users?  It could be that R is a free open-source software that works on multiple platforms (Unix, Windows, Mac), and has an extensive statistical library to help conduct basic statistical data analysis, to multivariate analysis, scaling up to big data analytics (Hothorn, 2016; Leisch & Gruen, 2016; Schumacker, 2014 & 2016; Templ, 2016; Theussl & Borchers, 2016; Wild, 2015).  Given the open-sourced nature of the R software, many libraries are being built and shared with the greater community, and the Comprehensive R Archive Network (CRAN), has a ton of these programs as part of R Packages (Schumacker, 2014).  Other advantages of R, is the customizable statistical analysis, control over the analytical processes, extensive documentation, and references (Schumacker, 2016).  R Packages allow for everyday data analytics, visually aesthetic data visualizations, faster results than legacy statistical software that the end-user can control, drawing upon the talents of leading data scientists (Revolution Analytics, n.d.).  R programming features include dealing with a whole suite of data types, (scalars, vectors, matrices, arrays, and data frames), as well as impetrating and exporting data into multiple other commercially available statistical/data software (SPSS, SAS, Excel, etc.) (Schumacker, 2014 & 2016).  All the features of R related to big data analytics, statistical, and programming features are listed in Table 1 (below).  Given all the R Packages listed below and the importing and exporting features to other big data statistical software illustrates how useful R is for analyzing big datasets of various types (Schumacker, 2014, 2016).

Finally, R is the most dominant analytics tool for Big Data Analytics (Minelli et al., 2013).  Big data analytics is at the border of computing science, data mining, and statistics, it is natural to see multiple R Packages and libraries listed within CRAN that are freely available to use.  Within the field of big data analytics, some (but not all) of common sets of techniques that have R Packages are machine learning, cluster analysis, finite mixture models, and natural language processing. Given the extensive libraries through R Packages and extensive documentation, R is well suited for Big Data.

Table 1: Big Data Analytics, Statistical, and Programmable features of R

 R Programming Features (Schumacker, 2014) Input, Process, Output, R Packages Variables in R (Schumacker, 2014) number, character, logical Data Types in R (Schumacker, 2014) scalars, arrays, vectors, matrices, list, data frames Flow control: Loops (Schumacker, 2014) Loops (for, if, while, else, …) Boolean Operators (and, not, or) Visualizations (Schumacker, 2014) pie charts, bar charts, histogram, stem-and-leaf plots, scatter plots, box-whiskers plot, surface plots, contour plots, geographic maps, colors, plus others from the many R Packages Statistical Analysis (Schumacker, 2014) Central tendency, dispersion, correlation test, linear Regression, multiple regression, logistic regression, log-linear regression, analysis of variance, probability, confidence intervals, plus others from the many R Packages Distributions: population, sampling, and statistical (Schumacker, 2014) Binomial, Uniform, Exponential, Normal, Hypothesis testing, chi-square, z-test, t-test, f-test, plus others from the many R Packages Multivariate Statistical Analysis (Schumacker, 2016) MANOVA, MANCOVA, factor analysis, principle components analysis, structural equation modeling, multidimensional scaling, discriminant analysis, canonical correlation, multiple group multivariate statistical analysis, plus others from the many R Packages Big Data Analytics: Cluster Analysis (Leisch & Gruen, 2016) hierarchical clustering, partitioning clustering, model-based clustering, K-means clustering, fuzzy clustering, cluster-wise regression, principal component analysis, self-organizing maps, density based clustering Big Data Analytics: Machine Learning (Hothorn, 2016; Templ, 2016) neural networks, recursive partitioning, random forests, regularized and shrinkage methods, boosting, support vector machines, association rules, fuzzy rules based systems, model selection and validation, tree methods, expectation-maximization, nearest neighbor Big Data Analytics: Natural Language Processing (Wild, 2015) Frameworks, lexical databases, keyword extraction, string manipulation, stemming, semantic, pragmatics Big Data Analytics: Optimization and Mathematical Programing (Theussl & Borchers, 2016) optimization infrastructure packages, general purpose continuous solvers, least-squares problems, semidefinite and convex solvers, global and stochastic optimization, mathematical programming solvers

References

• Davenport, T. H., Barth, P., & Bean, R. (2012). How big data is different. MIT Sloan Management Review, 54(1), 43.
• Fox, S., & Do, T. (2013). Getting real about Big Data: applying critical realism to analyse Big Data hype. International Journal of Managing Projects in Business, 6(4), 739–760. http://doi.org/10.1108/IJMPB-08-2012-0049
• Kaur, K., & Rani, R. (2015). Managing Data in Healthcare Information Systems: Many Models, One Solution. Big Data Management, 52–59.
• Leisch, F. & Gruen, B. (2016). CRAN task view: Cluster analysis & finite mixture models. Retrieved from https://cran.r-project.org/web/views/Cluster.html
• Mao, R., Xu, H., Wu, W., Li, J., Li, Y., & Lu, M. (2015). Overcoming the Challenge of Variety: Big Data Abstraction, the Next Evolution of Data Management for AAL Communication Systems. Ambient Assisted Living Communications, 42–47.
• Minelli, M., Chambers M., & Dhiraj A. (2013) Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. VitalBook file.
• Podesta, J., Pritzker, P., Moniz, E. J., Holdren, J., & Zients, J. (2014). Big Data: Seizing Opportunities. Executive Office of the President of USA, 1–79.
• Richards, N. M., & King, J. H. (2014). Big Data Ethics. Wake Forest Law Review, 49, 393–432.
• Sagiroglu, S., & Sinanc, D. (2013). Big Data : A Review. Collaboration Technologies and Systems (CTS), 42–47.
• Schumacker, R. E. (2014) Learning statistics using R. California, SAGE Publications, Inc, VitalBook file.
• Schumacker, R. E. (2016) Using R with multivariate statistics. California, SAGE Publications, Inc.
• Theussl, S. & Borchers, H. W. (2016). CRAN task view: Optimization and mathematical programming. Retrieved from https://cran.r-project.org/web/views/Optimization.html
• Ward, J. S., & Barker, A. (2013). Undefined by data: a survey of big data definitions. arXiv preprint arXiv:1309.5821.
• Zikopoulos, P., &Eaton, C. (2012). Understanding Big Data: Analytics for enterprise class hadoop and streaming data. McGraw-Hill Osborne Media.

Big Data Analytics: R

R is a powerful statistical tool that can aid in data mining.  Thus, it has huge relevance in the big data arena.  Focusing on my project, I have found that R has a text mining package [tm()].

Patal and Donga (2015) and Fayyad, Piatetsky-Shapiro, & Smyth, (1996) say that the main techniques in Data Mining are: anomaly detection (outlier/change/deviation detection), association rule learning (relationships between the variables), clustering (grouping data that are similar to another), classification (taking a known structure to new data), regressions (find a function to describe the data), and summarization (visualizations, reports, dashboards). Whereas, According to Ghosh, Roy, & Bandyopadhyay (2012), the main types of Text Mining techniques are: text categorization (assign text/documents with pre-defined categories), text-clustering (group similar text/documents together), concept mining (discovering concept/logic based ideas), Information retrieval (finding the relevant documents per the query), and information extraction (id key phrases and relationships within the text). Meanwhile, Agrawal and Batra (2013) add: summarization (compressed representation of the input), assessing document similarity (similarities between different documents), document retrieval (id and grabbing the most relevant documents), to the list of text mining techniques.

We use the “library(tm)” to aid in transforming text, stem words, build a term-document matrix, etc. mostly for preprocessing the data (RStudio pubs, n.d.). Based on RStudio pubs (n.d.) some text preprocessing steps and code are as follows:

• To remove punctuation:

docs <- tm_map(docs, removePunctuation)

• To remove special characters:

for(j in seq(docs))      {        docs[[j]] <- gsub(“/”, ” “, docs[[j]])        docs[[j]] <- gsub(“@”, ” “, docs[[j]])        docs[[j]] <- gsub(“\\|”, ” “, docs[[j]])     }

• To remove numbers:

docs <- tm_map(docs, removeNumbers)

• Convert to lowercase:

docs <- tm_map(docs, tolower)

• Removing “stopwords”/common words

docs <- tm_map(docs, removeWords, stopwords(“english”))

• Removing particular words

docs <- tm_map(docs, removeWords, c(“department”, “email”))

• Combining words that should stay together

for (j in seq(docs)){docs[[j]] <- gsub(“qualitative research”, “QDA”, docs[[j]])docs[[j]] <- gsub(“qualitative studies”, “QDA”, docs[[j]])docs[[j]] <- gsub(“qualitative analysis”, “QDA”, docs[[j]])docs[[j]] <- gsub(“research methods”, “research_methods”, docs[[j]])}

• Removing coming word endings

library(SnowballC)   docs <- tm_map(docs, stemDocument)

Text mining algorithms could consist of but are not limited to (Zhao, 2013):

• Summarization:
• Word clouds use “library (wordcloud)”
• Word frequencies
• Regressions
• Term correlations use “library (ggplot2) use functions findAssocs”
• Plot word frequencies Term correlations use “library (ggplot2)”
• Classification models:
• Decision Tree “library (party)” or “library (rpart)”
• Association models:
• Apriori use “library (arules)”
• Clustering models:
• K-mean clustering use “library (fpc)”
• K-medoids clustering use “library(fpc)”
• Hierarchical clustering use “library(cluster)”
• Density-based clustering use “library (fpc)”

As we can see, there are current libraries, functions, etc. to help with data preprocessing, data mining, and data visualization when it comes to text mining with R and RStudio.

Resources:

Big Data Analytics: Installing R

I didn’t have any problems with the installation thanks to a video produced by Dr. Webb (2014).  It is a bigger package than what I thought it would be, so it can take a few minutes to download, depending on your download speed and internet connection. Thus,

(1)    For proper installation of R, you need to have administrative access on your computer.

(2)    Watch this video, to get a step-by-step instructions and an online tutorial to installing R and its graphical Integrated Development Environment (IDE).

1. Note: The application for R 32x and 64x can be found at http://cran.r-project.org/
2. Note: The Rstudio free “Desktop” graphical IDE can be found at http://www.rstudio.com/

(3)    Once installed Use the manual for this application at this site: http://cran.r-project.org/doc/manuals/R-intro.html

Once, I installed the software and the graphical IDE, I continued to follow along with the video to use the prepopulated Cars data under the “datasets” Packages, and I got the same result as shown in the video.  I also would like to note that Dr. Webb (2014) also had checked the Packages: “datasets,” “graphics,” “grDevices,” “methods,” and “stats” in the video, which can be hard to see depending on your video streaming resolution.

Resources: