Adv Quant: Birth Rate Dataset in R

Built in the R library is the Births dataset with 400,000 records and 13 variables. The following is an analysis of this dataset.


Built in the R library is the Births dataset with 400,000 records and 13 variables.  The following is an analysis of this dataset.



Figure 1. The first five data point entries in the births2006.smpl data set.


Figure 2. The frequency of births in 2006 per day of the week.


Figure 3. Histogram of 2006 births frequencies graphed by day of the week and separated by method of delivery.


Figure 4. A trellis histogram plot of 2006 birth weight per birth number.


Figure 5. A trellis histogram plot of 2006 birth weight per birth delivery method.


Figure 6. A boxplot of 2006 birth weight per Apgar score.


Figure 7. A boxplot of 2006 birth weight per day of week.


Figure 8. A histogram of 2006 average birth weight per multiple births separated by gender.


Given the open-sourced nature of the R software, many libraries are being built and shared with the greater community, and the Comprehensive R Archive Network (CRAN), has a ton of these programs as part of R Packages (Schumacker, 2014).  Thus, as part of the nutshell library, there exists a data set of 2006 births called “births2006.smpl”.  To view the first few entries the head() command can be used (R, n.d.g.).  The printout from the head() command (Figure 1) shows all 13 variables of the dataset along with the first five entries in the births2006.smpl dataset.

The number of birth seems to be approximately uniform (but not precisely) during the work week, assuming Sunday is 1 and Saturday is 7.  However, Tuesday-Thursday has the highest births in the week with the weekends having the least amount of births in the week.

Breaking down the method of deliveries in 2006 per day of the week, it can be seen that Vaginal birth in all seven days of the week outnumbers C-section deliveries in 2006 (Figure 3).  Also on Tuesday-Thursday there are more vaginal births compared to those during the weekend, and in C-section deliveries, there are most deliveries occur between Tuesday-Friday, and the least amount occurs during the weekends.

Breaking down the number of births frequencies per birth weight (Figure 4), it can be seen that the normal distribution of birth weight in grams shifts to the left as the number of multiple births increases.  This seems to suggest that babies born as a set of twins, triplets, etc. have lower birth rates on average and per distribution.  Birth weight is almost normally distributed for the single child birth but begins to lose normality as the number of births increases.

Further analysis of birth weights in 2006, per delivery method, shows that for whether or not the delivery method is known or not and its type of delivery method doesn’t play too much of a huge role in the determination of the child’s birth weight (Figure 5).  Statistical tests and effect size analysis could be conducted to verify and enhance the discussion and this assertion that is made through the graphical representation in Figure 5.

Apgar test is tested on the child after one and five minutes of birth looking at the skin color, heart rate, reflexes, muscle tone, and respiration rate of the child, where 10 is the highest but rarely obtain score (Hirsch, 2014).  Thus, observing the Apgar score variable (1-10) on birth weight in grams those with higher Apgar scores had on average higher median birth weights.  Typically, as Apgar score increases the tighter the distribution becomes, and the more outliers begin to appear (disregarding the results from Apgar score of 1).  These results from the boxplots tend to confirm Hirsch (2014) assertion that higher Apgar scores are harder to obtain.

Looking at the boxplot analysis of birth weight per day of the week (Figure 7) shows that the median, Q1, Q3, max, and min are normally distributed and unchanging per day of the week.  Outliers, the heavier babies, tend to occur without respect of the day of the week, and also appears to have little to no effect on the distribution of birth weight per day of the week.

Finally, looking at a mean birth weight per gender and per multiple births, shows a similar distribution of males and females (Figure 8). The main noticeable difference is the male Quintuplet or higher number of births on average weigh more than the corresponding female Quintuplet or higher number of births.  This chart also confirms the conclusions made (from Figure 4) where as the number of births increases the average weight of the children decrease.

In conclusion, the day of the week doesn’t predict birth weights, but probably birth frequency. In general, babies are heavier if they are single births and if they achieve Apgar score of 10.  Birth weights are not predictable through delivery method.  All of these conclusions are made on the visual representation of the dataset births2006.smpl.  What would increase the validity of these statements would be to conduct statistical significance tests and the effect size, to add further weight to what could be derived from through these images.


## Use R to analyze the Birth dataset. 
## The Birth dataset is in the Nutshell library. 
##  • SEX and APGAR5 (SEX and Apgar score) 
##  • DPLURAL (single or multiple birth) 
##  • WTGAIN (weight gain of mother) 
##  • ESTGEST (estimated gestation in weeks) 
##  • DOB_MM, DOB_WK (month and day of week of birth) 
##  • BWT (birth weight) 
##  • DMETH_REC (method of delivery)

# First, list the data for the first 5 births. 

# Next, show a bar chart of the frequencies of births according to the day of the week of the birth.
births.dayofweek = table(births2006.smpl$DOB_WK) #Goal of this variable is to speed up the calculations
barplot(births.dayofweek, ylab=”frequency”, xlab=”Day of week”, col = “darkred”, main= “Number of births in 2006 per day of the week”)

# Obtain frequencies for two-way classifications of birth according to the day of the week and the method of delivery.
births.methodsVdaysofweek = table(births2006.smpl$DOB_WK,births2006.smpl$DMETH_REC) 
barplot(births.methodsVdaysofweek[,-2], col=heat.colors(length(rownames(births.methodsVdaysofweek))), width=2, beside=TRUE, main = “bar plot of births per method per day of the week”)
legend (“topleft”, fill=heat.colors(length(rownames(births.methodsVdaysofweek))),legend=rownames(births.methodsVdaysofweek))

# Use lattice (trellis) graphs (R package lattice) to condition density histograms on the values of a third variable. 

# The variable for multiple births and the method of delivery are conditioning variables. 
# Separate the histogram of birth weight according to these variable.
histogram(~DBWT|DPLURAL,data=births2006.smpl,layout=c(1,5),col=”black”, xlab = “birth weight”, main = “trellis plot of birth weight vs birth number”)

histogram(~DBWT|DMETH_REC,data=births2006.smpl,layout=c(1,3),col=”black”, xlab = “birth weight”, main = “trellis plot of birth weight vs birth method”)

# Do a box plot of birth weight against Apgar score and box plots of birth weight by day of week of delivery. 
boxplot(DBWT~APGAR5,data=births2006.smpl,ylab=”birth weight”,xlab=”AGPAR5″, main=”Boxplot of birthweight per Apgar score”)

boxplot(DBWT~DOB_WK,data=births2006.smpl,ylab=”birth weight”,xlab=”Day of Week”, main=”Boxplot of birthweight per day of week”)

# Calculate the average birth weight as a function of multiple births for males and females separately. 
# Use the “tapply” function, and for missing values use the “option nz.rm=TRUE.” 
listed = list(births2006.smpl$DPLURAL,births2006.smpl$SEX)
barplot(tapplication,ylab=”birth weight”, beside=TRUE, legend=TRUE,xlab=”gender”, main = “bar plot of average birthweight per multiple births by gender”)


  • CRAN (n.d.). Using lattice’s historgram (). Retrieved from
  • Hirsch, L. (2014). About the Apgar score. Retrieved from
  • R (n.d.a.). Add legends to plots. Retrieved from
  • R (n.d.b.). Apply a function over a ragged array. Retrieved from
  • R (n.d.c.). Bar plots. Retrieved from
  • R (n.d.d.). Cross tabulation and table creation. Retrieved from
  • R (n.d.e.). List-Generic and dotted pairs. Retrieved from
  • R (n.d.f.). Produce box-and-wisker plot(s) of a given (grouped) values.  Retrieved from
  • R (n.d.g.). Return the first or last part of an object. Retrieved from
  • Schumacker, R. E. (2014) Learning statistics using R. California, SAGE Publications, Inc, VitalBook file.

Adv Quant: Statistical Significance and Machine Learning

Data mining and analytics are used to test hypotheses and detect trends from very large datasets. In statistics, the significance is determined to some extent by the sample size. How can supervised learning be used in such large data sets to overcome the problem where everything is significant with statistical analysis?

Statistical significance on large samples sizes can be affected by small differences and can show up as significant, while in smaller samples large differences may be deemed statistically insignificant (Field, 2014).  Statistically significant results allow the researcher to reject a null hypothesis but do not test the importance of the observations made (Huck, 2011). Statistical analysis is highly deductive (Creswell, 2014), and supervised learning is highly inductive (Connolly & Begg, 2014).  Also, statistical analysis tries to identify trends in a given sample size by assuming normality, linearity or constant variance; whereas in machine learning it aims to find a pattern in a large sample of data and it is expected that these statistical analysis assumptions are not met and therefore require a higher random sampling set (Ahlemeyer-Stubbe, & Coleman, 2014).

Machine learning tries to emulate the way humans learn. When humans learn, they create a model based off of observations to help describe key features of a situation and help them predict an outcome, and thus machine learning does predictive modeling of large data sets in a similar fashion (Connolly & Begg, 2014).  The biggest selling point of supervised machine learning is that the machine can build models that identify key patterns in the data when humans can no longer compute the volume, velocity, and variety of the data (Ahlemeyer-Stubbe, & Coleman, 2014). There are many applications that use machine learning: marketing, investments, fraud detection, manufacturing, telecommunication, etc. (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). Figure 1 illustrates how supervised learning can classify data or predict their values through a two-phase process.  The two-phase process consists of (1) training where the model is built by ingesting huge amounts of historical data; and (2) testing where the new model is tested on new current data that helps establish its accuracy, reliability, and validity (Ahlemeyer-Stubbe & Coleman, 2014; Connolly & Begg, 2014). The model that is created by machines through this learning is quickly adaptable to new data (Minelli, Chambers, & Dhiraj, 2013).  These models themselves are a set of rules or formulas, and that depends on which analytical algorithm is used (Ahlemeyer-Stubbe & Coleman, 2014).  Given that the supervised machine learning is trained with known responses (or outputs) to make its future predictions, it is vital to have a clear purpose defined before running the algorithm.  The model is only as good as the data that goes in it.


Figure 1:  Simplified process diagram on supervised machine learning.

Thus, for classification the machine is learning a function to map data into one or many different defining characteristics, and it could consist of decision trees and neural network induction techniques (Connolly & Begg, 2014; Fayyad et al., 1996).  Fayyad et al. (1996) mentioned that it is impossible to classify data cleanly into one camp versus another. For value prediction, regression is used to map a function to the data that when followed gives an estimate on where the next value would be (Connolly & Begg, 2014; Fayyad et al. 1996).  However, in these regression formulas, it is good to remember that correlation between the data/variables does not imply causation.

Random sampling is core to statistics and the concept of statistical inference (Smith, 2015; Field, 2011), but it also serves a purpose in supervised learning (Ahlemeyer-Stubbe & Coleman, 2014).  Random sampling of data, is selecting a proportion of the data from a population, where each data point has an equal opportunity of being selected (Smith, 2015; Huck, 2013). The larger the sample, on average tends to represent the population fairly well (Field, 2014; Huck, 2013). Given nature big data, high volume, velocity, and variety, it is assumed that there is plenty of data to draw upon and run a supervised machine learning algorithm.  However, too much data that is fed into the machine learning algorithm can increase the process and analysis time.  Also, the bigger the random sampling size used for the learning, the more time it would take to process and analyze the data.

There are also unsupervised learning algorithms, where it also needs training and testing, but unlike supervised learning, it doesn’t need to validate its model on some predetermined output value (Ahlemeyer-Stubbe & Coleman, 2014, Conolly & Begg, 2014).   Therefore, unsupervised learning tries to find the natural relationships in the input data (Ahlemeyer-Stubbe & Coleman, 2014).  Cluster analysis is an example of unsupervised learning, where the model seeks to find a finite set of the cluster that can help describe the data into subsets of similarities (Ahlemeyer-Stubbe & Coleman, 2014, Fayyad et al., 1996). Finally, in supervised learning the results could be checked through estimation error; however it is not so easy with unsupervised learning because of a lack of a target but requires retesting to see if the patterns are similar or repeatable (Ahlemeyer-Stubbe & Coleman, 2014).


  • Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online].
  • Connolly, T., Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. [VitalSource Bookshelf Online].
  • Creswell, J. W. (2014) Research design: Qualitative, quantitative and mixed method approaches (4th ed.). California, SAGE Publications, Inc. VitalBook file.
  • Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. Advances in Knowledge Discovery and Data Mining, 17(3), 37–54.
  • Field, A. (2011) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
  • Huck, S. W. (2013) Reading Statistics and Research (6th ed.). Pearson Learning Solutions. VitalBook file.
  • Minelli, M., Chambers, M., Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses, 1st Edition. [VitalSource Bookshelf Online].
  • Smith, M. (2015). Statistical analysis handbook. Retrieved from