Adv Quant: Birth Rate Dataset in R

Built in the R library is the Births dataset with 400,000 records and 13 variables. The following is an analysis of this dataset.

Introduction

Built in the R library is the Births dataset with 400,000 records and 13 variables.  The following is an analysis of this dataset.

Results

IP1F1

Figure 1. The first five data point entries in the births2006.smpl data set.

IP1F2

Figure 2. The frequency of births in 2006 per day of the week.

IP1F3.png

Figure 3. Histogram of 2006 births frequencies graphed by day of the week and separated by method of delivery.

IP1F4.png

Figure 4. A trellis histogram plot of 2006 birth weight per birth number.

IP1F5

Figure 5. A trellis histogram plot of 2006 birth weight per birth delivery method.

IP1F6.png

Figure 6. A boxplot of 2006 birth weight per Apgar score.

IP1F7

Figure 7. A boxplot of 2006 birth weight per day of week.

IP1F8

Figure 8. A histogram of 2006 average birth weight per multiple births separated by gender.

Discussion

Given the open-sourced nature of the R software, many libraries are being built and shared with the greater community, and the Comprehensive R Archive Network (CRAN), has a ton of these programs as part of R Packages (Schumacker, 2014).  Thus, as part of the nutshell library, there exists a data set of 2006 births called “births2006.smpl”.  To view the first few entries the head() command can be used (R, n.d.g.).  The printout from the head() command (Figure 1) shows all 13 variables of the dataset along with the first five entries in the births2006.smpl dataset.

The number of birth seems to be approximately uniform (but not precisely) during the work week, assuming Sunday is 1 and Saturday is 7.  However, Tuesday-Thursday has the highest births in the week with the weekends having the least amount of births in the week.

Breaking down the method of deliveries in 2006 per day of the week, it can be seen that Vaginal birth in all seven days of the week outnumbers C-section deliveries in 2006 (Figure 3).  Also on Tuesday-Thursday there are more vaginal births compared to those during the weekend, and in C-section deliveries, there are most deliveries occur between Tuesday-Friday, and the least amount occurs during the weekends.

Breaking down the number of births frequencies per birth weight (Figure 4), it can be seen that the normal distribution of birth weight in grams shifts to the left as the number of multiple births increases.  This seems to suggest that babies born as a set of twins, triplets, etc. have lower birth rates on average and per distribution.  Birth weight is almost normally distributed for the single child birth but begins to lose normality as the number of births increases.

Further analysis of birth weights in 2006, per delivery method, shows that for whether or not the delivery method is known or not and its type of delivery method doesn’t play too much of a huge role in the determination of the child’s birth weight (Figure 5).  Statistical tests and effect size analysis could be conducted to verify and enhance the discussion and this assertion that is made through the graphical representation in Figure 5.

Apgar test is tested on the child after one and five minutes of birth looking at the skin color, heart rate, reflexes, muscle tone, and respiration rate of the child, where 10 is the highest but rarely obtain score (Hirsch, 2014).  Thus, observing the Apgar score variable (1-10) on birth weight in grams those with higher Apgar scores had on average higher median birth weights.  Typically, as Apgar score increases the tighter the distribution becomes, and the more outliers begin to appear (disregarding the results from Apgar score of 1).  These results from the boxplots tend to confirm Hirsch (2014) assertion that higher Apgar scores are harder to obtain.

Looking at the boxplot analysis of birth weight per day of the week (Figure 7) shows that the median, Q1, Q3, max, and min are normally distributed and unchanging per day of the week.  Outliers, the heavier babies, tend to occur without respect of the day of the week, and also appears to have little to no effect on the distribution of birth weight per day of the week.

Finally, looking at a mean birth weight per gender and per multiple births, shows a similar distribution of males and females (Figure 8). The main noticeable difference is the male Quintuplet or higher number of births on average weigh more than the corresponding female Quintuplet or higher number of births.  This chart also confirms the conclusions made (from Figure 4) where as the number of births increases the average weight of the children decrease.

In conclusion, the day of the week doesn’t predict birth weights, but probably birth frequency. In general, babies are heavier if they are single births and if they achieve Apgar score of 10.  Birth weights are not predictable through delivery method.  All of these conclusions are made on the visual representation of the dataset births2006.smpl.  What would increase the validity of these statements would be to conduct statistical significance tests and the effect size, to add further weight to what could be derived from through these images.

Code

#
## Use R to analyze the Birth dataset. 
## The Birth dataset is in the Nutshell library. 
##  • SEX and APGAR5 (SEX and Apgar score) 
##  • DPLURAL (single or multiple birth) 
##  • WTGAIN (weight gain of mother) 
##  • ESTGEST (estimated gestation in weeks) 
##  • DOB_MM, DOB_WK (month and day of week of birth) 
##  • BWT (birth weight) 
##  • DMETH_REC (method of delivery)
#
install.packages(“nutshell”)
library(nutshell)
data(births2006.smpl)

# First, list the data for the first 5 births. 
head(births2006.smpl)

# Next, show a bar chart of the frequencies of births according to the day of the week of the birth.
births.dayofweek = table(births2006.smpl$DOB_WK) #Goal of this variable is to speed up the calculations
barplot(births.dayofweek, ylab=”frequency”, xlab=”Day of week”, col = “darkred”, main= “Number of births in 2006 per day of the week”)

# Obtain frequencies for two-way classifications of birth according to the day of the week and the method of delivery.
births.methodsVdaysofweek = table(births2006.smpl$DOB_WK,births2006.smpl$DMETH_REC) 
head(births.methodsVdaysofweek,7)
barplot(births.methodsVdaysofweek[,-2], col=heat.colors(length(rownames(births.methodsVdaysofweek))), width=2, beside=TRUE, main = “bar plot of births per method per day of the week”)
legend (“topleft”, fill=heat.colors(length(rownames(births.methodsVdaysofweek))),legend=rownames(births.methodsVdaysofweek))

# Use lattice (trellis) graphs (R package lattice) to condition density histograms on the values of a third variable. 
library(lattice)

# The variable for multiple births and the method of delivery are conditioning variables. 
# Separate the histogram of birth weight according to these variable.
histogram(~DBWT|DPLURAL,data=births2006.smpl,layout=c(1,5),col=”black”, xlab = “birth weight”, main = “trellis plot of birth weight vs birth number”)

histogram(~DBWT|DMETH_REC,data=births2006.smpl,layout=c(1,3),col=”black”, xlab = “birth weight”, main = “trellis plot of birth weight vs birth method”)

# Do a box plot of birth weight against Apgar score and box plots of birth weight by day of week of delivery. 
boxplot(DBWT~APGAR5,data=births2006.smpl,ylab=”birth weight”,xlab=”AGPAR5″, main=”Boxplot of birthweight per Apgar score”)

boxplot(DBWT~DOB_WK,data=births2006.smpl,ylab=”birth weight”,xlab=”Day of Week”, main=”Boxplot of birthweight per day of week”)

# Calculate the average birth weight as a function of multiple births for males and females separately. 
# Use the “tapply” function, and for missing values use the “option nz.rm=TRUE.” 
listed = list(births2006.smpl$DPLURAL,births2006.smpl$SEX)
tapplication=tapply(births2006.smpl$DBWT,listed,mean,na.rm=TRUE)
barplot(tapplication,ylab=”birth weight”, beside=TRUE, legend=TRUE,xlab=”gender”, main = “bar plot of average birthweight per multiple births by gender”)

References

  • CRAN (n.d.). Using lattice’s historgram (). Retrieved from https://cran.r-project.org/web/packages/tigerstats/vignettes/histogram.html
  • Hirsch, L. (2014). About the Apgar score. Retrieved from http://kidshealth.org/en/parents/apgar.html#
  • R (n.d.a.). Add legends to plots. Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/legend.html
  • R (n.d.b.). Apply a function over a ragged array. Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/base/html/tapply.html
  • R (n.d.c.). Bar plots. Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/barplot.html
  • R (n.d.d.). Cross tabulation and table creation. Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/base/html/table.html
  • R (n.d.e.). List-Generic and dotted pairs. Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/base/html/list.html
  • R (n.d.f.). Produce box-and-wisker plot(s) of a given (grouped) values.  Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/boxplot.html
  • R (n.d.g.). Return the first or last part of an object. Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/utils/html/head.html
  • Schumacker, R. E. (2014) Learning statistics using R. California, SAGE Publications, Inc, VitalBook file.

Quant: Exploring Data with SPSS

The aim of this analysis is to run a distribution analysis on diastolic blood pressure (DBP58), examining the following for individuals who have had no history of cardiovascular heart disease and individuals with a history of cardiovascular heart disease (CHD). The variable that looks at individual history is CHD.

Introduction

The aim of this analysis is to run a distribution analysis on diastolic blood pressure (DBP58), examining the following for individuals who have had no history of cardiovascular heart disease and individuals with a history of cardiovascular heart disease (CHD). The variable that looks at individual history is CHD.

From the SPSS outputs the following questions will be addressed:

  • What can be determined from the measures of skewness and kurtosis about a normal curve? What are the mean and median?
  • Does one seem better than the other to represent the scores?
  • What differences can be seen in the pattern of responses of those with history versus those with no history?
  • What information can be determined from the box plots?

Methodology

For this project, the electric.sav file is loaded into SPSS (Electric, n.d.).  The goal is to look at the relationships between the following variables: DBP58 (Average Diastolic Blood Pressure) and CHD (Incidence of Coronary Heart Disease). To conduct a descriptive analysis, navigate through Analyze > Descriptive Analytics > Explore.  The variable DBP58 was placed in the “Dependent List” box, and CHD was placed on the “Factor List” box.  Then on the Explore dialog box, “Statistics” button was clicked, and in this dialog box “Descriptives” at the 95% “Confidence interval for the mean” is selected along with outliers and percentiles.  Then going back to the on the Explore dialog box, “Plots” button was clicked, and in this dialog box under the “Boxplot” section only “Factor levels together” was selected, under the “Descriptive” section, both options were selected, and the “Spread vs. Level with Levene Test” section, “None” was selected.  The procedures for this analysis are provided in video tutorial form by Miller (n.d.). The following output was observed in the next four tables and five figures.

Results

Table 1: Case Processing Summary.

Incidence of Coronary Heart Disease Cases
Valid Missing Total
N Percent N Percent N Percent
Average Diast Blood Pressure 58 none 119 99.2% 1 0.8% 120 100.0%
chd 120 100.0% 0 0.0% 120 100.0%

According to Table 1, 99.2% or greater of the data is valid and not missing for when there is a history of Coronary Heart Disease (CHD) and when there isn’t. There is one missing data point in the case with no history of CHD. This data set contains 120 participants.

Table 2: Descriptive Statistics on the Incidents of Coronary Heart Disease and the Average Diastolic Blood Pressure.

Incidence of Coronary Heart Disease Statistic Std. Error
Average Diast Blood Pressure 58 none Mean 87.66 1.005
95% Confidence Interval for Mean Lower Bound 85.66
Upper Bound 89.65
5% Trimmed Mean 87.31
Median 87.00
Variance 120.312
Std. Deviation 10.969
Minimum 65
Maximum 125
Range 60
Interquartile Range 15
Skewness .566 .222
Kurtosis .671 .440
chd Mean 89.92 1.350
95% Confidence Interval for Mean Lower Bound 87.24
Upper Bound 92.59
5% Trimmed Mean 88.89
Median 87.00
Variance 218.732
Std. Deviation 14.790
Minimum 65
Maximum 160
Range 95
Interquartile Range 18
Skewness 1.406 .221
Kurtosis 3.620 .438

According to Table 2, there is a difference in the mean by +2 points and +0.345 in standard error in Diastolic Blood Pressure with CHD compared to when there isn’t.  The median for both cases of CHD or not are 87, with the mean for patients with CHD 89.92 (slightly skewed) and that can be seen with a skewness of 1.406 and a kurtosis of 3.620.  For the cases without a CHD, the mean blood pressure is 87.66 (showing little to now skewness in the data), as evident by the skewness of 0.566 and kurtosis of 0.671.  Upon further inspection of Figures 1 & 2, the skewness or lack thereof seems to appear to be the result of some outliers. The box plot in Figure 3 confirms these outliers.  The kurtosis values of 0.671 and 3.620 indicate they are Leptokurtic, which means they have higher peaks in their distribution and deviate from a normal distribution.

u2db3f1

Figure 1: Histogram on the Incidents of Coronary Heart Disease = none and the Average Diastolic Blood Pressure.

u2db3f2.png

Figure 2: Histogram on the Incidents of Coronary Heart Disease = chd and the Average Diastolic Blood Pressure.

u2db3f3.png

Figure 3: Box plots on the Incidents of Coronary Heart Disease and the Average Diastolic Blood Pressure.

Comparing the two histograms in Figures 1 & 2, there is a negative skewness to the data when there is CHD compared to when there isn’t.  The spread between the two histograms increases by about 3.7 points (the standard deviation from the mean) when there is CHD.  This shows that blood pressure in the sample population can vary greatly if there is CHD, whereas blood pressure is a bit more stable in the sample population that doesn’t have CHD.  Looking at the range of these the average diastolic blood pressure, if there is a CHD, then it increases, which is supported by the greater standard deviation number, and can be seen in Figure 3.  In the case with no CHD the interquartile range (which represents the middle 50% of the participants) is smaller than the participants with CHD. Participant 120 was excluded from the interquartile range due to its extreme nature.

Table 3: Percentiles on the Incidents of Coronary Heart Disease and the Average Diastolic Blood Pressure.

Incidence of Coronary Heart Disease Percentiles
5 10 25 50 75 90 95
Weighted Average (Definition 1) Average Diast Blood Pressure 58 none 71.00 75.00 80.00 87.00 95.00 102.00 105.00
chd 70.05 75.00 80.00 87.00 98.00 109.90 117.95
Tukey’s Hinges Average Diast Blood Pressure 58 none 80.00 87.00 94.50
chd 80.00 87.00 98.00

In Table 3, the percentiles on the incidents of CHD on the average diastolic blood pressure is mapped out.  95 % of all cases exist below 105 (117.95) diastolic blood pressure for no history of CHD (for the history of CHD).  These percentiles show that in the case where there is no CHD, the diastolic blood pressure values are centered more towards the median value of 87, which is supported by the above-mentioned Tables and Figures.

Table 4: Extreme Values on the Incidents of Coronary Heart Disease and the Average Diastolic Blood Pressure.

Incidence of Coronary Heart Disease Case Number Value
Average Diast Blood Pressure 58 none Highest 1 163 125
2 232 119
3 144 115
4 126 110
5 131 109
Lowest 1 157 65
2 156 65
3 175 68
4 153 68
5 237 69
chd Highest 1 120 160
2 56 133
3 42 125
4 26 121
5 111 120
Lowest 1 73 65
2 34 68
3 101 70
4 33 70
5 7 70a
a. Only a partial list of cases with the value 70 are shown in the table of lower extremes.

Examining the extreme values through Table 4, the top 5 and lowest 5 cases are considered.  In the case were there is no CHD, the lowest diastolic blood pressure value can be seen as 65 which is the same as those with CHD.  However, in the highest diastolic blood pressure value, there is a 35 point greater difference for the highest case with CHD on the highest case without CHD.

  •  Frequency    Stem &  Leaf
  •       .00        6 .
  •      5.00        6 .  55889
  •      4.00        7 .  1144
  •     18.00        7 .  555677777777888899
  •     21.00        8 .  000000000001122223344
  •     21.00        8 .  555556666777777888999
  •     20.00        9 .  00000111111222233334
  •     14.00        9 .  55666777888899
  •      8.00       10 .  00012233
  •      4.00       10 .  5559
  •      1.00       11 .  0
  •      1.00       11 .  5
  •      2.00 Extremes    (>=119)
  •  Stem width:   10
  •  Each leaf:        1 case(s)

Figure 4: Stem and leaf plot on the Incidents of Coronary Heart Disease = none and the Average Diastolic Blood Pressure.

  •  Frequency    Stem &  Leaf
  •       .00        6 .
  •      2.00        6 .  58
  •      9.00        7 .  000012233
  •     14.00        7 .  55555677788899
  •     23.00        8 .  00000000000111233333344
  •     24.00        8 .  555556667777777788999999
  •     11.00        9 .  00001122223
  •     13.00        9 .  6677788888999
  •      5.00       10 .  02333
  •      7.00       10 .  5557789
  •      4.00       11 .  0003
  •      3.00       11 .  578
  •      2.00       12 .  01
  •      1.00       12 .  5
  •      2.00 Extremes    (>=133)
  •  Stem width:   10
  •  Each leaf:        1 case(s)

Figure 5: Stem and leaf plot on the Incidents of Coronary Heart Disease = chd and the Average Diastolic Blood Pressure.

Figures 4 and 5 show more detail than the histogram information by stating the actual frequency to the left of the Stem values as well as stating what is considered to be extreme values.  In the case of CHD, a diastolic blood pressure greater than 133 is considered an outlier and when there is no CHD the extreme values are considered to be a diastolic blood pressure of 119 or more.

Conclusions

There is a difference between the distributions of those participants that have a history of Coronary Heart Disease (CHD) and those that don’t on their average diastolic blood pressure.  This is represented through the range, skewness, and distribution between both groups.  Both groups have similar medians, and lowest values, but vary greatly in the mean, standard deviation and highest values of diastolic blood pressure.

SPSS Code

DATASET NAME DataSet1 WINDOW=FRONT.

EXAMINE VARIABLES=dbp58 BY chd

  /PLOT BOXPLOT STEMLEAF HISTOGRAM

  /COMPARE GROUPS

  /PERCENTILES(5,10,25,50,75,90,95) HAVERAGE

  /STATISTICS DESCRIPTIVES EXTREME

  /CINTERVAL 95

  /MISSING LISTWISE

  /NOTOTAL.

References: