Adv Quant: Birth Rate Dataset in R

Built in the R library is the Births dataset with 400,000 records and 13 variables. The following is an analysis of this dataset.

Introduction

Built in the R library is the Births dataset with 400,000 records and 13 variables.  The following is an analysis of this dataset.

Results

IP1F1

Figure 1. The first five data point entries in the births2006.smpl data set.

IP1F2

Figure 2. The frequency of births in 2006 per day of the week.

IP1F3.png

Figure 3. Histogram of 2006 births frequencies graphed by day of the week and separated by method of delivery.

IP1F4.png

Figure 4. A trellis histogram plot of 2006 birth weight per birth number.

IP1F5

Figure 5. A trellis histogram plot of 2006 birth weight per birth delivery method.

IP1F6.png

Figure 6. A boxplot of 2006 birth weight per Apgar score.

IP1F7

Figure 7. A boxplot of 2006 birth weight per day of week.

IP1F8

Figure 8. A histogram of 2006 average birth weight per multiple births separated by gender.

Discussion

Given the open-sourced nature of the R software, many libraries are being built and shared with the greater community, and the Comprehensive R Archive Network (CRAN), has a ton of these programs as part of R Packages (Schumacker, 2014).  Thus, as part of the nutshell library, there exists a data set of 2006 births called “births2006.smpl”.  To view the first few entries the head() command can be used (R, n.d.g.).  The printout from the head() command (Figure 1) shows all 13 variables of the dataset along with the first five entries in the births2006.smpl dataset.

The number of birth seems to be approximately uniform (but not precisely) during the work week, assuming Sunday is 1 and Saturday is 7.  However, Tuesday-Thursday has the highest births in the week with the weekends having the least amount of births in the week.

Breaking down the method of deliveries in 2006 per day of the week, it can be seen that Vaginal birth in all seven days of the week outnumbers C-section deliveries in 2006 (Figure 3).  Also on Tuesday-Thursday there are more vaginal births compared to those during the weekend, and in C-section deliveries, there are most deliveries occur between Tuesday-Friday, and the least amount occurs during the weekends.

Breaking down the number of births frequencies per birth weight (Figure 4), it can be seen that the normal distribution of birth weight in grams shifts to the left as the number of multiple births increases.  This seems to suggest that babies born as a set of twins, triplets, etc. have lower birth rates on average and per distribution.  Birth weight is almost normally distributed for the single child birth but begins to lose normality as the number of births increases.

Further analysis of birth weights in 2006, per delivery method, shows that for whether or not the delivery method is known or not and its type of delivery method doesn’t play too much of a huge role in the determination of the child’s birth weight (Figure 5).  Statistical tests and effect size analysis could be conducted to verify and enhance the discussion and this assertion that is made through the graphical representation in Figure 5.

Apgar test is tested on the child after one and five minutes of birth looking at the skin color, heart rate, reflexes, muscle tone, and respiration rate of the child, where 10 is the highest but rarely obtain score (Hirsch, 2014).  Thus, observing the Apgar score variable (1-10) on birth weight in grams those with higher Apgar scores had on average higher median birth weights.  Typically, as Apgar score increases the tighter the distribution becomes, and the more outliers begin to appear (disregarding the results from Apgar score of 1).  These results from the boxplots tend to confirm Hirsch (2014) assertion that higher Apgar scores are harder to obtain.

Looking at the boxplot analysis of birth weight per day of the week (Figure 7) shows that the median, Q1, Q3, max, and min are normally distributed and unchanging per day of the week.  Outliers, the heavier babies, tend to occur without respect of the day of the week, and also appears to have little to no effect on the distribution of birth weight per day of the week.

Finally, looking at a mean birth weight per gender and per multiple births, shows a similar distribution of males and females (Figure 8). The main noticeable difference is the male Quintuplet or higher number of births on average weigh more than the corresponding female Quintuplet or higher number of births.  This chart also confirms the conclusions made (from Figure 4) where as the number of births increases the average weight of the children decrease.

In conclusion, the day of the week doesn’t predict birth weights, but probably birth frequency. In general, babies are heavier if they are single births and if they achieve Apgar score of 10.  Birth weights are not predictable through delivery method.  All of these conclusions are made on the visual representation of the dataset births2006.smpl.  What would increase the validity of these statements would be to conduct statistical significance tests and the effect size, to add further weight to what could be derived from through these images.

Code

#
## Use R to analyze the Birth dataset. 
## The Birth dataset is in the Nutshell library. 
##  • SEX and APGAR5 (SEX and Apgar score) 
##  • DPLURAL (single or multiple birth) 
##  • WTGAIN (weight gain of mother) 
##  • ESTGEST (estimated gestation in weeks) 
##  • DOB_MM, DOB_WK (month and day of week of birth) 
##  • BWT (birth weight) 
##  • DMETH_REC (method of delivery)
#
install.packages(“nutshell”)
library(nutshell)
data(births2006.smpl)

# First, list the data for the first 5 births. 
head(births2006.smpl)

# Next, show a bar chart of the frequencies of births according to the day of the week of the birth.
births.dayofweek = table(births2006.smpl$DOB_WK) #Goal of this variable is to speed up the calculations
barplot(births.dayofweek, ylab=”frequency”, xlab=”Day of week”, col = “darkred”, main= “Number of births in 2006 per day of the week”)

# Obtain frequencies for two-way classifications of birth according to the day of the week and the method of delivery.
births.methodsVdaysofweek = table(births2006.smpl$DOB_WK,births2006.smpl$DMETH_REC) 
head(births.methodsVdaysofweek,7)
barplot(births.methodsVdaysofweek[,-2], col=heat.colors(length(rownames(births.methodsVdaysofweek))), width=2, beside=TRUE, main = “bar plot of births per method per day of the week”)
legend (“topleft”, fill=heat.colors(length(rownames(births.methodsVdaysofweek))),legend=rownames(births.methodsVdaysofweek))

# Use lattice (trellis) graphs (R package lattice) to condition density histograms on the values of a third variable. 
library(lattice)

# The variable for multiple births and the method of delivery are conditioning variables. 
# Separate the histogram of birth weight according to these variable.
histogram(~DBWT|DPLURAL,data=births2006.smpl,layout=c(1,5),col=”black”, xlab = “birth weight”, main = “trellis plot of birth weight vs birth number”)

histogram(~DBWT|DMETH_REC,data=births2006.smpl,layout=c(1,3),col=”black”, xlab = “birth weight”, main = “trellis plot of birth weight vs birth method”)

# Do a box plot of birth weight against Apgar score and box plots of birth weight by day of week of delivery. 
boxplot(DBWT~APGAR5,data=births2006.smpl,ylab=”birth weight”,xlab=”AGPAR5″, main=”Boxplot of birthweight per Apgar score”)

boxplot(DBWT~DOB_WK,data=births2006.smpl,ylab=”birth weight”,xlab=”Day of Week”, main=”Boxplot of birthweight per day of week”)

# Calculate the average birth weight as a function of multiple births for males and females separately. 
# Use the “tapply” function, and for missing values use the “option nz.rm=TRUE.” 
listed = list(births2006.smpl$DPLURAL,births2006.smpl$SEX)
tapplication=tapply(births2006.smpl$DBWT,listed,mean,na.rm=TRUE)
barplot(tapplication,ylab=”birth weight”, beside=TRUE, legend=TRUE,xlab=”gender”, main = “bar plot of average birthweight per multiple births by gender”)

References

  • CRAN (n.d.). Using lattice’s historgram (). Retrieved from https://cran.r-project.org/web/packages/tigerstats/vignettes/histogram.html
  • Hirsch, L. (2014). About the Apgar score. Retrieved from http://kidshealth.org/en/parents/apgar.html#
  • R (n.d.a.). Add legends to plots. Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/legend.html
  • R (n.d.b.). Apply a function over a ragged array. Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/base/html/tapply.html
  • R (n.d.c.). Bar plots. Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/barplot.html
  • R (n.d.d.). Cross tabulation and table creation. Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/base/html/table.html
  • R (n.d.e.). List-Generic and dotted pairs. Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/base/html/list.html
  • R (n.d.f.). Produce box-and-wisker plot(s) of a given (grouped) values.  Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/boxplot.html
  • R (n.d.g.). Return the first or last part of an object. Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/utils/html/head.html
  • Schumacker, R. E. (2014) Learning statistics using R. California, SAGE Publications, Inc, VitalBook file.

Quant: Compelling topics

A discussion on what were the most compelling topics learned in the subject of Quantitative Analysis.

Most Compelling Topics

Field (2013) states that both quantitative and qualitative methods are complimentary at best, none competing approaches to solving the world’s problems. Although these methods are quite different from each other. Simply put, quantitative methods are utilized when the research contains variables that are numerical, and qualitative methods are utilized when the research contains variables that are based on language (Field, 2013).  Thus, central to quantitative research and methods is to understand the numerical, ordinal, or categorical dataset and what the data represents. This can be done through either descriptive statistics, where the researcher uses statistics to help describe a data set, or it can be done through inferential statistics, where conclusions can be drawn about the data set (Miller, n.d.).

Field (2013) and Schumacker (2014), defined central tendency as an all-encompassing term to help describe the “center of a frequency distribution” through the commonly used measures mean, median, and mode.  Outliers, missing values, and multiplication of a constant, and adding a constant are factors that affect the central tendency (Schumacker, 2014).  Besides just looking at one central tendency measure, researchers can also analyze the mean and median together to understand how skewed the data is and in which direction.  Heavily skewed distributions would heavily increase the distance between these two values, and if the mean less than the median the distribution is skewed negatively (Field, 2013).  To understand the distribution, better other measures like variance and standard deviations could be used.

Variance and standard deviations are considered as measures of dispersion, where the variance is considered as measures of average dispersion (Field, 2013; Schumacker, 2014).  Variance is a numerical value that describes how the observed data values are spread across the data distribution and how they differ from the mean on average (Huck, 2011; Field, 2013; Schumacker, 2014).  The smaller the variance indicates that the observed data values are close to the mean and vice versa (Field, 2013).

Rarely is every member of the population studied, and instead a sample from that population is randomly taken to represent that population for analysis in quantitative research (Gall, Gall, & Borg 2006). At the end of the day, the insights gained from this type of research should be impersonal, objective, and generalizable.  To generalize the results of the research the insights gained from a sample of data needs to use the correct mathematical procedures for using probabilities and information, statistical inference (Gall et al., 2006).  Gall et al. (2006), stated that statistical inference is what dictates the order of procedures, for instance, a hypothesis and a null hypothesis must be defined before a statistical significance level, which also has to be defined before calculating a z or t statistic value.  Essentially, a statistical inference allows for quantitative researchers to make inferences about a population.  A population, where researchers must remember where that data was generated and collected from during quantitative research process.

Most flaws in research methodology exist because the validity and reliability weren’t established (Gall et al., 2006). Thus, it is important to ensure a valid and reliable assessment instrument.  So, in using any existing survey as an assessment instrument, one should report the instrument’s: development, items, scales, reports on reliability, and reports on validity through past uses (Creswell, 2014; Joyner, 2012).  Permission must be secured for using any instrument and placed in the appendix (Joyner, 2012).  The validity of the assessment instrument is key to drawing meaningful and useful statistical inferences (Creswell, 2014).

Through sampling of a population and using a valid and reliable survey instrument for assessment, attitudes and opinions about a population could be correctly inferred from the sample (Creswell, 2014).  Sometimes, a survey instrument doesn’t fit those in the target group. Thus it would not produce valid nor reliable inferences for the targeted population. One must select a targeted population and determine the size of that stratified population (Creswell, 2014).

Parametric statistics, are inferential and based on random sampling from a distinct population, and that the sample data is making strict inferences about the population’s parameters, thus tests like t-tests, chi-square, f-tests (ANOVA) can be used (Huck, 2011; Schumacker, 2014).  Nonparametric statistics, “assumption-free tests”, is used for tests that are using ranked data like Mann-Whitney U-test, Wilcoxon Signed-Rank test, Kruskal-Wallis H-test, and chi-square (Field, 2013; Huck, 2011).

First, there is a need to define the types of data.  Continuous data is interval/ratio data, and categorical data is nominal/ordinal data.  Modified from Schumacker (2014) with data added from Huck (2011):

Statistic Dependent Variable Independent Variable
Analysis of Variance (ANOVA)
     One way Continuous Categorical
t-Tests
     Single Sample Continuous
     Independent groups Continuous Categorical
     Dependent (paired groups) Continuous Categorical
Chi-square Categorical Categorical
Mann-Whitney U-test Ordinal Ordinal
Wilcoxon Ordinal Ordinal
Kruskal-Wallis H-test Ordinal Ordinal

So, meaningful results get reported and their statistical significance, confidence intervals and effect sizes (Creswell, 2014). If the results from a statistical test have a low probability of occurring by chance (5% or 1% or less) then the statistical test is considered significant (Creswell, 2014; Field, 2014; Huck, 2011Statistical significance test can have the same effect yet result in different values (Field, 2014).  Statistical significance on large samples sizes can be affected by small differences and can show up as significant, while in smaller samples large differences may be deemed insignificant (Field, 2014).  Statistically significant results allow the researcher to reject a null hypothesis but do not test the importance of the observations made (Huck, 2011).  Huck (2011) stated two main factors that could influence whether or not a result is statistically significant is the quality of the research question and research design.

Huck (2011) suggested that after statistical significance is calculated and the research can either reject or fail to reject a null hypothesis, effect size analysis should be conducted.  The effect size allows researchers to measure objectively the magnitude or practical significance of the research findings through looking at the differential impact of the variables (Huck, 2011; Field, 2014).  Field (2014), defines one way of measuring the effect size is through Cohen’s d: d = (Avg(x1) – Avg(x2))/(standard deviation).  If d = 0.2 there is a small effect, d = 0.5 there is a moderate effect, and d = 0.8 or more there is a large effect (Field, 2014; Huck, 2011). Thus, this could be the reason why a statistical test could yield a statistically significant value, but further analysis with effect size could show that those statistically significant results do not explain much of what is happening in the total relationship.

In regression analysis, it should be possible to predict the dependent variable based on the independent variables, depending on two factors: (1) that the productivity assessment tool is valid and reliable (Creswell, 2014) and (2) we have a large enough sample size to conduct our analysis and be able to draw statistical inference of the population based on the sample data which has been collected (Huck, 2011). Assuming these two conditions are met, then regression analysis could be made on the data to create a prediction formula. Regression formulas are useful for summarizing the relationship between the variables in question (Huck, 2011).

When modeling predict the dependent variable based upon the independent variable the regression model with the strongest correlation will be used as it is that regression formula that explains the variance between the variables the best.   However, just because the regression formula can predict some or most of the variance between the variables, it will never imply causation (Field, 2013).  Correlations help define the strength of the regression formula in defining the relationships between the variables, and can vary in value from -1 to +1.  The closer the correlation coefficient is to -1 or +1; it informs the researcher that the regression formula is a good predictor of the variance between the variables.  The closer the correlation coefficient is to zero, indicates that there is hardly any relationship between the variable (Field, 2013; Huck, 2011; Schumacker, 2014).  It should never be forgotten that correlation doesn’t imply causation, but can help determine the percentage of the variances between the variables by the regression formula result, when the correlation value is squared (r2) (Field, 2013).

 

References:

  • Creswell, J. W. (2014) Research design: Qualitative, quantitative and mixed method approaches (4th ed.). California, SAGE Publications, Inc. VitalBook file.
  • Field, A. (2013) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
  • Gall, M. D., Gall, J., & Borg W. (2006). Educational research: An introduction (8th ed.). Pearson Learning Solutions. VitalBook file.
  • Huck, S. W. (2011) Reading Statistics and Research (6th ed.). Pearson Learning Solutions. VitalBook file.
  • Joyner, R. L. (2012) Writing the Winning Thesis or Dissertation: A Step-by-Step Guide (3rd ed.). Corwin. VitalBook file.
  • Miller, R. (n.d.). Week 1: Central tendency [Video file]. Retrieved from http://breeze.careeredonline.com/p9fynztexn6/?launcher=false&fcsContent=true&pbMode=normal
  • Schumacker, R. E. (2014) Learning statistics using R. California, SAGE Publications, Inc, VitalBook file.

Quant: In-depth Analysis in SPSS

This short analysis attempts to understand the marital happiness level on combined income. It was found that marital happiness levels are depended on a couples’ combined income, but for the happiest couples, they were happy regardless how much money they had. This, quantitative analysis on the sample data, has shown that when the happiness levels are low, there is a higher chance of lower levels of combined income.

Abstract

This short analysis attempts to understand the marital happiness level on combined income.  It was found that marital happiness levels are depended on a couples’ combined income, but for the happiest couples, they were happy regardless how much money they had.  This, quantitative analysis on the sample data, has shown that when the happiness levels are low, there is a higher chance of lower levels of combined income.

Introduction

Mulligan (1973), was one of the first that stated arguments about money was one of the top reasons for divorce between couples.  Factors for financial arguments could stem from: Goals and savings; record keeping; delaying tactics; apparel cost-cutting strategies; controlling expenditures; financial statements; do-it-yourself techniques; and cost cutting techniques (Lawrence, Thomasson, Wozniak, & Prawitz, 1993). Lawrence et al. (1993) exerts that financial arguments are common between families.  However, when does money no longer become an issue?  Does the increase in combined family income affect the marital happiness levels?  This analysis attempts to answer these questions.

Methods

Crosstabulation was conducted to get a descriptive exploration of the data.  Graphical images of box-plots helped show the spread and distribution of combined income per marital happiness.  In this analysis of the data the two alternative hypothesis will be tested:

  • There is a difference between the mean values of combined income per marital happiness levels.
  • There is a dependence between the combined income and marital happiness level

This would lead to finally analyzing the hypothesis introduced in the previous section, one-way analysis of variance and two-way chi-square test was conducted respectively.

Results

Table 1: Case processing summary for analyzing happiness level versus family income.

u6db1f7Table 2: Crosstabulation for analyzing happiness level versus family income (<$21,250).

u6db1f3Table 3: Crosstabulation for analyzing happiness level versus family income for (>$21,250).
u6db1f4

Table 4: Chi-square test for analyzing happiness level versus family income.

u6db1f5

Table 5: Analysis of Variance for analyzing happiness level versus family income.

u6db1f6

u6db1f1.png

Figure 1: Boxplot diagram per happiness level of a marriage versus the family incomes.

u6db1f2.png

Figure 2: Line diagram per happiness level of a marriage versus the mean of the family incomes.

Discussions and Conclusions

There are 1419 participants, and only 38.5% had responded to both their happiness of marriage and family income (Table 1).  What may have contributed to this huge unresponsive rate is that there could have been people who were not married, and thus making the happiness of marriage question not applicable to the participants.  Thus, it is suggested that in the future, there should be an N/A classification in this survey instrument, to see if we can have a higher response rate.  Given that there are still 547 responses, there is other information to be gained from analyzing this data.

As a family unit gains more income, their happiness level increases (Table 2-3).  This can be seen as the dollar value increases, the % within the family income and ranges recorded to midpoint for the very happy category increases as well from the 50% to the 75% level.    The unhappiest couples seem to be earning a combined medium amount of $7500-9000 and at $27500-45000.  Though for marriages that are pretty happy, it’s about stable at 30-40% of respondents at $13750 or more.

The mean values of family income to happiness (Figure 2), shows that on average, happier couples make more money together, but at a closer examination using boxplots (Figure 1), the happiest couples, seem to be happy regardless of how much money they make as the tails of the box plot extend really far from the median.  One interesting feature is that the spread of family combined income is shrinks as happiness decreases (Figure 1).  This could possibly suggest that though money is not a major factor for those couples that are happy, if the couple is unhappy it could be driven by lower combined incomes.

The two-tailed chi-squared test, shows statistical significance between family combined income and marital happiness allowing us to reject the null hypothesis #2, which stated that these two variables were independent of each other (Table 4).  Whereas the analysis of variance doesn’t allow for a rejection of the null hypothesis #1, which states the means are different between the groups of marital happiness level (Table 5).

There could be many reasons for this analysis, thus future work could include analyzing other variables that could help define other factors for marital happiness.  A possible multi-variate analysis may be necessary to see the impact on marital happiness as the dependent variable and combined income as one of many independent variables.

SPSS Code

GET

  FILE=’C:\Users\mkher\Desktop\SAV files\gss.sav’.

DATASET NAME DataSet1 WINDOW=FRONT.

CROSSTABS

  /TABLES=hapmar BY incomdol

  /FORMAT=AVALUE TABLES

  /STATISTICS=CHISQ CORR

  /CELLS=COUNT ROW COLUMN

  /COUNT ROUND CELL.

ONEWAY rincome BY hapmar

  /MISSING ANALYSIS

* Chart Builder.

GGRAPH

  /GRAPHDATASET NAME=”graphdataset” VARIABLES=hapmar incomdol MISSING=LISTWISE REPORTMISSING=NO

  /GRAPHSPEC SOURCE=INLINE.

BEGIN GPL

  SOURCE: s=userSource(id(“graphdataset”))

  DATA: hapmar=col(source(s), name(“hapmar”), unit.category())

  DATA: incomdol=col(source(s), name(“incomdol”))

  DATA: id=col(source(s), name(“$CASENUM”), unit.category())

  GUIDE: axis(dim(1), label(“HAPPINESS OF MARRIAGE”))

  GUIDE: axis(dim(2), label(“Family income; ranges recoded to midpoints”))

  SCALE: cat(dim(1), include(“1”, “2”, “3”))

  SCALE: linear(dim(2), include(0))

  ELEMENT: schema(position(bin.quantile.letter(hapmar*incomdol)), label(id))

END GPL.

* Chart Builder.

GGRAPH

  /GRAPHDATASET NAME=”graphdataset” VARIABLES=hapmar MEAN(incomdol)[name=”MEAN_incomdol”]

    MISSING=LISTWISE REPORTMISSING=NO

  /GRAPHSPEC SOURCE=INLINE.

BEGIN GPL

  SOURCE: s=userSource(id(“graphdataset”))

  DATA: hapmar=col(source(s), name(“hapmar”), unit.category())

  DATA: MEAN_incomdol=col(source(s), name(“MEAN_incomdol”))

  GUIDE: axis(dim(1), label(“HAPPINESS OF MARRIAGE”))

  GUIDE: axis(dim(2), label(“Mean Family income; ranges recoded to midpoints”))

  SCALE: cat(dim(1), include(“1”, “2”, “3”))

  SCALE: linear(dim(2), include(0))

  ELEMENT: line(position(hapmar*MEAN_incomdol), missing.wings())

END GPL.

References

Quant: Chi-Square Test in SPSS

The aim of this analysis is to determine the association strength for the variables agecat and degree as well the major contributing cells through a chi-square analysis. Through the use of standardized residuals, it should aid in determining the cell contributions.

Introduction

The aim of this analysis is to determine the association strength for the variables agecat and degree as well the major contributing cells through a chi-square analysis. Through the use of standardized residuals, it should aid in determining the cell contributions.

Hypothesis

  • Null: There is no basis of difference between the agecat and degree
  • Alternative: There is are real differences between the agecat and degree

Methodology

For this project, the gss.sav file is loaded into SPSS (GSS, n.d.).  The goal is to look at the relationships between the following variables: agecat (Age category) and degree (Respondent’s highest degree).

To conduct a chi-square analysis, navigate through Analyze > Descriptive Statistics > Crosstabs.

The variable degree was placed in the “Row(s)” box and agecat was placed under “Column(s)” box.  Select “Statistics” button and select “Chi-square” and under the “Nominal” section select “Lambda”. Select the “Cells” button and select “Standardized” under the “Residuals” section. The procedures for this analysis are provided in video tutorial form by Miller (n.d.).  The following output were observed in the next four tables.

Results

Table 1: Case processing summary.

Cases
Valid Missing Total
N Percent N Percent N Percent
Degree * Age category 1411 99.4% 8 0.6% 1419 100.0%

From the total sample size of 1419 participants, 8 cases are reported to be missing, yielding a 99.4% response rate (Table 1).   Examining the cross tabulation, for the age groups 30-39, 40-49, 50-59, and 60-89 the standardized residual is far less than -1.96 or far greater than +1.96 respectively.  Thus, the frequencies between these two differ significantly.  Finally, for the 60-89 age group the standardized residual is less than -1.96, making these two frequencies differ significantly.  Thus, for all these frequencies, SPSS identified that the observed frequencies are far apart from the expected frequencies (Miller, n.d.).  For those significant standardized residuals that are negative is pointing out that the SPSS model is over predicting people of that age group with that respective diploma (or lack thereof).  For those significant standardized residuals that are positive is point out that the SPSS model is under-predicting people of that age group with a lack of a diploma.

Table 2: Degree by Age category crosstabulation.

Age category Total
18-29 30-39 40-49 50-59 60-89
Degree Less than high school Count 42 33 36 20 112 243
Standardized Residual -.1 -2.8 -2.3 -2.7 7.1
High school Count 138 162 154 113 158 725
Standardized Residual .9 .2 -.2 .4 -1.2
Junior college or more Count 68 115 114 78 68 443
Standardized Residual -1.1 1.8 1.9 1.4 -3.7
Total Count 248 310 304 211 338 1411

Deriving the degrees of freedom from Table 2, df = (5-1)*(3-1) is 8.  However, none of the expected counts were less than five because the minimum expected count is 36.3 (Table 3) which is desirable.  The chi-squared value is 96.364 and is significance at the 0.05 level. Thus, the null hypothesis is rejected, and there is a statistically significant association between a person’s age category and diploma level.  This test doesn’t tell us anything about the directionality of the relationship.

Table 3: Chi-Square Tests

Value df Asymptotic Significance (2-sided)
Pearson Chi-Square 96.364a 8 .000
Likelihood Ratio 90.580 8 .000
Linear-by-Linear Association 23.082 1 .000
N of Valid Cases 1411
a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 36.34.

Table 4: Directional Measures

Value Asymptotic Standard Errora Approximate Tb Approximate Significance
Nominal by Nominal Lambda Symmetric .029 .013 2.278 .023
Degree Dependent .000 .000 .c .c
Age category Dependent .048 .020 2.278 .023
Goodman and Kruskal tau Degree Dependent .024 .005 .000d
Age category Dependent .019 .004 .000d
a. Not assuming the null hypothesis.
b. Using the asymptotic standard error assuming the null hypothesis.
c. Cannot be computed because the asymptotic standard error equals zero.
d. Based on chi-square approximation

Since there is a statistically significant association between a person’s age category and diploma level, the chi-square test doesn’t show how much these variables are related to each other. The lambda value (when we reject the null hypothesis) is 0.029; there is a 2.9% relationship between the two variables. Thus the relationship has a very weak effect (Table 4). Thus, 2.9% of the variance is accounted for, and there is nothing going on in here.

Conclusions

There is a statistically significant association between a person’s age category and diploma level.  According to the crosstabulation, the SPSS model is significantly over-predicting the number of people with less education than a high school diploma for the age groups of 20-59 as well as those with a college degree for the 60-89 age group.  This difference in the standard residual helped drive a large and statistically significant chi-square value. With a lambda of 0.029, it shows that 2.9% of the variance is accounted for, and there is nothing going on in here.

SPSS Code

CROSSTABS

  /TABLES=ndegree BY agecat

  /FORMAT=AVALUE TABLES

  /STATISTICS=CHISQ CC LAMBDA

  /CELLS=COUNT SRESID

  /COUNT ROUND CELL.

References:

Quant: Linear Regression in SPSS

The aim of this analysis is to look at the relationship between a father’s education level (dependent variable) when you know the mother’s education level (independent variable). The variable names are “paeduc” and “maeduc.” Thus, the hope is to determine the linear regression equation for predicting the father’s education level from the mother’s education.

Introduction

The aim of this analysis is to look at the relationship between a father’s education level (dependent variable) when you know the mother’s education level (independent variable). The variable names are “paeduc” and “maeduc.” Thus, the hope is to determine the linear regression equation for predicting the father’s education level from the mother’s education.

From the SPSS outputs the following questions will be addressed:

  • How much of the total variance have you accounted for with the equation?
  • Based upon your equation, what level of education would you predict for the father when the mother has 16 years of education?

Methodology

For this project, the gss.sav file is loaded into SPSS (GSS, n.d.).  The goal is to look at the relationships between the following variables: paeduc (HIGHEST YEAR SCHOOL COMPLETED, FATHER) and maeduc (HIGHEST YEAR SCHOOL COMPLETED, MOTHER). To conduct a linear regression analysis navigate through Analyze > Regression > Linear Regression.  The variable paeduc was placed in the “Dependent List” box, and maeduc was placed under “Independent(s)” box.  The procedures for this analysis are provided in video tutorial form by Miller (n.d.).  The following output was observed in the next four tables.

The relationship between paeduc and maeduc are plotted in a scatterplot by using the chart builder.  Code to run the chart builder code is shown in the code section, and the resulting image is shown in the results section.

Results

Table 1: Variables Entered/Removed

Model Variables Entered Variables Removed Method
1 HIGHEST YEAR SCHOOL COMPLETED, MOTHERb . Enter
a. Dependent Variable: HIGHEST YEAR SCHOOL COMPLETED, FATHER
b. All requested variables entered.

Table 1, reports that for the linear regression analysis the dependent variable is the highest years of school completed for the father and the independent variable is the highest year of school completed by the mother.  No variables were removed.

Table 2: Model Summary

Model R R Square Adjusted R Square Std. Error of the Estimate
1 .639a .408 .407 3.162
a. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, MOTHER
b. Dependent Variable: HIGHEST YEAR SCHOOL COMPLETED, FATHER

For a linear regression trying to predict the father’s highest year of school completed based on his wife’s highest year of school completed, the correlation is positive with a value of 0.639, which can only 0.408 of the variance explained (Table 2) and 0.582 of the variance is unexplained.  The linear regression formula or line of best fit (Table 4) is: y = 0.76 x + (2.572 years) + e.  The line of best fit essentially explains in equation form the mathematical relationship between two variables and in this case the father’s and mother’s highest education level.  Thus, if the mother has completed her bachelors’ degree (16th year), then this equation would yield (y = 2.572 years + 0.76 (16 years) + e = 14.732 years + e).  The e is the error in this prediction formula, and it exists because of the r2 value is not exactly -1.0 or +1.0.  The ANOVA table (Table 3) describes that this relationship between these two variables is statistically significant at the 0.05 level.

Table 3: ANOVA Table

Model Sum of Squares df Mean Square F Sig.
1 Regression 6231.521 1 6231.521 623.457 .000b
Residual 9045.579 905 9.995
Total 15277.100 906
a. Dependent Variable: HIGHEST YEAR SCHOOL COMPLETED, FATHER
b. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, MOTHER

Table 4: Coefficients

Model Unstandardized Coefficients Standardized Coefficients t Sig.
B Std. Error Beta
1 (Constant) 2.572 .367 7.009 .000
HIGHEST YEAR SCHOOL COMPLETED, MOTHER .760 .030 .639 24.969 .000
a. Dependent Variable: HIGHEST YEAR SCHOOL COMPLETED, FATHER

The image below (Figure 1), is a scatter plot, which is plotting the highest year of school completed by the mother vs. the father along with the linear regression line (Table 4) and box plot images of each respective distribution.  There are more outliers in the husband’s education level compared to those of the wife’s education level, and the spread of the education level is more concentrated about the median for the husband’s education level.

u4db1f1.png

Figure 1: Highest year of school completed by the mother vs the father scatter plot with regression line and box plot images of each respective distribution.

Conclusion

There is a statistically significant relation between the husband’s and wife’s highest year of education completed.  The line of best-fit formula shows a moderately positive correlation and is defined as y = 0.76 x + (2.572 years) + e; which can only explain 40.8% of the variance, while 58.2% of the variance is unexplained.

SPSS Code

DATASET NAME DataSet1 WINDOW=FRONT.

REGRESSION

  /MISSING LISTWISE

  /STATISTICS COEFF OUTS R ANOVA

  /CRITERIA=PIN(.05) POUT(.10)

  /NOORIGIN

  /DEPENDENT paeduc

  /METHOD=ENTER maeduc

  /CASEWISE PLOT(ZRESID) OUTLIERS(3).

STATS REGRESS PLOT YVARS=paeduc XVARS=maeduc

/OPTIONS CATEGORICAL=BARS GROUP=1 BOXPLOTS INDENT=15 YSCALE=75

/FITLINES LINEAR APPLYTO=TOTAL.

References:

Quant: Statistical Significance

Presume that you have analyzed a relationship between 2 management styles and found they are significantly related. A statistician has looked at your output and said that the results really do not explain much of what is happening in the total relationship.

In quantitative research methodologies, meaningful results get reported and their statistical significance, confidence intervals and effect sizes (Creswell, 2014). If the results from a statistical test have a low probability of occurring by chance (5% or 1% or less) then the statistical test is considered significant (Creswell, 2014; Field, 2014; Huck, 2011).  Low statistical significance values usually try to protect against type I errors (Huck, 2011). Statistical significance test can have the same effect yet result in different values (Field, 2014).  Statistical significance on large samples sizes can be affected by small differences and can show up as significant, while in smaller samples large differences may be deemed insignificant (Field, 2014).  Statistically significant results allow the researcher to reject a null hypothesis but do not test the importance of the observations made (Huck, 2011).  Huck (2011) stated two main factors that could influence whether or not a result is statistically significant is the quality of the research question and research design.  This is why Creswell (2014) also stated confidence intervals and effect size. Confidence intervals explain a range of values that describe the uncertainty of the overall observation and effect size defines the strength of the conclusions made on the observations (Creswell, 2014).  Huck (2011) suggested that after statistical significance is calculated and the research can either reject or fail to reject a null hypothesis, effect size analysis should be conducted.  The effect size allows researchers to measure objectively the magnitude or practical significance of the research findings through looking at the differential impact of the variables (Huck, 2011; Field, 2014).  Field (2014), defines one way of measuring the effect size is through Cohen’s d: d = (Avg(x1) – Avg(x2))/(standard deviation).  There are multiple ways to pick a standard deviation for the denominator of the effect size equation: control group standard deviation, group standard deviation, population standard deviation or pooling the groups of standard deviations that are assuming there is independence between the groups (Field, 2014).   If d = 0.2 there is a small effect, d = 0.5 there is a moderate effect, and d = 0.8 or more there is a large effect (Field, 2014; Huck, 2011). Thus, this could be the reason why the statistical test yielded a statistically significant value, but further analysis with effect size could show that those statistically significant results do not explain much of what is happening in the total relationship.

Resources

  • Creswell, J. W. (2014) Research design: Qualitative, quantitative and mixed method approaches (4th ed.). California, SAGE Publications, Inc. VitalBook file.
  • Field, A. (2011) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
  • Huck, S. W. (2013) Reading Statistics and Research (6th ed.). Pearson Learning Solutions. VitalBook file.

Quant: Regression and Correlations

Top management of a large company has told you that they really would like to be able to determine what the impact of years of service at their company has on workers’ productivity levels, and they would like to be able to predict potential productivity based upon years of service. The company has data on all of its employees and has been using a valid productivity measure that assesses each employee’s productivity. You have told management that there is a possible way to do that.

Through a regression analysis, it should be possible to predict the potential productivity based upon years of service, depending on two factors: (1) that the productivity assessment tool is valid and reliable (Creswell, 2014) and (2) we have a large enough sample size to conduct our analysis and be able to draw statistical inference of the population based on the sample data which has been collected (Huck, 2011). Assuming these two conditions are met, then regression analysis could be made on the data to create a prediction formula. Regression formulas are useful for summarizing the relationship between the variables in question (Huck, 2011). There are multiple types of regression all of them are tests of prediction: Linear, Multiple, Log-Linear, Quadratic, Cubic, etc. (Huck, 2011; Schumacker, 2014).  The linear regression is the most well-known because it uses basic algebra, a straight line, and the Pearson correlation coefficient to aid in stating the regression’s prediction strength (Huck, 2011; Schumacker, 2014).  The linear regression formula is: y = a + bx + e, where y is the dependent variable (in this case the productivity measure), x is the independent variable (years of service), a (the intercept) and b (the regression weight) are a constants that are to be defined through the regression analysis, and e is the regression prediction error (Field, 2013; Schumacker, 2014).  The sum of the errors should be equal to zero (Schumacker, 2014).

Linear regression models try to describe the relationship between one dependent and one independent variable, which are measured at the ratios or interval level (Schumacker, 2014).  However, other regression models are tested to find the best regression fit over the data.  Even though these are different regression tests, the goal for each regression model is to try to describe the current relationship between the dependent variable and the independent variable(s) and for predicting.  Multiple regression is used when there are multiple independent variables (Huck, 2011; Schumacker, 2014). Log-Linear Regression is using a categorical or continuously independent variable (Schumacker, 2014). Quadratic and Cubic regressions use a quadratic and cubic formula to help predict trends that are quadratic or cubic in nature respectively (Field, 2013).  When modeling predict potential productivity based upon years of service the regression with the strongest correlation will be used as it is that regression formula that explains the variance between the variables the best.   However, just because the regression formula can predict some or most of the variance between the variables, it will never imply causation (Field, 2013).

Correlations help define the strength of the regression formula in defining the relationships between the variables, and can vary in value from -1 to +1.  The closer the correlation coefficient is to -1 or +1; it informs the researcher that the regression formula is a good predictor of the variance between the variables.  The closer the correlation coefficient is to zero, indicates that there is hardly any relationship between the variable (Field, 2013; Huck, 2011; Schumacker, 2014).  A negative correlation could show that as the years of service increases the productivity measured is decreased, which could be caused by apathy or some other factor that has yet to be measured.  A positive correlation could show that as the years of service increases the productivity also measured increases, which could also be influenced by other factors that are not directly related to the years of service.  Thus, correlation doesn’t imply causation, but can help determine the percentage of the variances between the variables by the regression formula result, when the correlation value is squared (r2) (Field, 2013).

References

  • Creswell, J. W. (2014) Research design: Qualitative, quantitative and mixed method approaches (4th ed.). California, SAGE Publications, Inc. VitalBook file.
  • Field, A. (2013) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
  • Huck, S. W. (2011) Reading Statistics and Research (6th ed.). Pearson Learning Solutions. VitalBook file.
  • Schumacker, R. E. (2014) Learning statistics using R. California, SAGE Publications, Inc, VitalBook file.