Quant: Chi-Square Test in SPSS

The aim of this analysis is to determine the association strength for the variables agecat and degree as well the major contributing cells through a chi-square analysis. Through the use of standardized residuals, it should aid in determining the cell contributions.

Introduction

The aim of this analysis is to determine the association strength for the variables agecat and degree as well the major contributing cells through a chi-square analysis. Through the use of standardized residuals, it should aid in determining the cell contributions.

Hypothesis

  • Null: There is no basis of difference between the agecat and degree
  • Alternative: There is are real differences between the agecat and degree

Methodology

For this project, the gss.sav file is loaded into SPSS (GSS, n.d.).  The goal is to look at the relationships between the following variables: agecat (Age category) and degree (Respondent’s highest degree).

To conduct a chi-square analysis, navigate through Analyze > Descriptive Statistics > Crosstabs.

The variable degree was placed in the “Row(s)” box and agecat was placed under “Column(s)” box.  Select “Statistics” button and select “Chi-square” and under the “Nominal” section select “Lambda”. Select the “Cells” button and select “Standardized” under the “Residuals” section. The procedures for this analysis are provided in video tutorial form by Miller (n.d.).  The following output were observed in the next four tables.

Results

Table 1: Case processing summary.

Cases
Valid Missing Total
N Percent N Percent N Percent
Degree * Age category 1411 99.4% 8 0.6% 1419 100.0%

From the total sample size of 1419 participants, 8 cases are reported to be missing, yielding a 99.4% response rate (Table 1).   Examining the cross tabulation, for the age groups 30-39, 40-49, 50-59, and 60-89 the standardized residual is far less than -1.96 or far greater than +1.96 respectively.  Thus, the frequencies between these two differ significantly.  Finally, for the 60-89 age group the standardized residual is less than -1.96, making these two frequencies differ significantly.  Thus, for all these frequencies, SPSS identified that the observed frequencies are far apart from the expected frequencies (Miller, n.d.).  For those significant standardized residuals that are negative is pointing out that the SPSS model is over predicting people of that age group with that respective diploma (or lack thereof).  For those significant standardized residuals that are positive is point out that the SPSS model is under-predicting people of that age group with a lack of a diploma.

Table 2: Degree by Age category crosstabulation.

Age category Total
18-29 30-39 40-49 50-59 60-89
Degree Less than high school Count 42 33 36 20 112 243
Standardized Residual -.1 -2.8 -2.3 -2.7 7.1
High school Count 138 162 154 113 158 725
Standardized Residual .9 .2 -.2 .4 -1.2
Junior college or more Count 68 115 114 78 68 443
Standardized Residual -1.1 1.8 1.9 1.4 -3.7
Total Count 248 310 304 211 338 1411

Deriving the degrees of freedom from Table 2, df = (5-1)*(3-1) is 8.  However, none of the expected counts were less than five because the minimum expected count is 36.3 (Table 3) which is desirable.  The chi-squared value is 96.364 and is significance at the 0.05 level. Thus, the null hypothesis is rejected, and there is a statistically significant association between a person’s age category and diploma level.  This test doesn’t tell us anything about the directionality of the relationship.

Table 3: Chi-Square Tests

Value df Asymptotic Significance (2-sided)
Pearson Chi-Square 96.364a 8 .000
Likelihood Ratio 90.580 8 .000
Linear-by-Linear Association 23.082 1 .000
N of Valid Cases 1411
a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 36.34.

Table 4: Directional Measures

Value Asymptotic Standard Errora Approximate Tb Approximate Significance
Nominal by Nominal Lambda Symmetric .029 .013 2.278 .023
Degree Dependent .000 .000 .c .c
Age category Dependent .048 .020 2.278 .023
Goodman and Kruskal tau Degree Dependent .024 .005 .000d
Age category Dependent .019 .004 .000d
a. Not assuming the null hypothesis.
b. Using the asymptotic standard error assuming the null hypothesis.
c. Cannot be computed because the asymptotic standard error equals zero.
d. Based on chi-square approximation

Since there is a statistically significant association between a person’s age category and diploma level, the chi-square test doesn’t show how much these variables are related to each other. The lambda value (when we reject the null hypothesis) is 0.029; there is a 2.9% relationship between the two variables. Thus the relationship has a very weak effect (Table 4). Thus, 2.9% of the variance is accounted for, and there is nothing going on in here.

Conclusions

There is a statistically significant association between a person’s age category and diploma level.  According to the crosstabulation, the SPSS model is significantly over-predicting the number of people with less education than a high school diploma for the age groups of 20-59 as well as those with a college degree for the 60-89 age group.  This difference in the standard residual helped drive a large and statistically significant chi-square value. With a lambda of 0.029, it shows that 2.9% of the variance is accounted for, and there is nothing going on in here.

SPSS Code

CROSSTABS

  /TABLES=ndegree BY agecat

  /FORMAT=AVALUE TABLES

  /STATISTICS=CHISQ CC LAMBDA

  /CELLS=COUNT SRESID

  /COUNT ROUND CELL.

References:

Quant: Linear Regression in SPSS

The aim of this analysis is to look at the relationship between a father’s education level (dependent variable) when you know the mother’s education level (independent variable). The variable names are “paeduc” and “maeduc.” Thus, the hope is to determine the linear regression equation for predicting the father’s education level from the mother’s education.

Introduction

The aim of this analysis is to look at the relationship between a father’s education level (dependent variable) when you know the mother’s education level (independent variable). The variable names are “paeduc” and “maeduc.” Thus, the hope is to determine the linear regression equation for predicting the father’s education level from the mother’s education.

From the SPSS outputs the following questions will be addressed:

  • How much of the total variance have you accounted for with the equation?
  • Based upon your equation, what level of education would you predict for the father when the mother has 16 years of education?

Methodology

For this project, the gss.sav file is loaded into SPSS (GSS, n.d.).  The goal is to look at the relationships between the following variables: paeduc (HIGHEST YEAR SCHOOL COMPLETED, FATHER) and maeduc (HIGHEST YEAR SCHOOL COMPLETED, MOTHER). To conduct a linear regression analysis navigate through Analyze > Regression > Linear Regression.  The variable paeduc was placed in the “Dependent List” box, and maeduc was placed under “Independent(s)” box.  The procedures for this analysis are provided in video tutorial form by Miller (n.d.).  The following output was observed in the next four tables.

The relationship between paeduc and maeduc are plotted in a scatterplot by using the chart builder.  Code to run the chart builder code is shown in the code section, and the resulting image is shown in the results section.

Results

Table 1: Variables Entered/Removed

Model Variables Entered Variables Removed Method
1 HIGHEST YEAR SCHOOL COMPLETED, MOTHERb . Enter
a. Dependent Variable: HIGHEST YEAR SCHOOL COMPLETED, FATHER
b. All requested variables entered.

Table 1, reports that for the linear regression analysis the dependent variable is the highest years of school completed for the father and the independent variable is the highest year of school completed by the mother.  No variables were removed.

Table 2: Model Summary

Model R R Square Adjusted R Square Std. Error of the Estimate
1 .639a .408 .407 3.162
a. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, MOTHER
b. Dependent Variable: HIGHEST YEAR SCHOOL COMPLETED, FATHER

For a linear regression trying to predict the father’s highest year of school completed based on his wife’s highest year of school completed, the correlation is positive with a value of 0.639, which can only 0.408 of the variance explained (Table 2) and 0.582 of the variance is unexplained.  The linear regression formula or line of best fit (Table 4) is: y = 0.76 x + (2.572 years) + e.  The line of best fit essentially explains in equation form the mathematical relationship between two variables and in this case the father’s and mother’s highest education level.  Thus, if the mother has completed her bachelors’ degree (16th year), then this equation would yield (y = 2.572 years + 0.76 (16 years) + e = 14.732 years + e).  The e is the error in this prediction formula, and it exists because of the r2 value is not exactly -1.0 or +1.0.  The ANOVA table (Table 3) describes that this relationship between these two variables is statistically significant at the 0.05 level.

Table 3: ANOVA Table

Model Sum of Squares df Mean Square F Sig.
1 Regression 6231.521 1 6231.521 623.457 .000b
Residual 9045.579 905 9.995
Total 15277.100 906
a. Dependent Variable: HIGHEST YEAR SCHOOL COMPLETED, FATHER
b. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, MOTHER

Table 4: Coefficients

Model Unstandardized Coefficients Standardized Coefficients t Sig.
B Std. Error Beta
1 (Constant) 2.572 .367 7.009 .000
HIGHEST YEAR SCHOOL COMPLETED, MOTHER .760 .030 .639 24.969 .000
a. Dependent Variable: HIGHEST YEAR SCHOOL COMPLETED, FATHER

The image below (Figure 1), is a scatter plot, which is plotting the highest year of school completed by the mother vs. the father along with the linear regression line (Table 4) and box plot images of each respective distribution.  There are more outliers in the husband’s education level compared to those of the wife’s education level, and the spread of the education level is more concentrated about the median for the husband’s education level.

u4db1f1.png

Figure 1: Highest year of school completed by the mother vs the father scatter plot with regression line and box plot images of each respective distribution.

Conclusion

There is a statistically significant relation between the husband’s and wife’s highest year of education completed.  The line of best-fit formula shows a moderately positive correlation and is defined as y = 0.76 x + (2.572 years) + e; which can only explain 40.8% of the variance, while 58.2% of the variance is unexplained.

SPSS Code

DATASET NAME DataSet1 WINDOW=FRONT.

REGRESSION

  /MISSING LISTWISE

  /STATISTICS COEFF OUTS R ANOVA

  /CRITERIA=PIN(.05) POUT(.10)

  /NOORIGIN

  /DEPENDENT paeduc

  /METHOD=ENTER maeduc

  /CASEWISE PLOT(ZRESID) OUTLIERS(3).

STATS REGRESS PLOT YVARS=paeduc XVARS=maeduc

/OPTIONS CATEGORICAL=BARS GROUP=1 BOXPLOTS INDENT=15 YSCALE=75

/FITLINES LINEAR APPLYTO=TOTAL.

References:

Quant: ANOVA and Multiple Comparisons in SPSS

The aim of this analysis is to look at the relationship between the dependent variable of the income level of respondents (rincdol) and the independent variable of their reported level of happiness (happy). This independent variable has at least 3 or more levels within it.

Introduction

The aim of this analysis is to look at the relationship between the dependent variable of the income level of respondents (rincdol) and the independent variable of their reported level of happiness (happy).   This independent variable has at least 3 or more levels within it.

From the SPSS outputs the goal is to:

  • How to use the ANOVA program to determine the overall conclusion. Use of the Bonferroni correction as a post-hoc analysis to determine the relationship of specific levels of happiness to income.

Hypothesis

  • Null: There is no basis of difference between the overall rincdol and happy
  • Alternative: There is are real differences between the overall rincdol and happy
  • Null2: There is no basis of difference between the certain pairs of rincdol and happy
  • Alternative2: There is are real differences between the certain pairs of rincdol and happy

Methodology

For this project, the gss.sav file is loaded into SPSS (GSS, n.d.).  The goal is to look at the relationships between the following variables: rincdol (Respondent’s income; ranges recoded to midpoints) and happy (General Happiness). To conduct a parametric analysis, navigate to Analyze > Compare Means > One-Way ANOVA.  The variable rincdol was placed in the “Dependent List” box, and happy was placed under “Factor” box.  Select “Post Hoc” and under the “Equal Variances Assumed” select “Bonferroni”.  The procedures for this analysis are provided in video tutorial form by Miller (n.d.). The following output was observed in the next two tables.

The relationship between rincdol and happy are plotted by using the chart builder.  Code to run the chart builder code is shown in the code section, and the resulting image is shown in the results section.

Results

Table 1: ANOVA

Respondent’s income; ranges recoded to midpoints
Sum of Squares df Mean Square F Sig.
Between Groups 11009722680.000 2 5504861341.000 9.889 .000
Within Groups 499905585000.000 898 556687733.900
Total 510915307700.000 900

Through the ANOVA analysis, Table 1, it shows that the overall ANOVA shows statistical significance, such that the first Null hypothesis is rejected at the 0.05 level. Thus, there is a statistically significant difference in the relationship between the overall rincdol and happy variables.  However, the difference between the means at various levels.

Table 2: Multiple Comparisons

Dependent Variable:   Respondent’s income; ranges recoded to midpoints
Bonferroni
(I) GENERAL HAPPINESS (J) GENERAL HAPPINESS Mean Difference (I-J) Std. Error Sig. 95% Confidence Interval
Lower Bound Upper Bound
VERY HAPPY PRETTY HAPPY 4093.678 1744.832 .058 -91.26 8278.61
NOT TOO HAPPY 12808.643* 2912.527 .000 5823.02 19794.26
PRETTY HAPPY VERY HAPPY -4093.678 1744.832 .058 -8278.61 91.26
NOT TOO HAPPY 8714.965* 2740.045 .005 2143.04 15286.89
NOT TOO HAPPY VERY HAPPY -12808.643* 2912.527 .000 -19794.26 -5823.02
PRETTY HAPPY -8714.965* 2740.045 .005 -15286.89 -2143.04
*. The mean difference is significant at the 0.05 level.

According to Table 2, for the pairings of “Very Happy” and “Pretty Happy” did not disprove the Null2 for that case at the 0.05 level. But, all other pairings “Very Happy” and “Not Too Happy” with “Pretty Happy” and “Not Too Happy” can reject the Null2 hypothesis at the 0.05 level.  Thus, there is a difference when comparing across the three different pairs.

u3db3f1

Figure 1: Graphed means of General Happiness versus incomes.

The relationship between general happiness and income are positively correlated (Figure 1).  That means that a low level of general happiness in a person usually have lower recorded mean incomes and vice versa.  There is no direction or causality that can be made from this analysis.  It is not that high amounts of income cause general happiness, or happy people make more money due to their positivism attitude towards life.

SPSS Code

DATASET NAME DataSet1 WINDOW=FRONT.

ONEWAY rincdol BY happy

  /MISSING ANALYSIS

  /POSTHOC=BONFERRONI ALPHA(0.05).

* Chart Builder.

GGRAPH

  /GRAPHDATASET NAME=”graphdataset” VARIABLES=happy MEAN(rincdol)[name=”MEAN_rincdol”]

    MISSING=LISTWISE REPORTMISSING=NO

  /GRAPHSPEC SOURCE=INLINE.

BEGIN GPL

  SOURCE: s=userSource(id(“graphdataset”))

  DATA: happy=col(source(s), name(“happy”), unit.category())

  DATA: MEAN_rincdol=col(source(s), name(“MEAN_rincdol”))

  GUIDE: axis(dim(1), label(“GENERAL HAPPINESS”))

  GUIDE: axis(dim(2), label(“Mean Respondent’s income; ranges recoded to midpoints”))

  SCALE: cat(dim(1), include(“1”, “2”, “3”))

  SCALE: linear(dim(2), include(0))

  ELEMENT: line(position(happy*MEAN_rincdol), missing.wings())

END GPL.

References:

Quant: Group Statistics in SPSS

The aim of this analysis is to make a decision about whether a person is alive or dead ten years after a coronary is reflected in a significant difference in his diastolic blood pressure taken when that event occurred. The variable “DBP58” will be used as a dependent variable and “Vital10” as an independent variable.

Introduction

The aim of this analysis is to make a decision about whether a person is alive or dead ten years after a coronary is reflected in a significant difference in his diastolic blood pressure taken when that event occurred. The variable “DBP58” will be used as a dependent variable and “Vital10” as an independent variable.

From the SPSS outputs the goal is to:

  • Analyze these conditions to determine if there is a significant difference between the DBP levels of those (vital10) who are alive 10 years later compared to those who died within 10 years.

Hypothesis

  • Null: There is no basis of difference between the DBP58 and Vital10
  • Alternative: There is are real differences between the DBP58 and Vital10

Methodology

For this project, the electric.sav file is loaded into SPSS (Electric, n.d.).  The goal is to look at the relationships between the following variables: DBP58 (Average Diastolic Blood Pressure) and Vital10 (Status at Ten Years). To conduct a parametric analysis, navigate to Analyze > Compare Means > Paired-Samples T Test.  The variable DBP58 was placed in the “Test Variables” box, and Vital10 was placed under “grouping variable” box.  Then select the “Define Groups” button and enter 0 for “Group 1” and 1 for “Group 2”.  The procedures for this analysis are provided in video tutorial form by Miller (n.d.). The following output was observed in the next two tables.

Results

Table 1: Group Statistics

Status at Ten Years N Mean Std. Deviation Std. Error Mean
Average Diast Blood Pressure 58 Alive 178 87.56 11.446 .858
Dead 61 92.38 16.477 2.110

According to the results in Table 1, the mean diastolic blood pressure of those who have passed away ten years later was 5 points higher and had a huge standard deviation.  Thus, those who are alive ten years later have a smaller variation of their diastolic blood pressure.

Table 2: Independent Samples Test

Levene’s Test for Equality of Variances t-test for Equality of Means
F Sig. t df Sig. (2-tailed) Mean Difference Std. Error Difference 95% Confidence Interval of the Difference
Lower Upper
Average Diast Blood Pressure 58 Equal variances assumed 8.815 .003 -2.515 237 .013 -4.815 1.915 -8.587 -1.043
Equal variances not assumed -2.114 80.735 .038 -4.815 2.277 -9.347 -.284

According to the independent t-test for equality of means, shows that there is no equality in the variance at the 0.05 level, such that when equal variances are not assumed, the null hypothesis could be rejected at the 0.05 level because the significance value is 0.038.  Thus, there is a statistically significant difference between the means of diastolic blood pressure of those who are alive and those who have passed away.

SPSS Code

DATASET NAME DataSet1 WINDOW=FRONT.

T-TEST GROUPS=vital10(0 1)

  /MISSING=ANALYSIS

  /VARIABLES=dbp58

  /CRITERIA=CI(.95).

References:

Quant: Paired Sample Statistics in SPSS

The aim of this analysis is to conduct a comparison of productivity under two organizational structures: The data are artificial estimates of productivity with column 1 representing traditional vertical management and column 2 representing other autonomous work teams (ATW). The background is that a company of 100 factory workers had been operating under traditional vertical management and decided to move to ATW. The same employees were involved in both systems having first worked under vertical management and then being converted to ATW.

Introduction

The aim of this analysis is to conduct a comparison of productivity under two organizational structures: The data are artificial estimates of productivity with column 1 representing traditional vertical management and column 2 representing other autonomous work teams (ATW). The background is that a company of 100 factory workers had been operating under traditional vertical management and decided to move to ATW. The same employees were involved in both systems having first worked under vertical management and then being converted to ATW.

From the SPSS outputs the goal is to:

  • Analyze the productivity levels of the 2 management approaches, and decide which is superior.

Hypothesis

  • Null: There is no basis of difference between the prodpre and prodpost
  • Alternative: There is are real differences between the prodpre and prodpost

Methodology

For this project, the atw.sav file is loaded into SPSS (ATW, n.d.).  The goal is to look at the relationships between the following variables: prodpre (productivity level preceding the new process) and prodpost (productivity level following the new process). To conduct a parametric analysis, navigate to Analyze > Compare Means > Paired-Samples T Test.  The variable prodpre was placed in the “Paired Variables” box under “Pair” 1 and “Variable 1”, and prodpost was placed under “Pair” 1 and “Variable 2”.  The procedures for this analysis are provided in video tutorial form by Miller (n.d.). The following output was observed in the next three tables.

Results

Table 1: Paired Sample Statistics

Mean N Std. Deviation Std. Error Mean
Pair 1 productivity level preceding the new process 76.43 100 16.820 1.682
productivity level following the new process 84.24 100 9.797 .980

Descriptively, productivity on average increased by 8 points, and the standard deviation about the mean decreased by 7 points.  This means that the estimates of productivity under the traditional vertical management are less than and showcase a wider spread than those of the estimates of productivity under the autonomous work teams.  Essentially these distributions tell the story that the workers are getting better productivity estimates with less deviation under autonomous work teams.

Table 2: Paired Samples Correlation

N Correlation Sig.
Pair 1 productivity level preceding the new process & productivity level following the new process 100 .040 .695

Based on Table 2, there is a weak correlation (r = 0.040) between the estimates of productivity under the traditional vertical management and the autonomous work teams.  Although correlation does not imply causation.

Table 3: Paired Samples Test

Paired Differences t df Sig. (2-tailed)
Mean Std. Deviation Std. Error Mean 95% Confidence Interval of the Difference
Lower Upper
Pair 1 productivity level preceding the new process – productivity level following the new process -7.817 19.126 1.913 -11.612 -4.022 -4.087 99 .000

Based on the results from the 2-tailed student t-tests (Table 3), the null hypothesis can be rejected.  There is a significant difference between the two variables prodpre and prodpost at the 0.05 level or less.  The data based on 100 workers (with degrees of freedom of 99) show that there is a significance in the estimates of productivity under the traditional vertical management and the autonomous work teams.

SPSS Code

DATASET NAME DataSet1 WINDOW=FRONT.

T-TEST PAIRS=prodpre WITH prodpost (PAIRED)

  /CRITERIA=CI(.9500)

  /MISSING=ANALYSIS.

References:

Quant: Exploring Data with SPSS

The aim of this analysis is to run a distribution analysis on diastolic blood pressure (DBP58), examining the following for individuals who have had no history of cardiovascular heart disease and individuals with a history of cardiovascular heart disease (CHD). The variable that looks at individual history is CHD.

Introduction

The aim of this analysis is to run a distribution analysis on diastolic blood pressure (DBP58), examining the following for individuals who have had no history of cardiovascular heart disease and individuals with a history of cardiovascular heart disease (CHD). The variable that looks at individual history is CHD.

From the SPSS outputs the following questions will be addressed:

  • What can be determined from the measures of skewness and kurtosis about a normal curve? What are the mean and median?
  • Does one seem better than the other to represent the scores?
  • What differences can be seen in the pattern of responses of those with history versus those with no history?
  • What information can be determined from the box plots?

Methodology

For this project, the electric.sav file is loaded into SPSS (Electric, n.d.).  The goal is to look at the relationships between the following variables: DBP58 (Average Diastolic Blood Pressure) and CHD (Incidence of Coronary Heart Disease). To conduct a descriptive analysis, navigate through Analyze > Descriptive Analytics > Explore.  The variable DBP58 was placed in the “Dependent List” box, and CHD was placed on the “Factor List” box.  Then on the Explore dialog box, “Statistics” button was clicked, and in this dialog box “Descriptives” at the 95% “Confidence interval for the mean” is selected along with outliers and percentiles.  Then going back to the on the Explore dialog box, “Plots” button was clicked, and in this dialog box under the “Boxplot” section only “Factor levels together” was selected, under the “Descriptive” section, both options were selected, and the “Spread vs. Level with Levene Test” section, “None” was selected.  The procedures for this analysis are provided in video tutorial form by Miller (n.d.). The following output was observed in the next four tables and five figures.

Results

Table 1: Case Processing Summary.

Incidence of Coronary Heart Disease Cases
Valid Missing Total
N Percent N Percent N Percent
Average Diast Blood Pressure 58 none 119 99.2% 1 0.8% 120 100.0%
chd 120 100.0% 0 0.0% 120 100.0%

According to Table 1, 99.2% or greater of the data is valid and not missing for when there is a history of Coronary Heart Disease (CHD) and when there isn’t. There is one missing data point in the case with no history of CHD. This data set contains 120 participants.

Table 2: Descriptive Statistics on the Incidents of Coronary Heart Disease and the Average Diastolic Blood Pressure.

Incidence of Coronary Heart Disease Statistic Std. Error
Average Diast Blood Pressure 58 none Mean 87.66 1.005
95% Confidence Interval for Mean Lower Bound 85.66
Upper Bound 89.65
5% Trimmed Mean 87.31
Median 87.00
Variance 120.312
Std. Deviation 10.969
Minimum 65
Maximum 125
Range 60
Interquartile Range 15
Skewness .566 .222
Kurtosis .671 .440
chd Mean 89.92 1.350
95% Confidence Interval for Mean Lower Bound 87.24
Upper Bound 92.59
5% Trimmed Mean 88.89
Median 87.00
Variance 218.732
Std. Deviation 14.790
Minimum 65
Maximum 160
Range 95
Interquartile Range 18
Skewness 1.406 .221
Kurtosis 3.620 .438

According to Table 2, there is a difference in the mean by +2 points and +0.345 in standard error in Diastolic Blood Pressure with CHD compared to when there isn’t.  The median for both cases of CHD or not are 87, with the mean for patients with CHD 89.92 (slightly skewed) and that can be seen with a skewness of 1.406 and a kurtosis of 3.620.  For the cases without a CHD, the mean blood pressure is 87.66 (showing little to now skewness in the data), as evident by the skewness of 0.566 and kurtosis of 0.671.  Upon further inspection of Figures 1 & 2, the skewness or lack thereof seems to appear to be the result of some outliers. The box plot in Figure 3 confirms these outliers.  The kurtosis values of 0.671 and 3.620 indicate they are Leptokurtic, which means they have higher peaks in their distribution and deviate from a normal distribution.

u2db3f1

Figure 1: Histogram on the Incidents of Coronary Heart Disease = none and the Average Diastolic Blood Pressure.

u2db3f2.png

Figure 2: Histogram on the Incidents of Coronary Heart Disease = chd and the Average Diastolic Blood Pressure.

u2db3f3.png

Figure 3: Box plots on the Incidents of Coronary Heart Disease and the Average Diastolic Blood Pressure.

Comparing the two histograms in Figures 1 & 2, there is a negative skewness to the data when there is CHD compared to when there isn’t.  The spread between the two histograms increases by about 3.7 points (the standard deviation from the mean) when there is CHD.  This shows that blood pressure in the sample population can vary greatly if there is CHD, whereas blood pressure is a bit more stable in the sample population that doesn’t have CHD.  Looking at the range of these the average diastolic blood pressure, if there is a CHD, then it increases, which is supported by the greater standard deviation number, and can be seen in Figure 3.  In the case with no CHD the interquartile range (which represents the middle 50% of the participants) is smaller than the participants with CHD. Participant 120 was excluded from the interquartile range due to its extreme nature.

Table 3: Percentiles on the Incidents of Coronary Heart Disease and the Average Diastolic Blood Pressure.

Incidence of Coronary Heart Disease Percentiles
5 10 25 50 75 90 95
Weighted Average (Definition 1) Average Diast Blood Pressure 58 none 71.00 75.00 80.00 87.00 95.00 102.00 105.00
chd 70.05 75.00 80.00 87.00 98.00 109.90 117.95
Tukey’s Hinges Average Diast Blood Pressure 58 none 80.00 87.00 94.50
chd 80.00 87.00 98.00

In Table 3, the percentiles on the incidents of CHD on the average diastolic blood pressure is mapped out.  95 % of all cases exist below 105 (117.95) diastolic blood pressure for no history of CHD (for the history of CHD).  These percentiles show that in the case where there is no CHD, the diastolic blood pressure values are centered more towards the median value of 87, which is supported by the above-mentioned Tables and Figures.

Table 4: Extreme Values on the Incidents of Coronary Heart Disease and the Average Diastolic Blood Pressure.

Incidence of Coronary Heart Disease Case Number Value
Average Diast Blood Pressure 58 none Highest 1 163 125
2 232 119
3 144 115
4 126 110
5 131 109
Lowest 1 157 65
2 156 65
3 175 68
4 153 68
5 237 69
chd Highest 1 120 160
2 56 133
3 42 125
4 26 121
5 111 120
Lowest 1 73 65
2 34 68
3 101 70
4 33 70
5 7 70a
a. Only a partial list of cases with the value 70 are shown in the table of lower extremes.

Examining the extreme values through Table 4, the top 5 and lowest 5 cases are considered.  In the case were there is no CHD, the lowest diastolic blood pressure value can be seen as 65 which is the same as those with CHD.  However, in the highest diastolic blood pressure value, there is a 35 point greater difference for the highest case with CHD on the highest case without CHD.

  •  Frequency    Stem &  Leaf
  •       .00        6 .
  •      5.00        6 .  55889
  •      4.00        7 .  1144
  •     18.00        7 .  555677777777888899
  •     21.00        8 .  000000000001122223344
  •     21.00        8 .  555556666777777888999
  •     20.00        9 .  00000111111222233334
  •     14.00        9 .  55666777888899
  •      8.00       10 .  00012233
  •      4.00       10 .  5559
  •      1.00       11 .  0
  •      1.00       11 .  5
  •      2.00 Extremes    (>=119)
  •  Stem width:   10
  •  Each leaf:        1 case(s)

Figure 4: Stem and leaf plot on the Incidents of Coronary Heart Disease = none and the Average Diastolic Blood Pressure.

  •  Frequency    Stem &  Leaf
  •       .00        6 .
  •      2.00        6 .  58
  •      9.00        7 .  000012233
  •     14.00        7 .  55555677788899
  •     23.00        8 .  00000000000111233333344
  •     24.00        8 .  555556667777777788999999
  •     11.00        9 .  00001122223
  •     13.00        9 .  6677788888999
  •      5.00       10 .  02333
  •      7.00       10 .  5557789
  •      4.00       11 .  0003
  •      3.00       11 .  578
  •      2.00       12 .  01
  •      1.00       12 .  5
  •      2.00 Extremes    (>=133)
  •  Stem width:   10
  •  Each leaf:        1 case(s)

Figure 5: Stem and leaf plot on the Incidents of Coronary Heart Disease = chd and the Average Diastolic Blood Pressure.

Figures 4 and 5 show more detail than the histogram information by stating the actual frequency to the left of the Stem values as well as stating what is considered to be extreme values.  In the case of CHD, a diastolic blood pressure greater than 133 is considered an outlier and when there is no CHD the extreme values are considered to be a diastolic blood pressure of 119 or more.

Conclusions

There is a difference between the distributions of those participants that have a history of Coronary Heart Disease (CHD) and those that don’t on their average diastolic blood pressure.  This is represented through the range, skewness, and distribution between both groups.  Both groups have similar medians, and lowest values, but vary greatly in the mean, standard deviation and highest values of diastolic blood pressure.

SPSS Code

DATASET NAME DataSet1 WINDOW=FRONT.

EXAMINE VARIABLES=dbp58 BY chd

  /PLOT BOXPLOT STEMLEAF HISTOGRAM

  /COMPARE GROUPS

  /PERCENTILES(5,10,25,50,75,90,95) HAVERAGE

  /STATISTICS DESCRIPTIVES EXTREME

  /CINTERVAL 95

  /MISSING LISTWISE

  /NOTOTAL.

References:

Quant: Crosstabs in SPSS

The aim of this analysis is to answer the question, if someone was rich, would they continue or stop working on their highest degree earned, gender, and job satisfaction.

Introduction

The aim of this analysis is to answer the question, if someone was rich, would they continue or stop working on their highest degree earned, gender, and job satisfaction.

Methodology

For this project, the gss.sav file is loaded into SPSS (GSS, n.d.).  The goal is to look at the relationships between the following variables: richwork (being wealthy), sex (demographics of gender), satjob (satisfaction level with the job), and degree (education degree level).   The variable richwork is the dependent variable and the other three variables are considered independent variables for this analysis. To conduct a crosstabs analysis, navigate through Analyze > Descriptive Analytics > Crosstabs.  The variable richwork was placed in the “Row(s)” box, and the other three variables were placed in the “Column(s)” box.  Then on the crosstabs dialog box, “Cells” button was clicked, and under the “Counts” section “Observed” was selected and all three boxes were seleceted under the “Percentages” section. The procedures for this analysis are provided in video tutorial form by Miller (n.d.).  The following output was observed in the next four tables.

Results

Table 1: Cases Processing Summary.

Cases
Valid Missing Total
N Percent N Percent N Percent
IF RICH, CONTINUE OR STOP WORKING * Respondent’s highest degree 625 44.0% 794 56.0% 1419 100.0%
IF RICH, CONTINUE OR STOP WORKING * Respondent’s sex 628 44.3% 791 55.7% 1419 100.0%
IF RICH, CONTINUE OR STOP WORKING * JOB OR HOUSEWORK 624 44.0% 795 56.0% 1419 100.0%

According to Table 1, about 44% (~625) of all cases are valid in all three scenarios and about 56% (~793) had missing data, from a total of 1419 respondents.

Table 2: If rich do people continue or stop working with respondent’s highest degree cross tabulation.

Respondent’s highest degree Total
Less than HS High school Junior college Bachelor Graduate
IF RICH, CONTINUE OR STOP WORKING CONTINUE WORKING Count 52 210 39 84 36 421
% within IF RICH, CONTINUE OR STOP WORKING 12.4% 49.9% 9.3% 20.0% 8.6% 100.0%
% within Respondent’s highest degree 69.3% 64.6% 81.3% 67.2% 69.2% 67.4%
% of Total 8.3% 33.6% 6.2% 13.4% 5.8% 67.4%
STOP WORKING Count 23 115 9 41 16 204
% within IF RICH, CONTINUE OR STOP WORKING 11.3% 56.4% 4.4% 20.1% 7.8% 100.0%
% within Respondent’s highest degree 30.7% 35.4% 18.8% 32.8% 30.8% 32.6%
% of Total 3.7% 18.4% 1.4% 6.6% 2.6% 32.6%
Total Count 75 325 48 125 52 625
% within IF RICH, CONTINUE OR STOP WORKING 12.0% 52.0% 7.7% 20.0% 8.3% 100.0%
% within Respondent’s highest degree 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
% of Total 12.0% 52.0% 7.7% 20.0% 8.3% 100.0%

According to Table 2, with further analysis on whether or not people would continue or stop working, 67.4% would stay, and 32.6% would stop working.  In our data about 12% have less than a high school diploma, 52% have a high school diploma, 7.7% have a gone to junior college, 20% have a bachelor degree and 8.3% have a graduate degree. With further analysis with respect to whether or not people would continue or stop working with respect to the respondent’s highest degree earned, 56.4% of respondents who have only a high school diploma would choose to leave work if they were rich making them the biggest demographic to leave in this “what if” scenario.  Finally, 81.3% of those with a junior college degree would stay at their job if they were rich, making them the biggest demographic to stay in this “what if” scenario. Those with a high school diploma, bachelor degree or graduate degree were approximately 65-69% more likely to continue working if they were rich.

Table 3: If rich do people continue or stop working with respondent’s gender cross tabulation.

Respondent’s sex Total
Male Female
IF RICH, CONTINUE OR STOP WORKING CONTINUE WORKING Count 214 209 423
% within IF RICH, CONTINUE OR STOP WORKING 50.6% 49.4% 100.0%
% within Respondent’s sex 69.3% 65.5% 67.4%
% of Total 34.1% 33.3% 67.4%
STOP WORKING Count 95 110 205
% within IF RICH, CONTINUE OR STOP WORKING 46.3% 53.7% 100.0%
% within Respondent’s sex 30.7% 34.5% 32.6%
% of Total 15.1% 17.5% 32.6%
Total Count 309 319 628
% within IF RICH, CONTINUE OR STOP WORKING 49.2% 50.8% 100.0%
% within Respondent’s sex 100.0% 100.0% 100.0%
% of Total 49.2% 50.8% 100.0%

In our sample data set about 49.2% were male and 50.8% were female, according to Table 3. With further analysis on whether or not people would continue or stop working on the respondent’s gender, 34.5% of women and 30.7% of men would choose to leave work if they were rich.  Gender doesn’t seem to be as strong of an indicator to help determine if a respondent were more likely to continue or stop working if they were rich in this “what if” scenario.

Table 4: If rich would people continue or stop working with respondent’s job satisfaction cross tabulation.

JOB OR HOUSEWORK Total
VERY SATISFIED MOD. SATISFIED A LITTLE DISSAT VERY DISSATISFIED
IF RICH, CONTINUE OR STOP WORKING CONTINUE WORKING Count 199 172 36 14 421
% within IF RICH, CONTINUE OR STOP WORKING 47.3% 40.9% 8.6% 3.3% 100.0%
% within JOB OR HOUSEWORK 71.8% 64.9% 60.0% 63.6% 67.5%
% of Total 31.9% 27.6% 5.8% 2.2% 67.5%
STOP WORKING Count 78 93 24 8 203
% within IF RICH, CONTINUE OR STOP WORKING 38.4% 45.8% 11.8% 3.9% 100.0%
% within JOB OR HOUSEWORK 28.2% 35.1% 40.0% 36.4% 32.5%
% of Total 12.5% 14.9% 3.8% 1.3% 32.5%
Total Count 277 265 60 22 624
% within IF RICH, CONTINUE OR STOP WORKING 44.4% 42.5% 9.6% 3.5% 100.0%
% within JOB OR HOUSEWORK 100.0% 100.0% 100.0% 100.0% 100.0%
% of Total 44.4% 42.5% 9.6% 3.5% 100.0%

In our sample data set about 49.2% were male and 50.8% were female, according to Table 3. With further analysis on whether or not people would continue or stop working on the respondent’s gender, 34.5% of women and 30.7% of menFinally, in Table 4, about 44.4% of respondents are very satisfied at work, 42.5% of respondents are moderately satisfied at work, 3.8% of respondents are moderately dissatisfied at work, and 1.3% of respondents are very dissatisfied at work. With further analysis on whether or not people would continue or stop working on the respondent’s job satisfaction level, 40% of respondents who are moderately dissatisfied would choose to leave work if they were rich making them the biggest demographic to leave in this “what if” scenario. In fact, if the respondents were anything but very satisfied with their job, they had an approximately 7-12% chance increase of wanting to leave their jobs if not rich.  This illustrates that 71.8% of those who are very satisfied with their jobs would stay at their job if they were rich, making them the biggest demographic to stay in this “what if” scenario.

Conclusions

Overall, this analysis has shown that to answer the question, if someone was rich, would they continue or stop working on their highest degree earned, and job satisfaction may have a contributing factor to the respondent’s decision in this “what if” scenario.  However, gender may not play an important role in answering this question.

Would choose to leave work if they were rich.  Gender doesn’t seem to be as strong of an indicator to help determine if a respondent were more likely to continue or stop working if they were rich in this “what if” scenario.

SPSS Code

DATASET NAME DataSet1 WINDOW=FRONT.

CROSSTABS

  /TABLES=richwork BY degree sex satjob

  /FORMAT=AVALUE TABLES

  /CELLS=COUNT ROW COLUMN TOTAL

  /COUNT ROUND CELL.

References: