undergraduate statistics for sociology students: October 2008

Tuesday, October 28, 2008

Partial Correlation

A partial correlation is the same as a Pearson's bivariate correlation , except that you add a control variable.

The control variable must be continuous, and the independent and dependent variables must both be continuous.

You perform and interpret the hypothesis test the same as for a Pearson's bivariate correlation . The hypotheses are:

Ho: There is no relationship between IV and DV, controlling for CV. r = 0

H1: There is a relationship between IV and DV, controlling for CV. r ne 0

*Fill in what IV, DV, and CV are in the above hypotheses.

Example 1:

Dependent Variable = income; measured in dollars

Independent Variable = tv hours; measured in daily hours watched

Control Variable (CV) = education; measured in years

Hypotheses:

Null: There is no relationship between the number of TV hours watched daily and income, adjusting for the number of years of education. r = 0

Research: There is a relationship between the number of TV hours watched daily and income, adjusting for the number of years of education. r ne 0

When we control for the effect of education on the relationship between number of TV hours watched daily and income, we find the following by doing a partial correlation with the GSS2000 data:

r = -.11, p = .000

p is less than alpha. Reject null.

There is a weak, negative relationship between the number of TV hours watched daily and income when we control for the effect of education. As the number of daily TV viewing hours goes up, income goes down for people of any level of education (r = -.11, p = .000). Or, as income increases the number of daily TV viewing hours goes down.

r2 = .11 * .11 = .0121

When controlling for the effect of education, the number of TV hours watched daily explains 1% of the variation in income. Or, when controlling for the effect of education, income explains 1% of the variation in the number of TV hours watched daily.

DO THIS AGAIN WITH GSS2006 DATA

Example 2: Take Home Exercise

I think the number of children that people have is determined, in part, by the amount of education they have, controlling for current income.

number of children (childs)
education (educ)
income (rincom98)

GET STATS from GSS2006

Example 3: Take Home Exercise

I think that age influences how well students do in college statistics courses, controlling for the number of previous math or statistics courses taken.

age = age in years
performance in statistics = average score in course on a scale of 0 to 100
previous math or statistics courses = total number of math or statistics courses previously taken for college credit

r = .18, p = .36

ANCOVA

Stands for Analysis of Covariance

This is a multivariate means test.

It is just like the ANOVA (aka multiple group means test) you learned in the last section. But it enables you to add a control variable.

Used when:

DV = continuous

IV = categorical with 2 or more categories (nominal or ordinal)

CV = continuous

You write the same hypotheses as with ANOVA , do the test the same way, and interpret the results the same way.

Null: There is no relationship between the IV and the DV, controlling for the CV. The means are equal. Mean 1 = mean 2 = mean 3 ..... F = 0.

Research: There is a relationship between the IV and the DV, controlling for the CV. The means are not equal. Mean 1 NE mean 2 NE mean 3 .... F ? 0.

*Write in the names of the variables. And write out a mean for each category in the IV.

Example 1:

Dependent Variable = SEI; measured on a scale of 0-100.

Independent Variable = sex; measured as men and women

Control Variable (CV) = number of hours usually worked weekly (hrs2); measured as # of hours

Hypotheses:

Null: There is no relationship between sex and SEI, controlling for the number of hours worked per week. Mean SEI for women = mean SEI for men.

Research: There is a relationship between sex and SEI, controlling for the number of hours worked per week. Mean SEI for women NE mean SEI for men.

Using the GSS2000, we find:

Mean SEI men, controlling for the number of hours worked = 50.37, s = 20.05

Mean SEI women, controlling for the number of hours worked = 47.36, s = 16.35

F = 1.48, p = .24

We would accept the null hypothesis and conclude that there is no difference in the mean SEI for men and for women controlling for the influence of the number of hours worked. Men and women's SEI, on average, is somewhere between 47 to 50 (about halfway on the scale of 0-100), holding the number of hours worked per week constant.

DO THIS AGAIN WITH GSS2006 DATA

Example 2:

DV = Education

IV = Race

CV = Age

Ho: There is no relationship between race and education, controlling for age. Mean Education of Whites = Mean Education of Blacks = Mean Education of "Others"

H1: There is a relationship between race and education, controlling for age. Mean Education of Whites ? Mean Education of Blacks NE Mean Education of "Others"

Mean Education of Whites, Holding Age Constant = 13.43
Mean Education of Blacks, Holding Age Constant = 12.33
Mean Education of "Others", Holding Age Constant = 13.42

F=46.00, p = .000

Reject Null.

There is a relationship between race and education, controlling for age. The mean education of African Americans (12.33 years) is lower than that of Whites and people of other races (about 13.43 years).

DO THIS AGAIN WITH GSS2006 DATA

Example 3: Take Home Exercise

Does social class influence the number of hours worked controlling for number of children?

Social class (class) = lower class, working class, middle class, and upper class

Number of hours worked last week (hrs1) is measured in hours and ranges from 3 to 89.

Number of children (childs) is measured in numbers. It ranges from 0 -8 kids.

Average number of hours worked, holding number of children constant
Lower class: 36.18
Working class: 42.01
Middle class: 42.24
Upper class: 40.47

F = 2.96, p = .02

Example 4: Take Home Exercise

Does wearing a condom (yes/no) influence the number of children (measured in numbers) that people have, controlling for income (in dollars)?

Results: F = 22.76, p = .000

Mean # of children among those who usually wear a condom, controlling for income: 1.11 children on average

Mean # of children among those who usually do not wear a condom, controlling for income: 1.72 on average

Monday, October 27, 2008

Elaboration

Elaboration is a crosstab adding a control variable (CV).

The control variable must be categorical, with at least two categories, and ideally no more than five categories. The independent and dependent variables must be categorical, with at least two categories and ideally no more than five categories.

With elaboration you get a separate crosstab for every category of your control variable, including a chi-square and p value for each crosstab. This allows you to isolate the effect of the IV on the DV for every value of the CV.

You perform and interpret the hypothesis test the same as for a bivariate crosstab , but you must do it for each of the values of the control variable. The hypotheses are:

Ho: There is no relationship between IV and DV, controlling for a second independent variable (called a CV). Chi-square = 0

H1: There is a relationship between IV and DV, controlling for a second independent variable (called a CV). Chi-square ? 0

We will not be doing elaboration by hand because it will take too long! Although you could. It is not difficult. Just do a bivariate crosstab for each value of the CV.

Example 1: Does race influence people's attitudes about legalizing marijuana when you control for political ideology?

Dependent Variable = attitude about marijuana legalization; measured as should be legal or should not be legal

Independent Variable = race; measured as white or black

Control Variable (CV) = political ideology measured as liberal, moderate, conservative

All variables are categorical, so analysis = elaboration.

Hypotheses:

Null: There is no relationship between race and attitudes about legalizing marijuana, controlling for political ideology. Chi-square = 0.

Research: There is a relationship between race and attitudes about legalizing marijuana, controlling for political ideology. Chi-square does not equal 0.

Do on the Computer, Using SPSS and the GSS 2006

Example 2: Does race influence people' s attitudes about gun permits when controlling for educational degree?

DV = whether people oppose or favor requiring gun owners to have a permit (gunlaw)

IV = race

CV = educational degree (degree) measured as less than HS, HS, some college, college degree, graduate degree

All variables are categorical. So analysis = elaboration

Hypotheses:

Null: There is not a relationship between race and attitudes about gun permits, controlling for educational degree. Chi-square = 0.

Research: There is a relationship between race and attitudes about gun permits, controlling for educational degree. Chi-square does not equal 0.

Do on the Computer, Using SPSS and GSS 2006

Example 3: Take Home Assignment

Does gender influence whether people think preschool age children are hurt when their mother works outside of the home, controlling for race?

Do on the computer. Using SPSS and GSS 2006

Multivariate Analyses Chart

I am emailing you a multivariate analysis chart. This indicates what multivariate analysis you should given the level of measurement in your independent, dependent, and additional variables. That is the new twist with multivariate analyses: You will have additional variables. The simplest case is just one additional variable, a third variable, called a control variable. We will start there. But usually you have many more additional variables, all of which are called independent variables. This enables us to approximate what happens in real life.

Tuesday, October 07, 2008

Sample Exams, Round Two

I am going to email you two sample exams to guide you in studying for your second exam on Friday October 24th. Your exam will look just like these in form. I likely will not give you a question that requires calculation chi-square by hand, because of the time restrictions.

Bivariate Correlations

Bivariate Correlations (Pearson's r)

Good news! You will not calculate this by hand.

Used when both the DV and the IV are continuous. (Robust to minor violations in distributional assumptions.)

A correlation indicates what the linear relationship is between two variables. It indicates how the two variables covary.

A positive correlation means that as one variable goes up in value, the other variable goes up too. Or as one variable goes down in value, the other variable goes down too. A positive correlation means the two variables vary in the same direction (either they both go up when one changes, or they both go down when one changes).

See scatterplot on board (x= education, y = income)

A negative correlation means that as one variable goes up in value, the other variable goes down. Or as one variable goes down in value, the other variable goes up. A negative correlation means the two variables vary in opposite directions.

See scatterplot on board. (x= age, y = crime)

Correlations (denoted with the symbol "r") range from -1 to +1. A -1 means there is a strong negative linear relationship between the two variables. A +1 means there is a strong positive linear relationship between the two variables. A 0 correlation means that there is no linear relationship between the two variables.

See scatterplot on board; x= age, y = shoe size)

A correlation of +1 or -1 in the social sciences is rare. Usually it only occurs if you are using an IV and a DV that are essentially the same thing. For example, IV= age, DV= cohort.

Very strong correlations are rare in the social sciences (social behavior is complicated).

The size of the correlation depends on:

1. Sample size: Bigger n's lead to higher r's.

2. Distribution of your variables: Pearson's r may not work well if your variables are not normally distributed.

3. Unit of analysis: Large unit of analysis (ex. countries, businesses, even households) lead to higher r's, because there is usually less variation in any x and y among large units of analysis.

Example: IV = health care expenditures ($'s), DV = health status (0-10)

Unit of analysis = people; r = .20

Unit of analysis = countries; r = .75

With small units of analysis (such as the GSS), we usually do not find correlations higher than .3 or so.

Explained Variation (r2)

You can also calculate the amount of variation that the IV explains of the DV by squaring the correlation.

r2 = r * r

r2 * 100 = The percent of variation in the DV that the IV explains.

Example: IV = health care expenditures ($'s), DV = health status (0-10)

Unit of analysis = people; r = .20, r2 = .04. Convert .04 to percentage (.04 * 100; or move decimal two places to the right). Interpretation: Health care expenditures explain 4% of the variation in people's health status.

Unit of analysis = countries; r = .75, r2 = .56. Convert .56 to percentage (.56 * 100; or move decimal two places to the right). Interpretation: Health care expenditures explain 56% of the variation in a country's health status.

See board for Venn diagram.

Limitations of Correlations

1. Correlation does not prove causation. In the social sciences, many independent variables could also be dependent variables. For example, what influences what -- does education determine income, or does income determine education, or both? If you specify that x = education and y = income and find a significant positive correlation, you should not say that education causes income to increase (unless you are able to establish time ordering, which you usually can't with survey data). The correlation does not indicate causation, only covariance. The correlation between x = income and y = education would be the same as that between x = education and y = income.

2. Correlations only show whether a linear relationship occurs between two variables. Many relationships in the social sciences are not linear. For example, look at the possible relationships between age and the consumption of pornography. See board for scatterplot.

Significance Tests

We use a t test to determine whether a correlation is different from 0. There are 3 research hypotheses we can test. You must choose one:

* There is a relationship between the IV and the DV. r ≠ 0. (two tailed test)
* There is a positive relationship between the IV and the DV. As the IV goes up, the DV goes up too (or as the IV goes down, the DV goes down too). r > 0. (one tailed test, right hand side).
* There is a negative relationship between the IV and the DV. As the IV goes up, the DV goes down (or as the IV goes down, the DV goes up). r < 0. (one tailed test, left hand side.)

Then draw your diagram using the alpha that you set ahead of time. Be sure to draw the number of tails that correspond to your research hypothesis, and to split alpha in half for a two tailed test.

If p is lower than alpha, reject. If p is higher than alpha, accept. Then give your interpretation. If you accept the null, you say that there is no correlation between the IV and the DV. If you reject the null, you say that there is a correlation between the IV and the DV, and then tell us what that relationship is. Is it a positive or negative correlation? As the IV increases, what happens in the DV? How much variation in the DV does the IV explain?

If the r is significant, then the r2 is too, in a bivariate analysis.

Example Correlation Hypothesis Tests and Interpretations

Example 1.

I think that education influences the number of children that people have.

IV = education level (0 to 20)

DV = number of children (0 to 8+)

Null Hypothesis: There is no linear relationship between education and the number of children that people have. r = 0 .

Research Hypothesis: There is a linear relationship between education and the number of children that people have. r ≠ 0.

Alpha = .05. Two tailed test. Draw diagram.

From SPSS (using the GSS), we learn that r = -.21, p = .000

r2 = .0441

Reject the null. There is a weak negative relationship between education and the number of children that people have. As education increases, the number of children that people have tends to decrease slightly. Education explains 4.41% of the variation in the number of children.

Example 2.

I think that age influences how many siblings that people have. Specifically, I think that older people tend to have more siblings than younger people.

age (18 to 89)

number of siblings (0 to 24)

From SPSS (using the GSS), we learn that r = .14, p = .000

Example 3.

I think that the number of hours that people work per week influences how many times they have sex.

hours worked (3 to 89)

sex frequency (0 to 6)

From SPSS (using the GSS), we learn that r = .06, p = .027

Take Home Exercises (ie HOMEWORK)

Example 4
I think people with higher income (measured in dollars) watch less television (measured in hours) than people with lower incomes.

Alpha = .05.

r = -.19, p = .000

Example 5.

Does the number of siblings that people have affect the number of children that they go on to have?

Alpha = .05

r = .23, p =.000

Example 6.

I think that people with higher income have higher education.

alpha = .05

r = .35, p = .000

Matched Group Means Test

Matched Group Means Test

Compare sample mean from two groups of matched cases.

Matched cases = pre and post test data on sample people, or data on cases that are not independent of each other.

Example of matched cases:

pre and post weight from a sample of people in an exercise program (pre = before starting program, post = after entering program)

pre and post anxiety of a sample of stats students (pre = first day of class, post = last day of class)

* physician and patient satisfaction with their interaction (not independent)
* husband and wives assessment of hours they both spend doing housework.

For a matched means test, DV = continuous, IV = categorical

The IV in a matched group means test is the matching variable(s) or group variable. There are only two matched groups in this formula.

df = n - 1

t calc = (mean difference - null value)/ (std dev of difference/ sq rt of n)

CI = mean difference +/- two tailed t crit (std dev of difference/ sq rt of n)

See board for t calc and confidence interval formula.

Must compute the difference in scores for each case before computing the mean difference.

Example 1.

I think physician and patients will have different satisfaction with their interaction with each other.

Ho: There is no difference in satisfaction scores between patients and physicians.

H1: There is a difference in satisfaction scores between patients and physicians.

Alpha = .05

See board for mathematical hypotheses and diagram.
Satisfaction Scores 1-5, higher score means more satisfied (Note: this isn't really continuous, but some disciplines cheat like this a lot)

Ph Pt Difference
5 4 1
4 4 0
4 2 2
4 4 0
5 4 1
3 3 0
4 1 3
4 2 2
3 3 0
4 1 3

n =10, mean difference = 1.2, standard deviation of the difference = 1.23

df = 10-1 = 9, t crit = 2.26

see board for t-calculation

t calc = 3.08

Reject null. There is a difference in satisfaction scores between patients and physicians. Physicians are more satisfied with the physician - patient interaction than are patients.

Confidence Interval = 1.2 +/- (2.26*.39) = .32-2.08

We are 95% confident that the difference in satisfaction between physicians and patients is somewhere between .32 and 2.08 (on a 5 point scale).

Reject null, because 0 is not in the interval.

Example 2. In Class Exercise

Below is the data on what husbands and wives reported as the number of times the husband washed the dishes in the last week. We think that husbands overestimate the number of times that they washed the dishes when compared to what wives' report.

Alpha = .10
H W Difference
7 0 7
2 2 0
4 0 4
3 2 1
10 4 6
8 4 4
14 14 0

Mean Difference 3.14
Std Dev of Difference 2.85

Example 3. In Class Exercise

The anxiety level among students who take statistics generally declines by the end of the semester.

Alpha = .05

Second Exam, Friday October 24th

Our next exam is scheduled for Friday, October 24th. This exam will cover all bivariate analyses. It will include four questions. I will post sample exams shortly.

Wednesday, October 01, 2008

Multiple Group Means Test

Multiple Group Means Test: ANOVA

Stands for "analysis of variance"

ANOVA is means test, just like the means tests you learned in the last section. But it enables you to compare more than 2 groups.

Used when:

DV = continuous

IV = categorical with more than 2 categories, usually 3-6 categories. You can do an ANOVA with an IV that has more than 6 categories, it is just cumbersome to interpret the results.

ANOVA uses an F test to compare the means of the groups. An F distribution is very similar to a chi-square distribution. An F test in ANOVA can only tell you if there is a relationship between two variables -- it can't tell you what that relationship is. Mathematically, this means it can only tell you if one of the means of the groups is different from another one. It can't tell you which mean is different.

The hypotheses we test with an F-test in ANOVA are:

Null: There is no relationship between the IV and the DV (write in the names of the variables ...). The means are equal. Mean 1 = mean 2 = mean 3 .... (write out however many groups there are (i.e., write a mean for each of the categories of the IV). F = 0.

Research: There is a relationship between the IV and the DV (write in the names of the variables ...). The means are not equal. Mean 1 ≠ mean 2 ≠ mean 3 .... (write out however many groups there are (i.e., write a mean for each of the categories of the IV). F ≠ 0.

Then draw your diagram using the alpha that you set ahead of time. For an F-test, there will always only be one tail (the right tail). So you will never divide alpha in half.

Then do the ANOVA on the computer. Look at the p value associated with the f that you get. If it is lower than alpha, reject. If it is higher than alpha, accept. And then give your interpretation.

If you reject, you say there is a relationship and then look at the means to determine which mean is higher or lower than the others.

If you accept, you say there is no relationship, and state what the mean is about for all of the groups.

Example 1

Does social class influence the number of hours worked?

IV = Social Class (4 categories; lower class, working class, middle class, upper class)

DV = Hours Worked per week

Null Hypothesis: There is no relationship between social class and number of hours worked. The average number of hours worked for each social class is equal. Mean 1 = mean 2 = mean 3 = mean 4. F = 0.

Research Hypothesis: There is a relationship between the social class and number of hours worked. The average number of hours worked for each social class is not equal. Mean 1 ≠ mean 2 ≠ mean 3 ≠ mean 4.. F ≠ 0.

Alpha = .05. One tailed test (that is all an f-test can do). Draw diagram.

From SPSS, we learn that F = 3.85, p = .009 in 2004

Reject the null. There is a relationship between social class and number of hours worked. People in the lower class work less often (about 36 hours a week) than people in the working, middle and upper classes. People in the middle and working class work the most (an average of 42 hours a week).

Check this for 2006

Example 2.

Does race influence socio-economic status?

IV = race (3 categories, 1 = white, 2 = black, 3 = other)

DV = socio-economic index (range of 0-100)

Null Hypothesis: There is no relationship between race and SEI. The average SEI for each race is equal. Mean 1 = mean 2 = mean 3. F = 0.

Research Hypothesis: There is a relationship between the race and SEI. The average SEI for each race is not equal. Mean 1 ≠ mean 2 ≠ mean 3. F ≠ 0.

Alpha = .05. One tailed test (this is all f tests can do). Draw diagram.

From SPSS we learn that F = 20.29, p = .000, for GSS 2004.

Reject the null. There is a statistical relationship between race and SEI. The average SEI for black respondents is about 43, which is lower than the average SEI for white respondents (mean = 50) and respondents of other races (mean = 50) .

Check this for 2006

Example 3. In Class Exercise or Take Home

Does educational degree attainment influence the number of hours of TV that people watch per day?

Alpha = .05

Do this for GSS 2006

Two Group Means Test

This test allows you to compare two means from a sample.

DV = continuous, IV = categorical*

*Can use a continuous IV but must recode it into categories. Example, could use income, but need to recode into a few income categories.

The IV is the grouping variable. There are only 2 groups in a two group means test, so you must choose the two categories of the IV that you want to compare.

df = n1 + n2 - 2

See board for t calc formula.

See board for confidence interval formula.

Example 1.

Medical sociologists argue that children of two parent homes get sick less often than children of one parent homes. To test this theory, you collect a national sample of families, including data on the average number of days children missed school per year and whether the family has one parent or two parent.

Ho: Children of two parent homes get sick as often or more often than children of one parent homes.

H1: Children of two parent homes get sick less often than children of one parent homes.

See board for mathematical hypotheses and diagram. This formula assumes that the two groups have unequal variances (different standard deviations). There is a different formula if the two groups have equal variances (which is unusual occurrence in the social sciences).

alpha = .05

1. sub-sample of two parent homes: n = 300, mean = 5, std dev = 2

2. sub-sample of one parent homes: n = 250, mean = 7, std dev = 1.5

df = 300 + 250 - 2 = 548, t crit = 1.64

See board for t calculation. t calc = -13.33

Reject null. Children of two parent homes get sick less often than children of one parent homes.

**If on computer:

t calc = -13.33, p = .03

Reject null. p is less than alpha. Ditto intrepretation above.

*************************

Confidence Interval = (5-7) +/- (1.96*.15) = -1.71 - -2.29

We are 95% confident that children of two parent homes miss between 1.71 and 2.29 fewer days per year from school than children of one parent homes.

Reject null because null value (0) is not in the interval.

Example 2.

Nationally, fee-for-service (FFS) health insurance plans have different annual costs than preferred provider organizations (PPO). You collect data from a sample of Boone residents, some of whom have FFS health insurance plans and some of whom have PPOs, to find out how the two plans differ.

Ho: The total health care costs are the same for FFS and PPOs.

H1: The total health care costs are different for FFS and PPOs.

alpha = .01

1. sub-sample of FFS patients: n = 1000, mean = 4800, s = 1000

2. sub-sample of PPO patients: n = 1500, mean = 4750, s = 500

df = 1000 + 1500 -2, t crit = 2.58

See board for t calculation and diagram.

t calc = 1.46

Accept null. The total health care costs are the same for FFS (mean = 4800) as for PPOs (mean = 4750).

**If on the computer: t= 1.46, p = .08

Accept null. p is more than alpha. Ditto intrepretation above.

*************************

Confidence Interval = (4800-4750) +/- (2.58*34.16) = -38.13 - 138.13

We are 99% confident that the FFS plans costs somewhere between $38.13 less per year to $138.13 more per year than the PPO plan.

Accept null, because 0 is in the interval.

Example 3. Do In Class

DV = education in years, IV = sex, alpha = .05

Men mean years of school = 13.46, s = 3.05, n =1227

Women mean years of school = 13.11, s = 2.72, n = 1581

Do Hyp Test and CI

Discuss statistical significance vs. substantive significance

Example 4. Take-Home

Does race influence socio-economic status?

Race is measured as White or Black.

Socio-economic status is measured as an index of education, income, and occupational prestige. It ranges from 0 to 100.

Whites: Mean SES = 50.14, s = 19.38, n = 2118

Blacks: Mean SES = 43.45, s = 18.43, n = 398

Alpha = .01

Do Hyp Test and CI

Discuss statistical significance vs. substantive significance

Example 5.Take-Home

Does perceived health status influence how many female sexual partners that people have?

Perceived health status is measured as excellent or good

Number of sexual partners ranges from 0 to 989.

Alpha = .05

Excellent Health: Mean # of female sexual partners = 9.05, s = 47.92, n = 529

Good Health: Mean # of female sexual partners = 7.30, s = 24.76, n = 812

Do Hyp Test and CI

Discuss statistical significance vs. substantive significance

undergraduate statistics for sociology students