Exercise1 Feedback: Odds ratios and degrees of freedom

OOPS! When going over the material today I found two corrections. You will see them below in bright purple!

READINGS

GUIDE 1: ISSUES IN MODELING
GUIDE 2: TERMINLOGY
GUIDE 3: THE LOWLY 2 X 2 TABLE GUIDE 4: BASICS ON FITTING MODELS
GUIDE 5: SOME REVIEW, EXTENSIONS, LOGITS GUIDE 6: LOGLINEAR & LOGIT MODELS
GUIDE 7: LOG-ODDS AND MEASURES OF FIT
GUIDE 8: LOGITS,LAMBDAS & OTHER GENERAL THOUGHTS

OVERVIEW

EDF 6937-01 SPRING 2009
THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA
EXERCISE 1 FEEDBACK
20 POINTS TOTAL
TERMINOLOGY: ODDS RATIOS AND DEGREES OF FREEDOM
Susan Carol Losh
Department of Educational Psychology and Learning Systems
Florida State University

IN GENERAL

TESTING THE MODELS

SPECIFIC ANSWERS

Please read over this material very carefully. Please remember that I do not answer questions about PERSONAL papers in class (including break). We can speak after class or in an appointment. Thanks!

The purpose of this exercise was to provide preliminary experience with three way cross-tabulation tables, loglinear terminology, and interpreting odds ratios at various levels. If you are comfortable with odds ratios (and their logarithms) you will be much more likely to correctly interpret analysis from binomial or multinomial logistic regression or ordinal regression. The beta coefficients in these analyses raise and lower the log odds (i.e., the logits) on your dependent variable. As you can see, logits are neither additive nor linear.

The odds ratio, like the percentage, also allows you to compare two groups who have different base ns.

You WILL NOT need to compute odds ratios on any other exercise! (Although they might be appropriate in your analytic paper depending on what you do.)

We are starting out "easy" because each of our three variables has only two categories or values. The data (n = 3391) are aggregated from two separate NSF Surveys of Public Understanding of Science and Technology. This table uses the following three variables:

GENDER: 1 = Male 2 = Female For purposes of this assignment, consider the value "male" to be a "success" or "high".

EARTHDUM or PLANET: 1= Correct answer, the earth goes around the sun 0 = any other answer. For purposes of this assignment, consider the value "earth goes around the sun" to be a "success".*

YEAR: 2001 or 2006. For purposes of this assignment, consider the value "2006" or "correct" to be a "success".

* To my sorrow, I switched the responses on the EARTHDUM or PLANET question. However, as in class, we will proceed as though I had labelled them the right way as they are shown below.

Our 2 X 2 X 2 table looks as follows:

	2001				2006
PLANET QUESTION	MALE	FEMALE		MALE	FEMALE
EARTH GOES AROUND SUN (CORRECT)	104	282	386	146	305	451
OTHER	649	538	1187	652	715	1367
	753	820	1573	798	1020	1818

Percentage distributions are as follows ( by year and gender)

	2001				2006
PLANET QUESTION	MALE	FEMALE		MALE	FEMALE
EARTH GOES AROUND SUN (CORRECT)	13.8%	34.4%	386	18.3%	29.9%	451
OTHER	86.2	65.6	1187	81.7	70.1	1367
	100.0% (753)	100.0% (820)	1573	100.0% (798)	100.0% (1020)	1818

In three way, and sometimes even four way cross tabulation tables, calculating the percentages on your dependent variable (assuming you have one) within category combinations of your independent variables can be quite useful. If you do have a three way (or higher) interaction effect, you will get a better idea of the form that it takes. The percentages in this table, for example, quickly tell us that there was about a 20% sex difference on the PLANET question in 2001, which was larger than the nearly 12% sex difference on the PLANET question in 2006. Thus, there was a smaller sex difference in 2006 than in 2001. Is this a statistically significant difference of percentage differences among year, gender and planet? MAYBE. So much depends on sample size. With nearly 3400 cases, this three way effect may be statistically different from zero but this would not happen, say, in a sample of 1,000. I discuss the runs on this three way table below.

Everyone did very well. I do want to point out a couple of common mistakes.

COMMON NOVICE MISTAKE: Switching the numerator and the denominator in the middle of an analysis. This can lead to some very strange odds ratios! If "MALE" = "2006" = "CORRECT ANSWER ON PLANET" = "success" = "high" you must stick with these values as the numerators throughout the entire analysis. You can't switch gears in the middle and decide "2001" will be the numerator. (Although you might do so in a separate subsequent analysis.) (If you consistently switched the numerator on any of these variables, your final third order log odds probably was the mirror image of mine, e.g., -0.54.)

SUGGESTION: In class and depending on my calculator, I rounded to two decimal places on the odds ratio. However, if possible I recommend taking decimal places out to THREE places (not 2) if your calculator will do so. This can make the difference for a second order odds. When converted to logarithms, that can become a considerable difference.

VERY COMMON NOVICE MISTAKE: Interpreting odds-ratios or logged odds as percentages.

Logged odds can't be interpreted that way at all. At best the logged odds are raised or lowered by other (criterion) variables.

Percentages use the total for the subgroup as a base. For example, 24.5% of 2001 respondents (386/1573 X 100) and 31.9% of female respondents (587/1840 X 100) got the PLANET question "correct".

However, the conditional odds by gender and year on the science question are:

	Men			Women
EDUCATIONAL LEVEL	2001	2006	ALL	2001	2006	ALL
Earth goes around sun	0.160	0.224	0.193	0.524	0.427	0.468

The odds-ratio on the "PLANET" variable for men is larger in 2006 than in 2001, whereas the reverse is true for women. The second order odds for women is about twice as great as it is for men.

BUT notice the percentages for female versus male respondents are NOWHERE near twice the percentage giving the correct answer.

How do we interpret the third order odds which is 1.718 (see far below for question 9)?

The effects of gender on answering this science question correctly to not were less pronounced in 2006 than in 2001. OR:

The 2nd order conditional odds on this science question (again see far below) for men were almost twice as great as they were for women.

It is very important to get used to these interpretations. It is a very common mistake to MISinterpret logistic regression coefficients as percentages too, when, instead, they aren't just odds-ratios but logits or logged odds-ratios.

This all assumes that the third order interaction {ABC} term was statistically significant, i.e., not zero. Was it?

TESTING MORE FORMALLY

Can we drop any of the terms? I tested out the model to see:

The loglinear equation (logarithmic and additive) for the saturated model using the type of equations from Guide 4 where:

P = Planet question
G = Gender and
Y = Year

G_ijk = + _i^P + _j^G + _k^Y + _ij^PG + _ik^PY + _jk^GY + _ijk^PGY

G_ijk = 5.850 - 0.602^P - 0.204^G + 0.088^Y - 0.228^PG + 0.016^PY - 0.002^GY + 0.068^PGY

These are loglinear general cell frequency effects.

A model that eliminates the P*G*Y three way interaction effect has 1 degree of freedom, a G² of 10.046, a p-level of 0.0015 and a corresponding normal (z) score of 3.15. This model has 1 df because there is now 1 df freed up by the (2-1)(2-1)(2-1) =1 third order interaction effect. Interestingly, however, the z score for GENDER * YEAR is only -0.09 (the gender composition was approximately the same both years) and the PLANET * YEAR z score was only 0.75 (about the same percentage got the correct PLANET answer both years). However, the GENDER*PLANET z score was -10.61 indicating an overall sex difference on the PLANET question.

Put together, these results indicate the sex difference varied by whether the year was 2001 or 2006, i.e., a three-way PLANET * GENDER * YEAR interaction effect.

The SPSS model selection program (which estimates hierarchical models ONLY) concluded that all parameters needed to be estimated for the model to fit (this was the saturated model). Thus, the final generating class on that hierarchical model was (PLANET * GENDER * YEAR).

The SPSS model selection program uses the .05 Type I error criterion as a cutoff.

Typically we are fussier than that. The convention for most loglinear analysts is to use an alpha-level of p > 0.20 as a model that "fits well". By either criterion, the three-way or third order interaction model is what is needed to fit the data.

SPECIFIC ANSWERS TO SPECIFIC QUESTIONS

USE THE TABLE ABOVE TO ANSWER THE FOLLOWING QUESTIONS.
PLEASE BE SURE TO ANSWER ALL DESIGNATED PARTS OF A QUESTION.

1. (1 point) FOR THE TOTAL SAMPLE, what is the odds on GENDER--MALE:FEMALE?

1551/1840 = 0.843

2. (1 point) FOR THE TOTAL SAMPLE, what is the odds on YEAR, 2006:2001?

1818/1573 = 1.156

3. (1 point) FOR THE TOTAL SAMPLE, what is the odds on PLANET--CORRECT:OTHER?

837/2554 = 0.328

4. (1 point) What is the FIRST ORDER CONDITIONAL ODDS for each category of GENDER for "PLANET"--CORRECT:OTHER?

Males = 250/1301 = 0.192

Females = 587/1253 = 0.468

The second order odds for GENDER * PLANET is 0.192/0.468 = 0.410

5. (1 point) What is the FIRST ORDER CONDITIONAL ODDS for each category of YEAR for "PLANET"--CORRECT:OTHER?

2006 = 451/1367 = 0.330 (notice that 2006 forms the numerator because it was defined as a "success")

2001 = 386/1187 = 0.325

The second order odds for YEAR * PLANET is 0.330/0.325 = 1.015 (ln = 0.01)
The second order odds and its logarithm so close to 0 correspond to the very small z score and coefficient in the computer run, telling us there was no essential change in the distribution on the planet question (NOT considering gender) by year.

6. (1 point) (1 point) FOR THE TOTAL SAMPLE, what is the second order odds ratio for YEAR and PLANET?
(1 point) Do you think there is an association between year and answers to the Planet science question? Why or why not?

See Q5. 1.015 (ln = 0.01). PROBABLY NOT. The odds is very close to 1 and the logit is very close to 0.

7. ((1 point) FOR THE TOTAL SAMPLE, what is the second order odds ratio for YEAR and GENDER ?
(1 point) Do you think there is an association between gender and year? Why or why not?

First order conditionals:

2006 = 798/1020 = 0.782

2001 = 753/820 = 0.918

Second order odds = 0.782/0.918 = 0.852 ln= -0.16

MAYBE a relationship although there were relatively slightly more males in 2001, it may not be enough of a difference by year to make a difference (see the z score above plus the computer runs).
So much depends on sample size, because we are asking whether the odds departs from 1 and the logit from 0 BEYOND SAMPLING ERROR. Sampling errors are smaller with larger samples, thus smaller odds or logits are more likely to be statistically significant with larger samples.

8. (1 point) FOR THE TOTAL SAMPLE, what is the second order odds ratio for GENDER and PLANET?
(1 point) Do you think there is an association between gender and answers to the "earth around the sun" science question? Why or why not?

First order conditionals (see question 4):

Males = 250/1301 = 0.192

Females = 587/1253 = 0.468

The second order odds for GENDER * PLANET is 0.192/0.468 = 0.410

The second order odds suggests an association between gender and answers to the EARTHDUM question because the logit is so much smaller than zero.
Again, so much depends on sample size, because we are asking whether the odds departs from 1 and the logit from 0 BEYOND SAMPLING ERROR.

9. (1 point) FOR THE TOTAL SAMPLE, what is the third order odds ratio for GENDER, YEAR, and PLANET?
(2 points) How do you interpret this odds ratio IN WORDS? (HINT: Try calculating some percentages and see if that gives you clues about interpreting the results)?
(1 point) Do you think there is a statistical interaction among gender, hsdegree and the science question?
(1 point) Why or why not?

Here goes:

For males in 2006 (right to wrong on PLANET): 146/652 = 0.224
For males in 2001: 104/649 = 0.160

Second order conditional is 0.224/0.160 = 1.400

For females in 2006 (right to other on PLANET): 305/715 = 0.427
For females in 2001: 282/538 = 0.524

Second order conditional is 0.427/0.524 = 0.815

STOP HERE A MOMENT!

We can see here that the second order conditional for males is 1.400, larger than the corresponding 0.815 for females. Even if I had not calculated the percentage getting the science question correct for for gender-year combinations, this should suggest to you that the effect of year on the science question will be stronger for men than women.

Now, to calculate the third order odds: 1.400/0.815 = 1.718 ln = 0.54

The odds greater than one and the positive logit suggest there may be a three-factor interaction. As you know from the loglinear analysis run (above), all the terms, including the three factor interaction must be present for this three way table to "fit." So, "officially" there is a three-factor interaction effect since the Chi-square probability is 0.0015 (but it's modest in size; see the comparison of male and female percentages above).

See above (near the table area) for an interpretation in words (in a few different ways).

This is an interesting example because it is non-hierarchical, there is a three-way interaction but no two-way interactions for planet by year or gender by year.

10. (1 point) How many TOTAL degrees of freedom are there in this 2 X 2 X 2 table?

In a 2 x 2 x 2 table, we start with an 8 cell table and 8 degrees of freedom.

11. (1 point) If we fix the total case base (i.e., constrain the model case base to equal the observed case base), how many degrees of freedom ARE LEFT?

fixing n means losing 1 degree of freedom, so this equiprobable model has 8 - 1 or 7 df.

12. Suppose you have a FULLY SATURATED model for GENDER X YEAR X EARTHSUN.

(1 point) Which parameters have you fixed in this case?

(I know this one was not obvious, but in the future it's good for practice to BE SURE TO LIST THEM ALL.)

The case base n (uses 1 df).
The three marginals for gender, YEAR and EARTHSUN (each, 2-1 or 1 df each, uses total 3 df)

The three two-way associations, (GENDER * YEAR)(GENDER * EARTHSUN) and (YEAR * EARTHSUN). Each has (2-1) X (2-1) df or 1 df each, uses total 3 df.

The third order interaction (GENDER * YEAR * EARTHSUN) with (2-1) x (2 -1) x (2-1) = 1 df.

(1 point) What is the degrees of freedom associated with this fully saturated model?

8 - 8 = 0

It's a good idea to get used to writing these parameters out because (A) you need them to write the equations and (B) you need to know what to specify if you use a non-hierarchical model.

OVERVIEW READINGS

This page created with Netscape Composer
Susan Carol Losh
February 9 2009