THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA EXERCISE 1 FEEDBACK 20 POINTS TOTAL TERMINOLOGY: ODDS RATIOS AND DEGREES OF FREEDOM Susan Carol Losh Department of Educational Psychology and Learning Systems Florida State University |
|
|
|
Please read over this material very carefully. Please remember that I do not answer questions about PERSONAL papers in class (including break). We can speak after class or in an appointment. Thanks!
The purpose of this exercise was to provide
preliminary experience with three way cross-tabulation tables, loglinear
terminology, and interpreting odds ratios at various levels. If you are
comfortable with odds ratios (and their logarithms) you will be much more
likely to correctly interpret analysis from binomial or multinomial logistic
regression or ordinal regression. The beta coefficients in these analyses
raise and lower the log odds (i.e., the logits) on your dependent variable.
As you can see, logits are neither additive nor linear.
The odds ratio, like the percentage, also allows you to compare two groups who have different base ns.
You WILL NOT need to compute odds ratios on any other exercise! (Although they might be appropriate in your analytic paper depending on what you do.)
We are starting out "easy" because each of our three variables has only two categories or values. The data (n = 3391) are aggregated from two separate NSF Surveys of Public Understanding of Science and Technology. This table uses the following three variables:
GENDER: 1 = Male 2 = Female For purposes of this assignment, consider the value "male" to be a "success" or "high".
EARTHDUM or PLANET: 1= Correct answer, the earth goes around the sun 0 = any other answer. For purposes of this assignment, consider the value "earth goes around the sun" to be a "success".*
YEAR: 2001 or 2006. For purposes of this assignment, consider the value "2006" or "correct" to be a "success".
* To my sorrow, I switched the responses on the EARTHDUM or PLANET question. However, as in class, we will proceed as though I had labelled them the right way as they are shown below.
Our 2 X 2 X 2 table looks as follows:
|
2006 | ||||||||
PLANET QUESTION |
|
|
|
|
|
||||
EARTH GOES AROUND SUN (CORRECT)
|
104
|
282
|
386
|
146
|
305
|
451
|
|||
OTHER
|
649
|
538
|
1187
|
652
|
715
|
1367
|
|||
753
|
820
|
1573
|
798
|
1020
|
1818
|
Percentage distributions are as follows
( by year and gender)
|
2006 | ||||||||
PLANET QUESTION |
|
|
|
|
|
||||
EARTH GOES AROUND SUN (CORRECT)
|
13.8%
|
34.4%
|
386
|
18.3%
|
29.9%
|
451
|
|||
OTHER
|
86.2
|
65.6
|
1187
|
81.7
|
70.1
|
1367
|
|||
100.0%
(753) |
100.0%
(820) |
1573 |
100.0%
(798) |
100.0%
(1020) |
1818 |
In three way, and sometimes even four way cross tabulation tables, calculating the percentages on your dependent variable (assuming you have one) within category combinations of your independent variables can be quite useful. If you do have a three way (or higher) interaction effect, you will get a better idea of the form that it takes. The percentages in this table, for example, quickly tell us that there was about a 20% sex difference on the PLANET question in 2001, which was larger than the nearly 12% sex difference on the PLANET question in 2006. Thus, there was a smaller sex difference in 2006 than in 2001. Is this a statistically significant difference of percentage differences among year, gender and planet? MAYBE. So much depends on sample size. With nearly 3400 cases, this three way effect may be statistically different from zero but this would not happen, say, in a sample of 1,000. I discuss the runs on this three way table below.
Everyone did very well. I do want to point out a couple of common mistakes.
COMMON NOVICE MISTAKE: Switching the numerator and the denominator in the middle of an analysis. This can lead to some very strange odds ratios! If "MALE" = "2006" = "CORRECT ANSWER ON PLANET" = "success" = "high" you must stick with these values as the numerators throughout the entire analysis. You can't switch gears in the middle and decide "2001" will be the numerator. (Although you might do so in a separate subsequent analysis.) (If you consistently switched the numerator on any of these variables, your final third order log odds probably was the mirror image of mine, e.g., -0.54.)
SUGGESTION: In class and depending on my calculator, I rounded to two decimal places on the odds ratio. However, if possible I recommend taking decimal places out to THREE places (not 2) if your calculator will do so. This can make the difference for a second order odds. When converted to logarithms, that can become a considerable difference.
VERY COMMON NOVICE MISTAKE: Interpreting odds-ratios or logged odds as percentages.
Logged odds can't be interpreted that way at all. At best the logged odds are raised or lowered by other (criterion) variables.
Percentages use the total for the subgroup as a base. For example, 24.5% of 2001 respondents (386/1573 X 100) and 31.9% of female respondents (587/1840 X 100) got the PLANET question "correct".
However, the conditional odds by gender and year on the science question are:
Men | Women | ||||||||
|
|
|
|
|
|
|
|||
Earth goes around sun
|
0.160
|
0.224
|
0.193
|
0.524
|
0.427
|
0.468
|
The odds-ratio on the "PLANET" variable for men is larger in 2006 than in 2001, whereas the reverse is true for women. The second order odds for women is about twice as great as it is for men.
BUT notice the percentages for female versus male respondents are NOWHERE near twice the percentage giving the correct answer.
How do we interpret the third order odds which is 1.718 (see far below for question 9)?
The effects of gender on answering this science question correctly to not were less pronounced in 2006 than in 2001. OR:
The 2nd order conditional odds on this science question (again see far below) for men were almost twice as great as they were for women.
It is very important to get used to these interpretations. It is a very common mistake to MISinterpret logistic regression coefficients as percentages too, when, instead, they aren't just odds-ratios but logits or logged odds-ratios.
This all assumes that the third order interaction {ABC} term was statistically significant, i.e., not zero. Was it?
|
Can we drop any of the terms? I tested out the model to see:
The loglinear equation (logarithmic and additive) for the saturated model using the type of equations from Guide 4 where:
P = Planet question
G = Gender and
Y = Year
Gijk = + iP + jG + kY + ijPG + ikPY + jkGY + ijkPGY
or
Gijk = 5.850 - 0.602P - 0.204G + 0.088Y - 0.228PG + 0.016PY - 0.002GY + 0.068PGY
These are loglinear general cell frequency effects.
A model that eliminates the P*G*Y three way interaction effect has 1 degree of freedom, a G2 of 10.046, a p-level of 0.0015 and a corresponding normal (z) score of 3.15. This model has 1 df because there is now 1 df freed up by the (2-1)(2-1)(2-1) =1 third order interaction effect. Interestingly, however, the z score for GENDER * YEAR is only -0.09 (the gender composition was approximately the same both years) and the PLANET * YEAR z score was only 0.75 (about the same percentage got the correct PLANET answer both years). However, the GENDER*PLANET z score was -10.61 indicating an overall sex difference on the PLANET question.
Put together, these results indicate the sex difference varied by whether the year was 2001 or 2006, i.e., a three-way PLANET * GENDER * YEAR interaction effect.
The SPSS model selection program (which estimates hierarchical models ONLY) concluded that all parameters needed to be estimated for the model to fit (this was the saturated model). Thus, the final generating class on that hierarchical model was (PLANET * GENDER * YEAR).
The SPSS model selection program uses the .05 Type I error criterion as a cutoff.
Typically we are fussier than that. The
convention for most loglinear analysts is to use an alpha-level of p >
0.20 as a model that "fits well". By either criterion, the three-way or
third order interaction model is what is needed to fit the data.
|
USE THE TABLE ABOVE TO ANSWER THE FOLLOWING
QUESTIONS.
PLEASE BE SURE TO ANSWER ALL DESIGNATED
PARTS OF A QUESTION.
1. (1 point) FOR THE TOTAL SAMPLE, what is the odds on GENDER--MALE:FEMALE?
1551/1840 = 0.843
2. (1 point) FOR THE TOTAL SAMPLE, what is the odds on YEAR, 2006:2001?
1818/1573 = 1.156
3. (1 point) FOR THE TOTAL SAMPLE, what is the odds on PLANET--CORRECT:OTHER?
837/2554 = 0.328
4. (1 point) What is the FIRST ORDER CONDITIONAL ODDS for each category of GENDER for "PLANET"--CORRECT:OTHER?
Males = 250/1301 = 0.192
Females = 587/1253 = 0.468
The second order odds for GENDER * PLANET is 0.192/0.468 = 0.410
5. (1 point) What is the FIRST ORDER CONDITIONAL ODDS for each category of YEAR for "PLANET"--CORRECT:OTHER?
2006 = 451/1367 = 0.330 (notice that 2006 forms the numerator because it was defined as a "success")
2001 = 386/1187 = 0.325
The second order odds for YEAR *
PLANET is 0.330/0.325 = 1.015 (ln = 0.01)
The second order odds and its logarithm
so close to 0 correspond to the very small z score and
coefficient in the computer run, telling us there was no essential change
in the distribution on the planet question (NOT considering gender) by
year.
6. (1 point) (1 point) FOR THE TOTAL SAMPLE,
what is the second order odds ratio for YEAR and PLANET?
(1 point) Do you think there is an association
between year and answers to the Planet science question? Why or why not?
See Q5. 1.015 (ln = 0.01). PROBABLY NOT. The odds is very close to 1 and the logit is very close to 0.
7. ((1 point) FOR THE TOTAL SAMPLE, what
is the second order odds ratio for YEAR and GENDER ?
(1 point) Do you think there is an association
between gender and year? Why or why not?
First order conditionals:
2006 = 798/1020 = 0.782
2001 = 753/820 = 0.918
Second order odds = 0.782/0.918 = 0.852
ln= -0.16
|
8. (1 point) FOR THE TOTAL SAMPLE, what
is the second order odds ratio for GENDER and PLANET?
(1 point) Do you think there is an association
between gender and answers to the "earth around the sun" science question?
Why or why not?
First order conditionals (see question 4):
Males = 250/1301 = 0.192
Females = 587/1253 = 0.468
The second order odds for GENDER
* PLANET is 0.192/0.468 = 0.410
|
9. (1 point) FOR THE TOTAL SAMPLE, what
is the third order odds ratio for GENDER, YEAR, and PLANET?
(2 points) How do you interpret this odds
ratio IN WORDS? (HINT: Try calculating some percentages and see if that
gives you clues about interpreting the results)?
(1 point) Do you think there is a statistical
interaction among gender, hsdegree and the science question?
(1 point) Why or why not?
Here goes:
For males in 2006 (right to wrong on
PLANET): 146/652 = 0.224
For males in 2001: 104/649 =
0.160
Second order conditional is 0.224/0.160 = 1.400
For females in 2006 (right to other
on PLANET): 305/715 = 0.427
For females in 2001: 282/538 = 0.524
Second order conditional is 0.427/0.524 = 0.815
STOP HERE A MOMENT!
We can see here that the second order conditional for males is 1.400, larger than the corresponding 0.815 for females. Even if I had not calculated the percentage getting the science question correct for for gender-year combinations, this should suggest to you that the effect of year on the science question will be stronger for men than women.
Now, to calculate the third order odds: 1.400/0.815 = 1.718 ln = 0.54
The odds greater than one and the positive logit suggest there may be a three-factor interaction. As you know from the loglinear analysis run (above), all the terms, including the three factor interaction must be present for this three way table to "fit." So, "officially" there is a three-factor interaction effect since the Chi-square probability is 0.0015 (but it's modest in size; see the comparison of male and female percentages above).
See above (near the table area) for an interpretation in words (in a few different ways).
This is an interesting example because it is non-hierarchical, there is a three-way interaction but no two-way interactions for planet by year or gender by year.
10. (1 point) How many TOTAL degrees of freedom are there in this 2 X 2 X 2 table?
In a 2 x 2 x 2 table, we start with an 8 cell table and 8 degrees of freedom.
11. (1 point) If we fix the total case base (i.e., constrain the model case base to equal the observed case base), how many degrees of freedom ARE LEFT?
fixing n means losing 1 degree of freedom, so this equiprobable model has 8 - 1 or 7 df.
12. Suppose you have a FULLY SATURATED model for GENDER X YEAR X EARTHSUN.
(1 point) Which parameters have you fixed in this case?
(I know this one was not obvious, but in the future it's good for practice to BE SURE TO LIST THEM ALL.)
The case base n (uses 1 df).
The three marginals for gender, YEAR
and EARTHSUN (each, 2-1 or 1 df each, uses total 3 df)
The three two-way associations, (GENDER * YEAR)(GENDER * EARTHSUN) and (YEAR * EARTHSUN). Each has (2-1) X (2-1) df or 1 df each, uses total 3 df.
The third order interaction (GENDER * YEAR * EARTHSUN) with (2-1) x (2 -1) x (2-1) = 1 df.
(1 point) What is the degrees of freedom associated with this fully saturated model?
8 - 8 = 0
It's a good idea
to get used to writing these parameters out because (A) you need them to
write the equations and (B) you need to know what to specify if you use
a non-hierarchical model.
|
OVERVIEW |
|
|
This page created with Netscape
Composer
Susan Carol Losh
February 9 2009