CHECK OUT THE PAPER GUIDE UPDATE HERE and the PAPER TIPS HERE
AND PRESENTATION INFORMATION HERE
EXERCISE
FOUR SPECIFICATIONS WILL GO HERE
KEY TAKEAWAYS GUIDE 7
|
THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA GUIDE 7: LOG-ODDS, MEASURES OF FIT, EXTENSIONS Susan Carol Losh Department of Educational Psychology and Learning Systems Florida State University |
|
|
DOES IT FIT? |
|
|
|
At this point, we are at a "jumping off" spot, to higher order applications. We have covered all the basics!
Here are terms and tools you should be
familiar with:
General Cell Frequency Models (GCF) | Odds ratios | Logits (logged odds ratios) |
Logit Models | Picking GCF vs. Logit models | Causal diagrams |
ANOVA vs. regression presentation | "Main Effects" | Associations |
Interaction Effects | Model testing | Determining "Best Models" |
"Fixing" effects | Partitioning Chi-Square (G2) and dfs | Calculating degrees of freedom |
Lambda coefficients | Converting lambdas to betas | Interpreting beta coefficients |
Writing loglinear equations | Writing logit equations | Beginning contrast effect choices |
So...what's next?
Different combinations, contrasts, etc
that create logistic regression models, ordinal regression etc.
Actually you've already seen probit models in Agresti (he calls them logit models; for shame for the confusion).
The probit takes the form of:
Where is the probability of a success.
[ln / ln (1-)]
The conclusions from a probit model are
comparable to those from a logit model.
They will probably be comparable
to those from an LPM (linear probability model) but remember
the LPM has a truncated range and its significance testing is more "generous"
to the researcher than other models because LPMs tend to underestimate
standard errors.
Also remember the contrasts that you choose
can also affect significance tests because you are asking different questions
(although the general conclusions with a dichotomous dependent variable
should be comparable). Contrasting the logits of one category with a second
is not the same as contrasting that category with a dummy omitted term.
Just as results from "contrast coding" in regression differ from the results
of "dummy variable" coding in regression.
|
The biggest recurring problem in using
loglinear and logistic models is interpreting the results.
The temptation is always present to backslide
into probabilistic and relative risk interpretations of the findings.
Even Alan Agresti, the biggest USA expert,
sometimes succumbs to temptation.
If you want a more probabilistic interpretation:
You can do PROBIT models, which are basically [ln p/ln (1-p)] estimates which are approximated by the probit solution in your logit comparisons. There's an SPSS probit regression program too so you don't need to do any mental gymnastics with the other programs.
You can also do cummulative probability models, which are estimated by the SPSS ordinal regression package. (In the multinomial regression package you can specify the form of the contrasts you want--at least for some of the contrasts.)
But if you have at least one designated dependent variable, and are doing some sort of logit contrast, then you interpret your results as either "log-odds" (the direct results you receive on the Logit logit output or on the binary or multinomial logistic regression output) or just as odds-ratios (you will need to exponentiate the betas back up to turn them into odds-ratios but that doesn't take very long with an inexpensive calculator).
Important SPSS program defaults to remember: The SPSS logit program and the multinomial regression program default will start with the first category according to the rank order of the numbers that correspond to each category (e.g., 0, 1, 2 , 3...). If you have a 0-1 dummy variable, the first parameter you will see in the output is for the value "0". The multinomial program is better about labeling the output. |
HOW DO YOU USE THESE BETAS TO DESCRIBE YOUR RESULTS?
Let's examine the results from a previous exercise 4 from an earlier year, which used a simple initial causal model: gender has direct and indirect causal effects on degree attainment (HS or less; Some College; BA; GRAD SCHOOL), indirectly through electing a college preparatory course (in this example, high school chemistry). High school chemistry has direct causal effects on degree level.
Depending on tastes and criteria, either your final model included a direct and an indirect (through hschem) effect for gender on degree level, or it dropped the direct effect of gender on degree level. There were no three-way (or higher, of course because there are only 3 variables) interaction effects in this model. Fortunately. For illustration, I will stay with the more complex model that includes the gender direct effects on both "hschem" and degree level.
The logit equation for high school chemistry (C) was:
C = 0.376C + (-) .354 G*C
The full logit equation for those with a high school degree or less was:
DL1 = 1.129 DL + (-) .230 G*DL+ 2.254 C*DL
The first value for gender (G) is
1 = male.
The first value for high school chemistry
is 0 = no chemistry.
The first value for degree level (DL)
is 1 = a high school degree or less.
Keep these values in mind so that you
interpret the DIRECTION of the results correctly.
The easiest is just to discuss the raising or lowering of the log-odds.
"Being male raised the odds of taking
high school chemistry."
Why? Because the G*C coefficient in the
first equation is NEGATIVE. Scores of "1" (lower on gender where female
= 2) go with scores of "1" (higher where no chemistry = 0).
If these make you uneasy, use the SPSS
recode provision to make gender a dummy variable with 0 = female and 1
= male. The coefficient will turn positive and remain (in this example)
at .354.
In the second equation, being male lowered the odds of having a high school rather than an advanced degree (sound too awkward? stating that being female raised the odds of having a high school rather than an advanced degree is equivalent).
Avoiding high school chemistry (value of zero) raised the odds of stopping with a high school degree compared with obtaining an advanced degree.
Next in difficulty is discussing how much each variable raised or lowered the log-odds.
C = 0.376C + (-) .354 G*C
It's fine to say that being male compared to female lowered the log-odds of avoiding chemistry by about 1/3 or by .354.
DL1 = 1.129 DL + (-) .230 G*DL + 2.254 C*DL
The log-odds of stopping with high school as opposed to going all the way to graduate school are lowered by about 1/4 or .23 for males compared with females. Avoiding high school chemistry more than doubled the log-odds of stopping with a high school degree compared with obtaining an advanced degree.
The most complex is to exponentiate and discuss the raising or lowering of the odds.
C = 0.376C + (-) .354 G*C
For the high school chemistry equation, the odds of avoiding to taking (0:1) for male: female becomes approximately 0.702 with exponentiation. Remember that negative log-odds are FRACTIONAL effects.
Males are about 70 percent as likely as
females to avoid high school chemistry as compared with taking it.
Don't like that explanation. Flip it!
Calculate (1/0.702) = 1.424. Compared with women, men are about 42 percent
more likely to take chemistry as to avoid it. SEE THE BOX BELOW!
|
DL1 = 1.129 DL + (-) .230 G*DL+ 2.254 C*DL
Exponentiated, the gender effect becomes 0.795 and the chemistry effect becomes 9.525.
Compared with women, men are about 80 percent
as likely to avoid high school chemistry as elect it.
Compared with adults who took high school
chemistry, adults who avoided it were over NINE times as likely to stop
with a high school degree as to go on to graduate school. And we know from
all these coefficients and other output that taking chemistry has much
more impact on the degree level you attain than gender.
If you are nervous, begin by discussing the raising and lowering of log-odds on your dependent variable and then gradually try the more precise numeric statements.
|
There are a lot of ways to assess the fit of a particular loglinear, logit, or logistic regression model.
The G2 likelihood ratio statistic is a favorite as are other large sample approximations to Chi-Square. Agresti, for example, discusses several approximations to Chi-Square in his Loglinear inference chapter. Keep in mind that fitting the corresponding loglinear model is in the background, even if your analysis is about multinomial logistic regression or logits.
However, the sample size has an important impact on the use of the G2 likelihood ratio statistic. As Agresti points out, we run the risk of not capturing the complexities in the data with a small case base, because it is unlikely that third or higher order interaction effects will be statistically significant.
The reverse is STRONGLY true with samples of several thousand--or greater. What will stick in my memory forever was the Indonesian student who came to me for advice because she was trying to simplify her results and she was unable to drop any terms in her model without producing a G2 in the MILLIONS. It turned out that her case base was the entire population of Djakarta, which at the time was about six million people.
The model G2, i.e., typically the difference between the the model that you believe provides the best fit to the observed results and the saturated model with its G2 of 0, which is sometimes called the deviance statistic because it assesses the overall deviance between the expected and observed frequencies in the multidimensional data table. The SPSS multinomial logistic regression package calls the G2 the deviance statistic.
With larger case bases (all other things equal), the deviance in each cell (or "the residual") between the observed and expected frequency takes on more importance, and the standard errors for each lambda parameter become smaller and smaller. Just as you tend to get a very large F-ratio and a small probability level in multiple regression with large samples--even if your R2 is quite small--so you may find that only the saturated model will "fit" if your case base in the n-dimensional table is quite large. And this leads to a model that is overly complex or "over-fitted".
As a result, with large samples, most analysts believe that parameters influenced by the very large or small n or casebase are not terribly helpful in deciding upon a final model. Unfortunately, this includes:
Thus we see the efforts to try to develop various R2 analogues for categorical data tables. These provide more information about EFFECT SIZES in the model (remember that Tallahassee Democrat test: If you wouldn't call the Tallahassee Democrat to publish it, it's probably not a very important effect.)
You should be aware that a relatively few
analysts believe that only tests which provide inference statistics (e.g.,
the G2) are valid to use to select a final categorical model.
To me (and most others) that makes about as much sense as only looking
at the tests of statistical significance for the overall multiple regression
model and the probability levels for each B, while ignoring the values
of R2, the multiple-partial R, the size of the beta weights
(which in numeric regression are NOT generally influenced by sample size)
and other descriptive statistics. But you need to know these folks are
out there and they can get pretty noisy about their beliefs.
|
So...enter a variety of measures of fit.
One of the simplest is literally to use an R2 measure, and this is one Agresti discusses. It is not exactly the R2 measure that you are used to in regression. That latter measure correlates the observed numeric score for a single case with the corresponding predicted score for that case using the regression equation to create the predicted score. High R2s mean that observed and predicted scores correlate closely.
For the loglinear situation, the R2 is between the predicted and the observed cell counts. So the number of observations to be correlated is equal to the total number of cells in the table. However, if the model is at all sensitive to trends in the data (as it should be to produce low deviance measures), then, of course the correlation between the observed and expected cells will tend to be quite high.
In the analysis discussed earlier in Guide 7, the G2 that corresponded to the equiprobable, grand mean only or only model appears in the MODEL SELECTION output under the first K-way table, "Tests that K-way and higher order effects are 0" in the line K =1. For the Guide 7 example studying 4 degree levels by gender by high school chemistry, that was a G2 of 2558.077.
Most equiprobable models do not correspond well to observed results so the G2 is generally quite high. This entity will function in a manner akin to the "total sum of squares" in regression or analysis of variance.
Then, take the G2 for the model you believe has the best fit for your observed results. For example, in the exercise under discussion in this Guide, suppose the analyst believed that the best model for deglev4, gender and hschem was the hierarchical model incorporating all the two-way associations (and thus, of course, all marginals). The G2 for this model in the earlier 1999-2001 NSF data was 1.863 (this model had 3 df because it omitted the three-way interaction 4 X 2 X 2 table).
The R2 analogue is formed by subtracting the G2 for the "best model" from the grand mean only model for the numerator. The denominator is the G2 for the grand mean only model. In the example I discuss here, which incorporates all two way partials, this was:
(2558.077 - 1.863) / 2558.077 = 0.999
R2 analogues of at least 0.90 and preferably at least 0.95 are considered to indicate a good fit to the data in these kinds of analyses.
We interpret an R2 analogue
of 0.99 as saying the model in question (e.g., in
this case, one that eliminated all three way and higher effects) explains
the variation in the cells of the multi-dimensional table about 99 percent
as well as the saturated model does. The saturated model (chi square
of 0) is the implicit comparison here.
Some have argued that the grand mean model almost never fits the data, and thus can work for the denominator but not the numerator. These critics assert that the appropriate first term in the numerator should be the marginals only model. This is the model that fixes the univariate frequencies but sets all two-way partial associations, three-way interaction effects and higher order terms to zero. You can find this G2 in the first K-way table under K = 2 (i.e., two-way associations and all higher order terms are set to zero).
In the three variable model I discuss here of deglev4 by hschem by gender, the G2 for the marginals only model = 568.293. This model has 10 df because we start with 16 cells in the table (4 X 2 X 2), subtract 1 for fixing n, subtract 3 (i.e., 4 - 1) for fixing the degree level marginal, subtract 1 for fixing the gender (2 - 1) marginal, and a final df for fixing the high school chemistry (2 - 1) marginal, or 16 - 1 - 3 - 1 - 1 = 10 df for the marginals only model.
The adjusted R2 analogue for the all two-way partial associations model (where gender has direct and indirect effects on degree level) then becomes:
(568.293 - 1.863) / 2558.077 = 0.22
The idea is that if the marginals are roughly equiprobable (e.g., a 50-50 split on gender) and most of the grand mean only effects are due to two-way associations and higher order terms, then the R2 analogue will be large. On the other hand, if most of the large G2 on the grand mean only model is due to non-equiprobable marginal splits, then the R2 analogue will be relatively small, as it is in this example.
You will see a variation on the adjusted R2 analogue in the multinomial regression package where you are shown the G2 for the intercept only and final models. The intercept only model (sometimes called the null model) fixes:
In the final model, the program calculates
the likelihood ratio statistic for your final model that does incorporate
(at least some of) the effects of the independent variables on the dependent
variable.
|
|
|
|
|
where is the probability of occurrence.
Thus, roughly speaking, the concentration measure uses the G2 omitting the marginals on the independent variables and associations among the independent variables as its base versus the G2 from the model that not only incorporates the marginals and associations from and among the independent variables, but the marginal on the dependent variable and the model associations between the dependent variable and the independent variables (including any interaction effects). For the example model I am using in this guide, that measure was 0.071--that is what incorporating the predictors of the dependent variable added to the model.
The entropy measure (you see it
in the logit model output) is a variation on the R2 analogue
and produces results that are usually comparable to the concentration measure.
"D" = (L2O - L2M) / L2O
Where the L2O is the null model that only includes the intercept term (basically, the equiprobable model) and
L2M is the model chi-square for your best fitting model.
The Classification Table can give you some idea whether you are doing much better than chance at predicting the dependent variable.
The baseline cells for the Classification Table are the univariate category frequencies on the dependent variable without considering any influence for the independent variables.
The predicted cells are the values of the dependent variable that are predicted from the combined values of the independent variable. The cells on the diagonal of the table are the correctly predicted cells. If all the predicted cells (except for the modal category on the observed dependent variable) have zero counts in them, unfortunately, your model is not operating much better (if at all) than an intercept only model would that considered only the marginal categories of the dependent variable.
Overall, you can do (the program calculates)
a summary percentage telling you what percent of the cells were accurately
classified by your best model compared with the observed univariate marginals
for an intercept only model. This is basically summing the diagonal cells
and dividing that sum by the total casebase.
|
In this section I have a variety of tips and cautions that don't easily fit elsewhere, or which serve as reminders for easy to forget variations.
On backwards versus forwards elimination for loglinear, logit and logistic regression models.
Backwards elimination typically starts
with the most complex model and eliminates terms. (The four-way interaction
in my Guide 7 initial example.)
Forward elimination starts with the simplest
model (e.g., intercept only) and adds terms.
Which is for you? That depends on the complexity of your data. SPSS MODEL SELECTION starts with backwards elimination because that is typically more likely.
On ordinal regression in SPSS. This section is for your reference if you decide to use the SPSS ordinal regression program.
The ordinal regression SPSS package allows you to use a dependent ordinal variable with a mix of categorical and numeric predictors. Because the dependent variable categories are NOT numbers, we need ways to get around this in a prediction equation. One type of ordinal regression allows you to estimate the cumulative probabilities that a case will fall in a particular ordered category. For example, if our dependent variable were degree level, we could ask: what's the probability (in a logit solution, the odds) that a person will have at least a high school degree, or at least a BA degree? This is apparently the type of regression in the SPSS program. The shorthand name for this procedure is "PLUMS".
Numeric predictors (logit analysis or ordinal regression): One of your decisions in constructing an ordinal regression model, of course, is to select your predictors for the location component of the model. Covariates can be interval or ratio; the assumption is that they are numeric...but I still wouldn't use too many categories. The program is still constructing a table and if you have many values in your covariates you will receive warnings about empty cells. The program will even begin to collapse some of these into cells by itself so it can do estimates. So if YOU want to be in charge, condense the categories yourself and check the multivariate table for zero cells.
Adding a bit (.5 is the usual) to the delta function will also "smooth" out the empty cells.
You need to select a link function. (Also see Agresti on link functions.) This is a transformation of the cumulative probabilities that allow you to estimate your model (see above). Five link functions are available in the ordinal regression procedure, I recommend the logit link function which is comparable to what we recently have been studying. Because, remember, you will need to describe what is happening in your data when you are all done.
Agresti also talks more about them in the "big Agresti" (third edition).
The scale component is optional. Much of the time, you don't need a scale component. The "location only" model will provide a good summary of the data. SPSS says "In the interests of keeping things simple, it's usually best to start with a location-only model, and add a scale component only if there is evidence that the location-only model is inadequate for your data. Following this philosophy, you will begin with a location-only model."
"The scale component is an optional modification to the basic model to account for differences in variability for different values of the predictor variables. For example, if men have more variability than women in their account status values, using a scale component to account for this may improve your model. The model with a scale component follows the form shown in this equation"
When SPSS suggests to keep things simple, I nearly always believe them.
Basically the scale component is a correction for what we call "heteroscedasticity" in OLS regression. Heteroscedasticity is when the variability on your dependent variable is different depending on the values of your independent variable--or combinations of independent variables. For example, there is usually a larger standard deviation on weight for tall people than for short people. Because you typically have far fewer values and cruder measurement on your ordinal dependent variable, this is less likely to happen in ordinal regression than in Ordinary Least Squares regression.
Be careful about including variables in these programs (especially the multinomial logistic regression program) if you don't plan to use them in a particular analysis. In the multinomial program, in particular, unused independent variables that are read into the multinomial program will be considered in constructing the n-dimensional table, even if you don't specify a relationship between that variable and the dependent variable, leading to misleading parameters, inference statistics, and degrees of freedom. You may be surprised to see a variable that you placed into the multinomial regression directions, but did not put in the model design, pop up when you study the table of observed and expected frequencies.
Remember! If you have an overall causal model and want to test the entire model, including indirect effects, you will need to use the loglinear model (GENERAL or the equivalent in another set of programs such as SAS) to do so. If you simply want the G2, degrees of freedom and probability level for the final model, the MODEL SELECTION program to model test will work fine here.
As the number of variables grows, the number of possible models grows too. The "aim of the game" is the simplest model with the smallest G2 and the largest degrees of freedom. But with a great many variables, it is possible to have comparable model statistics but quite different models.
How to choose? Here's where theory and a possible causal model can help you!
By the way, in writing up results, it
is very typical to put an initial causal model with many hypothesized paths
and then a "final" model that shows only the statistically significant
connections.
|
READINGS |
|
|
This page created with Netscape
Composer
Susan Carol Losh
April 2 2017