Guide 7: Interpreting Logits and Measures of Fit

SCROLL TO THE ODDS & ENDS FOR THINGS I HAVE LEARNED ABOUT THE SPSS ORDINAL REGRESSION PACKAGE!

CHECK OUT THE PAPER GUIDE UPDATE HERE and the PAPER TIPS HERE

AND PRESENTATION INFORMATION HERE

EXERCISE FOUR SPECIFICATIONS WILL GO HERE

READINGS GUIDE 1: ISSUES IN MODELING
GUIDE 2: TERMINLOGY
GUIDE 3: THE LOWLY 2 X 2 TABLE GUIDE 4: BASICS ON FITTING MODELS
GUIDE 5: SOME REVIEW, EXTENSIONS, LOGITS GUIDE 6: LOGLINEAR & LOGIT MODELS
GUIDE 7: LOG-ODDS AND MEASURES OF FIT
GUIDE 8: LOGITS,LAMBDAS & OTHER GENERAL THOUGHTS OVERVIEW

KEY TAKEAWAYS GUIDE 7

Probit models use combinations of probabilities to predict falling into a particular cell.

Results share some substantive similarities of conclusions with the Linear Probability Models (LPM) but are mathematically more rigorous

The biggest recurring problem in using loglinear and logistic models is interpreting the results.

The logits are NOT percentages!

Logits "raise or lower the odds" of one result (getting the "DADGENE" question right) always compared with a second result (getting the "DADGENE" question wrong)

So you can just say one variable (e.g., degree level) raises the odds on the planet question, right: wrong

Don't use "relative risk" statements: no percentages interpretations!

Not even probit models use percentages (now you see the lure of the LPM)

Check out some SPSS tips HERE

Chi-square is not the only indicator we use about the goodness of the model.

Like other probability density functions, the G² is affected by casebase or sample size.

There are several measures of "fit".

R²s

R² analogues

Concentration measures

McFadden Pseudo R²

One key is deciding which chi-square to use as the base (e.g., equiprobable or "grand mean" model as one example??)

Measures of fit typically depend on which analysis you use (e.g. loglinear vs. multinomial regression.)

Review the last section on ordinal regression and variable inclusion (we will revisit..)

Don't include variables in the list for multinomial regression programs that you don't plan to use

Very strange things can happen in your output if you do!

When SPSS suggests to keep things simple, I nearly always believe them.

EDF 6937-01 SPRING 2017
THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA
GUIDE 7: LOG-ODDS, MEASURES OF FIT, EXTENSIONS
Susan Carol Losh
Department of Educational Psychology and Learning Systems
Florida State University

WHERE WE ARE

INTERPRETING LOGITS

DOES IT FIT?

HOW DOES IT FIT?

ODDS 'N' ENDS

WHERE WE ARE NOW

At this point, we are at a "jumping off" spot, to higher order applications. We have covered all the basics!

Here are terms and tools you should be familiar with:

General Cell Frequency Models (GCF) Odds ratios Logits (logged odds ratios)

Logit Models Picking GCF vs. Logit models Causal diagrams

ANOVA vs. regression presentation "Main Effects" Associations

Interaction Effects Model testing Determining "Best Models"

"Fixing" effects Partitioning Chi-Square (G²) and dfs Calculating degrees of freedom

Lambda coefficients Converting lambdas to betas Interpreting beta coefficients

Writing loglinear equations Writing logit equations Beginning contrast effect choices

So...what's next?
Different combinations, contrasts, etc that create logistic regression models, ordinal regression etc.

Actually you've already seen probit models in Agresti (he calls them logit models; for shame for the confusion).

The probit takes the form of:

[ln / ln (1-)]

Where is the probability of a success.
If you have k categories in your dependent variable, you can do k-1 such contrasts.

The conclusions from a probit model are comparable to those from a logit model.
They will probably be comparable to those from an LPM (linear probability model) but remember the LPM has a truncated range and its significance testing is more "generous" to the researcher than other models because LPMs tend to underestimate standard errors.

Also remember the contrasts that you choose can also affect significance tests because you are asking different questions (although the general conclusions with a dichotomous dependent variable should be comparable). Contrasting the logits of one category with a second is not the same as contrasting that category with a dummy omitted term. Just as results from "contrast coding" in regression differ from the results of "dummy variable" coding in regression.

AND THE LOGISTIC REGRESSION COEFFICIENTS MEAN WHAT?

The biggest recurring problem in using loglinear and logistic models is interpreting the results.
The temptation is always present to backslide into probabilistic and relative risk interpretations of the findings.
Even Alan Agresti, the biggest USA expert, sometimes succumbs to temptation.

If you want a more probabilistic interpretation:

You can do PROBIT models, which are basically [ln p/ln (1-p)] estimates which are approximated by the probit solution in your logit comparisons. There's an SPSS probit regression program too so you don't need to do any mental gymnastics with the other programs.

You can also do cummulative probability models, which are estimated by the SPSS ordinal regression package. (In the multinomial regression package you can specify the form of the contrasts you want--at least for some of the contrasts.)

But if you have at least one designated dependent variable, and are doing some sort of logit contrast, then you interpret your results as either "log-odds" (the direct results you receive on the Logit logit output or on the binary or multinomial logistic regression output) or just as odds-ratios (you will need to exponentiate the betas back up to turn them into odds-ratios but that doesn't take very long with an inexpensive calculator).

Important SPSS program defaults to remember:
The SPSS logit program and the multinomial regression program default will start with the first category according to the rank order of the numbers that correspond to each category (e.g., 0, 1, 2 , 3...). If you have a 0-1 dummy variable, the first parameter you will see in the output is for the value "0". The multinomial program is better about labeling the output.
The Beta coefficients you see are akin to dummy variable coefficients. They measure the "differences" (i.e., the odds) of being in a category compared with being in the omitted or reference category. The default in SPSS uses the very last category or the largest "numeric" category value on the variable (e.g., a score of 4 for "advanced degree" for the variable "degree level 4") as the omitted or reference category.
You can change the reference category in the multinomial logistic regression program, but not in the General program. The reference category in the binary regression program (which is an older program) on your dependent variable is "0" if you have a 0-1 dummy dependent variable, i.e., the odds are calculated on a 1:0 set of frequencies. Since the GENERAL and LOGIT package use "0" as the first category and "1" as the omitted or referent category, this can sometimes cause some confusing comparisons.
Please take note of these program inconsistencies!
The binary regression program historically predates the rewriting of the General program package. The General tables and estimating iterative algorithms form the basis for: the logit program, the multinomial logistic regression program, and the ordinal regression program.
That does tend to mean that any wierdnesses that pop up in GENERAL also tend to pop up in the programs that are based on it.

HOW DO YOU USE THESE BETAS TO DESCRIBE YOUR RESULTS?

Let's examine the results from a previous exercise 4 from an earlier year, which used a simple initial causal model: gender has direct and indirect causal effects on degree attainment (HS or less; Some College; BA; GRAD SCHOOL), indirectly through electing a college preparatory course (in this example, high school chemistry). High school chemistry has direct causal effects on degree level.

Depending on tastes and criteria, either your final model included a direct and an indirect (through hschem) effect for gender on degree level, or it dropped the direct effect of gender on degree level. There were no three-way (or higher, of course because there are only 3 variables) interaction effects in this model. Fortunately. For illustration, I will stay with the more complex model that includes the gender direct effects on both "hschem" and degree level.

The logit equation for high school chemistry (C) was:

_C = 0.376^C + (-) .354 ^G*C

The full logit equation for those with a high school degree or less was:

_DL1 = 1.129 ^DL + (-) .230 ^G*DL+ 2.254 ^C*DL

The first value for gender (G) is 1 = male.
The first value for high school chemistry is 0 = no chemistry.
The first value for degree level (DL) is 1 = a high school degree or less.
Keep these values in mind so that you interpret the DIRECTION of the results correctly.

The easiest is just to discuss the raising or lowering of the log-odds.

"Being male raised the odds of taking high school chemistry."
Why? Because the G*C coefficient in the first equation is NEGATIVE. Scores of "1" (lower on gender where female = 2) go with scores of "1" (higher where no chemistry = 0).
If these make you uneasy, use the SPSS recode provision to make gender a dummy variable with 0 = female and 1 = male. The coefficient will turn positive and remain (in this example) at .354.

In the second equation, being male lowered the odds of having a high school rather than an advanced degree (sound too awkward? stating that being female raised the odds of having a high school rather than an advanced degree is equivalent).

Avoiding high school chemistry (value of zero) raised the odds of stopping with a high school degree compared with obtaining an advanced degree.

Next in difficulty is discussing how much each variable raised or lowered the log-odds.

_C = 0.376^C + (-) .354 ^G*C

It's fine to say that being male compared to female lowered the log-odds of avoiding chemistry by about 1/3 or by .354.

_DL1 = 1.129 ^DL + (-) .230 ^G*DL+ 2.254 ^C*DL

The log-odds of stopping with high school as opposed to going all the way to graduate school are lowered by about 1/4 or .23 for males compared with females. Avoiding high school chemistry more than doubled the log-odds of stopping with a high school degree compared with obtaining an advanced degree.

The most complex is to exponentiate and discuss the raising or lowering of the odds.

_C = 0.376^C + (-) .354 ^G*C

For the high school chemistry equation, the odds of avoiding to taking (0:1) for male: female becomes approximately 0.702 with exponentiation. Remember that negative log-odds are FRACTIONAL effects.

Males are about 70 percent as likely as females to avoid high school chemistry as compared with taking it.
Don't like that explanation. Flip it! Calculate (1/0.702) = 1.424. Compared with women, men are about 42 percent more likely to take chemistry as to avoid it. SEE THE BOX BELOW!

Remember! NO RELATIVE RISK STATEMENTS. Men are NOT 42% more likely to take chemistry than women!

_DL1 = 1.129 ^DL + (-) .230 ^G*DL+ 2.254 ^C*DL

Exponentiated, the gender effect becomes 0.795 and the chemistry effect becomes 9.525.

Compared with women, men are about 80 percent as likely to avoid high school chemistry as elect it.
Compared with adults who took high school chemistry, adults who avoided it were over NINE times as likely to stop with a high school degree as to go on to graduate school. And we know from all these coefficients and other output that taking chemistry has much more impact on the degree level you attain than gender.

If you are nervous, begin by discussing the raising and lowering of log-odds on your dependent variable and then gradually try the more precise numeric statements.

DOES THE MODEL FIT?

There are a lot of ways to assess the fit of a particular loglinear, logit, or logistic regression model.

The G² likelihood ratio statistic is a favorite as are other large sample approximations to Chi-Square. Agresti, for example, discusses several approximations to Chi-Square in his Loglinear inference chapter. Keep in mind that fitting the corresponding loglinear model is in the background, even if your analysis is about multinomial logistic regression or logits.

However, the sample size has an important impact on the use of the G² likelihood ratio statistic. As Agresti points out, we run the risk of not capturing the complexities in the data with a small case base, because it is unlikely that third or higher order interaction effects will be statistically significant.

The reverse is STRONGLY true with samples of several thousand--or greater. What will stick in my memory forever was the Indonesian student who came to me for advice because she was trying to simplify her results and she was unable to drop any terms in her model without producing a G² in the MILLIONS. It turned out that her case base was the entire population of Djakarta, which at the time was about six million people.

The model G², i.e., typically the difference between the the model that you believe provides the best fit to the observed results and the saturated model with its G² of 0, which is sometimes called the deviance statistic because it assesses the overall deviance between the expected and observed frequencies in the multidimensional data table. The SPSS multinomial logistic regression package calls the G² the deviance statistic.

With larger case bases (all other things equal), the deviance in each cell (or "the residual") between the observed and expected frequency takes on more importance, and the standard errors for each lambda parameter become smaller and smaller. Just as you tend to get a very large F-ratio and a small probability level in multiple regression with large samples--even if your R² is quite small--so you may find that only the saturated model will "fit" if your case base in the n-dimensional table is quite large. And this leads to a model that is overly complex or "over-fitted".

As a result, with large samples, most analysts believe that parameters influenced by the very large or small n or casebase are not terribly helpful in deciding upon a final model. Unfortunately, this includes:

the standardized lambdas and betas (because these are the statistic divided by its own standard error, which is sensitive to sample size)

the statistical significance tests associated with the partial association tables

the stepwise G² comparisons

Wald statistics in logistic regression (which are the squares of the Beta coefficients divided by their own standard errors...)

the "average" G² per degrees of freedom in the model, and

just about everything else with the possible exception of the adjusted standardized residuals.

Thus we see the efforts to try to develop various R² analogues for categorical data tables. These provide more information about EFFECT SIZES in the model (remember that Tallahassee Democrat test: If you wouldn't call the Tallahassee Democrat to publish it, it's probably not a very important effect.)

You should be aware that a relatively few analysts believe that only tests which provide inference statistics (e.g., the G²) are valid to use to select a final categorical model. To me (and most others) that makes about as much sense as only looking at the tests of statistical significance for the overall multiple regression model and the probability levels for each B, while ignoring the values of R², the multiple-partial R, the size of the beta weights (which in numeric regression are NOT generally influenced by sample size) and other descriptive statistics. But you need to know these folks are out there and they can get pretty noisy about their beliefs.

HOW WELL DOES THE MODEL FIT?

So...enter a variety of measures of fit.

One of the simplest is literally to use an R² measure, and this is one Agresti discusses. It is not exactly the R² measure that you are used to in regression. That latter measure correlates the observed numeric score for a single case with the corresponding predicted score for that case using the regression equation to create the predicted score. High R²s mean that observed and predicted scores correlate closely.

For the loglinear situation, the R² is between the predicted and the observed cell counts. So the number of observations to be correlated is equal to the total number of cells in the table. However, if the model is at all sensitive to trends in the data (as it should be to produce low deviance measures), then, of course the correlation between the observed and expected cells will tend to be quite high.

Next is the R² analogue measure. It's generally easy to calculate if you have the software. This is the one that Nigel Gilbert discusses.

In the analysis discussed earlier in Guide 7, the G² that corresponded to the equiprobable, grand mean only or only model appears in the MODEL SELECTION output under the first K-way table, "Tests that K-way and higher order effects are 0" in the line K =1. For the Guide 7 example studying 4 degree levels by gender by high school chemistry, that was a G² of 2558.077.

Most equiprobable models do not correspond well to observed results so the G² is generally quite high. This entity will function in a manner akin to the "total sum of squares" in regression or analysis of variance.

Then, take the G² for the model you believe has the best fit for your observed results. For example, in the exercise under discussion in this Guide, suppose the analyst believed that the best model for deglev4, gender and hschem was the hierarchical model incorporating all the two-way associations (and thus, of course, all marginals). The G² for this model in the earlier 1999-2001 NSF data was 1.863 (this model had 3 df because it omitted the three-way interaction 4 X 2 X 2 table).

The R² analogue is formed by subtracting the G² for the "best model" from the grand mean only model for the numerator. The denominator is the G² for the grand mean only model. In the example I discuss here, which incorporates all two way partials, this was:

(2558.077 - 1.863) / 2558.077 = 0.999

R² analogues of at least 0.90 and preferably at least 0.95 are considered to indicate a good fit to the data in these kinds of analyses.

We interpret an R² analogue of 0.99 as saying the model in question (e.g., in this case, one that eliminated all three way and higher effects) explains the variation in the cells of the multi-dimensional table about 99 percent as well as the saturated model does. The saturated model (chi square of 0) is the implicit comparison here.

Some have argued that the grand mean model almost never fits the data, and thus can work for the denominator but not the numerator. These critics assert that the appropriate first term in the numerator should be the marginals only model. This is the model that fixes the univariate frequencies but sets all two-way partial associations, three-way interaction effects and higher order terms to zero. You can find this G² in the first K-way table under K = 2 (i.e., two-way associations and all higher order terms are set to zero).

In the three variable model I discuss here of deglev4 by hschem by gender, the G² for the marginals only model = 568.293. This model has 10 df because we start with 16 cells in the table (4 X 2 X 2), subtract 1 for fixing n, subtract 3 (i.e., 4 - 1) for fixing the degree level marginal, subtract 1 for fixing the gender (2 - 1) marginal, and a final df for fixing the high school chemistry (2 - 1) marginal, or 16 - 1 - 3 - 1 - 1 = 10 df for the marginals only model.

The adjusted R² analogue for the all two-way partial associations model (where gender has direct and indirect effects on degree level) then becomes:

(568.293 - 1.863) / 2558.077 = 0.22

The idea is that if the marginals are roughly equiprobable (e.g., a 50-50 split on gender) and most of the grand mean only effects are due to two-way associations and higher order terms, then the R² analogue will be large. On the other hand, if most of the large G² on the grand mean only model is due to non-equiprobable marginal splits, then the R² analogue will be relatively small, as it is in this example.

You will see a variation on the adjusted R² analogue in the multinomial regression package where you are shown the G² for the intercept only and final models. The intercept only model (sometimes called the null model) fixes:

all the univariate marginals of variables in the table
the partial associations and higher order terms among the independent variables

then omits the Beta terms that correspond to the effects of all the independent variables on the dependent variable (including interactions with the dependent variable).

In the final model, the program calculates the likelihood ratio statistic for your final model that does incorporate (at least some of) the effects of the independent variables on the dependent variable.

The Concentration measure (also in the logit model output) is one minus the ratio of the probability that two independent observations on the dependent variable fall in two separate categories summed over all the cells in the table, given the probability of a score falling into a particular combination of scores on the independent variables. I realize that this one is a mouthful. Mathematically it is:

	_{( i j}²_ij )
1 -
	_i+

where is the probability of occurrence.

Thus, roughly speaking, the concentration measure uses the G² omitting the marginals on the independent variables and associations among the independent variables as its base versus the G² from the model that not only incorporates the marginals and associations from and among the independent variables, but the marginal on the dependent variable and the model associations between the dependent variable and the independent variables (including any interaction effects). For the example model I am using in this guide, that measure was 0.071--that is what incorporating the predictors of the dependent variable added to the model.

The entropy measure (you see it in the logit model output) is a variation on the R² analogue and produces results that are usually comparable to the concentration measure.

The McFadden Pseudo R² measure falls between 0 and 1. The closer it is to 1, the more the final model approximates the results that would be fitted by the saturated model. It is calculated approximately as:

"D" = (L²_O - L²_M) / L²_O

Where the L²_O is the null model that only includes the intercept term (basically, the equiprobable model) and

L²_M is the model chi-square for your best fitting model.

The Classification Table can give you some idea whether you are doing much better than chance at predicting the dependent variable.

The baseline cells for the Classification Table are the univariate category frequencies on the dependent variable without considering any influence for the independent variables.

The predicted cells are the values of the dependent variable that are predicted from the combined values of the independent variable. The cells on the diagonal of the table are the correctly predicted cells. If all the predicted cells (except for the modal category on the observed dependent variable) have zero counts in them, unfortunately, your model is not operating much better (if at all) than an intercept only model would that considered only the marginal categories of the dependent variable.

Overall, you can do (the program calculates) a summary percentage telling you what percent of the cells were accurately classified by your best model compared with the observed univariate marginals for an intercept only model. This is basically summing the diagonal cells and dividing that sum by the total casebase.

SOME ODDS 'N' ENDS

In this section I have a variety of tips and cautions that don't easily fit elsewhere, or which serve as reminders for easy to forget variations.

On backwards versus forwards elimination for loglinear, logit and logistic regression models.

Backwards elimination typically starts with the most complex model and eliminates terms. (The four-way interaction in my Guide 7 initial example.)
Forward elimination starts with the simplest model (e.g., intercept only) and adds terms.

Which is for you? That depends on the complexity of your data. SPSS MODEL SELECTION starts with backwards elimination because that is typically more likely.

On ordinal regression in SPSS. This section is for your reference if you decide to use the SPSS ordinal regression program.

The ordinal regression SPSS package allows you to use a dependent ordinal variable with a mix of categorical and numeric predictors. Because the dependent variable categories are NOT numbers, we need ways to get around this in a prediction equation. One type of ordinal regression allows you to estimate the cumulative probabilities that a case will fall in a particular ordered category. For example, if our dependent variable were degree level, we could ask: what's the probability (in a logit solution, the odds) that a person will have at least a high school degree, or at least a BA degree? This is apparently the type of regression in the SPSS program. The shorthand name for this procedure is "PLUMS".

Numeric predictors (logit analysis or ordinal regression): One of your decisions in constructing an ordinal regression model, of course, is to select your predictors for the location component of the model. Covariates can be interval or ratio; the assumption is that they are numeric...but I still wouldn't use too many categories. The program is still constructing a table and if you have many values in your covariates you will receive warnings about empty cells. The program will even begin to collapse some of these into cells by itself so it can do estimates. So if YOU want to be in charge, condense the categories yourself and check the multivariate table for zero cells.

Adding a bit (.5 is the usual) to the delta function will also "smooth" out the empty cells.

You need to select a link function. (Also see Agresti on link functions.) This is a transformation of the cumulative probabilities that allow you to estimate your model (see above). Five link functions are available in the ordinal regression procedure, I recommend the logit link function which is comparable to what we recently have been studying. Because, remember, you will need to describe what is happening in your data when you are all done.

Agresti also talks more about them in the "big Agresti" (third edition).

The scale component is optional. Much of the time, you don't need a scale component. The "location only" model will provide a good summary of the data. SPSS says "In the interests of keeping things simple, it's usually best to start with a location-only model, and add a scale component only if there is evidence that the location-only model is inadequate for your data. Following this philosophy, you will begin with a location-only model."

"The scale component is an optional modification to the basic model to account for differences in variability for different values of the predictor variables. For example, if men have more variability than women in their account status values, using a scale component to account for this may improve your model. The model with a scale component follows the form shown in this equation"

When SPSS suggests to keep things simple, I nearly always believe them.

Basically the scale component is a correction for what we call "heteroscedasticity" in OLS regression. Heteroscedasticity is when the variability on your dependent variable is different depending on the values of your independent variable--or combinations of independent variables. For example, there is usually a larger standard deviation on weight for tall people than for short people. Because you typically have far fewer values and cruder measurement on your ordinal dependent variable, this is less likely to happen in ordinal regression than in Ordinary Least Squares regression.

Be careful about including variables in these programs (especially the multinomial logistic regression program) if you don't plan to use them in a particular analysis. In the multinomial program, in particular, unused independent variables that are read into the multinomial program will be considered in constructing the n-dimensional table, even if you don't specify a relationship between that variable and the dependent variable, leading to misleading parameters, inference statistics, and degrees of freedom. You may be surprised to see a variable that you placed into the multinomial regression directions, but did not put in the model design, pop up when you study the table of observed and expected frequencies.

Remember! If you have an overall causal model and want to test the entire model, including indirect effects, you will need to use the loglinear model (GENERAL or the equivalent in another set of programs such as SAS) to do so. If you simply want the G², degrees of freedom and probability level for the final model, the MODEL SELECTION program to model test will work fine here.

As the number of variables grows, the number of possible models grows too. The "aim of the game" is the simplest model with the smallest G² and the largest degrees of freedom. But with a great many variables, it is possible to have comparable model statistics but quite different models.

How to choose? Here's where theory and a possible causal model can help you!

By the way, in writing up results, it is very typical to put an initial causal model with many hypothesized paths and then a "final" model that shows only the statistically significant connections.

READINGS OVERVIEW

This page created with Netscape Composer
Susan Carol Losh
April 2 2017

General Cell Frequency Models (GCF)	Odds ratios	Logits (logged odds ratios)
Logit Models	Picking GCF vs. Logit models	Causal diagrams
ANOVA vs. regression presentation	"Main Effects"	Associations
Interaction Effects	Model testing	Determining "Best Models"
"Fixing" effects	Partitioning Chi-Square (G²) and dfs	Calculating degrees of freedom
Lambda coefficients	Converting lambdas to betas	Interpreting beta coefficients
Writing loglinear equations	Writing logit equations	Beginning contrast effect choices