UPDATED FOR SPSS 23
 
READINGS GUIDE 1: ISSUES IN MODELING
GUIDE 2: TERMINLOGY
GUIDE 3: THE LOWLY 2 X 2 TABLE
GUIDE 4: BASICS ON FITTING MODELS
GUIDE 5: SOME REVIEW, EXTENSIONS, LOGITS
GUIDE 6: LOGLINEAR & LOGIT MODELS
GUIDE 7: LOG-ODDS AND MEASURES OF FIT
GUIDE 8: LOGITS,LAMBDAS & OTHER GENERAL THOUGHTS
OVERVIEW

 
 
EDF 6937-04       SPRING 2017
THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA
GUIDE 8: ON LOGITS, LAMBDA, and OTHER "GENERAL" THOUGHTS
EXERCISES PAPERS ETC ETC
Susan Carol Losh
Department of Educational Psychology and Learning Systems
Florida State University

 
UPDATES
GENERAL THOUGHTS
MULTNOMIAL EQUATIONS
ON CAUSALITY
INTERPRETATION

 

KEY TAKEAWAYS
  • I recommend using the multinomial logistic regression package, even for binomial logistic regression (which it will run with no problems)
    • Be careful about the order in which you enter terms
  • It's important to know what the positive and negative beta coefficients mean
    • Betas are logged measures, symmetric around 0
    • Better to use (because of the symmetry) than exponentiated measures
    • The "exp B" are the multiplicative odds-ratio (which only vary between 0 and 1 for negative Bs)
    • Pay attention to how the program codes the reference category!
  • The logic behind model testing is basically the same for loglinear and logistic regression
    • Eliminate the effect of interest and see what happens to the Chi-Square for the model
  • Be patient with yourself: you're learning a new language!
  • Review multinomial regression equations HERE
  • Causal diagrams look different for mediated, moderated, and spurious relationships
  • Remember to tell your reader why your results are important
 


 This section incorporates information from "trials runs," explanations, and class projects. Please read through carefully.

If you need a refresher course in why we are going to all this trouble to begin with, see Guide 1.

Meanwhile, please look further down this page to the MULTINOMIAL EQUATIONS section and study the examples carefully. There are two examples.

The causal model in the example below is that gender influences degree level which in turn influences watching science television (these data are from the 2001 NSF Surveys because the SCITV questions were not asked after that).

Now, you could do this analysis all at one time in the GENERAL or the MODEL SELECTION SPSS program (and if I were doing this for a conference or publication, I would). But, for the sake of this example, we will do TWO logistic regression equations.

In the first example equation, we are predicting four levels of highest degree (DEGLEV4) from gender using multinomial regression. There are three equations (k - 1), for (default) the first three levels of degree with graduate work as the suppressed and reference category.

In the second equation, we use a binomial regression through the multinomial regression program (more on this shortly) to  predict watching science television (1 = yes and 0 = otherwise) from degree level and gender, because SCITV has only two categories. However we will still have three equations because one of our independent variables, DEGLEV4 has four categories and there are k - 1 = 3 equations. The coefficients for the constant (the SCITV marginal) and for gender will be the same for all three SCITV categories. However, there will be three, possibly different coefficients for degree level.


To obtain statistics for the multinomial logistic regression programs, historically SPSS first creates a multi-way table comparable to what you see in the GENERAL or MODEL SELECTION programs and calculates the s. It then takes the logits using your chosen dependent variable and turns the s into s.

If there's any problem in the GENERAL program, unfortunately it appears this will translate into the logistic regression program estimates.

In the GENERAL program, you must enter terms from the simple to the more complex. This means when you create a custom model, you first enter the marginal terms, then desired two-way association terms, then three-way terms and so forth. Although your G2s will be fine regardless of order, very strange things can happen to your lambda/beta coefficients if you enter terms into the model in any other order.

Apparently in the current SPSS version you must use the same simple to complex rule in entering your desired terms as you do for the latest version of the GENERAL program (however, terms that only contain the independent variables needed for fit will not be seen in the multinomial output).

The current SPSS binomial logistic regression program: I advise NOT to use this program. Unlike the multinomial regression package, which produces betas comparable to GENERAL--logit, it's unclear about any correspondence to the binomial logistic regression program. Furthermore, there is no provision for including statistical moderation variables between predictors and the dependent variable--unless you create those yourself in the binomial program, which can create other problems. The multinomial regression program provides the opportunity to create moderators in the model option (including a saturated model, if that's what you want to do!) Since you can do binomial logistic regression through the SPSS multinomial program, that's what I recommend that you do when you want to do binomial regression.
 


As you have learned, the concept of an odds-ratio, including a logged odds-ratio, is not very intuitive.
Furthermore, recall that the LOGGED odds-ratio is symmetric around zero and is negative for fractional logged odds. If you have a very small fractional odds, you will have a very large negative effect for the logit.

In contrast, the odds-ratio for a fraction is sandwiched between zero and one. You will get a much better picture of how your independent variables influence your dependent variable if you use the logit (logged odds) equation rather than exponentiating the s because the positive and negative betas will be symmetric in the logits.

In binary logistic regression, try coding your dependent variable as 1 (high or a success) or 0 (low or everything else).
This is how the MODEL SELECTION and GENERAL programs code binary variables. (It's also the opposite of how the binary logistic regression program codes your independent variables.)

In the loglinear model, the programs give you the coefficients for the first value of a binary variable and suppress the second value.

Typically at the beginning of the logistic regression program output, it will tell you how it coded the variables. Hold on to that output and don't delete it when you turn your output in with your paper! Study that information carefully as you interpret your results.

Otherwise you will have difficulty interpreting your results and so will your readers.


They don't, really, although the results will LOOK different. If the G2 increases sizably over the G2 that included the terms of interest (e.g., all three-way interactions), we say the model "doesn't fit" and those terms must be included or "put back" to make the expected and observed counts in the multi-way table coincide within sampling error. Returning these terms to the model, lowers the G2  statistic, hence "small Chi-squares are good." They mean the model is a fairly close fit to the observed data.

The saturated model and the MODEL SELECTION program will give you a good idea of which terms are necessary to have a model that fits, i.e., produces a small G2.

The saturated MODEL SELECTION program is a real data drudger. It will tell you everything about how your variables fit together, and that's why  I recommend running it first.

And, by the way, if you don't include particular terms, the logistic regression programs, unlike MODEL SELECTION or GENERAL, won't tell you that you need them (another reason to run the saturated model first in MODEL SELECTION to be sure that you included everything important.) That's because of how the logistic regression programs proceed.

The logistic regression programs will first fit the following terms in an underlying loglinear model. You will NOT see these terms in logistic regression because they only address the independent variables and these terms drop out through subtraction as you move to a logit or logistic regression model (review Guide 6 on the algebraic transition from loglinear lambdas to logistic regression betas HERE). These terms are:

  1. All marginals for the INDEPENDENT variables
  2. All two-way associations between INDEPENDENT variables only
  3. All three-way and higher interactions among INDEPENDENT variables only
Next, the program OMITS the terms that you specified between the independent variables and the DEPENDENT variables. (In my first example below, that would be the two-way association between gender and degree level). It recalculates the multi-way table of expected cell counts and compares this with the observed table that incorporates the terms that you specified.

Over the course of a rather messy calculation formula, you will have a new  G for your logistic regression model. This G2  reflects the chi-square difference between the model omitting the terms you specified and the model that incorporates those terms. Similarly the degrees of freedom in this model will reflect the difference between the model omitting the terms and the model that incorporates those terms. The df typically corresponds to the number of terms that you specified in your logistic regression model (including extra terms for a polytotomous variable, i.e., incorporating the number of k-1 values in your independent variables). If this new G2 is statistically significant, it means the model does NOT fit without the terms you specified in it, and those terms must be returned to make the model "fit". This is considered the overall chi-square for your model.

Want to reproduce this G2? Go into the MODEL SELECTION program and run it with the terms included. Then leave those terms out and run the model through the MODEL SELECTION program again. The increase in the G2 should be roughly the same as the G2 you found in the logistic regression package. Just be sure to include in the MODEL SELECTION program run the terms that only address the independent variables to make the program specifications roughly the same as the logistic regression runs (see above).
 

GENERAL THOUGHTS AND LOGISTICS (no, NOT logits)

To date, everyone is at the A level overall on the exercise total.

Do you sometimes feel at sea? Not sure what to call those pesky coefficients? The why is simple. This is new material. Math is a language. You would not expect to take a semester of Spanish, Korean, Chinese, or English and speak like a native. You would feel a bit hesitant, perhaps worry a little about embarassing yourself with a grammatical error. The individuals you speak with, on the other hand, are so delighted that you are trying to speak their language that they easily forgive you a slip of the tongue or two.

I found a few "sticky wickets" and enumerate them below.

When your output gives you lambdas, you must convert them to betas (subtraction) and then exponentiate back up to get an odds-ratios. When your output gives you betas, all you need to do is exponentiate (the right beta) back up to get the odds because betas are logged odds-ratios. (The multinomial regression package gives you both.) *The same is not always true for SEM, by the way, especially if you have mixed models or nonrecursive models.

MULTINOMIAL EQUATIONS

On multiple categories for a variable. If you have a variable in your model with at least 2 (and up to k) categories, you will need k - 1 equations. The kth equation is obtained by subtraction with the requirement that the sum of the lambda (or beta) coefficients across rows, columns, and combinations of rows and columns must equal 0. For any equation with deglev4, for example, that means a set of three equations to describe what is happening.

For example, to predict four levels of education (deglev4) from gender, you need separate equations for high school, two year college level and the BA level. Here graduate school is the reference category. Recall: these data are for 2001 from the NSF Surveys of Public Understanding of Science and Technology.

DL1 = 2.408DL   + (-) .411 G*DL
DL2 = 0.821DL   + (-) .338 G*DL
DL3 = 0.685DL   + (-) .206 G*DL

NOTE: The constant terms tell us as we go up the educational ladder, the marginal for degree level diminishes and that females (the second category) are less likely to go on to obtain college degrees (the second, third and fourth categories). Traditionally (and still) from the twentieth century on, women have been more likely to graduate high school. In very recent years, women form the majority of AA and BA degrees but surveys of the adult US population obviously include individuals much older than those from recent generations.

Here's an equation for watching science television in 2001, again three equations are needed to predict watching science TV (not available in the later data) from degree level (DL) and gender (G). The three way interaction was not needed in this model, so the results use the modest all 2-ways model. Watching science TV is coded 1 = yes and 0 = no when I use the logistic regression programs. Gender is 1 for males and 2 for females and degree level is coded 1 through 4.

Although the dependent variable (watch science TV) is dichotomous, I still need need three equations, because my independent variable, degree level, has four categories.

TV = -1.133TV  +  .434 TV*DL1 + (-) .151 TV*G
TV = -1.133TV  +  .204 TV*DL2 + (-) .151 TV*G
TV = -1.133TV  + (-) .228 TV*DL3 + (-) .151 TV*G

The constant ( = -1.133) remains the same because there is only one (k = 2 - 1) expressed value in the dependent variable as does the coefficient for gender, which also only has two values (hence one coefficient).

Most people don't watch science TV but relatively more males and those with more education are more likely to watch.

EXAMINING CAUSALITY IN THESE MODELS

The pattern of results and lower sex differences that occur on watching science televisions when we take the differences in degree level into consideration are consistent with what we expect in a mediated relationship. The original independent-dependent variable relationship (gender  TV) becomes smaller when a mediating variable is introduced into the analysis. This suggests that any causal effect that gender has on science television occurs in part because it is causally indirect through education. Something about gender (perhaps socialization variables) leads to differential degree levels. In turn, degree level has a greater direct effect on science TV than gender--i.e., it is a more causally proximate variable.

We know, on the other hand, that this is not a spurious relationship. A spurious relationship occurs when the introduced control variable serves as the "real independent" variable and the relationship between the original independent variable and the dependent variable attenuates or becomes smaller.

How do we know this isn't a spurious relationship? Because causally, such a spurious relationship would be diagrammed as you see below:
 

PROPOSED SPURIOUS RELATIONSHIP CAUSAL ORDER

                      ----------> GENDER 
                     / 
                    /
DEGREE LEVEL
                \
                 \-----------> WATCH SCIENCE TV

Unless you are willing to grant that your highest achieved level of education makes you male or female as well as contributing to watching science TV, this cannot be a spurious relationship.

Depending on tastes and criteria, either a final model included a causal direct and a causal  indirect (through deglev4) effect for gender on science TV, or the analyst dropped the direct effect of gender on science TV. We "lucked out" in that there is no three way or four way interaction effect.

INTERPRETATION

Whether it is an exercise or a conference paper or an article, you must tell your reader what the results mean.

What did your final model look like? What terms did it include?
You can use loglinear abbreviations to describe your final model and equations to show the terms.

What's the causal status of your final model? Did you have direct causal effects, indirect causal effects with mediators, interaction effects with moderators? A causal diagram helps your reader understand the original causal model and the final results you obtained.

Put your results into words. For example: in a particular dataset, who was more likely to own a home computer, men or women (or was there no sex difference)? People with advanced degrees or with high school degrees? Did gender continue to affect owning a computer once education or time was controlled? (If not, educational level mediated the gender-homepc correlation, see diagram above for why this would not be a spurious relationship...)

Tell us why (again) we should care about the results.

Does this give us insight about who might sue McDonald's? About why multiple cluster logistic regression will add to our analytic tools? About who is more likely to smoke (so we can target anti-smoking ads, perhaps)? Who goes to alternative schools? How to tackle math anxiety? Who is more likely to check a patriotism question? Whatever you began with, the interpretation and discussion part of a paper is the place to return to your research problem and tell your reader what happened and what your suggestions are for future research.
 
 
OVERVIEW
READINGS

This page created with Netscape Composer
Susan Carol Losh
April 24 2017