Guide 8: More odds, ends, exercises, multinomials

UPDATED FOR SPSS 23

READINGS

GUIDE 1: ISSUES IN MODELING
GUIDE 2: TERMINLOGY
GUIDE 3: THE LOWLY 2 X 2 TABLE GUIDE 4: BASICS ON FITTING MODELS
GUIDE 5: SOME REVIEW, EXTENSIONS, LOGITS GUIDE 6: LOGLINEAR & LOGIT MODELS
GUIDE 7: LOG-ODDS AND MEASURES OF FIT
GUIDE 8: LOGITS,LAMBDAS & OTHER GENERAL THOUGHTS

OVERVIEW

EDF 6937-04 SPRING 2017
THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA
GUIDE 8: ON LOGITS, LAMBDA, and OTHER "GENERAL" THOUGHTS
EXERCISES PAPERS ETC ETC
Susan Carol Losh
Department of Educational Psychology and Learning Systems
Florida State University

UPDATES

GENERAL THOUGHTS

MULTNOMIAL EQUATIONS

ON CAUSALITY

INTERPRETATION

KEY TAKEAWAYS

I recommend using the multinomial logistic regression package, even for binomial logistic regression (which it will run with no problems)

Be careful about the order in which you enter terms

It's important to know what the positive and negative beta coefficients mean

Betas are logged measures, symmetric around 0
Better to use (because of the symmetry) than exponentiated measures
The "exp B" are the multiplicative odds-ratio (which only vary between 0 and 1 for negative Bs)
Pay attention to how the program codes the reference category!

The logic behind model testing is basically the same for loglinear and logistic regression

Eliminate the effect of interest and see what happens to the Chi-Square for the model

Be patient with yourself: you're learning a new language!
Review multinomial regression equations HERE
Causal diagrams look different for mediated, moderated, and spurious relationships
Remember to tell your reader why your results are important

This section incorporates information from "trials runs," explanations, and class projects. Please read through carefully.

If you need a refresher course in why we are going to all this trouble to begin with, see Guide 1.

Because of enlightening comments, here's how we will proceed: we'll go over Exercise 4 in class on Tuesday April 25.

Meanwhile, please look further down this page to the MULTINOMIAL EQUATIONS section and study the examples carefully. There are two examples.

The causal model in the example below is that gender influences degree level which in turn influences watching science television (these data are from the 2001 NSF Surveys because the SCITV questions were not asked after that).

Now, you could do this analysis all at one time in the GENERAL or the MODEL SELECTION SPSS program (and if I were doing this for a conference or publication, I would). But, for the sake of this example, we will do TWO logistic regression equations.

In the first example equation, we are predicting four levels of highest degree (DEGLEV4) from gender using multinomial regression. There are three equations (k - 1), for (default) the first three levels of degree with graduate work as the suppressed and reference category.

In the second equation, we use a binomial regression through the multinomial regression program (more on this shortly) to predict watching science television (1 = yes and 0 = otherwise) from degree level and gender, because SCITV has only two categories. However we will still have three equations because one of our independent variables, DEGLEV4 has four categories and there are k - 1 = 3 equations. The coefficients for the constant (the SCITV marginal) and for gender will be the same for all three SCITV categories. However, there will be three, possibly different coefficients for degree level.

SPSS program glitches

To obtain statistics for the multinomial logistic regression programs, historically SPSS first creates a multi-way table comparable to what you see in the GENERAL or MODEL SELECTION programs and calculates the

s. It then takes the logits using your chosen dependent variable and turns the

s into

If there's any problem in the GENERAL program, unfortunately it appears this will translate into the logistic regression program estimates.

In the GENERAL program, you must enter terms from the simple to the more complex. This means when you create a custom model, you first enter the marginal terms, then desired two-way association terms, then three-way terms and so forth. Although your G²s will be fine regardless of order, very strange things can happen to your lambda/beta coefficients if you enter terms into the model in any other order.

Apparently in the current SPSS version you must use the same simple to complex rule in entering your desired terms as you do for the latest version of the GENERAL program (however, terms that only contain the independent variables needed for fit will not be seen in the multinomial output).

The current SPSS binomial logistic regression program: I advise NOT to use this program. Unlike the multinomial regression package, which produces betas comparable to GENERAL--logit, it's unclear about any correspondence to the binomial logistic regression program. Furthermore, there is no provision for including statistical moderation variables between predictors and the dependent variable--unless you create those yourself in the binomial program, which can create other problems. The multinomial regression program provides the opportunity to create moderators in the model option (including a saturated model, if that's what you want to do!) Since you can do binomial logistic regression through the SPSS multinomial program, that's what I recommend that you do when you want to do binomial regression.

It is more critical to know what the positive and negative coefficients mean in your data than exponentiating up to odds-ratios.

As you have learned, the concept of an odds-ratio, including a logged odds-ratio, is not very intuitive.
Furthermore, recall that the LOGGED odds-ratio is symmetric around zero and is negative for fractional logged odds. If you have a very small fractional odds, you will have a very large negative effect for the logit.

In contrast, the odds-ratio for a fraction is sandwiched between zero and one. You will get a much better picture of how your independent variables influence your dependent variable if you use the logit (logged odds) equation rather than exponentiating the s because the positive and negative betas will be symmetric in the logits.

However, in order to interpret your coefficients at all, you must pay special attention to how your variables are coded.

In binary logistic regression, try coding your dependent variable as 1 (high or a success) or 0 (low or everything else).
This is how the MODEL SELECTION and GENERAL programs code binary variables. (It's also the opposite of how the binary logistic regression program codes your independent variables.)

In the loglinear model, the programs give you the coefficients for the first value of a binary variable and suppress the second value.

Thus it is imperative that (1) you pay special attention to how you code variables into logistic regression and (2) you MUST specify the variable codes in your paper (and in conference papers and articles and other manuscripts.)

Typically at the beginning of the logistic regression program output, it will tell you how it coded the variables. Hold on to that output and don't delete it when you turn your output in with your paper! Study that information carefully as you interpret your results.

Otherwise you will have difficulty interpreting your results and so will your readers.

How do loglinear models and logistic regression differ in testing models?

They don't, really, although the results will LOOK different.

In the loglinear model, we OMIT the terms we are interested in (e.g., all three-way interactions) and see what happens to the likelihood ratio chi-square.

If the G² increases sizably over the G²that included the terms of interest (e.g., all three-way interactions), we say the model "doesn't fit" and those terms must be included or "put back" to make the expected and observed counts in the multi-way table coincide within sampling error. Returning these terms to the model, lowers the G² statistic, hence "small Chi-squares are good." They mean the model is a fairly close fit to the observed data.

The saturated model and the MODEL SELECTION program will give you a good idea of which terms are necessary to have a model that fits, i.e., produces a small G².

The saturated MODEL SELECTION program is a real data drudger. It will tell you everything about how your variables fit together, and that's why I recommend running it first.

In logistic regression, on the other hand, YOU are in control. You tell the program which main effects and interaction effects you think are important to predict your dependent variable.

And, by the way, if you don't include particular terms, the logistic regression programs, unlike MODEL SELECTION or GENERAL, won't tell you that you need them (another reason to run the saturated model first in MODEL SELECTION to be sure that you included everything important.) That's because of how the logistic regression programs proceed.

The logistic regression programs will first fit the following terms in an underlying loglinear model. You will NOT see these terms in logistic regression because they only address the independent variables and these terms drop out through subtraction as you move to a logit or logistic regression model (review Guide 6 on the algebraic transition from loglinear lambdas to logistic regression betas HERE). These terms are:

All marginals for the INDEPENDENT variables
All two-way associations between INDEPENDENT variables only
All three-way and higher interactions among INDEPENDENT variables only

Next, the program OMITS the terms that you specified between the independent variables and the DEPENDENT variables. (In my first example below, that would be the two-way association between gender and degree level). It recalculates the multi-way table of expected cell counts and compares this with the observed table that incorporates the terms that you specified.

Over the course of a rather messy calculation formula, you will have a new G² for your logistic regression model. This G² reflects the chi-square difference between the model omitting the terms you specified and the model that incorporates those terms. Similarly the degrees of freedom in this model will reflect the difference between the model omitting the terms and the model that incorporates those terms. The df typically corresponds to the number of terms that you specified in your logistic regression model (including extra terms for a polytotomous variable, i.e., incorporating the number of k-1 values in your independent variables). If this new G² is statistically significant, it means the model does NOT fit without the terms you specified in it, and those terms must be returned to make the model "fit". This is considered the overall chi-square for your model.

Want to reproduce this G²? Go into the MODEL SELECTION program and run it with the terms included. Then leave those terms out and run the model through the MODEL SELECTION program again. The increase in the G² should be roughly the same as the G² you found in the logistic regression package. Just be sure to include in the MODEL SELECTION program run the terms that only address the independent variables to make the program specifications roughly the same as the logistic regression runs (see above).

GENERAL THOUGHTS AND LOGISTICS (no, NOT logits)

To date, everyone is at the A level overall on the exercise total.

Do you sometimes feel at sea? Not sure what to call those pesky coefficients? The why is simple. This is new material. Math is a language. You would not expect to take a semester of Spanish, Korean, Chinese, or English and speak like a native. You would feel a bit hesitant, perhaps worry a little about embarassing yourself with a grammatical error. The individuals you speak with, on the other hand, are so delighted that you are trying to speak their language that they easily forgive you a slip of the tongue or two.

I found a few "sticky wickets" and enumerate them below.

If you decide to calculate odds ratios from your diverse coefficients, you will probably encounter rounding error, which occurs at several stages cumulatively--what the computer calculates, you rounding each lambda, rounding during subtraction perhaps, and the rounding when you exponentiate back up into an odds-ratio. Lots of little rounding errors really add up, especially when you exponentiate (why I always had trouble with slide rules...)

Copy errors in exercises. Was it -.017 or -0.17? Accidently misplacing the decimal point could make a big difference, especially in calculating the last lambda or logit category--then perhaps exponentiating back up to create either a frequency or an odds-ratio. That's why I need your output (that goes for the paper, too). I grade on whether you capture the logic, generally, not on arithmetic mistakes (unless they are frequent or primitive).

Remember to give the programs the complete information you need. When you give the logistic regression programs an "independent variable" (e.g., gender), you really gave the program a two-way association with the dependent variable (e.g., gender * homepc). When you give the logistic regression programs an interaction term (e.g., gender * deglev4), you really gave the program a three-way association which includes the dependent variable (e.g., gender * deglev4 * homepc).

What were logits, lambdas, and odds-ratios, again? Logits and lambdas are both logged measures.

Additive combinations of lambdas, when exponentiated (anti-logged), create expected cell frequencies.

Other additive combinations of lambdas create betas, which are logits, or logged odds-ratios. This analytic system is called log-linear because the equations are linear and additive in logged form.

What type of coefficients will each program give? MODEL SELECTION will give you lambda estimates for the fully saturated model. These typically take effect coding form (also called deviation or contrast coding), analogous to analysis of variance, i.e., differences from a "grand mean". GENERAL will give lambda estimates using dummy or "indicator" coding for any loglinear model. (It looks like that's what you are getting.) LOGIT or the binary or multinomial logistic regression programs will give you Betas or logit estimators.

The logit defaults take a "dummy variable" or indicator coding form, i.e., deviations from a reference category (the last category in GENERAL or LOGIT; the referent category can be changed on models in the logistic regression program. In logistic regression, this means that the standardized betas, WALD statistics, or statistical significance tests for each coefficient within a variable group (e.g., deglev4) are performed comparing the effect to the reference category for each variable or combination of variables (e.g., advanced degree/no home computer). In this respect, the default is similar to dummy variable OLS regression.

When your output gives you lambdas, you must convert them to betas (subtraction) and then exponentiate back up to get an odds-ratios. When your output gives you betas, all you need to do is exponentiate (the right beta) back up to get the odds because betas are logged odds-ratios. (The multinomial regression package gives you both.)

Saturated models always fit the data in loglinear*. This is because the expected and observed frequencies exactly coincide for each cell of the table, the deviation for each cell is zero, and the total G² is zero. There are 0 df (you have as many parameters in the equation as cells in the table) and the probability level is either given as 1.00 or as undefined ("." or "0").

*The same is not always true for SEM, by the way, especially if you have mixed models or nonrecursive models.

MULTINOMIAL EQUATIONS

On multiple categories for a variable. If you have a variable in your model with at least 2 (and up to k) categories, you will need k - 1 equations. The kth equation is obtained by subtraction with the requirement that the sum of the lambda (or beta) coefficients across rows, columns, and combinations of rows and columns must equal 0. For any equation with deglev4, for example, that means a set of three equations to describe what is happening.

For logit models of all kinds, the last category (by default) is typically the reference category that forms the denominator for each logit equation; therefore the last category does not receive its own separate equation. However if you make the first category the reference category, you will not have an equation for the first level of that variable. The remaining equations are with respect to the first level in that case.

For example, to predict four levels of education (deglev4) from gender, you need separate equations for high school, two year college level and the BA level. Here graduate school is the reference category. Recall: these data are for 2001 from the NSF Surveys of Public Understanding of Science and Technology.

_DL1 = 2.408^DL + (-) .411 ^G*DL
_DL2 = 0.821^DL + (-) .338 ^G*DL
_DL3 = 0.685^DL + (-) .206 ^G*DL

NOTE: The constant terms tell us as we go up the educational ladder, the marginal for degree level diminishes and that females (the second category) are less likely to go on to obtain college degrees (the second, third and fourth categories). Traditionally (and still) from the twentieth century on, women have been more likely to graduate high school. In very recent years, women form the majority of AA and BA degrees but surveys of the adult US population obviously include individuals much older than those from recent generations.

Here's an equation for watching science television in 2001, again three equations are needed to predict watching science TV (not available in the later data) from degree level (DL) and gender (G). The three way interaction was not needed in this model, so the results use the modest all 2-ways model. Watching science TV is coded 1 = yes and 0 = no when I use the logistic regression programs. Gender is 1 for males and 2 for females and degree level is coded 1 through 4.

Although the dependent variable (watch science TV) is dichotomous, I still need need three equations, because my independent variable, degree level, has four categories.

TV = -1.133^TV + .434 ^TV*DL1 + (-) .151 ^TV*G
TV = -1.133^TV + .204 ^TV*DL2 + (-) .151 ^TV*G
TV = -1.133^TV + (-) .228 ^TV*DL3 + (-) .151 ^TV*G

The constant ( = -1.133) remains the same because there is only one (k = 2 - 1) expressed value in the dependent variable as does the coefficient for gender, which also only has two values (hence one coefficient).

Most people don't watch science TV but relatively more males and those with more education are more likely to watch.

EXAMINING CAUSALITY IN THESE MODELS

The pattern of results and lower sex differences that occur on watching science televisions when we take the differences in degree level into consideration are consistent with what we expect in a mediated relationship. The original independent-dependent variable relationship (gender TV) becomes smaller when a mediating variable is introduced into the analysis. This suggests that any causal effect that gender has on science television occurs in part because it is causally indirect through education. Something about gender (perhaps socialization variables) leads to differential degree levels. In turn, degree level has a greater direct effect on science TV than gender--i.e., it is a more causally proximate variable.

We know, on the other hand, that this is not a spurious relationship. A spurious relationship occurs when the introduced control variable serves as the "real independent" variable and the relationship between the original independent variable and the dependent variable attenuates or becomes smaller.

How do we know this isn't a spurious relationship? Because causally, such a spurious relationship would be diagrammed as you see below:

PROPOSED SPURIOUS RELATIONSHIP CAUSAL ORDER

                      ----------> GENDER
                     /
                    /

DEGREE LEVEL
\
\-----------> WATCH SCIENCE TV

Unless you are willing to grant that your highest achieved level of education makes you male or female as well as contributing to watching science TV, this cannot be a spurious relationship.

Depending on tastes and criteria, either a final model included a causal direct and a causal indirect (through deglev4) effect for gender on science TV, or the analyst dropped the direct effect of gender on science TV. We "lucked out" in that there is no three way or four way interaction effect.

INTERPRETATION

Whether it is an exercise or a conference paper or an article, you must tell your reader what the results mean.

What did your final model look like? What terms did it include?
You can use loglinear abbreviations to describe your final model and equations to show the terms.

What's the causal status of your final model? Did you have direct causal effects, indirect causal effects with mediators, interaction effects with moderators? A causal diagram helps your reader understand the original causal model and the final results you obtained.

Put your results into words. For example: in a particular dataset, who was more likely to own a home computer, men or women (or was there no sex difference)? People with advanced degrees or with high school degrees? Did gender continue to affect owning a computer once education or time was controlled? (If not, educational level mediated the gender-homepc correlation, see diagram above for why this would not be a spurious relationship...)

Tell us why (again) we should care about the results.

Does this give us insight about who might sue McDonald's? About why multiple cluster logistic regression will add to our analytic tools? About who is more likely to smoke (so we can target anti-smoking ads, perhaps)? Who goes to alternative schools? How to tackle math anxiety? Who is more likely to check a patriotism question? Whatever you began with, the interpretation and discussion part of a paper is the place to return to your research problem and tell your reader what happened and what your suggestions are for future research.

OVERVIEW READINGS

This page created with Netscape Composer
Susan Carol Losh
April 24 2017