THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA GUIDE 8: ON LOGITS, LAMBDA, and OTHER "GENERAL" THOUGHTS EXERCISES PAPERS ETC ETC Susan Carol Losh Department of Educational Psychology and Learning Systems Florida State University |
|
|
MULTNOMIAL EQUATIONS |
|
|
KEY TAKEAWAYS
|
This section incorporates information
from "trials runs," explanations, and class projects. Please read through
carefully.
If you need a refresher course in why we are going to all this trouble to begin with, see Guide 1.
The causal model in the example below is that gender influences degree level which in turn influences watching science television (these data are from the 2001 NSF Surveys because the SCITV questions were not asked after that).
Now, you could do this analysis all at one time in the GENERAL or the MODEL SELECTION SPSS program (and if I were doing this for a conference or publication, I would). But, for the sake of this example, we will do TWO logistic regression equations.
In the first example equation, we are predicting four levels of highest degree (DEGLEV4) from gender using multinomial regression. There are three equations (k - 1), for (default) the first three levels of degree with graduate work as the suppressed and reference category.
In the second equation, we use a binomial regression through the multinomial regression program (more on this shortly) to predict watching science television (1 = yes and 0 = otherwise) from degree level and gender, because SCITV has only two categories. However we will still have three equations because one of our independent variables, DEGLEV4 has four categories and there are k - 1 = 3 equations. The coefficients for the constant (the SCITV marginal) and for gender will be the same for all three SCITV categories. However, there will be three, possibly different coefficients for degree level.
If there's any problem in the GENERAL program, unfortunately it appears this will translate into the logistic regression program estimates.
In the GENERAL program, you must enter terms from the simple to the more complex. This means when you create a custom model, you first enter the marginal terms, then desired two-way association terms, then three-way terms and so forth. Although your G2s will be fine regardless of order, very strange things can happen to your lambda/beta coefficients if you enter terms into the model in any other order.
Apparently in the current SPSS version you must use the same simple to complex rule in entering your desired terms as you do for the latest version of the GENERAL program (however, terms that only contain the independent variables needed for fit will not be seen in the multinomial output).
The current SPSS binomial logistic regression
program: I advise NOT to use this program. Unlike the multinomial regression
package, which produces betas comparable to GENERAL--logit, it's unclear
about any correspondence to the binomial logistic regression program. Furthermore,
there is no provision for including statistical moderation variables between
predictors and the dependent variable--unless you create those yourself
in the binomial program, which can create
other problems. The multinomial
regression program provides the opportunity to create moderators in the
model
option
(including a saturated model, if that's what you want to do!)
Since
you can do binomial logistic regression through the SPSS multinomial program,
that's what I recommend that you do when you want to do binomial regression.
In contrast, the odds-ratio for a fraction is sandwiched between zero and one. You will get a much better picture of how your independent variables influence your dependent variable if you use the logit (logged odds) equation rather than exponentiating the s because the positive and negative betas will be symmetric in the logits.
In the loglinear model, the programs give you the coefficients for the first value of a binary variable and suppress the second value.
Otherwise you will have difficulty interpreting
your results and so will your readers.
The saturated model and the MODEL SELECTION program will give you a good idea of which terms are necessary to have a model that fits, i.e., produces a small G2.
The saturated MODEL SELECTION program is a real data drudger. It will tell you everything about how your variables fit together, and that's why I recommend running it first.
The logistic regression programs will first fit the following terms in an underlying loglinear model. You will NOT see these terms in logistic regression because they only address the independent variables and these terms drop out through subtraction as you move to a logit or logistic regression model (review Guide 6 on the algebraic transition from loglinear lambdas to logistic regression betas HERE). These terms are:
Over the course of a rather messy calculation formula, you will have a new G2 for your logistic regression model. This G2 reflects the chi-square difference between the model omitting the terms you specified and the model that incorporates those terms. Similarly the degrees of freedom in this model will reflect the difference between the model omitting the terms and the model that incorporates those terms. The df typically corresponds to the number of terms that you specified in your logistic regression model (including extra terms for a polytotomous variable, i.e., incorporating the number of k-1 values in your independent variables). If this new G2 is statistically significant, it means the model does NOT fit without the terms you specified in it, and those terms must be returned to make the model "fit". This is considered the overall chi-square for your model.
Want to reproduce this G2? Go
into the MODEL SELECTION program and run it with the terms included. Then
leave those terms out and run the model through the MODEL SELECTION program
again. The increase in the G2 should be roughly the same as
the G2 you found in the logistic regression package. Just be
sure to include in the MODEL SELECTION program run the terms that only
address the independent variables to make the program specifications roughly
the same as the logistic regression runs (see
above).
|
To date, everyone is at the A level overall on the exercise total.
Do you sometimes feel at sea? Not sure what to call those pesky coefficients? The why is simple. This is new material. Math is a language. You would not expect to take a semester of Spanish, Korean, Chinese, or English and speak like a native. You would feel a bit hesitant, perhaps worry a little about embarassing yourself with a grammatical error. The individuals you speak with, on the other hand, are so delighted that you are trying to speak their language that they easily forgive you a slip of the tongue or two.
I found a few "sticky wickets" and enumerate them below.
|
On multiple categories for a variable. If you have a variable in your model with at least 2 (and up to k) categories, you will need k - 1 equations. The kth equation is obtained by subtraction with the requirement that the sum of the lambda (or beta) coefficients across rows, columns, and combinations of rows and columns must equal 0. For any equation with deglev4, for example, that means a set of three equations to describe what is happening.
DL1
= 2.408DL +
(-) .411 G*DL
DL2
= 0.821DL +
(-) .338 G*DL
DL3
= 0.685DL +
(-) .206 G*DL
NOTE: The constant terms tell us as we go up the educational ladder, the marginal for degree level diminishes and that females (the second category) are less likely to go on to obtain college degrees (the second, third and fourth categories). Traditionally (and still) from the twentieth century on, women have been more likely to graduate high school. In very recent years, women form the majority of AA and BA degrees but surveys of the adult US population obviously include individuals much older than those from recent generations.
Here's an equation for watching science television in 2001, again three equations are needed to predict watching science TV (not available in the later data) from degree level (DL) and gender (G). The three way interaction was not needed in this model, so the results use the modest all 2-ways model. Watching science TV is coded 1 = yes and 0 = no when I use the logistic regression programs. Gender is 1 for males and 2 for females and degree level is coded 1 through 4.
Although the dependent variable (watch science TV) is dichotomous, I still need need three equations, because my independent variable, degree level, has four categories.
TV
= -1.133TV +
.434 TV*DL1 + (-) .151 TV*G
TV
= -1.133TV +
.204 TV*DL2 + (-) .151 TV*G
TV
= -1.133TV + (-)
.228 TV*DL3 + (-) .151 TV*G
The constant ( = -1.133) remains the same because there is only one (k = 2 - 1) expressed value in the dependent variable as does the coefficient for gender, which also only has two values (hence one coefficient).
Most people don't watch science TV but relatively more males and those with more education are more likely to watch.
|
The pattern of results and lower sex differences that occur on watching science televisions when we take the differences in degree level into consideration are consistent with what we expect in a mediated relationship. The original independent-dependent variable relationship (gender TV) becomes smaller when a mediating variable is introduced into the analysis. This suggests that any causal effect that gender has on science television occurs in part because it is causally indirect through education. Something about gender (perhaps socialization variables) leads to differential degree levels. In turn, degree level has a greater direct effect on science TV than gender--i.e., it is a more causally proximate variable.
We know, on the other hand, that this is not a spurious relationship. A spurious relationship occurs when the introduced control variable serves as the "real independent" variable and the relationship between the original independent variable and the dependent variable attenuates or becomes smaller.
How do we know this isn't a spurious relationship?
Because causally, such a spurious relationship would be diagrammed as you
see below:
PROPOSED SPURIOUS RELATIONSHIP CAUSAL ORDER
----------> GENDER
/ / DEGREE LEVEL |
Unless you are willing to grant that your highest achieved level of education makes you male or female as well as contributing to watching science TV, this cannot be a spurious relationship.
Depending on tastes and criteria, either a final model included a causal direct and a causal indirect (through deglev4) effect for gender on science TV, or the analyst dropped the direct effect of gender on science TV. We "lucked out" in that there is no three way or four way interaction effect.
|
Whether it is an exercise or a conference paper or an article, you must tell your reader what the results mean.
What did your final model look like? What
terms did it include?
You can use loglinear abbreviations
to describe your final model and equations to show the terms.
What's the causal status of your final model? Did you have direct causal effects, indirect causal effects with mediators, interaction effects with moderators? A causal diagram helps your reader understand the original causal model and the final results you obtained.
Put your results into words. For example: in a particular dataset, who was more likely to own a home computer, men or women (or was there no sex difference)? People with advanced degrees or with high school degrees? Did gender continue to affect owning a computer once education or time was controlled? (If not, educational level mediated the gender-homepc correlation, see diagram above for why this would not be a spurious relationship...)
Tell us why (again) we should care about the results.
Does this give us insight about who might
sue McDonald's? About why multiple cluster logistic regression will add
to our analytic tools? About who is more likely to smoke (so we can target
anti-smoking ads, perhaps)? Who goes to alternative schools? How to tackle
math anxiety? Who is more likely to check a patriotism question? Whatever
you began with, the interpretation and discussion part of a paper is the
place to return to your research problem and tell your reader what happened
and what your suggestions are for future research.
|
OVERVIEW |
|
|
This page created with Netscape
Composer
Susan Carol Losh
April 24 2017