THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA GUIDE 6: LOGLINEAR AND LOGIT MODELS Susan Carol Losh Department of Educational Psychology and Learning Systems Florida State University |
ON READINGS |
LOGIT MODELS |
LOGIT EQUATIONS |
S &ODDS-RATIOS |
KEY TAKEAWAYS
|
|
|
LEARNING TO LIVE WITH AGRESTI. OLD NEWS: |
Through tables, equations, and graphs, Agresti shows us that probit, logit, logistic regression and loglinear models will lead us to draw similar conclusions on the same set of results. These conclusions may differ from using a Linear Probability Model (which Agresti also discusses and which we examined earlier this semester). We have worked to date a lot with loglinear and some with logit models and odds-ratios. It is a tiny jump to logistic regression models--so what is a PROBIT model? Watch for more on that one later this semester.
Agresti discusses a lot of model testing. He introduces the Wald test, in addition to the G2 and X2 we already know. The Wald is one of the major statistics given in most logistic regression programs, so it is nice to see it discussed in context. He goes over the partitioning of Chi-square and how we use this in deleting effects to test different partial associations and interactions in the data.
This chapter gives you some of the math and logic behind the practice.
|
There are several ways to analyze categorical
dependent variables and still preserve the systematic approach that mimics
Analysis of Variance, multiple regression, or structural equation modeling
approaches for numeric dependent variables. Keep in mind that the General
Cell Frequency (GCF) loglinear model, the logged odds-ratio or logit model,
and logistic regression (regardless of the number of categories in your
dependent variable) are all essentially "cousins" or in the same statistical
family.
|
So, in large part, which type of model you choose depends on what you wish to highlight and the relationships that you believe exist among your variables.
Are you interested in a structural equation
model type analysis?
A single stage multiple regression analogue?
An analysis of variance or analysis of
covariance analogue?
Each of these will always begin with the
GCF loglinear model and analyze different pieces of it.
|
Are you interested in a structural equation model analogue (SEMA)? In an SEMA, you are interested in the relationships among variables. This includes the relationships among independent variables, between independent variables and mediator variables, and how both influence the dependent variable.
If you are interested in an SEMA, then a loglinear analysis is the form that will most help you do so. In other relatives of the loglinear model, relationships among independent variables or in mediators are hidden in the background.
The "causes" for an independent variable lie outside the system of variables you are analyzing. The independent variables function only as causes and not as effects in the variables that you have under study in your model. (This doesn't mean that these same variables could not be dependent or response variables in a different analysis with a different set of variables.)
Mediator variables serve simultaneously as causes and effects in the model that you have under study. Mediators are caused by independent variables, and, in turn, serve as causes of dependent or response variables. In fact, mediators may be the most proximate cause of your dependent variable.
Finally, dependent variables in a given analysis nearly always function solely as effects and not as causes. (I am omitting here the nonrecursive or simultaneous causation models.)
Using the GCF model means that you can not only assess the relationships among the independent variables, but that you can trace causal paths from independent variable to mediator to dependent variable.
In part, you make these inferences by assessing the partial relationships among variables.
For example, suppose (not surprisingly) that you believe an individual's social status is influenced indirectly by one's education working through one's occupation (a variation on Blau and Duncan's 1967 classicThe American Occupational Structure):
EDUCATIONAL LEVEL OCCUPATIONAL TYPE SOCIAL STATUS LEVEL
If the "zero order" correlation between education and social status attentuates in the partial correlation to zero or nearly zero between education and social status CONTROLLING occupational type, you would say that educational level has an indirect causal effect on social status, that is mediated by occupational type. Occupational type would have a direct causal effect on social status level.
It is possible for an independent variable
to have both direct and indirect causal effects on later variables in a
causal chain. (If an independent variable has neither one, you must ask
yourself what it is doing in your model at all.)
|
AND hope you don't
have many three way or higher interaction effects!
|
In basic OLS regression we assess how a mix of categorical and numeric variables influence a single numeric dependent variable.
Analysis of variance and analysis of covariance are simply variations on the same theme (especially ANCOVA). We have a numeric dependent variable and we closely examine what happens to mean scores on the dependent variable depending on categories of the independent variable(s) and (in ANCOVA) how these categorical factors influence the dependent variable mean adjusting for numeric covariates. This usually results in adjusting the mean up or down.
As Agresti points out, we can do all this through logit models or logistic regression. Because we work with the basic tabular setup in logit analysis, superficially logit analysis looks more analogous to analysis of variance than logistic regression does. Instead of predicting the dependent variable mean score, our "dependent variable" is a logged odds-ratio of modelled cell counts on the dependent variable in a multi-dimensional table.
Logistic regression appears more akin to analysis of covariance, or even multiple regression. Once again, your dependent variable is a logged odds-ratio on your dependent variable. With a polytotomous (many categoried) dependent variable, your logistic regressions will rely on the form of your dependent variable and just how you created your odds ratios.
In either the dichotomous or multinomial
logistic regression case, you will actually do one overall test of the
model, which is using (guess what?) the overall G2 from the
corresponding loglinear model. Your independent variables can be a mix
of categorical or numeric predictors. But--don't be fooled! There's a multidimensional
table hiding behind those regression like results. If you have too many
categories in your independent "numeric covariate" variable and too few
cases, your results can turn out to be unreliable and downright strange.
And the computer program is "fixing" your marginals and associations between
the independent variables and counting them in to your degrees of freedom.
They are just hidden in the background because the assumption is that you
are only interested in the partial relationships between each independent
variable and the dependent log-odds.
|
When you do logit analysis, you will
take the odds-ratio on categories of your chosen dependent variable.
You must choose
a dependent variable.
You must decide
which category will be your "numerator" or "success."
Pay special attention to SPSS program features. Logistic regression and loglinear analysis do not always default to the same numerator category of your variables.
For both logit analysis and logistic regression
(dichotomous or polytotomous) you choose ONE dependent variable.
Your interest is in how several independent
variables predict a single dependent variable.
Program-wise, logistic regression packages have handled combinations of numeric and categorical independent variables somewhat more easily than logit packages (although both--i.e., "general" for loglinear analysis--can do so).
Again, both logit and logistic regression
models start with the GCF loglinear model. That's what's used for estimation
and testing. However, the univariate marginals for the independent
variables and the associations and interaction effects solely among
the independent variables drop out through subtraction (in the
logged equation) or division (in the exponentiated equation). Skim the
quick review below to see how.
|
Starting with our EXPONENTIATED and multiplicative
saturated
general
cell frequency model for three variables, the predicted or modeled odds
ratio this is the equivalent of dividing the multiplicative equation for
category 1 of the chosen dependent variable by the multiplicative equation
for category 2 of the dependent variable, as shown below. Our dependent
variable is variable "C". The terms that will be affected by moving from
a loglinear model to a logit model are shown in red.
|
|
When we take the logarithms of the numerator
and denominator of the exponentiated odds-ratio in the saturated three
variable model, we convert from a multiplicative system to an additive
system for the LOG-ODDS and the equation now looks as follows:
|
|
By collecting terms, we can rewrite the
log-odds equations as:
|
|
and the log-odds equation simplifies to:
ln (Gij1 - Gij2) = (1C - 2C ) + (i1AC - i2AC ) + (j1BC - j2BC)+ (ij1ABC - ij2ABC)
Take each step at a comfortable pace for you until you have reassured yourself that you KNOW how you got from point "A" to point "Z".
All the lambda terms that contain ONLY the independent variables drop out of the model by subtraction (of course, they drop out of the exponentiated model by division too if you look closely).
To simplify we call k = ( (1C - 2C )
and ik = (i1AC - i2AC )
so the saturated logit equation now becomes:
ijk = k + ik + jk + ijk
And the s are the relatively familiar coefficients from logistic regression.
SCAN THE MATH BUT LEARN THE CONCLUSION! |
Recall the constraint (this is going to be a familiar refrain) that lambda coefficients must sum to zero, for the total for a univariate distribution, across the rows and the columns for a bivariate distribution, in three dimensional space for a three variable distribution, etc.
In the special case ONLY of a dichotomous dependent variable, then, each lambda for category 2 of variable A is simply the reverse or negative of each corresponding lambda for category 1 of variable A.
For example, in one loglinear table using the 1999 and 2001 NSF Science data, the parameter for two degree levels, having less than BA degree = 0.832. Most of the sample did not have a BA degree. Thus, the for the second degree level, a BA degree or more, was simply calculated at -0.832.
Going back to our logit equation for three variables AND A DICHOTOMOUS DEPENDENT VARIABLE:
ln (Gij1 - Gij2) = (1C - 2C ) + (i1AC - i2AC ) + (j1BC - j2BC)+ (ij1ABC - ij2ABC)
This means that
where - 2C = - (- 2C) = 2C
so (1C - 2C ) = 2 1C
and k = 2 1C
In the case of the logit betas, for example ik
,
i.e.,
(i1AC
- i2AC
)
= (i1AC
-
[ - i1AC
])
= ik
= 2 ( i1AC
)
and the logit equation (for three variables with dependent variable Ai) written below as:
ijk = k + ik + jk + ijk
is the same as:
ln (Gij1 - Gij2) = 2 (1C ) + 2 (i1AC ) + 2 (j1BC)+ 2 (ij1ABC)
In the special case of all
dichotomous variables, therefore, the logit or log-odds equation
reduces to twice the lambda coefficients for the first value of the dependent
variable.
|
We are always taking the log-odds on the
DEPENDENT variable.
The logit model refers to the log-odds
equation for the dependent variable.
Furthermore, the lambdas for the univariate independent variables and the pairwise correlations, three factor interactions and so forth among the INDEPENDENT variables drop out of the logit (or logistic regression) equation entirely through subtraction (or through division in the original multiplicative equation).
The only terms remaining in the logit equation (analogous to the logistic regression model) are the "constant" term for the dependent variable and the associations or interaction effects between each independent variable (and combinations) and the dependent variable.
Therefore, you will have a beta coefficient for each category of the independent variable with the dependent variable (e.g., if you have four levels of education and two levels of a science question, you will have four betas, one for each value of education).
HOWEVER, given the constraint
that the lambdas must sum to zero, you will only receive output from most
computer programs for the independent variable parameters that can be estimated
"independently". If your independent variable has K categories, then there
will be K - 1 logits that will be estimated for you. The Kth logit
will be estimated by subtraction and the constraint that the sum of the
logits in this case = 0. In my example, you would receive logits
for the first three categories of educational level by the science question.
|
When you work with the loglinear GCF equation, you are predicting the count or frequency in a particular cell of a crosstabulation table. To get the cell-count predicted under your model, you would need to exponentiate each lambda term using natural exponents (e). Then multiply all the terms together.
Ouch! Who wants to do that?
It's a lot easier to note the logs that
raise or lower the cell count.
Lambdas that are statistically significant
(i.e., are different from zero) are terms that must be retained in order
to describe the cells in the table accurately.
When you work with the logit equation, the beta coefficients signify raising or lowering the logged odds-ratio on the dependent variable.
There is a temptation to call these probabilities (Agresti succumbs to this one from time to time)--but they really aren't. They are logged odds-ratios.
So you can say something like (for example, if a beta = 2), education doubles the log-odds of answering the science question correctly as opposed to not.
Or, more simply still, something like "having an advanced degree doubles the log-odds of answering the science question correctly".
Obviously none of these statements is intuitively
self-evident!
But they make possible the quantitative
analysis of categorical data in a multivariate scheme.
PROBIT MODELS: A VERY BRIEF MENTION
In order to turn the numerator and denominator into probabilities (which some analysts find more comfortable), you need to divide each of them by the column (or row) total. A related, but slightly different, set of parameters results, called a probit model (logged probability odds-ratio model). Agresti shows you some analysis there too.
Probit models usually give substantive
conclusions that are very similar to those from logit models, especially
in large samples, although the actual numeric results may differ.
|
James Davis of NORC, the University of Chicago, Harvard and Darmouth University (the man got around) has been very fond of saying that anything can be dichotomized. For example, all the states in the United States could be divided into New Hampshire (where Dartmouth is located) and all the rest.
But dichotomies give us limited information, and they often do not do justice to the richness or complexity in our variables. For example, you cannot examine monotonicity or nonlinearity when the dependent variable has only two categories. Thus, a multinomial type of model with a dependent variable that has more than two categories may be called for.
It's no more complicated for a computer statistical program to calculate the s or s for polytotomous (many categoried) variables than it is to calculate them for dichotomous variables. Once again, we fix particular univariate marginals, bivariate tables, three-factor interactions (and so on) and fit a particular model. Once again, the s or the logit s sum to zero in the marginals or across the rows and columns of a table.
With polytotomous variables, the computer part may simply be an extension, but for the analyst, things are a bit trickier.
Degree Level Value | Frequency | Percent | Estimated |
1 = High School or Less |
|
|
|
2 = Vocational/AA Degree |
|
|
|
3 = BA Degree |
|
|
. |
4 = Advanced College Degree |
|
|
|
Total |
|
|
|
To find the for the Advanced Degree category, we must add the program-calculated s for the first three categories or:
1.250 - 0.321 - 0.153 = 0.776
Since the s for the distribution of a single variable must sum to zero, the for Advanced Degree must equal -0.776.
For example, using the two degree levels variable and making "less than a BA degree" the success or the numerator category, we have 4928 individuals with less than a BA, and 1529 persons with a BA degree or more. The odds-ratio is: 4928/1529 or 3.223 and the logit or log-odds is 1.170.Not only is calculating the odds-ratio a more complex decision in the four category degree level variable but you have many more choices.
If you have k categories, you can construct k - 1 odds ratios (the kth odds-ratio is linearly dependent on the first k - 1 that you calculated).
Here are some common possible choices:
The "ordinal solution" | With an ordinal dependent variable, take the odds on adjacent categories, e.g., BA degree to Advanced Degree |
The "referent category" solution | Contrast each value of the odds with the
same referent category, for example, Advanced Degree
Comparable in some respects to dummy variable coding in multple linear regression.. |
The "all other cases" solution | Contrast each value of the odds with all the other remaining cases. |
With ordinal
data, the ordinal solution preserves the rank-ordered categories nature
of the data. For the four category degree level solution above, we
would start with the lowest educational category and calculate odds as
follows:
|
|
|
Of course, you can use adjacent categories for any type of data, but the comparisons won't mean much with nominal data since the order of the categories is arbitrary. We can see that the largest change is going to some kind of college at all. The odds of high school (at most) to a two year degree are over twice as high as the other two comparisons. Once the individual has actually begun matriculation, it may be easier to continue than it is to start college in the first place.
Continuing with this example, then, you
might decide that attending post-secondary college or training at all is
the key comparison. In this referent category solution,
you could make a high school diploma or less the referent category and
take odds-ratios with respect to it. In this case, because it is the
referent, you could have the frequency for a high school degree or
less as the consistent denominator, as follows:
|
|
|
Thus, an individual is only about one-fifth as likely to have a two-year college degree as a high school diploma, one-quarter as likely to have a BA degree, and one-eighth as likely to have an advanced degree.
Alternatively, you could make the high
school diploma or less category consistently the numerator and then take
odds-ratios. The key, as you can see, is consistency within the same
analysis. Below is the inverse set of odds:
|
|
|
Not surprisingly, an adult would be almost five times as likely to have a high school diploma as a two-year degree, about four times as likely to have a high school diploma as a BA, and nearly eight times as likely to have a high school degree as an advanced college degree.
It is more typical to see the referent category used as the denominator. In our example, this would make all the odds-ratios fractions.
The referent category solution is akin to using dummy variable coding in regression.
In the "other cases" comparisons, the numerator in the odds ratio is the category of interest, and the denominator is all the other cases in the dependent variable. With K categories, you will only be able to independently calculate K - 1 of these odds because the Kth odds is determined by subtraction and is "linearly dependent" on the first K - 1 parameters.
In our four category degree level variable,
since the higher three categories all deal with post-secondary education,
I will use those to calculate my odds-ratios:
|
|
|
In each case, I calculated the denominator by subtracting the numerator from the total, or 6457 cases.
An individual was about one-sixth as likely to have a two-year degree as not, about one-fifth as likely to have a baccalaureate as not, and only about one-eleventh as likely to have a graduate college degree as not.
This solution is more akin to a probit model solution.
3. The decision is YOURS! What do you want to find out? Which groups do you want to compare? The multinomial dependent variable gives you a lot more choices and flexibility than a dichotomous variable does.
|
OVERVIEW |
|
|
This page created with Netscape
Composer
Susan Carol Losh
March 23 2017