PROJECT PRÉCIS IS DUE THROUGH THE DISCUSSION BOARD IF YOU HAVEN'T TURNED IT IN YET. Check the site for the parameters I need.
 
READINGS

GUIDE 1: ISSUES IN MODELING
GUIDE 2: TERMINLOGY
GUIDE 3: THE LOWLY 2 X 2 TABLE
GUIDE 4: BASICS ON FITTING MODELS
GUIDE 5: SOME REVIEW, EXTENSIONS, LOGITS
GUIDE 6: LOGLINEAR & LOGIT MODELS
GUIDE 7: LOG-ODDS AND MEASURES OF FIT
GUIDE 8: LOGITS,LAMBDAS & OTHER GENERAL THOUGHTS
OVERVIEW


 
 
EDF 6937-05       SPRING 2017
THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA
GUIDE 6: LOGLINEAR AND LOGIT MODELS
Susan Carol Losh
Department of Educational Psychology and Learning Systems
Florida State University

 
 
SOME NOTES 
ON READINGS
GRC VERSUS
LOGIT MODELS
LOGLINEAR TO DICHOTOMOUS
LOGIT EQUATIONS
POLYTOTOMOUS DEPENDENT
S &ODDS-RATIOS

 

KEY TAKEAWAYS
  • Through tables, equations, and graphs, Agresti reminds us that probit, logit, logistic regression and loglinear models produce similar substantive conclusions on the same set of analytic results.
    • These conclusions may differ from using a Linear Probability Model (which we examined earlier this semester). 
  • Which type of model you choose depends on what you wish to highlight and the relationships that you believe exist among your variables.
    •  In a Structural Equation Model Analogue, you are interested in the relationships among variables
      • Try a loglinear model.
    • Do you have a single "dependent variable"? 
      • Try a logit or logistic regression model. 
      • Instead of predicting the dependent variable mean score, our "dependent variable" is a logged odds-ratio of modelled cell counts on the dependent variable in a multi-dimensional table.
      • If you have two causal stages, (1) run the entire causal model through a loglinear program, such as General then (2) use logistic regression to estimate the Beta parameters (first for the mediator, then for the final dependent variable.)
  • Pay special attention to SPSS program features. 
    • Logistic regression and loglinear analysis do not always default to the same numerator category of your variables.
    • Sometimes the lowest category value is the "success"; sometimes the higher category value is the "success."
    • So check it out!
  • Switching from a loglinear to a logit or logistic regression model:
  • In the special case of all dichotomous variables, the logit or log-odds equation reduces to twice the lambda coefficients for the first or "success" value of the dependent variable.
  • When you work with the logit equation, the beta coefficients signify raising or lowering the logged odds-ratio on the dependent variable.
  • It's somewhat more complex to calculate lambda (and beta) coefficients in multinomial logit or logistic regression
    • A polytotomous dependent variable
    • See the example below
    • See the possible contrasts you can do with a simple (three value) dependent variable here
    • We'll review in class!

 
 
 
 A NOTE ON EXERCISES. Have your output and exercise answers with you as we go over them in class. Don't hesitate to change an answer if you feel it's more appropriate. On the other hand, if you disagree, stick to your guns--BUT be sure to explain and justify your answer! We are at the stage where there may be some disagreement on models (but be sure to examine all the information available); the same thing can happen with structural equation models too.

 
PLEASE NOTE CORRECTION(?) TO GILBERT, PAGE 135:

On page 135, Gilbert states that the value of 0 or the constant term in OLS multiple regression "does not vary with the values of the independent variables". This is true when you estimate the score for a particular case (which may have been what Nigel Gilbert really meant but the statement is ambiguous). On the other hand, the generic formula for estimating b0  (the sample regression constant term) is:
 

b0 = by1 1  -  by2 2  -
 bby3 3  - ... 
bb- b yk k

Thus, estimating the constant term value of  b0 in the regression equation depends on the mean values of the independent variables. In a similar fashion, the constant in a loglinear model will change as the model changes, and the values of the terms in the equation will change with the addition of a constant term. For example, a variation on the equiprobable model will give a different constant term than a model which incorporates several interactions between the independent variables with the dependent variable.

ALSO NOTE ON P. 135 (this is correct but useful to observe:) THE EQUATION AT THE VERY TOP OF THE PAGE:

As Gilbert moves from the loglinear to the logit equation, notice the "2" at the far left of the right hand side of the equation, to the right of the  terms on the page. He can do this in calculating the logit equation BECAUSE there are only two values to the dependent variable. Because the  terms (what Gilbert calls "" for model) must sum to zero in the marginals and n-way tables (a good example to study is Table 10.2 on page 136; notice how the terms in the equation sum to zero across the rows and down the columns), in the special instance of a dichotomous dependent variable, one can simply take the corresponding loglinear equation and double the  terms. As soon as we move to polytotomous dependent variables, the constraint that the  terms sum to zero will still be present, but the terms for the individual categories will no longer be mirror images. More on the dichotomous dependent variable in the GCF vs LOGIT MODELS below.
 
 

 

Alan Agresti is arguably one of the best statisticians in the country and probably the top current American statistician on categorical variable modeling (he is even a Gator, no less, and the former head of the Department of Statistics at the University of Florida). However, there is no denying that he can be quite technical if the reader is not a mathematical statistician. At this point, you do have a lot of the terminology and vocabulary under your belt. I think this is a good time to go back over the Agresti material and look ahead to some of the new chapters. You'll find the going smoother now.
 
     
    LEARNING TO LIVE WITH AGRESTI. OLD NEWS:
     
  • Model testing and the G2 statistic
  • Partitioning the G2  statistic to ascertain if an effect can be dropped.
  • Moving between a loglinear GCF model and a logit model
 
read ahead in Agresti, pp. 71-97 and I really like this chapter so Icommend it to you. First, it has a lot of examples. If we have to talk about a

Through tables, equations, and graphs, Agresti shows us that probit, logit, logistic regression and loglinear models will lead us to draw similar conclusions on the same set of results. These conclusions may differ from using a Linear Probability Model (which Agresti also discusses and which we examined earlier this semester). We have worked to date a lot with loglinear and some with logit models and odds-ratios. It is a tiny jump to logistic regression models--so what is a PROBIT model? Watch for more on that one later this semester.

Agresti discusses a lot of model testing. He introduces the Wald test, in addition to the G2 and X2 we already know. The Wald is one of the major statistics given in most logistic regression programs, so it is nice to see it discussed in context. He goes over the partitioning of Chi-square and how we use this in deleting effects to test different partial associations and interactions in the data.

This chapter gives you some of the math and logic behind the practice.



GENERAL CELL  FREQUENCY MODELS VERSUS LOGITS

There are several ways to analyze categorical dependent variables and still preserve the systematic approach that mimics Analysis of Variance, multiple regression, or structural equation modeling approaches for numeric dependent variables. Keep in mind that the General Cell Frequency (GCF) loglinear model, the logged odds-ratio or logit model, and logistic regression (regardless of the number of categories in your dependent variable) are all essentially "cousins" or in the same statistical family.
 
 

 
Any testing that you do on a GCF model, a logit model or logistic regression uses the underlying GCF model to calculate the Chi-square approximation and the degrees of freedom. This means that the relationships and any interaction effects among the independent variables are "fitted" and contribute to the Chi-square value whether you see the relationships among the independent variable s or s  presented in your computer output (as you do in the GCF model) or whether these appear invisible in your computer output as they do in logit modeling or logistic regression. The computer statistical program is fixing the univariate marginals, the pairwise correlations among the independent variables, the interaction effects and so forth. Also hidden in the background is the saturated model for your variables. Logit and logistic regression models use the saturated model G2 with its zero df and its value of zero as the comparison point. What you see in the program is the fitted model G2  with degrees of freedom equation to the number of dropped parameters in the corresponding GCF model.

So, in large part, which type of model you choose depends on what you wish to highlight and the relationships that you believe exist among your variables.

Are you interested in a structural equation model type analysis?
A single stage multiple regression analogue?
An analysis of variance or analysis of covariance analogue?
Each of these will always begin with the GCF loglinear model and analyze different pieces of it.
 
 
STRUCTURAL EQUATION MODEL ANALOGUES

Are you interested in a structural equation model analogue (SEMA)? In an SEMA, you are interested in the relationships among variables. This includes the relationships among independent variables, between independent variables and mediator variables, and how both influence the dependent variable.

If you are interested in an SEMA, then a loglinear analysis is the form that will most help you do so. In other relatives of the loglinear model, relationships among independent variables or in mediators are hidden in the background.

The "causes" for an independent variable lie outside the system of variables you are analyzing. The independent variables function only as causes and not as effects in the variables that you have under study in your model. (This doesn't mean that these same variables could not be dependent or response variables in a different analysis with a different set of variables.)

Mediator variables serve simultaneously as causes and effects in the model that you have under study. Mediators are caused by independent variables, and, in turn, serve as causes of dependent or response variables. In fact, mediators may be the most proximate cause of your dependent variable.

Finally, dependent variables in a given analysis nearly always function solely as effects and not as causes. (I am omitting here the nonrecursive or simultaneous causation models.)

Using the GCF model means that you can not only assess the relationships among the independent variables, but that you can trace causal paths from independent variable to mediator to dependent variable.

In part, you make these inferences by assessing the partial relationships among variables.

For example, suppose (not surprisingly) that you believe an individual's social status is influenced indirectly by one's education working through one's occupation (a variation on Blau and Duncan's  1967 classicThe American Occupational Structure):

EDUCATIONAL LEVEL    OCCUPATIONAL TYPE   SOCIAL STATUS LEVEL

If the "zero order" correlation between education and social status attentuates in the partial correlation to zero or nearly zero between education and social status CONTROLLING occupational type, you would say that educational level has an indirect causal effect on social status, that is mediated by occupational type. Occupational type would have a direct causal effect on social status level.

It is possible for an independent variable to have both direct and indirect causal effects on later variables in a causal chain. (If an independent variable has neither one, you must ask yourself what it is doing in your model at all.)
 

 
If these causal chains are making you uncomfortable, go back and review "On Proof" in Guide 1. At this stage, you need an almost 'intuitive' feel for proposing causal relations in non-experimental data to set up a causal chain at all. I hope that everyone is reasonably comfortable with the causal ordering "gender" "degree level"  "science question". Certainly neither an answer to a science question nor one's education causes one's gender, and it's more likely one's completed education affects science knowledge as an adult than the other way around.

And remember! We can disprove a causal assertion but in the language of science we don't prove a causal assertion the way we would, for example, in law, debates, or journalism. To draw a causal arrow from gender to education is to simply propose a relationship that can be disproven from the data. And, even if a relationship is found, that doesn't mean we necessarily understand the underlying mechanisms or processes that mediate the relationship. One part of "doing science" is to extend and elaborate causal chains.
 

AND hope you don't have many three way or higher interaction effects!
 
 

REGRESSION AND ANOVA AND ANCOVA ANALOGUES

In basic OLS regression we assess how a mix of categorical and numeric variables influence a single numeric dependent variable.

Analysis of variance and analysis of covariance are simply variations on the same theme (especially ANCOVA). We have a numeric dependent variable and we closely examine what happens to mean scores on the dependent variable depending on categories of the independent variable(s) and (in ANCOVA) how these categorical factors influence the dependent variable mean adjusting for numeric covariates. This usually results in adjusting the mean up or down.

As Agresti points out, we can do all this through logit models or logistic regression. Because we work with the basic tabular setup in logit analysis, superficially logit analysis looks more analogous to analysis of variance than logistic regression does. Instead of predicting the dependent variable mean score, our "dependent variable" is a logged odds-ratio of modelled cell counts on the dependent variable in a multi-dimensional table.

Logistic regression appears more akin to analysis of covariance, or even multiple regression. Once again, your dependent variable is a logged odds-ratio on your dependent variable. With a polytotomous (many categoried) dependent variable, your logistic regressions will rely on the form of your dependent variable and just how you created your odds ratios.

In either the dichotomous or multinomial logistic regression case, you will actually do one overall test of the model, which is using (guess what?) the overall G2 from the corresponding loglinear model. Your independent variables can be a mix of categorical or numeric predictors. But--don't be fooled! There's a multidimensional table hiding behind those regression like results. If you have too many categories in your independent "numeric covariate" variable and too few cases, your results can turn out to be unreliable and downright strange. And the computer program is "fixing" your marginals and associations between the independent variables and counting them in to your degrees of freedom. They are just hidden in the background because the assumption is that you are only interested in the partial relationships between each independent variable and the dependent log-odds.
 


FROM LOGLINEAR TO DICHOTOMOUS LOGIT EQUATIONS

When you do logit analysis, you will take the odds-ratio on categories of your chosen dependent variable.
You must choose a dependent variable.
You must decide which category will be your "numerator" or "success."

Pay special attention to SPSS program features. Logistic regression and loglinear analysis do not always default to the same numerator category of your variables.

For both logit analysis and logistic regression (dichotomous or polytotomous) you choose ONE dependent variable.
Your interest is in how several independent variables predict a single dependent variable.

Program-wise, logistic regression packages have handled combinations of numeric and categorical independent variables somewhat more easily than logit packages (although both--i.e., "general" for loglinear analysis--can do so).

Again, both logit and logistic regression models start with the GCF loglinear model. That's what's used for estimation and testing. However, the univariate marginals for the independent variables and the associations and interaction effects solely among the independent variables drop out through subtraction (in the logged equation) or division (in the exponentiated equation). Skim the quick review below to see how.
 
QUICK REVIEW

Starting with our EXPONENTIATED and multiplicative saturated general cell frequency model for three variables, the predicted or modeled odds ratio this is the equivalent of dividing the multiplicative equation for category 1 of the chosen dependent variable by the multiplicative equation for category 2 of the dependent variable, as shown below. Our dependent variable is variable "C". The terms that will be affected by moving from a loglinear model to a logit model are shown in red.
 
 

Fij1iAjB1CijABi1ACj1BCij1ABC

Fij2iAjB2CijABi2ACj2BCij2ABC

When we take the logarithms of the numerator and denominator of the exponentiated odds-ratio in the saturated three variable model, we convert from a multiplicative system to an additive system for the LOG-ODDS and the equation now looks as follows:
 
 

ln (Gij1 - Gij2) = ( iAjB1CijABi1ACj1BCij1ABC) -
 ( iAjB2CijABi2ACj2BCij2ABC

By collecting terms, we can rewrite the log-odds equations as:
 
 

ln (Gij1 - Gij2) = (  - +  )  + ( iAiA) +  ( jBjB ) + (  + ijABijAB ) + 
  ( 1C  - 2C ) + ( i1AC i2AC ) + ( j1BCj2BC)+  ( ij1ABC  - ij2ABC

and the log-odds equation simplifies to:

ln (Gij1 - Gij2) = (1C  - 2C ) + (i1AC i2AC ) + (j1BCj2BC)+  (ij1ABC  - ij2ABC)

Take each step at a comfortable pace for you until you have reassured yourself that you KNOW how you got from point "A" to point "Z".

All the lambda terms that contain ONLY the independent variables drop out of the model by subtraction (of course, they drop out of the exponentiated model by division too if you look closely).

To simplify we call k = ( (1C  - 2C )

and ik = (i1AC i2AC )

so the saturated logit equation now becomes:

ijkk ik jk ijk

And the s are the relatively familiar coefficients from logistic regression.

THE SPECIAL CASE OF A DICHOTOMOUS DEPENDENT VARIABLE
SCAN THE MATH BUT LEARN THE CONCLUSION!

Recall the constraint (this is going to be a familiar refrain) that lambda coefficients must sum to zero, for the total for a univariate distribution, across the rows and the columns for a bivariate distribution, in three dimensional space for a three variable distribution, etc.

In the special case ONLY of a dichotomous dependent variable, then,  each lambda for category 2 of variable A is simply the reverse or negative of each corresponding lambda for category 1 of variable A.

For example, in one loglinear table using the 1999 and 2001 NSF Science data, the  parameter for two degree levels, having less than BA degree  =  0.832. Most of the sample did not have a BA degree. Thus, the   for the second degree level, a BA degree or more, was simply calculated at -0.832.

Going back to our logit equation for three variables AND A DICHOTOMOUS DEPENDENT VARIABLE:

ln (Gij1 - Gij2) = (1C  - 2C ) + (i1AC i2AC ) + (j1BCj2BC)+  (ij1ABC  - ij2ABC)

This means that

where 2C = - (- 2C) 2C

so (1C  - 2C ) = 2 1C

and k = 1C

In the case of the logit betas, for example ik , i.e.,   (i1AC i2AC )
=  (i1AC - [ - i1AC ])
ik = 2 ( i1AC )

and the logit equation (for three variables with dependent variable Ai) written below as:

ijkk ik jk ijk

is the same as:

ln (Gij1 - Gij2) = 2 (1C ) + 2 (i1AC ) + 2 (j1BC)+ 2 (ij1ABC)

In the special case of all dichotomous variables, therefore, the logit or log-odds equation reduces to twice the lambda coefficients for the first value of the dependent variable.
 


 
HOW ABOUT POLYTOTOMOUS INDEPENDENT VARIABLES & A DICHOTOMOUS DEPENDENT?

We are always taking the log-odds on the DEPENDENT variable.
The logit model refers to the log-odds equation for the dependent variable.

Furthermore, the lambdas for the univariate independent variables and the pairwise correlations, three factor interactions and so forth among the INDEPENDENT variables drop out of the logit (or logistic regression) equation entirely through subtraction (or through division in the original multiplicative equation).

The only terms remaining in the logit equation (analogous to the logistic regression model) are the "constant" term for the dependent variable and the associations or interaction effects between each independent variable (and combinations) and the dependent variable.

Therefore, you will have a beta coefficient for each category of the independent variable with the dependent variable (e.g., if you have four levels of education and two levels of a science question, you will have four betas, one for each value of education).

HOWEVER, given the constraint that the lambdas must sum to zero, you will only receive output from most computer programs for the independent variable parameters that can be estimated "independently". If your independent variable has K categories, then there will be K - 1 logits that will be estimated for you. The Kth logit will be estimated by subtraction and the constraint that the sum of the logits in this case = 0.  In my example, you would receive logits for the first three categories of educational level by the science question.
 
 
And I interpret these coefficients HOW??

When you work with the loglinear GCF equation, you are predicting the count or frequency in a particular cell of a crosstabulation table. To get the cell-count predicted under your model, you would need to exponentiate each lambda term using natural exponents (e). Then multiply all the terms together.

Ouch! Who wants to do that?

It's a lot easier to note the logs that raise or lower the cell count.
Lambdas that are statistically significant (i.e., are different from zero) are terms that must be retained in order to describe the cells in the table accurately.

When you work with the logit equation, the beta coefficients signify raising or lowering the logged odds-ratio on the dependent variable.

There is a temptation to call these probabilities (Agresti succumbs to this one from time to time)--but they really aren't. They are logged odds-ratios.

So you can say something like (for example, if a beta = 2), education doubles the log-odds of answering the science question correctly as opposed to not.

Or, more simply still, something like "having an advanced degree doubles the log-odds of answering the science question correctly".

Obviously none of these statements is intuitively self-evident!
But they make possible the quantitative analysis of categorical data in a multivariate scheme.


PROBIT MODELS: A VERY BRIEF MENTION

In order to turn the numerator and denominator into probabilities (which some analysts find more comfortable), you need to divide each of them by the column (or row) total. A related, but slightly different, set of parameters results, called a probit model (logged probability odds-ratio model). Agresti shows you some analysis there too.

Probit models usually give substantive conclusions that are very similar to those from logit models, especially in large samples, although the actual numeric results may differ.
 


POLYTOTOMOUS DEPENDENTS & ODDS-RATIOS

James Davis of NORC, the University of Chicago, Harvard and Darmouth University (the man got around) has been very fond of saying that anything can be dichotomized. For example, all the states in the United States could be divided into New Hampshire (where Dartmouth is located) and all the rest.

But dichotomies give us limited information, and they often do not do justice to the richness or complexity in our variables. For example, you cannot examine monotonicity or nonlinearity when the dependent variable has only two categories. Thus, a multinomial type of model with a dependent variable that has more than two categories may be called for.

It's no more complicated for a computer statistical program to calculate the s or s  for polytotomous (many categoried) variables than it is to calculate them for dichotomous variables. Once again, we fix particular univariate marginals, bivariate tables, three-factor interactions (and so on) and fit a particular model. Once again, the s or the logit s sum to zero in the marginals or across the rows and columns of a table.

With polytotomous variables, the computer part may simply be an extension, but for the analyst, things are a bit trickier.

Contrast the simplicity for calculating the omitted category when we deal with two degree levels with the situation when we have four degree levels. For the 1999, 2001  and 2006 NSF Science data, the frequencies and percentages are shown for four degree levels:
 
 
Degree Level Value Frequency Percent Estimated 
1 = High School or Less 
4083
63.2%
1.250
2 = Vocational/AA Degree
845
13.1 
-0.321
3 = BA Degree
996
15.4
-0.153
.
4 = Advanced College Degree
533
8.3
???
Total
6457
100.0%
 

To find the  for the Advanced Degree category, we must add the program-calculated s for the first three categories or:

1.250 - 0.321 - 0.153 = 0.776

Since the s for the distribution of a single variable must sum to zero, the  for Advanced Degree must equal -0.776.

For example, using the two degree levels variable and making "less than a BA degree" the success or the numerator category, we have 4928 individuals with less than a BA, and 1529 persons with a BA degree or more. The odds-ratio is: 4928/1529 or 3.223 and the logit or log-odds is 1.170.
Not only is calculating the odds-ratio a more complex decision in the four category degree level variable but you have many more choices.

If you have k categories, you can construct k - 1 odds ratios (the kth odds-ratio is linearly dependent on the first k - 1 that you calculated).

Here are some common possible choices:

The "ordinal solution" With an ordinal dependent variable, take the odds on adjacent categories, e.g., BA degree to Advanced Degree
The "referent category" solution Contrast each value of the odds with the same referent category, for example, Advanced Degree

Comparable in some respects to dummy variable coding in multple linear regression..

The "all other cases" solution Contrast each value of the odds with all the other remaining cases.

With ordinal data, the ordinal solution preserves the rank-ordered categories nature of the data. For the four category degree level solution above, we would start with the lowest educational category and calculate odds as follows:
 

4083/845 = 4.832
845/996 = 0.848
996/533 = 1.869

Of course, you can use adjacent categories for any type of data, but the comparisons won't mean much with nominal data since the order of the categories is arbitrary. We can see that the largest change is going to some kind of college at all. The odds of high school (at most) to a two year degree are over twice as high as the other two comparisons. Once the individual has actually begun matriculation, it may be easier to continue than it is to start college in the first place.

Continuing with this example, then, you might decide that attending post-secondary college or training at all is the key comparison. In this referent category solution, you could make a high school diploma or less the referent category and take odds-ratios with respect to it. In this case, because it is the referent, you could have the frequency for a high school degree or less as the consistent denominator, as follows:
 

845/4083 = 0.207
996/4083 = 0.244
533/4083 = 0.131

Thus, an individual is only about one-fifth as likely to have a two-year college degree as a high school diploma, one-quarter as likely to have a BA degree, and one-eighth as likely to have an advanced degree.

Alternatively, you could make the high school diploma or less category consistently the numerator and then take odds-ratios. The key, as you can see, is consistency within the same analysis. Below is the inverse set of odds:
 

4083/845= 4.832
4083/996 = 4.099
4083/533 = 7.660

Not surprisingly, an adult would be almost five times as likely to have a high school diploma as a two-year degree, about four times as likely to have a high school diploma as a BA, and nearly eight times as likely to have a high school degree as an advanced college degree.

It is more typical to see the referent category used as the denominator. In our example, this would make all the odds-ratios fractions.

The referent category solution is akin to using dummy variable coding in regression.

In the "other cases" comparisons, the numerator in the odds ratio is the category of interest, and the denominator is all the other cases in the dependent variable. With K categories, you will only be able to independently calculate K - 1 of these odds because the Kth odds is determined by subtraction and is "linearly dependent" on the first K - 1 parameters.

In our four category degree level variable, since the higher three categories all deal with post-secondary education, I will use those to calculate my odds-ratios:
 

845/5612= 0.151
996/5461 = 0.182
533/5924 = 0.090

In each case, I calculated the denominator by subtracting the numerator from the total, or 6457 cases.

An individual was about one-sixth as likely to have a two-year degree as not, about one-fifth as likely to have a baccalaureate as not, and only about one-eleventh as likely to have a graduate college degree as not.

This solution is more akin to a probit model solution.

  • 3. The decision is YOURS! What do you want to find out? Which groups do you want to compare? The multinomial dependent variable gives you a lot more choices and flexibility than a dichotomous variable does.

  •  
    OVERVIEW
    READINGS

     

    This page created with Netscape Composer
    Susan Carol Losh
    March 23 2017