The Paper Precis is due March 21 (The Tuesday after classes resume).
You'll need a datset! So let's start talking data.
 

 
PLEASE NOTE CORRECTION TO GILBERT, PAGE 125, Table 9.7. By definition, lambda parameters within a particular variable (or variable combination) are constrained to sum to 0. If you examine Table 9.7, you will see this is true for the marginal effects for who does the dishes (-.39, .21, .18). However, this was not true for the gender-employment combinations. If you sum each of the columns in Table 9.6, you will see that there are more cases for column 3 (Man works, woman doesn't) and column 4 (Other) than there are in the first two columns. This tells me that the lambda sign for the first two columns will be negative (fewer cases than expected for an equiprobable split) and that for the last two columns will be positive (more cases than expected by chance). Therefore, the lambdas for section (b) should read, in order: -0.14, -0.25, + 0.11 and +0.28. Notice that all the gender-work combinations now add to 0 once this correction is made.

 
READINGS GUIDE 1: ISSUES IN MODELING
GUIDE 2: TERMINLOGY
GUIDE 3: THE LOWLY 2 X 2 TABLE
GUIDE 4: BASICS ON FITTING MODELS
GUIDE 5: SOME REVIEW, EXTENSIONS, LOGITS
GUIDE 6: LOGLINEAR & LOGIT MODELS
GUIDE 7: LOG-ODDS AND MEASURES OF FIT
GUIDE 8: LOGITS,LAMBDAS & OTHER GENERAL THOUGHTS
OVERVIEW

 
EDF 6937-05       SPRING 2017
THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA
GUIDE 5: SOME REVIEW, EXTENSIONS AND LOGITS
Susan Carol Losh
Department of Educational Psychology and Learning Systems
Florida State University

 

KEY TAKEAWAYS:
  • Below I work through a 3 variable model using the 2014 General Social Survey data: 
    • Planet question * gender * degree level
  • We can use the Model Selection SPSS program to quickly test this saturated model.
  • We use the parameter estimates (lambdas), the z-scores, the G2s from dropping parameters and their probability levels as guides.
  • I compare the G2s and df from adjacent or "nested" models. If the difference in the G2s is small, I use the simpler "leaner" model.
    • I want the simplest possible model (fewest number of parameters)
    • That "fits" (small G2)
  • Ideally with an alpha or "p-level" between 0.20 and 0.50 or 0.60.
    • Alpha level too low? Underfitted, return parameters to the model
    • Alpha level too large? Overfitted, try dropping more parameters for a simpler model.
  • Can some effects be dropped?
    • I recommend rerunning the loglinear "General" program to reestimate the parameters that remain in your model.
    • Be sure to include all the lower term effects in the General program if you chose a hierarchical model!
    • The General program is NOT hierarchical so you must specify lower order terms in a hierarchical model.
  • Moving from a loglinear to a logit model.
    • Choose a dependent variable
    • Decide which category value in the dependent variable will be a "success" or the numerator.
    • Most programs use the first or smallest value as "the success" so code accordingly (especially if that is NOT what you want.)
  • Follow the general conversion from loglinear to logit model in the algebra below
 

 
TESTING A 3 WAY 
MODEL: MORE DETAIL
TESTING MODELS 
(REPRISE)
USEFUL 
OUTPUT HINTS
MODEL 
EQUATIONS
FROM GCF TO LOGIT MODELS

A BIT MORE DETAIL ON TESTING A MODEL 

In Guide 4, we saw that loglinear (and logit and logistic regression) models are all tested using the likelihood ratio G2 or Lstatistic. The  logged G2 is preferred in these analyses because, unlike the Pearson X2, it is additive and therefore can be partitioned for adjacent or "nested" models. This means that we can test models incrementally.

Let's examine our familiar three variable model, (PlanetQuestion*Gender*DegreeLevel) in more detail. This analysis uses the 2014 NSF Surveys of Public Understanding of Science and Technology through the General Social Survey,  and has a total of 5267 cases. The percentage table looks like this:

2014

EDUCATIONAL LEVEL SOME COLLEGE OR LESS BA OR MORE
GENDER MALE FEMALE     MALE FEMALE  
EARTH AROUND SUN
77.5%
65.9%
601
 
94.6%
94.3%
325
EVERYTHING ELSE
22.5
34.1
228
 
5.4
5.7
19
100.0%
478
100.0%
358
tau-b = 0.13
829
 
100.0%
203
100.0%
141
tau-b = 0.01 

344

Source: NSF Surveys of Public Understanding of Science and Technology, through the General Social Survey 2014.  Director, NORC General Social Survey, available n = 1173.

We can first try dropping the third order interaction term. This term corresponds to the interaction effect, i.e., that the relationship between gender and the science question differs by educational level. Using the Model Selection program to estimate the G2 , this resulted in a G2 of 1.115 with 1 degree of freedom and a p-level of about 0.29. Almost any loglinear analyst would agree that this third order interaction term can be dropped.

Can we make this model simpler and continue to accurately estimate the observed values in the three way table (within sampling error,of course)?

The Model (PlanetQuestion*Gender)(PlanetQuestion*DegreeLevel) is said to be nested within the model incorporating all two way effects: (PlanetQuestion*Gender)(PlanetQuestion*DegreeLevel) (Gender*DegreeLevel). This is because this is the model with all three two-way associations:

MODEL 1:  (PlanetQuestion*Gender)(PlanetQuestion*DegreeLevel) (Gender*DegreeLevel)

or all three 2-way effects contains every single parameter that is also in the simpler model:

MODEL 2: (PlanetQuestion*Gender)(PlanetQuestion*DegreeLevel)

except that Model 1 also contains the (Gender*DegreeLevel) parameter. Both models also contain the univariate marginals (Gender)(DegreeLevel)(PlanetQuestion).

The G2 for the model (PlanetQuestion*Gender)(PlanetQuestion*DegreeLevel) was 1.156 with two degrees of freedom, p > .50. Why 2 df? Because this second model drops the three-way interaction with 1 df and also drops the two-way association for (Gender*DegreeLevel) which has 1 df because each of the variables only has two categories.

Because Model 2 is nested within Model 1, we can subtract the degrees of freedom and the G2s as follows:
 
 

MODEL
MODEL G2
MODEL DF
2
1.156
2
1
1.115
1
Difference
1.156 - 1.115  = 0.041
2 - 1 = 1

A G2 of 0.041 with 1 df has a p-level > .50. Because Models 1 and 2 do not significantly differ in explaining the three way table (PlanetQuestion*Gender*DegreeLevel), we prefer Model 2 because it is simpler (contained fewer terms). Simpler models contain fewer terms and parameters and are typically preferred.

Or, if you like, "parsimony counts."
 
 

 
ANALYTIC ADVANCE: Notice that the effects of gender and degree level on the planetary question are partial or net effects. That is, for the effect of each independent variable on the dependent variable (the planetary question), we have controlled for all the other independent variables in the loglinear equation. Using earlier "nonparametric methods," this would have been next to impossible to do and test statistically.

ANALYTIC ADVANCE:  Instead of "eyeballing" the interaction effect and trying to decide if it was needed to adequately describe the results, we were able to systematically test this interaction effect and eliminate it with confidence. We could also see that a model that eliminated the interaction effect PLUS any correlation between gender and degree level described our results just as well as a model that only eliminated the interaction effect.

 

What can we say at the end of all this? Gender and degree level jointly predict answers to the planetary science question, with both males
(controlling for degree level) and college graduates (controlling for gender) more likely to answer the planetary question correctly.

Oops! How do I know gender is also a predictor? Its two way association with the planet question has a  parameter of only 0.080 with a z-score of 1.300, compared with the planet by degree association with a  of -0.460 and a z-score of -7.485.

In Model 3 I omitted the gender X planet question and used the VERY simple model of: {degree * planet} and the univariate distribution of {gender}.

This model has 3 degrees of freedom AND a G2 of 14.069, clearly much larger than the earlier model 2: (PlanetQuestion*Gender)(PlanetQuestion*DegreeLevel)
(see the above G2 for model 2!)

This huge jump in the G2 means I must include the (PlanetQuestion*Gender) parameter in the final model!
 


MODEL TESTING: GENERAL GOALS

So, what do we do when we "test a loglinear model"?

We ELIMINATE the effect that we are interested in, for example, the three way interaction among gender, degree level, and very basic science knowledge.

Then we see what happens to the G2 or L2 likelihood ratio  statistic.

We know that the saturated model, which generates expected table cell frequencies identical to the observed cell frequencies, "fits perfectly", that is, the observed and expected cell frequencies exactly coincide.

The saturated model has a G2of 0 with 0 df.

A hierarchical model that drops any terms will have a  G2 greater than 0 because there will be some deviation, no matter how small, between the expected and observed cell frequencies or marginals. However, if the G2 is within sampling error of zero, these dropped terms are descriptive terms (e.g., the three way interaction effect) that are unnecessary. For example, there may be no interaction among gender, education, and the planetary science question.

The more the terms that are dropped from the loglinear model, typically the larger the G2  becomes and the larger the degrees of freedom (because each dropped term adds back degrees of freedom). That is why in nested models, we subtract the more compex model (smaller G2 or L2, fewer degrees of freedom) from the simpler model (larger G2, more df) when we partition the likelihood ratio statistic.

OUR GENERAL AIM: To have the simplest possible model (the fewest fixed parameters) that describes the data well, i.e., within sampling error. That means we want to be able to drop as many fixed parameters as we possibly can and still have a small G2 for the corresponding degrees of freedom.

It is easier to describe a model with no interaction effects in words than it is to describe a model that does contain interaction effects.

So, after we have specified a simpler model, we examine its G2 and its corresponding df. A small G2 relative to its associated degrees of freedom means that the revised, simpler model fits the observed table within sampling error. That's why "small chi-squares are good". A small G2 means that we can describe our results in relatively simple terms.

On the other hand, after we have dropped some loglinear terms and specified a simpler model, a large L2 or G2  means that our model is a "poor fit" to the observed data. Perhaps we dropped a correlation between two variables but in reality there really is a correlation between them (e.g., the planet question and gender in my model 3 example). In that case, the G2 for the corresponding partial association will be very large and we can reject the null hypothesis of no association. This means the lambda that corresponds to the partial correlation must be returned to the model loglinear equation.

When we replace this association, that means that the model will be more complicated but also that it will more accurately describe the true state of events.

That's why "big chi-squares are bad"--a large chi-square means that you dropped interaction effects or two variable correlations that really do exist in the data. You know that because the model G2  that corresponded to those dropped effects immediately "jumped up" and became quite large. These terms must be returned to the loglinear model in order for it to "fit", i.e., have a small chi-square.

Note the analogies here to an N-way analysis of variance. An interaction effect that describes how two factors influence the mean of a dependent variable within categories of a third predictor may be "interesting," However, several of these within the same data analysis become complex and cumbersome to report. And an N-way ANOVA that discovers that three predictors interact within categories of a fourth predictor to describe the dependent variable mean becomes cumbersome to envision, indeed. 


A NOTE ON FIT

As I have noted in class, most analysts are looking for a probability level of about 0.20 that corresponds to the model G2 . If the probability level is under 0.20, and especially if it is p < .05 for the Type I error, the model is said to be "underfitted." That means you have left off parameters (e.g., that describe a two variable correlation) that must be included in the loglinear model to describe the observed cell frequencies well.

On the other hand, a model that has a large alpha probability level (e.g., p > 0.80) is often called "overfitted", This means that you are including some interaction effects or two-way associations that you may be able to  drop, resulting in a simpler model. The only way to find out is to examine the partial associations and other statistics from your analysis to see if it is possible to simplify the model description.



USEFUL HINTS IN YOUR OUTPUT

Finding the "best possible" model when you are only analyzing three variables is not too difficult. Do you have an interaction effect or not? Do all three variables have non-negligible pairwise correlations with each other? You may or may not have univariate marginal effects. Most of the time, we are a lot more interested in correlations or interaction effects than we are in marginal parameters.

As soon as we have at least four variables, the world becomes more complicated. It is possible, in the most complex of worlds, to have one four-way interaction effect, four three-way interactions, six pairwise correlations and four marginal effects. And that's just with four variables!

The "model selection" hierarchical loglinear ("HILOG") program is very useful in guiding model selection as long as you examine hierarchical models. It also offers clues as to whether a non-hierarchical model would be a better choice to describe your results.

The HILOG program will very quickly test any hierarchical model you construct. It will calculate the observed and expected frequencies that accompany that model. The HILOG program will also calculate the loglinear lambda parameters for the saturated model only. It will calculate the lambdas, the standard errors and the standard scores for each loglinear parameter except for the theta  or "grand mean" effect. Useful hints for selecting a good model can be found in the following:

TESTS THAT K-WAY AND HIGHER EFFECTS ARE ZERO. This part of the program does backwards elimination. It IS hierarchical. So, for example, in a four variable model, the k = 2 line eliminates all pairwise correlations, all three variable interaction effects and the four variable interaction. This can provide a very quick way to simplify a loglinear model. For example, if the k = 4 line has a small G2 and a large probability level, there probably is NOT a four-way interaction effect. If the k = 3 line has a small G2 and a large probability level, there are no three-way or four-way interaction effects in your analysis.

On the other hand, in this same example, when you go to the k = 2 line, suppose you find a large G2 and a small probability level. This indicates that at least one of the pairwise correlation coefficients is not zero. Which one is it? Check some of the other tips below to find out.

TESTS OF PARTIAL ASSOCIATION. This part of the program is not hierarchical, but it is still very helpful. The highest order interaction (there is only one) is tested in the first K-way table. After that, the tests of partial association eliminate each pairwise correlation or three way (or higher) interaction effect one at a time. The G2 test and its associated probability level are testing the null hypothesis that the effect is zero in the population.

Large G2s indicate that an effect has been eliminated that is necessary to be included in the loglinear equation in order for the model to fit the data well. Small G2s indicate that the tested correlation or interaction does not exist in your observed results. That parameter can be dropped and the model will still fit the data. Because this set of tests is not hierarchical, you may be able to see whether a non-hierarchical model might work for your particular set of results.

Z-SCORES. These are the Z-scores that accompany the lambda parameters for the saturated model. They are obtained by dividing the lambda parameter by its own standard error. Because all of the Z-scores are present in the HILOG saturated model output, you may get an idea of which parameters can be dropped and still have a model that fits the data well.

Conversely, be alert to Z-scores of absolute value |1.50| and higher. Drop these effects one at a time to see if the G2  becomes so large that the model doesn't fit. Lambda parameters with corresponding Z-values of |1.81| almost certainly must be retained to have a good fit on the final model.

TEST A VARIETY OF MODELS. HILOG runs so quickly you can specify a variety of different (hierarchical) models. Once you find the one that you think is the most parsimonious but fits well, you can run the General loglinear program to get the final loglinear equation. You can use the General SPSS program to check on non-hierarchical models too. The General program is more cumbersome but will provide parameters for any model, hierarchical or not.
 

.
THE LOGLINEAR MODEL EQUATION AND CORRESPONDING EFFECTS 

REVIEW IT:  Recall that the loglinear model we have been working with that has the theta and lambda parameters is often called the General Cell Frequency Model (GCF). In the GCF or "General" model, we are trying to predict or model a logged cell frequency.  Examining the saturated model for a loglinear model with three variables we have:

GijkiAjB +kCijABikACjkBCijkABC

Lambda coefficients that are statistically significant in GCF models raise or lower the predicted logged cell frequencies (Gij).
Negative lambdas mean fewer logged frequencies in a cell than would occur with a model that predicted  no effect.
Positive lambdas mean more logged frequencies in a cell than would occur with a model that predicted no effect.

Each lambda parameter corresponds to a marginal effect, a pairwise (partial) correlation, a three variable interaction, and so on. How do we know what a particular lambda does to a cell frequency? In order to know that:

(1) We must isolate which particular effect we mean (e.g., a science question by gender association)
(2) Understand what the coding in each of our variables means and
(2) Exponeniate back up by e and convert back into a multiplicative model.
 
 
     
    The HILOG program treats the first value in sequence of a variable as a "high" or "a success". It makes the frequency corresponding to that value the numerator or "high value" in any logits that are created. Therefore you should:
     
  • (A) run univariate frequencies on each of your variables at the very beginning of your analysis to assess the coding and 
  • (B) ensure that you have the coding order in a way that makes sense for your analysis (e.g., if you want to predict the "correct score" on the planetary question and that is coded "1" and "other" is coded "2" make sure the first value of educational attainment is also the "high" category.)

  • I recommend that you run bivariate associations with the dependent variable through HILOG first including the lambda estimates to ascertain if your variables are coded the way you want them to be.
     

Recall that the original equation is multiplicative:

FijkiAjBkCijABikACjkBCijkABC

We converted the original multiplicative equation for each cell frequency into an additive equation by taking logarithms because additive equations tend to be easier to work with than multiplicative ones are.

As we simplify the model by dropping interaction effects and two variable correlations, we drop terms from the loglinear equation.

Let's continue with the (PlanetQuestion*Gender*DegreeLevel) three variable model.

We'll use PQ, G and DL to indicate the three variables. Recall, that using the Model Selection program to test with the G2 that the model: (PlanetQuestion*Gender)(PlanetQuestion*DegreeLevel) was a good fit to the data. In other words, both gender and degree level predicted whether someone responded that the earth goes around the sun--or some other answer. The GCF loglinear equation that corresponds to this model is:

GijkiPQjGkDLijPQ*G +ikPQ*DL

The theta parameter   is the sum of the logged expected frequencies in each of the cells, divided by the number of cells in the total table.
 

FAQ: WHAT DO I DO WITH THE NUMBERS IF AN EFFECT IS ZERO? 

What happens to the numeric lambdas, compared with the saturated model loglinear equation, if some of the effects are 0? For example, in the model above, setting the (PQ*G*DL) and (G*DL) effects to zero produced a model that fit well and could not be further simplified. This dropped two terms from the Gijk equation and added back two degrees of freedom.

What happens numerically to the remaining s? Do their numeric values remain identical to the values estimated in the saturated loglinear equation or do they become different, and thus should be re-estimated? There are two schools of thought on this issue:

SCHOOL 1 says that if the dropped lambdas are in fact zero within sampling error, the values of the remaining lambdas stay the same as the original estimates for the saturated model.

SCHOOL 2 says that even if the dropped lambdas are zero within sampling error, depending on the sample size, the standard error of each lambda could be quite large. Therefore, when the saturated model is estimated including the values of these lambdas (which are statistically zero) will introduce error into the numeric lambdas that remain in the equation. This school of thought says you will need to re-estimate the new model omitting the numerically zero lambda terms. You can use the HILOG program to test your best (hierarchical) model, then turn to the General loglinear program. to estimate the s in the final model.  You can also test a non-hierarchical model using the General program. 



START WITH LOGLINEAR AND CONVERT TO A LOGIT MODEL--IF YOU LIKE...

The loglinear model is THE basic model. Estimating the cell frequencies initially can lead later on to logit models or logistic regression. As I mentioned earlier, I like the loglinear model the best because it is the most flexible. For example,  if there are not too many third order or higher interaction terms, it can provide a SEM-like analogy for categorical variables, something that can't be done all at once with logit models or logistic regression.

Further, you are always doing model tests using the loglinear model--whether you knew it or not. In either the logit transformation or in logistic regression, the program fits the correlations (and possible interactions) among the independent variables when it calculates the model G2. It's just that, as you shall shortly see, these terms were hidden from your view. But the program used them to fit the model whether you saw the parameters in an equation or if they were hidden.

In the loglinear model, you have not designated a dependent variable. That's one reason why the order of entering your variables in a program such as HILOG doesn't matter. You will have the same numeric results regardless of the order, just as your third order interaction term will be the same numerically if you calculate all the way up to the third order odds ratio in a three variable table.

However, depending on your interests:

You may have designated a specific dependent variable and you only want to predict that.

You may not care about the marginal splits on any of the variables EXCEPT the dependent variable.

You are not very interested in the correlations among the independent variables, although you do want to estimate the final partial effects of each independent variable on the dependent (response) variable, controlling for all the other variables in the equation, in a manner similar to single stage multiple regression.

You are not interested in constructing a SEM-type causal model for your variables.

You like and are comfortable with using odds ratios.

If the points immediately above describe you, you may wish to do a logit analysis rather than a general cell frequency analysis.
 


A logit is the logarithm of an odds-ratio, or, abbreviated, the "log-odds".
The logit allows us to use the additive and linear properties of a loglinear model while focusing on the odds-ratios taken on a chosen dependent variable.

Recall that when you take the odds-ratio, you are doing so with respect to one category of a variable to a second category (or merged categories) of that same variable.

You must stay within the same variable when you take the odds.

You must decide which category will form the numerator of the odds ("a success") and consistently use that category as the numerator for the remainder of the analysis.

When you do logit analysis, you will take the odds-ratio on categories of the dependent variable.

Starting with our EXPONENTIATED and multiplicative saturated general cell frequency model, this is the equivalent of dividing the multiplicative equation for category 1 of the dependent variable by the multiplicative equation for category 2 of the dependent variable, as shown below. Our dependent variable is variable "C". The terms that will be affected by switching to the logit model are shown in red.
 
 

Fij1iAjB1CijABi1ACj1BCij1ABC

Fij2iAjB2CijABi2ACj2BCij2ABC

When we take the logarithms of the numerator and denominator of the exponentiated odds-ratio in the saturated three variable model, we convert from a multiplicative system to an additive system for the LOG-ODDS and the equation now looks as follows:
 
 

ln (Gij1 - Gij2) = ( iAjB1CijABi1ACj1BCij1ABC) -  [minus]
 ( iAjB2CijABi2ACj2BCij2ABC

By collecting terms, we can rewrite the log-odds equations as:
 
 

ln (Gij1 - Gij2) = (  - +  )  + ( iAiA) +  ( jBjB ) + (  + ijABijAB ) + 
  ( 1C  - 2C ) + ( i1AC i2AC ) + ( j1BCj2BC)+  ( ij1ABC  - ij2ABC

During the subtraction process, the terms that contain ONLY the independent variables cancel each other out (you can see this in the multiplicative model above, too) and the saturated log-odds equation for three variables simplifies to:

ln (Gij1 - Gij2) = (1C  - 2C ) + (i1AC i2AC ) + (j1BCj2BC)+  (ij1ABC  - ij2ABC)

All the lambda terms that contain ONLY the independent variables drop out of the model by subtraction. The lambda terms containing the DEPENDENT variable ("C" in this case) remain.

To simplify we call k = ( (1C  - 2C )

and ik = (i1AC i2AC )

so the saturated logit equation in the three total variable case now becomes:

ijk =k ik jk ijk

And the s are the relatively familiar beta coefficients from logistic regression.

Feeling that "glassy eyed stare" feeling? Not to worry. This is the first time around for this new vocabulary and you should be a "pro" by the second or third time around. We will see more (with an example) in Guide 6.
 
 
OVERVIEW
READINGS

This page created with Netscape Composer
Susan Carol Losh
February 20 2017