THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA EXERCISE 3: THE SPSS GENERAL LOGLINEAR PROGRAM 20 points Susan Carol Losh Department of Educational Psychology and Learning Systems Florida State University |
NUANCES |
THE TABLE |
|
"BEST MODEL" |
|
---|
You will use the Loglinear Model Selection and General SPSS Programs to:
(A) run
the saturated model for the four-way
(1)
compuse* by (2) degrecod by (3) gender by (4) recyear then
(B)
run and test the model you believe has the best fit.
These are variables from the 2010 and 2014 General Social Survey.
*Do you use a computer? 1 = yes 2 = no.
You will write out the loglinear equations for for your best-fitting model, then
You will describe your model in words suitable (say) for a conference
presentation.
|
|
I RECOMMEND THE FOLLOWING TO MAKE YOUR VARIABLE CHOICES EASIER TO FIND. You really DO want to know what the numeric values are for the categories in each variable because this could influence any recoding decisions that you make as well as providing variable ranges and interpreting results. Changing the Data General defaults means that you often can find your variables for statistical maneuvers much faster. In the SPSS menu at the top of the screen, go to the “Options” choice under the “Edit” menu. |
Keep your model selection program output in front of you as you use the GENERAL program. It's easier to use the copious output in the MODEL SELECTION package to identify your best model than to tediously try them out one at a time in the GENERAL program.
You will need the GSS 2010 2014 data file for Exercise 3 (GSS1014.sav).
You'll find the datafile under Course
Documents in the Data and Output folder.
When you click on the SPSS GSS1014.sav
file it will pull up SPSS on your [or the LRC] computer.
(I wanted to go further back in time but the question wasn't asked prior to 2010...)
|
Your GENERAL computer program output will look different from the MODEL SELECTION output--although the substantive conclusions about your final model should remain about the same. However, the values of the estimators will differ:
First, you are now including the estimate of the constant term in your equation with the GENERAL program.
Second, the MODEL SELECTION program uses "deviation estimates" or contrast coding (1, 0 and -1). This solution is typical to a "grand mean" approach in Analysis of Variance. The default for dichotomous variables in this case is 1, -1 (NOT 1,0). However, the GENERAL program uses "indicator estimates" or dummy coding (0, 1). This is more comparable to dummy variable regression. It also means the parameter estimates will look different in each program.
Third, check the variable codings very carefully to see which is the referent category and which is the omitted category. Make sure the variables are coded the way you wanted them to be.
Fourth, both programs add a small amount (typically 0.5) to each cell for the saturated model. This will change the expected frequencies at least slightly.
Fifth, the algorithms used for computation to calculate the coefficients and model are slightly different for each program (IPF* for MODEL SELECTION and Newton-Raphson for GENERAL). This is partly because the GENERAL program estimates both hierarchical and non-hierarchical models.
*IPF = Iterative proportional fitting algorithm
Because of the constraints that lambda and beta coefficients sum to zero within a particular variable, or across variable combinations (e.g., Gender-Compuse), parameter estimator coefficients are not independent of each other. This will be especially true when you use the multinomial distribution which assumes that n is fixed to be the sample casebase (although with a saturated model, don't expect differences between multinomial and poisson). MODEL SELECTION simply omitted the extraneous estimator coefficients entirely, trusting that you would know how to calculate them (particularly easy with contrast coding and dichotomous variables--these tend to be mirror images). GENERAL lists the parameters, but places zeros on the parameters that are not estimated independently of one another.
However, you can still use the Z-scores to guide you about which interaction terms or pairwise correlations you can drop or must keep.
When you create your GENERAL program, be sure to add the lower order terms to your program model that are hierarchically nested within the higher order interactions. The GENERAL program uses a hierarchical algorithm to calculate the G2 statistic and degrees of freedom. However, the parameter estimates change dramatically (and perhaps nonsensically) depending on which terms you include or omit while model building. The parameter estimates do not behave in a hierarchical fashion in the General program.
***If you decide to use a saturated model, the General SPSS program WILL include all hierarchical parameters--all parameters period. But it won't do that with the customized models. In those cases you must include and specify the lower order terms yourself.***
It is, of course, a different case if you have definite grounds to omit certain lower order terms. Perhaps you have an experimental design or a disproportionate sampling design that will automatically make some lower order terms or pairwise correlations to be zero. Since these effects really ARE zero, however, doing so shouldn't change your estimates of the other parameters.
However, notice that nuances such as sampling design differ from partial effects or correlations that become zero when you control for other variables! Remember that it is often necessary to include "main effects" when you have higher order interactions, whether you are doing an Analysis of Variance or a Loglinear Analysis. Not to do so is to often credit the interaction effects with more influence than they actually have. It's a good chance your Chi-square in the General program will become VERY large, too.
If a partial effect (e.g., the gender marginal) becomes zero when other variables are controlled, but was originally nonzero before controls, you should probably include this term when you creat your loglinear equation.
But do remember about the possibility of
changing the probability level to include or omit a lambda or beta parameter
because you can change entry criteria with the Model selection or
Logistic Regression packages. You can also set the confidence interval
to 0.80 instead of 0.95 (recommended!). This will generate a narrower confidence
interval than the default 0.95 confidence interval. If the narrower 0.80
interval does not contain zero, you should probably retain that parameter
in the model.
|
This exercise uses results from the General Social Survey: 2010 and 2014, a total of 4582 cases. What happened to the other cases? We lose them largely because those respondents were not asked about computer use.
The four variable table below is one you'll want to study and maybe do some crosstabulation runs and/or calculate some percentages to see how pairwise correlations and/or interaction effects may operate because you will need to summarize your results in words again at the end of exercise 3.
2010
EDUCATION LEVEL | JUNIOR COLLEGE OR LESS | AT LEAST A BA DEGREE |
GENDER | MALE | FEMALE | MALE | FEMALE |
USE COMPUTER |
77.3%
|
76.8%
|
807
|
93.0%
|
96.7%
|
364
|
|
EVERYONE ELSE |
22.7
|
23.2
|
241
|
7.0
|
3.3
|
19
|
|
100.0%
454 |
100.0%
594 |
1048 |
100.0%
172 |
100.0%
211 |
383 |
2014
EDUCATION LEVEL | JUNIOR COLLEGE OR LESS | AT LEAST A BA DEGREE |
GENDER | MALE | FEMALE | MALE | FEMALE |
USE COMPUTER |
76.8%
|
81.8%
|
940
|
98.3%
|
96.0%
|
471
|
|
EVERYONE ELSE |
23.2
|
18.2
|
240
|
11.7
|
4.0
|
14
|
|
100.0%
514 |
100.0%
666 |
1180 |
100.0%
235 |
100.0%
250 |
485 |
Source: General Social Survey, 2010 and
2014, General Social Survey; available n for all sample participants
in the core module = 4582.
|
Click on the GSS1014.sav file in
the Course Documnents folder. SPSS will begin to load (NOTE: remember this
can take a bit: be patient!).
|
In the MODEL SELECTION program, run the following (add in order) under the saturated model:
Recyear, Degrecod, Gender and Compuse
Recyear, Gender and Compuse have the values
1 and 2.
Degrecod has the values 0 (low) and 1
(high).
Request the parameter estimates
and the partial association table.
Change the probability level on the opening
Model Selection menu to 0.20 if you like.
Print the output and examine the models.
Still in MODEL SELECTION, try out and test
your best model(s).
Under "Model" if needed change the
option from Saturated to Custom.
Since MODEL SELECTION is a hierarchical
program, you only need to include the higher order terms for this program.
Test your best model and note the G2,
df and significance level.
If you like, try out a few other models
too. (Maybe your first one wasn't "the best" after all...)
However remember this is a hierarchical
program so you can't omit lower order terms contained in higher order terms.
Under the SPSS Analyze program, go to the Loglinear section and click on General...
In this order, enter the variables:
Recyear Degrecod Gender and Compuse
into the Factor(s): box.
Leave the distribution of cell counts on Poisson (for all dichotomous variables and a fully saturated model, your conclusions will essentially be the same as if you had used the Multinomial distribution).
Click on the Model... box.
Leave the radio button choice on Saturated for this first run experience, then click the Continue box.
Click on the Options... box.
Click to put a check mark on the Estimates
box.
GENERAL allows you to obtain the estimated
parameters for any model, saturated or not.
If you like: Change the Confidence Interval:
box to 80 (%)
Click on the Continue box.
Then click OK.
|
You will have A LOT of output.
Depending on what you are interested in,
you may wish to delete some of it (not for this assignment, however) when
you work with loglinear models in data analysis.
First, you receive a lot of information about your program run. Always check the design model in the GENERAL program (the saturated model should be OK but it's a good habit to cultivate with any of these programs) to make sure that you included the terms in the model that you intended to include. You will find the model at the very beginning of the program output after the "Factors" list.
Make sure the casebase, the variables, the way the variables are coded corresponds to what you believe you had entered into the analysis.
For example, make sure the number of categories (levels) for each variable corresponds to your knowledge of that variable. Too many categories could mean that you inadvertently included cases that had missing data codes or even "wild punches" as substantively real values. If so, you would want to use SPSS provisions to make sure these category values are correctly coded as missing the next time you analyze the data. (Did you run the frequencies on each of your four variables first?)
For your saturated model, the Chi-square should be zero with zero degrees of freedom. The "statistical significance" level is considered undefined in this case.
Next, you receive the observed and model cell counts for each of the 16 cells in the table.
In a saturated model, the observed and expected cell counts will be identical. However, once you drop terms, these will diverge. In models that are not saturated, you will want to examine the cell count residuals, and especially the adjusted cell count residuals, for clues about where the model may not fit. Larger adjusted residuals (over 1.50) should alert you to possible terms that need to be placed back in the model so that the model accurately reflects your observed data table.
You will next examine A LOT of parameter estimates. It's not as bad as it looks, however, because redundant parameters, that could be calculated through a combination of marginal totals and independent parameters, have been omitted in this program by setting them to zero.
Finally, you will see a gigantic table(s)
that presents the covariances and correlations among the parameter estimates.
This usually is not as interesting in terms of the substance of your results
and we won't really examine these in this assignment.
|
This time, you will use the General program to build, test, and obtain the parameter estimates for your best-fitting model. Use your saturated model MODEL SELECTION and GENERAL runs to decide which terms to incorporate in your best-fitting model.
Once again, you will use the GSS1014.SAV file.
In this order, enter the variables:
Recyear Degrecod Gender and Compuse
into the Factor(s): box.
Leave the distribution of cell counts on Poisson.
Click on the Model box.
HERE'S WHERE THE DIRECTIONS WILL CHANGE!
Click on Custom to move the radio button choice.
I suggest you start with the "Main effects" (single variable marginals) and work forward from there in constructing your model.
So, put the Build Term(s) box on "Main
effects".
Click on the variables in the Factors
& Covariates: box to highlight them where you want to preserve
the marginal distributions (typically, that is all of them unless your
experimental or sampling design dictates otherwise).
Notice that you can click and highlight
more than one variable at a time. (Hold down the "Ctrl" key on your keyboard
to do this.)
So, for example, if you want to have all the single-variable effects in your model, you can click on all of them in the Factors & Covariates: box to highlight them all the variables at once, then click the button to pull all of the variables over into the Terms in Model: box.
In the Build Term(s) box decide
what you will do next with the 2-way variable associations or correlation
coefficients. If you want ALL the pairwise correlations, it's easiest
just to highlight all the variables in the Factors & Covariates:
box, select "All 2-way" in the Build Term(s) box and click the
button. The program will then place all possible two way associations in
the Terms in Model: box. By extension, you can also build in all
3-way interactions, 4-way interactions, and so forth. Of course, this would
simply rebuild the saturated model if you included all the possible terms
(in this case, up to the 4-way interaction).
However, your best model may not be the saturated model.
To include ONLY the 2-way interactions you want, change the box to read "Interaction." Highlight the variable pairs in the Factors & Covariates: box two at a time, then click on the arrow to put the correlations in the Terms in Model: box. For example, if you wanted to include the gender*compuse correlation, highlight both gender and compuse, then click the button. The 2-way association will appear in the Terms in Model: box. Continue entering the two-way terms until you have entered all the ones that you want to include in your best-fitting model.
To include ONLY the 3-way interactions you want, keep the box on "Interaction" and highlight the variable triplets in the Factors & Covariates: box three at a time, then click on the arrow to put the triplets in the Terms in Model: box. For example, if I wanted to include the 3-way interaction gender*recyear*compuse, I would highlight gender, recyear and compuse, then click the button. The program will now paste the gender*recyear*compuse term into the Terms in Model: box.
Continue selecting and pasting the terms you want into the Terms in Model: box until you are satisfied you have included all the terms in the model that you want. Remember to include lower order terms if appropriate. Then click the Continue box.
NOTE: You may find it easier to use the expressions "All 2-way" "All 3-way" (etc), highlight all the variables and let the program build the terms for you. Then just use the back arrow to omit the terms that you don't want to use in the model.
Click on the Options... box.
Click to put a check mark on the Estimates
box.
Notice that you can now obtain the adjusted
residuals and their corresponding normal scores.
Since you may not have a saturated model, keep these check marks (although they add to an already lengthy output) because if your new model doesn't fit, the normal residuals in particular will give you clues to what your results should look like. Also you will need them for the Exercise 3 questions.
Click on the Continue box.
NOW click OK.
|
Once again, you have a lot of output. Double check the variable categories, the n, and other data information. Double check the model to ensure that you included ALL the terms that you wanted.
Check to be sure that the function converged in your model. The program initially allows 20 iterations of the function. Under Options...you can raise that number if you need to. I have almost never seen that to be necessary. However, if the estimates did not converge for your particular model, you will have strange and unreliable output. So quickly glance at the convergence box to make sure before you continue. If the number of iterations is less than or equal to 20, you are fine.
Check the Goodness-of-Fit Tests box for the Chi-square, degrees of freedom and probability level for your best-fitting model. You will use these results to justify why you chose the model that you did.
Chi-square, degrees of freedom, and probability levels are always positive entities. If any of these quantities are negative, something is wrong!
The Adjusted Residuals which compare the expected and observed frequencies for each cell in the table are also helpful in selecting a final model.
The program gives you the parameter estimates for your model. Notice that only the terms you included when you created your custom model box are shown for the parameter estimates.
Once again, the program generates the correlations and covariances among all the parameters in the model you created.
|
NOTE: The little red balls are scattered throughout to help you remember to answer all parts of each question.
1. Your SPSS FREQUENCIES output and your MODEL SELECTION AND GENERAL output (2 points)
Although your output does not have a large weight, you must turn it in. That way, if you have made any mistakes on the rest of the assignment, I can check these back against your output.
PLUS YOUR ANSWERS TO QUESTIONS 2 - 9 BELOW:
2. (3 points) Based
on the saturated model MODEL SELECTION output, which marginal, 2-way,
3-way, or 4-way terms look like they can be dropped or must be retained
in the model?
Why?
3. (2 points) Briefly use loglinear terminology to describe your best-fitting model.
After you have written the abbreviation for your best-fitting model, please be sure to include all the lower order terms that are needed so that the model is a hierarchical one.
4. (2 points) How
many degrees of freedom are in this best-fitting
model?
What
was the G2 and the associated p-level for
the model you selected?
5. (3 points) Do
you consider your final model "overfitted," "under fitted" or "just right".
Briefly
defend your choice.
6. (3 points) Using your GENERAL results, write out the loglinear equation WITH NUMBERS that corresponds to the model you believe has the best fit..
Be sure to label the variables in your equation. You can assign them the letters A, B,C and D as long as you provide the variable names that accompany each of the letters. You can also assign the variables descriptive letters, e.g., G, YR, DEG or PC.
7. (2 points) What
was the largest residual in your best fitting model (i.e., which cell in
the table did this residual correspond to?)
What
was the size of its associated standardized or normal residual?
Given
the size of the normal residual, was this further evidence about the fit
of your model?
8. (1 point) Using your results from your best fitting model, draw a brief causal diagram sketch of how you think the variables Recyear, Gender, Degrecod and using a computer all work together. (Note: there may be some assumptions about your best-fitting model buried here. See if you can catch them!)
9. (2 points) IN WORDS, briefly describe the results as implied by your best fitting model. This means discussing the associations and possible interactions among the variables, not presenting numeric loglinear results or symbols.
Imagine
that you are describing the results in a non-technical fashion to a colleague
at a conference who is not familiar with categorical data analysis.
|
READINGS |
|
|
This page created with Netscape
Composer
Susan Carol Losh
March 27 2017