Exercise 3: The Loglinear General Program

THIS EXERCISE IS DUE BY CLASS TUESDAY APRIL 4.

READINGS

GUIDE 1: ISSUES IN MODELING
GUIDE 2: TERMINLOGY
GUIDE 3: THE LOWLY 2 X 2 TABLE GUIDE 4: BASICS ON FITTING MODELS
GUIDE 5: SOME REVIEW, EXTENSIONS, LOGITS GUIDE 6: LOGLINEAR & LOGIT MODELS
GUIDE 7: LOG-ODDS AND MEASURES OF FIT
GUIDE 8: LOGITS,LAMBDAS & OTHER GENERAL THOUGHTS

OVERVIEW

EDF 6937-05 SPRING 2017
THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA
EXERCISE 3: THE SPSS GENERAL LOGLINEAR PROGRAM
20 points
Susan Carol Losh
Department of Educational Psychology and Learning Systems
Florida State University

PROGRAM NUANCES	REVIEW THE TABLE	PROGRAM 1 THE SATURATED MODEL	GENERAL--YOUR "BEST MODEL"	ASSIGNMENT QUESTIONS

You will use the Loglinear Model Selection and General SPSS Programs to:

(A) run the saturated model for the four-way (1) compuse* by (2) degrecod by (3) gender by (4) recyear then
(B) run and test the model you believe has the best fit.

These are variables from the 2010 and 2014 General Social Survey.

*Do you use a computer? 1 = yes 2 = no.

You will write out the loglinear equations for for your best-fitting model, then

You will describe your model in words suitable (say) for a conference presentation.

IMPORTANT NOTE #1: Be sure to use the 2010 and 2014 SPSS file; it is the only database that has 2010 and 2014 in "RECYEAR"!
Before starting, make sure the data weight is on (it will say "Weight on" in the lower right corner of the database window and the file is weighted online)
IMPORTANT NOTE #2: Please include both your MODEL SELECTION AND your GENERAL output when you turn in Exercise 3.
If you can't make it to class, please turn all in through the Discussion Board; use title "Your name exercise 3".

IMPORTANT NOTE #3: LOOK AT EVERYTHING WHEN YOU DETERMINE A FINAL MODEL! The MODEL SELECTION output, the GENERAL output, both sets of Z scores, k-way tables, chi-squares partitioned with the associated df, etc.!
Make sure that you TEST your final model! (You might want to test a couple of other models for comparison purposes too.)

REMEMBER: MAKE LIFE EASIER IN SPSS
I RECOMMEND THE FOLLOWING TO MAKE YOUR VARIABLE CHOICES EASIER TO FIND. You really DO want to know what the numeric values are for the categories in each variable because this could influence any recoding decisions that you make as well as providing variable ranges and interpreting results. Changing the Data General defaults means that you often can find your variables for statistical maneuvers much faster.

In the SPSS menu at the top of the screen, go to the “Options” choice under the “Edit” menu.
Under “Data—General” choose the following:
Variable Lists: (1) Display NAMES (2) ALPHABETICAL
Under “Data—Output Labels” choose the following:
Outline Labeling: (1) Variables—NAMES AND LABELS (2) Variable values—VALUES AND LABELS.
I have checked with our LRC personnel and they have no problem with your doing this (they think these options make it easier too!)

Keep your model selection program output in front of you as you use the GENERAL program. It's easier to use the copious output in the MODEL SELECTION package to identify your best model than to tediously try them out one at a time in the GENERAL program.

You will need the GSS 2010 2014 data file for Exercise 3 (GSS1014.sav).
You'll find the datafile under Course Documents in the Data and Output folder.
When you click on the SPSS GSS1014.sav file it will pull up SPSS on your [or the LRC] computer.

(I wanted to go further back in time but the question wasn't asked prior to 2010...)

PROGRAM NUANCES AND DIFFERENCES FROM THE MODEL SELECTION PROGRAM

Your GENERAL computer program output will look different from the MODEL SELECTION output--although the substantive conclusions about your final model should remain about the same. However, the values of the estimators will differ:

First, you are now including the estimate of the constant term in your equation with the GENERAL program.

Second, the MODEL SELECTION program uses "deviation estimates" or contrast coding (1, 0 and -1). This solution is typical to a "grand mean" approach in Analysis of Variance. The default for dichotomous variables in this case is 1, -1 (NOT 1,0). However, the GENERAL program uses "indicator estimates" or dummy coding (0, 1). This is more comparable to dummy variable regression. It also means the parameter estimates will look different in each program.

Third, check the variable codings very carefully to see which is the referent category and which is the omitted category. Make sure the variables are coded the way you wanted them to be.

Fourth, both programs add a small amount (typically 0.5) to each cell for the saturated model. This will change the expected frequencies at least slightly.

Fifth, the algorithms used for computation to calculate the coefficients and model are slightly different for each program (IPF* for MODEL SELECTION and Newton-Raphson for GENERAL). This is partly because the GENERAL program estimates both hierarchical and non-hierarchical models.

*IPF = Iterative proportional fitting algorithm

Because of the constraints that lambda and beta coefficients sum to zero within a particular variable, or across variable combinations (e.g., Gender-Compuse), parameter estimator coefficients are not independent of each other. This will be especially true when you use the multinomial distribution which assumes that n is fixed to be the sample casebase (although with a saturated model, don't expect differences between multinomial and poisson). MODEL SELECTION simply omitted the extraneous estimator coefficients entirely, trusting that you would know how to calculate them (particularly easy with contrast coding and dichotomous variables--these tend to be mirror images). GENERAL lists the parameters, but places zeros on the parameters that are not estimated independently of one another.

However, you can still use the Z-scores to guide you about which interaction terms or pairwise correlations you can drop or must keep.

When you create your GENERAL program, be sure to add the lower order terms to your program model that are hierarchically nested within the higher order interactions. The GENERAL program uses a hierarchical algorithm to calculate the G² statistic and degrees of freedom. However, the parameter estimates change dramatically (and perhaps nonsensically) depending on which terms you include or omit while model building. The parameter estimates do not behave in a hierarchical fashion in the General program.

***If you decide to use a saturated model, the General SPSS program WILL include all hierarchical parameters--all parameters period. But it won't do that with the customized models. In those cases you must include and specify the lower order terms yourself.***

It is, of course, a different case if you have definite grounds to omit certain lower order terms. Perhaps you have an experimental design or a disproportionate sampling design that will automatically make some lower order terms or pairwise correlations to be zero. Since these effects really ARE zero, however, doing so shouldn't change your estimates of the other parameters.

However, notice that nuances such as sampling design differ from partial effects or correlations that become zero when you control for other variables! Remember that it is often necessary to include "main effects" when you have higher order interactions, whether you are doing an Analysis of Variance or a Loglinear Analysis. Not to do so is to often credit the interaction effects with more influence than they actually have. It's a good chance your Chi-square in the General program will become VERY large, too.

If a partial effect (e.g., the gender marginal) becomes zero when other variables are controlled, but was originally nonzero before controls, you should probably include this term when you creat your loglinear equation.

Because the GENERAL program does not automatically select good (or otherwise) models, you can't set the entry criterion probability level.

But do remember about the possibility of changing the probability level to include or omit a lambda or beta parameter because you can change entry criteria with the Model selection or Logistic Regression packages. You can also set the confidence interval to 0.80 instead of 0.95 (recommended!). This will generate a narrower confidence interval than the default 0.95 confidence interval. If the narrower 0.80 interval does not contain zero, you should probably retain that parameter in the model.

REVIEW THE TABLE

This exercise uses results from the General Social Survey: 2010 and 2014, a total of 4582 cases. What happened to the other cases? We lose them largely because those respondents were not asked about computer use.

The four variable table below is one you'll want to study and maybe do some crosstabulation runs and/or calculate some percentages to see how pairwise correlations and/or interaction effects may operate because you will need to summarize your results in words again at the end of exercise 3.

2010

EDUCATION LEVEL

JUNIOR COLLEGE OR LESS

AT LEAST A BA DEGREE

GENDER

MALE

FEMALE

MALE

FEMALE

USE COMPUTER	77.3%	76.8%	807	93.0%	96.7%	364
EVERYONE ELSE	22.7	23.2	241	7.0	3.3	19
	100.0% 454	100.0% 594	1048	100.0% 172	100.0% 211	383

2014

EDUCATION LEVEL

JUNIOR COLLEGE OR LESS

AT LEAST A BA DEGREE

GENDER

MALE

FEMALE

MALE

FEMALE

USE COMPUTER	76.8%	81.8%	940	98.3%	96.0%	471
EVERYONE ELSE	23.2	18.2	240	11.7	4.0	14
	100.0% 514	100.0% 666	1180	100.0% 235	100.0% 250	485

Source: General Social Survey, 2010 and 2014, General Social Survey; available n for all sample participants in the core module = 4582.

THE SATURATED MODEL: YOUR SPSS MODEL SELECTION AND GENERAL PROGRAMS RUNS

Click on the GSS1014.sav file in the Course Documnents folder. SPSS will begin to load (NOTE: remember this can take a bit: be patient!).

Run frequencies on the four variables (compuse;.gender; degrecod--recoded degree; and recyear (2010 or 2014). Be sure to note how variables are coded and that any missing values are excluded.

In the MODEL SELECTION program, run the following (add in order) under the saturated model:

Recyear, Degrecod, Gender and Compuse

Recyear, Gender and Compuse have the values 1 and 2.
Degrecod has the values 0 (low) and 1 (high).
Request the parameter estimates and the partial association table.
Change the probability level on the opening Model Selection menu to 0.20 if you like.
Print the output and examine the models.

Still in MODEL SELECTION, try out and test your best model(s).
Under "Model" if needed change the option from Saturated to Custom.
Since MODEL SELECTION is a hierarchical program, you only need to include the higher order terms for this program.
Test your best model and note the G², df and significance level.
If you like, try out a few other models too. (Maybe your first one wasn't "the best" after all...)
However remember this is a hierarchical program so you can't omit lower order terms contained in higher order terms.

THE GENERAL PROGRAM

Under the SPSS Analyze program, go to the Loglinear section and click on General...

In this order, enter the variables:

Recyear Degrecod Gender and Compuse

into the Factor(s): box.

Leave the distribution of cell counts on Poisson (for all dichotomous variables and a fully saturated model, your conclusions will essentially be the same as if you had used the Multinomial distribution).

Click on the Model... box.

Leave the radio button choice on Saturated for this first run experience, then click the Continue box.

Click on the Options... box.
Click to put a check mark on the Estimates box.
GENERAL allows you to obtain the estimated parameters for any model, saturated or not.

If you like: Change the Confidence Interval: box to 80 (%)
Click on the Continue box.
Then click OK.

OUTPUT: THE "GENERAL" PROGRAM SATURATED RUN--GET THE PARAMETER ESTIMATES FOR YOUR BEST MODEL

You will have A LOT of output.
Depending on what you are interested in, you may wish to delete some of it (not for this assignment, however) when you work with loglinear models in data analysis.

First, you receive a lot of information about your program run. Always check the design model in the GENERAL program (the saturated model should be OK but it's a good habit to cultivate with any of these programs) to make sure that you included the terms in the model that you intended to include. You will find the model at the very beginning of the program output after the "Factors" list.

Make sure the casebase, the variables, the way the variables are coded corresponds to what you believe you had entered into the analysis.

For example, make sure the number of categories (levels) for each variable corresponds to your knowledge of that variable. Too many categories could mean that you inadvertently included cases that had missing data codes or even "wild punches" as substantively real values. If so, you would want to use SPSS provisions to make sure these category values are correctly coded as missing the next time you analyze the data. (Did you run the frequencies on each of your four variables first?)

For your saturated model, the Chi-square should be zero with zero degrees of freedom. The "statistical significance" level is considered undefined in this case.

Next, you receive the observed and model cell counts for each of the 16 cells in the table.

In a saturated model, the observed and expected cell counts will be identical. However, once you drop terms, these will diverge. In models that are not saturated, you will want to examine the cell count residuals, and especially the adjusted cell count residuals, for clues about where the model may not fit. Larger adjusted residuals (over 1.50) should alert you to possible terms that need to be placed back in the model so that the model accurately reflects your observed data table.

You will next examine A LOT of parameter estimates. It's not as bad as it looks, however, because redundant parameters, that could be calculated through a combination of marginal totals and independent parameters, have been omitted in this program by setting them to zero.

Finally, you will see a gigantic table(s) that presents the covariances and correlations among the parameter estimates. This usually is not as interesting in terms of the substance of your results and we won't really examine these in this assignment.

YOUR BEST MODEL: USING THE SPSS GENERAL PROGRAM

This time, you will use the General program to build, test, and obtain the parameter estimates for your best-fitting model. Use your saturated model MODEL SELECTION and GENERAL runs to decide which terms to incorporate in your best-fitting model.

Once again, you will use the GSS1014.SAV file.

In this order, enter the variables:

Recyear Degrecod Gender and Compuse

into the Factor(s): box.

Leave the distribution of cell counts on Poisson.

Click on the Model box.

HERE'S WHERE THE DIRECTIONS WILL CHANGE!

Click on Custom to move the radio button choice.

I suggest you start with the "Main effects" (single variable marginals) and work forward from there in constructing your model.

So, put the Build Term(s) box on "Main effects".
Click on the variables in the Factors & Covariates: box to highlight them where you want to preserve the marginal distributions (typically, that is all of them unless your experimental or sampling design dictates otherwise).
Notice that you can click and highlight more than one variable at a time. (Hold down the "Ctrl" key on your keyboard to do this.)

So, for example, if you want to have all the single-variable effects in your model, you can click on all of them in the Factors & Covariates: box to highlight them all the variables at once, then click the button to pull all of the variables over into the Terms in Model: box.

In the Build Term(s) box decide what you will do next with the 2-way variable associations or correlation coefficients. If you want ALL the pairwise correlations, it's easiest just to highlight all the variables in the Factors & Covariates: box, select "All 2-way" in the Build Term(s) box and click the button. The program will then place all possible two way associations in the Terms in Model: box. By extension, you can also build in all 3-way interactions, 4-way interactions, and so forth. Of course, this would simply rebuild the saturated model if you included all the possible terms (in this case, up to the 4-way interaction).

However, your best model may not be the saturated model.

To include ONLY the 2-way interactions you want, change the box to read "Interaction." Highlight the variable pairs in the Factors & Covariates: box two at a time, then click on the arrow to put the correlations in the Terms in Model: box. For example, if you wanted to include the gender*compuse correlation, highlight both gender and compuse, then click the button. The 2-way association will appear in the Terms in Model: box. Continue entering the two-way terms until you have entered all the ones that you want to include in your best-fitting model.

To include ONLY the 3-way interactions you want, keep the box on "Interaction" and highlight the variable triplets in the Factors & Covariates: box three at a time, then click on the arrow to put the triplets in the Terms in Model: box. For example, if I wanted to include the 3-way interaction gender*recyear*compuse, I would highlight gender, recyear and compuse, then click the button. The program will now paste the gender*recyear*compuse term into the Terms in Model: box.

Continue selecting and pasting the terms you want into the Terms in Model: box until you are satisfied you have included all the terms in the model that you want. Remember to include lower order terms if appropriate. Then click the Continue box.

NOTE: You may find it easier to use the expressions "All 2-way" "All 3-way" (etc), highlight all the variables and let the program build the terms for you. Then just use the back arrow to omit the terms that you don't want to use in the model.

Click on the Options... box.
Click to put a check mark on the Estimates box.
Notice that you can now obtain the adjusted residuals and their corresponding normal scores.

Since you may not have a saturated model, keep these check marks (although they add to an already lengthy output) because if your new model doesn't fit, the normal residuals in particular will give you clues to what your results should look like. Also you will need them for the Exercise 3 questions.

Click on the Continue box.

NOW click OK.

OUTPUT: THE SPSS GENERAL PROGRAM RUN--"BEST MODEL"

Once again, you have a lot of output. Double check the variable categories, the n, and other data information. Double check the model to ensure that you included ALL the terms that you wanted.

Check to be sure that the function converged in your model. The program initially allows 20 iterations of the function. Under Options...you can raise that number if you need to. I have almost never seen that to be necessary. However, if the estimates did not converge for your particular model, you will have strange and unreliable output. So quickly glance at the convergence box to make sure before you continue. If the number of iterations is less than or equal to 20, you are fine.

Check the Goodness-of-Fit Tests box for the Chi-square, degrees of freedom and probability level for your best-fitting model. You will use these results to justify why you chose the model that you did.

Chi-square, degrees of freedom, and probability levels are always positive entities. If any of these quantities are negative, something is wrong!

The Adjusted Residuals which compare the expected and observed frequencies for each cell in the table are also helpful in selecting a final model.

The program gives you the parameter estimates for your model. Notice that only the terms you included when you created your custom model box are shown for the parameter estimates.

Once again, the program generates the correlations and covariances among all the parameters in the model you created.

ASSIGNMENT QUESTIONS

NOTE: The little red balls are scattered throughout to help you remember to answer all parts of each question.

1. Your SPSS FREQUENCIES output and your MODEL SELECTION AND GENERAL output (2 points)

Although your output does not have a large weight, you must turn it in. That way, if you have made any mistakes on the rest of the assignment, I can check these back against your output.

PLUS YOUR ANSWERS TO QUESTIONS 2 - 9 BELOW:

2. (3 points) Based on the saturated model MODEL SELECTION output, which marginal, 2-way, 3-way, or 4-way terms look like they can be dropped or must be retained in the model?
Why?

3. (2 points) Briefly use loglinear terminology to describe your best-fitting model.

After you have written the abbreviation for your best-fitting model, please be sure to include all the lower order terms that are needed so that the model is a hierarchical one.

4. (2 points) How many degrees of freedom are in this best-fitting model?
What was the G² and the associated p-level for the model you selected?

5. (3 points) Do you consider your final model "overfitted," "under fitted" or "just right".
Briefly defend your choice.

6. (3 points) Using your GENERAL results, write out the loglinear equation WITH NUMBERS that corresponds to the model you believe has the best fit..

Be sure to label the variables in your equation. You can assign them the letters A, B,C and D as long as you provide the variable names that accompany each of the letters. You can also assign the variables descriptive letters, e.g., G, YR, DEG or PC.

7. (2 points) What was the largest residual in your best fitting model (i.e., which cell in the table did this residual correspond to?)
What was the size of its associated standardized or normal residual?
Given the size of the normal residual, was this further evidence about the fit of your model?

8. (1 point) Using your results from your best fitting model, draw a brief causal diagram sketch of how you think the variables Recyear, Gender, Degrecod and using a computer all work together. (Note: there may be some assumptions about your best-fitting model buried here. See if you can catch them!)

9. (2 points) IN WORDS, briefly describe the results as implied by your best fitting model. This means discussing the associations and possible interactions among the variables, not presenting numeric loglinear results or symbols.

Imagine that you are describing the results in a non-technical fashion to a colleague at a conference who is not familiar with categorical data analysis.

READINGS
OVERVIEW

This page created with Netscape Composer
Susan Carol Losh
March 27 2017