READINGS UIDE 1: ISSUES IN MODELING
GUIDE 2: TERMINLOGY
GUIDE 3: THE LOWLY 2 X 2 TABLE
GUIDE 4: BASICS ON FITTING MODELS
GUIDE 5: SOME REVIEW, EXTENSIONS, LOGITS
GUIDE 6: LOGLINEAR & LOGIT MODELS
GUIDE 7: LOG-ODDS AND MEASURES OF FIT
GUIDE 8: LOGITS,LAMBDAS & OTHER GENERAL THOUGHTS
OVERVIEW

 
 
EDF 6937-01       SPRING 2017
THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA
EXERCISE 2
THE MODEL SELECTION (HILOG) LOGLINEAR SPSS PROGRAM
Susan Carol Losh
Department of Educational Psychology and Learning Systems
Florida State University

 
 
THIS EXERCISE IS DUE TUESDAY MARCH 7 AT CLASS. WE WILL DISCUSS IT BEFORE YOU TURN IN YOUR EXERCISE.
BE SURE TO INCLUDE (HARD COPY OF) YOUR COMPUTER OUTPUT!

If you can't attend class March 7 please see that I receive your exercise by FOUR P.M. Tuesday March 7.

Please remember NO EMAIL ATTACHMENTS!

Check out our DISCUSSION BOARD link in Blackboard (it will take .doc or .pdf files). You can scan your output into a pdf file!
 


 

PLEASE NOTE: Be sure to examine the program variables for your model very carefully. I have created some new recodes for this (and possible future) assignments. The Model Selection program treats the first category in arithmetic sequence of each variable as "high" and the second as "low". So, be sure to use: DEGRECOD, RECYEAR, GENDER and DADGENE for this assignment. You will find the data in Blackboard under Course Documents, Data and Output. Use the 2006to2014B file!

Questions? Send me an email at: slosh@fsu.edu

And, REMEMBER! (Repeat aloud as needed:)
 
 
Big Chi-squares are BAD. The model doesn't fit!
Little Chi-squares are GOOD. The model fits!

 
YOUR 
VARIABLES
SPECIFICATIONS & PROGRAMS
THE SPSS HILOG PROGRAM
EXAMINING YOUR OUTPUT
ASSIGNMENT QUESTIONS

PLEASE BE CERTAIN TO USE A "FULLSIZE" SPSS VERSION (e.g., 23). Studentware type SPSS programs often will not work with a database this size.

PURPOSE, VARIABLES AND THE FOUR WAY PERCENTAGE TABLE

The purpose of this exercise is to provide preliminary experience with testing and evaluating basic hierarchical loglinear models. You will analyze four variables using the 2598  valid WEIGHTED cases from the 2006 and 2014 General social Survey datafile available in Blackboard. You will use the "Model Selection" program in SPSS. The datafile is in the Data and Output folder under Course Documents. In the GSS-NSF 2006 to 2014 folder, please use the GSS2006TO2014B.SAV   (12.332 MB) file.

When you click on the GSS2006TO2014B.SAV file line, using a computer which has SPSS on it, e.g., in the LRC, in a minute or two SPSS will load right up and it will get the datafile and put it in the SPSS spreadsheet.

Be sure the datafile says "Weight on" in the lower right hand corner.

The four variables in this model are:

GENDER: 1 = Male  2 = Female

RECYEAR (recoded year):  1= 2006  2 = 2014

DEGRECOD: (recoded  degree) 0 = the respondent has a junior college degree or less   1 = the respondent has at least a BA degree

DADGENE: (The father's gene decides the sex of the baby--the earlier "BOYORGRL" variable) coded 1 = true (correct answer) or 2 = false (incorrect)
(note there is no "i" in BOYORGRL.)

This 2 X 2 X 2 X 2 percentage table for your viewing information looks as follows:

2006

DEGREE LEVEL JUNIOR COLLEGE OR LESS BACCALAUREATE OR MORE
GENDER MALE FEMALE     MALE FEMALE  
FATHER'S GENE DETERMINES BABY'S SEX
69.8%
81.3%
822
 
69.0%
85.2%
368
DOES NOT DETERMINE BABY'S SEX
30.2
18.7
250
 
31.0
14.8
105
 
100.0%
430
100.0%
642
 

1072
 
100.0%
216
100.0%
257
 

473

2014

DEGREE LEVEL JUNIOR COLLEGE OR LESS BACCALAUREATE OR MORE
GENDER MALE FEMALE     MALE FEMALE  
FATHER'S GENE DETERMINES BABY'S SEX
54.8%
71.3%
482
 
68.4%
82.9%
232
DOES NOT DETERMINE BABY'S SEX
45.2
28.7
268
 
31.6
17.1
71
 
100.0%
321
100.0%
429
 

750
 
100.0%
133
100.0%
170
 

303

Source: General Social Surveys 2006 and 2014,  Director, General Social Survey (NORC).
 
 

 
In the past, women more often answered this question correctly. Are the sex differences in 2006 and 2014 meaningful? Does year make a difference? Does degree level?  Are any apparent  differences really just sampling error?  Analysis will tell!

Take a few minutes to study these percentage results. The results from this percentage table can be useful in answering question 10 below.


 

 

ASSIGNMENT GENERAL SPECIFICATIONS AND THE SPSS MODEL SELECTION PROGRAM

The 2006 and 2014 samples of the GSS were face to face area probability samples of the lower 48 United States with random selection within households.

This exercise uses the "Model Selection" program in SPSS. This "HILOG" program (short for HIerarchical LOGlinear) estimates ONLY hierarchical loglinear models. With the Model Selection program , you only specify the higher order terms and the program will fill in the rest. If you use the saturated model, the program will calculate all possible parameters.

The HILOG program will only provide the numeric lambda general cell frequencies parameters for the fully saturated model. It also does not estimate the grand mean (equiprobable model) term. Whichever of the loglinear, ordinal, or logistic regression packages you use, now or later, you really need to read the output VERY carefully to ensure that you estimated the model that you thought you did, and to understand the output terms.

Unlike the "General" loglinear program, the "Model Selection" program does not provide the model you are using. However, this program can be very helpful in suggesting and testing a hopefully final model.

In this exercise you will evaluate a four variable model. You will interpret statistics from the saturated model. Then you will examine the diverse backwards elimination models to estimate what appears to be the best model to use with the data. Sometimes there will be some divergence between what the computer program will tell you is the best fitting model under the default conditions and some of our class material. Be sure to go with the in-class material in this case! What the program will do is use a "data druger" mechanical means to make its decision--and we hope that human judgement will take more of the nuances into account. For example, we use a 0.20 probability cutoff and the program will use a 0.05 cutoff for its default.(NOTE: On the front of the Model Selection menu,you can change the "probability for removal" from 0.05 to 0.20 to make life a bit easier!) This would mean that some models that may "underfit" when reproducing the observed table could "slide through" and meet the SPSS criteria for a good model because they have probability levels of over .05 but less than 0.20.

You will examine the Chi-squared statistics, the degrees of freedom, probability levels and the Z-values, the standardized scores that correspond to the numeric lambda estimates for the saturated model. All of these results should give you some clues about the best model for the data. In the process you will evaluate some different models for the data in the questions at the end of this website.

You will will also write out the numeric general cell frequency equation using the lambdas for the saturated loglinear model that describes these four variables. Then, you will interpret the findings in this 4-way table in words. The percentage tables above should help you with the verbal descriptions.

You do NOT need to include the "grand mean" or  parameter in this exercise.
 


RUNNING YOUR SPSS HIERARCHICAL LOGLINEAR PROGRAM

Open the SPSS program.
You can do so by clicking on the GSS2006TO2014B.SAV file under Course Documents-->Data and Output folder in Blackboard.
(use the "B" datafile)

This file may take SPSS a minute or so to load all the data.

Under the Analyze menu at the top of the page, select Descriptive Statistics, then select Frequencies
Select the variables: GENDER, RECYEAR, DEGRECOD and DADGENE.
Click OK to run the frequencies for these four variables.
Make sure that all the missing data are, in fact, coded as missing (including any system missing "sysmis" values) and that the percentages shown are ONLY for valid cases.
There are a total of 7048 valid cases that are possible for the years 2006 and 2014 depending on the variable.

Now, print or save the Frequencies tables for your output to turn in to me.

Under the Analyze menu at the top of the page, select Loglinear, then select Model Selection...

For factors, select RECYEAR (range 1-2), DEGRECOD (range 0-1), GENDER (range 1-2), and DADGENE (range 1-2). You must click on the Define Range box each time you enter a variable, and enter the minimum and maximum values for each variable.

Click on Model.
You will start with the saturated model already chosen for you, so click Continue.

Click on the Options... button.
Left click to select Parameter estimates.
Left click to select Association table
Click Continue.

Remember you can change the probability for removal on the front menu to 0.20.
Now click OK to run the program. 



EXAMINING YOUR OUTPUT

Well, doing the computer program is one thing, but being able to view ALL of your output is something else again. If you use either SPSS 23 or later, all your output should be present and easily visible.

Be sure to print your total loglinear "Model Selection" program output to turn in (OR POST as a pdf) to me with your assignment.

The Model Selection Program gives you a  diversity  of output.

"Tests that K-way and higher order effects are zero" is a BACKWARDS elimination algorithm. (Model Selection does not do  a forward elimination algorithm.) This is a set of HIERARCHICAL TESTS.

The "Tests that K-way effects are zero" is NOT a forward algorithm. It is actually a non-hierarchical set of tests, e.g., it eliminates all the marginal effects but preserves the two way associations and the interaction terms. This is kind of a strange option and one which may not be very useful to you (unless you have, perhaps, equal cell size experiments or other non-hierarchical designs).

With the "Tests of PARTIAL associations", HILOG eliminates one term at a time and assesses the result. This section is not hierarchical either, but both this and some aspects of the "K-way effects" non-hierarchical sections are very helpful in deciding which, if any, effects can be eliminated from your model.

The " Estimates for Parameters" section is also very helpful. Not only will it give you the loglinear parameters you need to write the numeric equation for the saturated model (ONLY) (see the example in Guides 3/4) but the standardized Z-values can assist you in selecting the most parsimonious model that can still explain the data. Pay attention to Z-values with an absolute value of |1.50| or larger. These parameters generally cannot be dropped.
 


ASSIGNMENT QUESTIONS

1. Your SPSS FREQUENCIES output and your MODEL SELECTION output

(2 points) Although your output does not have a large weight for the exercise, you must turn it in. That way, if you have made any mistakes on the rest of the assignment, I can check these back against your output to give you more credit for understanding the material.

PLUS YOUR ANSWERS TO QUESTIONS 2 - 10 BELOW:

2. (2 points) What is the G2 (the likelihood ratio chi square), degrees of freedom and p-level for the hierarchical model that incorporates all the three-way interaction effects?

3. (2 points) Do your results suggest that the four variable interaction term can be deleted from a well-fitting model?
BRIEFLY, give the rationale behind your decision of why or why not.

4. (2 points) Based on the results, does it appear that any of the three-way interaction terms must be retained in order for the model to fit well?
If so, which interaction term(s) must be retained? (List all that apply.)
BRIEFLY give the rationale behind your decision.

5. (2 points) Based on the results, does it appear that any of the two-way association terms can be dropped from the model and yet the model will still fit the data well?
If so, which two way association(s) could be dropped? (List all that apply.)
BRIEFLY give the rationale behind your decision.

6. (2 points) Using the Z-score values associated with each parameter, which parameters look like they may be dropped and yet the model will still fit well? (List all that apply.)

7. (4 points) Use either text or class terminology to describe the model that you believe has the best fit.
Choose among the models in your output only.

How many degrees of freedom are in this model?
SHOW how you obtained the degrees of freedom.
What was the G2 for this model?
What was the p-level for the model you selected?
Briefly describe the rationale for your choice of this model and the results that support it.
PLEASE USE THE IN-CLASS CRITERIA (p > .20) FOR USING P-LEVELS TO SUPPORT A FINAL MODEL.

8. (2 points) Using your results, write out the loglinear equation WITH NUMBERS for the saturated model.
(NOTE: "Model Selection" does not give the grand mean or   effect. For this assignment, it is OK to either put the Greek letter theta as a place holder or simply to eliminate it.)

Use the lambdas from the "parameter estimates" section of  your output.

Be sure to label the variables in your equation. You can assign them the letters A, B,C and D as long as you provide the variable names that accompany each of the letters. You can also assign the variable's descriptive letters, e.g., G, E, Y or D. (for gender, education, recoded year, and dadgene)

9. (1 point) Using the symbols (i.e.,  and  ), write out the loglinear equation for the model that corresponds to the model you believe has the best fit.

(Recall that the HILOG program only generates the loglinear equation for the saturated model, although it will generate the degrees of freedom and G2 for a wide variety of hierarchical loglinear models. Therefore you can't use the parameter numbers if you have dropped any terms at all, since these will change somewhat with each new model. However, you CAN use the correct lambas with superscripts and/or subscripts).

10. (1 point) IN WORDS, briefly describe the results as implied by your best fitting model. This means talking about the associations and possible interactions among the variables in words, not presenting numeric loglinear results or symbols. Imagine that you are describing the results in a non-technical fashion to a colleague, a student in a class that you are teaching, or at a conference.
 
 
OVERVIEW
READINGS

This page created with Netscape Composer
Susan Carol Losh
February 27 2017