THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA EXERCISE 2 THE MODEL SELECTION (HILOG) LOGLINEAR SPSS PROGRAM Susan Carol Losh Department of Educational Psychology and Learning Systems Florida State University |
|
PLEASE NOTE: Be sure to examine the program variables for your model very carefully. I have created some new recodes for this (and possible future) assignments. The Model Selection program treats the first category in arithmetic sequence of each variable as "high" and the second as "low". So, be sure to use: DEGRECOD, RECYEAR, GENDER and DADGENE for this assignment. You will find the data in Blackboard under Course Documents, Data and Output. Use the 2006to2014B file!
Questions? Send me an email at: slosh@fsu.edu
And, REMEMBER! (Repeat aloud as needed:)
|
VARIABLES |
|
|
|
|
PLEASE BE CERTAIN TO USE A "FULLSIZE" SPSS VERSION (e.g., 23). Studentware type SPSS programs often will not work with a database this size.
|
The purpose of this exercise is to provide preliminary experience with testing and evaluating basic hierarchical loglinear models. You will analyze four variables using the 2598 valid WEIGHTED cases from the 2006 and 2014 General social Survey datafile available in Blackboard. You will use the "Model Selection" program in SPSS. The datafile is in the Data and Output folder under Course Documents. In the GSS-NSF 2006 to 2014 folder, please use the GSS2006TO2014B.SAV (12.332 MB) file.
When you click on the GSS2006TO2014B.SAV file line, using a computer which has SPSS on it, e.g., in the LRC, in a minute or two SPSS will load right up and it will get the datafile and put it in the SPSS spreadsheet.
Be sure the datafile says "Weight on" in the lower right hand corner.
The four variables in this model are:
GENDER: 1 = Male 2 = Female
RECYEAR (recoded year): 1= 2006 2 = 2014
DEGRECOD: (recoded degree) 0 = the respondent has a junior college degree or less 1 = the respondent has at least a BA degree
DADGENE:
(The
father's gene decides the sex of the baby--the earlier "BOYORGRL" variable)
coded 1 = true (correct answer) or 2 = false (incorrect)
(note there is no
"i" in BOYORGRL.)
This 2 X 2 X 2 X 2 percentage table for your viewing information looks as follows:
2006
DEGREE LEVEL | JUNIOR COLLEGE OR LESS | BACCALAUREATE OR MORE |
GENDER | MALE | FEMALE | MALE | FEMALE |
FATHER'S GENE DETERMINES BABY'S SEX |
69.8%
|
81.3%
|
822
|
69.0%
|
85.2%
|
368
|
|
DOES NOT DETERMINE BABY'S SEX |
30.2
|
18.7
|
250
|
31.0
|
14.8
|
105
|
|
100.0%
430 |
100.0%
642 |
1072 |
100.0%
216 |
100.0%
257 |
473 |
2014
DEGREE LEVEL | JUNIOR COLLEGE OR LESS | BACCALAUREATE OR MORE |
GENDER | MALE | FEMALE | MALE | FEMALE |
FATHER'S GENE DETERMINES BABY'S SEX |
54.8%
|
71.3%
|
482
|
68.4%
|
82.9%
|
232
|
|
DOES NOT DETERMINE BABY'S SEX |
45.2
|
28.7
|
268
|
31.6
|
17.1
|
71
|
|
100.0%
321 |
100.0%
429 |
750 |
100.0%
133 |
100.0%
170 |
303 |
Source: General Social Surveys 2006 and
2014, Director, General Social Survey (NORC).
|
|
The 2006 and 2014 samples of the GSS were face to face area probability samples of the lower 48 United States with random selection within households.
This exercise uses the "Model Selection" program in SPSS. This "HILOG" program (short for HIerarchical LOGlinear) estimates ONLY hierarchical loglinear models. With the Model Selection program , you only specify the higher order terms and the program will fill in the rest. If you use the saturated model, the program will calculate all possible parameters.
The HILOG program will only provide the numeric lambda general cell frequencies parameters for the fully saturated model. It also does not estimate the grand mean (equiprobable model) term. Whichever of the loglinear, ordinal, or logistic regression packages you use, now or later, you really need to read the output VERY carefully to ensure that you estimated the model that you thought you did, and to understand the output terms.
Unlike the "General" loglinear program, the "Model Selection" program does not provide the model you are using. However, this program can be very helpful in suggesting and testing a hopefully final model.
In this exercise you will evaluate a four variable model. You will interpret statistics from the saturated model. Then you will examine the diverse backwards elimination models to estimate what appears to be the best model to use with the data. Sometimes there will be some divergence between what the computer program will tell you is the best fitting model under the default conditions and some of our class material. Be sure to go with the in-class material in this case! What the program will do is use a "data druger" mechanical means to make its decision--and we hope that human judgement will take more of the nuances into account. For example, we use a 0.20 probability cutoff and the program will use a 0.05 cutoff for its default.(NOTE: On the front of the Model Selection menu,you can change the "probability for removal" from 0.05 to 0.20 to make life a bit easier!) This would mean that some models that may "underfit" when reproducing the observed table could "slide through" and meet the SPSS criteria for a good model because they have probability levels of over .05 but less than 0.20.
You will examine the Chi-squared statistics, the degrees of freedom, probability levels and the Z-values, the standardized scores that correspond to the numeric lambda estimates for the saturated model. All of these results should give you some clues about the best model for the data. In the process you will evaluate some different models for the data in the questions at the end of this website.
You will will also write out the numeric general cell frequency equation using the lambdas for the saturated loglinear model that describes these four variables. Then, you will interpret the findings in this 4-way table in words. The percentage tables above should help you with the verbal descriptions.
You do NOT need to include the "grand
mean" or parameter in this
exercise.
|
Open the SPSS program.
You can do so by clicking on the GSS2006TO2014B.SAV
file under Course Documents-->Data and Output folder in Blackboard.
(use the "B" datafile)
This file may take SPSS a minute or so to load all the data.
Under the Analyze menu at
the top of the page, select Descriptive Statistics, then
select Frequencies
Select the variables:
GENDER,
RECYEAR, DEGRECOD and DADGENE.
Click OK to run the frequencies for these
four variables.
Make sure that all the missing data are,
in fact, coded as missing (including any system missing "sysmis" values)
and that the percentages shown are ONLY for valid
cases.
There are a total
of 7048 valid cases that are possible for the years 2006 and 2014 depending
on the variable.
Now, print or save the Frequencies tables for your output to turn in to me.
Under the Analyze menu at the top of the page, select Loglinear, then select Model Selection...
For factors, select RECYEAR (range 1-2), DEGRECOD (range 0-1), GENDER (range 1-2), and DADGENE (range 1-2). You must click on the Define Range box each time you enter a variable, and enter the minimum and maximum values for each variable.
Click on Model.
You will start with the saturated model
already chosen for you, so click Continue.
Click on the Options... button.
Left click to select Parameter
estimates.
Left click to select Association table
Click Continue.
Remember you can change the probability
for removal on the front menu to 0.20.
Now click OK to run the program.
|
Well, doing the computer program is one thing, but being able to view ALL of your output is something else again. If you use either SPSS 23 or later, all your output should be present and easily visible.
Be sure to print your total loglinear "Model Selection" program output to turn in (OR POST as a pdf) to me with your assignment.
The Model Selection Program gives you a diversity of output.
"Tests that K-way and higher order effects are zero" is a BACKWARDS elimination algorithm. (Model Selection does not do a forward elimination algorithm.) This is a set of HIERARCHICAL TESTS.
The "Tests that K-way effects are zero" is NOT a forward algorithm. It is actually a non-hierarchical set of tests, e.g., it eliminates all the marginal effects but preserves the two way associations and the interaction terms. This is kind of a strange option and one which may not be very useful to you (unless you have, perhaps, equal cell size experiments or other non-hierarchical designs).
With the "Tests of PARTIAL associations", HILOG eliminates one term at a time and assesses the result. This section is not hierarchical either, but both this and some aspects of the "K-way effects" non-hierarchical sections are very helpful in deciding which, if any, effects can be eliminated from your model.
The " Estimates for Parameters" section
is also very helpful. Not only will it give you the loglinear parameters
you need to write the numeric equation for the saturated model (ONLY)
(see the example in Guides 3/4) but the standardized Z-values can assist
you in selecting the most parsimonious model that can still explain the
data. Pay attention to Z-values with an absolute value of |1.50| or
larger. These parameters generally cannot be dropped.
|
1. Your SPSS FREQUENCIES output and your MODEL SELECTION output
(2 points) Although your output does not have a large weight for the exercise, you must turn it in. That way, if you have made any mistakes on the rest of the assignment, I can check these back against your output to give you more credit for understanding the material.
PLUS YOUR ANSWERS TO QUESTIONS 2 - 10 BELOW:
2. (2 points) What is the G2 (the likelihood ratio chi square), degrees of freedom and p-level for the hierarchical model that incorporates all the three-way interaction effects?
3. (2 points) Do your results suggest that
the four variable interaction term can be deleted from a well-fitting model?
BRIEFLY, give the rationale behind your
decision of why or why not.
4. (2 points) Based on the results, does
it appear that any of the three-way interaction terms must be retained
in order for the model to fit well?
If so, which interaction term(s) must
be retained? (List all that apply.)
BRIEFLY give the rationale behind your
decision.
5. (2 points) Based on the results, does
it appear that any of the two-way association terms can be dropped from
the model and yet the model will still fit the data well?
If so, which two way association(s) could
be dropped? (List all that apply.)
BRIEFLY give the rationale behind your
decision.
6. (2 points) Using the Z-score values associated with each parameter, which parameters look like they may be dropped and yet the model will still fit well? (List all that apply.)
7. (4 points) Use either text or class
terminology to describe the model that you believe has the best fit.
Choose among the models in your output
only.
How many degrees of freedom are in this
model?
SHOW how you obtained the degrees of
freedom.
What was the G2 for this model?
What was the p-level for the model you
selected?
Briefly describe the rationale for your
choice of this model and the results that support it.
PLEASE USE THE IN-CLASS CRITERIA (p
> .20) FOR USING P-LEVELS TO SUPPORT A FINAL MODEL.
8. (2 points) Using your results, write
out the loglinear equation WITH NUMBERS for the saturated
model.
(NOTE: "Model Selection" does not give
the grand mean or effect.
For this assignment, it is OK to either put the Greek letter theta as a
place holder or simply to eliminate it.)
Use the lambdas from the "parameter estimates" section of your output.
Be sure to label the variables in your equation. You can assign them the letters A, B,C and D as long as you provide the variable names that accompany each of the letters. You can also assign the variable's descriptive letters, e.g., G, E, Y or D. (for gender, education, recoded year, and dadgene)
9. (1 point) Using the symbols (i.e., and ), write out the loglinear equation for the model that corresponds to the model you believe has the best fit.
(Recall that the HILOG program only generates the loglinear equation for the saturated model, although it will generate the degrees of freedom and G2 for a wide variety of hierarchical loglinear models. Therefore you can't use the parameter numbers if you have dropped any terms at all, since these will change somewhat with each new model. However, you CAN use the correct lambas with superscripts and/or subscripts).
10. (1 point) IN WORDS, briefly describe
the results as implied by your best fitting model. This means talking
about the associations and possible interactions among the variables in
words, not presenting numeric loglinear results or symbols. Imagine
that you are describing the results in a non-technical fashion to a colleague,
a student in a class that you are teaching, or at a conference.
|
OVERVIEW |
|
|
This page created with Netscape
Composer
Susan Carol Losh
February 27 2017