THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA GUIDE 1: ISSUES IN MODELING Susan Carol Losh Department of Educational Psychology and Learning Systems Florida State University |
|
|
|
|
|
KEY TAKEAWAYS
|
|
Many variables of interest in education, the behavioral and social sciences, business, and other fields are distinctly non-numeric and often non-ordered as well. Thus, the phenomena we wish to explain may be categorized as a nominal or ordinal dependent variable. As you know, most multivariate statistical analysis methods (including structural equation modeling) assume at least "quasi" numeric dependent variables.
What is the student who wishes to investigate a nominal or ordinal dependent variable* to do? If the response or dependent variable has ordered categories, analysts will often treat that variable as numeric, and then use statistical methods such as multiple regression or structural equation models. However, these methods can create problems. Because the dependent variable isn’t numeric, let alone normally distributed, serious violations of linear models may occur and this can produce misleading interpretations of the data. For example, some of these violations produce underestimates of "standard errors". Pragmatically for you, this means that some results that are not statistically significant (i.e., different from zero) can appear to be statistically significant, leading to inaccurate interpretations of the results on your part .
*Please see the review of types of variables at the bottom of this Web site.
If the dependent or response variable is clearly not numeric and the categories are not ordered, techniques such as regression generally don't make any sense at all. Another possibility is to use logistic regression. Logistic regression is very useful; however, most students have only learned it using dichotomous dependent variables (binary logistic regression). Furthermore, the uninitiated persist in interpreting logit coefficients as additive and linear like OLS regression coefficients when (in exponentiated form) they are multiplicative and nonlinear.
Part of what we will do in this course is to learn to interpret logistic regression coefficients more accurately. We will also extend this technique to dependent non-numeric variables with several categories, not just two.
Yet a third possibility is to use cross-tabulation tables and “control variables” using various “nonparametric” statistical measures. This is the long venerated tradition of "physical control" (as opposed to "statistical control") which presents its own set of problems:
|
Over the past 45 plus years (see the last chapter on history in the Agresti text!) several techniques have been developed to address analytic issues with categorical response or "dependent" variables. Although the statistical theory is older than that, these iterative methods were not practical until the development of high-speed computers and began to spread in the early 1970s. This seminar focuses on these techniques.There are many models for categorical response variables. We will examine several of them this semester. I don't expect you to remember all this information right now. But mentally "file away" these different techniques because we will revisit them later.
For example, let variable A be the individual's educational level; B, occupational type (such as professional, manager, clerical, etc) and C, the individual's overall income level.
In your causal model, education is a direct cause of occupational type. In turn, type of occupation is a direct cause of income.
No, these aren't experimental data, and
for obvious reasons, cannot be since we generally don't randomly hand out
educational degrees and types of occupations. For some "ground rules" on
establishing causality in non-experimental data, see below in this guide.
I also recommend:
Barbara Schneider, Martin Carnoy, Jeremy Kilpatrick, William H. Schmidt & Richard J. Shavelson (2007). Estimating Causal Effects Using Experimental and Observational Designs. American Educatonal Research Association. You can download the ENTIRE BOOK in pdf from the AERA website: |
Suppose your inital causal model looks like this:
EDUCATIONAL LEVEL OCCUPATIONAL TYPE INDIVIDUAL INCOME
There isn't a direct effects arrow from Education to Income shown. Can we drop the direct causal arrow from Educational Level to Individual Income? GCF Loglinear models allow us to test this.
|
One possibility that occurs to analysts when the dependent variable is dichotomous is to use the Linear Probability Model (LPM). The Linear Probability Model is familiar to many students, even if they didn't know the "proper name" for it. In the LPM we use variations on multiple regression (e.g., Weighted Least Squares) to predict a dichotomous dummy variable coded 1 or 0. In the LPM, we have a straightforward, linear, additive model in which the B coefficients are interpreted as raising or lowering the probability of a score of 1. Further, we have straightforward concepts such as the Explained and Unexplained Sums of Squares that are familiar to you from OLS regression (ordinary least squares) and Analysis of Variance (ANOVA). So why go to the trouble to learn an entirely new set of techniques, complete with a new vocabulary?
Unfortunately the LPM is riddled with problems. The dependent variable is hardly continuous (a basic regression assumption) and often contributes to heteroscedasticity in which the variance on the dependent variable depends on the scores of the independent variable(s). One consequence is again that you may think certain B coefficients are statistically significant when in fact they are not. The variance on the dependent variable is truncated since at a maximum the Total Sum of Squares is .25. You also may have impossible predicted values for the dependent variable, that are larger than 1 or less than 0. So this is not an easy way out!
If you do decide to use the LPM, the least you can do is use Weighted Least Squares instead of OLS. The most common weight is the inverse of the OLS regression equation and each score of each variable is multiplied by that weight. WLS will partially address the heteroscadasticity problem by doing some "smoothing" on the standard error estimates.
So here is our dilemma: we have many variables of interest to us that are definitely non-numeric. We want to be able to explain and predict them. If we try numeric solutions to analyze these dependent variables, we either risk total nonsense (in the case of multinominal dependent variables, e.g., ethnic group) or misleading results. Some of the common methods taught in earlier statistics classes (two variable "non-parametric tests"; three way cross-tabulations and comparisons of partial correlations in each subgroup; binary logistic regression) are either too restrictive or do not allow tests of statistical significance. Other techniques (e.g., the LPM) have many unsatisfactory ramifications.
On the other hand, what we have here in this course are a set of techniques that were especially developed for non-numeric dependent variables.
|
Much of the research process centers around explaining what are the true causal or “independent variables.” And much of various forms of modeling, including structural equation models and loglinear models, centers around the causal order of variables. Statistical analysis canNOT "prove" casual order but it can test how well the empirical data "fit" a causal model that is defined a priori.
Concepts of causality are critical: they tell us what is possible, what can be changed and what is difficult, if not impossible, to change. For example, if you are convinced that biological factors cannot be overcome, you probably will not believe that visually impaired children can compensate for their disability. Causality tells us what are the “prime movers” of the phenomena that we observe.
According to science rules, definitive proof via empirical testing does not exist. Science uses the term "proof" (or, rather, "disproof") differently from the way attorneys or journalists do. For example, a correlation could have many causes, only some of which can be identified. Later work can show earlier causes to be spurious, that is, both cause and effect depend on some prior causal (often extraneous) variable (see the charts on ice cream consumption and fire engines).
Before we can discuss causality, even for only two variables (one independent, one dependent) are involved, we must complete two prior steps. First, we must establish that the relationship is REAL, and of at least moderate strength so we know it is not trivial. We then ask about the causal ordering of variables) To evaluate the causal status of a bivariate relationship, we need to introduce at least one additional variable.
However, even with two variables, we still need to assess whether we can designate an independent variable (cause) and a dependent or response variable (effect). If we can designate an independent variable, the correlation is asymmetric. If we cannot designate an independent variable, it is symmetric.
This is relatively easy with experiments. In creating an intervention, you also created the causal variable. For example, if you assigned cigarette smokers to a nicotine patch group versus a placebo patch group, active nicotine patch (yes or no) becomes the independent variable.
However, in many cases, we cannot manipulate or intervene with variables, such as sex, age or ethnicity.
In other cases, we are dealing with naturally observed variables, which is often the case with surveys, ethnographies, contact analysis, and many other methods. So we need some guidelines to establish plausible causal order among variables in non-experimental studies.
Statistics, by the way, may be used
to DISCONFIRM a postulated causal order, but NEVER, NEVER, NEVER establish
causal order. That's part of the "rules of the game."
|
CAUSALITY IN NON-EXPERIMENTAL DATA: AN EXAMPLE
Cancerous Human Lung
This dissection of human lung tissue shows light-colored cancerous tissue in the center of the photograph. While normal lung tissue is light pink in color, the tissue surrounding the cancer is black and airless, the result of a tarlike residue left by cigarette smoke. Lung cancer accounts for the largest percentage of cancer deaths in the United States, and cigarette smoking is directly responsible for the majority of these cases. "Cancerous Human Lung," Microsoft(R) Encarta(R) 96 Encyclopedia. (c) 1993-1995 Microsoft Corporation. All rights reserved. |
|
Most people--and most scientists--accept that smoking cigarettes causes lung cancer although the evidence (for humans) is strictly correlational rather than experimental. There are many topics where it is neither possible--nor desirable--to use the experimental method. To accept more correlational evidence it will help to examine the rules below.(SCL) |
Many scientists believe that the ONLY way to establish causality is through randomized experiments. However a moment’s reflection will convince you that this cannot be so. Most people now accept that smoking cigarettes causes lung cancer (see the Encarta selection above)–yet no society has ever randomly assigned half its population to smoke cigarettes and the other half not (although these may have been done with laboratory rats). This causal conclusion about smoking and lung cancer in human beings is based on correlational evidence, i.e., observing the systematic covariation of two (or more) variables in a research study, which is exactly what we do when we examine the association between two variables. Cigarette smoking and lung cancer are both "naturalistic" variables, i.e., we must accept the data as nature gave them to us.
There is no doubt that the results from
careful, well-controlled experiments are typically
easier to interpret
in causal terms than results from other methods. However, as you can see,
causal inferences are often drawn from correlational studies as well. Non-experimental
methods must use a variety of ways to establish causality and ultimately
must use statistical control, rather than experimental control.
|
If one variable causes a second variable, they should correlate thus causation implies correlation.
However, two variables can be associated without having a causal relationship, for example, because a third variable is the true cause of the "original" independent and dependent variable. For example, there is a statistical correlation over months of the year between ice cream consumption and the number of assaults.
Does this mean ice cream manufacturers are responsible for crime? No! The correlation occurs statistically because the hot temperatures of summer cause both ice cream consumption and assaults to increase. Thus, correlation does NOT imply causation. Other factors besides cause and effect can create the illusion of an observed correlation.
If one variable
causes a second, the cause is the independent
variable (explanatory
variables or predictors).
The effect
is called the dependent variable
(sometimes it is called the criterion, target or response variable).
If you can designate a distinct cause and effect, the relationship is called asymmetric.
Two variables may be associated but we may be unable to designate cause and effect. These are symmetric relationships.
Since we know that we cannot use experimental
treatments in naturalistic variables to determine cause and effect, yet
we know that scientists do draw causal conclusions in nonexperimental studies,
here is a set of helpful rules for tentatively establishing causality in
correlational data.
|
GUIDE (1) TIME ORDER. The independent variable came first in time, prior to the second variable.
EXAMPLE: Gender or race are fixed at birth.
GUIDE (2) EASE OF CHANGE. Not only did the independent variable come first, but it is harder to change. The dependent variable is easier to change.
EXAMPLE: One's gender is definitely harder to change than scores on an assessment test or years of school. One's chronological age is not usually changed by attitudes, values, education, or much of anything else.
GUIDE (3) "MAJORITY RULE." The independent variable is the cause for most people.
EXAMPLES:
Although
some people become so fed up with their jobs that they return to school
to train for a better job, most people complete
their education prior to obtaining a regular year-round, full-time job.
Most people
marry prior to having children (although some people have their children
first, then marry as a result.)
GUIDE (4) NECESSARY OR SUFFICIENT. If one variable is a necessary or sufficient condition for the other variable to occur, or a prerequisite for the second variable, then the first variable is the cause or the independent variable.
EXAMPLES:
A
certain type of college degree is often required for certain jobs. (Necessary)
At most universities, publications are
a prerequisite for being awarded tenure. (Necessary but not sufficient.)
If you can come up with the money, you
almost certainly can purchase a meal. (Necessary and usually sufficient.)
GUIDE (5) GENERAL TO SPECIFIC. If two variables are on the same overall topic and one variable is quite general and the other is more specific, the general variable is usually the cause.
EXAMPLE: Overall ethnic intolerance influences attitudes toward Hispanics.
GUIDE (6) THE "GIGGLE" OR "SANITY" FACTOR. If reversing the causal order of the two variables seems illogical and makes you laugh, reverse the causal order back.
EXAMPLES: We don't believe choosing a specific college major or engaging in a particular sport determines one's gender.
These rules become important, not only
because you need to establish causal order to do most multivariate analyses
but because they will help you to decide which parameters to keep and which
to drop in more advanced analysis.
Unfortunately some causal assertations
that turn out to be causally problematic may initially seem quite reasonable
to the consumer. Here are two recent examples from the medical literature:
|
1. Only experiments can be used to make causal statements.
ANSWER: Why is this silly? Use the guidelines above to investigate causality in non-experimental data. Ninety-six percent of Americans believe that smoking causes lung cancer but the data on cigarette smoking and cancer are NOT experimental (with the exception of a few poor rats.)
2. Well, yes, but the smoking data is based on epidemiological studies with thousands of cases.
ANSWER: Why is this silly? The number of cases has NOTHING TO DO with causality. Having a large database means your estimates are relatively stable, they have low sampling error. But low sampling variability has nothing to do with causality. Examine the six guidelines above.
3. Nominal variables can serve as causal variables but numeric variables (interval or ratio) cannot be independent variables.
ANSWER: Why is this silly? The level of measurement has NOTHING TO DO with causality. You may be able to do more arithmetically complex statistics with numeric data but that gives it no special causal status. To believe a statement such as number 3 means that you believe that gender (nominal) can have causal status but variables such as age or years of education (ratio) cannot have causal status. A moment's reflection will show you how silly that is. For example, years of education is one of the most powerful predictors of how people live their lives. Do you honestly believe that education has no effect on the occupation you enter, the salary you earn, the health practices you use, how you vote, or the television programs you watch?
Use one of the guidelines above to determine
causal status, not the measurement level of your variables.
|
Some students who "survived" any of my
Introductory Statistics or Methods classes may remember this material and
how fanatical I am about students knowing the kind of data they are working
with. Astoundingly (since the type of data you have is a major determinant
of the kind of statistical analysis you can do), most introductory textbooks
relegate "levels of data" to at most a page or two. The section below is
intended as a review of this material as well as an amplification for those
who have not had an in-depth treatment of the topic before.
|
|
|
|
|
Variables can be described as discrete or continuous.
CONTINUOUS VARIABLES can take on any of an infinity of values. If you use very fine measurements, for example "mils," it is possible to describe income in U.S. dollars in almost infinitely fine variations. Similarly, age can even be divided into "nanoseconds." Obviously this only makes sense if the data take on NUMERIC values and continuous variables are numeric.
DISCRETE VARIABLES take on only a limited number of values, and cannot be infinitely subdivided into finer and finer measures in the way that continuous variables can. Very often, discrete variables take on integer values, such as "1" "2" or "90". The number of books in your library can be counted but not subdivided past the integer level.
Nominal, ordinal and interval-ratio variables are very different types of category systems. These form a cumulative and hierarchical set of data properties, so that nominal properties are true for ordinal and interval data. And ordinal properties are also true for interval data. The reverse does NOT hold. Interval and ration data are numeric so arithmetic operations can be used. Nominal and ordinal data are categorical, not numeric. It is nonsense to use arithmetic operations on nominal or ordinal data.
Each variable you analyze is comprised
of categories and forms a "category system." When in doubt about the kind
of data you have, review the category values (not the empirical
distribution of categories) to see the kinds of properties that category
system contains.
|
With nominal variables, you can tell whether two cases or instances fall into the same category or into different categories. Thus, you can sort all cases into mutually exclusive, exhaustive categories. That's it!
Examples of nominal variables include:
Your Zodiac sign
Gender
Ethnicity and
Religious affiliation (or denomination)
None of these variables have categories that can be ordered into more or less, or higher or lower.
Nominal variables are also sometimes called categorical variables or qualitative variables. The categories are not only not numbers, they do not have any inherent order.
Try these examples:
Who is more? Koreans or Turks? More WHAT? Country of origin is NOT a number or even ordered.
Who is "better"? Women or Men? Better at
WHAT?
If you suspect that ranking the categories (NOTE: NOT the cases
within the categories) would start a war, you probably have nominal
variables.
|
With ordinal variables, the categories themselves can be rank-ordered from highest to lowest.
This means the scores must FIRST be rank-ordered from highest to lowest (or vice versa) before you can use any ordinal measures. Like runners in a race, we can rank scores--and especially the categories themselves--from first to last, most to least, or highest to lowest.
In rank-ordered cases (as opposed to rank-ordered categories), we can literally rank order the finishers in a race or the students by their grade point average (first in class, second in class, and so on down to last in class). Notice that the intervals between cases probably are not the same (or equal). The class valedictorian may have a straight-A or 4.0 average, the salutatorian a 3.6, the third student a 3.5, and so on.
We can also rank-order the categories of a variable in ordinal data. One example is a Likert, or rank-ordered scale. Respondents are given a statement, such as "I like President Obama" then asked if they:
Strongly Agree Agree Disagree or Strongly Disagree with that statement.
We can surmise that someone who "strongly
agrees" supports that statement more intensely than someone who "agrees"--but
we don't know how much more intensely.
|
Other types of ordinal data include:
the order of finish (e.g., class rank or a horse race)
"yes-no" experiences (someone who answers "yes" to "Do you play the lottery?" clearly plays more than someone who answers "no"), or
collapses of numeric data into categories with unequal widths or intervals
(e.g., collapsing years of education into degree level).
|
You can count the number of books and you can't have less than zero.
In addition to the properties of nominal and interval category systems, interval and ratio variables possess a common and equal unit that separates adjacent or adjoining categories. This common, equal unit is what makes a numeric category system numeric.
EXAMPLES: one year of age or one year of education or one dollar of income. Each of these examples is one equal unit.
These intervals are equal no matter how high up or low down the scale you go.
EXAMPLE:
Most "count variables" (years of age or formal education, children, dollars) are ratio variables.
It is nonsense to perform arithmetic operations on clearly nominal data.
For example, suppose you have a group of
three men and three women. Can you calculate a "mean biological sex"
score? What could it possibly be? It can't be a number because gender category
value is a name or tag ("male" "female") that cannot be added or multiplied.
Would it be a transgendered person or a hermaphrodite?
|
TYPE OF VARIABLE | CASES CAN BE SEPARATED INTO CATEGORIES | CATEGORIES EXHAUSTIVE | CATEGORIES MUTUALLY EXCLUSIVE | CATEGORIES CAN BE RANK-ORDERED | CATEGORIES ARE SEPARATED BY EQUAL INTERVAL | FIXED OR NON ARBITRARY ZERO |
NOMINAL | X | X | X | |||
ORDINAL | X | X | X | X | ||
INTERVAL | X | X | X | X | X | |
RATIO | X | X | X | X | X | X |
|
OVERVIEW |
|
|
This page created with Netscape
Composer
and is best viewed with
Netscape Navigator
600 X 800 display resolution.
Susan Carol Losh
January 3 2017