Want to know more about tables (including how to contruct them)? Click HERE
READINGS

GUIDE 1: ISSUES IN MODELING
GUIDE 2: TERMINLOGY
GUIDE 3: THE LOWLY 2 X 2 TABLE
GUIDE 4: BASICS ON FITTING MODELS
GUIDE 5: SOME REVIEW, EXTENSIONS, LOGITS
GUIDE 6: LOGLINEAR & LOGIT MODELS
GUIDE 7: LOG-ODDS AND MEASURES OF FIT
GUIDE 8: LOGITS,LAMBDAS & OTHER GENERAL THOUGHTS
OVERVIEW


 
EDF 6937-04       SPRING 2017
THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA
GUIDE 1: ISSUES IN MODELING
Susan Carol Losh
Department of Educational Psychology and Learning Systems
Florida State University

 
SOME ANALYTIC PROBLEMS
SOME OVERALL SOLUTIONS
WHY GO TO ALL THIS TROUBLE?
CAUSAL ISSUES
TYPES OF DATA (REVIEW)

 

KEY TAKEAWAYS
  • Treating ordered data (ordinal variables) as numeric can cause problems
    • Under estimates of standard errors and thus often
    • Misleading "statistically significant" findings
  • Logistic regression coefficients often misinterpreted
    • They aren't linear or additive when exponentiated (odds-ratios)
    • Binary regression (dichotomous dependent variable) most common
  • Crosstabulation tables seen as one solution
    • They get "messy" fast, after 3 variables
    • No general system in introductory statistics
    • No "indirect effects" (mediators) as in Structural Equation Modeling
    • Statistical interaction (moderators) can be tested
  • Look below for information on the "loglinear" general cell frequency (GCF) model
    • Any number of variables (as practical)
    • "Path type" models can be estimated
    • Most programs these days allow you to use numeric "covariates" too
    • Testing and "Fit" comparable to structural equation models (but not typically Confirmatory Factor Analysis)
  • Logistic regression comes from the GCF model
  • So does something called "probit analysis
  • Why not use the "Linear Probability Model" LPM with a dependent dummy variable in "regular" regression"? 
    • Sometimes it's OK but watch out for potentially BIG problems described in section below
  • Any time you want to do "path type" models, you are referencing causality
  • You can review some causal clues below
  • And review levels of data category systems, nominal, ordinal, interval and ratio below too

SOME ANALYTIC PROBLEMS

Many variables of interest in education, the behavioral and social sciences, business, and other fields are distinctly non-numeric and often non-ordered as well. Thus, the phenomena we wish to explain may be categorized as a nominal or ordinal dependent variable. As you know, most multivariate statistical analysis methods (including structural equation modeling) assume at least "quasi" numeric dependent variables.

What is the student who wishes to investigate a nominal or ordinal dependent variable* to do? If the response or dependent variable has ordered categories, analysts will often treat that variable as numeric, and then use statistical methods such as multiple regression or structural equation models. However, these methods can create problems. Because the dependent variable isn’t numeric, let alone normally distributed, serious violations of linear models may occur and this can produce misleading interpretations of the data. For example, some of these violations produce underestimates of "standard errors". Pragmatically for you, this means that some results that are not statistically significant (i.e., different from zero) can appear to be statistically significant, leading to inaccurate interpretations of the results on your part .

*Please see the review of types of variables at the bottom of this Web site.

If the dependent or response variable is clearly not numeric and the categories are not ordered, techniques such as regression generally don't make any sense at all. Another possibility is to use logistic regression. Logistic regression is very useful; however, most students have only learned it using dichotomous dependent variables (binary logistic regression). Furthermore, the uninitiated persist in interpreting logit coefficients as additive and linear like OLS regression coefficients when (in exponentiated form) they are multiplicative and nonlinear.

Part of what we will do in this course is to learn to interpret logistic regression coefficients more accurately. We will also extend this technique to dependent non-numeric variables with several categories, not just two.

Yet a third possibility is to use cross-tabulation tables and “control variables” using various “nonparametric” statistical measures. This is the long venerated tradition of "physical control" (as opposed to "statistical control") which presents its own set of problems:

Partly as a result, the three variable cross-tabulation analytic model is of limited utility although we will use it during the course as an introduction. Often we are interested in several independent variables--as we are in multiple regression analyses or the analysis of variance. We may also be interested in causal chains, or causal models similar to those used in SEM. Yet how can we determine issues such as mediation in the cross-tabular table?
 
SOME POTENTIAL SOLUTIONS

Over the past 45 plus years (see the last chapter on history in the Agresti text!) several techniques have been developed to address analytic issues with categorical response or "dependent" variables. Although the statistical theory is older than that, these iterative methods were not practical until the development of high-speed computers and began to spread in the early 1970s. This seminar focuses on these techniques.There are many models for categorical response variables. We will examine several of them this semester. I don't expect you to remember all this information right now. But mentally "file away" these different techniques because we will revisit them later.

These general cell frequency (GCF) models also lend themselves to SEM-type interpretation and testing. For example, the relationship among three variables ("A", "B" and "C") can be conceptualized as three two way associations, A-B; A-C; and B-C where, if the relationship between an exogenous and endogenous variable is indirect (the partial A-C association = 0), that two way association sucessfully can be dropped from the overall model for the table.

For example, let variable A be the individual's educational level; B, occupational type (such as professional, manager, clerical, etc) and C, the individual's overall income level.

In your causal model, education is a direct cause of occupational type. In turn, type of occupation is a direct cause of income.

No, these aren't experimental data, and for obvious reasons, cannot be since we generally don't randomly hand out educational degrees and types of occupations. For some "ground rules" on establishing causality in non-experimental data, see below in this guide. I also recommend:
 
Sometimes the best things in life really are free! This AERA book is a good example:
Barbara Schneider, Martin Carnoy, Jeremy Kilpatrick, William H. Schmidt & Richard J. Shavelson (2007). Estimating Causal Effects Using Experimental and Observational Designs. American Educatonal Research Association. 
You can download the ENTIRE BOOK in pdf from the AERA website:

http://www.aera.net/Publications/Books/EstimatingCausalEffectsUsingExperimentaland/tabid/12625/Default.aspx

Suppose your inital causal model looks like this:

EDUCATIONAL LEVEL     OCCUPATIONAL TYPE   INDIVIDUAL INCOME

There isn't a direct effects arrow from Education to Income shown. Can we drop the direct causal arrow from Educational Level to Individual Income?  GCF Loglinear models allow us to test this.

More recent research in this area has focused on mixed models, using a variety of independent variable measurement types to predict a non-numeric dependent variable. Combinations of more continuous numeric predictors with categorical predictors are often called covariates in this type of analysis. (Note: this is not the same use of the term “mixed models” found in HLM but historically the term originates here.) 

WHY ARE WE GOING TO ALL THIS TROUBLE?

One possibility that occurs to analysts when the dependent variable is dichotomous is to use the Linear Probability Model (LPM). The Linear Probability Model is familiar to many students, even if they didn't know the "proper name" for it. In the LPM we use variations on multiple regression (e.g., Weighted Least Squares) to predict a dichotomous dummy variable coded 1 or 0. In the LPM, we have a straightforward, linear, additive model in which the B coefficients are interpreted as raising or lowering the probability of a score of 1. Further, we have straightforward concepts such as the Explained and Unexplained Sums of Squares that are familiar to you from OLS regression (ordinary least squares) and Analysis of Variance (ANOVA). So why go to the trouble to learn an entirely new set of techniques, complete with a new vocabulary?

Unfortunately the LPM is riddled with problems. The dependent variable is hardly continuous (a basic regression assumption) and often contributes to heteroscedasticity in which the variance on the dependent variable depends on the scores of the independent variable(s). One consequence is again that you may think certain B coefficients are statistically significant when in fact they are not. The variance on the dependent variable is truncated since at a maximum the Total Sum of Squares is .25.  You also may have impossible predicted values for the dependent variable, that are larger than 1 or less than 0. So this is not an easy way out!

If you do decide to use the LPM, the least you can do is use Weighted Least Squares instead of OLS. The most common weight is the inverse of the OLS regression equation and each score of each variable is multiplied by that weight. WLS will partially address the heteroscadasticity problem by doing some "smoothing" on the standard error estimates.

So here is our dilemma: we have many variables of interest to us that are definitely non-numeric. We want to be able to explain and predict them.  If we try numeric solutions to analyze these dependent variables, we either risk total nonsense (in the case of multinominal dependent variables, e.g., ethnic group) or misleading results. Some of the common methods taught in earlier statistics classes (two variable "non-parametric tests"; three way cross-tabulations and comparisons of partial correlations in each subgroup; binary logistic regression) are either too restrictive or do not allow tests of statistical significance. Other techniques (e.g., the LPM) have many unsatisfactory ramifications.

 On the other hand, what we have here in this course are a set of techniques that were especially developed for non-numeric dependent variables.

Yes, it is definitely worth the trouble!


 WHAT IS THE "TRUE" CAUSAL STATUS OF A RELATIONSHIP?

Much of the research process centers around explaining what are the true causal or “independent variables.” And much of various forms of modeling, including structural equation models and loglinear models, centers around the causal order of variables. Statistical analysis canNOT "prove" casual order but it can test how well the empirical data "fit" a causal model that is defined a priori.

Concepts of causality are critical: they tell us what is possible, what can be changed and what is difficult, if not impossible, to change. For example, if you are convinced that biological factors cannot be overcome, you probably will not believe that visually impaired children can compensate for their disability. Causality tells us what are the “prime movers” of the phenomena that we observe.

According to science rules, definitive proof via empirical testing does not exist. Science uses the term "proof" (or, rather, "disproof") differently from the way attorneys or journalists do. For example, a correlation could have many causes, only some of which can be identified. Later work can show earlier causes to be spurious, that is, both cause and effect depend on some prior causal (often extraneous) variable (see the charts on ice cream consumption and fire engines).

Before we can discuss causality, even for only two variables (one independent, one dependent) are involved, we must complete two prior steps. First, we must establish that the relationship is REAL, and of at least moderate strength so we know it is not trivial. We then ask about the causal ordering of variables) To evaluate the causal status of a bivariate relationship, we need to introduce at least one additional variable.

However, even with two variables, we still need to assess whether we can designate an independent variable (cause) and a dependent or response variable (effect). If we can designate an independent variable, the correlation is asymmetric. If we cannot designate an independent variable, it is symmetric.

This is relatively easy with experiments. In creating an intervention, you also created the causal variable. For example, if you assigned cigarette smokers to a nicotine patch group versus a placebo patch group, active nicotine patch (yes or no) becomes the independent variable.

However, in many cases, we cannot manipulate or intervene with variables, such as sex, age or ethnicity.

In other cases, we are dealing with naturally observed variables, which is often the case with surveys, ethnographies, contact analysis, and many other methods. So we need some guidelines to establish plausible causal order among variables in non-experimental studies.

Statistics, by the way, may be used to DISCONFIRM a postulated causal order, but NEVER, NEVER, NEVER establish causal order. That's part of the "rules of the game."
 

ON PROOF AND CAUSALITY

 

CAUSALITY IN NON-EXPERIMENTAL DATA: AN EXAMPLE
Cancerous Human Lung
This dissection of human lung tissue shows light-colored cancerous tissue in the center of the photograph. While normal lung tissue is light pink in color, the tissue surrounding the cancer is black and airless, the result of a tarlike residue left by cigarette smoke. Lung cancer accounts for the largest percentage of cancer deaths in the United States, and cigarette smoking is directly responsible for the majority of these cases. 

"Cancerous Human Lung," Microsoft(R) Encarta(R) 96 Encyclopedia. (c) 1993-1995 Microsoft Corporation. All rights reserved.

Most people--and most scientists--accept that smoking cigarettes causes lung cancer although the evidence (for humans) is strictly correlational rather than experimental. There are many topics where it is neither possible--nor desirable--to use the experimental method. To accept more correlational evidence it will help to examine the rules below.(SCL)

Many scientists believe that the ONLY way to establish causality is through randomized experiments. However a moment’s reflection will convince you that this cannot be so. Most people now accept that smoking cigarettes causes lung cancer (see the Encarta selection above)–yet no society has ever randomly assigned half its population to smoke cigarettes and the other half not (although these may have been done with laboratory rats). This causal conclusion about smoking and lung cancer in human beings is based on correlational evidence, i.e., observing the systematic covariation of two (or more) variables in a research study, which is exactly what we do when we examine the association between two variables. Cigarette smoking and lung cancer are both "naturalistic"  variables, i.e., we must accept the data as nature gave them to us.

There is no doubt that the results from careful, well-controlled experiments are typically easier to interpret in causal terms than results from other methods. However, as you can see, causal inferences are often drawn from correlational studies as well. Non-experimental methods must use a variety of ways to establish causality and ultimately must use statistical control, rather than experimental control.
 
RULES THAT HELP ESTABLISH WHICH VARIABLE IS "CAUSE" AND WHICH IS "EFFECT"

If one variable causes a second variable, they should correlate thus causation implies correlation.

However, two variables can be associated without having a causal relationship, for example, because a third variable is the true cause of the "original" independent and dependent variable. For example, there is a statistical correlation over months of the year between ice cream consumption and the number of assaults.

Does this mean ice cream manufacturers are responsible for crime? No! The correlation occurs statistically because the hot temperatures of summer cause both ice cream consumption and assaults to increase. Thus, correlation does NOT imply causation. Other factors besides cause and effect can create the illusion of an observed correlation.

If one variable causes a second, the cause is the independent variable  (explanatory variables or predictors).
The effect is called the dependent variable (sometimes it is called the criterion, target or response  variable).

If you can designate a distinct cause and effect,  the relationship is called asymmetric.

Two variables may be associated but we may be unable to designate cause and effect. These are symmetric  relationships.

Since we know that we cannot use experimental treatments in naturalistic variables to determine cause and effect, yet we know that scientists do draw causal conclusions in nonexperimental studies, here is a set of helpful rules for tentatively establishing causality in correlational data.
 
 

 
When you are asked to "name a causal rule" on exercises, please use one of the six guidelines below.

GUIDE (1) TIME ORDER. The independent variable came first in time, prior to the second variable.

EXAMPLE: Gender or race are fixed at birth.


GUIDE (2) EASE OF CHANGE. Not only did the independent variable come first, but it is harder to change. The dependent variable is easier to change.

EXAMPLE: One's gender is definitely harder to change than scores on an assessment test or years of school. One's chronological age is not usually changed by attitudes, values, education, or much of anything else.


GUIDE (3) "MAJORITY RULE." The independent variable is the cause for most people.

EXAMPLES: Although some people become so fed up with their jobs that they return to school to train for a better job, most people  complete their education prior to obtaining a regular year-round, full-time job.
Most people marry prior to having children (although some people have their children first, then marry as a result.)


GUIDE (4) NECESSARY OR SUFFICIENT. If one variable is a necessary or sufficient condition for the other variable to occur, or a prerequisite for the second variable, then the first variable is the cause or the independent variable.

EXAMPLES: A certain type of college degree is often required for certain jobs. (Necessary)
At most universities, publications are a prerequisite for being awarded tenure. (Necessary but not sufficient.)
If you can come up with the money, you almost certainly can purchase a meal. (Necessary and usually sufficient.)


GUIDE (5)  GENERAL TO SPECIFIC. If two variables are on the same overall topic and one variable is quite general and the other is more specific, the general variable is usually the cause.

EXAMPLE: Overall ethnic intolerance influences attitudes toward Hispanics.


GUIDE (6) THE "GIGGLE" OR "SANITY" FACTOR. If reversing the causal order of the two variables seems illogical and makes you laugh, reverse the causal order back.

EXAMPLES: We don't  believe choosing a specific college major or engaging in a particular sport determines one's gender.

These rules become important, not only because you need to establish causal order to do most multivariate analyses but because they will help you to decide which parameters to keep and which to drop in more advanced analysis.
 


 Unfortunately some causal assertations that turn out to be causally problematic may initially seem quite reasonable to the consumer. Here are two recent examples from the medical literature:

These examples don't mean that causal models can never be conducted with correlational data but they certainly caution researchers to be careful!
 
 
SOME SILLY STATEMENTS ABOUT CAUSALITY THAT OCCUR IN THE LITERATURE

1. Only experiments can be used to make causal statements.

ANSWER: Why is this silly? Use the guidelines above to investigate causality in non-experimental data. Ninety-six percent of Americans believe that smoking causes lung cancer but the data on cigarette smoking and cancer are NOT experimental (with the exception of a few poor rats.)

2. Well, yes, but the smoking data is based on epidemiological studies with thousands of cases.

ANSWER: Why is this silly? The number of cases has NOTHING TO DO with causality. Having a large database means your estimates are relatively stable, they have low sampling error. But low sampling variability has nothing to do with causality. Examine the six guidelines above.

3. Nominal variables can serve as causal variables but numeric variables (interval or ratio) cannot be independent variables.

ANSWER: Why is this silly? The level of measurement has NOTHING TO DO with causality. You may be able to do more arithmetically complex statistics with numeric data but that gives it no special causal status. To believe a statement such as number 3 means that you believe that gender (nominal) can have causal status but variables such as age or years of education (ratio) cannot have causal status. A moment's reflection will show you how silly that is. For example, years of education is one of the most powerful predictors of how people live their lives. Do you honestly believe that education has no effect on the occupation you enter, the salary you earn, the health practices you use, how you vote, or the television programs you watch?

Use one of the guidelines above to determine causal status, not the measurement level of your variables.
 

TYPES AND LEVELS OF DATA

Some students who "survived" any of my Introductory Statistics or Methods classes may remember this material and how fanatical I am about students knowing the kind of data they are working with. Astoundingly (since the type of data you have is a major determinant of the kind of statistical analysis you can do), most introductory textbooks relegate "levels of data" to at most a page or two. The section below is intended as a review of this material as well as an amplification for those who have not had an in-depth treatment of the topic before.
 

DISCRETE AND CONTINUOUS
NOMINAL
ORDINAL
INTERVAL-RATIO

DISCRETE AND CONTINUOUS

Variables can be described as discrete or continuous.

CONTINUOUS VARIABLES can take on any of an infinity of values. If you use very fine measurements, for example "mils," it is possible to describe income in U.S. dollars in almost infinitely fine variations. Similarly, age can even be divided into "nanoseconds." Obviously this only makes sense if the data take on NUMERIC values and continuous variables are numeric.

DISCRETE VARIABLES take on only a limited number of values, and cannot be infinitely subdivided into finer and finer measures in the way that continuous variables can. Very often, discrete variables take on integer values, such as "1" "2" or "90". The number of books in your library can be counted but not subdivided past the integer level.


Nominal, ordinal and interval-ratio variables are very different types of category systems. These form a cumulative and hierarchical set of data properties, so that nominal properties are true for ordinal and interval data. And ordinal properties are also true for interval data.  The reverse does NOT hold. Interval and ration data are numeric so arithmetic operations can be used. Nominal and ordinal data are categorical, not numeric. It is nonsense to use arithmetic operations on nominal or ordinal data.

Each variable you analyze is comprised of categories and forms a "category system." When in doubt about the kind of data you have, review the category values (not the empirical distribution of categories) to see the kinds of properties that category system contains.
 

NOMINAL VARIABLES

With nominal  variables, you can tell whether two cases or instances fall into the same category or into different categories. Thus, you can sort all cases into mutually exclusive, exhaustive categories. That's it!

Examples of nominal variables include:

  Your Zodiac sign

  Gender

  Ethnicity and

  Religious affiliation (or denomination)

None of these variables have categories that can be ordered into more or less, or higher or lower.

Nominal variables are also sometimes called categorical variables or qualitative variables. The categories are not only not numbers, they do not have any inherent order.

Try these examples:

Who is more? Koreans or Turks? More WHAT? Country of origin is NOT a number or even ordered.

Who is "better"? Women or Men? Better at WHAT? If you suspect that ranking the categories (NOTE: NOT the cases within the categories) would start a war, you probably have nominal variables.
 

ORDINAL VARIABLES

With ordinal variables, the categories themselves can be rank-ordered from highest to lowest.

This means the scores must FIRST be rank-ordered from highest to lowest (or vice versa) before you can use any ordinal measures. Like runners in a race, we can rank scores--and especially the categories themselves--from first to last, most to least, or highest to lowest.

In rank-ordered cases (as opposed to rank-ordered categories), we can literally rank order the finishers in a race or the students by their grade point average (first in class, second in class, and so on down to last in class). Notice that the intervals between cases probably are not the same (or equal). The class valedictorian may have a straight-A or 4.0 average, the salutatorian a 3.6, the third student a 3.5, and so on.

We can also rank-order the categories of a variable in ordinal data. One example is a Likert, or rank-ordered scale. Respondents are given a statement, such as "I like President Obama" then asked if they:

Strongly Agree        Agree        Disagree         or    Strongly Disagree     with that statement.

We can surmise that someone who "strongly agrees" supports that statement more intensely than someone who "agrees"--but we don't know how much more intensely.
 
 

 
Virtually all Agree-Disagree Likert attitude scales are ordinal data.

This is fairly obvious when there are 5-7 categories but it is also true when there are only two categories: someone who favors raising teacher salaries obviously is more in favor than someone who opposes the raise.

This is also true for many behaviors. Someone who smokes EVEN ONE cigarette per day clearly smokes more than someone who smokes none at all. Notice it is NOT how many categories in the variable but whether the categories can be ranked to more or less than. Some variables like smoking cigarettes or not may be ordinal variables with only 2 categories. Some multi category variables, such as college major, have lots of categories but no rank order, hence they are nominal variables.
 

Other types of ordinal data include:

the order of finish (e.g., class rank or a horse race)

  "yes-no" experiences (someone who answers "yes" to "Do you play the lottery?" clearly plays more than someone who answers "no"), or

   collapses of numeric data into categories with unequal widths or intervals (e.g., collapsing years of education into degree level).
 

INTERVAL-RATIO VARIABLES

You can count the number of books and you can't have less than zero.

In addition to the properties of nominal and interval category systems, interval and ratio  variables possess a common and equal unit that separates adjacent or adjoining categories. This common, equal unit is what makes a numeric category system numeric.

EXAMPLES: one year of age or one year of education or one dollar of income. Each of these examples is one equal unit.

These intervals are equal no matter how high up or low down the scale you go.

EXAMPLE:

EXAMPLE: It is the equal interval between adjacent categories, no matter how small or how large the score may be, that makes the data numeric. EXAMPLES:  0 children or 0 years of age. You cannot have fewer than zero children or be less than zero years of age. You cannot have less than zero dollars of income (net worth is another story) or less than zero years of formal education.

Most "count variables" (years of age or formal education, children, dollars) are ratio variables.

It is nonsense to perform arithmetic operations on clearly nominal data.

For example, suppose you have a group of three men and three women. Can you calculate a "mean biological sex"  score? What could it possibly be? It can't be a number because gender category value is a name or tag ("male" "female") that cannot be added or multiplied. Would it be a transgendered person or a hermaphrodite?
 

LEVELS OF ANALYSIS SUMMARY

 
TYPE OF VARIABLE CASES  CAN BE SEPARATED INTO CATEGORIES CATEGORIES EXHAUSTIVE CATEGORIES MUTUALLY EXCLUSIVE CATEGORIES CAN BE RANK-ORDERED CATEGORIES ARE SEPARATED BY EQUAL INTERVAL FIXED OR NON ARBITRARY ZERO
NOMINAL X X X      
ORDINAL X X X X    
INTERVAL X X X X X  
RATIO X X X X X X

 
 
OVERVIEW

READINGS

This page created with Netscape Composer
and is best viewed with Netscape Navigator
600 X 800 display resolution.
Susan Carol Losh
January 3 2017