THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA GUIDE 2: TABLE NOMENCLATURE Susan Carol Losh Department of Educational Psychology and Learning Systems Florida State University |
KEY TAKEAWAYS
|
Here is the link to the web pages on three variable crosstabulation tables and causal issues:
http://myweb.fsu.edu/slosh/IntroStatsGuide6.html
Here is the link to the web pages on online data archives:
http://myweb.fsu.edu/slosh/MethodsGuide8.html
AN AGRESTI NOTE: Agresti places relatively more importance on various logistic regression models, whereas I place relatively more importance on loglinear models. I wanted to mention a few important reasons for the discrepancy, for my preference, and the order of topics and readings:
Loglinear models are the BASIC models and logistic regression and logit
models are derived from and tested by the underlying and corresponding
loglinear model.
Other models are also derived from the underlying loglinear model, such
as ordinal regression.
If you are comfortable with loglinear models, all the other derived models
will be much easier for you (based on my learning and teaching experience
of this material).
Causal models can be tested relatively easily with loglinear models but
virtually all logistic regression models are essentially only two causal
stage models similar to an OLS regression model. I'm often interested in
causal models so this is the way to go.
A TABLE |
|
TABLES |
ODDS-RATIOS |
|
A TABLE is a common and useful way to present data.
Don't be a snob about tables! A table is the most useful basic building block in your tool chest of data analytic techniques and presentation of your results. If you can construct a simple table thoroughly, everyone (including you) will be able to assess your basic results. You could even write an entire dissertation using tables alone. And, of course, crosstabulation tables form the bedrock foundation of this course material.
Further, as we will shortly see tables
can become increasingly complex. You can (and we will) present joint distributions
of three or more variables in tables.
A bit boring but necessary:
|
(no matter how basic it seems right now) |
These are rows. 1 |
A row stretches from the left hand side of the table to the right hand side of the table. |
By convention, the top row is number 1. |
Below are columns.
Columns start at the top of the table and plummet straight down to the bottom of the table. By convention, the FAR LEFT column is designated number 1. |
|
|
|
|
|
A univariable table addresses one variable at a time. Sometimes univariate distributions are called "the marginals" (since they form the margins of a two-way or N-way table) or "marginal distributions."
A bivariate table addresses the simultaneous joint distribution of two variables.For example, the following combination of values example jointly and simultaneously cross-classifies each individual on two characteristics, their college and their gender:
Male Business Major
Female Business Major
Male Education Major
Female Education Major
Male Humanities Major
Female Humanities Major
and so on.
The juxtaposition of a particular row in the table with a particular column produces a "cell" in the table.
By convention, we give the row first and the column second to locate a particular CELL in a bivariate table. The "female, business major" cell is row 1, column 2 or just "1,2" for short. (We don't count the variable and value labels as a row or a column.)
Staying with the example of gender and college major, we could present a bivariate table as follows:
DISTRIBUTION OF COLLEGE MAJOR BY GENDER
AT FLORIDA STATE UNIVERSITY 2004
College Major | Gender | Male | Female | Row Totals |
Business |
|
1,2 | ||
Education | 2,1 | 2,2 | ||
Humanities | 3,1 | 3,2 | ||
Other majors, entered row by row | ||||
Column Totals (for males and females) |
The table itself is a rectangular array in at least two dimensional space. It typically shows the value or category of the variable and the number of cases that fall into that category. It may also give the percentage of cases that fall into each category (but usually not both frequencies and percents).
There are a lot of ways to construct univariate and bivariate tables. This Guide follows commonly accepted statistical practices.
|
Let's start by examining one variable at a time. Consider the following univariate frequency distribution AND percentage table from the 2014 General social Survey, a national probability in-person survey regularly conducted in even years every other year by NORC at the University of Chicago:
TITLE: Percentage of United States Households
with a Particular Type of Telephone (est) 2014
Type of Telepone Usage |
|
|
Landline in home only |
1,086
|
42.9%
|
Cell phone |
1,296
|
51.2
|
Phone elsewhere |
74
|
2.9
|
Refused |
63
|
2.5
|
No telephone |
12
|
0.5
|
Total households |
2531
|
100.0%
|
Source = 2014 General Social Survey,
NORC, University of Chicago: http://gss.norc.org/
1. THE TITLE
Each table must have a title that briefly but accurately describes the contents of the table. This means that if, for example, you have a bivariate distribution, you should include the names of BOTH VARIABLES in the title, such as "Type of Telephone by Marital Status".
2. THE VARIABLE(S) OF INTEREST
In my univariate example, there is ONE variable of interest, what kind of telephone (if any) there is in the household.
3. THE CATEGORIES OF THE VARIABLE
In my example above, there are four categories: Landline only, Cell phone, Phone elsewhere (e.g., a neighbor), Refused, and No Telephone. The percentage of households with no telephone, by the way, was five percent for several years, dropping now with "throw-away" cells and government financial assistance..
4A. THE NUMBER OF CASES OR THE FREQUENCY IN EACH CATEGORY OF THE VARIABLE
The total collection of every category name with its associated frequency is the univariate frequency distribution.
4B. THE PERCENTAGE OF CASES IN EACH CATEGORY OF THE VARIABLE
Percentages are very handy, because they are a standardized measure, or on a "per 100" standard. The total collection of every category name with its associated percentage is the univariate percentage distribution.
This section reads 4A or 4B because typically either the frequency distribution OR the percentage distribution is presented, but not both. Presenting too many numbers clutters the table and makes it more difficult to read. American Psychological Association (APA) standards call for "lean tables."
5. THE TOTAL CASE BASE
In my example, that's (about) 2531 households. If you have the total case base and the percentages, you can recalculate the category frequencies.
You would be amazed how many people omit the total case base, including authors in professional journals. In the back of my mind, I always get just a little suspicious.
What are they hiding? Were they ashamed of the number of cases? ("Three out of every four dentists recommend Blanca Blazing White toothpaste" isn't very impressive when there are only four dentists.)
Did they manage to forget how many cases they collected?
6. THE SOURCE OF THE DATA
Who collected these data? The United States government? NORC? A freshman undergraduate for a psychology project? Your Aunt Millie?
It is important to know the source of the data because clearly some sources have more of a reputation for collecting data in a trustworthy, systematic and reliable way than others.
The United States government, the National Opinion Research Center (NORC), federal agencies of many countries around the world, certain private companies (e.g., the Roper Company), all have good reputations for the care that they take in data collection. You will want to know the collector of the data so that you can interpret the data in context.1
1By the way, despite what you may read, in fact the polls called things correctly in the 2016 U.S. Presidential race. They predicted Hillary Clinton would win the national popular vote. And, indeed, so she did, by nearly three million voters nationwide. However, the U.S. presidency is decided by the Electoral College, which allots a certain number of electoral votes to each state. Donald Trump did win the Electoral College.
|
A bivariate distribution simultaneously and jointly cross-classifies the scores on a case for two variables.
For example, if we have a bivariate distribution of gender and support for President Obama (favorable/unfavorable) we can simultaneously cross-classify people as favorable males, favorable females, unfavorable males and unfavorable females.
The jointly cross-classified cases form the "cells" or interior of the table. Each cell has a frequency of cases that have a JOINT score considering both variables simultaneously.
The univariate summaries for each variable separately (for example, male or female) are at the bottom of the table for the independent variable and at the far right of the table for the dependent variable, and are called the marginals. Because the row and column totals are in the margins of the table, they are often called "the marginals". Remember that the cells are labelled with the row number first, then the column number.
The grand total is usually presented in the lower right corner of the bivariate table.
Title: Generic Bivariate Table
Variable X, Value 1 | Variable X, Value 2 | Row Totals | |
Variable Y, Value 1 | (Cell 1,1) | (Cell 1,2) | Marginal Total
Variable Y, Value 1 |
Variable Y Value 2 | (Cell 2,1) | (Cell 2,2) | Marginal Total
Variable Y, Value 2 |
Column Totals | Marginal Total
Variable X, Value 1 |
Marginal Total
Variable X, Value 2 |
Grand Total |
Then, with values supplied for each variable:
Title: Attitude toward President Obama by Gender
MALE | FEMALE | ||
FAVORABLE | Male-Favorable (Cell 1,1) | Female Favorable (Cell 1,2) | Total Favorable |
UNFAVORABLE | Male-Unfavorable (Cell 2,1) | Female Unfavorable (Cell 2,2) | Total Unfavorable |
Total Male | Total Female | Grand Total |
These terms: cell, grand total, and
marginal total are important because each of them, as well as combinations
of table cells, can become a parameter to be modelled in loglinear analysis.
So we need to be familiar with this terminology.
|
The size of a crosstabulation table (which is the total number of cells) depends on how many rows and columns are in the table.
In turn, the number of rows or columns depends on how many values or categories each variable has.
If the row variable has 3 categories and the column variable has 4 categories, the result is a "3 by 4" table.
CONVENTION: The row number always comes first.
Square tables have the same number of rows and columns (e.g., a 2 X 2 table such as the example above).
The size of the table and the number
of marginal categories for each variable become critical in loglinear models
in setting the degrees of freedom.
The total case base becomes important
in statistical significance testing.
|
The Bivariate Percentage
Table is
a variation on our old friend, the univariate percentage table. However,
the bivariate percentages allows us to compare and contrast group
similarities and differences. I have the very simplest bivariate table,
which is a 2 X 2 table, below. There is one column each for women and men,
one row for the correct answer and one row for the incorrect answer. The
first table shown (also found in Guide 2) is the Bivariate
Frequency
Distribution:
How Gender Influences Answers to the Question: Does the Sun go around the Earth or does the Earth go around the Sun? |
NOTE: By convention, categories of the independent variable typically form the COLUMNS of the table. (see below) | Male | Female | Total |
Answer to Question: | |||
Sun goes around Earth (WRONG) | 72
(r1, c1) |
165 | 237 |
Earth goes around Sun (RIGHT) | 468 | 461 | 929 |
Total (at the bottom of each column are SEPARATE totals for women and men, then a total for everyone combined) | 540 | 626 | 1166 |
Source: NSF Surveys of Public Understanding
of Science and Technology, 2014, Director, General Social Survey (NORC).
Total valid n = 1173; MD = 7
A key issue is whether to percentize
down the columns or across the rows.
Make no mistake about it, this IS
a key issue and not a matter of semantics. Percentizing in "the wrong direction"
will totally change the meaning of the results that you present.
CONVENTION: Values of the independent variable create the columns of the table.
For example, the two values of gender:
male and female, head each column in my sample table.
Remember, gender might cause science knowledge,
but we know science knowledge CANNOT cause biological sex. Therefore, if
there is an independent variable, gender is it. Science knowledge is the
possible effect, or dependent variable. (Review Guide
1 on causal order in non-experimental data if you are not
convinced.)
IMPORTANT NOTE: AGRESTI AND TERMINOLOGY Sometimes you will notice that Agresti "flip-flops" on whether the categories of the independent (explanatory) variable or the dependent (response) variable form the columns of the table. Mostly he has the explanatory or independent variable as the row variables, rather than the column variable. However, because the norm (see most journal articles in the social and behavioral sciences) is to have the independent variable form the columns of the table, I will follow the convention of having the independent variable comprise the table columns throughout this course. |
CONVENTION: Percentize separately within values of the independent variable.
In my example, this means that first I calculate the percent giving correct and incorrect percentage responses for men. I then repeat the process, calculating the percent giving correct and incorrect responses for women.
Once I have done so, I can now specify the percentage of men who give the right answer (the Earth goes around the Sun) and the percentage of women who give the right answer, and then directly compare women and men.
These percentages within gender are different numbers, and they mean something entirely different from the following question:
among those who think the Earth goes around the Sun, what percent are female?(Answer 461/929 X 100 or 49.6% Since women are 1020/1818 X 100 or 53.7 percent of the sample, we can see that women are slightly underrepresented among those giving the correct answer. Notice below that neither column has a percent figure of 52.3%)
CONVENTION: Remember that when the columns are formed by categories of the independent variable, a percent sign ONLY goes at the top of each column (in this case, the "wrong answer") and after the 100 percent at the bottom of each column. (Note: this is because the values of the independent variable form the columns of the table.) This also helps to produce a "lean table."
These conventions
are particularly important as the number of values for each variable grows
and as the number of variables grow. They help your reader to immediately
discern which way the percentages are calculated and they make your table
much easier to read.
How Gender Influences Answers to the Question: Does the Sun go around the Earth or does the Earth go around the Sun? |
Male | Female | |
Answer to Question:
|
||
Sun goes around Earth
(WRONG)
|
13.4%
|
26.3%
|
Earth goes around Sun
(RIGHT)
|
86.6
|
73.7
|
Total
|
100.0%
|
100.0%
|
Casebase
|
|
|
Source: NSF Surveys of Public Understanding
of Science and Technology, 2014, Director, General Social Survey (NORC).
n = 1166 valid cases (P.S. Yes, these
are real data and real percentages. Sigh.)
|
A multivariate table jointly and simultaneously cross-classifies each individual on at least three characteristics. For example, an Hispanic Female Business Major is simultaneously cross-classified on her ethnicity, her gender, and her college.
For example, we might have a multivariate distribution on gender (male/female), marital status (married/not), and presence of children (yes/no). We can simultaneously cross-classify people as one of the following combinations:
Using my example above, with two marital status categories (married versus unmarried), two sexes (female and male), and two parental statuses (with children versus without), this is a 2 X 2 X 2 or 8-cell total table.
We could set up our example this way:
Women | Men | |||||||
Married | Not Married | Married | Not Married | |||||
Has Children | MarFeKid | UnMarFeKid | Has Children | MarMenKid | UnMarMenKid | |||
No Children | MarFeNoKid | UnMarFeNoKid | No Children | MarMenNoKid | UnMarMenNoKid |
Notice that we have now created TWO separate tables side by side that examine how marital status relates to the presence of children in the home, one for men and one for women.
The use of separate or "partial" tables or "subtables" to examine the original relationship between two variables within categories of a third variable (e.g., looking at how marital status influences the presence or absence of children separately for each category of a control variable such as gender) is also called physical control because we have physically separated the cases into groups using the control variable. Generally physical control is an inefficient way to analyze cross-tabulation tables.
|
The odds of a variable is simply the ratio of one of the category frequencies in a variable to another category frequency in the same variable. Although we can take odds on any variable, in bivariate and multivariate analysis, if we are able to designate a dependent (response) variable, typically we take the odds on the categories of the dependent variable.
Typically, in a binary (two category) variable we designate one of the categories as a "success" and the other as a "failure". This has nothing to do with the social or emotive meaning of success and failure. For example,remember in examining death rates from a disease, we might easily designate death as a "success" and survival as a "failure".
Odds can vary from zero (no successes at
all) to infinity. They are undefined when there are no "failures".
Odds are fractional when there are more
failures than successes. For example, if most people with a disease survive,
the odds will be fractional. Unlike the Linear Probability Model, the
odds does not have a truncated variance.
Let's re-examine the table for gender and
the "planets" question.
How Gender Influences Answers to the Question: Does the Sun go around the Earth or does the Earth go around the Sun? |
GENDER-->
|
Male | Female | Total |
Answer to Planetary Question: | |||
Sun goes around Earth (WRONG) | 72 | 165 | 237 |
Earth goes around Sun (RIGHT) | 468 | 461 | 929 |
540 | 626 | 1166 |
Source: NSF Surveys of Public Understanding
of Science and Technology, 2014, Director, General Social Survey (NORC).
n = 1166 valid cases
The odds on GENDER (using Male as
a "success") for the total group is 540/626 = .86
That is, someone is 86% as likely to be
male as to be female.
The odds on the PLANETARY QUESTION (using the right answer as a "success") for the total group is 929/237 or 3.92. Right answers occur nearly four times as often as wrong answers do.
We can calculate conditional odds within each category of the independent (explanatory) variable. For our tabular example above, the odds of giving the correct to the incorrect answer are:
For men: 468/72 or 6.50
For women: 461/165 or 2.79
Although both sexes are more likely to give right answers than wrong answers, men are over SIX times as likely to give a right:wrong answer whereas women are nearly three times as likely to give a right:wrong answer.
We can then examine the second order odds or odds-ratio. This is comparing the odds across groups or categories of the independent variable. In this instance (remembering again that "successes" form the numerator) we would divide the odds for men on the planetary question by the odds for women on the planetary question or:
6.50/2.79 = 2.33
When the odds-ratio or second order odds is 1, we have the case of statistical independence, i.e., the independent variable has no influence or is unrelated to the dependent variable. The odds on the dependent variable are the same regardless of whether one is male or female.
In this example, the odds-ratio of 2.33 is suggestive. Men are a little more than twice as likely to give the right answer compared with the wrong answer as women are. It is important in the interpretation of the odds that I give both the successes and failures. After all, a glance at the table should reassure you it would be incorrect to say men were twice as likely as women to give the correct answer. This would be untrue. But the odds for men is nearly twice that for women. (In this example, women were proportionately twice as likely to give the wrong answer--but that was actually just a coincidence.)
I say this is suggestive because at this point we have not done any formal tests of statistical significance. But have patience because before you know it, we will.
What happens if your variable has more
than three categories?
In this case, several possibilities open
up. Some of them depend on whether the categories have an inherent order
to them (ordinal data) or whether category names or values are simply labels
or "tags" (nominal data).
One possibility is to examine the odds successively of a "success" in turn to all other categories combined (recall Jim Davis of Chicago, Harvard College and Dartmouth College said years ago we can always dichotomize the states of the USA into New Hampshire versus all other states combined). In the case of ordinal categories we may wish to calculate odds on adjacent categories. If we have K categories, we can form K-1 odds. We cannot do K odds because the Kth odds is linearly dependent upon the first K-1 odds. That is, if we know the first categories and the total, we can always calculate the Kth odds through subtraction in the table.
We call the natural (base e/Euler's constant/ i.e., 2.71...) log of the odds-ratio the "log-odds" or logit. I use the abbreviation "ln" to distinguish base e from base 10 or log 10 logarithms. Most statistical uses of logarithms use base e lns.
In later guides we will see how we can extend the odds ratio to any number of variables and how to calculate them. Get comfortable with the terminology for now.
|
We are used to null hypotheses that say in essence: if the parameter we're interested in is zero, what is the probability or likelihood of observing the results we did by chance?
We are used to direct estimatation methods of parameters that basically take a one step procedure, whether using linear (matrix) algebra or partial derivatives in calculus. That's what we do in OLS regression.
But causal model statistics are different, whether they are structural equation models or loglinear models. Most use Maximum Likelihood Estimators (MLEs) and iterative methods to solve.
With MLEs we ask the question: given the data, how likely are the parameters (sort of the reverse of sentence 1)? With MLEs we estimate a host of possible parameters, each with a probability attached to it (see the graphs in Agresti). The MLE solution you see on your computer output is the one with the highest overall probability attached to it, hence MAXIMUM likelihood estimators.
With large samples, MLE models often give
results that are similar to those estimated through more direct methods,
such as Ordinary Least Squares regression. But in many cases MLE parameters
are quite different. This is especially true with more complex models so
do not think you can always substitute different kinds of estimators.
|
OVERVIEW |
|
|
This page created with Netscape
Composer
Susan Carol Losh
January 15 2017