Guide 2: Basic Terminology

EDF 6937-04 SPRING 2017
THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA
GUIDE 2: TABLE NOMENCLATURE
Susan Carol Losh
Department of Educational Psychology and Learning Systems
Florida State University

KEY TAKEAWAYS

Constructing a basic table is an important skill: it's a building block

APA (2009) urges us to construct "lean" tables (use either percentages + total n OR just the frequencies--do not present both)

Percents, by putting everything on a per-100 basis allow us to compare groups of different sizes
In bivariate frequency distributions, the table cells consist of joint classification on 2 variables (e.g., female and earth goes around the sun--yes or no)
# rows X #columns = the size of a table (the row number comes first)

The size of the table and the number of marginal categories for each variable are critical in loglinear models in setting the degrees of freedom.
The total case base of the table becomes important in "statistical significance testing."

Generally the columns of the table are formed by categories of the "independent variable"
"Old fashioned physical control" separates two way tables separately, one for each category of a third or "control" variable

We'll be doing "statistical control" in loglinear analysis

The [first order] odds of a variable is simply the ratio of one of the category frequencies in a designated variable to another category frequency in the same variable.

For example Male:Female (n Male/n Female) in a national sample
Although we can take odds on any variable, in bivariate and multivariate analysis, if we can designate a dependent (response) variable, typically we take the odds on the categories of the dependent variable.
Odds can vary from zero (no successes at all) to infinity. They are undefined when there are no "failures".

Odds are fractional when there are more failures than successes.

For example, suppose a death from a disease is a "success".
If most people with that disease survive, the odds will be fractional.
If most people who get the disease die, the odds will be greater than 1.

Unlike the Linear Probability Model binary dependent variable, the odds does not have a truncated variance.
We can take second order odds too (see the gender and earth/sun example below)--you are actually used to these in cross-tabulation tables
These methods use Maximum Likelihood Estimators

Here is the link to the web pages on three variable crosstabulation tables and causal issues:

http://myweb.fsu.edu/slosh/IntroStatsGuide6.html

Here is the link to the web pages on online data archives:

http://myweb.fsu.edu/slosh/MethodsGuide8.html

READINGS GUIDE 1: ISSUES IN MODELING
GUIDE 2: TERMINLOGY
GUIDE 3: THE LOWLY 2 X 2 TABLE GUIDE 4: BASICS ON FITTING MODELS
GUIDE 5: SOME REVIEW, EXTENSIONS, LOGITS GUIDE 6: LOGLINEAR & LOGIT MODELS
GUIDE 7: LOG-ODDS AND MEASURES OF FIT
GUIDE 8: LOGITS,LAMBDAS & OTHER GENERAL THOUGHTS OVERVIEW

AN AGRESTI NOTE: Agresti places relatively more importance on various logistic regression models, whereas I place relatively more importance on loglinear models. I wanted to mention a few important reasons for the discrepancy, for my preference, and the order of topics and readings:

Loglinear models are the BASIC models and logistic regression and logit models are derived from and tested by the underlying and corresponding loglinear model.
Other models are also derived from the underlying loglinear model, such as ordinal regression.
If you are comfortable with loglinear models, all the other derived models will be much easier for you (based on my learning and teaching experience of this material).
Causal models can be tested relatively easily with loglinear models but virtually all logistic regression models are essentially only two causal stage models similar to an OLS regression model. I'm often interested in causal models so this is the way to go.

COMPONENTS OF
A TABLE BASIC BIVARIATE TABLE REVIEW MULTIVARIATE
TABLES ODDS AND
ODDS-RATIOS MAXIMUM LIKELIHOOD ESTIMATORS

A TABLE is a common and useful way to present data.

Don't be a snob about tables! A table is the most useful basic building block in your tool chest of data analytic techniques and presentation of your results. If you can construct a simple table thoroughly, everyone (including you) will be able to assess your basic results. You could even write an entire dissertation using tables alone. And, of course, crosstabulation tables form the bedrock foundation of this course material.

Further, as we will shortly see tables can become increasingly complex. You can (and we will) present joint distributions of three or more variables in tables.
A bit boring but necessary:

COMPONENTS OF A TABLE

LET'S KEEP OUR TERMINOLOGY STRAIGHT
(no matter how basic it seems right now)

These are rows. 1

A row stretches from the left hand side of the table to the right hand side of the table.

By convention, the top row is number 1.

Below are columns.

Columns start at the top of the table and plummet straight down to the bottom of the table.
By convention, the FAR LEFT column is designated number 1.

A univariable table addresses one variable at a time. Sometimes univariate distributions are called "the marginals" (since they form the margins of a two-way or N-way table) or "marginal distributions."

A bivariate table addresses the simultaneous joint distribution of two variables.For example, the following combination of values example jointly and simultaneously cross-classifies each individual on two characteristics, their college and their gender:

Male Business Major
Female Business Major
Male Education Major
Female Education Major
Male Humanities Major
Female Humanities Major

and so on.

The juxtaposition of a particular row in the table with a particular column produces a "cell" in the table.

By convention, we give the row first and the column second to locate a particular CELL in a bivariate table. The "female, business major" cell is row 1, column 2 or just "1,2" for short. (We don't count the variable and value labels as a row or a column.)

Staying with the example of gender and college major, we could present a bivariate table as follows:

DISTRIBUTION OF COLLEGE MAJOR BY GENDER AT FLORIDA STATE UNIVERSITY 2004

College Major	Gender	Male	Female	Row Totals
Business		A CELL (1,1)	1,2
Education		2,1	2,2
Humanities		3,1	3,2
Other majors, entered row by row
Column Totals (for males and females)

The table itself is a rectangular array in at least two dimensional space. It typically shows the value or category of the variable and the number of cases that fall into that category. It may also give the percentage of cases that fall into each category (but usually not both frequencies and percents).

There are a lot of ways to construct univariate and bivariate tables. This Guide follows commonly accepted statistical practices.

A UNIVARIATE EXAMPLE

Let's start by examining one variable at a time. Consider the following univariate frequency distribution AND percentage table from the 2014 General social Survey, a national probability in-person survey regularly conducted in even years every other year by NORC at the University of Chicago:

TITLE: Percentage of United States Households with a Particular Type of Telephone (est) 2014

Type of Telepone Usage	Number of Cases	Percent of Total Cases
Landline in home only	1,086	42.9%
Cell phone	1,296	51.2
Phone elsewhere	74	2.9
Refused	63	2.5
No telephone	12	0.5
Total households	2531	100.0%

Source = 2014 General Social Survey, NORC, University of Chicago: http://gss.norc.org/

THERE ARE (AT LEAST) SIX COMPONENTS OR PIECES OF INFORMATION THAT "THE PERFECT TABLE" MUST TELL YOU. These components are:

1. THE TITLE

Each table must have a title that briefly but accurately describes the contents of the table. This means that if, for example, you have a bivariate distribution, you should include the names of BOTH VARIABLES in the title, such as "Type of Telephone by Marital Status".

2. THE VARIABLE(S) OF INTEREST

In my univariate example, there is ONE variable of interest, what kind of telephone (if any) there is in the household.

3. THE CATEGORIES OF THE VARIABLE

In my example above, there are four categories: Landline only, Cell phone, Phone elsewhere (e.g., a neighbor), Refused, and No Telephone. The percentage of households with no telephone, by the way, was five percent for several years, dropping now with "throw-away" cells and government financial assistance..

4A. THE NUMBER OF CASES OR THE FREQUENCY IN EACH CATEGORY OF THE VARIABLE

The total collection of every category name with its associated frequency is the univariate frequency distribution.

4B. THE PERCENTAGE OF CASES IN EACH CATEGORY OF THE VARIABLE

Percentages are very handy, because they are a standardized measure, or on a "per 100" standard. The total collection of every category name with its associated percentage is the univariate percentage distribution.

This section reads 4A or 4B because typically either the frequency distribution OR the percentage distribution is presented, but not both. Presenting too many numbers clutters the table and makes it more difficult to read. American Psychological Association (APA) standards call for "lean tables."

5. THE TOTAL CASE BASE

In my example, that's (about) 2531 households. If you have the total case base and the percentages, you can recalculate the category frequencies.

You would be amazed how many people omit the total case base, including authors in professional journals. In the back of my mind, I always get just a little suspicious.

What are they hiding? Were they ashamed of the number of cases? ("Three out of every four dentists recommend Blanca Blazing White toothpaste" isn't very impressive when there are only four dentists.)

Did they manage to forget how many cases they collected?

6. THE SOURCE OF THE DATA

Who collected these data? The United States government? NORC? A freshman undergraduate for a psychology project? Your Aunt Millie?

It is important to know the source of the data because clearly some sources have more of a reputation for collecting data in a trustworthy, systematic and reliable way than others.

The United States government, the National Opinion Research Center (NORC), federal agencies of many countries around the world, certain private companies (e.g., the Roper Company), all have good reputations for the care that they take in data collection. You will want to know the collector of the data so that you can interpret the data in context.¹

¹By the way, despite what you may read, in fact the polls called things correctly in the 2016 U.S. Presidential race. They predicted Hillary Clinton would win the national popular vote. And, indeed, so she did, by nearly three million voters nationwide. However, the U.S. presidency is decided by the Electoral College, which allots a certain number of electoral votes to each state. Donald Trump did win the Electoral College.

BIVARIATE DISTRIBUTIONS

A bivariate distribution simultaneously and jointly cross-classifies the scores on a case for two variables.

For example, if we have a bivariate distribution of gender and support for President Obama (favorable/unfavorable) we can simultaneously cross-classify people as favorable males, favorable females, unfavorable males and unfavorable females.

The jointly cross-classified cases form the "cells" or interior of the table. Each cell has a frequency of cases that have a JOINT score considering both variables simultaneously.

The univariate summaries for each variable separately (for example, male or female) are at the bottom of the table for the independent variable and at the far right of the table for the dependent variable, and are called the marginals. Because the row and column totals are in the margins of the table, they are often called "the marginals". Remember that the cells are labelled with the row number first, then the column number.

The grand total is usually presented in the lower right corner of the bivariate table.

Title: Generic Bivariate Table

Variable X, Value 1 Variable X, Value 2 Row Totals

Variable Y, Value 1 (Cell 1,1) (Cell 1,2) Marginal Total
Variable Y, Value 1

Variable Y Value 2 (Cell 2,1) (Cell 2,2) Marginal Total
Variable Y, Value 2

Column Totals Marginal Total
Variable X, Value 1 Marginal Total
Variable X, Value 2 Grand Total

Then, with values supplied for each variable:

Title: Attitude toward President Obama by Gender

MALE FEMALE

FAVORABLE Male-Favorable (Cell 1,1) Female Favorable (Cell 1,2) Total Favorable

UNFAVORABLE Male-Unfavorable (Cell 2,1) Female Unfavorable (Cell 2,2) Total Unfavorable

Total Male Total Female Grand Total

These terms: cell, grand total, and marginal total are important because each of them, as well as combinations of table cells, can become a parameter to be modelled in loglinear analysis. So we need to be familiar with this terminology.

THE "SIZE" OF A CROSSTABULATION TABLE

The size of a crosstabulation table (which is the total number of cells) depends on how many rows and columns are in the table.

In turn, the number of rows or columns depends on how many values or categories each variable has.

If the row variable has 3 categories and the column variable has 4 categories, the result is a "3 by 4" table.

CONVENTION: The row number always comes first.

Square tables have the same number of rows and columns (e.g., a 2 X 2 table such as the example above).

The size of the table and the number of marginal categories for each variable become critical in loglinear models in setting the degrees of freedom.
The total case base becomes important in statistical significance testing.

BIVARIATE PERCENTAGE TABLES: FAST REVIEW

The Bivariate Percentage Table is a variation on our old friend, the univariate percentage table. However, the bivariate percentages allows us to compare and contrast group similarities and differences. I have the very simplest bivariate table, which is a 2 X 2 table, below. There is one column each for women and men, one row for the correct answer and one row for the incorrect answer. The first table shown (also found in Guide 2) is the Bivariate Frequency Distribution:

How Gender Influences Answers to the Question: Does the Sun go around the Earth or does the Earth go around the Sun?

NOTE: By convention, categories of the independent variable typically form the COLUMNS of the table. (see below)	Male	Female	Total
Answer to Question:
Sun goes around Earth (WRONG)	72 (r1, c1)	165	237
Earth goes around Sun (RIGHT)	468	461	929
Total (at the bottom of each column are SEPARATE totals for women and men, then a total for everyone combined)	540	626	1166

Source: NSF Surveys of Public Understanding of Science and Technology, 2014, Director, General Social Survey (NORC).
Total valid n = 1173; MD = 7

A key issue is whether to percentize down the columns or across the rows.
Make no mistake about it, this IS a key issue and not a matter of semantics. Percentizing in "the wrong direction" will totally change the meaning of the results that you present.

CONVENTION: Values of the independent variable create the columns of the table.

For example, the two values of gender: male and female, head each column in my sample table.
Remember, gender might cause science knowledge, but we know science knowledge CANNOT cause biological sex. Therefore, if there is an independent variable, gender is it. Science knowledge is the possible effect, or dependent variable. (Review Guide 1 on causal order in non-experimental data if you are not convinced.)

IMPORTANT NOTE: AGRESTI AND TERMINOLOGY
Sometimes you will notice that Agresti "flip-flops" on whether the categories of the independent (explanatory) variable or the dependent (response) variable form the columns of the table. Mostly he has the explanatory or independent variable as the row variables, rather than the column variable. However, because the norm (see most journal articles in the social and behavioral sciences) is to have the independent variable form the columns of the table, I will follow the convention of having the independent variable comprise the table columns throughout this course.

CONVENTION: Percentize separately within values of the independent variable.

In my example, this means that first I calculate the percent giving correct and incorrect percentage responses for men. I then repeat the process, calculating the percent giving correct and incorrect responses for women.

Once I have done so, I can now specify the percentage of men who give the right answer (the Earth goes around the Sun) and the percentage of women who give the right answer, and then directly compare women and men.

These percentages within gender are different numbers, and they mean something entirely different from the following question:

among those who think the Earth goes around the Sun, what percent are female?

(Answer 461/929 X 100 or 49.6% Since women are 1020/1818 X 100 or 53.7 percent of the sample, we can see that women are slightly underrepresented among those giving the correct answer. Notice below that neither column has a percent figure of 52.3%)

CONVENTION: Remember that when the columns are formed by categories of the independent variable, a percent sign ONLY goes at the top of each column (in this case, the "wrong answer") and after the 100 percent at the bottom of each column. (Note: this is because the values of the independent variable form the columns of the table.) This also helps to produce a "lean table."

These conventions are particularly important as the number of values for each variable grows and as the number of variables grow. They help your reader to immediately discern which way the percentages are calculated and they make your table much easier to read.

How Gender Influences Answers to the Question: Does the Sun go around the Earth or does the Earth go around the Sun?

	Male	Female
Answer to Question:
Sun goes around Earth (WRONG)	13.4%	26.3%
Earth goes around Sun (RIGHT)	86.6	73.7
Total	100.0%	100.0%
Casebase	540	626

Source: NSF Surveys of Public Understanding of Science and Technology, 2014, Director, General Social Survey (NORC).
n = 1166 valid cases (P.S. Yes, these are real data and real percentages. Sigh.)

MULTIVARIATE DISTRIBUTIONS

A multivariate table jointly and simultaneously cross-classifies each individual on at least three characteristics. For example, an Hispanic Female Business Major is simultaneously cross-classified on her ethnicity, her gender, and her college.

For example, we might have a multivariate distribution on gender (male/female), marital status (married/not), and presence of children (yes/no). We can simultaneously cross-classify people as one of the following combinations:

married females with children
married males with children
unmarried females with children
unmarried males with children
married females without children
married males without children
unmarried females without children
unmarried males without children

The size of the multivariate table is the number of rows multiplied by the number of columns (in the "original" table) THEN multiplied by the number of categories or values in the third or "control variable."

Using my example above, with two marital status categories (married versus unmarried), two sexes (female and male), and two parental statuses (with children versus without), this is a 2 X 2 X 2 or 8-cell total table.

We could set up our example this way:

	Women			Men
	Married	Not Married		Married	Not Married
Has Children	MarFeKid	UnMarFeKid	Has Children	MarMenKid	UnMarMenKid
No Children	MarFeNoKid	UnMarFeNoKid	No Children	MarMenNoKid	UnMarMenNoKid

Notice that we have now created TWO separate tables side by side that examine how marital status relates to the presence of children in the home, one for men and one for women.

The use of separate or "partial" tables or "subtables" to examine the original relationship between two variables within categories of a third variable (e.g., looking at how marital status influences the presence or absence of children separately for each category of a control variable such as gender) is also called physical control because we have physically separated the cases into groups using the control variable. Generally physical control is an inefficient way to analyze cross-tabulation tables.

ODDS AND ODDS RATIOS

The odds of a variable is simply the ratio of one of the category frequencies in a variable to another category frequency in the same variable. Although we can take odds on any variable, in bivariate and multivariate analysis, if we are able to designate a dependent (response) variable, typically we take the odds on the categories of the dependent variable.

Typically, in a binary (two category) variable we designate one of the categories as a "success" and the other as a "failure". This has nothing to do with the social or emotive meaning of success and failure. For example,remember in examining death rates from a disease, we might easily designate death as a "success" and survival as a "failure".

Odds can vary from zero (no successes at all) to infinity. They are undefined when there are no "failures".
Odds are fractional when there are more failures than successes. For example, if most people with a disease survive, the odds will be fractional. Unlike the Linear Probability Model, the odds does not have a truncated variance.

Let's re-examine the table for gender and the "planets" question.

How Gender Influences Answers to the Question: Does the Sun go around the Earth or does the Earth go around the Sun?

GENDER-->	Male	Female	Total
Answer to Planetary Question:
Sun goes around Earth (WRONG)	72	165	237
Earth goes around Sun (RIGHT)	468	461	929
	540	626	1166

Source: NSF Surveys of Public Understanding of Science and Technology, 2014, Director, General Social Survey (NORC).
n = 1166 valid cases

The odds on GENDER (using Male as a "success") for the total group is 540/626 = .86
That is, someone is 86% as likely to be male as to be female.

The odds on the PLANETARY QUESTION (using the right answer as a "success") for the total group is 929/237 or 3.92. Right answers occur nearly four times as often as wrong answers do.

We can calculate conditional odds within each category of the independent (explanatory) variable. For our tabular example above, the odds of giving the correct to the incorrect answer are:

For men: 468/72 or 6.50
For women: 461/165 or 2.79

Although both sexes are more likely to give right answers than wrong answers, men are over SIX times as likely to give a right:wrong answer whereas women are nearly three times as likely to give a right:wrong answer.

We can then examine the second order odds or odds-ratio. This is comparing the odds across groups or categories of the independent variable. In this instance (remembering again that "successes" form the numerator) we would divide the odds for men on the planetary question by the odds for women on the planetary question or:

6.50/2.79 = 2.33

When the odds-ratio or second order odds is 1, we have the case of statistical independence, i.e., the independent variable has no influence or is unrelated to the dependent variable. The odds on the dependent variable are the same regardless of whether one is male or female.

In this example, the odds-ratio of 2.33 is suggestive. Men are a little more than twice as likely to give the right answer compared with the wrong answer as women are. It is important in the interpretation of the odds that I give both the successes and failures. After all, a glance at the table should reassure you it would be incorrect to say men were twice as likely as women to give the correct answer. This would be untrue. But the odds for men is nearly twice that for women. (In this example, women were proportionately twice as likely to give the wrong answer--but that was actually just a coincidence.)

I say this is suggestive because at this point we have not done any formal tests of statistical significance. But have patience because before you know it, we will.

What happens if your variable has more than three categories?
In this case, several possibilities open up. Some of them depend on whether the categories have an inherent order to them (ordinal data) or whether category names or values are simply labels or "tags" (nominal data).

One possibility is to examine the odds successively of a "success" in turn to all other categories combined (recall Jim Davis of Chicago, Harvard College and Dartmouth College said years ago we can always dichotomize the states of the USA into New Hampshire versus all other states combined). In the case of ordinal categories we may wish to calculate odds on adjacent categories. If we have K categories, we can form K-1 odds. We cannot do K odds because the Kth odds is linearly dependent upon the first K-1 odds. That is, if we know the first categories and the total, we can always calculate the Kth odds through subtraction in the table.

We call the natural (base e/Euler's constant/ i.e., 2.71...) log of the odds-ratio the "log-odds" or logit. I use the abbreviation "ln" to distinguish base e from base 10 or log 10 logarithms. Most statistical uses of logarithms use base e lns.

In later guides we will see how we can extend the odds ratio to any number of variables and how to calculate them. Get comfortable with the terminology for now.

MAXIMUM LIKELIHOOD ESTIMATORS

We are used to null hypotheses that say in essence: if the parameter we're interested in is zero, what is the probability or likelihood of observing the results we did by chance?

We are used to direct estimatation methods of parameters that basically take a one step procedure, whether using linear (matrix) algebra or partial derivatives in calculus. That's what we do in OLS regression.

But causal model statistics are different, whether they are structural equation models or loglinear models. Most use Maximum Likelihood Estimators (MLEs) and iterative methods to solve.

With MLEs we ask the question: given the data, how likely are the parameters (sort of the reverse of sentence 1)? With MLEs we estimate a host of possible parameters, each with a probability attached to it (see the graphs in Agresti). The MLE solution you see on your computer output is the one with the highest overall probability attached to it, hence MAXIMUM likelihood estimators.

With large samples, MLE models often give results that are similar to those estimated through more direct methods, such as Ordinary Least Squares regression. But in many cases MLE parameters are quite different. This is especially true with more complex models so do not think you can always substitute different kinds of estimators.

OVERVIEW READINGS

This page created with Netscape Composer
Susan Carol Losh
January 15 2017

	Variable X, Value 1	Variable X, Value 2	Row Totals
Variable Y, Value 1	(Cell 1,1)	(Cell 1,2)	Marginal Total Variable Y, Value 1
Variable Y Value 2	(Cell 2,1)	(Cell 2,2)	Marginal Total Variable Y, Value 2
Column Totals	Marginal Total Variable X, Value 1	Marginal Total Variable X, Value 2	Grand Total

	MALE	FEMALE
FAVORABLE	Male-Favorable (Cell 1,1)	Female Favorable (Cell 1,2)	Total Favorable
UNFAVORABLE	Male-Unfavorable (Cell 2,1)	Female Unfavorable (Cell 2,2)	Total Unfavorable
	Total Male	Total Female	Grand Total