READ THIS GUIDE FIRST! |
2010 |
|
|
|
|
|
|
|
SOME EXAMPLES:
OK, you just collected a lot of data for your thesis or dissertation. OR
You were just designated by your workplace to compile statistics on the number of rehabilitation admissions. OR
You are a public administrator preparing a report for the city comission. OR
Perhaps you availed yourself of one of the increasing numbers of online database archives, for example, to investigate gender and computer access to check out "the digital divide."
How are you going to compile all this information in a form that is succinct, easy to read and understand, and yet does justice to your data?
What you are staring at right now is a pile of questionnaires or a page filled up with numbers. Lots of questionnaires or lots of numbers don't convey any kind of intelligible information to anyone. So, you must systematically reduce all this information to a form that you can easily manage and describe.
A TABLE is a common and useful way to present data.
Don't be a snob about tables. OK, they aren't advanced statistics.
But a table is the most useful basic building block in your tool chest of data analytic techniques and the presentation of your results. If you can construct a simple table thoroughly, everyone (including you) will be able to assess your basic results. You could even write an entire dissertation using tables alone.
Further, tables can become increasingly complex. You can present joint distributions of three or more variables in tables, once you understand how to construct a basic table with just one variable.
Percentages are very useful too. We will see in this guide that you can go a long way with the basic percent and its variations.
|
|
These are rows. This is ROW #1 |
A row stretches from the left hand side of the table to the right hand side of the table. |
By convention, the top row is number 1. |
Below are columns.
Columns start at the top of the table and plummet straight down to the bottom of the table. By convention, the FAR LEFT column is designated number 1. |
|
|
|
|
|
|
A univariable table addresses only one variable at a time.
A bivariate table addresses the joint distribution of two variables. For example, the following combination of values example jointly and simultaneously cross-classifies each individual on two characteristics, their college and their gender:
Male Business Major
Female Business Major
Male Education Major
Female Education Major
Male Humanities Major
Female Humanities Major
and so on.
By convention, we give the row first and the column second to locate a particular CELL in a bivariate table. The "female, business major" cell is row 1, column 2 or just "1,2" for short.
Staying with the example of gender and college major, we could present a bivariate table as follows:
DISTRIBUTION OF COLLEGE MAJOR BY GENDER
AT FLORIDA STATE UNIVERSITY 2004
College Major | Gender | Male | Female |
Business |
|
1,2 | |
Education | 2,1 | 2,2 | |
Humanities | 3,1 | 3,2 | |
Other majors, entered row by row | |||
Total |
A multivariate table jointly and simultaneously cross-classifies each individual on at least three characteristics. For example, a Hispanic Female Business Major is simultaneously classified on her ethnicity, her gender, and her college.
The table itself is a rectangular array in at least two dimensional space. It typically shows the value or category of the variable and the number of cases that fall into that category. It may also give the percentage of cases that fall into each category (but usually not both frequencies and percents)..
There are a lot of ways to construct univariate and bivariate tables. This Guide follows commonly accepted statistical practices.
|
Let's start by examining one variable at a time. Consider the following univariate frequency AND percentage table from the August 2000 Current Population Survey, which is conducted by the United states Government on a regular basis:
TITLE: Percentage of United States Households
with a Telephone in the Household
Is there a telephone in the household? |
|
|
Has telephone |
118,026
|
87.4%
|
No telephone |
6,530
|
4.8
|
No Answer |
10,430
|
7.7
|
Total |
134,986
|
100.0%
|
n = 134,986
Source = Current Population Survey
Internet and Computer Use Supplement (Aug 2000)
|
What are these six pieces of information?
1. THE TITLE
Each table must have a title that briefly but accurately describes the contents of the table. This means that if, for example, you have a bivariate distribution, you should include the names of BOTH VARIABLES in the title.
2. THE VARIABLE(S) OF INTEREST
In my univariate example, there is ONE variable of interest, whether the household contains a telephone (presumably, working.)
When you look at a table, don't YOU want to know what the variables are that are in it? I sure do.
3. THE CATEGORIES OF THE VARIABLE
In my example above, there are three categories: Yes, No, and No Answer (which also included the miniscule categories of refused and don't know).
4A. THE NUMBER OF CASES OR THE FREQUENCY IN EACH CATEGORY OF THE VARIABLE
The total collection of every category name with its associated frequency is the univariate frequency distribution.
4B. THE PERCENTAGE OF CASES IN EACH CATEGORY OF THE VARIABLE
Percentages are very handy, because they are a standardized measure, or on a "per 100" standard. We will examine percentages below. The total collection of every category name with its associated percentage is the univariate percentage distribution.
This section reads 4A or 4B because typically either the frequency distribution OR the percentage distribution is presented, but not both. Presenting too many numbers clutters the table and makes it more difficult to read.
5. THE TOTAL CASE BASE
In my example, that's 134,986.
You would be amazed how many people leave out the total case base, including authors in professional journals. In the back of my mind, I always get just a little suspicious.
What are they hiding?
Were they ashamed of the number of cases? ("Three out of every four dentists recommend Blanca toothpaste" isn't very impressive when there are only four dentists.)
Did they manage to forget how many cases they had collected?
The larger the case base, all other things equal, the more stable the results, so this one is very important.
6. THE SOURCE OF THE DATA
Who collected these data? The United States government? A freshman undergraduate for a psychology project? Your Aunt Millie?
It is important to know the source of the data because clearly some sources have more of a reputation for collecting data in a systematic and reliable data than others.
The United States government, the National Opinion Research Center (NORC), federal agencies of many countries around the world, certain private companies (e.g., the Roper Company), all have excellent reputations for the care that they take in data collection.
You will want to know the collector of the data so that you can interpret the data in context.
It is absolutely amazing how many authors
omit this one from a table too. Don't you be one of them.
|
Before you can do anything with your data, remember to sort the observations into your constructed exhaustive and mutually exclusive categories.
This is often called coding.
Sometimes coding is easy: your questionnaire may have already categorized people by using closed questions. For example, each person you studied answered either "very concerned", "somewhat concerned", "only a little concerned" or "not concerned at all" when asked about having enough money to buy the things they want.
It is typically easy to place the responses to prestructured, closed questionnaire items into categories, although you may have a problem case or two and must create an "other" category to make the category system exhaustive. The manipulation variables and dependent behaviors in an experiment are also often easy to code.
Other times, you have open-ended questions and respondents answer in their own words. You must decide how to collapse all these different answers into a few categories. Or you have less structured data collected through an ethnography or content analysis. Or, you may have so many categories that you must reduce the number of categories to make the responses intelligible. Coding becomes much more of an art form in such instances.
Remember these additional (if possible) properties. Some of these properties have been especially amended for tabular display. For example, if you were working with the variable "years of age" and you want to use it as an independent variable in an interval-level data statistical technique, you would leave the categories pretty much as is. You wouldn't want to "collapse" categories, that is, group categories together (as I did with No answer, Refused and Don't Know in the example I used above). To do so in the "years of age as a dependent variable" example would be to lose information.
But
a table is different. You wouldn't want to present years of age in (for
example) 65 different categories. It would take up two pages and no one
would be able to simultaneously remember all that information. Click on
the link to look at an example of "year of birth" and you will see how
cluttered the page is. It is difficult to draw any conclusions about the
distribution of year of birth.
SOME TIPS:
ABOVE ALL, MAKE SENSE!
The CARDINAL RULE in constructing coding categories is to place cases that have something in common together in the same category. Twenty year olds have different interests than 50 year olds so you wouldn't (and shouldn't) want to place them in the same category of age group.
In general, try to keep the number of substantive categories (aside from "other" or "no answer") down to about seven if you can do so without distorting the data. It will be easier for your reader to read your table if there are fewer categories in it.
When you combine a large number of categories by collapsing or grouping some of them together to make a smaller number of total categories , make sure that your combinations make sense. In my example, it would be silly to combine "has telephone" and "no telephone" into one gigantic category.
For another bad example of placing
people together who have very little in common, review my collapse of formal
level of education into just two categories, eighth grade or less, and
ninth grade or more. Just click on the box below:
MORE TIPS:
Keep the meaning of the categories simple. You or your reader should be able to know at a glance what the category means.
Give each category a simple, descriptive, but easy to understand category label or value. "Yes" or "No" works if your question had these as precoded responses, or if the action is so simple (locked car door, for example) that yes or no would suffice.
Make sure each category represents only one dimension. For example, if you were creating a category for college major field, Theater majors typically have a very different set of interests from Business majors, so you wouldn't want to include them in a category together. But you might put Marketing majors and Management majors in the same category.
If
you are working with numeric data, try for equal intervals IF THE DATA
ALLOW AND IF IT MAKES SENSE! Don't try to torture your data into nonsensical
equal interval categories.
|
The common percent is an extremely useful measure.
You can use a percent with any kind of data: nominal, ordinal, interval and ratio.
A percent is a standardized measure. Per cent means "per 100 cases."
Because the percent is standardized you can use it to compare results from different population bases that have different sizes or total casebases. For example, you could compare the percentage of home computer ownership among people with elementary school, high school, and college educational levels.
EXAMPLE: suppose
you wanted to compare women and men on an item of basic science knowledge:
does the father's genes determine the sex of a couple's baby (the correct
answer is "yes")? Here is the following bivariate frequency distribution:
How Gender Influences Answers to the Question: Does the father's gene determine the sex of the couple's baby? |
NOTE:
By convention, categories of the independent
variable (or "cause") form the COLUMNS of the table. |
Male | Female | Total |
Answer to Question: | |||
No, father's gene does not determine baby's sex (WRONG) | 318
(r1, c1) |
232 |
550
|
Yes, father's gene does determine baby's sex (RIGHT) | 436 | 588 |
1024
|
Total (at the bottom of each column are SEPARATE totals for women and men, then a total for everyone combined) | 754 | 820 |
1574
|
Source: NSF Surveys of Public Understanding of Science and Technology, 2001, Director, ORC/Macro New York. n = 1574
As you can see, more women (588) than men (436) gave a correct answer to this question. BUT this tells us relatively little, because there are ALSO more women (820) than men (754) in the total group of people studied.
If we put the answers from each sex on
a per 100, or a percentage basis, then we can compare women and men directly
even though there are different total cases for women and men.
|
The first step in calculating a percent is to isolate your case base of interest. This is particularly important if you really have two or more separate case bases, as I do for women and men in my bivariate table above. I will first look at the 754 people in the MALE CASEBASE.
The second step
is
to identify your category of interest.
In
my example of the "father gene" question, there are only two categories,
the
"WRONG" answer and the "RIGHT" answer. I will begin with the "wrong
answer" category.
|
The third step is to locate the number of cases in my category of interest ONLY for the group I am looking at. In my first example here, I am only looking at men, and men who gave the wrong answer (the category of interest) to the father gene question. That frequency is found in row 1, column 1 of the bivariate table and it is 318 men.
The fourth step is to take the frequency in my category of interest, in this instance ONLY for men, and divide that frequency by the total number of cases in my casebase of interest (all the males = 754). Or:
318/754 = .422
The number .422 is the proportion of men who gave the wrong answer to the father gene question.
As you can see, a proportion is a fraction.
Proportions vary from 0 to 1.00. All the proportions for a particular group studied (in this case, men) will add up to 1.00 within rounding errors.
In my example, this means if .422 of men gave the wrong answer, by substraction, .578 of men must have given the correct answer because there are only two categories, and I already know the proportion in the "wrong answer"-male category.
1.000 - .422 = .578
The fifth step is to turn the proportion into a percentage. Multiply the proportion by 100 and the result is a percent.
.422 X 100 = 42.2%
Thus, among men, 42.2 percent said that it is false that the father's gene determines the sex of the baby, and 57.8 percent correctly stated "true," that the father's gene determines the sex of the baby.
You must complete the last step to multiply by 100 to turn your proportion into a percentage.
Here, I'll repeat the process for women. There are a total of 820 women. 232 women said it is false that the father's gene determines the sex of a baby. So:
(232/820) X 100 = 28.3 % of women gave an incorrect answer to the father gene question
(588/820) x 100 = 71.7 % of women correctly answered this question
Here is my original bivariate table,
now reworked as a percentage table. We can compare men and women directly
on the father gene question using percentages. We could not compare women
and men directly when the data were in frequencies form because we had
a different number of men (754) than women (820).
How Gender Influences Answers to the Question: Does the father's gene determine the sex of the couple's baby? |
Male | Female | |
Answer to Question:
|
||
No, father's gene does
not determine baby's sex (WRONG)
|
42.2%
|
28.3%
|
Yes, father's gene does
determine baby's sex (RIGHT)
|
57.8
|
71.7
|
Total
|
100.0%
|
100.0%
|
Casebase
|
|
|
Source: NSF Surveys of Public Understanding
of Science and Technology, 2001, Director, ORC/Macro New York. n = 1574
|
The bivariate percentage table that I presented immediately above follows several presentation conventions in use with such tables that make them easier to read.
Categories of the independent variable or "cause" usually form the columns of the table and the category names on the independent variable go at the top of each column. Categories of the dependent variable or "effect" form the rows of the table. Apparently it is easier to read up and down the columns than across the rows.
Here "cause" and "effect" are easy to determine. Gender is fixed at birth. Your basic science knowledge will not create a change in your biological sex.
Only percentages are in the cells of the table. DO NOT include both the frequencies AND the percents in the cells. That clutters up the table and makes the table more difficult to read. It is also redundant because given the percentages and the appropriate case bases, the reader can reconstruct the actual frequencies.
As the American Psychological Association Manual on Style puts it: do not include numbers (frequencies) that can be simply calculated from other numbers (percentages). Percentages, because they are standardized, are typically more informative than frequencies (except for the total number of cases).
Give a 100 percent total at the bottom of each column (or at the end of each row if you percentized across). Do NOT put the 100% in any kind of parentheses or brackets.
That way, your reader knows whether s/he should read down the columns or across the rows.
Notice there are only TWO percent marks in each column if the values of the independent variable form the columns:
(1) the percent mark at the top of each
column and
(2) the percent mark with the hundred
percent at the bottom of each column.
Again, this helps your reader to know whether to read across or down the table rows or columns.
DO NOT put a percent mark in every cell of the table. Again, that clutters up the table and makes the table difficult to read. In addition, the reader can easily be confused about whether to read across or down.
I only went out to ONE decimal place in my percents. In general, DO NOT go beyond two decimal places in percentages. The added decimal places do not typically give increased precision and they will again clutter the table. (NOTE: If you have such a large case base that you believe more detail in your precents is necessary, do a rate instead; see rates below.)
My grand total of 1574 cases was placed UNDERNEATH the table itself, with an n = terminology. That way, the total casebase was less likely to get confused with the total for men or the total for women. In this particular example, there were no missing data, so they are not mentioned. However, large amounts of missing data should be mentioned; if the reason for the missing data is known, it should be briefly described.
In conventional statistical terminology:
N is used for a population
total while
n is used for a total coming
from some sample from the population.
As you can see, the purpose of these conventions
are to make your table simple and thus easier to read.
|
How Gender Influences Answers to the Question: Does the father's gene determine the sex of the couple's baby? |
Answer to Question | Incorrect | Correct | Total | Casebase |
Male |
42.2%
|
57.8
|
100.0%
|
754
|
Female |
28.3%
|
71.7
|
100.0%
|
820
|
Source: NSF Surveys of Public Understanding of Science and Technology, 2001, Director, ORC/Macro New York. n = 1574
IS THERE A TELEPHONE IN THE HOUSEHOLD? THE UNIVARIATE FREQUENCY DISTRIBUTION TRANSFORMED.
Here is my original example univariate frequency distribution table from the Current Population Survey, only now it is a univariate PERCENTAGE distribution. Notice that the table is simpler and easier to read with just the percentages present. ONLY use the percentages. The only frequency will be your total at the bottom of the table.
TITLE: Percentage of United States Households
with a Telephone in the Household
Is there a telephone in the household? |
|
Has telephone |
87.4%
|
No telephone |
4.8
|
No Answer |
7.7
|
Total |
100.0%
(134,986) |
Source: Current Population Survey, August, 2000
Another
conventional option for handling case bases: Instead
of placing my n s in a separate row under the 100 %, I just put
the casebase in parentheses ( ) underneath the 100%. This, too, helps
to simplify the table.
|
What IS the total casebase that you use? Is it all cases (missing values or not) or just cases that have a valid value on the variables that you use?
That is a judgment call. One way to handle
it is to put the number of missing cases underneath the table off to the
left side, and the number of valid cases as the n (if a sample) casebase.
Here's the telephone in the household question, where the interior of the
table only contains the 124,556 respondents who gave a "yes" or "no" answer:
Is there a telephone in the household? |
|
Has telephone |
94.8%
|
No telephone |
5.2
|
Total |
100.0%
(124,556) |
Number of missing cases = 10,430
Source = Current Population Survey
Internet and Computer Use Supplement (Aug 2000)
|
|
Some events such as births or divorces happen at relatively rare intervals among the population at large. The overwhelming number of women do NOT have a baby in any given year. In any given year, most people stay married.
In the case of comparatively rare events, we standardize using a cousin of percents called the rate. Governments often report rates because they have enormous case bases and often collect data on relatively rare events.
We know that the base for a percent is per 100. What is the base for a rate?
The base for a rate varies. The usual bases are:
per 1000
per 10,000 and
per 100,000
The more unusual the event (such as winning the state Lottery), the larger the standardizing base.
Here's how to obtain a rate:
1. Obtain the proportion. In the case of a truly rare event, you will have a very, very small fraction.
2. Next multiply the proportion by the standardizing base figure. If your rate is per 1000, you multiply the proportion by 1000 (instead of by 100 as you did for the percent). If your rate is per 100,000, you multiply the proportion by 100,000.
Rates are easier to interpret than either percentages with lots of decimal places or tiny, fractional proportions.
Be very careful to use the correct population base to calculate any of these figures.
EXAMPLE: Crude divorce rates are the number of divorces per 1000 people, and this includes 5 year olds and, worse yet, unmarried people who are at no risk for divorce!
Crude rates have limited utility. It is the choice of a meaningless base (for example, not correcting for age distributions in crime rates) that can allow the ignorant (charitable) or the unscrupulous to "lie" with statistics.
|
NOTE: This section can give novices a lot of trouble. Please read carefully.
Cumulative percents aggregate the percentages going up or down the values of the variable. Cumulative percents allow us to make statements such as "at least" or "at most".
IMPORTANT: The categories of your variable must be ordinal, interval or ratio in order to take cumulative percents. You implicitly make "more than" or "less than" statements when you calculate a cumulative percent (e.g., 50 percent of the population has more formal education than a high school degree).
In a cumulative percent, you add the percents sequentially.
You begin with the very lowest value category (or the very highest category), then add the percent from the next lowest category to form a subtotal percent.
Proceeding sequentially, you add the percent from the next category to the first subtotal that you just created to make a new, larger second subtotal.
The process ends when you reach 100 percent at the very highest category (or at the very lowest one, if you started at the top). The process sounds much more complicated than it really is.
EXAMPLE: Here is a new table to serve an an example from the Current Population Survey (August 2000) on Computer and Internet Use in United States households.
TITLE: Number of computers in the household
How many computers or laptops are there in this household? | Percent of Total Cases | Cumulative - down
"Less than" statements "At most" statements |
Cumulative - up
"At least" statements "Or more" statements |
No computer |
47.3%
|
47.3
|
52.7
+ 47.3 = 100.0
|
1 |
37.6
|
47.3
+ 37.6 = 84.9
|
15.1
+ 37.6 = 52.7
|
2 |
10.4
|
84.9
+ 10.4 = 95.3
|
4.7
+ 10.4 = 15.1
|
3 or more* |
4.7
|
95.3
+ 4.7 = 100.0
|
4.7
|
Total |
100.0%
(134,986) |
Source = Current Population Survey Internet and Computer Use Supplement (Aug 2000)
*oops, here is the dreaded "open-ended" category, 3 or more. Since we are using percents, and not calculating an arithmetic average or other numerical entity, we will just leave the category as it is in this example.
Now, let's use the data in the table
above and the cumulative percents to make the following kinds of statements:
(and I PROMISE statements of this
type will be on Exam 1)
What percent of United States households contain at least one computer?
ANSWER: 52.7 percent of United States households contain at least one computer.
That's the percent with 3 + 2 + 1 computers
37.6% of households have one computer.
10.4% of households have two computers.
4.7% of households have three computers.
Cummulate UP from the bottom (3
+ 2 + 1).
4.7% + 10.4% + 37.6% = 52.7%.
What percent of United States household have at least two computers (that means two or more)?
ANSWER: 15.1 percent of United States household contain two or more computers.
That's the percent with 3 + 2 computers.
10.4% of households have two computers.
4.7% of households have three computers.
.
Start at three computers and commulate
UP to and including two computers.
4.7% + 10.4% = 15.1% of households have
two or more computers.
What percent of United States households own less than two computers?
ANSWER: 84.9 percent of United States households
own less than two computers.
That's the percent with 0 + 1 computer
Start at zero computers and cummulate
DOWN to and including one computer.
47.3% + 37.6% = 84.9%.
What percent of United States households own at more one computer?
ANSWER: 84.9 percent of United States households
own at most one computer.
That's the same thing as less than two.
At most one means 0 + 1 computer.
See the prior example.
|
Many researchers like to assess how much change occurs in a phenomenon that they study. One easy to calculate measure is the percentage change over time.
To calculate the percent change over time, start with the frequency at the later or more recent time, call this more recent time "time 2".
We call this f t2
You will also need the frequency at the original or earlier time, or "time 1".
We call this earlier frequency f t1
To calculate the percent change over time, here's the formula:
[ ( f t2 - f t1 )/f t1] X 100
1) Later frequency minus earlier frequency
2) Divide step one by the EARLIER frequency
3) Multiply the step two result by 100
REMEMBER!! Divide by the EARLIER frequency!
Let's apply this to the population growth in the Miami-Fort Lauderdale-Miami Beach, Florida metropolitan area from 1990 to 2002. My figures are in millions of people.
1990 population in millions = 4.056
2002 population in millions = 5.232
Population change in Miami-Fort Lauderdale-Miami Beach area 1990-2002 =
[(5.232-4.056)/4.056] X 100 = 29.0 percent
That's a lot of growth for only 12 years.
Notice that the smaller the base at time 1, the more growth there appears to be. This is why small states like Nevada seem to have such high growth rates compared with larger states like California.
A calculator typically does this one in two minutes. All you have to make sure is that you feed in the correct figures in the correct order. Remember to divide by the earlier time frequency.
|
With ratios, we compare the frequency in one category of a variable with the frequency in a second category of the same variable.
For example, we can look at the ratio of males to females, but a ratio of males to first year graduate students "mixes apples and oranges" and makes no sense.
(You might be thinking of the percent of first year graduate students who are men--but that is a totally different kind of measure and it isn't a ratio.)
Here's how to calculate ratios: divide the frequency in first category by the frequency in the second category. We then multiply by the appropriate standardizing base. Suppose we wanted the number of males per 100 females. In our sample, we have 30 women and 20 men:
Step One: divide the number of males by the number of females (because we are looking at per 100 females)
20/30 = .667
Step Two: Multiply by the appropriate standardizing base. Since we are looking at the number of males per 100 females, we will multiply by 100
.667 X 100 = 66.7 males
per 100 females
|
My example looks at the ratio of males
per 100 females by age groups in the United States in 2002.
The numbers are in MILLIONS.
AGE GROUPS 2002
15-19 years | 20-24 years | 25-29 years | 30-34 years | 35-39 years | 40-44 years | 45-49 years | |
Number of | |||||||
Men
|
10.471
|
10.350
|
9.640
|
10.563
|
10.954
|
11.413
|
10.492
|
Women
|
9.905
|
9.863
|
9.332
|
10.394
|
10.961
|
11.589
|
10.810
|
Ratio
Men:Women |
105.7
|
104.9
|
103.3
|
101.6
|
99.9
|
98.5
|
97.1
|
source: U.S. Bureau of the Census: Resident
Population by Age and Sex, 2003
All the measures in this Guide can be used
on any kind of data, with the exception of the cumulative percent, where
the data must be at least at the ordinal level of measurement. All the
measures in this section also allow you to compare two or more groups that
have different case bases. All of these measures are used A LOT in conference
papers, journals, textbooks, and mass media reports.
|
INTRO STATS READINGS AND ASSIGNMENTS |
OVERVIEW |
|
Susan Carol Losh March 22
2010
This page was built with
Netscape Composer.