GUIDE 2: CONSTRUCTING A TABLE
GUIDE 3: UNIVARIATE STATISTICS AND DISPLAYS
GUIDE 6: MULTIVARIATE CROSSTABULATIONS
GUIDE 7: BASIC REGRESSION
GUIDE 8: REGRESSION SPECIFICS
GUIDE 9: SAMPLING
DR SUSAN CAROL LOSH
KEY TO: Agresti and Finlay, Preface (entire); Chapter 1 (1-9), Chapter 2 (12-17 ONLY)
KEY TO: Huff, Introduction (pp. 7-9) and Chapter 1, pp. 1-26
This is the first of several guides that I will publish on the Internet for a basic course on Introductory Data Analysis. Be sure to return because I will develop new course sites over the next few months, making it easy for you to cross reference topical guides. All guides will be linked to the Overview and Readings WEB sites.
Statistics is a language. (Just as math is a language.) If you keep this in mind, you will experience far less anxiety and life will be much, much easier.
Like English, statistics has a vocabulary, often expressed in "unknown placeholders" (such as "X"), greek letters, such as or , and equations that serve as the grammar to tie the vocabulary together.
You could say (very quickly):
"Take the score on a characteristic (such as years of formal education), add all the education scores from each case, then divide by the number of cases you have. (Whew!) And you have the arithmetic mean of years of formal education for this particular population of cases."
Or, you could simply write instead:
( X) / N =
which is much more compact.
Statistics is a highly cumulative topic. Later, more complex information builds upon earlier, simpler information.
This means that trying to cram statistics typically doesn't work. It also means that students who get behind in a statistics course can have a hard time catching up.
You should do quite well if you keep up with the material and complete exercises.
are about logic and reasoning, and only secondarily about arithmetic.
can do that for you.
The days of the solely armchair theorists
are pretty much over. Theories and conjectures must pass empirical muster,
that is, if data constantly disconfirm your hypotheses, it's probably
time to revise the hypotheses.
What? Isn't this course STATISTICS? What am I doing with this "diversionary topic" called "variables" for?
The answer is simple: if you do not understand what a variable is, and what different kinds of variables are, you will not be able to choose the very best statistical tools to describe your data and you will run a very high risk of making elementary mistakes in analyzing your data and interpreting your results (and other people's results).
A variable is a characteristic or factor that has values that vary, for example, levels of education, intelligence, or physical endurance.
A variable has at least two different categories or values.
If all cases have the same score or value, we call that characteristic a constant, not a variable.
of sets or systems of categories with several properties.
Examples of category systems include:
GENDER: Categories = Male and Female
PRIMARY/SECONDARY GRADES: Categories = Kindergarten, first, second, third...and so forth to grade twelve
AGE IN YEARS: Categories = 1, 2, 3, 4, 5, and so forth up to 90 years of age--or even higher.
Each observation that you make on a particular characteristic has a specific value associated with it.
Years of Age
Highest year of formal education completed
We often call the values that a variable takes on the categories of a variable.
Individual observed scores are then fit into the appropriate category.
You may even have categories that do not have cases in them for a particular sample or collection of cases. For example, although it is certainly possible to have a score of 96 on years of age, in your particular project perhaps no one individual that you studied was that old.
CONCEPTUAL VARIABLES are what you think the entity really is or what it means. Conceptual variables are about abstract constructs. YOU DO NOT DISCUSS MEASUREMENT AT THIS STAGE. Instead you discuss what the construct means. Examples include "achievement motivation" or "endurance" or "group cohesion". You are describing a concept.
On the other hand, OPERATIONAL VARIABLES (sometimes called "operational definitions") are how you actually measured this entity, or the concrete operations, measures or procedures that you used to measure the variable.
A conceptual definition is broader. A particular concept or construct can be operationalized in several different ways. For example, disengagement among students or team members can be measured through absence records, rates of volunteerism, expressions of enthusiasm, and so on. "Culture" could refer to films, paintings, posters, or other media.
However, we apply
our statistics to operational variables, that is, to your actual measurements.
|Letter recognition||Scores on a particular test of letter recognition|
|Anaerobic exercise||Maximum number of pounds one can weight-lift|
|Gender||Biological sex: male or female|
|Collective Effiicacy||Number of online study groups formed in a distance course|
You usually begin your research problem
with CONCEPTUAL VARIABLES and the relationships among them. One
of the few exceptions is if your actual purpose is to study a particular
operational variable, for example, perhaps you want to study the validity
of the FCAT test, the achievement assessment test that kindergarten through
twelve grade students in Florida must take each year.
At a minimum, category systems should be exhaustive (cover all cases) . Each case must be able to fit into a category. Sometimes that means we must construct an all-inclusive "other" category.
Categories of a variable should also be mutually
exclusive (each case fits into one and only ONE category).
|These two features above are the basics.|
Other desirable category properties--WHEN IT IS POSSIBLE (and it ISN'T always possible)-- include:
a good spread of cases over categories (no category with too large or too small a percentage of cases).
Possibilities IF the data allow include a normal ("bell-shaped" or Gaussian distribution) or an equiprobable distribution in which each category has the same number of cases.
a limited number of categories and
equal intervals between categories IF POSSIBLE (applies
only IF the category values are numeric).
TIP: If you are collecting your own data, try to gather data as completely as possible (for example, get education in number of years rather than degree level--or get BOTH measures if you can) because you can collapse or move around categories later. If you really meant degree level, then ask about degree level explicitly rather than years of education or "how much" education.
TIP: Avoid "open-ended" categories
that do not have fixed end points when this is possible (e.g., "graduate
degree or more"--or "$75000 or more"). Keep in mind when you gather data
that it may not be possible to use a final closed category with income.
We will spend considerable time in the coming weeks examing causal issues because the causal order of your variables can help determine the statistics you choose for analysis. For right now, you need to know about independent, intervening or mediator, and dependent variables.
are called INDEPENDENT
If one variable truly causes a second, the cause is the independent variable. Speaking more statistically, variation in the independent variables comes from sources outside our causal system or is "explained" by these sources.
often also called explanatory variables
Agresti and Finlay use the term "explanatory variable."
Statistically speaking, we "explain" the variation in our dependent variable.
also sometimes called outcome
Agresti and Finlay use the term "response variables" (that term is unusual).
The typical research problem will describe the causal relationships between independent and dependent variables and explain how these relationships come to be.
I define an intervening variable as one that links in between the independent and the dependent variable. Thus, an intervening variable is part of a causal chain:
INDEPENDENT VARIABLE -------> MEDIATOR VARIABLE ------> DEPENDENT VARIABLE
Increasingly in structural equation models, intervening variables are also called mediator or mediating variables (because they "mediate" between an independent and a dependent variable).
EXAMPLE: educational level is a cause of science attitudes because educational level influences the type of occupation someone has, and it is the occupational type that affects science attitudes.
Occupational type in this example is the mediator variable.
Intervening or mediator variables inform us about causal sequences or chains, thus explaining the causal process of a phenomenon. To diagram a second example:
educational level -----> occupational type -----> income level
While I would love to say that employers will pay you just because you have a degree, in fact, it is the job you obtain (often thanks to the degree) that pays the salary.
Intervening variables are critical to use
in non experimental research designs, sometimes called observational or
naturalistic designs (in constrast to experiments). Other times, intervening
variables can specify exactly what it is
the dependent variable that is important.
Variables can be described as discrete or continuous.
CONTINUOUS VARIABLES can take on any of an infinity of values. If you use very fine measurements, for example "mils," it is possible to describe income in U.S. dollars in almost infinitely fine variations. Similarly, age can even be divided into "nanoseconds." Obviously this only makes sense if the data take on NUMERIC values.
DISCRETE VARIABLES take on only a limited number of values, and cannot be infinitely subdivided into finer and finer measures in the way that continuous variables can. Very often, discrete variables take on integer values, such as "1" "2" or "90".
and interval-ratio variables are
very different types of category systems. These form a cumulative and hierarchical
set of data properties, so that nominal properties are true for ordinal
and interval data. And ordinal properties are also true for interval data.
The reverse does NOT hold. Interval and ration data are numeric so arithmetic
operations can be used. Nominal and ordinal data are categorical, not numeric.
It is nonsense to use arithmetic operations on nominal or ordinal data.
With nominal variables, you can tell whether two cases or instances fall into the same category or into different categories. Thus, you can sort all cases into mutually exclusive, exhaustive categories. That's it!
Examples of nominal variables include:
Your Zodiac sign
Religious affiliation (or denomination)
None of these variables have categories that can be ordered into more or less, or higher or lower.
Nominal variables are also sometimes called categorical variables or qualitative variables. The categories are not only not numbers, they do not have any inherent order.
Try these examples:
Who is more? Koreans or Turks? More WHAT? Country of origin is NOT a number.
Who is "better"? Women or Men? Better at WHAT? If you suspect that ranking the categories (NOTE: NOT the cases within the categories) would start a war, you probably have nominal variables.
STATS & PRESENTATION
With ordinal variables, the categories themselves can be rank-ordered from highest to lowest.
This means the scores must FIRST be rank-ordered from highest to lowest (or vice versa) before you can use any ordinal measures. Like runners in a race, we can rank scores--and especially the categories themselves--from first to last, most to least, or highest to lowest.
In rank-ordered cases, we can literally rank order the finishers in a race or the students by their grade point average (first in class, second in class, and so on down to last in class). Notice that the intervals between cases probably are not the same (or equal). The class valedictorian may have a straight-A or 4.0 average, the salutatorian a 3.6, the third student a 3.5, and so on.
We can also rank-order the categories of a variable in ordinal data. One example is a Likert, or rank-ordered scale. Respondents are given a statement, such as "I like President Bush" then asked if they:
Strongly Agree Agree Disagree or Strongly Disagree with that statement.
We can surmise that someone who "strongly
agrees" supports that statement more intensely than someone who "agrees"--but
we don't know how much more intensely.
Other types of ordinal data include:
the order of finish (e.g., class rank or a horse race)
"yes-no" experiences (someone who answers "yes" to "Do you play the lottery?" clearly plays more than someone who answers "no"), or
collapses of numeric data into categories with unequal widths or intervals (e.g., collapsing years of education into degree level).
STATS & PRESENTATION ADVICE: Everything that you can do with nominal data (graphs, modes, etc.) you can do with ordinal data too. In addition, with ordinal data, you can do percentiles, quartiles, and medians (the category that includes the 50th percentile).
Most statistical processing computer programs,
such as SPSS, SAS or the online SDA program, assign numbers to all categories,
even to non numeric nominal and ordinal variables. This is for data processing
ease and does not give you any clues as to the type of data you have. YOU
MUST MAKE THE DECISION ABOUT WHETHER YOUR VARIABLES ARE NOMINAL, ORDINAL,
You can count the number of books and you can't have less than zero.
In addition to the properties of nominal and interval category systems, interval and ratio variables possess a common and equal unit that separates adjacent or adjoining categories.
EXAMPLES:one year of age or one year of education or one dollar of income. Each of these examples is one equal unit.
These intervals are equal no matter how high up or low down the scale you go.
Most "count variables" (years of age or formal education, children, dollars) are ratio variables.
STATS & PRESENTATION
It is nonsense to perform arithmetic operations on clearly nominal data.
For example, suppose you have a group of three men and three women. Can you calculate a mean biological sex score? What could it possibly be? It can't be a number because gender category value is a name or tag ("male" "female") that cannot be added or multiplied. Would it be a transgendered person or a hermaphrodite?
Lots of people make the elementary mistake of using numeric operations
(such as addition or division) on categorical data. Be careful to use data
|TYPE OF VARIABLE||CASES CAN BE SEPARATED INTO CATEGORIES||CATEGORIES EXHAUSTIVE||CATEGORIES MUTUALLY EXCLUSIVE||CATEGORIES CAN BE RANK-ORDERED||CATEGORIES ARE SEPARATED BY EQUAL INTERVAL||FIXED OR NON ARBITRARY ZERO|
Susan Carol Losh March 20
This page was built with Netscape Composer.