INTRODUCTORY STATISTICS AND DATA ANALYSIS 2010 DR SUSAN CAROL LOSH

 GUIDE 1: INTRODUCTION

 KEY TO: Agresti and Finlay, Preface (entire); Chapter 1 (1-9), Chapter 2 (12-17 ONLY) KEY TO: Huff, Introduction (pp. 7-9) and Chapter 1, pp. 1-26

This is the first of several guides that I will publish on the Internet for a basic course on Introductory Data Analysis. Be sure to return because I will develop new course sites over the next few months, making it easy for you to cross reference topical guides. All guides will be linked to the Overview and Readings WEB sites.

 THE KEYS TO LEARNING STATISTICS AND DATA ANALYSIS

Statistics is a language. (Just as math is a language.) If you keep this in mind, you will experience far less anxiety and life will be much, much easier.

Like English, statistics has a vocabulary, often expressed in "unknown placeholders" (such as "X"), greek letters, such as  or , and equations that serve as the grammar to tie the vocabulary together.

You could say (very quickly):

"Take the score on a characteristic (such as years of formal education), add all the education scores from each case, then divide by the number of cases you have. (Whew!) And you have the arithmetic mean of years of formal education for this particular population of cases."

Or, you could simply write instead:

( X) / N =

which is much more compact.

Statistics is a highly cumulative topic. Later, more complex information builds upon earlier, simpler information.

This means that trying to cram statistics typically doesn't work. It also means that students who get behind in a statistics course can have a hard time catching up.

You should do quite well if you keep up with the material and complete exercises.

Statistics are about logic and reasoning, and only secondarily about arithmetic. Calculators can do that for you.

 AND...WHY SHOULD YOU CARE?

The days of the solely armchair theorists are pretty much over. Theories and conjectures must pass empirical muster, that is, if data constantly disconfirm your hypotheses, it's probably time to revise the hypotheses.

 VARIABLES: CONCEPTUAL AND OPERATIONAL

What? Isn't this course STATISTICS? What am I doing with this "diversionary topic" called "variables" for?

The answer is simple: if you do not understand what a variable is, and what different kinds of variables are, you will not be able to choose the very best statistical tools to describe your data and you will run a very high risk of making elementary mistakes in analyzing your data and interpreting your results (and other people's results).

So: patience!

A variable is a characteristic or factor that has values that vary, for example, levels of education, intelligence, or physical endurance.

A variable has at least two different categories or values.

If all cases have the same score or value, we call that characteristic a constant, not a variable.

Variables consist of  sets or systems of categories with several properties.
Examples of category systems include:

GENDER: Categories = Male and Female

PRIMARY/SECONDARY GRADES: Categories = Kindergarten, first, second, third...and so forth to grade twelve

AGE IN YEARS: Categories = 1, 2, 3, 4, 5, and so forth up to 90 years of age--or even higher.

Each observation that you make on a particular characteristic has a specific value associated with it.

For example:

 VARIABLE A PARTICULAR VALUE Religion Southern Baptist Years of Age 32 years Highest year of formal education completed 12th grade

We often call the values that a variable takes on the categories of a variable.

Individual observed scores are then fit into the appropriate category.

You may even have categories that do not have cases in them for a particular sample or collection of cases. For example, although it is certainly possible to have a score of 96 on years of age, in your particular project perhaps no one individual that you studied was that old.

CONCEPTUAL VARIABLES are what you think the entity really is or what it means. Conceptual variables are about abstract constructs. YOU DO NOT DISCUSS MEASUREMENT AT THIS STAGE. Instead you discuss what the construct means.  Examples include "achievement motivation" or "endurance" or "group cohesion". You are describing a concept.

On the other hand, OPERATIONAL VARIABLES (sometimes called "operational definitions") are how you actually measured this entity, or the concrete operations, measures or procedures that you used to measure the variable.

A conceptual definition is broader. A particular concept or construct can be operationalized in several different ways. For example, disengagement among students or team members can be measured through absence records, rates of volunteerism, expressions of enthusiasm, and so on. "Culture" could refer to films, paintings, posters, or other media.

However, we apply our statistics to operational variables, that is, to your actual measurements.

 EXAMPLES

 CONCEPTUAL VARIABLE OPERATIONAL VARIABLE Letter recognition Scores on a particular test of letter recognition Anaerobic exercise Maximum number of pounds one can weight-lift Gender Biological sex: male or female Collective Effiicacy Number of online study groups formed in a distance course

You usually begin your research problem with CONCEPTUAL VARIABLES and the relationships among them. One of the few exceptions is if your actual purpose is to study a particular operational variable, for example, perhaps you want to study the validity of the FCAT test, the achievement assessment test that kindergarten through twelve grade students in Florida must take each year.

 WHAT WE WANT IN OPERATIONAL CATEGORY SYSTEMS

At a minimum, category systems should be exhaustive (cover all cases) . Each case must be able to fit into a category. Sometimes that means we must construct an all-inclusive "other" category.

Categories of a variable should also be mutually exclusive (each case fits into one and only ONE category).

 These two features above are the basics.

Other desirable category properties--WHEN IT IS POSSIBLE (and it ISN'T always possible)-- include:

a good spread of cases over categories (no category with too large or too small a percentage of cases).

Possibilities IF the data allow include a normal ("bell-shaped" or Gaussian distribution) or an equiprobable distribution in which each category has the same number of cases.

a limited number of categories and

equal intervals between categories IF POSSIBLE  (applies only IF the category values are numeric).

 Examine the scores and the category system on each variable carefully to see if you can fulfill any of the above three properties. For example, if your variable does not take on numeric scores (e.g., biological sex: male or female), you will not be able to have equal intervals between adjacent categories. Equal intervals require numeric categories.

TIP: If you are collecting your own data, try to gather data as completely as possible (for example, get education in number of years rather than degree level--or get BOTH measures if you can) because you can collapse or move around categories later. If you really meant degree level, then ask about degree level explicitly rather than years of education or  "how much" education.

TIP: Avoid "open-ended" categories that do not have fixed end points when this is possible (e.g., "graduate degree or more"--or "\$75000 or more"). Keep in mind when you gather data that it may not be possible to use  a final closed category with income.

 PRELIMINARY CAUSE AND EFFECT

We will spend considerable time in the coming weeks examing causal issues because the causal order of your variables can help determine the statistics you choose for analysis. For right now, you need to know about independent, intervening or mediator, and dependent variables.

Causes are called  INDEPENDENT VARIABLES.

If one variable truly causes a second, the cause is the independent variable. Speaking more statistically, variation in the independent variables comes from sources outside our causal system or is "explained" by these sources.

Independent variables are often also called explanatory variables or predictors.
Agresti and Finlay use the term "explanatory variable."

Effects are called DEPENDENT VARIABLES.

Statistically speaking, we "explain" the variation in our dependent variable.

Dependent variablesare also sometimes called outcome or criterion variables.
Agresti and Finlay use the term "response variables" (that term is unusual).

The typical research problem will describe the causal relationships between independent and dependent variables and explain how these relationships come to be.

INTERVENING OR MEDIATING VARIABLES

I define an intervening variable as one that links in between the independent and the dependent variable. Thus, an intervening variable is part of a causal chain:

INDEPENDENT VARIABLE -------> MEDIATOR VARIABLE ------> DEPENDENT VARIABLE

Increasingly in structural equation models, intervening variables are also called mediator or mediating variables (because they "mediate" between an independent and a dependent variable).

EXAMPLE: educational level is a cause of science attitudes because educational level influences the type of occupation someone has, and it is the occupational type that affects science attitudes.

Occupational type in this example is the mediator variable.

Intervening or mediator variables inform us about causal sequences or chains, thus explaining the causal process  of a phenomenon. To diagram a second example:

educational level -----> occupational type -----> income level

While I would love to say that employers will pay you just because you have a degree, in fact, it is the job you obtain (often thanks to the degree) that pays the salary.

Intervening variables are critical to use in non experimental research designs, sometimes called observational or naturalistic designs (in constrast to experiments). Other times, intervening variables can specify exactly what it is about the dependent variable that is important.

 TYPES AND LEVELS OFDATA

 DISCRETE AND CONTINUOUS

Variables can be described as discrete or continuous.

CONTINUOUS VARIABLES can take on any of an infinity of values. If you use very fine measurements, for example "mils," it is possible to describe income in U.S. dollars in almost infinitely fine variations. Similarly, age can even be divided into "nanoseconds." Obviously this only makes sense if the data take on NUMERIC values.

DISCRETE VARIABLES take on only a limited number of values, and cannot be infinitely subdivided into finer and finer measures in the way that continuous variables can. Very often, discrete variables take on integer values, such as "1" "2" or "90".

Nominal, ordinal and interval-ratio variables are very different types of category systems. These form a cumulative and hierarchical set of data properties, so that nominal properties are true for ordinal and interval data. And ordinal properties are also true for interval data.  The reverse does NOT hold. Interval and ration data are numeric so arithmetic operations can be used. Nominal and ordinal data are categorical, not numeric. It is nonsense to use arithmetic operations on nominal or ordinal data.

 The distinctions among nominal, ordinal, interval, and ratio data are absolutely critical to select the right statistical analytic tools for your variables. Even articles in professional journals often show elementary mistakes in matching the proper statistics to the data the researcher wants to analyze.

 NOMINAL VARIABLES

With nominal  variables, you can tell whether two cases or instances fall into the same category or into different categories. Thus, you can sort all cases into mutually exclusive, exhaustive categories. That's it!

Examples of nominal variables include:

Gender

Ethnicity and

Religious affiliation (or denomination)

None of these variables have categories that can be ordered into more or less, or higher or lower.

Nominal variables are also sometimes called categorical variables or qualitative variables. The categories are not only not numbers, they do not have any inherent order.

Try these examples:

Who is more? Koreans or Turks? More WHAT? Country of origin is NOT a number.

Who is "better"? Women or Men? Better at WHAT? If you suspect that ranking the categories (NOTE: NOT the cases within the categories) would start a war, you probably have nominal variables.

 You can only do very basic statistics or presentations with nominal data, such as: percents, ratios, rates, frequency distributions (thus charts and graphs), and modes. Of course, many nominal variables are very important, especially as explanatory variables. Most statistical processing computer programs, such as SPSS, SAS or the online SDA program, assign numbers to all categories, even to non numeric nominal and ordinal variables. This is for data processing ease and does not give you any clues as to the type of data you have. YOU MUST MAKE THE DECISION ABOUT WHETHER YOUR VARIABLES ARE NOMINAL, ORDINAL, OR INTERVAL-RATIO!

 ORDINAL VARIABLES

With ordinal variables, the categories themselves can be rank-ordered from highest to lowest.

This means the scores must FIRST be rank-ordered from highest to lowest (or vice versa) before you can use any ordinal measures. Like runners in a race, we can rank scores--and especially the categories themselves--from first to last, most to least, or highest to lowest.

In rank-ordered cases, we can literally rank order the finishers in a race or the students by their grade point average (first in class, second in class, and so on down to last in class). Notice that the intervals between cases probably are not the same (or equal). The class valedictorian may have a straight-A or 4.0 average, the salutatorian a 3.6, the third student a 3.5, and so on.

We can also rank-order the categories of a variable in ordinal data. One example is a Likert, or rank-ordered scale. Respondents are given a statement, such as "I like President Bush" then asked if they:

Strongly Agree        Agree        Disagree         or    Strongly Disagree     with that statement.

We can surmise that someone who "strongly agrees" supports that statement more intensely than someone who "agrees"--but we don't know how much more intensely.

 Virtually all Agree-Disagree attitude scales are ordinal data. This is fairly obvious when there are 5-7 categories but it is also true when there are only two categories: someone who favors raising teacher salaries obviously is more in favor than someone who opposes the raise. This is also true for many behaviors. Someone who smokes ONLY ONE cigarette per day clearly smokes more than someone who smokes none at all.

Other types of ordinal data include:

the order of finish (e.g., class rank or a horse race)

"yes-no" experiences (someone who answers "yes" to "Do you play the lottery?" clearly plays more than someone who answers "no"), or

collapses of numeric data into categories with unequal widths or intervals (e.g., collapsing years of education into degree level).

STATS & PRESENTATION ADVICE: Everything that you can do  with nominal data (graphs, modes, etc.) you can do with ordinal data too. In addition, with ordinal data, you can do percentiles, quartiles, and medians (the category that includes the 50th percentile).

Most statistical processing computer programs, such as SPSS, SAS or the online SDA program, assign numbers to all categories, even to non numeric nominal and ordinal variables. This is for data processing ease and does not give you any clues as to the type of data you have. YOU MUST MAKE THE DECISION ABOUT WHETHER YOUR VARIABLES ARE NOMINAL, ORDINAL, OR INTERVAL-RATIO!

 INTERVAL-RATIO VARIABLES

You can count the number of books and you can't have less than zero.

In addition to the properties of nominal and interval category systems, interval and ratio  variables possess a common and equal unit that separates adjacent or adjoining categories.

EXAMPLES:one year of age or one year of education or one dollar of income. Each of these examples is one equal unit.

These intervals are equal no matter how high up or low down the scale you go.

EXAMPLE:

• the difference between two and three children = one child.
• the difference between eight and nine children also = one child.
EXAMPLE:
• the difference between completing ninth grade and tenth grade is one year of school
• the difference between completing junior and senior year of college is one year of school
It is the equal interval between adjacent categories, no matter how small or how large the score may be, that makes the data numeric.
• In addition to all the properties of nominal, ordinal, and interval variables, ratio variables also have a fixed/non-arbitrary zero point. Non arbitrary means that it is impossible to go below a score of zero for that variable. For example, any bottom score on IQ or aptitude tests is created by human beings and not nature. On the other hand, scientists believe they have isolated an "absolute zero" degree point. You can't get colder than that.
EXAMPLES:  0 children or 0 years of age. You cannot have fewer than zero children or be less than zero years of age. You cannot have less than zero dollars of income (net worth is another story) or less than zero years of formal education.

Most "count variables" (years of age or formal education, children, dollars) are ratio variables.

 With numeric data (interval or ratio variables), in addition to all the options that you have with nominal and ordinal variables, you can perform arithmetic operations on the scores: add, subtract, divide and multiply them. Thus you can calculate arithmetic means on numeric data. You can also calculate ratios ("half as much", "twice as much") with ratio data.

It is nonsense to perform arithmetic operations on clearly nominal data.

For example, suppose you have a group of three men and three women. Can you calculate a mean biological sex  score? What could it possibly be? It can't be a number because gender category value is a name or tag ("male" "female") that cannot be added or multiplied. Would it be a transgendered person or a hermaphrodite?

CAUTION! Lots of people make the elementary mistake of using numeric operations (such as addition or division) on categorical data. Be careful to use data correctly!

 LEVELS OF ANALYSIS SUMMARY

 TYPE OF VARIABLE CASES  CAN BE SEPARATED INTO CATEGORIES CATEGORIES EXHAUSTIVE CATEGORIES MUTUALLY EXCLUSIVE CATEGORIES CAN BE RANK-ORDERED CATEGORIES ARE SEPARATED BY EQUAL INTERVAL FIXED OR NON ARBITRARY ZERO NOMINAL X X X ORDINAL X X X X INTERVAL X X X X X RATIO X X X X X X

Susan Carol Losh March 20 2010