Guide 3: Reliability, Validity, Causality, and Experiments

NEW!

METHODS READINGS AND ASSIGNMENTS

OVERVIEW

GUIDE 1: INTRODUCTION
GUIDE 2: VARIABLES AND HYPOTHESES
GUIDE 3: RELIABILITY, VALIDITY, CAUSALITY, AND EXPERIMENTS
GUIDE 4: EXPERIMENTS & QUASI-EXPERIMENTS
GUIDE 5: A SURVEY RESEARCH PRIMER
GUIDE 6: FOCUS GROUP BASICS
GUIDE 7: LESS STRUCTURED METHODS
GUIDE 8: ARCHIVES AND DATABASES

EDF 5481 METHODS OF EDUCATIONAL RESEARCH
FALL 2017

GUIDE 3: RELIABILITY, VALIDITY, CAUSALITY, AND EXPERIMENTS I
SUSAN CAROL LOSH

KEY TAKEAWAYS:

Reliability essentially refers to the stability and repeatability of measures.
Reliable measures still can be biased (differ from the true value) or confounded (measure more than 1 thing simultaneously).
Strong internal validity refers to the unambiguous assignment of causes to effects. Internal validity addresses causal control.
External validity addresses the ability to generalize a study to other people and/or to other situations.
Construct validity is about the correspondence between concepts (constructs) and the actual measurements. A measure with high construct validity accurately reflects the abstract concept that the researcher wants to study.
Measure carefully. Measure more than once. Use more than one measure of a construct.
According to science rules, definitive proof via empirical testing does not exist; alternative causes may be later discovered.
Results from careful, well-controlled experiments are typically easier to interpret in causal terms than results from other methods.
However, causal inferences can sometimes also be drawn from correlational studies. See HERE below.
Random assignment of participants to treatments in experiments is a powerful causal tool.

RELIABILITY & VALIDITY

ISSUES IN CAUSALITY

RULES FOR CAUSE & EFFECT

AN EXPERIMENTS PRIMER

At this point, you are fairly itching to begin your readings or start your research design. But we still have a bit more basic material to cover. After all, you want your measures to be reliable and valid, your statements about causality to be appropriate, and be able to generalize your findings.

RELIABILITY AND VALIDITY

RELIABILITY

In order to make any kind of causal assessments in your research situation, you must first have reliable measures, i.e., measures that are stable and/or repeatable. If the random variation in your measurements is so large that there is almost no stability in your measures, you can't explain anything! Picture an intelligence test where an individual's scores ranged from below average in the morning of day 1 to genius level in the early evening of day 3. No one would place any confidence in the results of such a "test" because the person's scores were so unstable or unreliable.

Reliability is required to make statements about validity. However, reliable measures could be biased and hence "untrue" measures of a phenomenon, or confounded with other factors such as acquiescence response set. Picture a scale that always weighs five pounds too light.
The results are reliable, but inaccurate or biased. Or, picture an intelligence test on which women or people of color always score lower (even if this doesn't occur on other tests). Again, the measure may be reliable but biased.

Note that some estimates of reliability are based on the number of items in the test or scale (the famous Cronbach's Alpha is one example). Thus, we might have a lvery ong measure, with dozens of items, so that entire scale will appear "reliable," yet when we examine the measure closely, we discover that the correlations among the scale items are low. This means that items in that measure just don't seem to "hang together" or relate well to each other and your measure may be multidimensional. While this is a "judgement call," be advised that it is desirable for "reliable measures" to also be unidimensional measures, i.e., to measure one and only one construct. It is much easier to interpret unidimensional measures.

INTERNAL VALIDITY

Internal validity addresses the "true" causes of the outcomes that you observed in a study. Strong internal validity means that you not only have reliable measures of your independent and dependent variables BUT a strong justification that causally links your independent variables to your dependent variables. At the same time, you are able to rule out extraneous variables, or alternative, often unanticipated, or even spurious (causally "fake") causes for your dependent variables. Thus strong internal validity refers to the unambiguous assignment of causes to effects. Internal validity is about causal control.

Laboratory "true experiments" have the potential to make very strong causal control statements. Random assignment of participants to treatment groups (see below) rules out many threats to internal validity. Furthermore, the lab typically is a controlled setting, very often the experimenter's "stage." If the researcher is very careful, nothing will be in that laboratory setting that the researcher did not place there. When we leave the lab to do studies in natural settings, we can still do random assignment of participants to treatments, but we lose control over potential alternative causal variables in the study setting (dogs bark, telephones ring, the experimental confederate just got run over walking against the "don't walk" sign on West Tennessee.)

EXTERNAL VALIDITY

External validity addresses the ability to generalize your study to other people and to other situations. To have strong external validity (ideally), optimally you need a probability sample of participants or respondents drawn using "chance methods" from a clearly defined population (all registered students at Florida State University in the Fall 2017 semester, for example). Ideally, you will have a good sample of groups (e.g., classes at all ability levels). You will have a sample of measurements and situations (you study who follows a confederate who violates the "don't walk" signs at different times of day, different days, and different locations on campus.) When you have strong external validity, you can generalize to other people and situations with confidence. Many public opinion surveys typically place considerable emphasis on defining the population of interest and drawing good samples from that population. On the other hand, laboratory experiments often employ "convenience samples," such as intact college classes taught by a friend or in the College "subject pool". As a result, we may not know whom the subjects represent.

CONSTRUCT VALIDITY

Construct validity is about the correspondence between your concepts (constructs) and the actual measurements that you use. A measure with high construct validity accurately reflects the abstract concept that you are trying to study. Since we can only know about our concepts through the concrete measures that we use, you can see that construct validity is extremely important.

It also becomes clear why it is so important to have very clear conceptual definitions of our variables. Only then can we begin to assess whether our measures, in fact, correspond to these concepts. This is a critical reason why researchers should first work with concepts, and only then begin to work on operationalizing them, if at all possible.

If we only use one measure of a concept, about the best we can do is "face validity," i.e., whether the measure appears "on the face of it" to reflect the concept. Therefore, it is wise to use multiple measures of a concept whenever possible. Furthermore, ideally these will be different kinds of measures and designs.

EXAMPLE: You might measure mathematical skill through a paper and pencil test, through having the student work with more geometric problems, such as a wood puzzle, and having the student make change at a cash register. Our faith that we have accurately measured her high math ability is stronger if she performs well on all three sets of tasks.

Construct validity is often established through the use of what is called a multi-trait, multi-method matrix. At least two constructs are measured. Each construct is measured at least two different ways, and the type of measure is repeated across constructs. For example, each construct first might be measured using a questionnaire, then each construct would be measured using a similar set of behavioral observation categories.

Typically, under conditions of high construct validity, correlations are high for the same construct (or "trait") across a host of different measures. Correlations are low across constructs that are different but measured using the same general technique (e.g., a questionnaire only). Sometimes, this is called "triangulating" measures.

Under low construct validity, the reverse holds. Correlations are high across traits using the same "method" (or type of technique or measurement) but low for the same trait measured in different ways. For example, if our estimate of a student's math ability was wildly divergent depending on whether we examined scores on the questionnaire, making change, or the wood puzzle, we would have low construct validity and a corresponding lack of faith in the results.

HELPFUL HINT: One implication of all this material is that, of course, we NEVER, NEVER say phrases such as: "intelligence is what this intelligence test measures."
Or any other single kind of "test" or assessment, of course. Be very skeptical of studies that totally equate their concrete measures with their constructs.

ON PROOF AND CAUSALITY

There are many ways of knowing, and different cultures and subcultures use different expectations and norms about proof and causality. Causality is critical: it tells us what is possible, what can be changed and what is difficult, if not impossible, to change. For example, if you are convinced that biological factors cannot be overcome, you probably will not work with visually impaired children because you would believe that they could not compensate for their disabilities. Causality tells us what are the “prime movers” of the phenomena that we observe.

Consider some different perspectives on causality:

God (or some type of Gods) did it.
Nature works with "an unseen hand".
There are "rational laws" to be discovered (and people are capable of discovering these).
Causal relations are an illusion; the universe is random and chaotic, and runs on entropy.

Of course, none these perspectives nor the "means of proof" below are mutually inconsistent in the human cognitive process. Just as a physicist may secretly read his horoscope each morning, people may simultaneously invoke some, all, or none of these perspectives.

Here are some different ways and means of "proof":

Controlled experiments in which purported causal factors are manipulated systematically.
Citing recognized authorities, such as Biblical or Quran scripture-or Sigmund Freud.
Marshalling one's reasonable arguments as in a court of law or journalism.
Precedent as in a court of law.
Intuition...feelings...one just "knows" (in love?).
Reading traces in the environment (Sherlock Holmes stories).
Devine revelation in dreams, visions, bones, tea leaves, etc.
Statistically controlling various purported causal variables.

Why do I bother with these different orientations? Because (again) causality is critical to the research enterprise!

Much of the research process centers around what are the true causal or “independent variables.” What we initially may consider to be “true causal” variables may, instead, turn out to be artifacts of the research process (e.g., questionnaire format response set or experimental reactivity or confounded treatment effects) or the particular group that we studied. Much of science consists of ruling out alternative causes or explanations. While science is one form of knowing and one generic way of gathering evidence that either disconfirms or is suggestive of causality, it is not the only way of doing so. The results of science may or may not be accurate, but without following "the rules" of science, most scientists do not believe one is "doing science."

Considerable disagreement occurs between scientists and members of the general public because scientists don't always make it clear how science methods of "proof" differ from those commonly used among the general public (e.g., legal arguments).

According to science rules, definitive proof via empirical testing does not exist. Science uses the term "proof" (or, rather, "disproof") differently from the way attorneys or journalists do. Our measurements could be later shown to be contaminated by confounding factors. A correlation could have many causes, only some of which have been identified. Later work can show earlier causes to be spurious, that is, both cause and effect depend on some prior causal (often extraneous) variable. Statistics are NEVER EVER considered to "prove" anything although statistical results CAN disconfirm.

Further, science at its best is a self-correcting process. Another researcher can try to duplicate your results. If the results are interesting, in fact, dozens of researchers may try to duplicate the results. If something was awry with your study, the subsequent research projects should discover and correct this.

HELPFUL HINT: We use the rules of science in this course.

CAUSALITY AND METHODS: EXPERIMENTS AND CORRELATIONAL STUDIES

Cancerous Human Lung
This dissection of human lung tissue shows light-colored cancerous tissue in the center of the photograph. While normal lung tissue is light pink in color, the tissue surrounding the cancer is black and airless, the result of a tarlike residue left by cigarette smoke. Lung cancer accounts for the largest percentage of cancer deaths in the United States, and cigarette smoking is directly responsible for the majority of these cases.

Most people (95% plus of the American public)--and most scientists--accept that smoking cigarettes causes lung cancer although the evidence (for humans) is strictly correlational rather than experimental. There are many topics where it is neither possible--nor desirable--to use the experimental method. To accept more correlational evidence it will help to examine the rules below. (SCL)

Many scientists believe that the ONLY way to establish causality is through randomized experiments. That is one reason why so many methods text books designate experiments–and only experiments--as “quantitative research.”

Other scholars think causal relations can only be established with numeric data. I have never understood how the numeric level of one's measures can have much to do with cause. After all, variables such as gender, nationality, and ethnicity can have profound casual effects and they are categorical variables. Authors who make this mistake may also misunderstand causality.

Indeed a moment’s reflection will convince you that experiments are far from the only way to establish causality. Most people now accept that smoking cigarettes causes lung cancer (see the Encarta selection above)–yet no society has ever randomly assigned half its population to smoke cigarettes and the other half not (although there are some experiments with rats). This causal conclusion about smoking and lung cancer is based on correlational or observational evidence, i.e., observing the systematic covariation of two (or more) variables. Cigarette smoking and lung cancer are both "naturalistic" variables, i.e., we must accept the data as nature gave them to us.

There is no doubt that the results from careful, well-controlled experiments are typically easier to interpret in causal terms than results from other methods. However, as you can see, causal inferences are often drawn from correlational studies as well. Non-experimental methods must use a variety of ways to establish causality and ultimately must use statistical control, rather than experimental control. The results of the Hormone Replacement Therapy experiments, released in the summer of 2002, remind us of the great care that must be taken when designing nonexperimental research. Self selection of women into the original "hormone" non-experimental conditions implied that HRT prevented heart attacks and strokes among women. In fact, when the topic was studied experimentally the reverse was true: HRT increased the risk of heart and circulatory disease among women.

The discrepancy probably occurred because women who take better care of themselves may see a physician on a more regular basis, and thus be in better health to begin with. This self selection bias probably caused an erroneous and spurious correlation between HRT and women's health.

Some scientists mistakenly believe that large samples can establish causality. Just as numeric measures can't establish cause, neither can the size of the sample or population studied. Large numbers of participants can increase the stability of research results, but do not help to designate cause and effect.

HELPFUL HINT:Watch for some of these fallacies in establishing cause and effect in the research that you encounter.

SOME RULES TO HELP ESTABLISH CAUSAL ORDER

If one variable causes a second variable, they should correlate thus causation implies correlation. However, two variables can be associated without having a causal relationship, for example, because a third variable is the true cause of the "original" independent and dependent variable. For example, there is a statistical correlation over months of the year between ice cream consumption and the number of assaults. Does this mean ice cream manufacturers are responsible for violent crime? No! The correlation occurs statistically because the hot temperatures of summer cause both ice cream consumption and assaults to increase.Thus, correlation does NOT imply causation. Other factors besides cause and effect can create an observed correlation.

If one variable causes a second, the cause is the independent variable (explanatory variables or predictors).
The effect is the dependent variable (outcome or response variable).

If you can designate a distinct cause and effect, the relationship is called asymmetric.

For example, most people would agree that it is nonsense to assume that contacting lung cancer would lead most individuals to smoke cigarettes. For one thing, it takes several years of smoking before lung cancer develops. On the other hand, there is good reason to believe that the carcinogens in tobacco smoke could lead someone to develop lung cancer. Therefore, we can designate a causal variable (smoking) and the relationship is asymmetric.

Two variables may be associated but we may be unable to designate cause and effect. These are symmetric relationships.

For example, men over 30 with higher mental health scores are more likely to be married in the U.S. Aha! Marriage is a "buffer" protecting from the stresses of life, and therefore it promotes greater mental health. Wait! Perhaps the causal direction is the reverse. Men who are in better mental shape to begin with get married. Maybe both are true...When we cannot clearly designate which variable is causal, we have a symmetric relationship.

WHICH VARIABLE IS THE INDEPENDENT VARIABLE IN CORRELATIONAL STUDIES?
RULES AND GUIDANCE

Since we know that we cannot use experimental treatments in naturalistic variables to determine cause and effect, yet we know that scientists can and do draw causal conclusions in nonexperimental studies, here is a set of helpful rules for tentatively establishing causality in correlational data.

For a more detailed discussion, I recommend the following books:

(A) Barbara Schneider, Martin Carnoy, Jeremy Kilpatrick, William H. Schmidt, Richard J. Shavelson (2007): Estimating Causal Effects: Using Experimental and Observational Designs. A think tank white paper prepared under the auspices of the AERA Grants Program.

AERA BARGAIN: You can actually download this book FOR FREE in pdf from the American Educational Research Association.

(B) For a true classic on establishing cause and effect in observational or correlational data, see: Morris Rosenberg (1968) The Logic of Survey Analysis. New York: Basic Books.
This excellent book is still in print! Used copies are available on Amazon and other auction sites (and it covers causal issues in more than just surveys).

By the way, there are always alternative causal explanations in experiments too. The study control group may be flawed. Participants' awareness of being studied may create conditions (e.g., anxiety) that mean we do not measure "true" behavior or performance. So even though it may be easier to establish cause in experiments, keep in mind that nothing is fool-proof.

SOME SUGGESTIONS BELOW

(1) TIME ORDER. The independent variable came first in time, prior to the second variable.

EXAMPLE: Gender or race are fixed at birth. Gender or race can be important causal variables because individuals behave differently toward males or females, and often behave differently toward individuals of different religions or ethnicities.

(2) EASE OF CHANGE. The independent variable is harder to change. The dependent variable is easier to change.

EXAMPLE: One's gender is much harder to change than scores on an assessment test or years of school.

(3) "MAJORITY RULE." The independent variable is the cause for most people.

EXAMPLE: Although some people become so fed up with their jobs that they return to school to train for a better job, most people complete most of their education prior to obtaining a regular year-round, full-time job.

(4) NECESSARY OR SUFFICIENT. If one variable is a necessary or sufficient condition for the other variable to occur, or a prerequisite for the second variable, then the first variable may be the cause or independent variable.

EXAMPLES: A certain type of college degree is often required for certain jobs. At most research universities, publications are a prerequisite for being awarded tenure.

(5) GENERAL TO SPECIFIC. If two variables are on the same overall topic and one variable is quite general and the other is more specific, the general variable is usually the cause.

EXAMPLE: Overall ethnic intolerance influences attitudes toward Hispanics.

(6) THE "GIGGLE" OR "SANITY" FACTOR. If reversing the causal order of the two variables seems illogical and makes you laugh, reverse the causal order back.

EXAMPLES: We don't believe choosing a specific college major or engaging in a particular sport determines one's gender.

MEMORIZE THESE SIX RULES. We will apply them all semester!

A PRIMER ON EXPERIMENTS

Dedicated to health and fitness, you, the researcher, have devised a new exercise plan that you believe will really help people. So you obtain a sample of Educational Psychology undergraduate students. With the flip of a coin, half the students receive a physical and mental health screening and those who are fit begin this new exercise program. The other half also receive a health screening but no exercise regimen. Six weeks later, you re-examine everyone who was physically fit in the screening and compare the two groups. The group receiving the exercise plan now score happier and healthier than the group that did not.

Jubilant over the results, you assert that your new exercise plan contributes to physical and mental fitness!

Or does it? Are your results internally valid?

Could be.

This study was a "true experiment." In a true experiment--whether laboratory, field, or simulation--participants are randomly assigned to treatment groups using a coin flip or some other type of probability, non human judgment method. It is randomization that makes true experiments so strong in internal validity and typically allows us to make relatively strong influences about causality. It is also random assignment to treatments that distinguishes a true experiment from other kinds of data collection.

Random assignment means that on the average at the beginning of a study, all your treatment groups are about the same. In your physical fitness study, it meant about the same percent of each group "flunked" the screening test and about the same percent exercised on a regular basis, even before your intervention.

Random assignment or "randomization" controls at the beginning for all the variables you can think of, and, more important, all the variables you didn't think of.

This study had another important research design aspect: it had a control group which did not receive the special exercise program. Control or comparison groups are critical in all kinds of research. If we did not have a control or comparison group, the study would be open to the criticism--and alternative causal explanation--that improvement in health would have occurred in any event among young adults, even had the exercise program never been instituted. Not only did you have a control groupm in your study example, but, in an experiment, participants are randomly assigned to it.

In the "Draw a Scientist" study my students and I conducted, a random half of the elementary school students were asked to draw a veterinarian and the other half were asked to draw "a scientist". We wanted to see whether young students were able to differentiate one type of occupation from another.

Studies that lack a control group are sometimes called "one shot" studies or sometimes case studies. While the results may be interesting, we are limited in the causal implications we can make from the results of "one shot" research.

We will later examine facets of the "good" control group.

In the example above, you were pretty sure that you know what improved the health of your experimental subjects: the new exercise program you initiated. And there is a good chance that you are right, because by using random assignment you controlled for several pre-existing conditions or threats to internal validity: participants' general physical health, previous exercise patterns, incidence of depression or their general personal histories which, on the average, should be the same for each group. By using random assignment, you also controlled for any incidental historical conditions (such as an influenza outbreak that year which could influence health in both groups).

Your study has two other important features: a pretest and a posttest. In the pretest, you measured existing conditions on your dependent variables, i.e., mental and physical health among all your participants, whether in the experimental or control group, prior to any intervention at all. This enables you to double-check that your participants were pretty much alike across groups at the beginning of the study. You can also assess the level of change because you have both pretest and posttest information. Then, after your intervention, you reassessed scores on your dependent variables in a posttest. A posttest only design cannot do either of these important sets of measures.

My exercise study example in this Guide with before and after measures is often called a "pretest-posttest" experimental design.

You should be advised, however, that the standard pretest-posttest design may pose some threats to internal validity, or the unambiguous assignment of cause and effect. Why? Because simply being measured or observed during the pretest may sensitize some participants and they will behave differently as a result. (For example, being weighed might have sent all subjects to the exercise room for six weeks!) Furthermore, a pretest may interact with an experimental treatment to heighten the effect of the experimental intervention more than it ordinarily would have.

How can you cope with this dilemma? One way is the famous Solomon Four Group Design, considered one of the strongest experimental designs with respect to internal validity. In the Solomon Four Group Design, there are four randomized groups of participants. One group receives a pretest, the experimental treatment and a posttest. The second group is identical, except it does not receive a pretest. The third group receives a pretest and posttest but a different treatment (this could be a group that receives no treatment at all, for example). The final group receives only a posttest and the second treatment (such as no treatment). Below is a diagram of the Solomon Four Group Design:

GROUP ONE	Pretest	Treatment 1	Posttest
GROUP TWO		Treatment 1	Posttest only
GROUP THREE	Pretest	Treatment 2 (control group?)	Posttest
GROUP FOUR		Treatment 2	Posttest only

Solomon Four Group Designs are more expensive because they require more participants and conditions than other types of experimental treatments. But, many researchers believe the advantages are worth the expense.

We will revisit experiments, and compare them with "quasi experiments", in Guide 4.

ON "EXPERIMENTAL DESIGNS" WITH INTACT GROUPS

Some textbooks imply that "intact groups" cannot be part of a "true experiment." This is not necessarily true so assess each situation carefully when reading a study to see if a true experiment really would be possible.

For example, suppose you want to study fourth grade classes. The major way the school divides its fourth grade students into classes is through a systematic alphabetical list. If there are five fourth grade classes, every fifth student goes to Class 1, Class 2, and so on. In other words, there is no reason at this particular school to believe any of the fourth grade classes is distinctive at the very beginning of the school year. If you randomly assign classes to different experimental treatments in this example, you will indeed have a "true experiment." The key is that the intact groups were pretty much assembled using random means in the first place.

Also, if it is the very beginning of the academic year, students in the different classes have not been exposed to different teachers or teaching methods. This will not be true later in the year. If you come in and do your experiment at the very beginning and before the different teachers have made assignments, begun in-depth lessons, etc., you probably do have a "true experiment."

On the other hand, suppose there was a systematic difference among groups before you applied any kind of intervention, such as Honors classes versus regular classes in school. In such a case, even random assignment of intact groups could not produce a true experimental design. The problem is particularly great if a difference between groups relates to a variable you want to study. For example, Honors math students may react differently to a new way of teaching algebra than students in regular classes.

So, study the situation carefully. "True experiments" with intact groups are possible, but only under a very restricted set of conditions. If those conditions are not met, it is more likely that this is a "quasi-experiment," which we will examine next.

Measure carefully. Measure more than once. Use more than one measure of a construct.
Avoid bias, such as the bathroom scale that always measures 5 pounds too light. (Check things with the doctor's scale, for example.)

Susan Carol Losh
September 13 2017
This page was built with Netscape Composer

METHODS READINGS AND ASSIGNMENTS OVERVIEW