Guide 8: Archives and Databases

METHODS READINGS AND ASSIGNMENTS

. GUIDE 8:
WEB-BASED ARCHIVES AND DATABASES

CLICK HERE FOR
ASSIGNMENT 5 SPECIFICATIONS

OVERVIEW

KEY TAKEAWAYS:

It is important to recognize that using these databases is not "easier" and it can be time-consuming.

What they do make possible is a scope and breadth an individual researcher could not attain on their own.
Thus research with these databases may (!) have greater construct and external validity.

Archival databases may be useful reference points to find out the most accurate or extensive treatment of a topic in your study area.
They are also useful to new researchers who are still constrained for resources.
The student must ask the same questions of an archive that they do when evaluating any study or data source, for example:

What was the method of data collection (e.g., organizational records? surveys? what kind of surveys?)
Who or what is the population? What is the estimated level of external validity?
What possibilities for bias exist? Coding cause of death to spare the family? Biased questions or omitted participants (e.g., never married mothers)?

How complete is the description of the data (response rate? coverage, e.g., including cell phones)
Are the data available? If so, how (including cost)?
Is online data analysis of the data possible?
Have the data been used in books, chapters or articles? (If so, we can learn more about the topic, too.)

PURPOSES

USEFULNESS

ISSUES TO CONSIDER

HINTS FOR USE

GUIDE 1: INTRODUCTION
GUIDE 2: VARIABLES AND HYPOTHESES
GUIDE 3: RELIABILITY, VALIDITY, CAUSALITY, AND EXPERIMENTS
GUIDE 4: EXPERIMENTS & QUASI-EXPERIMENTS
GUIDE 5: A SURVEY RESEARCH PRIMER
GUIDE 6: FOCUS GROUP BASICS
GUIDE 7: LESS STRUCTURED METHODS
GUIDE 8: ARCHIVES AND DATABASES

EDF 5481 METHODS OF EDUCATIONAL RESEARCH
INSTRUCTOR: DR. SUSAN CAROL LOSH FALL 2017

PLEASE NOTE: Your texts do not give much information about online data and secondary analysis, so this lecture will be the basis for the topic this term. "Big Data" are increasingly used in original research! You are responsible for this material on Quizzes and Assignments.

OPTIONAL: What do we know about "big data"? Check out my address to the AERA Advanced Studies of National Databases Special Interest Group (SIG):	HERE
OPTIONAL: Here's a "Big Data" AERA SIG Newsletter (with thanks to Editor Jim Harvey )	HERE

WHY EXAMINE WEB-BASED DATABASES?

As you have learned, it is expensive and time-consuming to collect data, especially datasets that are sizable or comprehensive. In the early 1970s, the United States Federal government initiated a series of what have come to be called "Social Indicators." The idea was to collect data from different domains (education, health, the status of women and ethnic minorities, public opinion, etc.) and to continue these series over time, thereby tracking change and continuity among Americans. At the same time, other countries, particularly Canada, Western Europe, and Japan , also began indicator series, thus making possible international comparisons. By now, nearly all regions of the world collect and store indicator information. In education, one example is the Trends in International Mathematics and Science Studies (TIMSS). Data were collected in 42 countries in 1995 and in 38 countries in 1999. More recent additions (2003, 2007, 2011, with 2015 shortly to come) address experience with computers and the World Wide Web.

Take a look at:http://nces.ed.gov/timss/

Considerable effort has been devoted to making many of these indicator series compatible over time and many of these efforts reference concepts we have utilized all semester, e.g., internal, construct and external validity; sampling and question format issues:

Questions are asked in the same way
Changes to questions are established via "split-ballot" testing, i.e., experiments to see whether the revised questions work the same way as the original questions. A good indicator series NEVER arbitrarily shifts question format (or open question codes).
Variables are defined in the same way
Coding categories remain constant
If coding changes are made, care is taken to make new coding systems compatible with the old, such as the detailed United States Census three digit occupational codes

A series may have an "oversight board." These boards monitor the content and form of the indicator series. Thus, principal investigators cannot capriciously change either content or form without input from a panel of expert professionals.

The number of data archives is already HUGE and it is growing by the minute. Some of the large archives, such as ICPSR, The Roper Center or the Odum Institute for Research in Social Science at the University of North Carolina, are simply staggering in the amount of data that they hold.

As you look through some of the sample pages, you will see that several times I have given the warning: "set aside a day to explore this archive." Do take this warning seriously! One of these archives may hold the answer to questions you may have about research in your field, or your proposed dissertation or master's thesis, or provide the basis for a nice conference paper or article. They are definitely worth exploring.

These archives may be the source to consult if a new study garners a lot of publicity and possibly "strange" findings.

With resources such as these, the novice--and even the experienced--researcher should seriously reconsider whether they really want to gather all of their own data from scratch.

Analyzing data from these archives is often called SECONDARY ANALYSIS, partially because the data were originally gathered for other research and information purposes.

WHY THESE ARCHIVES ARE IMPORTANT TO YOU

There is no point in "reinventing the wheel." Why do a small local study when data already exist on regional, national or even international levels? An example is using the "CIRP" (often called the "Freshman Surveys") to look at college student beliefs, attitudes, and accomplishments instead of convenience samples of your buddy's classes.

"There is plenty of gold in them thar hills." Most of these databases are so huge that no one investigator could ever analyze everything in them. With each successive year, the possibilities for analysis grow. Furthermore, other researchers may have ideas for analysis that did not occur to the original Principal Investigator. In other words, there is plenty of data for you to do an original study analysis--without all the backbreaking work of collecting the data too.

I practice what I preach! Since 2001 I have worked with the National Science Foundation Surveys of Public Understanding of Science and Technology. These surveys now span 1979 to 2014, an unprecedented look at public knowledge, reasoning and attitudes about science and technology. I have built longitudinal files from these data now available at ICPSR and The Roper Center.

OPTIONAL: One thing repeated studies can make possible are comparing generational effects versus chronological aging. For two examples of my examination of generational versus aging effects on science beliefs and attitudes (CLICK HERE) and information technology (CLICK HERE), see the Internet links. Currently I am examining general public perceptions of climatologists over time.

Many of these archives offer an unprecedented opportunity to track trends over time. How did computer use change from the early 1980s to the early 2000s? What kind of educational preparation do students receive who rise to eminence later on? What are the average student characteristics in research universities as opposed to liberal arts colleges, and how did these characteristics change over time? What are gender differences in Internet use over time?

YOUR time, resources, and energy. Many researchers, especially junior faculty and doctoral candidates, have limited resources. With one eye on the tenure clock, junior faculty have limited time too. It twould be nearly impossible for most young researchers to collect international data or wait many years to collect repeated measures. If existing archives have variables that are directly pertinent to your research interests, it is often in your best professional interests to use--or at least to reference--these archives.

Obviously, using pre-existing archives are not for everyone. Many students in disciplines that lend themselves to experiments or surveys might be able to quickly collect hand-tailored data with relatively little financial investment. However, even these researchers may be interested in "triangulation" with survey data or historical records.

CLICK HERE TO ENTER THE ONLINE DATABASE MENU

QUESTIONS YOU SHOULD CONSIDER ABOUT ONLINE DATABASES

What is the unit of analysis? Is it an individual? An organization, such as a college or university? A time point for a country or state series? Archives vary and the unit is not always an individual.

What kinds of variables does the archive cover? Degree attainment? Spirituality? Symptoms of stress? Health practices? Drug or alcohol usage? Water polution?

What is the time frame covered by the archive? Examples: the average school FCAT scores for 1998-2016 or The General Social Survey from 1972-2016.

What is the geographic frame covered by the archive (state? local? United States? international?)

Who were the sponsor(s) of the archive (e.g., NSF? NCES? United Faculty of Florida?)

How did the archive come to be?

Were the data collected especially for the archive (such as IPEDS or TIMSS)? Or were the data compiled from other sources (such as Web CASPAR)?

Does the archive contain any tutorials that instruct how to use it (online or otherwise)?

Are there codebooks that describe the data, the variables and the file structure?

How are the data available? Are they ready for online analysis? Are the data available to download into one's computer? Are the data contained in .pdf format tables? Are there alternative ways to obtain the data (such as CD?)? If so, how can the data be obtained?

Can you simply download the data or must you obtain a CD or other device from the archive agency?

How "clean" are the data? One good example is the U.S. government's famous "Falling Through the Net" data about the "Digital Divide" in Computer and Internet Usage. This is one of the most cited datasets about early Digital Divides but the data are appallingly "dirty." Any household resident 14 years of age or older was asked to provide information about all other residents in the household. Considerable data are missing on racial identification. The information I could locate did not say how the data were gathered (in-person? Random Digit Dial of landlines?) Apparently, the government was in such a rush to put up the dataset, the data contain a LOT of careless errors. As a result, I consider estimates from the early years of these data to be unreliable despite a usually trustworthy source.

Is there a charge for the data? If so, what is the cost? Most archival costs are surprisingly reasonable, when you consider the effort involved in the first place. For example, the cost of the ENTIRE General Social Survey archive, from 1972 to 2016, in SPSS ready format is about $500. Compare this with the millions of dollars it cost to gather the data (about three million dollars every other year). Don't forget: a researcher can incur time and financial costs to gather and process their own data. It may, indeed, turn out to be cheaper to use the archive. And University dissertation grants may even cover the acquisition cost.

What kinds of analyses can be done online? Frequency distributions? Cross-tabulations? Multiple regression or other multivariate analyses? See if the archive uses the California-Berkeley Survey Documentation and Analysis (SDA/DAS) System program which is simple to use, covers most basic statistics, and is unbelievably fast (including on a dial-up system where I first used it: it tore through over 131,000 cases in 7 seconds). Many online datasets are now directly linked to the SDA/DAS system.

Is a questionnaire available or some other original document describing each variable in detail? Maybe it is available as a separate link or as a .pdf document.

EVERYDAY NOTE: The IRB will want to see your questionnaire(s) if you do a survey design yourself--and any questionnaires from an existing database if you conduct secondary analysis.

What is mentioned about coverage or response rate? For example, data are missing from several states in early data series about abortion. Some surveys, especially longitudinal or panel studies, may have completed interviews with less than half of the originally contacted respondents. In other cases, such as the CIRP, response rates can vary considerably from college to college.

Does the user need any kind of license from the data agency? Many data sets at the National Science Foundation, the National Center for Educational Statistics, and other agencies require the user to have a license if s/he works with what is called the "unit record" data. "Unit record data" is the "raw data" archive where each record or line in the datafile is an individual or an institution. This means the person or institution could plausibly be identified (although in many if not most cases, this is unlikely). Obtaining a license is typically not a problem for legitimate researchers but it does necessitate some paperwork so if you are the user: be prepared to check about this and budget some time accordingly.

What was the mode of data collection? In-person surveys may give different results than telephone surveys. The top administrator of a university may access different data than a rank-and-file faculty member. And, remember, of course, data in the archive might not be surveys. Instead it might be standardized tests, documents (birth or death records), economic records, or an institutional archive.

How recently has the database been monitored or updated? See if you can find a date on the page, typically at the very top or the very bottom of the page. "Old pages" may have missing links, unfixed errors, omit the most recent updates to files, or simply may not work.

HELPFUL HINTS

Were the data gathered over time by different agencies or different principal investigators? If so, changes in variables, definitions, or coding may have occurred. The user may find differences attributable to these changes, rather than to changes in the concepts they are studying--thus threats to internal validity.

How far back does the data series extend? The longer the series, the more likely you are to encounter strange alphabetic and non-alphanumeric computer codes, or inconsistencies in definitions or measures. And the more likely the original data are to be flat out MISSING.

Were data compiled from different agencies into a single archive? Again, check for consistencies in definitions (even of the same variable!) across agencies.

See if the description of the archive notes any problems or missing information.

For prospective analysts: what are your computer skills? Some databases are in ascii format which you can probably download into a spreadsheet such as EXCEL. But the field delimiters vary widely: some use spaces, others use commas, still others rely on a format statement so that the data can be read. Do you know how to analyze data using a spreadsheet program? If not, do you know how to transfer spreadsheet data into a statistical program such as SPSS, SAS, M+, R, or other software? Do you have file management skills so that you can insert value labels, variable labels and missing data codes? In other cases, you may have to save or print tabular displays and hand enter the data into a spreadsheet (very carefully). As you can see, it is VERY helpful to have good computer skills--or to have some good friends who do.

IMPORTANT! READ BELOW!

Any original problems when the data were first gathered will STILL be there when the data are archived. See what you can find out about issues with question format, sampling, coding categories, and other sources of bias and random error. Sometimes (for example: the General Social Survey) there will be considerable information about entities such as response rate, but sometimes there is not.

Always remember this classic cliché: do the best you can with what you got. Despite any problems, online databases and archives are a terrific resource for us all.

WHERE TO START HUNTING FOR ONLINE ARCHIVES

Professional associations in your field

The FSU on-line library system (schedule a meeting with a librarian; FSU is an ICPSR and a Roper Center member)

Search engines using your topic of interest

Major US government or state WEB sites (if you are an International Student, check out sites from your home country). The National Center for Education Statistics, the National Science Foundation, the Centers for Disease Control--and even the State of Florida website all contain links to many, many databases. You will find several of them (but far from all of them!) in our course database menu.

Major archives such as the Inter-university Consortium for Political and Social Research at University of Michigan (ICPSR), Pew Center for Research on the People and the Press, or the Roper Center (now at Cornell University).

One link leads to another. I found the International Social Survey Program link from the General Social Survey www site.

Ask your major professor

Check with faculty and graduate students in The College of Information

Many recent textbooks have online supplements or Web sites that list archives

CLICK HERE TO ENTER THE ONLINE DATABASE MENU

November 29 2017
This page was built with Netscape Composer.
Susan Carol Losh

Always under construction as new databases are entered.

METHODS READINGS AND ASSIGNMENTS OVERVIEW