\chapter{R Tutorial} \label{chap:Rtutorial} \SweaveOpts{keep.source=TRUE, pdf=FALSE, prefix.string=Chap02, grdevice=tikz.Swd} <>= options(digits=3, show.signif.stars=FALSE, width=53) rm(list=ls()) require(tikzDevice) source("../SweaveTikZ.R") @ %data files: read.table("US.txt", T); read.table("NAO.txt", T) %packages: UsingR %source code: NONE %third party: NONE \begin{quote} ``I think it is important for software to avoiding imposing a cognitive style on workers and their work.'' \end{quote} \indent---Edward Tufte\\ We begin with a tutorial on using R. To get the most out of it you should open an R session and type the commands as you read the text. You should be able to use copy-and-paste if you are using an electronic version of the book. \section{Introduction} Science requires transparency and reproducibility. The R language for statistical modeling makes this easy. Developing, maintaining, and documenting your R code is simple. R contains numerous functions for organizing, graphing, and modeling your data. Directions for obtaining R, accompanying packages and other sources of documentation are available at \url{http://www.r-project.org/}. Anyone serious about applying statistics to climate data should learn to use R. The book is self-contained. It presents R code and data (or links to data) that can be copied and pasted to reproduce the graphs and tables. This reproducibility provides you with an enhanced learning opportunity. Here we present a tutorial to help you get started. This can be skipped if you already know how to work with R. \subsection{What is R?} R is the `lingua franca' of data analysis and statistical computing. It helps you perform a variety of computing tasks by giving you access to commands. This is similar to other programming languages like Python and C++. R is particularly useful to researchers because it contains a number of built-in mechanisms for organizing data, performing calculations and creating graphics. R is an open-source statistical environment modeled after S. The S language was developed in the late 1980s at AT\&T labs. The R project was started by \underline{R}obert Gentleman and \underline{R}oss Ihaka of the Statistics Department of the University of Auckland in 1995. It now has a large audience. It's currently maintained by the R core-development team, an international group of volunteer developers. To get to the R project web site, open a browser and in a search window, type the key words `R project' or directly link to the web page using \url{http://www.r-project.org/}. Directions for obtaining the software, accompanying packages and other sources of documentation are provided at the site. Why use R? It's open-source, free, runs on the major computing platforms, has built-in help, excellent graphing capabilities, it's powerful, extensible, and contains thousands of functions. A drawback to R for many is the lack of a serious graphical user interface (GUI). This means it is harder to learn at the outset and you need to appreciate syntax. Modern ways of working with computers include browsers, music players, and spreadsheets. R is nothing like these. Infrequent users forget commands. There are no visual cues; a blank screen is intimidating. At first, working with R might seem daunting. However with a little effort you will quickly learn the basic commands and then realize how it can help you do much, much more. R is really a library of modern statistical tools. Researchers looking for methods that they can use will quickly discover that R unmatched. A climate scientist whose research requires customized scripting, extensive simulation analysis, or state-of-the-art statistical analysis will find R to be a solid foundation. R is also a language. It has rules and syntax. R can be run interactively, and analysis can be done on the fly. It is not limited by a finite set of functions. You can download packages to perform specialized analysis and graph the results. R is object-oriented allowing you to define new objects and create your own methods for them. Many people use spreadsheets. This is good for tasks like data storage and manipulation. Unfortunately they are unsuitable for serious research. A big drawback is the lack of a community support for new methods. Also, nice graphs are hard to make. Reproducibility can also be a challenge. If you are serious about your research you should not use a spreadsheet for statistics. \subsection{Get R} At the R project website, click on CRAN (Comprehensive R Archive Network) and select a nearby mirror site. Then follow instructions appropriate for your operating system to download the base distribution. On your computer, click on the download icon and follow the install instructions. Click on the icon to start R. If you are using the Linux operating system, type the letter R from a command window. R is most easily used in an interactive manner. You ask R a question and it gives you an answer. Questions are asked and answered on the command line. The \verb@>@ (greater than symbol) is used as the prompt. Throughout this book, it is the character that is NOT typed or copied into your R session. If a command is too long, a \verb@+@ (plus symbol) is used as a continuation prompt. If you get completely lost on a command, you can use Ctrl c or Esc to get the prompt back and start over. Most commands are functions and most functions require parentheses. \subsection{Packages} A package is a set of functions for doing specific things. As of early 2012 there were over 3200 of them. Indeed this is one of the main attractions; the availability of literally hundreds of packages for performing statistical analysis and modeling. And the range of packages is enormous. The {\bf BiodiversityR} offers a graphical interface for calculations of environmental trends, the package {\bf Emu} analyzes speech patterns, and the package {\bf GenABEL} is used to study the human genome. To install and load the package called {\bf UsingR} type <>= install.packages("UsingR") @ Note that the syntax is case sensitive. \verb@UsingR@ is not the same as \verb@usingR@ and \verb@Install.Packages@ is not the same as \verb@install.packages@. After the package is downloaded to your computer, you make it available to your R session by typing <>= library(UsingR) @ Or <>= require(UsingR) @ Note again the syntax is case sensitive. When installing the package the package name needs to be in quotes (either signal or double, but not directional quotes). No quotes are needed when making the package available to your working session. Each time you start a new R session, the package needs to be made available but it does not need to be installed from CRAN. If a package is not available from a particular CRAN site you can try another. To change your CRAN site, type <>= chooseCRANmirror() @ Then scroll to a different location. \subsection{Calculator} R evaluates commands typed at the prompt. It returns the result of the computation on the screen or saves it to an object. For example, to find the sum of the square root of 25 and 2, type <>= sqrt(25) + 2 @ The \verb@[1]@ says `first requested element will follow.' Here there is only one element, the answer 7. The \verb@>@ prompt that follows indicates R is ready for another command. For example, type <>= 12/3 - 5 @ R uses order of operations so that for instance multiplication comes before addition. How would you calculate the 5th power of 2? Type <>= 2^5 @ How about the amount of interest on \$1000, compounded annually at 4.5\% (annual percentage) for 5 years? Type <>= 1000 * (1 + .045)^5 - 1000 @ \subsection{Functions} There are numerous mathematical and statistical functions available in R. They are used in a similar manner. A function has a name, which is typed, followed by a pair of parentheses (required). Arguments are added inside this pair of parentheses as needed. For example, the square root of two is given as <>= sqrt(2) # the square root @ The \verb@#@ is the comment character. Any text in the line following this character is treated as a comment and is not evaluated by R. Some other examples include <>= sin(pi) # the sine function log(42) # log of 42 base e @ Many functions have arguments that allow you to change the default behavior. For example, to use base 10 for the logarithm, you can use either of the following <>= log(42, 10) log(42, base=10) @ To understand the first function, \verb@log(42,10)@, you need to know that R expects the base to be the second argument (after the first comma) of the function. The second example uses a named argument of the type \verb@base=@ to explicitly set the base value. The first style contains less typing, but the second style is easier to remember and is good coding practice. \subsection{Warnings and errors} When R doesn't understand your function it responds with an error message. For example <>= srt(2) @ <>= try_out = try(srt(2)) cat(try_out) @ If you get the function correct, but your input is not acceptable, then <>= sqrt(-2) @ <>= try_out = try(sqrt(-2)) cat(try_out) @ The output \verb@NaN@ is used to indicate `not a number.' As mentioned, if R encounters a line that is not complete a continuation prompt, \verb@+@, is printed, indicating more input is expected. You can complete the line after the continuation prompt. \subsection{Assignments} It is often convenient to name an object so that you can use it later. Doing so is called an assignment. Assignments are straightforward. You put a name on the left-hand side of the equals sign and a value, function, object, etc on the right. Assignments generally do not produce printed output. <>= x = 2 # assignments return a prompt only x + 3 # x is now 2 @ Remember, the pound symbol (\#) is used as a comment character. Anything after the \# is ignored. Adding comments to your code is a good way of recalling what you did and why. You are free to make object names out of letters, numbers, and the dot or underline characters. A name starts with a letter or a dot (a leading dot may not be followed by a number). You are not allowed to use math operators, such as \verb@+@, \verb@-@, \verb@*@, and \verb@/@. The help page for \verb@make.names@ describes this in more detail (\verb@?make.names@). Note that case is also important in names; \verb@x@ is different than \verb@X@. It is good coding practice to use conventional names for certain types of data. For instance, \verb@n@ is used for the length of a data record or the length of a vector, \verb@x@ and \verb@y@ are used for spatial coordinates, and \verb@i@ and \verb@j@ are for integers and indices for vectors and matrices. These conventions are not forced, but consistently using them makes it easier for you (and others) to look back and understand what you've done. Variables that begin with the dot character are usually reserved for advanced programmers. Unlike many programming languages, the period in R is only used as punctuation and can be included in an object name (e.g., \verb@my.object@). \subsection{Help} Using R to do statistics requires knowing many functions---more than you will likely be able to keep in your head. R has built-in help facilities for information about what is returned by the function, for details on additional arguments, and for examples. If you know the name of a function for example you type <>= help(var) @ This brings up a help page dedicated to the function (\verb@?var@ works the same way). Depending on your operating system the help page may come as html. The name of the function and the associate package is given as the preamble. This is followed by a brief description of the function and how it is used. Arguments are explained along with function and argument details. Important are the examples given toward the bottom of the page. A good strategy for understanding what a function does is to copy and paste the examples into your R session. This works great if you know the name of the function. If not, there are other ways to search. For example, the function \verb@help.search("mean")@ searches each entry in the help system and returning matches (often many) of functions that mention the word \verb@"mean"@. The function \verb@apropos@ searches through only function names and variables for matches. Type <>= apropos("mean") @ Most help pages provide examples. The examples help you understand what the function does. You can try the examples individually by copying and pasting them into your R session. You can also try them all at once by using the \verb@example@ function. For instance, type <>= example(mean) @ To end your R session type <>= q(save="no") @ Like most R functions \verb@q@ needs an open (left) and close (right) parentheses. The argument \verb@save="no"@ says do not save the workspace. Otherwise the workspace and session history are saved to a file in your working directory. By default the file is called \verb@.RData@. The workspace is your current R working environment and includes any user-defined objects (vectors, matrices, data frames, lists, functions). \section{Data} \subsection{Small amounts} To begin you need to get your data into R. For small amounts you use the \verb@c@ function. It combines (concatenates) items like a set of numbers. Consider a set of hypothetical hurricane counts, where in the first year there were two landfalls in the second three and so on. To enter these values type <>= h = c(2, 3, 0, 3, 1, 0, 0, 1, 2, 1) @ The ten values are stored in a vector object of class numeric called \verb@h@. <>= class(h) @ To show the values type the name of the object. <>= h @ Take note. You assigned the values to an object called \verb@h@. The assignment operator is an equal sign (\verb@=@). Another assignment operator used frequently is \verb@<-@, a left pointing arrow that consists of two keystrokes (the less-than sign and the hyphen). This is the more common of the assignment operators and it reserves the equal sign for use in declaring argument values. With most assignments only the prompt is returned to the screen with nothing printed. The object to the left of the assignment operator is given the values of whatever is to the right of the operator. They can be printed by typing the object name as you just did. Finally, the values when printed are prefaced with a \verb@[1]@. This indicates that the first entry in the object has a value of 2 (The number immediately to the right of \verb@[1]@). More on this later. The arrow keys can be used to retrieve previous functions. This saves typing. Each command is stored in a history file and the up arrow key moves backward through the history file and the down arrow moves forward. The left and right arrow keys work as expected. Changes can be made to a mistyped function followed by a return without the need to go to the end of the line. You can also enter small amounts of data with the \verb@scan@ function. You enter data values one at a time line by line. When finished, type enter (return). R has a wide range of functions for inputing data. We will look at several of them as needed. \subsection{Functions} Once the data are stored as an object you apply functions to do various things. For example, you find the total number of land falls occurring over the set of years by typing: <>= sum(h) @ The number of years is found by typing: <>= length(h) @ The average number of hurricanes over this ten year period is found by typing: <>= sum(h)/length(h) @ Or <>= mean(h) @ Other useful functions include \verb@sort@, \verb@min@, \verb@max@, \verb@range@, \verb@diff@, and \verb@cumsum@. Try them on the object \verb@h@ of land fall counts. For example, what does the function \verb@diff@ do? Most functions have a name followed by a left parenthesis, then a set of arguments separated by commas followed by a right parenthesis. Arguments have names. Some are required, but many are optional with R providing default values. Throughout this book function names references in line are left without arguments and without parentheses. In summary consider the code <>= x = log(42, base=10) @ Here \verb@x@ is the object name, \verb@=@ is the assignment operator, \verb@log@ is the function, 42 is the value for which the logarithm is being computed, and 10 is the argument corresponding to the logarithm base. Note here the equal sign is used in two different ways; as an assignment operator and to specify a value for an argument. \subsection{Vectors} Your data object \verb@h@ is stored as a vector. This means that R keeps track of the order you entered the data. The vector contains a first element, a second element, and so on. This is convenient. Your data object of land fall counts has a natural order--year 1, year 2, etc, so you want to keep this order. You would like to be able to make changes to the data item by item instead of reentering the data. R lets you do this. Also, vectors are math objects so math operations can be performed on them. Let's see how these concepts apply to your data. Suppose \verb@h@ contain the annual land fall count from the first decade of a longer record. You want to keep track of counts over a second decade. This can be done as follows. <>= d1 = h # make a copy d2 = c(0, 5, 4, 2, 3, 0, 3, 3, 2, 1) @ Most functions will operate on each vector component all at once. <>= d1 + d2 d1 - d2 d1 - mean(d1) @ In the first two cases, the first year count of the first decade is added (and subtracted) from the first year count of the second decade and so on. In the third case a constant (the average of the first decade) is subtracted from each count of the first decade. This is an example of recycling. R repeats values from one vector so as to match the length of the other vector. Here the mean value is computed then repeated 10 times. The subtraction then follows on each component one at a time. Suppose you are interested in the variability of hurricane counts from one year to the next. An estimate of this variability is the variance. The sample mean of a set of numbers $y$ is \begin{equation} \bar y = \frac{1}{n}\sum_{i=1}^n y_i, \end{equation} where $n$ is the sample size. And the sample variance is \begin{equation} s^2 = \frac{1}{n-1} \sum_{i=1}^n (y_i - \bar y)^2. \end{equation} Although the function \verb@var@ will compute the sample variance, to see how vectorization works in R, you write a script. <>= dbar = mean(d1) dif = d1 - dbar ss = sum(dif^2) n = length(d1) ss/(n - 1) @ Note how the different parts of the equation for the variance match what you type in R. To verify your code, type <>= var(d1) @ To change the number of significant digits printed to the screen from the default of 7, type <>= options(digits=3) var(d1) @ Similarly the standard deviation, which is the square root of the variance, is obtained by typing <>= sd(d1) @ One restriction on vectors is that all the components have to have the same type. You can't create a vector with the first component a numeric value and the second component a character text. A character vector can be a set of text strings as in <>= Elsners = c("Jim", "Svetla", "Ian", "Diana") Elsners @ Note that character strings are made by matching quotes on both sides of the string, either double, \verb@""@, or single, \verb@'@. Caution: The quotes must not be directional. This can cause problems if you type your code in a word processor (like MS Word) as the program will insert directional quotes. It is better to use a text editor like notepad. You add another component to the vector \verb@Elsners@ by using the \verb@c@ function. <>= c(Elsners, 1.5) @ The component 1.5 gets coerced to a characters string. Coercion occurs for mixed types where the components get changed to the lowest common type, which is usually a character. This will prevent arithmetic operations with the vector. Elements of a vector can have names. The names will appear when you print the vector. You use the \verb@names@ function to retrieve and set names as character strings. For instance you type <>= names(Elsners) = c("Dad", "Mom", "Son", "Daughter") Elsners @ Unlike most functions, \verb@names@ appears on the left side of the assignment operator. The function adds the names attribute to the vector. Names can be used on vectors of any type. Returning to your hurricane example, suppose the National Hurricane Center (NHC) finds a previously undocumented hurricane in the 6th year of the 2nd decade. In this case you type <>= d2[6] = 1 @ This changes the 6th element (component) of vector \verb@d2@ to 1 leaving the other components alone. Note the use of square brackets (\verb@[]@). Square brackets are used to subset components of vectors (and arrays, lists, etc), whereas parentheses are used with functions to enclose the set of arguments. You list all values in the vector \verb@d2@ by typing <>= d2 @ To print the number of hurricanes during the 3rd year only, type <>= d2[3] @ To print all the hurricane counts except from the 4th year, type <>= d2[-4] @ To print the hurricane counts for the odd years only, type <>= d2[c(1, 3, 5, 7, 9)] @ Here you use the \verb@c@ function inside the subset operator \verb@[@. Since here you want a regular sequence of years, the expression \verb@c(1,3,5,7,9)@ can be simplified using structured data. \subsection{Structured data} Sometimes a set of values has a pattern. The integers from 1 through 10 for example. To enter these one by one using the \verb@c@ function is tedious. Instead, the colon function is used to create sequences. For example to sequence the first ten positive integers you type <>= 1:10 @ Or to reverse the sequence you type <>= 10:1 @ You create the same reversed sequence of integers using the \verb@rev@ function together with the colon function as <>= rev(1:10) @ The \verb@seq@ function is more general than the colon function. It allows for start and end values, but also a step size or sequence length. Some examples include <>= seq(1, 9, by=2) seq(1, 10, by=2) seq(1, 9, length=5) @ Use the \verb@rep@ function to create a vector with elements having repeat values. The simplest usage of the function is to replicate the value of the first argument the number of times specified by the value of the second argument. <>= rep(1, times=10) @ Or <>= rep(1:3, times=3) @ You create more complicated patterns by specifying pairs of equal-sized vectors. In this case, each component of the first vector is replicated the corresponding number of times as specified in the second vector. <>= rep(c("cold", "warm"), c(1, 2)) @ Here the vectors are implicitly defined using the \verb@c@ function and the name of the second argument (\verb@times@) is left off. As noted above, it is good coding practice to name the arguments as then the order in which they appear in the function is not important. Suppose you want to repeat the sequence of cold, warm, warm three times. You nest the above sequence generator in another repeat function as follows. <>= rep(rep(c("cold", "warm"), c(1, 2)), 3) @ Function nesting gives you a lot of flexibility. \subsection{Logic} As you've seen, there are functions like \verb@mean@ and \verb@var@ that when applied to a vector of data output a statistic. Another example is the \verb@max@ function. To find the maximum number of hurricanes in a single year during the first decade type <>= max(d1) @ To determine which years had this many hurricanes type <>= d1 == 3 @ Note the double equal sign. A single equal sign assigns \verb@d1@ the value 3. Instead with the double equal sign you are performing a logical operation on the components of the vector. Each component is matched in value with the value 3 and a true or false is returned. That is, is component one equal to 3? No, so return \verb@FALSE@, is component two equal to 3? Yes, so return \verb@TRUE@, etc. The length of the output will match the length of the vector. Now how can you get the years corresponding to the \verb@TRUE@ values? To rephrase, which years have 3 hurricanes? If you guessed the function \verb@which@, you're on your way to mastering R. <>= which(d1 == 3) @ You might be interested in the number of years in each decade without a hurricane. <>= sum(d1 == 0); sum(d2 == 0) @ Here we apply two functions on a single line by separating the functions with a semicolon. Or how about the ratio of the number of hurricanes over the two decades. <>= mean(d2)/mean(d1) @ So there is \Sexpr{round((mean(d2)/mean(d1)-1)*100,digits=0)}\% more landfalls during the second decade. The statistical question is, is this difference significant? Before moving on, it is recommended that you remove objects from your workspace that are no longer needed. This helps you recycle names. First, to see what objects reside in your workspace, type <>= objects() @ Then, to remove only selected objects, type <>= rm(d1, d2, Elsners) @ To remove all objects type <>= rm(list=objects()) @ This will clean your workspace completely. To avoid name conflicts it is good practice is to start a session with a clean workspace. \subsection{Imports} Most of what you do in R involves data. To import data there are a few things to know. First, you need to know your working directory. You do this with the \verb@getwd@ function by typing <>= getwd() @ The output is a character string in quotes that indicates the full path of your working directory on your computer. It is the directory where R will look for data. To change your working directory you use the \verb@setwd@ function and specifying the path name within quotes. Alternatively you should be able to use one of the menu options in the R interface. To list the files in your working directory you type \verb@dir()@. Second, you need to know the file type of your data. This will determine the read function. For example, the data set {\it US.txt} contains a list of tropical cyclone counts by year making land fall in the United States (excluding Hawaii) at hurricane intensity. The file is a space-delimited text file. In this case you use the \verb@read.table@ function to import the data. Third, you need to know if your data file has column names. These are given in the first line of your file, usually as a series of character strings. The line is called a `header' and if your data has one, you need to specify \verb@header=TRUE@. Assuming the text file {\it US.txt} is in your working directory, type <>= H = read.table("US.txt", header=TRUE) @ If R returns a prompt without an error message, the data has been imported. Here your file contains a header so the argument \verb@header@ is used. At this stage the most common mistake is that your data file is not in your working directory. This will result in an error message along the lines of `cannot open the connection' or `cannot open file.' The function has options for specifying the separator character or characters between columns in the file. For example, if your file has comma's between columns, then you use the argument \verb@sep=","@ in the \verb@read.table@ function. If the file has tabs then you use \verb@sep="\t"@. Note that R does not make changes your file. You can also change the missing value character. By default it is NA. If the missing value character in your file is coded as 99, specify \verb@na.strings="99"@ and it will be changed to NA in your R data object. There are several variants of \verb@read.table@ that differ only in the default argument settings. Note in particular \verb@read.csv@, which has settings that are suitable for comma delimited (csv) files that have been exported from a spreadsheet. Thus the typical work flow is to export your data from a spreadsheet to a csv file then import it to R using the \verb@read.csv@ function. You can also import data directly from the web by specifying the URL instead of the local file name. <>= loc = "http://myweb.fsu.edu/jelsner/US.txt" H = read.table(loc, header=TRUE) @ The object \verb@H@ is a data frame and the function \verb@read.table@ and variants return data frames. Data frames are similar to a spread sheet. The data are arranged in rows and columns. The rows are the cases and the columns are the variables. To check the dimensions of your data frame type <>= dim(H) @ This tells you there are \Sexpr{dim(H)[1]} rows and \Sexpr{dim(H)[2]} columns in your data frame. To list the first six lines of the data object, type <>= head(H) @ The columns include year, number of U.S. hurricanes, number of major U.S. hurricanes, number of U.S. Gulf coast hurricanes, number of Florida hurricanes, and number of East coast hurricanes in order. Note the column names are given as well. The last six lines of your data frame are listed similarly using the \verb@tail@ function. The number of lines listed is changed with using the argument \verb@n@. If your data reside in a directory other than your working directory, you can use the \verb@file.choose@ function. This will open a dialog box allowing you to scroll and choose the file. Note the default for this function has no arguments \verb@file.choose()@. To make the individual columns available by column name, type <>= attach(H) @ The total number of years in the record is obtained and saved in \verb@n@ and the average number of U.S. hurricanes is saved in \verb@rate@ using the following two lines of code. <>= n = length(All) rate = mean(All) @ By typing the names of the saved objects, the values are printed. <>= n rate @ Thus over the \Sexpr{n} years of data the average number of U.S. hurricanes per year is \Sexpr{round(rate,digits=2)}. If you want to change the names of the data frame, type <>= names(H)[4] = "GC" names(H) @ This changes the 4th column name from G to GC. Note that this is done to the data frame in R and not to your data file. While attaching a data frame is convenient, it is not a good strategy when writing R code as name conflicts can easily arise. If you do attach your data frame, make sure you use the function \verb@detach@ after you are finished. \section{Tables and Plots} Now that you know a bit about using R, you are ready for some data analysis. R has a wide variety of data structures including scalars, vectors, matrices, data frames, and lists. \subsection{Tables and summaries} Vectors and matrices must have a single class. For example, the vectors \verb@A@, \verb@B@, and \verb@C@ below are constructed as numeric and logical, respectively. <>= A = c(1, 2.2, 3.6, -2.8) #numeric vector B = c(TRUE, TRUE, FALSE, TRUE) #logical vector C = c("Cat 1", "Cat 2", "Cat 3") #character vector @ To view the data class, type <>= class(A); class(B); class(C) @ Let the vector \verb@wx@ denote the weather conditions for five forecast periods as character data. <>= wx = c("sunny", "clear", "cloudy", "cloudy", "rain") class(wx) @ Character data is summarized using the \verb@table@ function. To summarize the weather conditions over the six days, type <>= table(wx) @ The output is a list of the unique character strings and the corresponding number of occurrences of each string. As another example, let the object \verb@ss@ denote the Saffir-Simpson category for a set of five hurricanes. <
>= ss = c("Cat 3", "Cat 2", "Cat 1", "Cat 3", "Cat 3") table(ss) @ Here the character strings correspond to different intensity levels as ordered categories with \verb@Cat 1@ $<$ \verb@Cat 2@ $<$ \verb@Cat 3@. In this case it is better to convert the character vector \verb@ss@ to an ordered factor with levels. This is done using the \verb@factor@ function. <>= ss = factor(ss, order=TRUE) class(ss) ss @ The class of \verb@ss@ gets changed to an ordered factor. A print of the object results in a list of the elements in the vector and a list of the levels in order. Note, if you do the same for the \verb@wx@ object, the order is alphabetical by default. Try it. You can also use the \verb@table@ function on discrete numeric data. For example, <
>= table(All) @ The table tells you that your data has \Sexpr{table(All)[1]} zeros, \Sexpr{table(All)[2]} ones, etc. Since these are annual U.S. hurricane counts you know for instance that there are \Sexpr{table(All)[5]} years with four hurricanes and so on. Recall from the previous section you attached the data frame \verb@H@ so you can use the column names as separate vectors. The data frame remains attached for the entire session. Remember, that you detach it with the function \verb@detach@. The summary function is used to a get a concise description of your object. The form of the value(s) returned depends on the class of the object being summarized. If your object is a data frame the summary consisting of numeric data the output are six summary statistics (mean, median, minimum, maximum, first quartile, and third quartile) for each column. <>= summary(H) @ Each column of your data frame \verb@H@ is labeled and summarized by six numbers including the minimum, the maximum, the mean, the median, and the first (lower) and third (upper) quartiles. For example you see that the maximum number of major U.S. hurricanes (\verb@MUS@) in a single season is \Sexpr{max(H$MUS)}. Since the first column is the year, the summary is not particularly meaningful. \subsection{Quantiles} The quartiles output from the \verb@summary@ method are examples of quantiles. Sample quantiles cut a set of ordered data into equal-sized data bins. The ordering comes from rearranging the data from lowest to highest. The first, or lower, quartile corresponding to the .25 quantile (25th percentile) indicates that 25\% of the data have a value less than this quartile value. The third, or upper, quartile corresponding to the .75 quantile (75th percentile) indicates that 75\% of the data have a smaller value than this quartile value. The \verb@quantile@ function calculates sample quantiles on a vector of data. For example, consider the set of North Atlantic Oscillation (NAO) index values for the month of June from the period 1851--2010. The NAO is a variation in the climate over the North Atlantic Ocean featuring fluctuations in the difference of atmospheric pressure at sea level between the Iceland and the Azores. The index is computed as the difference in standardized sea-level pressures. The standardization is done by subtracting the mean and dividing by the standard deviation. The units on the index is standard deviation. See Chapter~\ref{chap:datasets} for more details on these data. First read the data consisting of monthly NAO values, then apply the \verb@quantile@ function to the June values. <>= NAO = read.table("NAO.txt", header=TRUE) quantile(NAO$Jun, probs=c(.25, .5)) @ Note the use of the \verb@$@ sign to point to a particular column in the data frame. Recall that to list the column names of the data frame object called \verb@NAO@ type \verb@names(NAO)@. Of the \Sexpr{length(NAO$Jun)} values, 25\% of them are less than \Sexpr{round(quantile(NAO$Jun,probs=.25),2)} standard deviations (sd), 50\% are less than \Sexpr{round(quantile(NAO$Jun,probs=.5),2)} sd. Thus there are an equal number of years with June NAO values between \Sexpr{round(quantile(NAO$Jun,probs=.25),2)} and \Sexpr{round(quantile(NAO$Jun,probs=.5),2)} sd. The third quartile value corresponding to the .75 quantile (75th percentile) indicates that 75\% of the data have a value less than this. The difference between the first and third quartile values is called the interquartile range (IQR). Fifty percent of all values lie within the IQR. The IQR can be found using the \verb@IQR@ function. \subsection{Plots} %tikz graphics device was here R has a wide range of plotting capabilities. It takes time to master, but a little effort goes a long way. You will create a lot of plots as you work through this book. Here are a few examples to get you started. Chapter~\ref{chap:graphsandmaps} provides more details. \subsubsection{Bar plots} The bar plot (or bar chart) is a way to easily compare categorical or discrete data. Levels of the variable are arranged in some order along the horizontal axis and the frequency of values in each group is plotted as a bar with the bar height proportional to the frequency. To make a bar plot of your U.S. hurricane counts, type <>= barplot(table(All), ylab="Number of Years", xlab="Number of Hurricanes") @ \begin{figure} \centering <>= par(las=1, mgp=c(2.2, .75, 0)) x = table(All) barplot(x, ylab="Number of years", xlab="Number of hurricanes") @ \vspace{-.5cm} \caption{Bar plot of U.S. hurricanes.} \label{fig:barplot} \end{figure} The plot in Fig.~\ref{fig:barplot} is a concise summary of the number of hurricanes. The bar heights are proportional to the number years with that many hurricanes. The plot conveys the same information as the table. The purpose of the bar plot is to illustrate the difference between data values. Readers expect the plot to start at zero, so you should try to draw it that way. Also, usually there is little scientific reason to make the bars appear three dimensional. Note that the axis labels are set using the \verb@ylab@ and \verb@xlab@ arguments with the actual label as a character sting in quotation. You must be careful to use bi-directional quotes, not the directional quotes that appear in word-processing type. Although many of the plotting commands are simple and somewhat intuitive, to get a publication quality figure requires tweaking the default settings. You will see some of these tweaks as you work through the book. When a plot function like \verb@barplot@ is called, the plot is sent to the graphics device for your computer screen. This might be windows, quartz, or X11. There are also devices for creating postscript, pdf, png, and jpeg output and sending them to a file in your working directory. For publication-quality graphics, the postscript and pdf devices are preferred because they produce scalable images. Use bitmap devices for drafts. The sequence is to first specify a graphics device, then call your graphics functions, and finally close the device. For example, to create an encapsulated postscript file (eps) of your bar plot placed in your working directory, type <>= postscript(file="MyFirstRPlot.eps") barplot(table(All), ylab="Number of Years", xlab="Number of Hurricanes") dev.off() #close the the graphics device @ The file containing the bar plot is placed in your working directory. Note that the \verb@postscript@ function opens the device and \verb@dev.off()@ closes it. Make sure you remember to close the device. To list the files in your working directory type \verb@dir()@. The pie chart is an alternative to the bar chart for displaying relative frequencies or proportions (\verb@?pie@). It represents this information with wedges of a circle or pie. Since your eye has some difficulty judging relative areas \citep{Cleveland1985}, another alternative to the bar chart is the dot chart. To find out more about this graph function, type \verb@?dotchart@. \subsubsection{Scatter plots} Perhaps the most useful graph is the scatter plot. You use it to represent the relationship between two continuous variables. It is a graph of the values of one variable against the values of the other as points $(x_i, y_i)$ in a Cartesian plane. You use the \verb@plot@ function to make a scatter plot. The syntax is \verb@plot(x, y)@ where \verb@x@ and \verb@y@ are vectors containing the paired data. Values of the variable named in the first argument (here \verb@x@) are plotted along the horizontal axis. For example, to graph the relationship between the February and March values of the NAO type <>= plot(NAO$Feb, NAO$Mar, xlab="February NAO", ylab="March NAO") @ \begin{figure} \centering <>= par(las=1, pty="s", mgp=c(2.2, .75, 0)) plot(NAO$Feb, NAO$Mar, xlab="February NAO [s.d.]", ylab="March NAO [s.d.]", pch=16) grid() points(NAO$Feb, NAO$Mar, pch=16) @ \vspace{-.5cm} \caption{Scatter plot of February and March NAO values.} \label{fig:scatterplotNAO} \end{figure} The plot is shown in Fig.~\ref{fig:scatterplotNAO}. It is a summary of the relationship between the NAO values for February and March. Low values of the index during February tend to be followed by low values in March and high values in February tend to be followed by high values in March. You observe there is a direct (or positive) relationship between the two variables although the points are scattered widely indicating the relationship is not tight. The relationship between two variables can be easily visualized with a scatter plot. You can change the point symbol with the argument \verb@pch@. If your goal is to model the relationship, you should plot the dependent variable (the variable you are interested in modeling) on the vertical axis. Here it might make the most sense to put the March values on the vertical axis since a predictive model would use February values to forecast March values. The plot produces points as a default. This is changed using the argument \verb@type@ where the letter ell is placed in quotation. For example, to plot the February NAO values as a time series type <>= plot(NAO$Year, NAO$Feb, ylab="Febrary NAO", xlab="Year", type="l") @ \begin{figure} \centering <>= par(las=1, mgp=c(2.2, .75, 0)) plot(NAO$Year, NAO$Feb, ylab="February NAO [s.d.]", xlab="Year", type="l") abline(h=0, col="lightgray") grid() lines(NAO$Year, NAO$Feb) @ \vspace{-1cm} \caption{Time series of February NAO.} \label{fig:timeseriesplot} \end{figure} The plot is shown in Fig.~\ref{fig:timeseriesplot}. The values fluctuate about zero and do not appear to have a long-term trend. With time series data it is better to connect the values with lines rather than use points unless values are missing. More details on how to make time series and other graphs are given throughout the book. This concludes your introduction to R. We showed you where to get R, how to install it, and how to obtain add-on packages. We showed you how use R as a calculator, how to work with functions, make assignments, and get help. We also showed you how to work with small amounts of data and how to import data from a file. We concluded with how to tabulate, summarize, and make some simple plots. There is much more ahead, but you've made a good start. Table~\ref{tab:commonRfunctions} lists most of the functions in the chapter. A complete list of the functions used in the book is given in Appendix A. \begin{table} \begin{center} \caption{\label{tab:commonRfunctions} R functions used in this chapter.} \begin{tabular}{ll} \hline Function & Description \\ \hline \multicolumn{2}{l}{Numeric Functions} \\ \verb@sqrt(x)@ & square root of x \\ \verb@log(x)@ & natural logarithm of x \\ \verb@length(v)@ & number of elements in vector v \\ \verb@summary(d)@ & statistical summary of columns in data frame d \\ & \\ \multicolumn{2}{l}{Statistical Functions} \\ \verb@sum(v)@ & summation of the elements in v \\ \verb@max(v)@ & maximum value in v \\ \verb@mean(v)@ & average of the elements in v \\ \verb@var(v)@ & variance of the elements in v \\ \verb@sd(v)@ & standard deviation of the elements in v \\ \verb@quantile(x, prob)@ & prob quantile of the elements in x \\ & \\ \multicolumn{2}{l}{Structured Data Functions} \\ \verb@c(x, y, z)@ & concatenate the objects x, y, and z \\ \verb@seq(from, to, by)@ & generate a sequence of values\\ \verb@rep(x, n)@ & replicate x n times \\ & \\ \multicolumn{2}{l}{Table and Plot Functions} \\ \verb@table(a)@ & tabulate the characters or factors in a \\ \verb@barplot(h)@ & bar plot with heights h\\ \verb@plot(x, y)@ & scatter plot of the values in x and y \\ & \\ \multicolumn{2}{l}{Input, Package and Help Functions} \\ \verb@read.table("file")@ & input the data from connection file \\ \verb@head(d)@ & list the first six row of data frame d \\ \verb@objects()@ & list all objects in the workspace \\ \verb@help(fun)@ & open help documentation for function fun\\ \verb@install.packages("pk")@& install the package pk on your computer \\ \verb@require(pk)@ & make functions in package pk available\\ \hline \end{tabular} \end{center} \end{table} The next chapter provides an introduction to statistics. If you've had a course in statistics this will be a review, but we encourage you to follow along anyway as you will learn new things about using R.