\chapter{R Tutorial} \label{chap:Rtutorial} \SweaveOpts{keep.source=TRUE, pdf=FALSE, prefix.string=Chap02, grdevice=tikz.Swd} <>= options(digits=3, show.signif.stars=FALSE, width=53) rm(list=ls()) require(tikzDevice) source("../SweaveTikZ.R") @ %data files: read.table("US.txt", T); read.table("NAO.txt", T) %packages: UsingR %source code: NONE %third party: NONE \begin{quote} ``I think it is important for software to avoiding imposing a cognitive style on workers and their work.'' \end{quote} \indent---Edward Tufte\\ We begin with a tutorial on using R. To get the most out of it you should open an R session and type the commands as you read the text. You should be able to use copy-and-paste if you are using an electronic version of the book. \section{Introduction} Science requires transparency and reproducibility. The R language for statistical modeling makes this easy. Developing, maintaining, and documenting your R code is simple. R contains numerous functions for organizing, graphing, and modeling your data. Directions for obtaining R, accompanying packages and other sources of documentation are available at \url{http://www.r-project.org/}. Anyone serious about applying statistics to climate data should learn to use R. The book is self-contained. It presents R code and data (or links to data) that can be copied and pasted to reproduce the graphs and tables. This reproducibility provides you with an enhanced learning opportunity. Here we present a tutorial to help you get started. This can be skipped if you already know how to work with R. \subsection{What is R?} R is the `lingua franca' of data analysis and statistical computing. It helps you perform a variety of computing tasks by giving you access to commands. This is similar to other programming languages like Python and C++. R is particularly useful to researchers because it contains a number of built-in mechanisms for organizing data, performing calculations and creating graphics. R is an open-source statistical environment modeled after S. The S language was developed in the late 1980s at AT\&T labs. The R project was started by \underline{R}obert Gentleman and \underline{R}oss Ihaka of the Statistics Department of the University of Auckland in 1995. It now has a large audience. It's currently maintained by the R core-development team, an international group of volunteer developers. To get to the R project web site, open a browser and in a search window, type the key words `R project' or directly link to the web page using \url{http://www.r-project.org/}. Directions for obtaining the software, accompanying packages and other sources of documentation are provided at the site. Why use R? It's open-source, free, runs on the major computing platforms, has built-in help, excellent graphing capabilities, it's powerful, extensible, and contains thousands of functions. A drawback to R for many is the lack of a serious graphical user interface (GUI). This means it is harder to learn at the outset and you need to appreciate syntax. Modern ways of working with computers include browsers, music players, and spreadsheets. R is nothing like these. Infrequent users forget commands. There are no visual cues; a blank screen is intimidating. At first, working with R might seem daunting. However with a little effort you will quickly learn the basic commands and then realize how it can help you do much, much more. R is really a library of modern statistical tools. Researchers looking for methods that they can use will quickly discover that R unmatched. A climate scientist whose research requires customized scripting, extensive simulation analysis, or state-of-the-art statistical analysis will find R to be a solid foundation. R is also a language. It has rules and syntax. R can be run interactively, and analysis can be done on the fly. It is not limited by a finite set of functions. You can download packages to perform specialized analysis and graph the results. R is object-oriented allowing you to define new objects and create your own methods for them. Many people use spreadsheets. This is good for tasks like data storage and manipulation. Unfortunately they are unsuitable for serious research. A big drawback is the lack of a community support for new methods. Also, nice graphs are hard to make. Reproducibility can also be a challenge. If you are serious about your research you should not use a spreadsheet for statistics. \subsection{Get R} At the R project website, click on CRAN (Comprehensive R Archive Network) and select a nearby mirror site. Then follow instructions appropriate for your operating system to download the base distribution. On your computer, click on the download icon and follow the install instructions. Click on the icon to start R. If you are using the Linux operating system, type the letter R from a command window. R is most easily used in an interactive manner. You ask R a question and it gives you an answer. Questions are asked and answered on the command line. The \verb@>@ (greater than symbol) is used as the prompt. Throughout this book, it is the character that is NOT typed or copied into your R session. If a command is too long, a \verb@+@ (plus symbol) is used as a continuation prompt. If you get completely lost on a command, you can use Ctrl c or Esc to get the prompt back and start over. Most commands are functions and most functions require parentheses. \subsection{Packages} A package is a set of functions for doing specific things. As of early 2012 there were over 3200 of them. Indeed this is one of the main attractions; the availability of literally hundreds of packages for performing statistical analysis and modeling. And the range of packages is enormous. The {\bf BiodiversityR} offers a graphical interface for calculations of environmental trends, the package {\bf Emu} analyzes speech patterns, and the package {\bf GenABEL} is used to study the human genome. To install and load the package called {\bf UsingR} type <>= install.packages("UsingR") @ Note that the syntax is case sensitive. \verb@UsingR@ is not the same as \verb@usingR@ and \verb@Install.Packages@ is not the same as \verb@install.packages@. After the package is downloaded to your computer, you make it available to your R session by typing <>= library(UsingR) @ Or <>= require(UsingR) @ Note again the syntax is case sensitive. When installing the package the package name needs to be in quotes (either signal or double, but not directional quotes). No quotes are needed when making the package available to your working session. Each time you start a new R session, the package needs to be made available but it does not need to be installed from CRAN. If a package is not available from a particular CRAN site you can try another. To change your CRAN site, type <>= chooseCRANmirror() @ Then scroll to a different location. \subsection{Calculator} R evaluates commands typed at the prompt. It returns the result of the computation on the screen or saves it to an object. For example, to find the sum of the square root of 25 and 2, type <>= sqrt(25) + 2 @ The \verb@[1]@ says `first requested element will follow.' Here there is only one element, the answer 7. The \verb@>@ prompt that follows indicates R is ready for another command. For example, type <>= 12/3 - 5 @ R uses order of operations so that for instance multiplication comes before addition. How would you calculate the 5th power of 2? Type <>= 2^5 @ How about the amount of interest on \$1000, compounded annually at 4.5\% (annual percentage) for 5 years? Type <>= 1000 * (1 + .045)^5 - 1000 @ \subsection{Functions} There are numerous mathematical and statistical functions available in R. They are used in a similar manner. A function has a name, which is typed, followed by a pair of parentheses (required). Arguments are added inside this pair of parentheses as needed. For example, the square root of two is given as <>= sqrt(2) # the square root @ The \verb@#@ is the comment character. Any text in the line following this character is treated as a comment and is not evaluated by R. Some other examples include <>= sin(pi) # the sine function log(42) # log of 42 base e @ Many functions have arguments that allow you to change the default behavior. For example, to use base 10 for the logarithm, you can use either of the following <>= log(42, 10) log(42, base=10) @ To understand the first function, \verb@log(42,10)@, you need to know that R expects the base to be the second argument (after the first comma) of the function. The second example uses a named argument of the type \verb@base=@ to explicitly set the base value. The first style contains less typing, but the second style is easier to remember and is good coding practice. \subsection{Warnings and errors} When R doesn't understand your function it responds with an error message. For example <>= srt(2) @ <>= try_out = try(srt(2)) cat(try_out) @ If you get the function correct, but your input is not acceptable, then <>= sqrt(-2) @ <>= try_out = try(sqrt(-2)) cat(try_out) @ The output \verb@NaN@ is used to indicate `not a number.' As mentioned, if R encounters a line that is not complete a continuation prompt, \verb@+@, is printed, indicating more input is expected. You can complete the line after the continuation prompt. \subsection{Assignments} It is often convenient to name an object so that you can use it later. Doing so is called an assignment. Assignments are straightforward. You put a name on the left-hand side of the equals sign and a value, function, object, etc on the right. Assignments generally do not produce printed output. <>= x = 2 # assignments return a prompt only x + 3 # x is now 2 @ Remember, the pound symbol (\#) is used as a comment character. Anything after the \# is ignored. Adding comments to your code is a good way of recalling what you did and why. You are free to make object names out of letters, numbers, and the dot or underline characters. A name starts with a letter or a dot (a leading dot may not be followed by a number). You are not allowed to use math operators, such as \verb@+@, \verb@-@, \verb@*@, and \verb@/@. The help page for \verb@make.names@ describes this in more detail (\verb@?make.names@). Note that case is also important in names; \verb@x@ is different than \verb@X@. It is good coding practice to use conventional names for certain types of data. For instance, \verb@n@ is used for the length of a data record or the length of a vector, \verb@x@ and \verb@y@ are used for spatial coordinates, and \verb@i@ and \verb@j@ are for integers and indices for vectors and matrices. These conventions are not forced, but consistently using them makes it easier for you (and others) to look back and understand what you've done. Variables that begin with the dot character are usually reserved for advanced programmers. Unlike many programming languages, the period in R is only used as punctuation and can be included in an object name (e.g., \verb@my.object@). \subsection{Help} Using R to do statistics requires knowing many functions---more than you will likely be able to keep in your head. R has built-in help facilities for information about what is returned by the function, for details on additional arguments, and for examples. If you know the name of a function for example you type <>= help(var) @ This brings up a help page dedicated to the function (\verb@?var@ works the same way). Depending on your operating system the help page may come as html. The name of the function and the associate package is given as the preamble. This is followed by a brief description of the function and how it is used. Arguments are explained along with function and argument details. Important are the examples given toward the bottom of the page. A good strategy for understanding what a function does is to copy and paste the examples into your R session. This works great if you know the name of the function. If not, there are other ways to search. For example, the function \verb@help.search("mean")@ searches each entry in the help system and returning matches (often many) of functions that mention the word \verb@"mean"@. The function \verb@apropos@ searches through only function names and variables for matches. Type <>= apropos("mean") @ Most help pages provide examples. The examples help you understand what the function does. You can try the examples individually by copying and pasting them into your R session. You can also try them all at once by using the \verb@example@ function. For instance, type <>= example(mean) @ To end your R session type <>= q(save="no") @ Like most R functions \verb@q@ needs an open (left) and close (right) parentheses. The argument \verb@save="no"@ says do not save the workspace. Otherwise the workspace and session history are saved to a file in your working directory. By default the file is called \verb@.RData@. The workspace is your current R working environment and includes any user-defined objects (vectors, matrices, data frames, lists, functions). \section{Data} \subsection{Small amounts} To begin you need to get your data into R. For small amounts you use the \verb@c@ function. It combines (concatenates) items like a set of numbers. Consider a set of hypothetical hurricane counts, where in the first year there were two landfalls in the second three and so on. To enter these values type <>= h = c(2, 3, 0, 3, 1, 0, 0, 1, 2, 1) @ The ten values are stored in a vector object of class numeric called \verb@h@. <