Chapter 2: R Tutorial

“I think it is important for software to avoiding imposing a cognitive style on workers and their work.” — Edward Tufte

date()

## [1] "Sun Jan  6 14:26:12 2013"

Packages

A package is a set of functions for doing specific things. There are thousands of them. Indeed this is one of the main attractions; the availability of literally hundreds of packages for performing statistical analysis and modeling. And the range of packages is enormous. The BiodiversityR offers a graphical interface for calculations of environmental trends, the package Emu analyzes speech patterns, and the package GenABEL is used to study the human genome.

To install and load the package called UsingR type

install.packages("UsingR")

Syntax is case sensitive.

After the package is downloaded to your computer, you make it available to your R session by typing

require(UsingR)

## Loading required package: UsingR

## Loading required package: MASS

If a package is not available from a particular CRAN site you can try another. To change your CRAN site, type

chooseCRANmirror()

Calculations

R evaluates commands typed at the prompt. It returns the result of the computation on the screen or saves it to an object. For example, to find the sum of the square root of 25 and 2, type

sqrt(25) + 2

## [1] 7

The [1] says first requested element will follow.' Here there is only one element, the answer 7.

12/3 - 5

## [1] -1

How would you calculate the 5th power of 2?

2^5

## [1] 32

Functions

There are many math and statistical functions available in R. A function has a name, which is typed, followed by a pair of parentheses (required). Arguments are added inside this pair of parentheses as needed. For example, the square root of two is given as

sqrt(2)  # the square root

## [1] 1.414

The pound symbol (#) is the comment character. Any text in the line following this character is treated as a comment and is not evaluated by R.

sin(pi)  # the sine function

## [1] 1.225e-16

log(42)  # log of 42 base e

## [1] 3.738

Many functions have arguments that allow you to change the default behavior. For example, to use base 10 for the logarithm, you can use either of the following

log(42, 10)

## [1] 1.623

log(42, base = 10)

## [1] 1.623

To understand the first function, log(42, 10), you need to know that R expects the base to be the second argument (after the first comma) of the function. The second example uses a named argument of the type base= to explicitly set the base value. The first style contains less typing, but the second style is easier to remember and is good coding practice.

When R doesn't understand your function it responds with an error message. For example

srt(2)

## Error: could not find function "srt"

sqrt(-2)

## Warning: NaNs produced

## [1] NaN

The output NaN is used to indicate the value is not a number.

Assignments

It is convenient to name an object so you can use it later. Doing so is called an assignment. You put a name on the left-hand side of the equals sign and a value, function, object, etc on the right. Assignments generally do not produce output.

x = 2
x + 3

## [1] 5

Here you assign 2 to x and then add 3 to x.

You can make names out of letters, numbers, and the dot or underline characters. A name starts with a letter or a dot (a leading dot may not be followed by a number). You are not allowed to use math operators, such as + and -. The help page for make.names describes this in more detail (?make.names).

Case is also important in names; x is different than X. It is good coding practice to use conventional names for certain types of data. For instance, n is used for the length of a data record or the length of a vector, x and y are used for spatial coordinates, and i and j are for integers and indices for vectors and matrices.

These conventions are not forced, but consistently using them makes it easier for you (and others) to look back and understand what you've done.

Variables that begin with the dot character are usually reserved for advanced programmers. Unlike many programming languages, the period in R is only used as punctuation and can be included in an object name (e.g., my.object).

Getting help

Using R requires knowing many functions. R has a built in help for information about what is returned by each function, for details on additional arguments, and for examples. If you know the name of a function for example you type

help(var)

This brings up a help page dedicated to the variance function. ?var works the same way. The name of the function and the associate package is given as the preamble followed by a brief description of the function and how it is used. Arguments are explained along with function and argument details. Examples given toward the bottom of the page. A good strategy for understanding what a function does is to copy and paste the examples into your R session.

The examples help you understand what the function does. You can try the examples individually by copying and pasting them into your R session. You can also try them all at once by using the example() function. For instance, type

example(mean)

## 
## mean> x <- c(0:10, 50)
## 
## mean> xm <- mean(x)
## 
## mean> c(xm, mean(x, trim = 0.10))
## [1] 8.75 5.50

The help facility works great if you know the name of the function. If not, there are other ways to search. For example, the function help.search(“mean”) searches each entry in the help system and returning matches (often many) of functions that mention the word “mean”. The function apropos searches through function names and variables for matches.

apropos("mean")

##  [1] ".colMeans"       ".rowMeans"       "colMeans"       
##  [4] "kmeans"          "mean"            "mean.data.frame"
##  [7] "mean.Date"       "mean.default"    "mean.difftime"  
## [10] "mean.POSIXct"    "mean.POSIXlt"    "rowMeans"       
## [13] "weighted.mean"

Time to quit

To end your R session type

q(save = "no")

Like most R functions q needs an open (left) and close (right) parentheses. The argument save=“no” says do not save the workspace. Otherwise the workspace and session history are saved to a file in your working directory. By default the file is called .RData. The workspace is your current R working environment and includes any user-defined objects (vectors, matrices, data frames, lists, functions).

Small amounts of data

To begin your analysis you will need to get your data into R. For small amounts you use the concatenate function. It combines items like a set of numbers. Consider a set of hypothetical hurricane counts, where in the first year there were two landfalls in the second three and so on. To enter these values type

h = c(2, 3, 0, 3, 1, 0, 0, 1, 2, 1)
class(h)

## [1] "numeric"

##  [1] 2 3 0 3 1 0 0 1 2 1

The ten values are stored in a vector object of class numeric called h.

The assignment operator is an equal sign. Another assignment operator used frequently is the left pointing arrow that consists of two keystrokes (the less-than sign and the hyphen <-). This is the more common and it reserves the equal sign for use in declaring argument values.

With most assignments only the prompt is returned to the screen with nothing printed. The object to the left of the assignment operator is given the values of whatever is to the right of the operator. They can be printed by typing the object name as you just did. Finally, the values when printed are prefaced with a [1]. This indicates that the first entry in the object has a value of 2.

Arrow keys

The arrow keys on your computer keyboard can be used to retrieve previous functions. This saves typing. Each command is stored in a history file and the up arrow key moves backward through the history file and the down arrow moves forward. The left and right arrow keys work as expected. Changes can be made to a mistyped function followed by a return without the need to go to the end of the line.

Functions

Once the data are stored as an object you apply functions to do various things. For example, you find the total number of land falls occurring over the set of years by typing:

sum(h)

## [1] 13

The number of years is found by typing

length(h)

## [1] 10

The average number of hurricanes over this ten year period is found by typing:

sum(h)/length(h)

## [1] 1.3

mean(h)

## [1] 1.3

Other useful functions include sort, min, max, range, diff, and cumsum. Try them on the object h of land fall counts. For example, what does the function diff do?

Most functions have a name followed by a left parenthesis, then a set of arguments separated by commas followed by a right parenthesis. Arguments have names. Some are required, but many are optional with R providing default values.

In summary consider the code

x = log(42, base = 10)

Here x is the object name, = is the assignment operator, log is the function, 42 is the value for which the logarithm is being computed, and 10 is the argument corresponding to the logarithm base. Note here the equal sign is used in two different ways; as an assignment operator and to specify a value for an argument.

Vectors

Your data object h is stored as a vector. This means that R keeps track of the order the data were entered. The vector contains a first element, a second element, and so on. This is convenient.

Your data object of land fall counts has a natural order–year 1, year 2, etc, so you want to keep this order. You would like to be able to make changes to the data item by item instead of reentering the data. R lets you do this. Also, vectors are math objects so math operations can be performed on them.

Suppose h contain the annual land fall count from the first decade of a longer record. You want to keep track of counts over a second decade.

d1 = h
d2 = c(0, 5, 4, 2, 3, 0, 3, 3, 2, 1)

Most functions operate the vector components all at once.

d1 + 2

##  [1] 4 5 2 5 3 2 2 3 4 3

d1 + d2

##  [1] 2 8 4 5 4 0 3 4 4 2

d1 - d2

##  [1]  2 -2 -4  1 -2  0 -3 -2  0  0

d1 - mean(d1)

##  [1]  0.7  1.7 -1.3  1.7 -0.3 -1.3 -1.3 -0.3  0.7 -0.3

In the second two cases, the first year count of the first decade is added (and subtracted) from the first year count of the second decade and so on. In the third case a constant (the average of the first decade) is subtracted from each count of the first decade. This is an example of recycling. R repeats values from one vector so as to match the length of the other vector. Here the mean value is computed then repeated 10 times. Subtraction then follows on each component one at a time.

Suppose you are interested in the variability of hurricane counts from one year to the next. An estimate of this variability is the variance given by \[ s^2 = \frac{1}{n-1} \sum_{i=1}^n (h_i - \bar h)^2, \] where \( n \) is the sample size and \( \bar h \) is the sample mean \[ \bar h = \frac{1}{n}\sum_{i=1}^n h_i \]

The function var will compute the sample variance but to see how vectorization works here you write a script.

dbar = mean(d1)
dif = d1 - dbar
ss = sum(dif^2)
n = length(d1)
ss/(n - 1)

## [1] 1.344

Note how the different parts of the equation for the variance match what you type in R. To verify your answer, type

var(d1)

## [1] 1.344

Similarly the standard deviation, which is the square root of the variance, is obtained by typing

sd(d1)

## [1] 1.16

To change the number of significant digits printed to the screen from the default of 7, type

options(digits = 3)
var(d1)

## [1] 1.34

A restriction on vectors is that all the components have to have the same type. You can't create a vector with the first component a numeric value and the second component a character text. A character vector can be a set of text strings as in

Elsners = c("Jim", "Svetla", "Ian", "Diana")
Elsners

## [1] "Jim"    "Svetla" "Ian"    "Diana"

Note that character strings are made by matching quotes on both sides of the string, either double or single. Caution: The quotes must not be directional. This can cause problems if you type your code in a word processor (like MS Word) as the program will insert directional quotes. It is better to use a text editor like Notepad.

You add another component to the vector Elsners by using the concatenate function.

c(Elsners, 1.5)

## [1] "Jim"    "Svetla" "Ian"    "Diana"  "1.5"

The component 1.5 gets coerced to a character string. Coercion occurs for mixed types where the components get changed to the lowest common type which is usually a character. This will prevent you from using the vector in arithmetic operations.

Vector elements can have names. The names will appear when the vector is printed. You use the names() function to retrieve and set names as character strings. For instance you type

names(Elsners) = c("Dad", "Mom", "Son", "Daughter")
Elsners

##      Dad      Mom      Son Daughter 
##    "Jim" "Svetla"    "Ian"  "Diana"

The names() function appears on the left side of the assignment operator. The function adds the names attribute to the vector. Names can be used on vectors of any type.

Returning to your hurricane example, suppose the National Hurricane Center (NHC) finds a previously undocumented hurricane in the 6th year of the 2nd decade. In this case you type

d2[6] = 1
d2

##  [1] 0 5 4 2 3 1 3 3 2 1

This changes the 6th element (component) of vector d2 to 1 leaving the other components alone. Note the use of square brackets. Square brackets are used to subset components of vectors (and arrays, lists, etc), whereas parentheses are used with functions to enclose the set of arguments.

To print the number of hurricanes during the 3rd year only, type

d2[3]

## [1] 4

To print all the hurricane counts except from the 4th year, type

d2[-4]

## [1] 0 5 4 3 1 3 3 2 1

To print the hurricane counts for the odd years only, type

d2[c(1, 3, 5, 7, 9)]

## [1] 0 4 3 3 2

Here you use the c() function inside the subset operator [.

Since here you want a regular sequence of years, the expression c(1, 3, 5, 7, 9) can be simplified using structured data.

Structured data

Creating structured data can save you time. For example the colon function is used to create sequences of numbers.

1:10

##  [1]  1  2  3  4  5  6  7  8  9 10

10:1

##  [1] 10  9  8  7  6  5  4  3  2  1

rev(1:10)

##  [1] 10  9  8  7  6  5  4  3  2  1

The seq() function is more general. It allows for start and end values, but also a step size or sequence length.

seq(1, 9, by = 2)

## [1] 1 3 5 7 9

seq(1, 10, by = 2)

## [1] 1 3 5 7 9

seq(1, 9, length = 5)

## [1] 1 3 5 7 9

Use the rep() function to create a vector with elements having repeat values. The simplest usage of the function is to replicate the value of the first argument the number of times specified by the value of the second argument.

rep(1, times = 10)

##  [1] 1 1 1 1 1 1 1 1 1 1

rep(1:3, times = 3)

## [1] 1 2 3 1 2 3 1 2 3

You create more complicated patterns by specifying pairs of equal-sized vectors. In this case, each component of the first vector is replicated the corresponding number of times as specified in the second vector.

rep(c("cold", "warm"), c(1, 2))

## [1] "cold" "warm" "warm"

Here the vectors are implicitly defined using the c() function and the name of the second argument times= is left off. It's good coding practice to name the arguments so the order they appear in the function is not important.

Suppose you want to repeat the sequence of cold, warm, warm three times. You nest the above sequence generator in another repeat function as follows.

rep(rep(c("cold", "warm"), c(1, 2)), 3)

## [1] "cold" "warm" "warm" "cold" "warm" "warm" "cold" "warm" "warm"

Function nesting provides flexibility.

Logic

Functions like mean() and var() when applied to a vector of data output a statistic. max() is another. To find the greatest number of hurricanes in a single year during the first decade type

max(d1)

## [1] 3

To determine which years had the greatest number type

d1 == 3

##  [1] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

A single equal sign is an assignment statement, but a double equal sign is a logical operator. Each element of the vector is matched in value with 3 and a true or false returned. The length of the output will match the length of the vector.

To get the years corresponding to the truth of the logical operature type

which(d1 == 3)

## [1] 2 4

How would you get the number of years in each decade without a hurricane?

sum(d1 == 0)

## [1] 3

sum(d2 == 0)

## [1] 1

Note you can place two commands on a single line by separating them with a semicolon.

The ratio of the mean number of hurricanes over the two decades is

mean(d2)/mean(d1)

## [1] 1.85

So there is 85\% more landfalls during the second decade. Is this difference statistically significant?

You should remove objects from your workspace that are no longer needed. This helps you to recycle names. First, to see what objects reside in your workspace, type

objects()

##  [1] "d1"      "d2"      "dbar"    "dif"     "Elsners" "h"       "n"      
##  [8] "ss"      "x"       "xm"

To remove objects, type

rm(d1, d2, Elsners)

To remove all objects type

rm(list = objects())

This will clean your workspace completely. To avoid name conflicts it is good practice is to start a session with a clean workspace. Caution: don't include this command in code you share with others.

Getting data into R

First check your working directory.

getwd()

## [1] "/Users/jelsner/Dropbox/book/Chap02"

This is the directory where R will look for data. To change your working directory you use the setwd() function and specifying the path name within quotes. Alternatively you should be able to use one of the menu options in the R interface. To list the files in your working directory you type dir().

You also need to know the file type of your data. This will determine the read function. For example, the data set US.txt contains a list of tropical cyclone counts by year making land fall in the United States (excluding Hawaii) at hurricane intensity. The file is a space-delimited text file. In this case you use the read.table() function to import the data.

Does your data file have column names? These are given in the first line of your file, usually as a series of character strings. The line is called a header' and if your data has one, you need to specify header=TRUE.

Make sure the text file US.txt is in your working directory and type

H = read.table("US.txt", header = TRUE)

If you get a prompt without an error message, the data has been imported.

At this stage the most common mistake is that your data file is not in your working directory. This will result in an error message along the lines of 'cannot open the connection' or 'cannot open file.'

If your file has comma's between columns then use the argument sep=“,” in the function. No changes are made to your file.

You can also change the missing value character. By default it is NA. If the missing value character in your file is coded as 99, specify na.strings=“99” and it will be changed to NA in your data object.

Several variants of read.table() differ only in the default argument settings. read.csv() has settings that are suitable for comma delimited (csv) files that have been exported from a spreadsheet. Your work flow is to export your data from a spreadsheet to a csv file then import it to R using the read.csv() function.

You can also import data directly from the web by specifying the URL instead of the local file name.

loc = "http://www.hurricaneclimate.com/storage/chapter-2/US.txt"
H = read.table(loc, header = TRUE)

Data frames

The object H is a data frame and the function read.table() and its variants return data frames. Data frames are similar to a spread sheet. The data are arranged in rows and columns. The rows are the cases and the columns are the variables. To see how big your data frame is type

dim(H)

## [1] 160   6

This tells you there are 160 rows and 6 columns in your data frame.

To list the first six lines of the data object, type

head(H)

##   Year All MUS G FL E
## 1 1851   1   1 0  1 0
## 2 1852   3   1 1  2 0
## 3 1853   0   0 0  0 0
## 4 1854   2   1 1  0 1
## 5 1855   1   1 1  0 0
## 6 1856   2   1 1  1 0

The columns include year, number of hurricanes, number of major hurricanes, number of Gulf coast hurricanes, number of Florida hurricanes, and number of East coast hurricanes in order. Note the column names are given as well. The last six lines of your data frame are listed similarly using the tail() function. The number of lines listed is changed with using the argument n.

If your data reside in a directory other than your working directory, you can use the file.choose() function. This will open a dialog box allowing you to scroll and choose the file.

To make the individual columns available by column name, type

attach(H)

The number of years in the record is obtained and saved in n and the average number of U.S. hurricanes is saved in rate using the following two lines of code.

n = length(All)
rate = mean(All)

By typing the names of the saved objects, the values are printed.

## [1] 160

rate

## [1] 1.69

Thus over the 160 years of data the average number of hurricanes per year is 1.69.

If you want to change the names of the columns in the data frame, type

names(H)[4] = "GC"
names(H)

## [1] "Year" "All"  "MUS"  "GC"   "FL"   "E"

This changes the 4th column name from G to GC. Note that this is done to the data frame in R and not to your data file.

While attaching a data frame is convenient, it is not a good strategy when writing lots of code as name conflicts can easily arise. If you do attach your data frame, make sure you use the function detach() after finishing.

Tables and summaries

Vectors and matrices must have a single class. For example, the vectors A, B, and C below are constructed as numeric and logical, respectively.

A = c(1, 2.2, 3.6, -2.8)  #numeric vector
B = c(TRUE, TRUE, FALSE, TRUE)  #logical vector
C = c("Cat 1", "Cat 2", "Cat 3")  #character vector
class(A)

## [1] "numeric"

class(B)

## [1] "logical"

class(C)

## [1] "character"

Let the vector wx denote the weather conditions for five forecast periods as character data.

wx = c("sunny", "clear", "cloudy", "cloudy", "rain")
class(wx)

## [1] "character"

table(wx)

## wx
##  clear cloudy   rain  sunny 
##      1      2      1      1

The output from the table() function is a list of the unique character strings and the corresponding number of occurrences of each string.

As another example, let the object ss denote the Saffir-Simpson category for a set of five hurricanes.

ss = c("Cat 3", "Cat 2", "Cat 1", "Cat 3", "Cat 3")
table(ss)

## ss
## Cat 1 Cat 2 Cat 3 
##     1     1     3

Here the character strings correspond to different intensity levels as ordered categories with Cat 1 < Cat 2 < Cat 3. In this case convert the character vector to an ordered factor with levels. This is done using the factor() function.

ss = factor(ss, order = TRUE)
class(ss)

## [1] "ordered" "factor"

ss

## [1] Cat 3 Cat 2 Cat 1 Cat 3 Cat 3
## Levels: Cat 1 < Cat 2 < Cat 3

The class gets changed to an ordered factor. A print of the object results in a list of the elements in the vector and a list of the levels in order. Note, if you do the same for the wx object, the order is alphabetical by default. Try it.

You can also use the table() function on discrete numeric data. For example,

table(All)

## All
##  0  1  2  3  4  5  6  7 
## 34 48 38 27  6  1  5  1

The summary() method provides a summary description of your object. What gets returned depends on the class of the object being summarized. If your object is a data frame the summary consisting of numeric data and the output includes six statistics (mean, median, minimum, maximum, first quartile, and third quartile) for each column.

summary(H)

##       Year           All            MUS            GC             FL      
##  Min.   :1851   Min.   :0.00   Min.   :0.0   Min.   :0.00   Min.   :0.00  
##  1st Qu.:1891   1st Qu.:1.00   1st Qu.:0.0   1st Qu.:0.00   1st Qu.:0.00  
##  Median :1930   Median :1.00   Median :0.0   Median :0.50   Median :0.00  
##  Mean   :1930   Mean   :1.69   Mean   :0.6   Mean   :0.69   Mean   :0.68  
##  3rd Qu.:1970   3rd Qu.:2.25   3rd Qu.:1.0   3rd Qu.:1.00   3rd Qu.:1.00  
##  Max.   :2010   Max.   :7.00   Max.   :4.0   Max.   :4.00   Max.   :4.00  
##        E        
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.469  
##  3rd Qu.:1.000  
##  Max.   :3.000

Quantiles

The quartiles output from the summary() method are examples of quantiles. Sample quantiles cut a set of ordered data into equal-sized data bins. The ordering comes from rearranging the data from lowest to highest. The first, or lower, quartile corresponding to the .25 quantile (25th percentile) indicates that 25\% of the data have a value less than this quartile value. The third, or upper, quartile corresponding to the .75 quantile (75th percentile) indicates that 75\% of the data have a smaller value than this quartile value.

The quantile() function calculates sample quantiles on a vector of data. For example, consider the set of North Atlantic Oscillation (NAO) index values for the month of June from the period 1851–2010. The NAO is a variation in the climate over the North Atlantic Ocean featuring fluctuations in the difference of atmospheric pressure at sea level between the Iceland and the Azores. The index is computed as the difference in standardized sea-level pressures. The standardization is done by subtracting the mean and dividing by the standard deviation. The units on the index is standard deviation.

First read the data consisting of monthly NAO values, then apply the quantile() function to the June values.

loc = "http://www.hurricaneclimate.com/storage/chapter-2/NAO.txt"
NAO = read.table(loc, header = TRUE)
quantile(NAO$Jun, probs = c(0.25, 0.5))

##    25%    50% 
## -1.405 -0.325

Note the use of the dollar sign to point to a particular column in the data frame. Recall that to list the column names of the data frame object called NAO type names(NAO).

Of the 160 values, 25% of them are less than -1.4 standard deviations (sd), 50% are less than -0.32 sd. Thus there are an equal number of years with June NAO values between -1.4 and -0.32 sd.

The third quartile value corresponding to the .75 quantile (75th percentile) indicates that 75% of the data have a value less than this. The difference between the first and third quartile values is called the interquartile range (IQR). Fifty percent of all values lie within the IQR. The IQR can be found directly by using the IQR() function.

Bar plots

R has a wide range of plotting capabilities. It takes time to master, but a little effort goes a long way.

The bar plot (or bar chart) is a way to easily compare categorical or discrete data. Levels of the variable are arranged in some order along the horizontal axis and the frequency of values in each group is plotted as a bar with the bar height proportional to the frequency.

To make a bar plot of your U.S. hurricane counts, type

barplot(table(All), ylab = "Number of Years", xlab = "Number of Hurricanes")

plot of chunk barplotUShur

The plot is a concise summary of the number of hurricanes. Bar heights are proportional to the number years with that many hurricanes. The plot conveys the same information as the table. The purpose of the bar plot is to illustrate the difference between data values. Readers expect the plot to start at zero so draw it that way. Usually there is little scientific reason to make the bars appear three dimensional.

Note that the axis labels are set using the ylab and xlab arguments with the actual label as a character string in quotation. You must be careful to use bi-directional quotes, not the directional quotes that appear in word-processing type.

Although many of the plotting commands are simple and somewhat intuitive, to get a publication quality figure requires tweaking the default settings. You will see some of these tweaks as you work through the book.

By default the plot is sent to the graphics device for your computer screen. This might be windows, quartz, or X11. There are also devices for creating postscript, pdf, png, and jpeg output and sending them to a file in your working directory. For publication-quality graphics, the postscript and pdf devices are preferred because they produce scalable images. Use bitmap devices for drafts.

The sequence is to first specify a graphics device, then call your graphics functions, and finally close the device. For example, to create an encapsulated postscript file (eps) of your bar plot placed in your working directory, type

postscript(file = "MyFirstRPlot.eps")
barplot(table(All), ylab = "Number of Years", xlab = "Number of Hurricanes")
dev.off()

## pdf 
##   2

The file containing the bar plot is placed in your working directory. Note that the postscript() function opens the device and dev.off() closes it. Make sure you remember to close the device. To list the files in your working directory type dir().

The pie chart is an alternative to the bar chart for displaying relative frequencies or proportions (?pie). It represents this information with wedges of a circle or pie. Since your eye has some difficulty judging relative areas, another alternative to the bar chart is the dot chart. To find out more about this graph function, type ?dotchart.

Scatter plots

Perhaps the most useful graph of all is the scatter plot. You use it to represent the relationship between two continuous variables. It is a graph of the values of one variable against the values of the other as points \( (x_i, y_i) \) in a Cartesian plane.

You use the plot() method to make a scatter plot. The syntax is plot(x, y) where x and y are vectors containing the paired data. Values of the variable named in the first argument (here x) are plotted along the horizontal axis.

For example, to graph the relationship between the February and March values of the NAO type

plot(NAO$Feb, NAO$Mar, xlab = "February NAO", ylab = "March NAO")

plot of chunk NAOscatterPlot

The plot shows the relationship between the NAO values during February and March. Low values of the index during February tend to be followed by low values in March and high values in February tend to be followed by high values in March. You observe there is a direct (or positive) relationship between the two variables although the points are scattered widely indicating the relationship is not tight.

The relationship between two variables can be easily visualized with a scatter plot. You can change the point symbol with the argument pch. If your goal is to model the relationship, you should plot the dependent variable (the variable you are interested in modeling) on the vertical axis. Here it might make the most sense to put the March values on the vertical axis since a predictive model would use February values to forecast March values.

The plot produces points as a default. This is changed using the argument type where the letter ell is placed in quotation. For example, to plot the February NAO values as a time series type

plot(NAO$Year, NAO$Feb, ylab = "February NAO", xlab = "Year", type = "l")

plot of chunk NAOscatterwithLine

The values fluctuate about zero and do not appear to have a long-term trend. With time series data it is better to connect the values with lines rather than use points unless values are missing.

This concludes your introduction to R. We showed you where to get R, how to install it, and how to install packages. We showed you how use R as a calculator, how to work with functions, make assignments, and get help. We also showed you how to work with small amounts of data and how to import data from a file. We concluded with how to tabulate, summarize, and make some simple plots.