# Introduction to R

Junvie Pailden, Ph.D.
May 27, 2014

SIUE - Stat 575 - Summer 2014

### Why R?

• R offers a powerful and appealing interactive environment for exploring data, running simulations, etc.

• R is platform independent meaning it is available on Windows, Mac, and Linux.

• R has the best help resources both online (just google any issue/question) and using help(…), e.g. help(lm).

• R is not a black box software, i.e., you can trace how a function or package works by following the R script, e.g. lm()

• Many more!!!

### Install R and RStudio on Windows

2. Install R. Leave all default settings in the installation options.

4. Open RStudio.

### Commands on R Console

# create an integer sequence
3:7

[1] 3 4 5 6 7

# create an sequence from 0 to 3 with 0.5 increment
seq(0,3,by=0.5)

[1] 0.0 0.5 1.0 1.5 2.0 2.5 3.0

# create a repeated sequence
rep(pi,4)

[1] 3.142 3.142 3.142 3.142


Basic Operations

17+3+1

[1] 21

(2-3)*4

[1] -4


### Concatenate operator

c(6,20,-3) # numbers

[1]  6 20 -3

c("words","are","wind") # strings

[1] "words" "are"   "wind"


Operations

c(1,2,3,4) + 1

[1] 2 3 4 5

1/ c(1,2,3,4)

[1] 1.0000 0.5000 0.3333 0.2500

c(1,2,3,4)^2

[1]  1  4  9 16


### Variables

Variable

# assignment
x <- 3
# is the same as
3 -> x
# and
x = 3


In this class, we will use <- for convenience. Be careful with = because it does not mean equals. For that, you need == operator

one <- 1
two <- 2

one = two # This means: assign the value of "two" to the variable "one"

one

[1] 2

two

[1] 2


Let's start again

one <- 1
two <- 2

one == two  # This means: does the value of "one" equals the value of "two"

[1] FALSE


### Commonly Used Operators

a <- sqrt(2); b <- 1:3; c <- 2:4


a + b

[1] 2.414 3.414 4.414

a * b

[1] 1.414 2.828 4.243


Entrywise multiplication

b * c

[1]  2  6 12


x modulus y

17 %% 5

[1] 2


Integer Division

17 %/% 5

[1] 3


### Some built-in functions in R

General Form

f(argument1, argument2,...)


sum(),mean(),sd()

b <- c(1,2,3)
sum(b)

[1] 6

mean(b)

[1] 2

sd(b)

[1] 1


exp(),cos(),log()

exp(1)

[1] 2.718

cos(3.141593)

[1] -1

log2(1)

[1] 0

log(x=64,base=4)

[1] 3


### Simple Summaries

height <- 58:72
weight <- c(115,117,120,123,126,129,132,135,139,142,146,150,154,159,164)

hbar <- mean(height); hbar # mean of height, OR

[1] 65

n <- length(height);
sum(height)/n

[1] 65

var(height) # variance of height, OR

[1] 20

sum((height-hbar)^2)/(n-1)

[1] 20


Find the correlation of height and weight?

height <- 58:72
weight <- c(115,117,120,123,126,129,132,135,139,142,146,150,154,159,164)
# size
n <- length(height)
# mean
hbar <- mean(height)
wbar <- mean(weight)
# standard deviation
sdh <- sd(height)
sdw <- sd(weight)
# correlation coefficient
r <- sum((height-hbar)*(weight-wbar))/(sdh*sdw*(n-1))


### Correlation

Find the correlation of height and weight?

height <- 58:72
weight <- c(115,117,120,123,126,129,132,135,139,142,146,150,154,159,164)
n <- length(height)
hbar <- mean(height)
wbar <- mean(weight)
sdh <- sd(height)
sdw <- sd(weight)
r <- sum((height-hbar)*(weight-wbar))/(sdh*sdw*(n-1))
# printing the results
print(c(n,hbar,wbar,sdh,sdw,r))

[1]  15.0000  65.0000 136.7333   4.4721  15.4987   0.9955

# lazy way
cor(height,weight)

[1] 0.9955


### Writing Functions in R

General Form

function(arglist) expr
return(value)


I want a function that will add two numbers

my_fun <- function(x,y){
x + y
}
my_fun(1,2)

[1] 3


Body of the function does not need to be in separate lines. If the body of the function is only one line, then braces aren't necessary.

my_fun2 <- function(x,y) x + y
my_fun2(1,2)

[1] 3


### More on functions in R

I can set default values, say y=5

my_fun2 <- function(x,y=5) x + y
my_fun2(1)

[1] 6


The sapply() function accepts a list and a function, then applies the function to every element of that list and returns the result.

Because functions are also objects, I can pass a function into another function as the argument.

l <- 1:5
sapply(l, my_fun2)

[1]  6  7  8  9 10


Write a function that computes the correlation between two variables!

my_corr <- function(a,b){
# size
n <- length(a)
# mean
abar <- mean(a)
bbar <- mean(b)
# standard deviation
sda <- sd(a)
sdb <- sd(b)
# correlation coefficient
r <- sum((a-abar)*(b-bbar))/(sda*sdb*(n-1))
return(r)
}

height <- 58:72
weight <- c(115,117,120,123,126,129,132,135,139,142,146,150,154,159,164)
my_corr(height,weight)



Write a function that computes the correlation between two variables!

my_corr <- function(a,b){
# size
n <- length(a)
# mean
abar <- mean(a)
bbar <- mean(b)
# standard deviation
sda <- sd(a)
sdb <- sd(b)
# correlation coefficient
r <- sum((a-abar)*(b-bbar))/(sda*sdb*(n-1))
return(r)
}

height <- 58:72
weight <- c(115,117,120,123,126,129,132,135,139,142,146,150,154,159,164)
my_corr(height,weight)

[1] 0.9955


### Special Values

There are a few special values that are used in R

The NA values are used to represent missing values. You may encounter NA values in text loaded in R or in data loaded from databases (to replace NULL values).

v <- c(1,2,3)
v

[1] 1 2 3

length(v) <- 4
v

[1]  1  2  3 NA


Expanding the size of a vector (matrix, array) beyond the size where values are defined.

If a computation results in a number that is too big, R will return Inf for a positive and -Inf for a negative.

2^1024

[1] Inf

-2 ^ 1024

[1] -Inf

1/0

[1] Inf

Inf-Inf # will return NaN

[1] NaN


### Lists

A list, in R use list(), is an ordered collection of objects of possibly different types. Lists are frequently used to return several results of a function in a single object.

arya <- list(name='Arya of Winterfell',age=11,northman=TRUE)
arya

$name [1] "Arya of Winterfell"$age
[1] 11

$northman [1] TRUE  You can see that the name of each item is preceded by a $. You can then reference each item in the list by its position or its name:

arya[1]

$name [1] "Arya of Winterfell"  arya$name

[1] "Arya of Winterfell"

arya$age>15  [1] FALSE  ### Matrices A matrix is a two-dimensional array. Matrices (same as vectors) can hold elements only of the same type. # 2 by 4 matrix m <- matrix(1:20,nrow=5,ncol=4) m   [,1] [,2] [,3] [,4] [1,] 1 6 11 16 [2,] 2 7 12 17 [3,] 3 8 13 18 [4,] 4 9 14 19 [5,] 5 10 15 20  By default, the matrix is populated by column m <- matrix(1:20,nrow=5,ncol=4,byrow=TRUE) m   [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 5 6 7 8 [3,] 9 10 11 12 [4,] 13 14 15 16 [5,] 17 18 19 20  To access the matrix, use square brackets. m[10] # 10th entry columnwise  [1] 18  m[3,4] # entry on 3rd row, 4th column  [1] 12  m[3:5] # 3rd to 4th entry columnwise  [1] 9 13 17  m[3:5,2:3] # entry fromt 3rd thru 5th rows and 2nd thru 3rd columns   [,1] [,2] [1,] 10 11 [2,] 14 15 [3,] 18 19  ### Matrices (con't) You can also give names to each row and each column using dimnames(). dimnames(m) <- list(c('a','b','c','d','e'),c('p','q','r','s')) m   p q r s a 1 2 3 4 b 5 6 7 8 c 9 10 11 12 d 13 14 15 16 e 17 18 19 20  Combine objects by rows rbind() or columns cbind() S <- rbind(rep(FALSE,5),rep(NA,5)) rownames(S) <- c('All False','All NA') S   [,1] [,2] [,3] [,4] [,5] All False FALSE FALSE FALSE FALSE FALSE All NA NA NA NA NA NA  ### Arrays An array is an extention of the vector to more than two dimensions.  # 2 by 2 by 2 array A <- array(1:16,c(2,4,2)) A  , , 1 [,1] [,2] [,3] [,4] [1,] 1 3 5 7 [2,] 2 4 6 8 , , 2 [,1] [,2] [,3] [,4] [1,] 9 11 13 15 [2,] 10 12 14 16  Interchange the first two subscripts on a 3-way array A At <- aperm(A, c(2,1,3)) At  , , 1 [,1] [,2] [1,] 1 2 [2,] 3 4 [3,] 5 6 [4,] 7 8 , , 2 [,1] [,2] [1,] 9 10 [2,] 11 12 [3,] 13 14 [4,] 15 16  ### Factors in R Values can be nominal,ordinal, or continuous. In R, nominal and ordinal values are represented by factor() houses <- c('Stark','Lannister','Tully','Arryn','Tyrells','Baratheon','Martell') factor(houses)  [1] Stark Lannister Tully Arryn Tyrells Baratheon Martell Levels: Arryn Baratheon Lannister Martell Stark Tully Tyrells  By default, factor levels are created in alphabetical order. factor(houses,order=TRUE,levels=houses)  [1] Stark Lannister Tully Arryn Tyrells Baratheon Martell 7 Levels: Stark < Lannister < Tully < Arryn < Tyrells < ... < Martell  ### Apply function in R Applies a function to sections of an array (or matrix) and returns the results in an array (or matrix). apply(array, margin, function, ...)  The margin argument is used to specify which margin we want to apply the function to and which margin we wish to keep. mat1 <- matrix(rep(seq(4), 4), ncol = 4) mat1   [,1] [,2] [,3] [,4] [1,] 1 1 1 1 [2,] 2 2 2 2 [3,] 3 3 3 3 [4,] 4 4 4 4  #row sums of mat1, margin is 1 apply(mat1, 1, sum)  [1] 4 8 12 16  #column sums of mat1, margin is 2 apply(mat1, 2, sum)  [1] 10 10 10 10  #using a user defined function sum.plus.2 <- function(x){ sum(x) + 2 } #using the sum.plus.2 function on the rows of mat1 apply(mat1, 1, sum.plus.2)  [1] 6 10 14 18  ### Data Frames A data frame is a data structure we will be using most often in this class. A data frame is a list that contains multiple named vectors of the same length. Whereas we usually use spreadsheet or database table by row, data frames are constructed by columns. # head displays the returns the first parts of the data frame "cars"" head(cars)   speed dist 1 4 2 2 4 10 3 7 4 4 7 22 5 8 16 6 9 10  # faster summary measures summary(cars)   speed dist Min. : 4.0 Min. : 2 1st Qu.:12.0 1st Qu.: 26 Median :15.0 Median : 36 Mean :15.4 Mean : 43 3rd Qu.:19.0 3rd Qu.: 56 Max. :25.0 Max. :120  ### Conditionals General Form if (arglist satisfies) { do this one } else { do this two }  Create a function that tells you whether a variable is greater than 20 or not my_cond <- function(x){ if (x > 20) { print("x is greater than 20") } else { print("x is less than 20") } } x <- 10 my_cond(x)  [1] "x is less than 20"  ### Repeat Loops in R R has three forms of loops. The first is repeat w/c repeats a particular expression until it hits a break keyword. x <- 0 repeat{if (x>10) break else {print(x); x <- x+1} }  [1] 0 [1] 1 [1] 2 [1] 3 [1] 4 [1] 5 [1] 6 [1] 7 [1] 8 [1] 9 [1] 10  • Within the outermost braces is an if-else expression: if (x>10) break else {print(x); x <- x+1}. The inner set of braces is part of the else clause: print(x); x <- x+1. • The semicolon separates the clause into two parts. The first is print statement, and the second increments x so that the condition that termintes the loop, x>10, is eventually satisfied. ### While Loops in R x <- 0 while (x < 10) {print (x); x <- x + 1}  [1] 0 [1] 1 [1] 2 [1] 3 [1] 4 [1] 5 [1] 6 [1] 7 [1] 8 [1] 9  ### For Loops in R R loops iterate through each item in a vector or a list: x <- 0 for (x in 1: 10) print(x)  [1] 1 [1] 2 [1] 3 [1] 4 [1] 5 [1] 6 [1] 7 [1] 8 [1] 9 [1] 10  The colon creates a vector, passing each integer from 1 to 10 to the loop. ### Fibonacci Sequence len <- 10 fibvals <- numeric(len) # creates a vector of 0's of length 10 fibvals   [1] 0 0 0 0 0 0 0 0 0 0  fibvals[1] <- 1 fibvals[2] <- 1 for (i in 3:len) { fibvals[i] <- fibvals[i-1]+fibvals[i-2] } fibvals   [1] 1 1 2 3 5 8 13 21 34 55  ### Loops: Your Turn! 1. Create a function that returns a Fibonaccy sequence of any length. 2. Create a function that returns a sequence of odd numbers of any length. ### R package • An R package is a set of related functions and help files, bundled together. • It is similar to libraries in C or toolbox in Matlab. • Normally, all functions within a single package are related: for example, the stats package contains functions for statistical analysis. • There are few public repositories of packages: the largest is CRAN hosted by the R foundation with more than 4000 packages, and is mirrored in many sites worldwide. Of course, you need internet connection to do this. • To use a package, you first need to install it into R. • If you're using the R console user interface, you can use the package installer from the menu. • You can also install R packages directly through R console using install.packages(). • To load up an R package, use the library() ### Visualization in R There are many ways to create a scatterplot in R. The basic function is plot(x, y), where x and y are numeric vectors denoting the (x,y) points to plot. # Simple Scatterplot attach(mtcars) plot(wt, mpg, main="Scatterplot Example", xlab="Car Weight ", ylab="Miles Per Gallon ", pch=19)  ### Scatterplot There are many ways to create a scatterplot in R. The basic function is plot(x, y), where x and y are numeric vectors denoting the (x,y) points to plot. # Simple Scatterplot plot(wt, mpg, main="Scatterplot Example", xlab="Car Weight ", ylab="Miles Per Gallon ", pch=19) # Add fit lines abline(lm(mpg~wt), col="red") # regression line (y~x) lines(lowess(wt,mpg), col="blue") # lowess line (x,y)  ### Basic Scatterplot Matrix names(mtcars)   [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" [11] "carb"  # Consider only the variables mpg, disp, drat, and wt pairs(~mpg+disp+drat+wt,data=mtcars, main="Simple Scatterplot Matrix")  ### Boxplots Boxplots can be created for individual variables or for variables by group. The format is boxplot(x, data=), where x is a formula and data= denotes the data frame providing the data. # Boxplot of MPG by Car Cylinders boxplot(mpg~cyl,data=mtcars, main="Car Milage Data", xlab="Number of Cylinders", ylab="Miles Per Gallon")  ### Dotplots # Dotplot: Grouped Sorted and Colored # Sort by mpg, group and color by cylinder x <- mtcars[order(mtcars$mpg),] # sort by mpg
x$cyl <- factor(x$cyl) # it must be a factor
x$color[x$cyl==4] <- "red"
x$color[x$cyl==6] <- "blue"
x$color[x$cyl==8] <- "darkgreen"
dotchart(x$mpg,labels=row.names(x),cex=.7,groups= x$cyl,
main="Gas Milage for Car Models\ngrouped by cylinder",
xlab="Miles Per Gallon", gcolor="black", color=x\$color)