Introduction to R

Junvie Pailden, Ph.D.
May 27, 2014

SIUE - Stat 575 - Summer 2014

Why R?

  • R offers a powerful and appealing interactive environment for exploring data, running simulations, etc.

  • R is platform independent meaning it is available on Windows, Mac, and Linux.

  • R has the best help resources both online (just google any issue/question) and using help(…), e.g. help(lm).

  • R is not a black box software, i.e., you can trace how a function or package works by following the R script, e.g. lm()

  • Many more!!!

Install R and RStudio on Windows

  1. Download R from http://cran.us.r-project.org/

  2. Install R. Leave all default settings in the installation options.

  3. Download RStudio from [http://rstudio.org/download/desktop] and install it. Leave all default settings in the installation options.

  4. Open RStudio.

Commands on R Console

# create an integer sequence
3:7
[1] 3 4 5 6 7
# create an sequence from 0 to 3 with 0.5 increment
seq(0,3,by=0.5)
[1] 0.0 0.5 1.0 1.5 2.0 2.5 3.0
# create a repeated sequence
rep(pi,4)
[1] 3.142 3.142 3.142 3.142

Basic Operations

17+3+1
[1] 21
(2-3)*4
[1] -4

Concatenate operator

c(6,20,-3) # numbers
[1]  6 20 -3
c("words","are","wind") # strings  
[1] "words" "are"   "wind" 

Operations

c(1,2,3,4) + 1
[1] 2 3 4 5
1/ c(1,2,3,4)
[1] 1.0000 0.5000 0.3333 0.2500
c(1,2,3,4)^2
[1]  1  4  9 16

Variables

Variable

# assignment
x <- 3
# is the same as
3 -> x
# and
x = 3

In this class, we will use <- for convenience. Be careful with = because it does not mean equals. For that, you need == operator

one <- 1
two <- 2
one = two # This means: assign the value of "two" to the variable "one"
one
[1] 2
two
[1] 2

Let's start again

one <- 1
two <- 2
one == two  # This means: does the value of "one" equals the value of "two"
[1] FALSE

Commonly Used Operators

a <- sqrt(2); b <- 1:3; c <- 2:4

Scalar addition and multiplication

a + b
[1] 2.414 3.414 4.414
a * b
[1] 1.414 2.828 4.243

Entrywise multiplication

b * c
[1]  2  6 12

x modulus y

17 %% 5
[1] 2

Integer Division

17 %/% 5 
[1] 3

Some built-in functions in R

General Form

f(argument1, argument2,...)

sum(),mean(),sd()

b <- c(1,2,3)
sum(b)
[1] 6
mean(b)
[1] 2
sd(b)
[1] 1

exp(),cos(),log()

exp(1)
[1] 2.718
cos(3.141593)
[1] -1
log2(1)
[1] 0
log(x=64,base=4)
[1] 3

Simple Summaries

height <- 58:72
weight <- c(115,117,120,123,126,129,132,135,139,142,146,150,154,159,164)

hbar <- mean(height); hbar # mean of height, OR
[1] 65
n <- length(height);
sum(height)/n
[1] 65
var(height) # variance of height, OR
[1] 20
sum((height-hbar)^2)/(n-1)
[1] 20

Correlation : Your Turn!

Find the correlation of height and weight?

height <- 58:72
weight <- c(115,117,120,123,126,129,132,135,139,142,146,150,154,159,164)
# size
n <- length(height)
# mean
hbar <- mean(height)
wbar <- mean(weight)
# standard deviation
sdh <- sd(height)  
sdw <- sd(weight)
# correlation coefficient
r <- sum((height-hbar)*(weight-wbar))/(sdh*sdw*(n-1))        

Correlation

Find the correlation of height and weight?

height <- 58:72
weight <- c(115,117,120,123,126,129,132,135,139,142,146,150,154,159,164)
n <- length(height)
hbar <- mean(height)
wbar <- mean(weight)
sdh <- sd(height)  
sdw <- sd(weight)
r <- sum((height-hbar)*(weight-wbar))/(sdh*sdw*(n-1))
# printing the results
print(c(n,hbar,wbar,sdh,sdw,r))
[1]  15.0000  65.0000 136.7333   4.4721  15.4987   0.9955
# lazy way
cor(height,weight)          
[1] 0.9955

Writing Functions in R

General Form

function(arglist) expr
return(value)

I want a function that will add two numbers

my_fun <- function(x,y){
  x + y
}
my_fun(1,2)
[1] 3

Body of the function does not need to be in separate lines. If the body of the function is only one line, then braces aren't necessary.

my_fun2 <- function(x,y) x + y
my_fun2(1,2)
[1] 3

More on functions in R

I can set default values, say y=5

my_fun2 <- function(x,y=5) x + y
my_fun2(1)
[1] 6

The sapply() function accepts a list and a function, then applies the function to every element of that list and returns the result.

Because functions are also objects, I can pass a function into another function as the argument.

l <- 1:5
sapply(l, my_fun2)
[1]  6  7  8  9 10

Function : Your Turn!

Write a function that computes the correlation between two variables!

my_corr <- function(a,b){
# size
n <- length(a)
# mean
abar <- mean(a)
bbar <- mean(b)
# standard deviation
sda <- sd(a)  
sdb <- sd(b)
# correlation coefficient
r <- sum((a-abar)*(b-bbar))/(sda*sdb*(n-1))
return(r)
}
height <- 58:72
weight <- c(115,117,120,123,126,129,132,135,139,142,146,150,154,159,164)
my_corr(height,weight)

Function : Your Turn!

Write a function that computes the correlation between two variables!

my_corr <- function(a,b){
# size
n <- length(a)
# mean
abar <- mean(a)
bbar <- mean(b)
# standard deviation
sda <- sd(a)  
sdb <- sd(b)
# correlation coefficient
r <- sum((a-abar)*(b-bbar))/(sda*sdb*(n-1))
return(r)
}
height <- 58:72
weight <- c(115,117,120,123,126,129,132,135,139,142,146,150,154,159,164)
my_corr(height,weight)
[1] 0.9955

Special Values

There are a few special values that are used in R

The NA values are used to represent missing values. You may encounter NA values in text loaded in R or in data loaded from databases (to replace NULL values).

v <- c(1,2,3)
v
[1] 1 2 3
length(v) <- 4
v
[1]  1  2  3 NA

Expanding the size of a vector (matrix, array) beyond the size where values are defined.

If a computation results in a number that is too big, R will return Inf for a positive and -Inf for a negative.

2^1024
[1] Inf
-2 ^ 1024
[1] -Inf
1/0
[1] Inf
Inf-Inf # will return `NaN`
[1] NaN

Lists

A list, in R use list(), is an ordered collection of objects of possibly different types. Lists are frequently used to return several results of a function in a single object.

arya <- list(name='Arya of Winterfell',age=11,northman=TRUE)
arya
$name
[1] "Arya of Winterfell"

$age
[1] 11

$northman
[1] TRUE

You can see that the name of each item is preceded by a $. You can then reference each item in the list by its position or its name:

arya[1]
$name
[1] "Arya of Winterfell"
arya$name
[1] "Arya of Winterfell"
arya$age>15
[1] FALSE

Matrices

A matrix is a two-dimensional array. Matrices (same as vectors) can hold elements only of the same type.

# 2 by 4 matrix
m <- matrix(1:20,nrow=5,ncol=4) 
m
     [,1] [,2] [,3] [,4]
[1,]    1    6   11   16
[2,]    2    7   12   17
[3,]    3    8   13   18
[4,]    4    9   14   19
[5,]    5   10   15   20

By default, the matrix is populated by column

m <- matrix(1:20,nrow=5,ncol=4,byrow=TRUE) 
m
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12
[4,]   13   14   15   16
[5,]   17   18   19   20

To access the matrix, use square brackets.

m[10] # 10th entry columnwise 
[1] 18
m[3,4]  # entry on 3rd row, 4th column
[1] 12
m[3:5]  # 3rd to 4th entry columnwise
[1]  9 13 17
m[3:5,2:3] # entry fromt 3rd thru 5th rows and 2nd thru 3rd columns
     [,1] [,2]
[1,]   10   11
[2,]   14   15
[3,]   18   19

Matrices (con't)

You can also give names to each row and each column using dimnames().

dimnames(m) <- list(c('a','b','c','d','e'),c('p','q','r','s'))
m
   p  q  r  s
a  1  2  3  4
b  5  6  7  8
c  9 10 11 12
d 13 14 15 16
e 17 18 19 20

Combine objects by rows rbind() or columns cbind()

S <- rbind(rep(FALSE,5),rep(NA,5))
rownames(S) <- c('All False','All NA')
S
           [,1]  [,2]  [,3]  [,4]  [,5]
All False FALSE FALSE FALSE FALSE FALSE
All NA       NA    NA    NA    NA    NA

Arrays

An array is an extention of the vector to more than two dimensions.


# 2 by 2 by 2 array
A <- array(1:16,c(2,4,2)) 
A
, , 1

     [,1] [,2] [,3] [,4]
[1,]    1    3    5    7
[2,]    2    4    6    8

, , 2

     [,1] [,2] [,3] [,4]
[1,]    9   11   13   15
[2,]   10   12   14   16

Interchange the first two subscripts on a 3-way array A

At <- aperm(A, c(2,1,3))
At
, , 1

     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
[4,]    7    8

, , 2

     [,1] [,2]
[1,]    9   10
[2,]   11   12
[3,]   13   14
[4,]   15   16

Factors in R

Values can be nominal,ordinal, or continuous. In R, nominal and ordinal values are represented by factor()

houses <- c('Stark','Lannister','Tully','Arryn','Tyrells','Baratheon','Martell')
factor(houses)
[1] Stark     Lannister Tully     Arryn     Tyrells   Baratheon Martell  
Levels: Arryn Baratheon Lannister Martell Stark Tully Tyrells

By default, factor levels are created in alphabetical order.

factor(houses,order=TRUE,levels=houses)
[1] Stark     Lannister Tully     Arryn     Tyrells   Baratheon Martell  
7 Levels: Stark < Lannister < Tully < Arryn < Tyrells < ... < Martell

Apply function in R

Applies a function to sections of an array (or matrix) and returns the results in an array (or matrix).

apply(array, margin, function, ...)

The margin argument is used to specify which margin we want to apply the function to and which margin we wish to keep.

mat1 <- matrix(rep(seq(4), 4), ncol = 4)
mat1
     [,1] [,2] [,3] [,4]
[1,]    1    1    1    1
[2,]    2    2    2    2
[3,]    3    3    3    3
[4,]    4    4    4    4
#row sums of mat1, margin is 1
apply(mat1, 1, sum)
[1]  4  8 12 16
#column sums of mat1, margin is 2
apply(mat1, 2, sum)
[1] 10 10 10 10
#using a user defined function
sum.plus.2 <- function(x){
  sum(x) + 2
}
#using the sum.plus.2 function on the rows of mat1
apply(mat1, 1, sum.plus.2)
[1]  6 10 14 18

Data Frames

A data frame is a data structure we will be using most often in this class. A data frame is a list that contains multiple named vectors of the same length. Whereas we usually use spreadsheet or database table by row, data frames are constructed by columns.

# head displays the returns the first parts of the data frame "cars""
head(cars)
  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10
# faster summary measures
summary(cars)
     speed           dist    
 Min.   : 4.0   Min.   :  2  
 1st Qu.:12.0   1st Qu.: 26  
 Median :15.0   Median : 36  
 Mean   :15.4   Mean   : 43  
 3rd Qu.:19.0   3rd Qu.: 56  
 Max.   :25.0   Max.   :120  

Conditionals

General Form

if (arglist satisfies) {
  do this one
} else {
  do this two
}

Create a function that tells you whether a variable is greater than 20 or not

my_cond <- function(x){
if (x > 20) {
  print("x is greater than 20")
}
else {
  print("x is less than 20")
}
}
x <- 10
my_cond(x)
[1] "x is less than 20"

Repeat Loops in R

R has three forms of loops.

The first is repeat w/c repeats a particular expression until it hits a break keyword.

x <- 0
repeat{if (x>10) break 
  else {print(x); x <- x+1} 
  }
[1] 0
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
  • Within the outermost braces is an if-else expression: if (x>10) break else {print(x); x <- x+1}. The inner set of braces is part of the else clause: print(x); x <- x+1.

  • The semicolon separates the clause into two parts. The first is print statement, and the second increments x so that the condition that termintes the loop, x>10, is eventually satisfied.

While Loops in R

x <- 0
while (x < 10) {print (x); x <- x + 1}
[1] 0
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9

For Loops in R

R loops iterate through each item in a vector or a list:

x <- 0
for (x in 1: 10) print(x)
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

The colon creates a vector, passing each integer from 1 to 10 to the loop.

Fibonacci Sequence

len <- 10
fibvals <- numeric(len)  # creates a vector of 0's of length 10
fibvals
 [1] 0 0 0 0 0 0 0 0 0 0
fibvals[1] <- 1
fibvals[2] <- 1
for (i in 3:len) { 
   fibvals[i] <- fibvals[i-1]+fibvals[i-2]
} 
fibvals
 [1]  1  1  2  3  5  8 13 21 34 55

Loops: Your Turn!

  1. Create a function that returns a Fibonaccy sequence of any length.
  2. Create a function that returns a sequence of odd numbers of any length.

R package

  • An R package is a set of related functions and help files, bundled together.
  • It is similar to libraries in C or toolbox in Matlab.
  • Normally, all functions within a single package are related: for example, the stats package contains functions for statistical analysis.
  • There are few public repositories of packages: the largest is CRAN hosted by the R foundation with more than 4000 packages, and is mirrored in many sites worldwide. Of course, you need internet connection to do this.
  • To use a package, you first need to install it into R.
  • If you're using the R console user interface, you can use the package installer from the menu.
  • You can also install R packages directly through R console using install.packages().
  • To load up an R package, use the library()

Visualization in R

There are many ways to create a scatterplot in R. The basic function is plot(x, y), where x and y are numeric vectors denoting the (x,y) points to plot.

# Simple Scatterplot
attach(mtcars)
plot(wt, mpg, main="Scatterplot Example", 
     xlab="Car Weight ", ylab="Miles Per Gallon ", pch=19)

plot of chunk unnamed-chunk-50

Scatterplot

There are many ways to create a scatterplot in R. The basic function is plot(x, y), where x and y are numeric vectors denoting the (x,y) points to plot.

# Simple Scatterplot
plot(wt, mpg, main="Scatterplot Example", 
     xlab="Car Weight ", ylab="Miles Per Gallon ", pch=19)
# Add fit lines
abline(lm(mpg~wt), col="red") # regression line (y~x) 
lines(lowess(wt,mpg), col="blue") # lowess line (x,y)

plot of chunk unnamed-chunk-51

Basic Scatterplot Matrix

names(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"
# Consider only the variables mpg, disp, drat, and wt
pairs(~mpg+disp+drat+wt,data=mtcars, 
   main="Simple Scatterplot Matrix")

plot of chunk unnamed-chunk-52

Boxplots

Boxplots can be created for individual variables or for variables by group. The format is boxplot(x, data=), where x is a formula and data= denotes the data frame providing the data.

# Boxplot of MPG by Car Cylinders 
boxplot(mpg~cyl,data=mtcars, main="Car Milage Data", 
     xlab="Number of Cylinders", ylab="Miles Per Gallon")

plot of chunk unnamed-chunk-53

Dotplots

# Dotplot: Grouped Sorted and Colored
# Sort by mpg, group and color by cylinder 
x <- mtcars[order(mtcars$mpg),] # sort by mpg
x$cyl <- factor(x$cyl) # it must be a factor
x$color[x$cyl==4] <- "red"
x$color[x$cyl==6] <- "blue"
x$color[x$cyl==8] <- "darkgreen"  
dotchart(x$mpg,labels=row.names(x),cex=.7,groups= x$cyl,
     main="Gas Milage for Car Models\ngrouped by cylinder",
   xlab="Miles Per Gallon", gcolor="black", color=x$color)

plot of chunk unnamed-chunk-54