R is the *lingua franca* of statistical software. It is a dialect of S, a computer language developed by John Chambers in 1976 at Bell Labs. In 1988, the system was rewritten in C. The software changed hand frequently, until in 1993, R appeared in the scene. It was a *free* implementation of S, written by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand.

If you’re in command-line environment you can start `R`

by typing the command `R`

. In our example will will use a far nicer environment called RStudio. It is available from: https://www.rstudio.com/.

`getwd()`

- Get the current directory

`setwd()`

- Set the current directory

`ls()`

- Lists the objects created.

`rm()`

- Will remove objects.

`list.files()`

or`dir()`

- List the files in the current directory

`source()`

- Reads a R file

Create a file called `source_test.R`

to have the following code.

```
myfunction <- function() {
x <- rnorm(100)
mean(x)
}
addnoise <- function() {
y <- x + rnorm(length(x))
}
ls()
```

`## [1] "addnoise" "myfunction"`

Type `source("source_test.R")`

in the R console. Type `ls()`

, you’ll see two objects, `myfunction`

and `adnoise`

have been created.

`CRAN`

- The Comprehensive R Archive Network (http://cran.r-project.org/). About ~4000 packages and data available for use.- R website (http://www.r-project.org/) and FAQs (http://cran.r-project.org/doc/FAQ/R-FAQ.html)
- R related projects are difficult to search using Google. Use R-specific search engine RSEEK (http://www.rseek.org/).

`help.search("functionname")`

will search the help page of the a function name. You can also search the help page by putting `"?"`

before the function, e.g., `?plot`

.

- An introduction to R (http://cran.r-project.org/doc/manuals/r-release/R-intro.pdf)
- R Data Import/Export (http://cran.r-project.org/doc/manuals/r-release/R-data.pdf)

A full list of such manuals can be found at http://cran.r-project.org/manuals.html.

- Bioconductor project http://www.bioconductor.org/
- A set of R modules specifically meant for biological data analysis. Distributed separately from CRAN.

- Some biological data related module in CRAN, such as
`ape`

, a phylogenetic analysis package. Or,`seqinr`

, a basic sequence analysis package. A brief introduction to using R for bioinformatics can be found here.

`install.packages('Packagename')`

will install any package in R. To use the package you need to use `library('Packagename')`

.

Assignment operator in R is `<-`

.

```
x <- 1:20
x
```

`## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20`

- character
- numeric All numbers in R are double precision real numbers. If you want an integer, you need to specify
`L`

suffix.`1`

is`numeric`

, but`1L`

is integer. - integer
- complex
- logical (True/False)

Undefined data in R is represented as either ‘NaN’ or ‘NA’ types. Infinity is represented as `Inf`

.

- A
`array`

like objects. Vector contains only one type of objects. - A list can contain various data types.
- Empty vector can be created using
`vector()`

function. - Empty list can be created using
`list()`

function. - Properties are
`attributes()`

,`class()`

,`length()`

. - Individual elements can be
*concatenated*using`c()`

function. - Can be converted to other data types using
`as.x()`

functions.

```
x <- 0:6
as.numeric(x)
```

`## [1] 0 1 2 3 4 5 6`

`as.logical(x)`

`## [1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE`

`as.character(x)`

`## [1] "0" "1" "2" "3" "4" "5" "6"`

`as.complex(x)`

`## [1] 0+0i 1+0i 2+0i 3+0i 4+0i 5+0i 6+0i`

Special type of vector with “dimension” attributes.

```
m <- matrix(nrow = 2, ncol = 3)
dim(m)
```

`## [1] 2 3`

`attributes(m)`

```
## $dim
## [1] 2 3
```

Matrix is constructed column-wise.

```
m <- matrix(1:6, nrow = 2, ncol = 3)
m
```

```
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
```

```
m <- 1:10
dim(m) <- c(2, 5)
m
```

```
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
```

You can create matrix out of vectors using column-binding or row-binding:

```
x <- 1:3
y <- 10:12
cbind(x, y)
```

```
## x y
## [1,] 1 10
## [2,] 2 11
## [3,] 3 12
```

`rbind(x, y)`

```
## [,1] [,2] [,3]
## x 1 2 3
## y 10 11 12
```

Categorical data. Can be ordered or unordered.

`x <- factor(c("yes", "yes", "no"))`

Factors have levels:

`table(x)`

```
## x
## no yes
## 1 2
```

Factors are basically numerical data under the hood. Each string is given a number. We can view the underlying assignment using this:

`unclass(x)`

```
## [1] 2 2 1
## attr(,"levels")
## [1] "no" "yes"
```

Explicit ordering can be specified using `levels`

arguments to `factor`

. If we want to make `no`

before `yes`

:

`x <- factor(c("yes", "no"), levels = c("Yes", "no"))`

Undefined values in R are represented by `"NaN"`

and `"NA"`

. NA values can have class of numeric, integer, etc. NaN is also NA but the converse is not true. There are two functions to find `NaN`

and `NA`

is R:

```
is.na()
is.nan()
```

```
x <- c(1, 2, NA, 10, 3)
is.na(x)
```

`## [1] FALSE FALSE TRUE FALSE FALSE`

`is.nan(x)`

`## [1] FALSE FALSE FALSE FALSE FALSE`

```
x <- c(1, 2, NaN, NA, 4)
is.na(x)
```

`## [1] FALSE FALSE TRUE TRUE FALSE`

`is.nan(x)`

`## [1] FALSE FALSE TRUE FALSE FALSE`

Key data type in R. It stores tabular data. Column of a data frame can have different types of data, unlike matrix (same type). Special attributes `row.names`

. To read use `read.table()`

or `read.csv()`

. Data frame can be converted to matrix `data.matrix()`

.

```
x <- data.frame(foo = 1:4, bar = c(T, T, F, F))
x
```

```
## foo bar
## 1 1 TRUE
## 2 2 TRUE
## 3 3 FALSE
## 4 4 FALSE
```

`nrow(x)`

`## [1] 4`

`ncol(x)`

`## [1] 2`

```
data(iris) # Read iris data sets
head(iris) # View the first few line of the iris data
```

```
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
```

```
write.table(iris, "iris.txt", quote = F) # Write a sample file out
dir() # Check that we have written a file
```

```
## [1] "airquality.pdf" "figure" "final_problems.md"
## [4] "gbs_746-1.2.pdf" "gbs_746.mdown" "gbs_746.pdf"
## [7] "Intro_to_R.html" "Intro_to_R.md" "Intro_to_R.mdown"
## [10] "Intro_to_R.pdf" "Intro_to_R.rmd" "iris.txt"
## [13] "knit.R" "Makefile" "r_style_files"
## [16] "r_style.html" "Rplots.pdf"
```

```
test_data <- read.table("iris.txt", header = T) # read the data back into test_data
head(test_data) # Check the first few lines of the data
```

```
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
```

To read a data frame from a gzipped file:

`d <- read.table(gzfile("myzipped.gz"))`

Each type of of data in R can have names. But they are most important for Matrix and Data-frames.

```
m <- matrix(1:4, nrow = 2, ncol = 2)
dimnames(m) <- list(c("a", "b"), c("c", "d"))
m
```

```
## c d
## a 1 3
## b 2 4
```

```
d <- data.frame(c(1, 2, 3), c(4, 5, 6))
d
```

```
## c.1..2..3. c.4..5..6.
## 1 1 4
## 2 2 5
## 3 3 6
```

```
names(d) <- c("A", "B")
d
```

```
## A B
## 1 1 4
## 2 2 5
## 3 3 6
```

Subsetting is a way to select a subset of data. There are 3 basic ways to subset data in R:

`[]`

`[[]]`

`$`

```
x <- c("a", "b", "c")
x[1]
```

`## [1] "a"`

`x[2]`

`## [1] "b"`

`x[1:3]`

`## [1] "a" "b" "c"`

`x[-2]`

`## [1] "a" "c"`

```
x[x > "a"]
```

`## [1] "b" "c"`

```
u <- x > "a"
u
```

`## [1] FALSE TRUE TRUE`

`x[u]`

`## [1] "b" "c"`

```
x <- list(foo = 1:4, bar = 0.6)
x[1]
```

```
## $foo
## [1] 1 2 3 4
```

`x[[1]]`

`## [1] 1 2 3 4`

`x$bar`

`## [1] 0.6`

`x[["bar"]]`

`## [1] 0.6`

`x["bar"]`

```
## $bar
## [1] 0.6
```

```
x <- c(1, 2, NA, 4, NA, 5)
bad <- is.na(x)
x[!bad]
```

`## [1] 1 2 4 5`

```
x <- matrix(1:6, 2, 3) # Create a sample 2x3 matrix
x # print the matrix out
```

```
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
```

`x[1, 2] # what is the value in row 1, column 2`

`## [1] 3`

`x[2, 1] # what is the value in row 2, column 1`

`## [1] 2`

We can select the whole row or columns:

`x[1, ] # select whole row 1`

`## [1] 1 3 5`

`x[, 2] # select whole column 2`

`## [1] 3 4`

All the operations of matrix also work with data-frame. However, you will more often use `$`

to extract individual columns of a data-frame.

```
x <- data.frame(c(1:3), c(4:6)) # create a small data frame
names(x) <- c("A", "B") # Give each column a nice name
x
```

```
## A B
## 1 1 4
## 2 2 5
## 3 3 6
```

`x$A # Extract the column A`

`## [1] 1 2 3`

Instead of using loops, you should all the time try to use “vectorized” operations. That means if you are interested in operations performed on each element of a “collection”, you should use the collection as a whole, not individual elements. This gives the computer to parallelize the operation and usually results in faster runtime.

```
x <- 1:4
y <- 6:9
x + y
```

`## [1] 7 9 11 13`

`x - y`

`## [1] -5 -5 -5 -5`

`x * y`

`## [1] 6 14 24 36`

`x/y`

`## [1] 0.1667 0.2857 0.3750 0.4444`

When you multiply two matrices using the `"*"`

, it multiplies element by element multiplications. This is different that standard matrix multiplication. If you are interested standard matrix multiplication, use `"%*%"`

operator.

```
m1 <- matrix(1:4, 2, 2)
m1
```

```
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
```

`m1 * m1`

```
## [,1] [,2]
## [1,] 1 9
## [2,] 4 16
```

`m1 %*% m1`

```
## [,1] [,2]
## [1,] 7 15
## [2,] 10 22
```

```
myfunction <- function() {
cat("Hello world\n")
}
```

```
for (i in 1:3) {
if (i > 1) {
cat("I said hello", i, "times\n")
} else {
cat("I said hello", i, "time\n")
}
}
```

```
## I said hello 1 time
## I said hello 2 times
## I said hello 3 times
```

```
x <- rnorm(50)
mean(x)
```

`## [1] 0.2844`

`sd(x)`

`## [1] 0.9511`

`var(x)`

`## [1] 0.9046`

`median(x)`

`## [1] 0.3103`

`sum(x)`

`## [1] 14.22`

You need to skip `NA`

for doing summary statistics:

```
data(airquality)
mean(airquality$Ozone)
```

`## [1] NA`

`mean(airquality$Ozone, na.rm = T)`

`## [1] 42.13`

`summary()`

will give you a nice summary of data:

`summary(airquality)`

```
## Ozone Solar.R Wind Temp
## Min. : 1.0 Min. : 7 Min. : 1.70 Min. :56.0
## 1st Qu.: 18.0 1st Qu.:116 1st Qu.: 7.40 1st Qu.:72.0
## Median : 31.5 Median :205 Median : 9.70 Median :79.0
## Mean : 42.1 Mean :186 Mean : 9.96 Mean :77.9
## 3rd Qu.: 63.2 3rd Qu.:259 3rd Qu.:11.50 3rd Qu.:85.0
## Max. :168.0 Max. :334 Max. :20.70 Max. :97.0
## NA's :37 NA's :7
## Month Day
## Min. :5.00 Min. : 1.0
## 1st Qu.:6.00 1st Qu.: 8.0
## Median :7.00 Median :16.0
## Mean :6.99 Mean :15.8
## 3rd Qu.:8.00 3rd Qu.:23.0
## Max. :9.00 Max. :31.0
##
```

We can generate random numbers from the normal distribution using `rnorm()`

function.

```
x <- rnorm(10)
x
```

```
## [1] -1.5152 -0.4647 2.4993 -0.1411 1.0314 0.5282 0.0562 -0.3260
## [9] -0.6591 -0.3245
```

If we are interested in generating random numbers using a particular mean and standard deviation, we should you this:

```
x <- rnorm(10, mean = 5, sd = 2)
x
```

`## [1] 2.710 3.107 3.573 5.697 3.507 6.808 5.742 3.548 3.792 3.006`

In R, you can sample other distributions, I will leave it up to you to explore.

There are 3 graphical libraries in R:

- Base graphics
- Lattice
`gglplot2()`

In this tutorial we will discuss only base graphics.

`Plot`

`plot`

function is pretty intelligent. If you just give a dataframe, it will plot a graph with all vs all variables.

```
data(iris)
plot(iris)
```

```
x <- runif(50, 0, 2) # create a vector from uniform distribution
y <- runif(50, 0, 2) # create another one
plot(x, y, main = "Main title", sub = "subtitle", xlab = "x-label", ylab = "y-label")
abline(h = 0.6, v = 0.6)
```

Each side of the plot have specific margins.

```
plot(1, 1)
for (side in 0:3) mtext(0:3, side = side, at = 0.7, line = 0:3)
mtext(paste("side", 1:4), side = 1:4, line = -1, font = 2)
```

```
plot(x, y, type = "n", xlab = "", ylab = "", axes = F) # Draw an empty plot area
points(x, y) # Draw point
axis(1) # Draw the first axis
axis(2, at = seq(0.2, 1.8, 0.2)) # Draw second axis with tick at specific scale
box() # Draw the surrounding box
title(main = "Main title", xlab = "x", ylab = "y") # Write the main title and the labels
legend("topleft", legend = "Some random data", pch = 16)
```

`pchShow()`

will show you all the plotting characters.

Normally when you create a plot, it is drawn on the default computer screen. But you can create all sorts of images, such PDF (good for high resolution publication quality graphs), or bitmap images such as PNG. In all cases, the steps are identical. You open a device, draw your plot and close the device. For e.g., you can create a PDF drawing like this.

```
data(airquality)
pdf("airquality.pdf", width = 8.5, height = 11)
plot(airquality)
dev.off()
```

```
## pdf
## 2
```

`par`

`par`

function controls every parameter of a plot. The function has several parameters. But the most common ones are as follows:

`pch`

- point type used for plotting`lty`

- Line type`lwd`

- line width`cex`

- character magnification. It has several variations for different regions of the plot.`las`

- rotation of the axis labels`bg`

- background color of the plot`mar`

- margin of the plot`oma`

- outer margin of the plot`mfrow`

- draws multiple plots

You can modify `mfrow`

in `par()`

. For e.g, to draw a 1 row two column plot,

```
data(airquality)
par(mfrow = c(1, 2))
with(airquality, {
plot(Wind, Ozone)
plot(Temp, Wind)
})
```

```
data("airquality")
hist(airquality$Ozone)
```

No of breaks in the histogram can be controlled:

`hist(airquality$Ozone, breaks = 50)`

Boxplot is an summary plot showing the median and the distributions of data. You can give it a data frame.

`boxplot(airquality)`

```
data(mtcars)
head(mtcars)
```

```
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
```

```
counts <- table(mtcars$gear)
barplot(counts, main = "Car number of gear distribution", xlab = "Number of gears")
```

You can make a horizontal bar plot.

```
barplot(counts, main = "Car distribution", horiz = T, names.arg = c("3 Gears",
"4 Gears", "5 Gears"))
```

You can draw a grouped bar plot like this.

```
counts <- table(mtcars$cyl, mtcars$gear)
barplot(counts, main = "Gear and cylinder", xlab = "Number of gears", col = c("black",
"grey", "white"), legend = rownames(counts), beside = T)
```

There is no inbuilt way to generate error bars in R. But we have to hack a solution. We will use arrows to generate error bars. The trick is arrows take an argument of `angle`

. If you give `angle=90`

it will generate an error bar.

```
## Generate some data
heights <- vector() # Empty vector to hold the heights of the bar plot
stddevs <- vector() # Empty vector to hold the standard deviations
for (i in 1:3) {
# We will generate 3 sets of data
d <- rnorm(10, mean = 5, sd = 1) # Be sure that we have some positive data
heights[i] <- mean(d)
stddevs[i] <- sd(d)
}
## Draw the barplot
bp <- barplot(heights, names = c(1:3), ylim = c(0, 8), space = 2) # Returns a matrix
mids <- bp[, 1] # Get the midpoint
for (i in 1:3) {
# We will draw the errorbars
arrows(x0 = mids[i], y0 = heights[i] - stddevs[i], x1 = mids[i], y1 = heights[i] +
stddevs[i], code = 3, angle = 90)
}
```

If you have drawn figures in Excel, you might have used colors like this

```
barplot(counts, main = "Gear and cylinder", xlab = "Number of gears", col = c("red",
"green", "blue"), legend = rownames(counts), beside = T)
```

This is what I call “Christmas” plot. You should never use color like this. In R you can use sophisticated color palette.

`library(RColorBrewer)`

`## Warning: package 'RColorBrewer' was built under R version 3.0.2`

`display.brewer.all()`

There are 3 palettes: sequential, quantitative, and contrasting. Now choose a nice color for our plot.

```
nice <- brewer.pal(3, "Pastel2")
barplot(counts, main = "Gear and cylinder", xlab = "Number of gears", col = nice,
legend = rownames(counts), beside = T)
```

We will use the `women`

datasets in R for this example. These is a simple datasets with “weight” and “height” of women in the US.

```
data(women)
head(women)
```

```
## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
```

If we plot a simple scatterplot of this data, we will see there is an obvious correlation between this two variables.

`plot(women)`

Let’s do a linear regression between this two variables.

```
l <- lm(women$weight ~ women$height)
summary(l)
```

```
##
## Call:
## lm(formula = women$weight ~ women$height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.733 -1.133 -0.383 0.742 3.117
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.5167 5.9369 -14.7 1.7e-09 ***
## women$height 3.4500 0.0911 37.9 1.1e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.53 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.99
## F-statistic: 1.43e+03 on 1 and 13 DF, p-value: 1.09e-14
```

You can safely ignore the `(Intercept)`

line. But the adjusted r^{2} is 0.99 with P-value 1.1e-14, which is highly significant. Let’s put the regression line on the plot.

```
plot(women$height, women$weight, main = "Women height vs weight", xlab = "Height",
ylab = "Weight")
abline(l, lwd = 2, col = "red")
```

You can extract the P-value and the r^{2} values for printing

```
s <- summary(l)
s$adj.r.squared
```

`## [1] 0.9903`

`s$coefficients[2, 4]`

`## [1] 1.091e-14`

We can also do a correlation between this two variables

```
s <- cor.test(women$weight, women$height, alternatives = "tow.sided")
s
```

```
##
## Pearson's product-moment correlation
##
## data: women$weight and women$height
## t = 37.86, df = 13, p-value = 1.088e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9861 0.9985
## sample estimates:
## cor
## 0.9955
```

`s$p.value`

`## [1] 1.088e-14`

`s$estimate`

```
## cor
## 0.9955
```