R is the lingua franca of statistical software. It is a dialect of S, a computer language developed by John Chambers in 1976 at Bell Labs. In 1988, the system was rewritten in C. The software changed hand frequently, until in 1993, R appeared in the scene. It was a free implementation of S, written by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand.
If you’re in command-line environment you can start R
by typing the command R
. In our example will will use a far nicer environment called RStudio. It is available from: https://www.rstudio.com/.
getwd()
setwd()
ls()
rm()
list.files()
or dir()
source()
Create a file called source_test.R
to have the following code.
myfunction <- function() {
x <- rnorm(100)
mean(x)
}
addnoise <- function() {
y <- x + rnorm(length(x))
}
ls()
## [1] "addnoise" "myfunction"
Type source("source_test.R")
in the R console. Type ls()
, you’ll see two objects, myfunction
and adnoise
have been created.
CRAN
- The Comprehensive R Archive Network (http://cran.r-project.org/). About ~4000 packages and data available for use.help.search("functionname")
will search the help page of the a function name. You can also search the help page by putting "?"
before the function, e.g., ?plot
.
A full list of such manuals can be found at http://cran.r-project.org/manuals.html.
ape
, a phylogenetic analysis package. Or, seqinr
, a basic sequence analysis package. A brief introduction to using R for bioinformatics can be found here.install.packages('Packagename')
will install any package in R. To use the package you need to use library('Packagename')
.
Assignment operator in R is <-
.
x <- 1:20
x
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
L
suffix. 1
is numeric
, but 1L
is integer.Undefined data in R is represented as either ‘NaN’ or ‘NA’ types. Infinity is represented as Inf
.
array
like objects. Vector contains only one type of objects.vector()
function.list()
function.attributes()
, class()
, length()
.c()
function.as.x()
functions.x <- 0:6
as.numeric(x)
## [1] 0 1 2 3 4 5 6
as.logical(x)
## [1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE
as.character(x)
## [1] "0" "1" "2" "3" "4" "5" "6"
as.complex(x)
## [1] 0+0i 1+0i 2+0i 3+0i 4+0i 5+0i 6+0i
Special type of vector with “dimension” attributes.
m <- matrix(nrow = 2, ncol = 3)
dim(m)
## [1] 2 3
attributes(m)
## $dim
## [1] 2 3
Matrix is constructed column-wise.
m <- matrix(1:6, nrow = 2, ncol = 3)
m
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
m <- 1:10
dim(m) <- c(2, 5)
m
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
You can create matrix out of vectors using column-binding or row-binding:
x <- 1:3
y <- 10:12
cbind(x, y)
## x y
## [1,] 1 10
## [2,] 2 11
## [3,] 3 12
rbind(x, y)
## [,1] [,2] [,3]
## x 1 2 3
## y 10 11 12
Categorical data. Can be ordered or unordered.
x <- factor(c("yes", "yes", "no"))
Factors have levels:
table(x)
## x
## no yes
## 1 2
Factors are basically numerical data under the hood. Each string is given a number. We can view the underlying assignment using this:
unclass(x)
## [1] 2 2 1
## attr(,"levels")
## [1] "no" "yes"
Explicit ordering can be specified using levels
arguments to factor
. If we want to make no
before yes
:
x <- factor(c("yes", "no"), levels = c("Yes", "no"))
Undefined values in R are represented by "NaN"
and "NA"
. NA values can have class of numeric, integer, etc. NaN is also NA but the converse is not true. There are two functions to find NaN
and NA
is R:
is.na()
is.nan()
x <- c(1, 2, NA, 10, 3)
is.na(x)
## [1] FALSE FALSE TRUE FALSE FALSE
is.nan(x)
## [1] FALSE FALSE FALSE FALSE FALSE
x <- c(1, 2, NaN, NA, 4)
is.na(x)
## [1] FALSE FALSE TRUE TRUE FALSE
is.nan(x)
## [1] FALSE FALSE TRUE FALSE FALSE
Key data type in R. It stores tabular data. Column of a data frame can have different types of data, unlike matrix (same type). Special attributes row.names
. To read use read.table()
or read.csv()
. Data frame can be converted to matrix data.matrix()
.
x <- data.frame(foo = 1:4, bar = c(T, T, F, F))
x
## foo bar
## 1 1 TRUE
## 2 2 TRUE
## 3 3 FALSE
## 4 4 FALSE
nrow(x)
## [1] 4
ncol(x)
## [1] 2
data(iris) # Read iris data sets
head(iris) # View the first few line of the iris data
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
write.table(iris, "iris.txt", quote = F) # Write a sample file out
dir() # Check that we have written a file
## [1] "airquality.pdf" "figure" "final_problems.md"
## [4] "gbs_746-1.2.pdf" "gbs_746.mdown" "gbs_746.pdf"
## [7] "Intro_to_R.html" "Intro_to_R.md" "Intro_to_R.mdown"
## [10] "Intro_to_R.pdf" "Intro_to_R.rmd" "iris.txt"
## [13] "knit.R" "Makefile" "r_style_files"
## [16] "r_style.html" "Rplots.pdf"
test_data <- read.table("iris.txt", header = T) # read the data back into test_data
head(test_data) # Check the first few lines of the data
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
To read a data frame from a gzipped file:
d <- read.table(gzfile("myzipped.gz"))
Each type of of data in R can have names. But they are most important for Matrix and Data-frames.
m <- matrix(1:4, nrow = 2, ncol = 2)
dimnames(m) <- list(c("a", "b"), c("c", "d"))
m
## c d
## a 1 3
## b 2 4
d <- data.frame(c(1, 2, 3), c(4, 5, 6))
d
## c.1..2..3. c.4..5..6.
## 1 1 4
## 2 2 5
## 3 3 6
names(d) <- c("A", "B")
d
## A B
## 1 1 4
## 2 2 5
## 3 3 6
Subsetting is a way to select a subset of data. There are 3 basic ways to subset data in R:
[]
[[]]
$
x <- c("a", "b", "c")
x[1]
## [1] "a"
x[2]
## [1] "b"
x[1:3]
## [1] "a" "b" "c"
x[-2]
## [1] "a" "c"
x[x > "a"]
## [1] "b" "c"
u <- x > "a"
u
## [1] FALSE TRUE TRUE
x[u]
## [1] "b" "c"
x <- list(foo = 1:4, bar = 0.6)
x[1]
## $foo
## [1] 1 2 3 4
x[[1]]
## [1] 1 2 3 4
x$bar
## [1] 0.6
x[["bar"]]
## [1] 0.6
x["bar"]
## $bar
## [1] 0.6
x <- c(1, 2, NA, 4, NA, 5)
bad <- is.na(x)
x[!bad]
## [1] 1 2 4 5
x <- matrix(1:6, 2, 3) # Create a sample 2x3 matrix
x # print the matrix out
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
x[1, 2] # what is the value in row 1, column 2
## [1] 3
x[2, 1] # what is the value in row 2, column 1
## [1] 2
We can select the whole row or columns:
x[1, ] # select whole row 1
## [1] 1 3 5
x[, 2] # select whole column 2
## [1] 3 4
All the operations of matrix also work with data-frame. However, you will more often use $
to extract individual columns of a data-frame.
x <- data.frame(c(1:3), c(4:6)) # create a small data frame
names(x) <- c("A", "B") # Give each column a nice name
x
## A B
## 1 1 4
## 2 2 5
## 3 3 6
x$A # Extract the column A
## [1] 1 2 3
Instead of using loops, you should all the time try to use “vectorized” operations. That means if you are interested in operations performed on each element of a “collection”, you should use the collection as a whole, not individual elements. This gives the computer to parallelize the operation and usually results in faster runtime.
x <- 1:4
y <- 6:9
x + y
## [1] 7 9 11 13
x - y
## [1] -5 -5 -5 -5
x * y
## [1] 6 14 24 36
x/y
## [1] 0.1667 0.2857 0.3750 0.4444
When you multiply two matrices using the "*"
, it multiplies element by element multiplications. This is different that standard matrix multiplication. If you are interested standard matrix multiplication, use "%*%"
operator.
m1 <- matrix(1:4, 2, 2)
m1
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
m1 * m1
## [,1] [,2]
## [1,] 1 9
## [2,] 4 16
m1 %*% m1
## [,1] [,2]
## [1,] 7 15
## [2,] 10 22
myfunction <- function() {
cat("Hello world\n")
}
for (i in 1:3) {
if (i > 1) {
cat("I said hello", i, "times\n")
} else {
cat("I said hello", i, "time\n")
}
}
## I said hello 1 time
## I said hello 2 times
## I said hello 3 times
x <- rnorm(50)
mean(x)
## [1] 0.2844
sd(x)
## [1] 0.9511
var(x)
## [1] 0.9046
median(x)
## [1] 0.3103
sum(x)
## [1] 14.22
You need to skip NA
for doing summary statistics:
data(airquality)
mean(airquality$Ozone)
## [1] NA
mean(airquality$Ozone, na.rm = T)
## [1] 42.13
summary()
will give you a nice summary of data:
summary(airquality)
## Ozone Solar.R Wind Temp
## Min. : 1.0 Min. : 7 Min. : 1.70 Min. :56.0
## 1st Qu.: 18.0 1st Qu.:116 1st Qu.: 7.40 1st Qu.:72.0
## Median : 31.5 Median :205 Median : 9.70 Median :79.0
## Mean : 42.1 Mean :186 Mean : 9.96 Mean :77.9
## 3rd Qu.: 63.2 3rd Qu.:259 3rd Qu.:11.50 3rd Qu.:85.0
## Max. :168.0 Max. :334 Max. :20.70 Max. :97.0
## NA's :37 NA's :7
## Month Day
## Min. :5.00 Min. : 1.0
## 1st Qu.:6.00 1st Qu.: 8.0
## Median :7.00 Median :16.0
## Mean :6.99 Mean :15.8
## 3rd Qu.:8.00 3rd Qu.:23.0
## Max. :9.00 Max. :31.0
##
We can generate random numbers from the normal distribution using rnorm()
function.
x <- rnorm(10)
x
## [1] -1.5152 -0.4647 2.4993 -0.1411 1.0314 0.5282 0.0562 -0.3260
## [9] -0.6591 -0.3245
If we are interested in generating random numbers using a particular mean and standard deviation, we should you this:
x <- rnorm(10, mean = 5, sd = 2)
x
## [1] 2.710 3.107 3.573 5.697 3.507 6.808 5.742 3.548 3.792 3.006
In R, you can sample other distributions, I will leave it up to you to explore.
There are 3 graphical libraries in R:
gglplot2()
In this tutorial we will discuss only base graphics.
Plot
plot
function is pretty intelligent. If you just give a dataframe, it will plot a graph with all vs all variables.
data(iris)
plot(iris)
x <- runif(50, 0, 2) # create a vector from uniform distribution
y <- runif(50, 0, 2) # create another one
plot(x, y, main = "Main title", sub = "subtitle", xlab = "x-label", ylab = "y-label")
abline(h = 0.6, v = 0.6)
Each side of the plot have specific margins.
plot(1, 1)
for (side in 0:3) mtext(0:3, side = side, at = 0.7, line = 0:3)
mtext(paste("side", 1:4), side = 1:4, line = -1, font = 2)
plot(x, y, type = "n", xlab = "", ylab = "", axes = F) # Draw an empty plot area
points(x, y) # Draw point
axis(1) # Draw the first axis
axis(2, at = seq(0.2, 1.8, 0.2)) # Draw second axis with tick at specific scale
box() # Draw the surrounding box
title(main = "Main title", xlab = "x", ylab = "y") # Write the main title and the labels
legend("topleft", legend = "Some random data", pch = 16)
pchShow()
will show you all the plotting characters.
Normally when you create a plot, it is drawn on the default computer screen. But you can create all sorts of images, such PDF (good for high resolution publication quality graphs), or bitmap images such as PNG. In all cases, the steps are identical. You open a device, draw your plot and close the device. For e.g., you can create a PDF drawing like this.
data(airquality)
pdf("airquality.pdf", width = 8.5, height = 11)
plot(airquality)
dev.off()
## pdf
## 2
par
par
function controls every parameter of a plot. The function has several parameters. But the most common ones are as follows:
pch
- point type used for plottinglty
- Line typelwd
- line widthcex
- character magnification. It has several variations for different regions of the plot.las
- rotation of the axis labelsbg
- background color of the plotmar
- margin of the plotoma
- outer margin of the plotmfrow
- draws multiple plotsYou can modify mfrow
in par()
. For e.g, to draw a 1 row two column plot,
data(airquality)
par(mfrow = c(1, 2))
with(airquality, {
plot(Wind, Ozone)
plot(Temp, Wind)
})
data("airquality")
hist(airquality$Ozone)
No of breaks in the histogram can be controlled:
hist(airquality$Ozone, breaks = 50)
Boxplot is an summary plot showing the median and the distributions of data. You can give it a data frame.
boxplot(airquality)
data(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
counts <- table(mtcars$gear)
barplot(counts, main = "Car number of gear distribution", xlab = "Number of gears")
You can make a horizontal bar plot.
barplot(counts, main = "Car distribution", horiz = T, names.arg = c("3 Gears",
"4 Gears", "5 Gears"))
You can draw a grouped bar plot like this.
counts <- table(mtcars$cyl, mtcars$gear)
barplot(counts, main = "Gear and cylinder", xlab = "Number of gears", col = c("black",
"grey", "white"), legend = rownames(counts), beside = T)
There is no inbuilt way to generate error bars in R. But we have to hack a solution. We will use arrows to generate error bars. The trick is arrows take an argument of angle
. If you give angle=90
it will generate an error bar.
## Generate some data
heights <- vector() # Empty vector to hold the heights of the bar plot
stddevs <- vector() # Empty vector to hold the standard deviations
for (i in 1:3) {
# We will generate 3 sets of data
d <- rnorm(10, mean = 5, sd = 1) # Be sure that we have some positive data
heights[i] <- mean(d)
stddevs[i] <- sd(d)
}
## Draw the barplot
bp <- barplot(heights, names = c(1:3), ylim = c(0, 8), space = 2) # Returns a matrix
mids <- bp[, 1] # Get the midpoint
for (i in 1:3) {
# We will draw the errorbars
arrows(x0 = mids[i], y0 = heights[i] - stddevs[i], x1 = mids[i], y1 = heights[i] +
stddevs[i], code = 3, angle = 90)
}
If you have drawn figures in Excel, you might have used colors like this
barplot(counts, main = "Gear and cylinder", xlab = "Number of gears", col = c("red",
"green", "blue"), legend = rownames(counts), beside = T)
This is what I call “Christmas” plot. You should never use color like this. In R you can use sophisticated color palette.
library(RColorBrewer)
## Warning: package 'RColorBrewer' was built under R version 3.0.2
display.brewer.all()
There are 3 palettes: sequential, quantitative, and contrasting. Now choose a nice color for our plot.
nice <- brewer.pal(3, "Pastel2")
barplot(counts, main = "Gear and cylinder", xlab = "Number of gears", col = nice,
legend = rownames(counts), beside = T)
We will use the women
datasets in R for this example. These is a simple datasets with “weight” and “height” of women in the US.
data(women)
head(women)
## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
If we plot a simple scatterplot of this data, we will see there is an obvious correlation between this two variables.
plot(women)
Let’s do a linear regression between this two variables.
l <- lm(women$weight ~ women$height)
summary(l)
##
## Call:
## lm(formula = women$weight ~ women$height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.733 -1.133 -0.383 0.742 3.117
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.5167 5.9369 -14.7 1.7e-09 ***
## women$height 3.4500 0.0911 37.9 1.1e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.53 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.99
## F-statistic: 1.43e+03 on 1 and 13 DF, p-value: 1.09e-14
You can safely ignore the (Intercept)
line. But the adjusted r2 is 0.99 with P-value 1.1e-14, which is highly significant. Let’s put the regression line on the plot.
plot(women$height, women$weight, main = "Women height vs weight", xlab = "Height",
ylab = "Weight")
abline(l, lwd = 2, col = "red")
You can extract the P-value and the r2 values for printing
s <- summary(l)
s$adj.r.squared
## [1] 0.9903
s$coefficients[2, 4]
## [1] 1.091e-14
We can also do a correlation between this two variables
s <- cor.test(women$weight, women$height, alternatives = "tow.sided")
s
##
## Pearson's product-moment correlation
##
## data: women$weight and women$height
## t = 37.86, df = 13, p-value = 1.088e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9861 0.9985
## sample estimates:
## cor
## 0.9955
s$p.value
## [1] 1.088e-14
s$estimate
## cor
## 0.9955