In the earlier tutorials we learnt vectors, list, arrays and matrix. In this tutorial we will look at factors and data frame
Factors
Factors in R store categorical data. Categorical data has discreet values. For example, a vector of months of births of students in a class contains discreet values or categorical data.
- Their class is ‘factor’
- They have an attribute called levels of ‘character’ mode. The levels have to be unique
- If the factor is ordered then it has an additional class called ‘ordered’
- A factor cannot contain values that are not in its levels
Lets start looking at some examples
> x=c('jan','jan','march','april') # the factors of x are the unique levels in x which is jan , march, april > factor(x) [1] jan jan march april Levels: april jan march # the class 'factor' > f=factor(x) > class(f) [1] "factor" #the levels are > levels(f) [1] "april" "jan" "march" #if you order the levels (this orders by name) > ordered(f) [1] jan jan march april Levels: april < jan < march > class(o) [1] "ordered" "factor"
In the above example we let R determine the levels. We can also explicitly give it levels
> factor(x,levels=c('jan','feb','march')) [1] jan jan march <NA> Levels: jan feb march
We gave it 3 levels, but since the data had four levels, R assigned an NA to the value that is not present in levels (‘april’)
We can tell R to calculate levels on its own, but exclude some values
# we want to exclude "jan" > factor(x,exclude=c("jan")) [1] <NA> <NA> march april Levels: april march
To make factor names more meaningful we can assign labels to each level
> factor(x,labels=c("January","March","April")) [1] March March April January Levels: January March April
If you want to allow NA as a level
# NA is NOT a level here > x=c('jan','jan','march','april',NA) > factor(x) [1] jan jan march april <NA> Levels: april jan march #NA is a level here > x=c('jan','jan','march','april',NA) > factor(x,exclude=NULL) [1] jan jan march april <NA> Levels: april jan march <NA>
If you want to drop the levels that do not occur then pass the factor again through the factor function
# the level feb is not used > factor(x,levels=c('jan','feb','march')) [1] jan jan march <NA> <NA> Levels: jan feb march # the level feb is dropped > factor(factor(x,levels=c('jan','feb','march'))) [1] jan jan march <NA> <NA> Levels: jan march
Lets now look at ways to extract or replace parts of a factor
> x=c('jan','jan','march','april') > f=factor(x) > f [1] jan jan march april Levels: april jan march #lets extract the first two levels > f[1:2] [1] jan jan Levels: april jan march #also drop unused levels > f[1:2,drop=TRUE] [1] jan jan Levels: jan #replace a factor by one of the levels > f[1] <- 'march' > f [1] march jan march april Levels: april jan march
Data Frames
- A data frame is a matrix like data structure. i.e. it has rows and columns. However, unlike a matrix, a data frame can contain columns with different types of values (integer, character etc)
- A data frame has unique row names.
- data frame has the class “data.frame”
- A data frame can be thought of as a list (rows) of vectors(columns). Therefore all values in a column are of same type, however a row can have values of different types
- Data frame can have non unique column names
- Character vectors/variables passed to a data frame are converted to factors.
Lets look at some ways to create a data frame
-
data frame from two numeric vectors
> a=c(1,2,3,4) > b=c(5,6,7,8) > d=data.frame(a,b) > d a b 1 1 5 2 2 6 3 3 7 4 4 8
In the example above we have created a data frame using two numeric vectors. The name of the rows and columns have been automatically assigned. Lets specify Column names
> d=data.frame(v1=a,v2=b) > d v1 v2 1 1 5 2 2 6 3 3 7 4 4 8
Lets also specify the row names
> d=data.frame(v1=a,v2=b,row.names=c("r1","r2","r3","r4")) > d v1 v2 r1 1 5 r2 2 6 r3 3 7 r4 4 8
-
data frame from a numeric vector and a character vector
> a=c(1,2,3,4) > b=c("b1","b2","b3","b4") > d=data.frame(v1=a,v2=b,row.names=c("r1","r2","r3","r4")) > d v1 v2 r1 1 b1 r2 2 b2 r3 3 b3 r4 4 b4
This looks obvious, however if you look at the structure of the data frame you will realize that v2 does not contain characters
> str(d) 'data.frame': 4 obs. of 2 variables: $ v1: num 1 2 3 4 $ v2: Factor w/ 4 levels "b1","b2","b3",..: 1 2 3 4
The second column contains factors and not characters
- Other ways to create data frames
Lets see how constructing a data frame from a list looks like
> a=c(1,2,3,4) > b=list("one","two") > data.frame(a=a,b=b) a b..one. b..two. 1 1 one two 2 2 one two 3 3 one two 4 4 one two
The list is added row wise and recycled to fill in all the rows. A data frame can also be created from another data frame and a vector
> a=c(1,2,3,4) > b=c("one","two","three","four") > d=data.frame(a=a,b=b) > d a b 1 1 one 2 2 two 3 3 three 4 4 four #Now we create a dataframe from 'd' and another boolean vector > m=c(TRUE,FALSE,FALSE,TRUE) > e=data.frame(d,m) > e a b m 1 1 one TRUE 2 2 two FALSE 3 3 three FALSE 4 4 four TRUE
We now draw our attention to subsetting a data frame. Subsetting is the act of selecting specific values or range of values from the data frame. Since the data frame is a two dimensional structure, the easiest way to select a single value is to specify the row and column. Here’s an example.
> d=data.frame(a=c(1,2,3,4,5),b=c('a','b','c','d','e'),c=c(10L,20L,30L,40L,50L)) > d a b c 1 1 a 10 2 2 b 20 3 3 c 30 4 4 d 40 5 5 e 50 # we first retrieve the element at first row,second column > d[1,2] [1] a Levels: a b c d e #recall that by default R stores characters as factors in a data frame # Another way to select is to use the row and column name > d["1","a"] [1] 1 #It will be good to give rows useful names. > rownames(d)=c("r1","r2","r3","r4","r5") > d a b c r1 1 a 10 r2 2 b 20 r3 3 c 30 r4 4 d 40 r5 5 e 50 > d["r1","a"] [1] 1 #maybe assign names to columns as well > colnames(d)=c("c1","c2","c3") > d c1 c2 c3 r1 1 a 10 r2 2 b 20 r3 3 c 30 r4 4 d 40 r5 5 e 50 # note that column names need not be unique. we could have done this > colnames(d)=c("c1","c1","c1") > d c1 c1 c1 r1 1 a 10 r2 2 b 20 r3 3 c 30 r4 4 d 40 r5 5 e 50 #although there wouldnt be too many reasons to do this. # We can't do this for rows though > rownames(d)=c("r1","r2","r3","r4","r4") Error in `row.names<-.data.frame`(`*tmp*`, value = value) : duplicate 'row.names' are not allowed In addition: Warning message: non-unique value when setting 'row.names': ‘r4’
Its also possible to select multiple elements
#To select a complete row > d["r1",] c1 c2 c3 r1 1 a 10 # To select a complete column > d[,"c1"] [1] 1 2 3 4 5 #Note that the object returned is of class data.frame #other way to select a row or a column is to specify a subscript > d[1] c1 r1 1 r2 2 r3 3 r4 4 r5 5 > d[,1] [1] 1 2 3 4 5 # To select multiple rows/columns do this > d[1:2] c1 c2 r1 1 a r2 2 b r3 3 c r4 4 d r5 5 e > d[1:2,] c1 c2 c3 r1 1 a 10 r2 2 b 20 # another way to select multiple columns s > d[,1:2] c1 c2 r1 1 a r2 2 b r3 3 c r4 4 d r5 5 e # to get a subset of the matrix do this > d[1:2,1:2] c1 c2 r1 1 a r2 2 b
Another way to access a column in a data frame is to use the variable name with the symbol ‘$’
> d$c1 [1] 1 2 3 4 5
dim()
– This function gives the dimensions of the data frame> dim(d) [1] 5 3
names()
– This function can be used to get the names of the variables (columns) of a data frame. It can also be used to change their names> names(d) = c("col1","col2","col3") > d col1 col2 col3 r1 1 a 10 r2 2 b 20 r3 3 c 30 r4 4 d 40 r5 5 e 50
I()
– use the I() function to create a data frame with characters instead of factors <> str(data.frame(x=c(1,2,3),y=I(c(“a”,”b”,”c”)))) ‘data.frame’: 3 obs. of 2 variables: $ x: num 1 2 3 $ y:Class ‘AsIs’ chr [1:3] “a” “b” “c”
In the example above we create a data frame with two variables, however we specify that the second variable should contain characters and should not be converted to vector which is the default behaviour for data frame.
head()
– Use the head function to get the first n rows of a data frame> head(d,n=2L) col1 col2 col3 r1 1 a 10 r2 2 b 20
tail()
– Similar to the head function, the tail function can be used to get the last n rows of a data frame.> tail(d,n=2L) col1 col2 col3 r4 4 d 40 r5 5 e 50 >
rbind()
– This function can be used to add a row to the data frame> rbind(d,I(c(4,"f",60))) col1 col2 col3 r1 1 a 10 r2 2 b 20 r3 3 c 30 r4 4 d 40 r5 5 e 50 6 4 <NA> 60 Warning message: In `[<-.factor`(`*tmp*`, ri, value = "f") : invalid factor level, NA generated
We tried to add a row that had a character that was not a factor, so R complained. we should rather create the data frame so that R treats strings as strings.
> d=data.frame(a=c(1,2,3,4,5),b=c('a','b','c','d','e'),c=c(10L,20L,30L,40L,50L),stringsAsFactors=FALSE) > d a b c 1 1 a 10 2 2 b 20 3 3 c 30 4 4 d 40 5 5 e 50 > rbind(d,I(c(4,"f",60))) a b c 1 1 a 10 2 2 b 20 3 3 c 30 4 4 d 40 5 5 e 50 6 4 f 60
This adds the row to the end of the data frame. To add the row at a particular index, two solutions have been suggested on StackOverflow
insertRow <- function(existingDF, newrow, r) { existingDF[seq(r+1,nrow(existingDF)+1),] <- existingDF[seq(r,nrow(existingDF)),] existingDF[r,] <- newrow existingDF } insertRow2 <- function(existingDF, newrow, r) { existingDF <- rbind(existingDF,newrow) existingDF <- existingDF[order(c(1:(nrow(existingDF)-1),r-0.5)),] row.names(existingDF) <- 1:nrow(existingDF) return(existingDF) }
cbind()
This function can be used to add a new variable or column> cbind(d,k=c(TRUE,TRUE,FALSE,TRUE,FALSE)) a b c k 1 1 a 10 TRUE 2 2 b 20 TRUE 3 3 c 30 FALSE 4 4 d 40 TRUE 5 5 e 50 FALSE
In the next tutorial we look at some more functions that can be applied to data frames.