Data Structures in R – factors and Data Frame

In the earlier tutorials we learnt vectors, list, arrays and matrix. In this tutorial we will look at factors and data frame

Factors

Factors in R store categorical data. Categorical data has discreet values. For example, a vector of months of births of students in a class contains discreet values or categorical data.

  • Their class is ‘factor’
  • They have an attribute called levels of ‘character’ mode. The levels have to be unique
  • If the factor is ordered then it has an additional class called ‘ordered’
  • A factor cannot contain values that are not in its levels

Lets start looking at some examples


> x=c('jan','jan','march','april')
# the factors of x are the unique levels in x which is jan , march, april
> factor(x)
[1] jan   jan   march april
Levels: april jan march
# the class 'factor'
> f=factor(x)
> class(f)
[1] "factor"
#the levels are
> levels(f)
[1] "april" "jan"   "march"
#if you order the levels (this orders by name)
> ordered(f)
[1] jan   jan   march april
Levels: april < jan < march
> class(o)
[1] "ordered" "factor" 

In the above example we let R determine the levels. We can also explicitly give it levels

> factor(x,levels=c('jan','feb','march'))
[1] jan   jan   march <NA> 
Levels: jan feb march

We gave it 3 levels, but since the data had four levels, R assigned an NA to the value that is not present in levels (‘april’)

We can tell R to calculate levels on its own, but exclude some values

# we want to exclude "jan"
> factor(x,exclude=c("jan"))
[1] <NA>  <NA>  march april
Levels: april march

To make factor names more meaningful we can assign labels to each level

> factor(x,labels=c("January","March","April"))
[1] March   March   April   January
Levels: January March April

If you want to allow NA as a level

# NA is NOT a level here
> x=c('jan','jan','march','april',NA)
> factor(x)
[1] jan   jan   march april <NA> 
Levels: april jan march
#NA is a level here
> x=c('jan','jan','march','april',NA)
> factor(x,exclude=NULL)
[1] jan   jan   march april <NA> 
Levels: april jan march <NA>
			

If you want to drop the levels that do not occur then pass the factor again through the factor function

# the level feb is not used
> factor(x,levels=c('jan','feb','march'))
[1] jan   jan   march <NA>   <NA>  
Levels: jan feb march
# the level feb is dropped
> factor(factor(x,levels=c('jan','feb','march')))
[1] jan   jan   march <NA>   <NA>  
Levels: jan march

Lets now look at ways to extract or replace parts of a factor

> x=c('jan','jan','march','april')
> f=factor(x)
> f
[1] jan   jan   march april
Levels: april jan march
#lets extract the first two levels
> f[1:2]
[1] jan jan
Levels: april jan march
#also drop unused levels
> f[1:2,drop=TRUE]
[1] jan jan
Levels: jan
#replace a factor by one of the levels
> f[1] <- 'march'
> f
[1] march jan   march april
Levels: april jan march

Data Frames

Data frame is the most used data structure in R modeling packages. These are the characteristics of a data frame:

  • A data frame is a matrix like data structure. i.e. it has rows and columns. However, unlike a matrix, a data frame can contain columns with different types of values (integer, character etc)
  • A data frame has unique row names.
  • data frame has the class “data.frame”
  • A data frame can be thought of as a list (rows) of vectors(columns). Therefore all values in a column are of same type, however a row can have values of different types
  • Data frame can have non unique column names
  • Character vectors/variables passed to a data frame are converted to factors.

Lets look at some ways to create a data frame

  • data frame from two numeric vectors

    > a=c(1,2,3,4)
    > b=c(5,6,7,8)
    > d=data.frame(a,b)
    > d
      a b
    1 1 5
    2 2 6
    3 3 7
    4 4 8
    

    In the example above we have created a data frame using two numeric vectors. The name of the rows and columns have been automatically assigned. Lets specify Column names

    > d=data.frame(v1=a,v2=b)
    > d
      v1 v2
    1  1  5
    2  2  6
    3  3  7
    4  4  8
    

    Lets also specify the row names

    > d=data.frame(v1=a,v2=b,row.names=c("r1","r2","r3","r4"))
    > d
       v1 v2
    r1  1  5
    r2  2  6
    r3  3  7
    r4  4  8
    
  • data frame from a numeric vector and a character vector

    > a=c(1,2,3,4)
    > b=c("b1","b2","b3","b4")
    > d=data.frame(v1=a,v2=b,row.names=c("r1","r2","r3","r4"))
    > d
       v1 v2
    r1  1 b1
    r2  2 b2
    r3  3 b3
    r4  4 b4
    

    This looks obvious, however if you look at the structure of the data frame you will realize that v2 does not contain characters

    > str(d)
    'data.frame':	4 obs. of  2 variables:
     $ v1: num  1 2 3 4
     $ v2: Factor w/ 4 levels "b1","b2","b3",..: 1 2 3 4
    

    The second column contains factors and not characters

  • Other ways to create data frames

    Lets see how constructing a data frame from a list looks like

    > a=c(1,2,3,4)
    > b=list("one","two")
    > data.frame(a=a,b=b)
      a b..one. b..two.
    1 1     one     two
    2 2     one     two
    3 3     one     two
    4 4     one     two
    

    The list is added row wise and recycled to fill in all the rows. A data frame can also be created from another data frame and a vector

    > a=c(1,2,3,4)
    > b=c("one","two","three","four")
    > d=data.frame(a=a,b=b)
    > d
      a     b
    1 1   one
    2 2   two
    3 3 three
    4 4  four
    #Now we create a dataframe from 'd' and another boolean vector
    > m=c(TRUE,FALSE,FALSE,TRUE)
    > e=data.frame(d,m)
    > e
      a     b     m
    1 1   one  TRUE
    2 2   two FALSE
    3 3 three FALSE
    4 4  four  TRUE
    

We now draw our attention to subsetting a data frame. Subsetting is the act of selecting specific values or range of values from the data frame. Since the data frame is a two dimensional structure, the easiest way to select a single value is to specify the row and column. Here’s an example.

> d=data.frame(a=c(1,2,3,4,5),b=c('a','b','c','d','e'),c=c(10L,20L,30L,40L,50L))
> d
  a b  c
1 1 a 10
2 2 b 20
3 3 c 30
4 4 d 40
5 5 e 50
# we first retrieve the element at first row,second column
> d[1,2]
[1] a
Levels: a b c d e
#recall that by default R stores characters as factors in a data frame
# Another way to select is to use the row and column name
> d["1","a"]
[1] 1
#It will be good to give rows useful names. 
> rownames(d)=c("r1","r2","r3","r4","r5")
> d
   a b  c
r1 1 a 10
r2 2 b 20
r3 3 c 30
r4 4 d 40
r5 5 e 50
> d["r1","a"]
[1] 1
#maybe assign names to columns as well
> colnames(d)=c("c1","c2","c3")
> d
   c1 c2 c3
r1  1  a 10
r2  2  b 20
r3  3  c 30
r4  4  d 40
r5  5  e 50
# note that column names need not be unique. we could have done this
> colnames(d)=c("c1","c1","c1")
> d
   c1 c1 c1
r1  1  a 10
r2  2  b 20
r3  3  c 30
r4  4  d 40
r5  5  e 50
#although there wouldnt be too many reasons to do this.
# We can't do this for rows though
> rownames(d)=c("r1","r2","r3","r4","r4")
Error in `row.names<-.data.frame`(`*tmp*`, value = value) : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique value when setting 'row.names': ‘r4’

Its also possible to select multiple elements

#To select a complete row
> d["r1",]
   c1 c2 c3
r1  1  a 10
# To select a complete column
> d[,"c1"]
[1] 1 2 3 4 5
#Note that the object returned is of class data.frame
#other way to select a row or a column is to specify a subscript
> d[1]
   c1
r1  1
r2  2
r3  3
r4  4
r5  5
> d[,1]
[1] 1 2 3 4 5
# To select multiple rows/columns do this
> d[1:2]
   c1 c2
r1  1  a
r2  2  b
r3  3  c
r4  4  d
r5  5  e
> d[1:2,]
   c1 c2 c3
r1  1  a 10
r2  2  b 20
# another way to select multiple columns s
> d[,1:2]
   c1 c2
r1  1  a
r2  2  b
r3  3  c
r4  4  d
r5  5  e
# to get a subset of the matrix do this
> d[1:2,1:2]
   c1 c2
r1  1  a
r2  2  b

Another way to access a column in a data frame is to use the variable name with the symbol ‘$’

> d$c1
[1] 1 2 3 4 5

  • dim() – This function gives the dimensions of the data frame
    > dim(d)
    [1] 5 3
    
  • names() – This function can be used to get the names of the variables (columns) of a data frame. It can also be used to change their names
    > names(d) = c("col1","col2","col3")
    > d
       col1 col2 col3
    r1    1    a   10
    r2    2    b   20
    r3    3    c   30
    r4    4    d   40
    r5    5    e   50
    
  • I() – use the I() function to create a data frame with characters instead of factors <
    
    > str(data.frame(x=c(1,2,3),y=I(c(“a”,”b”,”c”))))
    ‘data.frame’: 3 obs. of 2 variables: 
    $ x: num 1 2 3 
    $ y:Class ‘AsIs’ chr [1:3] “a” “b” “c”
    					

    In the example above we create a data frame with two variables, however we specify that the second variable should contain characters and should not be converted to vector which is the default behaviour for data frame.

  • head() – Use the head function to get the first n rows of a data frame
    > head(d,n=2L)
       col1 col2 col3
    r1    1    a   10
    r2    2    b   20
    
  • tail() – Similar to the head function, the tail function can be used to get the last n rows of a data frame.
    > tail(d,n=2L)
       col1 col2 col3
    r4    4    d   40
    r5    5    e   50
    > 
    
  • rbind() – This function can be used to add a row to the data frame
    > rbind(d,I(c(4,"f",60)))
       col1 col2 col3
    r1    1    a   10
    r2    2    b   20
    r3    3    c   30
    r4    4    d   40
    r5    5    e   50
    6     4 <NA>   60
    Warning message:
    In `[<-.factor`(`*tmp*`, ri, value = "f") :
      invalid factor level, NA generated
    

    We tried to add a row that had a character that was not a factor, so R complained. we should rather create the data frame so that R treats strings as strings.

    > d=data.frame(a=c(1,2,3,4,5),b=c('a','b','c','d','e'),c=c(10L,20L,30L,40L,50L),stringsAsFactors=FALSE)
    > d
      a b  c
    1 1 a 10
    2 2 b 20
    3 3 c 30
    4 4 d 40
    5 5 e 50
    > rbind(d,I(c(4,"f",60)))
      a b  c
    1 1 a 10
    2 2 b 20
    3 3 c 30
    4 4 d 40
    5 5 e 50
    6 4 f 60
    

    This adds the row to the end of the data frame. To add the row at a particular index, two solutions have been suggested on StackOverflow

    insertRow <- function(existingDF, newrow, r) {
      existingDF[seq(r+1,nrow(existingDF)+1),] <- existingDF[seq(r,nrow(existingDF)),]
      existingDF[r,] <- newrow
      existingDF
    }
    
    insertRow2 <- function(existingDF, newrow, r) {
      existingDF <- rbind(existingDF,newrow)
      existingDF <- existingDF[order(c(1:(nrow(existingDF)-1),r-0.5)),]
      row.names(existingDF) <- 1:nrow(existingDF)
      return(existingDF)  
    }
    
  • cbind() This function can be used to add a new variable or column
    > cbind(d,k=c(TRUE,TRUE,FALSE,TRUE,FALSE))
      a b  c     k
    1 1 a 10  TRUE
    2 2 b 20  TRUE
    3 3 c 30 FALSE
    4 4 d 40  TRUE
    5 5 e 50 FALSE
    
    

In the next tutorial we look at some more functions that can be applied to data frames.

Leave a Comment