In this tutorial we will look at Reshape R package, reshape2, melt and cast functions.
Reshape R package – Introduction and concepts
In this tutorial on data frame we saw the reshape function that can be used to convert multiple columns of a dataframe to multiple rows. In the current tutorial we look at the reshape package that was designed and developed by Hadley Wickham to overcome the shortcomings of the reshape and other related functions in base R. Before we look at the methods in the package, lets look at some concepts.
Why Reshape R Package?
The data obtained as a result of an experiment or study is generally different from data required as an input to the analytic functions. Generally the data from a study has one or more columns that uniquely identify a row followed by a number of columns that represent the values measured. The columns that uniquely identify the row can be thought of as composite key of a database column. For example, consider a data frame representing climate statistics. The unique columns identifying a row can be city name and month. (Each row gives monthly statistics for a particular city). The table can report multiple statistics such as min and max temperatures, max wind speed, max humidity and so on. Each statistic is a column in this table. In most cases we will need to convert this table so that it contains the id columns and ONE column representing the statistic. Here’s a diagrammatic representation of what we want.
Definitions of terms in reshape R package
Lets define some terms :
-
Identifier (id)
– These variables can uniquely identify a row. In the example above city name and month are the identifiers for the first table and city name, month and Variable are the identifiers for the second table.
-
Measured Variables
These are the variables or columns that contain the values. In the first table Min Temp, Max Temp and Max Wind are the measured variables. In the second table the column called “value” is the measured variable.
-
Melting
– This is the process that produces table 2 from table 1.
-
Molten Table
– A molten table has id columns (one of which is a column called ‘variable’) and one column called ‘value’.
Note that all the measured variables have to be of the same type (numeric, factor or date) since they will be saved in a single column called ‘value’ after melting.
melt() example
We haven’t yet looked at casting, but lets see an example of melting before looking at casting. We will use the airquality data frame in the ‘datasets’ package.
> library(datasets) > str(airquality) 'data.frame': 153 obs. of 6 variables: $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ... $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ... $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ... $ Temp : int 67 72 74 62 56 66 65 59 61 69 ... $ Month : int 5 5 5 5 5 5 5 5 5 5 ... $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
The data frame contains 6 variables and 153 observations. The variables Day and Month are the “id” variables and the others are the measured variables. We would like to melt this data frame. Here’s how it is done :
> molted = melt(airquality,id.vars=c("Month","Day"),measured.vars=c("Ozone","Solar.R","Wind","Temp")) > str(molted) 'data.frame': 612 obs. of 4 variables: $ Month : int 5 5 5 5 5 5 5 5 5 5 ... $ Day : int 1 2 3 4 5 6 7 8 9 10 ... $ variable: Factor w/ 4 levels "Ozone","Solar.R",..: 1 1 1 1 1 1 1 1 1 1 ... $ value : num 41 36 12 18 NA 28 23 19 8 NA ...
The function melt is use to melt the airquality data frame. The id variables are Month and day. The other variables are specified as measured variables. The resulting object contains 4 columns as expected. Lets look at 10 random rows from this object.
> molted[sample(nrow(molted),10),] Month Day variable value 32 6 1 Ozone NA 353 6 16 Wind 14.9 548 7 28 Temp 88.0 85 7 24 Ozone 80.0 211 6 27 Solar.R 47.0 294 9 18 Solar.R 27.0 394 7 27 Wind 12.0 554 8 3 Temp 82.0 164 5 11 Solar.R NA 3 5 3 Ozone 12.0
We notice that there are rows where the value is NA and we dont really need these rows. We can pass in na.rm=TRUE to remove NA values
> molted = melt(airquality,id.vars=c("Month","Day"),measured.vars=c("Ozone","Solar.R","Wind","Temp"),na.rm=TRUE) > molted[sample(nrow(molted),10),] Month Day variable value 355 8 1 Wind 6.9 565 9 27 Temp 77.0 523 8 16 Temp 77.0 421 5 6 Temp 66.0 476 6 30 Temp 83.0 307 6 14 Wind 13.8 301 6 8 Wind 6.9 346 7 23 Wind 11.5 290 5 28 Wind 12.0 535 8 28 Temp 97.0 > str(molted) 'data.frame': 568 obs. of 4 variables: $ Month : int 5 5 5 5 5 5 5 5 5 5 ... $ Day : int 1 2 3 4 6 7 8 9 11 12 ... $ variable: Factor w/ 4 levels "Ozone","Solar.R",..: 1 1 1 1 1 1 1 1 1 1 ... $ value : num 41 36 12 18 28 23 19 8 7 16 ...
The molten object now does not contain rows with NA values. There are only 568 rows now instead of 612 that we saw earlier. You can also omit measured.vars if you specify id.vars. You can also specify another name for the “variable” column
> molted = melt(airquality,id.vars=c("Month","Day"),measured.vars=c("Ozone","Solar.R","Wind"),variable_name="measured") > str(molted) 'data.frame': 612 obs. of 4 variables: $ Month : int 5 5 5 5 5 5 5 5 5 5 ... $ Day : int 1 2 3 4 5 6 7 8 9 10 ... $ measured: Factor w/ 4 levels "Ozone","Solar.R",..: 1 1 1 1 1 1 1 1 1 1 ... $ value : num 41 36 12 18 NA 28 23 19 8 NA ...
cast()
Once a data frame is molten you can then modify the molten data frame to whatever form that is required for analysis. This is known as casting. In its simplest form it gets back the original dataframe that was used in the melt function, however, that’s not much fun. Lets see why cast is so good. You can also pass in another argument to cast. This argument is a formula of type col_1+col_2+col_3~row_1+row_2+row_3. On the left side of the ~ operator you specify all variables that you want as columns and on the right side of ~ you specify all variables that you want as rows. If there are multiple variables on either side then separate them by +. Lets see how to get the original dataframe using a formula
> head(cast(molted,Month+Day~variable)) Month Day Ozone Solar.R Wind Temp 1 5 1 41 190 7.4 67 2 5 2 36 118 8.0 72 3 5 3 12 149 12.6 74 4 5 4 18 313 11.5 62 5 5 5 NA NA 14.3 56 6 5 6 28 NA 14.9 66
We can also use the … operator. This operator instructs R to use all variables that have not been used in the formula.
> head(cast(molted,...~variable)) Month Day Ozone Solar.R Wind Temp 1 5 1 41 190 7.4 67 2 5 2 36 118 8.0 72 3 5 3 12 149 12.6 74
Lets see what happens if we use only one of Month or Day
> cast(molted,Month~variable,fun.aggregate=mean) Month Ozone Solar.R Wind Temp 1 5 23.61538 181.2963 11.622581 65.54839 2 6 29.44444 190.1667 10.266667 79.10000 3 7 59.11538 216.4839 8.941935 83.90323 4 8 59.96154 171.8571 8.793548 83.96774 5 9 31.44828 167.4333 10.180000 76.90000
Here We specified only Month and not Day, The values that we get are for a particular month. Its quite obvious that we need to use a function that tell how to combine values for a month from multiple days. In our example we tell it to get a mean.
We can get a transposed version too.
> cast(molted,variable~Month,fun.aggregate=mean) variable 5 6 7 8 9 1 Ozone 23.61538 29.44444 59.115385 59.961538 31.44828 2 Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333 3 Wind 11.62258 10.26667 8.941935 8.793548 10.18000 4 Temp 65.54839 79.10000 83.903226 83.967742 76.90000
Another useful feature is to use . (dot) which signifies no variables.
> cast(molted,variable~.,fun.aggregate=mean) variable (all) 1 Ozone 42.129310 2 Solar.R 185.931507 3 Wind 9.957516 4 Temp 77.882353
In the above example we get a mean for all months and Days. We can also do this
> cast(molted,Month~.,fun.aggregate=mean) Month (all) 1 5 68.70696 2 6 87.38384 3 7 93.49748 4 8 79.71207 5 9 71.82689
Of course, in this case it does not make sense to average out Ozone, Wind etc, but this just demonstrates what’s possible.
We can also create multi dimensional arrays. If you have two sets of hierarchies then this can help in create aggregates in multiple dimensions. For example if you had a data set that has sales with location hierarchy (state, cities etc) and time hierarchy (Months, days etc) then you can create arrays that give sales aggregated by state for a month and so on. Here an example
> dim(cast(molted,Month~Day~variable,fun.aggregate=mean)) [1] 5 31 4
You can use the | operator to create lists (not all output is shown below)
> print(cast(molted,variable~Day|Month,fun.aggregate=mean)) $`5` variable 1 2 3 4 5 6 7 8 9 10 11 12 1 Ozone 41.0 36 12.0 18.0 NaN 28.0 23.0 19.0 8.0 NaN 7.0 16.0 2 Solar.R 190.0 118 149.0 313.0 NaN NaN 299.0 99.0 19.0 194.0 NaN 256.0 3 Wind 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 6.9 9.7 4 Temp 67.0 72 74.0 62.0 56.0 66.0 65.0 59.0 61.0 69.0 74.0 69.0 .... $`6` variable 1 2 3 4 5 6 7 8 9 10 11 1 Ozone NaN NaN NaN NaN NaN NaN 29.0 NaN 71.0 39.0 NaN 2 Solar.R 286.0 287.0 242.0 186.0 220.0 264.0 127.0 273.0 291.0 323.0 259.0 3 Wind 8.6 9.7 16.1 9.2 8.6 14.3 9.7 6.9 13.8 11.5 10.9 4 Temp 78.0 74.0 67.0 84.0 85.0 79.0 82.0 87.0 90.0 87.0 93.0 ..
You can create margins by setting margins=TRUE. Margins can be thought of as “Column totals” or “Row Totals” or “Grand Total”
> cast(molted,variable~Month,fun.aggregate=mean,margins=TRUE) variable 5 6 7 8 9 (all) 1 Ozone 23.61538 29.44444 59.115385 59.961538 31.44828 42.129310 2 Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333 185.931507 3 Wind 11.62258 10.26667 8.941935 8.793548 10.18000 9.957516 4 Temp 65.54839 79.10000 83.903226 83.967742 76.90000 77.882353 5 (all) 68.70696 87.38384 93.497479 79.712069 71.82689 80.057218
Its also possible to pass in multiple functions (so far we have been using only mean)
> cast(molted,variable~Month,fun.aggregate=c(mean,sum)) variable 5_mean 5_sum 6_mean 6_sum 7_mean 7_sum 8_mean 8_sum 1 Ozone 23.61538 614.0 29.44444 265 59.115385 1537.0 59.961538 1559.0 2 Solar.R 181.29630 4895.0 190.16667 5705 216.483871 6711.0 171.857143 4812.0 3 Wind 11.62258 360.3 10.26667 308 8.941935 277.2 8.793548 272.6 4 Temp 65.54839 2032.0 79.10000 2373 83.903226 2601.0 83.967742 2603.0 9_mean 9_sum 1 31.44828 912.0 2 167.43333 5023.0 3 10.18000 305.4 4 76.90000 2307.0
I hope that at the end of this tutorial you are as amazed by this package as I am. Its truly quite powerful, but needless to say, be cautious while using it for huge amounts of data, for everything happens in memory.
Hi,
I explicitly stated the id vars and measured vars as in df2 <- melt(), but still got Error: id variables not found in data: source_id. with source_id being the first column of several factors(loc,ctr,week,yr..) & 10 measured cols into (Events) & EvntCount . I had to use Notepad ++ to remove utf-8-bom (ï..¿source_id) characters from the source_id column name. Error: id variables not found in data: source_id.. Any sugesstions?