Reshape data from wide to long or from long to wide in R

reshape data wide long r

Reshaping though frequently required in data analysis, so often it remains confusing even if you are frequent user of reshape function. Here I have provided an simple example to elaborate more on each argument of reshape.

Reshape data from wide to long

Your data in wide form if the multiple observations of item, place or person (i.e. units) has been recorded in single row (but in multiple column). This multiple observations may be of repeated measure type (observation are made repeatedly at different time point) or multiple characteristics of some unit (eg. height, length and width of square).

Let’s take an example of wide data of repeated measure type. Person A is visiting clinic every month for their blood pressure check, and nurse note down the value for each month in single row which belong to this specific person A. See example below,

# wide data
df <- data.frame(matrix(data = NA, nrow = 3, ncol = 5, dimnames = list(NULL, paste0(c("id","name","jan","feb","march")))))
df[1,] <- c(1,"A",123,120,125)
df[2,] <- c(2,"B",140,150,155)
df[3,] <- c(3,"C",96,86,97)

Our “wide” data look like

kable(df,format = "pandoc", caption = "wide data")
Table 1: wide data
id name jan feb march
1 A 123 120 125
2 B 140 150 155
3 C 96 86 97

We want a data in which each new observation in new row but within same column i.e. long data. To achieve that we need to use standard reshape command in R. The reshape has following argument:

  • idvar : unique identifier for person,place or object on which observations(measurements) are made at different time points or repeatedly. Example Case ID

  • varying : if observation for specific individuals are made at different time points, in which columns values are recorded i.e time-varying columns. Example Jan, Feb, March

  • timevar: what will be the name of column once the time-varying columns above has been staked in rows. Example Month

  • times : what will be the values (of time) once the time-varying columns above has been staked in rows. Example Jan, Feb, March

  • v.names: what will be the values (of observations) once the time-varying columns above has been staked in rows. Example BP

  • direction: data needs to converted from wide to long format.

df_long <- reshape(df,
idvar = "id",  
        #[unique identifier for person,place or object on which observations(measurments) are made at different time points or repeatedly]

varying = c("jan","feb","march"),  
        # [if observation for specific individulas are made at different time points, in which columns values are recorded i.e time-varying columns ]

timevar = "month", 
        # [what will be the name of column once the timevarying columns above has been staked in rows]

times = c("jan","feb","march"),
        # [what will be the values (of time) once the timevarying columns above has been staked in rows]

v.names = "BP", 
        # [what will be the values (of observations) once the timevarying columns above has been staked in rows]

direction = "long")   
        # [we want to convert wide df into long one])

Our “long” data look like

df_long <- arrange(df_long, id)

kable(df_long,format = "pandoc", caption = "long data")
Table 2: long data
id name month BP
1 A jan 123
1 A feb 120
1 A march 125
2 B jan 140
2 B feb 150
2 B march 155
3 C jan 96
3 C feb 86
3 C march 97

Sometime, not only one type of measurement (BP) but also other types (such as heart rate-HR) are measured and recorded row wise. For example,

# wide data
df <- data.frame(matrix(data = NA, nrow = 3, ncol = 8, dimnames = list(NULL, paste0(c("id","name","BP_jan","BP_feb","BP_march","HR_jan","HR_feb","HR_march")))))
df[1,] <- c(1,"A",123,120,125,72,70,71)
df[2,] <- c(2,"B",140,150,155,85,82,86)
df[3,] <- c(3,"C",96,86,97,65,52,59)
kable(df,format = "pandoc", caption = "wide data- multiple category")
Table 3: wide data- multiple category
id name BP_jan BP_feb BP_march HR_jan HR_feb HR_march
1 A 123 120 125 72 70 71
2 B 140 150 155 85 82 86
3 C 96 86 97 65 52 59

This data can be converted into “long” by using list for group of time-varying columns for varying

df_long <- reshape(df,
idvar = "id",  
varying = list(c("BP_jan","BP_feb","BP_march"),c("HR_jan","HR_feb","HR_march") ),  
timevar = "month", 
times = c("jan","feb","march"),
v.names = c("BP","HR"), 
direction = "long")   
df_long <- arrange(df_long, id)
kable(df_long,format = "pandoc", caption = "long data")
Table 4: long data
id name month BP HR
1 A jan 123 72
1 A feb 120 70
1 A march 125 71
2 B jan 140 85
2 B feb 150 82
2 B march 155 86
3 C jan 96 65
3 C feb 86 52
3 C march 97 59

Reshape data from long to wide

To make data “wide” from long, the reshape function will need only two main arguments

  • idvar: unique identifier of unit on which measurement are made

  • timevar : which column represent the timing of the observations ( so that reshape function associate it with the value for given time for each ID )

If you do not specify above two arguments, function will drop an error-

Error in [.data.frame (data, , idvar) : undefined columns selected

If you read above error carefully, it already specifying which arguments were missing. Here in above case missing argument was idvar.

You can optionally provide,

  • v.names : which column represent values of the observations in long data (so that reshape function can transform these values into rows for each ID)

  • sep: column names in wide format are going to be created using value of timesvar and integers. Specify how both will be seperated in column names.

Here is the example

df_wide <- reshape(df_long,
       idvar = "id",
       # unique identifier
       timevar = "month",
       # the column represent the timing of the observations
       v.names = c("BP","HR"),
       # the columns represent the value of the observation (BP,HR)
       direction = "wide",
       sep = "_"
        )

Here is the wide data

Table 5: wide data
id name BP_jan HR_jan BP_feb HR_feb BP_march HR_march
1 1 A 123 72 120 70 125 71
4 2 B 140 85 150 82 155 86
7 3 C 96 65 86 52 97 59