Purpose

In this notebook session, I’ll be exploring different methods of recoding variables in the R language. I intend to explore solutions in base R, dplyr, and plyr packages.


Recoding with dplyr

One of the most used packages for data transformations in R is dplyr from the tidyverse. We can load in dplyr and the magrittr (for the %>% operator) packages individually, or we can chose to load the tidyverse packages to load them both in with other packages.


#Load in tidyverse.
library(tidyverse)


Example Data-1

To practice recoding with dplyr, let’s create a dataframe to practice on and call it example_data_1. We’ll make it a tibble since we’re using dplyr.


#Setting a seed for reproducibility.
set.seed(99)

#Creating the example data frame.
example_data_1 <- tibble("ID" = c(1:10),
                         "Gender" = sample(0:3, 10, replace = TRUE),
                         "Levels" = sample(c("Low","Moderate","High"), 10, replace = TRUE),
                         "Medication Type" = sample(1:5, 10, replace = TRUE))


Which gives us…

ID Gender Levels Medication Type
1 3 Moderate 1
2 0 Low 2
3 3 High 5
4 1 Low 2
5 1 Moderate 5
6 0 Moderate 4
7 2 Moderate 2
8 1 Low 1
9 1 Moderate 4
10 1 Moderate 4


case_when() function

The dyplr package has a case_when function that is a “go-to” for most situations you’ll encounter! Let’s use it to recode our Medication Type variable with the following codes:


Medication Category Code
Antidepressants 1
Anxiolytics 2
Stimulants 3
Antipsychotics 4
Mood Stabilizers 5


We can apply the function by directly mutating the variable in the data set with tidy code that utilizes magrittr’s pipe operator. You can read more about the tidyverse style of coding like this here.


#Using tidyverse-styled code to mutate the "Medication Type" variable to recode the numbers into strings.
example_data_1 <- example_data_1 %>%
  mutate(`Medication Type` = case_when(`Medication Type` == 1 ~ "Antidepressants",
                                       `Medication Type` == 2 ~ "Anxiolytics",
                                       `Medication Type` == 3 ~ "Stimulants",
                                       `Medication Type` == 4 ~ "Antipsychotics",
                                       `Medication Type` == 5 ~ "Mood Stabilizers"))


In this block, we’re using the mutate function to change the Medication Type variable. It’s here that we apply the case_when function to the variable to recode our numeric codes to strings.


The result…

ID Gender Levels Medication Type
1 3 Moderate Antidepressants
2 0 Low Anxiolytics
3 3 High Mood Stabilizers
4 1 Low Anxiolytics
5 1 Moderate Mood Stabilizers
6 0 Moderate Antipsychotics
7 2 Moderate Anxiolytics
8 1 Low Antidepressants
9 1 Moderate Antipsychotics
10 1 Moderate Antipsychotics


We can also use case_when to convert existing strings to numeric values as well. We can test this out by recoding the Medication Type variable back to numeric code…


#Using tidyverse-styled code to mutate the "Medication Type" variable back into numeric values.
example_data_1 <- example_data_1 %>%
  mutate(`Medication Type` = case_when(`Medication Type` == "Antidepressants" ~ 1,
                                       `Medication Type` == "Anxiolytics" ~ 2,
                                       `Medication Type` == "Stimulants" ~ 3,
                                       `Medication Type` == "Antipsychotics" ~ 4,
                                       `Medication Type` == "Mood Stabilizers" ~ 5))


Which gives us our original table back:

ID Gender Levels Medication Type
1 3 Moderate 1
2 0 Low 2
3 3 High 5
4 1 Low 2
5 1 Moderate 5
6 0 Moderate 4
7 2 Moderate 2
8 1 Low 1
9 1 Moderate 4
10 1 Moderate 4


The case_when() function is great to use for more complex/conditional recoding as well. Check out the tidyverse reference page for more possibilities/limitations!



recode() function

The dyplr package also has a recode() function that can be useful. Let’s use it to recode our Levels variable with the following codes:


Levels Category Code
Low 1
Moderate 2
High 3


To do this, we can directly apply and assign the function to the variable we wish to recode…


#Applying the recode function to the "Gender" variable in the data set.
example_data_1$Levels <- recode(example_data_1$Levels, 
                                "Low" = 1,
                                "Moderate" = 2,
                                "High" = 3)
#Similarly, we can use the tidyverse-style as well...
example_data_1 <- example_data_1 %>%
  mutate(Levels = recode(Levels,
                         "Low" = 1,
                         "Moderate" = 2,
                         "High" = 3))


Either way, we get the same result…


ID Gender Levels Medication Type
1 3 2 1
2 0 1 2
3 3 3 5
4 1 1 2
5 1 2 5
6 0 2 4
7 2 2 2
8 1 1 1
9 1 2 4
10 1 2 4


This function only works when trying to convert character/string values


Note that if we try to use the recode() function to convert numeric data types, and error will be thrown. We can see this by trying to convert the Levels variable back into it’s string categories.


#Attempting to convert numeric data types with the "recode" function...
example_data_1 <- example_data_1 %>%
  mutate(Levels = recode(Levels,
                         1 = "Low",
                         2 = "Moderate",
                         3 = "High"))
## Error: <text>:4:28: unexpected '='
## 3:   mutate(Levels = recode(Levels,
## 4:                          1 =
##                               ^


We get an error telling us that ‘=’ was unexpected. This is because the function is expecting a string input. If you really wanted to use the recode() function with numeric types, you’d have to convert the numbers to strings first and then run the function…


#Attempting to convert numeric data types with the "recode" function...
example_data_1 <- example_data_1 %>%
  mutate(Levels = recode(as.character(Levels),
                         "1" = "Low",
                         "2" = "Moderate",
                         "3" = "High"))


Which gives us…

ID Gender Levels Medication Type
1 3 Moderate 1
2 0 Low 2
3 3 High 5
4 1 Low 2
5 1 Moderate 5
6 0 Moderate 4
7 2 Moderate 2
8 1 Low 1
9 1 Moderate 4
10 1 Moderate 4


It’s important to note that as of 04/26/2021 the lifecycle for the recode() function has a questioning status because of the order in which the function takes in values. There is a possibility a new function will be created to replace this one. Additionally, there is a specific function for recoding factors as well called recode_factor() that can be read about here, although the forcats package could be used for easier factor processing.


if_else() function

Although you can use case_when for cleaner code that conditionally recodes variables, maybe you want to try dplyr’s version of if_else(). This is a function that is similar to base R’s ifelse() but can be a bit faster and is more strict with the data types allowed in the function. We can use it to recode variables as well. Let’s use it recode the ID variable in our set with the following codes…


IDs Category Code
South Wing 1-3
East Wing 4-7
North Wing 8-10


We can use the mutate() function to conditionally recode the ID variable with the following…


#Using the if_else function with mutate for conditional recoding.
example_data_1 <- example_data_1 %>%
  mutate(ID = if_else(ID <= 3, "South Wing",
                      if_else(ID > 3 & ID <= 7, "East Wing","North Wing")))


Which gives us this…

ID Gender Levels Medication Type
South Wing 3 Moderate 1
South Wing 0 Low 2
South Wing 3 High 5
East Wing 1 Low 2
East Wing 1 Moderate 5
East Wing 0 Moderate 4
East Wing 2 Moderate 2
North Wing 1 Low 1
North Wing 1 Moderate 4
North Wing 1 Moderate 4


Note that conditional recoding like this can also be done in the case_when() function we reviewed previously.



Recoding with plyr


mapvalues() function

If we have vectors full of set values we want to use to recode with, we can use the mapvalues function from the plyr package. Note that this method only works when the vector of the old values and new values are of the same length. Let’s explore this by repeating the recoding for the Medication Type variable. Let’s look at our codes again for a reminder:


Medication Category Code
Antidepressants 1
Anxiolytics 2
Stimulants 3
Antipsychotics 4
Mood Stabilizers 5


Let’s say we have a vector of the string medication categories. We can remap the values in our original data set with the following…


#Creating a vector of medication categories.
med_cats <- c("Antidepressants","Anxiolytics","Stimulants","Antipsychotics","Mood Stabilizers")
med_codes <- 1:5

#If you're solely using plyr, you can load in the plyr package. Note that if you have dplyr loaded as well, you will get a warning that plyr is masking alot of functions in dplyr. In this case, it's best to use plyr functions by directly calling it's namespace. This is shown below:

#Using the mapvalues function for recoding.
example_data_1$`Medication Type` <- plyr::mapvalues(example_data_1$`Medication Type`,
                                                    from = med_codes,
                                                    to = med_cats)
## The following `from` values were not present in `x`: 3


Which results in…


ID Gender Levels Medication Type
South Wing 3 Moderate Antidepressants
South Wing 0 Low Anxiolytics
South Wing 3 High Mood Stabilizers
East Wing 1 Low Anxiolytics
East Wing 1 Moderate Mood Stabilizers
East Wing 0 Moderate Antipsychotics
East Wing 2 Moderate Anxiolytics
North Wing 1 Low Antidepressants
North Wing 1 Moderate Antipsychotics
North Wing 1 Moderate Antipsychotics


Note that the first argument is the object you wish to change. The second (from=) is a set of values that you wish to find within your object to change, and the last argument (to =) is the set of values you wish to replace with.


Let’s try to recode the ID variable back into numbers to test if we can just use a vector with duplicate values as an input for the mapvalues function. Because the values are duplicated, we can use the warn_missing argument to prevent any warnings from printing to the console. Let’s try this out by attempting to replace these values with a range of integers…


#Using mapvalues with existing dataframe columns and number ranges.
example_data_1$ID <- plyr::mapvalues(example_data_1$ID, 
                                     from = example_data_1$ID,
                                     to = 1:10,
                                     warn_missing = FALSE)


These code ran without error, but let’s see the dataframe…

ID Gender Levels Medication Type
1 3 Moderate Antidepressants
1 0 Low Anxiolytics
1 3 High Mood Stabilizers
4 1 Low Anxiolytics
4 1 Moderate Mood Stabilizers
4 0 Moderate Antipsychotics
4 2 Moderate Anxiolytics
8 1 Low Antidepressants
8 1 Moderate Antipsychotics
8 1 Moderate Antipsychotics


Not what we were expecting. The mapvalues() function did recode our variables, but it only applied the numeric values to each unique index because we had duplicates. This is definitely a limitation of the mapvalues() function. Regardless, it seems the mapvalues function can be really convenient if you have a lot of values to recode as this only requires creating a vector once to be used. It’s also important to note that you can’t do conditional recoding with this unless you transform your values first in the from= vector. Because of this, mapvalues is good for quick basic recoding when vectors of unique values are present or created for the purpose of recoding.


For instances like this when we just want to recode something into a range of numbers it can simply be applied as such…


#Recoding ID variable simply with desired number ranges.
example_data_1$ID <- 1:10

revalue() function


Another function that can be used from the plyr package is the revalue() function. This function works to recode character and factor vectors only. Because of this limitation, most using the plyr package for recoding will opt for the mapvalues() function. The revalue function can be useful if you’d like to incorporate a level of data validation if you want to be sure that the data in question is in fact characters or factors.


As an example, let’s try to convert the Gender variable with the revalue function to it’s appropriate categories with the following codes:


Gender Category Code
Female 0
Male 1
Transgender 2
Non-Binary 3

Because the input needs to be a character or factor, we can coerce the Gender variable to fit this requirement. Let’s change it into a character vector with the as.character() function:


#Using the revalues function for recoding.
example_data_1$Gender <- plyr::revalue(as.character(example_data_1$Gender),
                                       replace = c("0" = "Female", "1" = "Male", "2" = "Transgender", "3" = "Non-Binary"))


Which gives us…


ID Gender Levels Medication Type
1 Non-Binary Moderate Antidepressants
2 Female Low Anxiolytics
3 Non-Binary High Mood Stabilizers
4 Male Low Anxiolytics
5 Male Moderate Mood Stabilizers
6 Female Moderate Antipsychotics
7 Transgender Moderate Anxiolytics
8 Male Low Antidepressants
9 Male Moderate Antipsychotics
10 Male Moderate Antipsychotics

Recoding with Base R

Maybe you want to stay in base R and don’t want to deal with alternative packages. Although the previously mentioned packages can help make recoding efficient, they aren’t the only way.


Example Data-2

We’ll create a second example dataframe for the rest of the notebook…

#Setting a seed for reproducibility.
set.seed(1234)

#Creating the example data frame.
example_data_2 <- data.frame("ID" = c(1:10), 
                           "Gender" = sample(0:3, 10, replace = TRUE),
                           "Illness" = sample(1:3, 10, replace = TRUE),
                           "Severity" = sample(c("Low","Moderate","High"), 10, replace = TRUE),
                           "Medications" = sample(0:1, 10, replace = TRUE))


ID Gender Illness Severity Medications
1 3 2 Low 1
2 3 3 High 0
3 1 2 High 1
4 1 2 High 1
5 0 2 Low 1
6 3 3 Moderate 0
7 2 2 Low 0
8 0 2 Moderate 0
9 0 2 Moderate 0
10 1 2 High 1


More often than not, we’ll see data like this where categorical variables will be numerically coded. Depending on the analyses, we may need to switch back and forth. Let’s recode the gender variable into categories. In this example our codes are the following:


Gender Category Code
Female 0
Male 1
Transgender 2
Non-Binary 3

Named Vectors

We can recode our Gender variable with a named vector where we directly give names to the values that are already present in our data frame. Let’s call ours Gender_Codes and then directly apply it to our Gender variable in our example_data_2 data set…


#Creating the named vector for gender.
gender_codes <- c("Female" = 0, 
                  "Male" = 1,
                  "Transgender" = 2,
                  "Non-Binary" = 3)

#Applying the named vector to the gender variable in the original data set. Note how we convert the gender variable to a factor and then wrap the "names" function around everything.
example_data_2$Gender <- names(gender_codes[as.factor(example_data_2$Gender)])

Which gives us….

ID Gender Illness Severity Medications
1 Non-Binary 2 Low 1
2 Non-Binary 3 High 0
3 Male 2 High 1
4 Male 2 High 1
5 Female 2 Low 1
6 Non-Binary 3 Moderate 0
7 Transgender 2 Low 0
8 Female 2 Moderate 0
9 Female 2 Moderate 0
10 Male 2 High 1


We’re able to do this by converting our original Gender variable to a factor, subsetting it inside of our gender_codes vector and applying the resulting names into the Gender variable.




Vector Indexing

In base R we can recode variables with vector indexing. This approach can be used if you have a few values that need to be recoded. For this example, let’s recode the Illness variable with the following codes:

Illness Category Code
Bipolar I 1
Bipolar II 2
Cyclothymia 3


When looking at our data set, we actually see that we have no observations with the value of 1 or “Bipolar 1” present in the set. With that knowledge, we know that we only have to recode values 2 and 3


#Accessing the "Illness" vector to convert 2's into "Bipolar II".
example_data_2$Illness[example_data_2$Illness == 2] <- "Bipolar II"

#Accessing the "Illness" vector to convert 3's into "Cyclothymia".
example_data_2$Illness[example_data_2$Illness == 3] <- "Cyclothymia"

#Note that trying to recode the value "1" will not result in any errors, even though there aren't any 1s present. This code will run.  
example_data_2$Illness[example_data_2$Illness == 1] <- "Bipolar I"


Our result…

ID Gender Illness Severity Medications
1 Non-Binary Bipolar II Low 1
2 Non-Binary Cyclothymia High 0
3 Male Bipolar II High 1
4 Male Bipolar II High 1
5 Female Bipolar II Low 1
6 Non-Binary Cyclothymia Moderate 0
7 Transgender Bipolar II Low 0
8 Female Bipolar II Moderate 0
9 Female Bipolar II Moderate 0
10 Male Bipolar II High 1

Vector indexing can be great in a pinch, but can get a bit messy the more values you have. This approach also won’t let you know if any values you’ve declared is not present in your data which could lead to potential issues at some point.



If-Else Statements

In base R, we can also use if-else chains to recode variables. This code can get messier the more values you have to recode. If this method is used for recoding, it might be best to limit it to recoding two or three values. For this example, let’s recode the Medications variable. Our codes for this variable is the following:

Medications Category Code
No 0
Yes 1


To recode this variable, we can use an ifelse statement…

#Applying the if-else statement to the Medications variable.
example_data_2$Medications <- ifelse(example_data_2$Medications == 0,"No","Yes")


Which gives us….

ID Gender Illness Severity Medications
1 Non-Binary Bipolar II Low Yes
2 Non-Binary Cyclothymia High No
3 Male Bipolar II High Yes
4 Male Bipolar II High Yes
5 Female Bipolar II Low Yes
6 Non-Binary Cyclothymia Moderate No
7 Transgender Bipolar II Low No
8 Female Bipolar II Moderate No
9 Female Bipolar II Moderate No
10 Male Bipolar II High Yes


The if-else statement here evaluates the Medications variable in the example_data_2 data set. For each Medications value that is 0, R will replace the value with “No”, otherwise, it will replace it with the other values we’ve supplied, “Yes”. Theoretically, we can make if-else chains as big as we want to account for more than two values, but this isn’t recommended for a large amount of values as it can get messy.

Let’s use an if-else chain to recode the ID column’s numerical values into spelled out characters of each number.


#Applying the if-else statement to the Medications variable.
example_data_2$ID <- ifelse(example_data_2$ID == 1,"one",
                     ifelse(example_data_2$ID == 2,"two",
                       ifelse(example_data_2$ID == 3,"three",
                         ifelse(example_data_2$ID == 4,"four",
                           ifelse(example_data_2$ID == 5,"five",
                             ifelse(example_data_2$ID == 6,"six",
                               ifelse(example_data_2$ID == 7,"seven",
                                 ifelse(example_data_2$ID == 8,"eight",
                                   ifelse(example_data_2$ID == 9,"nine","ten")))))))))

Which gives us….

ID Gender Illness Severity Medications
one Non-Binary Bipolar II Low Yes
two Non-Binary Cyclothymia High No
three Male Bipolar II High Yes
four Male Bipolar II High Yes
five Female Bipolar II Low Yes
six Non-Binary Cyclothymia Moderate No
seven Transgender Bipolar II Low No
eight Female Bipolar II Moderate No
nine Female Bipolar II Moderate No
ten Male Bipolar II High Yes


While something like this may work in a pinch, it’s not really efficient to recode this way. This approach requires that you have knowledge of what your data contains beforehand. If we had an ID value of 11, it would not have been caught by this if-else chain/ladder. You can always add statements that would help you catch unknown values, but there are more efficient ways to recode multiple variables. Some of which have already been presented in this notebook.


Fun Fact: If you ever need to convert numbers to words like this you can use the numbers_to_words function from the xfun package. Alternatively, if you ever want to convert numbers into words, you can try out the wordstonumbers package by fsingletonthorn over on Github!