Purpose

In this notebook session, I’ll be exploring different methods of recoding variables in the R language. I intend to explore solutions in base R, dplyr, and plyr packages.

Recoding with dplyr

One of the most used packages for data transformations in R is dplyr from the tidyverse. We can load in dplyr and the magrittr (for the %>% operator) packages individually, or we can chose to load the tidyverse packages to load them both in with other packages.

#Load in tidyverse.
library(tidyverse)

Example Data-1

To practice recoding with dplyr, let’s create a dataframe to practice on and call it example_data_1. We’ll make it a tibble since we’re using dplyr.

#Setting a seed for reproducibility.
set.seed(99)

#Creating the example data frame.
example_data_1 <- tibble("ID" = c(1:10),
                         "Gender" = sample(0:3, 10, replace = TRUE),
                         "Levels" = sample(c("Low","Moderate","High"), 10, replace = TRUE),
                         "Medication Type" = sample(1:5, 10, replace = TRUE))

Which gives us…

ID	Gender	Levels	Medication Type
1	3	Moderate	1
2	0	Low	2
3	3	High	5
4	1	Low	2
5	1	Moderate	5
6	0	Moderate	4
7	2	Moderate	2
8	1	Low	1
9	1	Moderate	4
10	1	Moderate	4

case_when() function

The dyplr package has a case_when function that is a “go-to” for most situations you’ll encounter! Let’s use it to recode our Medication Type variable with the following codes:

Medication Category	Code
Antidepressants	1
Anxiolytics	2
Stimulants	3
Antipsychotics	4
Mood Stabilizers	5

We can apply the function by directly mutating the variable in the data set with tidy code that utilizes magrittr’s pipe operator. You can read more about the tidyverse style of coding like this here.

#Using tidyverse-styled code to mutate the "Medication Type" variable to recode the numbers into strings.
example_data_1 <- example_data_1 %>%
  mutate(`Medication Type` = case_when(`Medication Type` == 1 ~ "Antidepressants",
                                       `Medication Type` == 2 ~ "Anxiolytics",
                                       `Medication Type` == 3 ~ "Stimulants",
                                       `Medication Type` == 4 ~ "Antipsychotics",
                                       `Medication Type` == 5 ~ "Mood Stabilizers"))

In this block, we’re using the mutate function to change the Medication Type variable. It’s here that we apply the case_when function to the variable to recode our numeric codes to strings.

The result…

ID	Gender	Levels	Medication Type
1	3	Moderate	Antidepressants
2	0	Low	Anxiolytics
3	3	High	Mood Stabilizers
4	1	Low	Anxiolytics
5	1	Moderate	Mood Stabilizers
6	0	Moderate	Antipsychotics
7	2	Moderate	Anxiolytics
8	1	Low	Antidepressants
9	1	Moderate	Antipsychotics
10	1	Moderate	Antipsychotics

We can also use case_when to convert existing strings to numeric values as well. We can test this out by recoding the Medication Type variable back to numeric code…

#Using tidyverse-styled code to mutate the "Medication Type" variable back into numeric values.
example_data_1 <- example_data_1 %>%
  mutate(`Medication Type` = case_when(`Medication Type` == "Antidepressants" ~ 1,
                                       `Medication Type` == "Anxiolytics" ~ 2,
                                       `Medication Type` == "Stimulants" ~ 3,
                                       `Medication Type` == "Antipsychotics" ~ 4,
                                       `Medication Type` == "Mood Stabilizers" ~ 5))

Which gives us our original table back:

ID	Gender	Levels	Medication Type
1	3	Moderate	1
2	0	Low	2
3	3	High	5
4	1	Low	2
5	1	Moderate	5
6	0	Moderate	4
7	2	Moderate	2
8	1	Low	1
9	1	Moderate	4
10	1	Moderate	4

The case_when() function is great to use for more complex/conditional recoding as well. Check out the tidyverse reference page for more possibilities/limitations!

recode() function

The dyplr package also has a recode() function that can be useful. Let’s use it to recode our Levels variable with the following codes:

Levels Category	Code
Low	1
Moderate	2
High	3

To do this, we can directly apply and assign the function to the variable we wish to recode…

#Applying the recode function to the "Gender" variable in the data set.
example_data_1$Levels <- recode(example_data_1$Levels, 
                                "Low" = 1,
                                "Moderate" = 2,
                                "High" = 3)

#Similarly, we can use the tidyverse-style as well...
example_data_1 <- example_data_1 %>%
  mutate(Levels = recode(Levels,
                         "Low" = 1,
                         "Moderate" = 2,
                         "High" = 3))

Either way, we get the same result…

ID	Gender	Levels	Medication Type
1	3	2	1
2	0	1	2
3	3	3	5
4	1	1	2
5	1	2	5
6	0	2	4
7	2	2	2
8	1	1	1
9	1	2	4
10	1	2	4

This function only works when trying to convert character/string values

Note that if we try to use the recode() function to convert numeric data types, and error will be thrown. We can see this by trying to convert the Levels variable back into it’s string categories.

#Attempting to convert numeric data types with the "recode" function...
example_data_1 <- example_data_1 %>%
  mutate(Levels = recode(Levels,
                         1 = "Low",
                         2 = "Moderate",
                         3 = "High"))

## Error: <text>:4:28: unexpected '='
## 3:   mutate(Levels = recode(Levels,
## 4:                          1 =
##                               ^

We get an error telling us that ‘=’ was unexpected. This is because the function is expecting a string input. If you really wanted to use the recode() function with numeric types, you’d have to convert the numbers to strings first and then run the function…

#Attempting to convert numeric data types with the "recode" function...
example_data_1 <- example_data_1 %>%
  mutate(Levels = recode(as.character(Levels),
                         "1" = "Low",
                         "2" = "Moderate",
                         "3" = "High"))

Which gives us…

ID	Gender	Levels	Medication Type
1	3	Moderate	1
2	0	Low	2
3	3	High	5
4	1	Low	2
5	1	Moderate	5
6	0	Moderate	4
7	2	Moderate	2
8	1	Low	1
9	1	Moderate	4
10	1	Moderate	4

It’s important to note that as of 04/26/2021 the lifecycle for the recode() function has a questioning status because of the order in which the function takes in values. There is a possibility a new function will be created to replace this one. Additionally, there is a specific function for recoding factors as well called recode_factor() that can be read about here, although the forcats package could be used for easier factor processing.

if_else() function

Although you can use case_when for cleaner code that conditionally recodes variables, maybe you want to try dplyr’s version of if_else(). This is a function that is similar to base R’s ifelse() but can be a bit faster and is more strict with the data types allowed in the function. We can use it to recode variables as well. Let’s use it recode the ID variable in our set with the following codes…

IDs Category	Code
South Wing	1-3
East Wing	4-7
North Wing	8-10

We can use the mutate() function to conditionally recode the ID variable with the following…

#Using the if_else function with mutate for conditional recoding.
example_data_1 <- example_data_1 %>%
  mutate(ID = if_else(ID <= 3, "South Wing",
                      if_else(ID > 3 & ID <= 7, "East Wing","North Wing")))

Which gives us this…

ID	Gender	Levels	Medication Type
South Wing	3	Moderate	1
South Wing	0	Low	2
South Wing	3	High	5
East Wing	1	Low	2
East Wing	1	Moderate	5
East Wing	0	Moderate	4
East Wing	2	Moderate	2
North Wing	1	Low	1
North Wing	1	Moderate	4
North Wing	1	Moderate	4

Note that conditional recoding like this can also be done in the case_when() function we reviewed previously.

Recoding with plyr

mapvalues() function

If we have vectors full of set values we want to use to recode with, we can use the mapvalues function from the plyr package. Note that this method only works when the vector of the old values and new values are of the same length. Let’s explore this by repeating the recoding for the Medication Type variable. Let’s look at our codes again for a reminder:

Medication Category	Code
Antidepressants	1
Anxiolytics	2
Stimulants	3
Antipsychotics	4
Mood Stabilizers	5

Let’s say we have a vector of the string medication categories. We can remap the values in our original data set with the following…

#Creating a vector of medication categories.
med_cats <- c("Antidepressants","Anxiolytics","Stimulants","Antipsychotics","Mood Stabilizers")
med_codes <- 1:5

#If you're solely using plyr, you can load in the plyr package. Note that if you have dplyr loaded as well, you will get a warning that plyr is masking alot of functions in dplyr. In this case, it's best to use plyr functions by directly calling it's namespace. This is shown below:

#Using the mapvalues function for recoding.
example_data_1$`Medication Type` <- plyr::mapvalues(example_data_1$`Medication Type`,
                                                    from = med_codes,
                                                    to = med_cats)

## The following `from` values were not present in `x`: 3

Which results in…

ID	Gender	Levels	Medication Type
South Wing	3	Moderate	Antidepressants
South Wing	0	Low	Anxiolytics
South Wing	3	High	Mood Stabilizers
East Wing	1	Low	Anxiolytics
East Wing	1	Moderate	Mood Stabilizers
East Wing	0	Moderate	Antipsychotics
East Wing	2	Moderate	Anxiolytics
North Wing	1	Low	Antidepressants
North Wing	1	Moderate	Antipsychotics
North Wing	1	Moderate	Antipsychotics

Note that the first argument is the object you wish to change. The second (from=) is a set of values that you wish to find within your object to change, and the last argument (to =) is the set of values you wish to replace with.

Let’s try to recode the ID variable back into numbers to test if we can just use a vector with duplicate values as an input for the mapvalues function. Because the values are duplicated, we can use the warn_missing argument to prevent any warnings from printing to the console. Let’s try this out by attempting to replace these values with a range of integers…

#Using mapvalues with existing dataframe columns and number ranges.
example_data_1$ID <- plyr::mapvalues(example_data_1$ID, 
                                     from = example_data_1$ID,
                                     to = 1:10,
                                     warn_missing = FALSE)

These code ran without error, but let’s see the dataframe…

ID	Gender	Levels	Medication Type
1	3	Moderate	Antidepressants
1	0	Low	Anxiolytics
1	3	High	Mood Stabilizers
4	1	Low	Anxiolytics
4	1	Moderate	Mood Stabilizers
4	0	Moderate	Antipsychotics
4	2	Moderate	Anxiolytics
8	1	Low	Antidepressants
8	1	Moderate	Antipsychotics
8	1	Moderate	Antipsychotics

Not what we were expecting. The mapvalues() function did recode our variables, but it only applied the numeric values to each unique index because we had duplicates. This is definitely a limitation of the mapvalues() function. Regardless, it seems the mapvalues function can be really convenient if you have a lot of values to recode as this only requires creating a vector once to be used. It’s also important to note that you can’t do conditional recoding with this unless you transform your values first in the from= vector. Because of this, mapvalues is good for quick basic recoding when vectors of unique values are present or created for the purpose of recoding.

For instances like this when we just want to recode something into a range of numbers it can simply be applied as such…

#Recoding ID variable simply with desired number ranges.
example_data_1$ID <- 1:10

revalue() function

Another function that can be used from the plyr package is the revalue() function. This function works to recode character and factor vectors only. Because of this limitation, most using the plyr package for recoding will opt for the mapvalues() function. The revalue function can be useful if you’d like to incorporate a level of data validation if you want to be sure that the data in question is in fact characters or factors.

As an example, let’s try to convert the Gender variable with the revalue function to it’s appropriate categories with the following codes:

Gender Category	Code
Female	0
Male	1
Transgender	2
Non-Binary	3

Because the input needs to be a character or factor, we can coerce the Gender variable to fit this requirement. Let’s change it into a character vector with the as.character() function:

#Using the revalues function for recoding.
example_data_1$Gender <- plyr::revalue(as.character(example_data_1$Gender),
                                       replace = c("0" = "Female", "1" = "Male", "2" = "Transgender", "3" = "Non-Binary"))

Which gives us…

ID	Gender	Levels	Medication Type
1	Non-Binary	Moderate	Antidepressants
2	Female	Low	Anxiolytics
3	Non-Binary	High	Mood Stabilizers
4	Male	Low	Anxiolytics
5	Male	Moderate	Mood Stabilizers
6	Female	Moderate	Antipsychotics
7	Transgender	Moderate	Anxiolytics
8	Male	Low	Antidepressants
9	Male	Moderate	Antipsychotics
10	Male	Moderate	Antipsychotics

Recoding with Base R

Maybe you want to stay in base R and don’t want to deal with alternative packages. Although the previously mentioned packages can help make recoding efficient, they aren’t the only way.

Example Data-2

We’ll create a second example dataframe for the rest of the notebook…

#Setting a seed for reproducibility.
set.seed(1234)

#Creating the example data frame.
example_data_2 <- data.frame("ID" = c(1:10), 
                           "Gender" = sample(0:3, 10, replace = TRUE),
                           "Illness" = sample(1:3, 10, replace = TRUE),
                           "Severity" = sample(c("Low","Moderate","High"), 10, replace = TRUE),
                           "Medications" = sample(0:1, 10, replace = TRUE))

ID	Gender	Illness	Severity	Medications
1	3	2	Low	1
2	3	3	High	0
3	1	2	High	1
4	1	2	High	1
5	0	2	Low	1
6	3	3	Moderate	0
7	2	2	Low	0
8	0	2	Moderate	0
9	0	2	Moderate	0
10	1	2	High	1

More often than not, we’ll see data like this where categorical variables will be numerically coded. Depending on the analyses, we may need to switch back and forth. Let’s recode the gender variable into categories. In this example our codes are the following:

Gender Category	Code
Female	0
Male	1
Transgender	2
Non-Binary	3

Named Vectors

We can recode our Gender variable with a named vector where we directly give names to the values that are already present in our data frame. Let’s call ours Gender_Codes and then directly apply it to our Gender variable in our example_data_2 data set…

#Creating the named vector for gender.
gender_codes <- c("Female" = 0, 
                  "Male" = 1,
                  "Transgender" = 2,
                  "Non-Binary" = 3)

#Applying the named vector to the gender variable in the original data set. Note how we convert the gender variable to a factor and then wrap the "names" function around everything.
example_data_2$Gender <- names(gender_codes[as.factor(example_data_2$Gender)])

Which gives us….

ID	Gender	Illness	Severity	Medications
1	Non-Binary	2	Low	1
2	Non-Binary	3	High	0
3	Male	2	High	1
4	Male	2	High	1
5	Female	2	Low	1
6	Non-Binary	3	Moderate	0
7	Transgender	2	Low	0
8	Female	2	Moderate	0
9	Female	2	Moderate	0
10	Male	2	High	1

We’re able to do this by converting our original Gender variable to a factor, subsetting it inside of our gender_codes vector and applying the resulting names into the Gender variable.

Vector Indexing

In base R we can recode variables with vector indexing. This approach can be used if you have a few values that need to be recoded. For this example, let’s recode the Illness variable with the following codes:

Illness Category	Code
Bipolar I	1
Bipolar II	2
Cyclothymia	3

When looking at our data set, we actually see that we have no observations with the value of 1 or “Bipolar 1” present in the set. With that knowledge, we know that we only have to recode values 2 and 3…

#Accessing the "Illness" vector to convert 2's into "Bipolar II".
example_data_2$Illness[example_data_2$Illness == 2] <- "Bipolar II"

#Accessing the "Illness" vector to convert 3's into "Cyclothymia".
example_data_2$Illness[example_data_2$Illness == 3] <- "Cyclothymia"

#Note that trying to recode the value "1" will not result in any errors, even though there aren't any 1s present. This code will run.  
example_data_2$Illness[example_data_2$Illness == 1] <- "Bipolar I"

Our result…

ID	Gender	Illness	Severity	Medications
1	Non-Binary	Bipolar II	Low	1
2	Non-Binary	Cyclothymia	High	0
3	Male	Bipolar II	High	1
4	Male	Bipolar II	High	1
5	Female	Bipolar II	Low	1
6	Non-Binary	Cyclothymia	Moderate	0
7	Transgender	Bipolar II	Low	0
8	Female	Bipolar II	Moderate	0
9	Female	Bipolar II	Moderate	0
10	Male	Bipolar II	High	1

Vector indexing can be great in a pinch, but can get a bit messy the more values you have. This approach also won’t let you know if any values you’ve declared is not present in your data which could lead to potential issues at some point.

If-Else Statements

In base R, we can also use if-else chains to recode variables. This code can get messier the more values you have to recode. If this method is used for recoding, it might be best to limit it to recoding two or three values. For this example, let’s recode the Medications variable. Our codes for this variable is the following:

Medications Category	Code
No	0
Yes	1

To recode this variable, we can use an ifelse statement…

#Applying the if-else statement to the Medications variable.
example_data_2$Medications <- ifelse(example_data_2$Medications == 0,"No","Yes")

Which gives us….

ID	Gender	Illness	Severity	Medications
1	Non-Binary	Bipolar II	Low	Yes
2	Non-Binary	Cyclothymia	High	No
3	Male	Bipolar II	High	Yes
4	Male	Bipolar II	High	Yes
5	Female	Bipolar II	Low	Yes
6	Non-Binary	Cyclothymia	Moderate	No
7	Transgender	Bipolar II	Low	No
8	Female	Bipolar II	Moderate	No
9	Female	Bipolar II	Moderate	No
10	Male	Bipolar II	High	Yes

The if-else statement here evaluates the Medications variable in the example_data_2 data set. For each Medications value that is 0, R will replace the value with “No”, otherwise, it will replace it with the other values we’ve supplied, “Yes”. Theoretically, we can make if-else chains as big as we want to account for more than two values, but this isn’t recommended for a large amount of values as it can get messy.

Let’s use an if-else chain to recode the ID column’s numerical values into spelled out characters of each number.

#Applying the if-else statement to the Medications variable.
example_data_2$ID <- ifelse(example_data_2$ID == 1,"one",
                     ifelse(example_data_2$ID == 2,"two",
                       ifelse(example_data_2$ID == 3,"three",
                         ifelse(example_data_2$ID == 4,"four",
                           ifelse(example_data_2$ID == 5,"five",
                             ifelse(example_data_2$ID == 6,"six",
                               ifelse(example_data_2$ID == 7,"seven",
                                 ifelse(example_data_2$ID == 8,"eight",
                                   ifelse(example_data_2$ID == 9,"nine","ten")))))))))

Which gives us….

ID	Gender	Illness	Severity	Medications
one	Non-Binary	Bipolar II	Low	Yes
two	Non-Binary	Cyclothymia	High	No
three	Male	Bipolar II	High	Yes
four	Male	Bipolar II	High	Yes
five	Female	Bipolar II	Low	Yes
six	Non-Binary	Cyclothymia	Moderate	No
seven	Transgender	Bipolar II	Low	No
eight	Female	Bipolar II	Moderate	No
nine	Female	Bipolar II	Moderate	No
ten	Male	Bipolar II	High	Yes

While something like this may work in a pinch, it’s not really efficient to recode this way. This approach requires that you have knowledge of what your data contains beforehand. If we had an ID value of 11, it would not have been caught by this if-else chain/ladder. You can always add statements that would help you catch unknown values, but there are more efficient ways to recode multiple variables. Some of which have already been presented in this notebook.

Fun Fact: If you ever need to convert numbers to words like this you can use the numbers_to_words function from the xfun package. Alternatively, if you ever want to convert numbers into words, you can try out the wordstonumbers package by fsingletonthorn over on Github!

Recoding in R Notes

Purpose

Recoding with dplyr

Example Data-1

case_when() function

recode() function

if_else() function

Recoding with plyr

mapvalues() function

revalue() function

Recoding with Base R

Example Data-2

Named Vectors

Vector Indexing

If-Else Statements