Exploring Data Validation

“With The pointblank Package”

Meghan Harris, MPH

Cleveland R User Group

October 26th 2022 | 2022-10-26

About Me

I’m Meghan Harris 👋🏾
Data Scientist 💻
Prostate Cancer Clinical Trials Consortium (PCCTC) - Memorial Sloan Kettering ⚛️
thetidytrekker.com 🌐
@meghansharris
meghan-harris

What are We Doing Today?

What is Validation and Why Do We Care?
Different Types of Validations
What is pointblank?
Why pointblank?
pointblank Use-Cases

Disclaimer

I’m ABSOLUTELY not an expert in Data Validation or the pointblank package

What is Validation?
and Why Do We Care?

What is Validation? and Why Do We Care?

Validation: the action of checking or proving the accuracy of something.

This can look like a lot of things:

Testing Data Inputs
Testing Data Outputs
Testing Pipeline Logic
Enforcing Pipeline/Analysis/Data Rules

What is Validation? and Why Do We Care?

Why is it important? … Because…

What is Validation? and Why Do We Care?

Why is it important? … Because…

Your work is no good if it’s not right
Incorrect assumptions about data can be costly and unethical
Real world data is usually complex and nuanced
External controls may be lacking or non-existent

What is Validation? and Why Do We Care?

The art of Data Validation is a Rabbit Hole

What is Validation? and Why Do We Care?

Don’t fall for the “Validation Crux”

What is Validation? and Why Do We Care?

Data Validation is a “rabbithole” because you could probably spend an eternity thinking about the “best” plans for validation.

Deciding to think seriously about data validation is a humbling experience that should be done with care.

Data validation plans should be developed with a holistic, “systems” approach that includes any applicable third parties, data management teams, data collectors, etc.

What is Validation? and Why Do We Care?

💡Remember:

Start “small” by looking at the most important pieces of your data and figuring out the minimum building blocks of logic or assumptions that are needed to make a fairly confident guess that the data is actually accurate. It is normal for this to be an iterative process.

Different “types” of validations can be considered as the building blocks you need to get started.

Different Types
of Validations

Different Types of Validations

There are eight basic types that I always mention…but this list is NOT exhaustive:

Different Types of Validations

Variable Type

“Is this variable the same class/type we are expecting?”

Example: - Is R reading this as a numeric or integer variable, or is it just a string impostor?

# Variable Type Check - Base R
actual_number <- 2022
fake_number <- "2022"

#Actual Number?
is.numeric(actual_number)

[1] TRUE

#Actual Number?
is.numeric(fake_number)

[1] FALSE

Allowed/Expected Values

“Is the value of the variable one that is allowed or expected?”

Example: - Can we confirm that the value R is reading makes sense given the context of allowed or expected values for this variable?

# Allowed/Expected Values Check - Base R
age <- -99
first_name <- "Meghan"

#Is age a valid number and over the value of zero?
is.numeric(age) & age > 0

[1] FALSE

#Is first name a character string?
is.character(first_name)

[1] TRUE

Missing Data

“Is there any missing data where there should/shouldn’t be?”

Example: - Do we logically expect any values in a variable to be missing or should missing data prompt us to investigate the data further?

# Missing Data Check - Base R
subject_data

  subject_id consent
1        345     Yes
2        123     Yes
3         NA      No

#Are IDs missing only for those w/ no consent?
which(is.na(subject_data$subject_id)) == which(subject_data$consent == "No")

[1] TRUE

Logic Checks

“Is the data logically making sense upon delivery or after transformation/analyses/processing?”

Example: - Does the data make sense given the context of collection, processing, or analysis?

# Logic Data Check - Base R

# Is a new data frame different from an old data frame?#
!identical(old_data, new_data)

[1] TRUE

Uniqueness

“Should we expect data values to be unique or duplicated?”

Example: - Given the context of the data, do we expect R to find any duplicates? Is it a bad thing or have any meaning if present?

# Uniqueness Check - Base R
#Check to see there are 0 duplicated IDs
anyDuplicated(subject_data$subject_id) == 0

[1] TRUE

String Length

“Is the string/character value of the right left given the context of the data?”

Example: - Is the string’s length appropriate given the data’s meaning?

# String Length Check - Base R
#Do all of our "US zipcodes" have a string length of 5 characters?

sapply(us_zipcodes, function(x) nchar(x) == 5)

14215 14120   142 08796   867 
 TRUE  TRUE FALSE  TRUE FALSE

Formatting

“Is the general format of the data as we expect?”

Example: - Can R confirm that our data matches a specific format that is needed for our work?

# Data Format Check - Lubridate
#Do all of our dates match the YYYY-MM-DD format?
dates

[1] "2022-08-04" "1998/06/12" "01-15-1943"

library(lubridate)
!is.na(parse_date_time(dates,orders="ymd"))

[1]  TRUE  TRUE FALSE

Range Checks

“Does the data fall into an accepted pre-determined range?”

Example: - Can R confirm that our data matches a specific range that fits within the context of the data?

# Range Check - Base R
#Are all of the subjects 18 - 65 years old?

subject_data$over_18 <- sapply(subject_data$age, 
                               function(x) x >= 18 & x <= 65)

subject_data

  age over_18
1  38    TRUE
2  78   FALSE
3  74   FALSE

What is `pointblank`?

Pointblank is an R package by Rich Iannone (author/maintainer) and Mauricio Vargas (author) that was created to assist with methodically validating data and keeping track of relevant metadata (data about data) in R.

What is `pointblank`?

pointblank currently has 6 presented validation workflows in the package:

VALID-V: Table Scan

VALID-VI: R Markdown Document Validation

Why `pointblank`?

While you can use base R and other relevant packages like validate or testthat for data validation and testing in R, pointblank is a validation package that has a heavy focus on implementing easy reporting and methodological validation schema with ease.

However…

To a lowly beginner (like myself) pointblank can be overwhelming…

Why `pointblank`?

`pointblank` Use-Cases

`pointblank` Use-Cases: Data Quality Reporting

Example Scenario:

We’ve got some aggregate study data that includes the total number of patients present in a US State for our clinical trial studies. Each total also has an associated report date that’s recorded. Because this is the real world, there’s no external control in the data collection process and it’s pretty F-tier️🙃

`pointblank` Use-Cases: Data Quality Reporting

Example Data:

#example_script_1.R in the example_scripts folder#
library(readr)

#Load in the data#
example_data <- read_csv("data/example_data.csv")

#view it - or not 🤷🏾‍♀️
example_data

# A tibble: 5 × 3
  state         total_patients report_date   
  <chr>                  <dbl> <chr>         
1 New Jersey                12 July, 1st 2020
2 New York                  58 8/15/2020     
3 Pennslyvania              34 6/13/2020     
4 Mainne                    12 9/1/2020      
5 New Hampshire             -5 8/20/2022

`pointblank` Use-Cases: Data Quality Reporting

state	total_patients	report_date
Study Patient Totals
New Jersey	12	July, 1st 2020
New York	58	8/15/2020
Pennslyvania	34	6/13/2020
Mainne	12	9/1/2020
New Hampshire	-5	8/20/2022

Make a Plan:

There’s 3 variables (columns) and 5 observations (rows).
state: Official U.S states that we’d expect to be spelled correctly and capitalized.
total_patients: A total number of patients reported from each state. We’d expect this to be a numeric type and make sense. Only positive values.
report_date: A reported date of entry. We’d expect this to be a date type and have consistent formatting for each observations.

`pointblank` Use-Cases: Data Quality Reporting

Create an Agent:

library(pointblank) #For validation help

#Make an agent
patient_agent <- create_agent(tbl = example_data,
                              tbl_name = "Patient Totals")

Our Steps:

Create an “agent”

`pointblank` Use-Cases: Data Quality Reporting

Pipe in Valdation Rules:

library(pointblank) #For validation help

#Make an agent
patient_agent <- create_agent(tbl = example_data,
                              tbl_name = "Patient Totals") %>%
  col_vals_in_set(state, state.name ) %>% #Only valid states in the column?
  col_is_numeric(total_patients) %>% #Is column type numeric?
  col_vals_gte(total_patients,0) %>% #Only values greater than 0 in the column?
  col_is_date(report_date) #Is column type date?

Our Steps:

Create an “Agent”
Pipe in Validation Rules

`pointblank` Use-Cases: Data Quality Reporting

Interrogate the Agent:

library(pointblank) #For validation help

#Make an agent#
patient_agent <- create_agent(tbl = example_data,
                              tbl_name = "Patient Totals") %>%
  col_vals_in_set(state, state.name ) %>% #Only valid states in the column?
  col_is_numeric(total_patients) %>% #Is column type numeric?
  col_vals_gte(total_patients,0) %>% #Only values greater than 0 in the column?
  col_is_date(report_date) #Is column type date?

#interrogate it#
patient_agent %>%
  interrogate()

Our Steps:

Create an “Agent”
Pipe in Validation Rules
Interrogate the Agent

`pointblank` Use-Cases: Data Quality Reporting

Interrogation Results in Viewer Pane:

`pointblank` Use-Cases: Data Quality Reporting

Breaking it Down:

Date/Time/Object Metadata

`pointblank` Use-Cases: Data Quality Reporting

Breaking it Down:

STEP: The name of the validation functions used. Color-coded tabs let us know if a step was completed. Darker green means everything in the step passed

`pointblank` Use-Cases: Data Quality Reporting

Breaking it Down:

COLUMNS: The target columns we told the agent to interrogate via our validation rules

`pointblank` Use-Cases: Data Quality Reporting

Breaking it Down:

VALUES: Any required values needed/used to test for validation if applicable.

`pointblank` Use-Cases: Data Quality Reporting

Breaking it Down:

TBL: Let’s us know if the table was mutated in a validation step.

EVAL: Let’s us know if there’s issues R might have evaluating the table itself.

`pointblank` Use-Cases: Data Quality Reporting

Breaking it Down:

UNITS: Gives the total number of tests ran for each step

Steps that check all values in a column = 5 because we have five rows of data

Steps that just check a whole column = 1 because it’s just evaluating one column

`pointblank` Use-Cases: Data Quality Reporting

Breaking it Down:

PASS/FAIL: Gives the number/percentage of passing and failing unit tests

`pointblank` Use-Cases: Data Quality Reporting

Breaking it Down:

W,S,N: Tells us if the validation steps have entered WARN, STOP, or NOTIFY. This is empty because there’s no action levels set.

`pointblank` Use-Cases: Data Quality Reporting

Breaking it Down:

EXT: Provides a download of a data extract of observations that failed any validations if applicable.

`pointblank` Use-Cases: Data Quality Reporting

Data Extracts:

#Generate a data extract#
patient_agent %>%
  interrogate() %>%
  get_data_extracts() %>%
  setNames(c("State Validation Fails",
             "Patient Total Validation Fails"))

$`State Validation Fails`
# A tibble: 2 × 3
  state        total_patients report_date
  <chr>                 <dbl> <chr>      
1 Pennslyvania             34 6/13/2020  
2 Mainne                   12 9/1/2020   

$`Patient Total Validation Fails`
# A tibble: 1 × 3
  state         total_patients report_date
  <chr>                  <dbl> <chr>      
1 New Hampshire             -5 8/20/2022

`pointblank` Use-Cases: Data Quality Reporting

Setting optional `action_levels()`:

#Make an action levels object#
al <- action_levels(warn_at = 0.2, 
                    stop_at = 0.5,
                    notify_at = 1)

#Make an agent#
patient_agent <- create_agent(tbl = example_data,
                              tbl_name = "Patient Totals",
                              actions = al) %>%
  col_vals_in_set(state, state.name ) %>% #Only valid states in the column?
  col_is_numeric(total_patients) %>% #Is column type numeric?
  col_vals_gte(total_patients,0) %>% #Only values greater than 0 in the column?
  col_is_date(report_date) #Is column type date?

#interrogate it#
patient_agent %>%
  interrogate()

- We can use action_levels() to give our agent more directives when interrogating our data.

- We can set the fraction/percentage levels of validation failure that determines when the agent warns us, or stops the process altogether.

`pointblank` Use-Cases: Data Quality Reporting

Interrogation Results (With `action_levels()`) in Viewer Pane:

`pointblank` Use-Cases: Pipeline Data Validation

Apply Validation Rules Directly to the Table:

example_data %>%
  col_vals_in_set(state, state.name,
                  actions = warn_on_fail(warn_at = .6)) %>%
  col_is_numeric(total_patients,
                 actions = stop_on_fail(stop_at = 1)) %>% 
  col_vals_gte(total_patients,-0,
               actions = al) %>%
  col_is_date(report_date)

`pointblank` Use-Cases: Pipeline Data Validation

Apply Validation Rules Directly to the Table:

example_data %>%
  col_vals_in_set(state, state.name,
                  actions = warn_on_fail(warn_at = .6)) %>%
  col_is_numeric(total_patients,
                 actions = stop_on_fail(stop_at = 1)) %>% 
  col_vals_gte(total_patients,-0,
               actions = al) %>%
  col_is_date(report_date)

Action Possibilities:

Use warn_on_fail() for targeted warnings

`pointblank` Use-Cases: Pipeline Data Validation

Apply Validation Rules Directly to the Table:

example_data %>%
  col_vals_in_set(state, state.name,
                  actions = warn_on_fail(warn_at = .6)) %>%
  col_is_numeric(total_patients,
                 actions = stop_on_fail(stop_at = 1)) %>% 
  col_vals_gte(total_patients,-0,
               actions = al) %>%
  col_is_date(report_date)

Action Possibilities:

Use warn_on_fail() for targeted warnings
Use stop_on_fail() for targeted stops/error catches

`pointblank` Use-Cases: Pipeline Data Validation

Apply Validation Rules Directly to the Table:

example_data %>%
  col_vals_in_set(state, state.name,
                  actions = warn_on_fail(warn_at = .6)) %>%
  col_is_numeric(total_patients,
                 actions = stop_on_fail(stop_at = 1)) %>% 
  col_vals_gte(total_patients,-0,
               actions = al) %>%
  col_is_date(report_date)

Action Possibilities:

Use warn_on_fail() for targeted warnings
Use stop_on_fail() for targeted stops/error catches
Use an action_levels() object for more control

`pointblank` Use-Cases: Pipeline Data Validation

Apply Validation Rules Directly to the Table:

example_data %>%
  col_vals_in_set(state, state.name,
                  actions = warn_on_fail(warn_at = .6)) %>%
  col_is_numeric(total_patients,
                 actions = stop_on_fail(stop_at = 1)) %>% 
  col_vals_gte(total_patients,-0,
               actions = al) %>%
  col_is_date(report_date)

Action Possibilities:

Use warn_on_fail() for targeted warnings
Use stop_on_fail() for targeted stops/error catches
Use an action_levels() object for more control
Use no actions at all for basic testing

`pointblank` Use-Cases: Pipeline Data Validation

The goal is the side-effects to maintain/halt the workflow:

example_data %>%
  col_vals_in_set(state, state.name,
                  actions = warn_on_fail(warn_at = .6)) %>%
  col_is_numeric(total_patients,
                 actions = stop_on_fail(stop_at = 1)) %>% 
  col_vals_gte(total_patients,-0,
               actions = al) %>%
  col_is_date(report_date)

Error: Failure to validate that column `report_date` is of type: Date.
The `col_is_date()` validation failed beyond the absolute threshold level (1).
* failure level (1) >= failure threshold (1)

Fails will halt R’s workflow

`pointblank` Use-Cases: Pipeline Data Validation

The goal is the side-effects to maintain/halt the workflow:

example_data %>%
  col_is_numeric(total_patients,
                 actions = stop_on_fail(stop_at = 1))

# A tibble: 5 × 3
  state         total_patients report_date   
  <chr>                  <dbl> <chr>         
1 New Jersey                12 July, 1st 2020
2 New York                  58 8/15/2020     
3 Pennslyvania              34 6/13/2020     
4 Mainne                    12 9/1/2020      
5 New Hampshire             -5 8/20/2022

Passes will maintain R’s workflow

`pointblank` Use-Cases: Table Scans

Sometimes (always)…we should just look at our data… 🤷🏾‍

#Use the scan_data function to generate an HTML output#
example_data %>%
  scan_data()

Simply use scan_data() to generate an automated HTML output that gives information about the table.

`pointblank` Use-Cases: Table Scans

We can also set some options…

#Use the scan_data function to generate an HTML output#
#With options#
example_data %>%
  scan_data(sections = "OV",
            navbar = FALSE,
            lang = "es",
            width = 500,
            locale = "es_ES")

sections - “OVICMS” (Overview, Variables, Interactions, Correlations, Missing Values, Sample)
navbar - Toggles the navigation bar on/off
lang - Chooses a language to present the report in: (English, French, German, Italian, Spanish, Portuguese, Chinese, Russian)
width - Width of the HTML report
locale - Sets the region/locale for formatting numerical values.

Exploring Data Validation

About Me

What are We Doing Today?

Disclaimer

What is Validation?and Why Do We Care?

What is Validation? and Why Do We Care?

What is Validation? and Why Do We Care?

What is Validation? and Why Do We Care?

What is Validation? and Why Do We Care?

What is Validation? and Why Do We Care?

What is Validation? and Why Do We Care?

What is Validation? and Why Do We Care?

Different Typesof Validations

Different Types of Validations

Different Types of Validations

Variable Type

Allowed/Expected Values

Missing Data

Logic Checks

Uniqueness

String Length

Formatting

Range Checks

What is pointblank?

What is pointblank?

What is pointblank?

Why pointblank?

Why pointblank?

Why pointblank?

pointblank Use-Cases

pointblank Use-Cases: Data Quality Reporting

pointblank Use-Cases: Data Quality Reporting

pointblank Use-Cases: Data Quality Reporting

pointblank Use-Cases: Data Quality Reporting

Example Scenario:

pointblank Use-Cases: Data Quality Reporting

Example Data:

pointblank Use-Cases: Data Quality Reporting

Make a Plan:

pointblank Use-Cases: Data Quality Reporting

Create an Agent:

Our Steps:

pointblank Use-Cases: Data Quality Reporting

Pipe in Valdation Rules:

Our Steps:

pointblank Use-Cases: Data Quality Reporting

Interrogate the Agent:

Our Steps:

pointblank Use-Cases: Data Quality Reporting

Interrogation Results in Viewer Pane:

pointblank Use-Cases: Data Quality Reporting

Breaking it Down:

pointblank Use-Cases: Data Quality Reporting

Breaking it Down:

pointblank Use-Cases: Data Quality Reporting

Breaking it Down:

pointblank Use-Cases: Data Quality Reporting

Breaking it Down:

pointblank Use-Cases: Data Quality Reporting

Breaking it Down:

pointblank Use-Cases: Data Quality Reporting

Breaking it Down:

pointblank Use-Cases: Data Quality Reporting

Breaking it Down:

pointblank Use-Cases: Data Quality Reporting

Breaking it Down:

pointblank Use-Cases: Data Quality Reporting

Breaking it Down:

pointblank Use-Cases: Data Quality Reporting

Data Extracts:

pointblank Use-Cases: Data Quality Reporting

Setting optional action_levels():

pointblank Use-Cases: Data Quality Reporting

Interrogation Results (With action_levels()) in Viewer Pane:

pointblank Use-Cases: Pipeline Data Validation

pointblank Use-Cases: Pipeline Data Validation

pointblank Use-Cases: Pipeline Data Validation

pointblank Use-Cases: Pipeline Data Validation

Apply Validation Rules Directly to the Table:

pointblank Use-Cases: Pipeline Data Validation

What is Validation?
and Why Do We Care?

Different Types
of Validations

What is `pointblank`?

What is `pointblank`?

What is `pointblank`?

Why `pointblank`?

Why `pointblank`?

Why `pointblank`?

`pointblank` Use-Cases

`pointblank` Use-Cases: Data Quality Reporting

`pointblank` Use-Cases: Data Quality Reporting

`pointblank` Use-Cases: Data Quality Reporting

`pointblank` Use-Cases: Data Quality Reporting

`pointblank` Use-Cases: Data Quality Reporting

`pointblank` Use-Cases: Data Quality Reporting

`pointblank` Use-Cases: Data Quality Reporting

`pointblank` Use-Cases: Data Quality Reporting

`pointblank` Use-Cases: Data Quality Reporting

`pointblank` Use-Cases: Data Quality Reporting

`pointblank` Use-Cases: Data Quality Reporting

`pointblank` Use-Cases: Data Quality Reporting

`pointblank` Use-Cases: Data Quality Reporting

`pointblank` Use-Cases: Data Quality Reporting

`pointblank` Use-Cases: Data Quality Reporting

`pointblank` Use-Cases: Data Quality Reporting

`pointblank` Use-Cases: Data Quality Reporting

`pointblank` Use-Cases: Data Quality Reporting

`pointblank` Use-Cases: Data Quality Reporting

`pointblank` Use-Cases: Data Quality Reporting

`pointblank` Use-Cases: Data Quality Reporting

Setting optional `action_levels()`:

`pointblank` Use-Cases: Data Quality Reporting

Interrogation Results (With `action_levels()`) in Viewer Pane:

`pointblank` Use-Cases: Pipeline Data Validation

`pointblank` Use-Cases: Pipeline Data Validation

`pointblank` Use-Cases: Pipeline Data Validation

`pointblank` Use-Cases: Pipeline Data Validation

`pointblank` Use-Cases: Pipeline Data Validation

`pointblank` Use-Cases: Pipeline Data Validation

`pointblank` Use-Cases: Pipeline Data Validation

`pointblank` Use-Cases: Pipeline Data Validation

`pointblank` Use-Cases: Pipeline Data Validation

`pointblank` Use-Cases: Pipeline Data Validation

`pointblank` Use-Cases: Table Scans

`pointblank` Use-Cases: Table Scans

`pointblank` Use-Cases: Table Scans