Exploring Data Validation

“With The pointblank Package”

Meghan Harris, MPH

Cleveland R User Group

October 26th 2022 | 2022-10-26

About Me


What are We Doing Today?


  • What is Validation and Why Do We Care?

  • Different Types of Validations

  • What is pointblank?

  • Why pointblank?

  • pointblank Use-Cases

Disclaimer

I’m ABSOLUTELY not an expert in Data Validation or the pointblank package

What is Validation?
and Why Do We Care?

What is Validation? and Why Do We Care?

Validation: the action of checking or proving the accuracy of something.


This can look like a lot of things:

  • Testing Data Inputs
  • Testing Data Outputs
  • Testing Pipeline Logic
  • Enforcing Pipeline/Analysis/Data Rules

What is Validation? and Why Do We Care?

Why is it important? … Because…


.

What is Validation? and Why Do We Care?

Why is it important? … Because…


  • Your work is no good if it’s not right
  • Incorrect assumptions about data can be costly and unethical
  • Real world data is usually complex and nuanced
  • External controls may be lacking or non-existent

What is Validation? and Why Do We Care?

The art of Data Validation is a Rabbit Hole

What is Validation? and Why Do We Care?

Don’t fall for the “Validation Crux”

What is Validation? and Why Do We Care?


  • Data Validation is a “rabbithole” because you could probably spend an eternity thinking about the “best” plans for validation.
  • Deciding to think seriously about data validation is a humbling experience that should be done with care.
  • Data validation plans should be developed with a holistic, “systems” approach that includes any applicable third parties, data management teams, data collectors, etc.

What is Validation? and Why Do We Care?


💡Remember:


Start “small” by looking at the most important pieces of your data and figuring out the minimum building blocks of logic or assumptions that are needed to make a fairly confident guess that the data is actually accurate. It is normal for this to be an iterative process.


Different “types” of validations can be considered as the building blocks you need to get started.


Different Types
of Validations

Different Types of Validations

  • There are eight basic types that I always mention…but this list is NOT exhaustive:

.

Different Types of Validations

  • Variable Type
  • Allowed/Expected Values
  • Missing Data
  • Logic Checks
  • Uniqueness
  • String Length
  • Formatting
  • Range Checks

Variable Type

“Is this variable the same class/type we are expecting?”


Example: - Is R reading this as a numeric or integer variable, or is it just a string impostor?


# Variable Type Check - Base R
actual_number <- 2022
fake_number <- "2022"

#Actual Number?
is.numeric(actual_number)
[1] TRUE
#Actual Number?
is.numeric(fake_number)
[1] FALSE

Allowed/Expected Values

“Is the value of the variable one that is allowed or expected?”


Example: - Can we confirm that the value R is reading makes sense given the context of allowed or expected values for this variable?


# Allowed/Expected Values Check - Base R
age <- -99
first_name <- "Meghan"

#Is age a valid number and over the value of zero?
is.numeric(age) & age > 0 
[1] FALSE
#Is first name a character string?
is.character(first_name)
[1] TRUE

Missing Data

“Is there any missing data where there should/shouldn’t be?”


Example: - Do we logically expect any values in a variable to be missing or should missing data prompt us to investigate the data further?


# Missing Data Check - Base R
subject_data
  subject_id consent
1        345     Yes
2        123     Yes
3         NA      No
#Are IDs missing only for those w/ no consent?
which(is.na(subject_data$subject_id)) == which(subject_data$consent == "No")
[1] TRUE

Logic Checks

“Is the data logically making sense upon delivery or after transformation/analyses/processing?”


Example: - Does the data make sense given the context of collection, processing, or analysis?


# Logic Data Check - Base R

# Is a new data frame different from an old data frame?#
!identical(old_data, new_data)
[1] TRUE

Uniqueness

“Should we expect data values to be unique or duplicated?”


Example: - Given the context of the data, do we expect R to find any duplicates? Is it a bad thing or have any meaning if present?


# Uniqueness Check - Base R
#Check to see there are 0 duplicated IDs
anyDuplicated(subject_data$subject_id) == 0
[1] TRUE

String Length

“Is the string/character value of the right left given the context of the data?”


Example: - Is the string’s length appropriate given the data’s meaning?


# String Length Check - Base R
#Do all of our "US zipcodes" have a string length of 5 characters?

sapply(us_zipcodes, function(x) nchar(x) == 5)
14215 14120   142 08796   867 
 TRUE  TRUE FALSE  TRUE FALSE 

Formatting

“Is the general format of the data as we expect?”


Example: - Can R confirm that our data matches a specific format that is needed for our work?


# Data Format Check - Lubridate
#Do all of our dates match the YYYY-MM-DD format?
dates
[1] "2022-08-04" "1998/06/12" "01-15-1943"
library(lubridate)
!is.na(parse_date_time(dates,orders="ymd"))
[1]  TRUE  TRUE FALSE

Range Checks

“Does the data fall into an accepted pre-determined range?”


Example: - Can R confirm that our data matches a specific range that fits within the context of the data?


# Range Check - Base R
#Are all of the subjects 18 - 65 years old?

subject_data$over_18 <- sapply(subject_data$age, 
                               function(x) x >= 18 & x <= 65)

subject_data
  age over_18
1  38    TRUE
2  78   FALSE
3  74   FALSE

What is pointblank?

What is pointblank?


Pointblank is an R package by Rich Iannone (author/maintainer) and Mauricio Vargas (author) that was created to assist with methodically validating data and keeping track of relevant metadata (data about data) in R.

pointblank

What is pointblank?

pointblank currently has 6 presented validation workflows in the package:

  1. VALID-I: Data Quality Reporting

  2. VALID-II: Pipeline Data Validation

  1. VALID-V: Table Scan

Why pointblank?

Why pointblank?

While you can use base R and other relevant packages like validate or testthat for data validation and testing in R, pointblank is a validation package that has a heavy focus on implementing easy reporting and methodological validation schema with ease.


However…

To a lowly beginner (like myself) pointblank can be overwhelming…

Why pointblank?

pointblank Use-Cases

pointblank Use-Cases: Data Quality Reporting

pointblank Use-Cases: Data Quality Reporting

pointblank Use-Cases: Data Quality Reporting

pointblank Use-Cases: Data Quality Reporting

Example Scenario:


We’ve got some aggregate study data that includes the total number of patients present in a US State for our clinical trial studies. Each total also has an associated report date that’s recorded. Because this is the real world, there’s no external control in the data collection process and it’s pretty F-tier️🙃

pointblank Use-Cases: Data Quality Reporting

Example Data:

#example_script_1.R in the example_scripts folder#
library(readr)

#Load in the data#
example_data <- read_csv("data/example_data.csv")

#view it - or not 🤷🏾‍♀️
example_data
# A tibble: 5 × 3
  state         total_patients report_date   
  <chr>                  <dbl> <chr>         
1 New Jersey                12 July, 1st 2020
2 New York                  58 8/15/2020     
3 Pennslyvania              34 6/13/2020     
4 Mainne                    12 9/1/2020      
5 New Hampshire             -5 8/20/2022     

pointblank Use-Cases: Data Quality Reporting


Study Patient Totals
state total_patients report_date
New Jersey 12 July, 1st 2020
New York 58 8/15/2020
Pennslyvania 34 6/13/2020
Mainne 12 9/1/2020
New Hampshire -5 8/20/2022

Make a Plan:

  • There’s 3 variables (columns) and 5 observations (rows).

  • state: Official U.S states that we’d expect to be spelled correctly and capitalized.

  • total_patients: A total number of patients reported from each state. We’d expect this to be a numeric type and make sense. Only positive values.

  • report_date: A reported date of entry. We’d expect this to be a date type and have consistent formatting for each observations.

pointblank Use-Cases: Data Quality Reporting

Create an Agent:

library(pointblank) #For validation help

#Make an agent
patient_agent <- create_agent(tbl = example_data,
                              tbl_name = "Patient Totals")

Our Steps:

  1. Create an “agent”

pointblank Use-Cases: Data Quality Reporting

Pipe in Valdation Rules:

library(pointblank) #For validation help

#Make an agent
patient_agent <- create_agent(tbl = example_data,
                              tbl_name = "Patient Totals") %>%
  col_vals_in_set(state, state.name ) %>% #Only valid states in the column?
  col_is_numeric(total_patients) %>% #Is column type numeric?
  col_vals_gte(total_patients,0) %>% #Only values greater than 0 in the column?
  col_is_date(report_date) #Is column type date?

Our Steps:

  1. Create an “Agent”
  2. Pipe in Validation Rules

pointblank Use-Cases: Data Quality Reporting

Interrogate the Agent:

library(pointblank) #For validation help

#Make an agent#
patient_agent <- create_agent(tbl = example_data,
                              tbl_name = "Patient Totals") %>%
  col_vals_in_set(state, state.name ) %>% #Only valid states in the column?
  col_is_numeric(total_patients) %>% #Is column type numeric?
  col_vals_gte(total_patients,0) %>% #Only values greater than 0 in the column?
  col_is_date(report_date) #Is column type date?

#interrogate it#
patient_agent %>%
  interrogate()

Our Steps:

  1. Create an “Agent”
  2. Pipe in Validation Rules
  3. Interrogate the Agent

pointblank Use-Cases: Data Quality Reporting

Interrogation Results in Viewer Pane:

pointblank Use-Cases: Data Quality Reporting

Breaking it Down:




Date/Time/Object Metadata

pointblank Use-Cases: Data Quality Reporting

Breaking it Down:



STEP: The name of the validation functions used. Color-coded tabs let us know if a step was completed. Darker green means everything in the step passed

pointblank Use-Cases: Data Quality Reporting

Breaking it Down:




COLUMNS: The target columns we told the agent to interrogate via our validation rules

pointblank Use-Cases: Data Quality Reporting

Breaking it Down:



VALUES: Any required values needed/used to test for validation if applicable.

pointblank Use-Cases: Data Quality Reporting

Breaking it Down:



TBL: Let’s us know if the table was mutated in a validation step.

EVAL: Let’s us know if there’s issues R might have evaluating the table itself.

pointblank Use-Cases: Data Quality Reporting

Breaking it Down:


UNITS: Gives the total number of tests ran for each step

Steps that check all values in a column = 5 because we have five rows of data

Steps that just check a whole column = 1 because it’s just evaluating one column

pointblank Use-Cases: Data Quality Reporting

Breaking it Down:




PASS/FAIL: Gives the number/percentage of passing and failing unit tests

pointblank Use-Cases: Data Quality Reporting

Breaking it Down:



W,S,N: Tells us if the validation steps have entered WARN, STOP, or NOTIFY. This is empty because there’s no action levels set.

pointblank Use-Cases: Data Quality Reporting

Breaking it Down:




EXT: Provides a download of a data extract of observations that failed any validations if applicable.

pointblank Use-Cases: Data Quality Reporting

Data Extracts:

#Generate a data extract#
patient_agent %>%
  interrogate() %>%
  get_data_extracts() %>%
  setNames(c("State Validation Fails",
             "Patient Total Validation Fails"))
$`State Validation Fails`
# A tibble: 2 × 3
  state        total_patients report_date
  <chr>                 <dbl> <chr>      
1 Pennslyvania             34 6/13/2020  
2 Mainne                   12 9/1/2020   

$`Patient Total Validation Fails`
# A tibble: 1 × 3
  state         total_patients report_date
  <chr>                  <dbl> <chr>      
1 New Hampshire             -5 8/20/2022  

pointblank Use-Cases: Data Quality Reporting

Setting optional action_levels():

#Make an action levels object#
al <- action_levels(warn_at = 0.2, 
                    stop_at = 0.5,
                    notify_at = 1)

#Make an agent#
patient_agent <- create_agent(tbl = example_data,
                              tbl_name = "Patient Totals",
                              actions = al) %>%
  col_vals_in_set(state, state.name ) %>% #Only valid states in the column?
  col_is_numeric(total_patients) %>% #Is column type numeric?
  col_vals_gte(total_patients,0) %>% #Only values greater than 0 in the column?
  col_is_date(report_date) #Is column type date?

#interrogate it#
patient_agent %>%
  interrogate()


- We can use action_levels() to give our agent more directives when interrogating our data.


- We can set the fraction/percentage levels of validation failure that determines when the agent warns us, or stops the process altogether.

pointblank Use-Cases: Data Quality Reporting

Interrogation Results (With action_levels()) in Viewer Pane:

pointblank Use-Cases: Pipeline Data Validation

pointblank Use-Cases: Pipeline Data Validation

pointblank Use-Cases: Pipeline Data Validation

pointblank Use-Cases: Pipeline Data Validation

Apply Validation Rules Directly to the Table:


example_data %>%
  col_vals_in_set(state, state.name,
                  actions = warn_on_fail(warn_at = .6)) %>%
  col_is_numeric(total_patients,
                 actions = stop_on_fail(stop_at = 1)) %>% 
  col_vals_gte(total_patients,-0,
               actions = al) %>%
  col_is_date(report_date)

pointblank Use-Cases: Pipeline Data Validation

Apply Validation Rules Directly to the Table:


example_data %>%
  col_vals_in_set(state, state.name,
                  actions = warn_on_fail(warn_at = .6)) %>%
  col_is_numeric(total_patients,
                 actions = stop_on_fail(stop_at = 1)) %>% 
  col_vals_gte(total_patients,-0,
               actions = al) %>%
  col_is_date(report_date)

Action Possibilities:

  • Use warn_on_fail() for targeted warnings

pointblank Use-Cases: Pipeline Data Validation

Apply Validation Rules Directly to the Table:


example_data %>%
  col_vals_in_set(state, state.name,
                  actions = warn_on_fail(warn_at = .6)) %>%
  col_is_numeric(total_patients,
                 actions = stop_on_fail(stop_at = 1)) %>% 
  col_vals_gte(total_patients,-0,
               actions = al) %>%
  col_is_date(report_date)

Action Possibilities:

  • Use warn_on_fail() for targeted warnings

  • Use stop_on_fail() for targeted stops/error catches

pointblank Use-Cases: Pipeline Data Validation

Apply Validation Rules Directly to the Table:


example_data %>%
  col_vals_in_set(state, state.name,
                  actions = warn_on_fail(warn_at = .6)) %>%
  col_is_numeric(total_patients,
                 actions = stop_on_fail(stop_at = 1)) %>% 
  col_vals_gte(total_patients,-0,
               actions = al) %>%
  col_is_date(report_date)

Action Possibilities:

  • Use warn_on_fail() for targeted warnings

  • Use stop_on_fail() for targeted stops/error catches

  • Use an action_levels() object for more control

pointblank Use-Cases: Pipeline Data Validation

Apply Validation Rules Directly to the Table:


example_data %>%
  col_vals_in_set(state, state.name,
                  actions = warn_on_fail(warn_at = .6)) %>%
  col_is_numeric(total_patients,
                 actions = stop_on_fail(stop_at = 1)) %>% 
  col_vals_gte(total_patients,-0,
               actions = al) %>%
  col_is_date(report_date)

Action Possibilities:

  • Use warn_on_fail() for targeted warnings

  • Use stop_on_fail() for targeted stops/error catches

  • Use an action_levels() object for more control

  • Use no actions at all for basic testing

pointblank Use-Cases: Pipeline Data Validation

The goal is the side-effects to maintain/halt the workflow:


example_data %>%
  col_vals_in_set(state, state.name,
                  actions = warn_on_fail(warn_at = .6)) %>%
  col_is_numeric(total_patients,
                 actions = stop_on_fail(stop_at = 1)) %>% 
  col_vals_gte(total_patients,-0,
               actions = al) %>%
  col_is_date(report_date)
Error: Failure to validate that column `report_date` is of type: Date.
The `col_is_date()` validation failed beyond the absolute threshold level (1).
* failure level (1) >= failure threshold (1)




Fails will halt R’s workflow

pointblank Use-Cases: Pipeline Data Validation

The goal is the side-effects to maintain/halt the workflow:


example_data %>%
  col_is_numeric(total_patients,
                 actions = stop_on_fail(stop_at = 1)) 
# A tibble: 5 × 3
  state         total_patients report_date   
  <chr>                  <dbl> <chr>         
1 New Jersey                12 July, 1st 2020
2 New York                  58 8/15/2020     
3 Pennslyvania              34 6/13/2020     
4 Mainne                    12 9/1/2020      
5 New Hampshire             -5 8/20/2022     




Passes will maintain R’s workflow

pointblank Use-Cases: Table Scans

pointblank Use-Cases: Table Scans


Sometimes (always)…we should just look at our data… 🤷🏾‍


#Use the scan_data function to generate an HTML output#
example_data %>%
  scan_data()
  • Simply use scan_data() to generate an automated HTML output that gives information about the table.

pointblank Use-Cases: Table Scans


We can also set some options


#Use the scan_data function to generate an HTML output#
#With options#
example_data %>%
  scan_data(sections = "OV",
            navbar = FALSE,
            lang = "es",
            width = 500,
            locale = "es_ES")
  • sections - “OVICMS” (Overview, Variables, Interactions, Correlations, Missing Values, Sample)
  • navbar - Toggles the navigation bar on/off
  • lang - Chooses a language to present the report in: (English, French, German, Italian, Spanish, Portuguese, Chinese, Russian)
  • width - Width of the HTML report
  • locale - Sets the region/locale for formatting numerical values.

Questions?