# Variable Type Check - Base R
actual_number <- 2022
fake_number <- "2022"
#Actual Number?
is.numeric(actual_number)
[1] TRUE
[1] FALSE
“With The pointblank
Package”
Meghan Harris, MPH
Cleveland R User Group
October 26th 2022 | 2022-10-26
What is Validation and Why Do We Care?
Different Types of Validations
What is pointblank
?
Why pointblank
?
pointblank
Use-Cases
I’m ABSOLUTELY not an expert in Data Validation or the pointblank
package
Validation: the action of checking or proving the accuracy of something.
This can look like a lot of things:
Why is it important? … Because…
.
Why is it important? … Because…
The art of Data Validation is a Rabbit Hole
Don’t fall for the “Validation Crux”
💡Remember:
Start “small” by looking at the most important pieces of your data and figuring out the minimum building blocks of logic or assumptions that are needed to make a fairly confident guess that the data is actually accurate. It is normal for this to be an iterative process.
Different “types” of validations can be considered as the building blocks you need to get started.
.
“Is this variable the same class/type we are expecting?”
Example: - Is R reading this as a numeric or integer variable, or is it just a string impostor?
“Is the value of the variable one that is allowed or expected?”
Example: - Can we confirm that the value R is reading makes sense given the context of allowed or expected values for this variable?
“Is there any missing data where there should/shouldn’t be?”
Example: - Do we logically expect any values in a variable to be missing or should missing data prompt us to investigate the data further?
“Is the data logically making sense upon delivery or after transformation/analyses/processing?”
Example: - Does the data make sense given the context of collection, processing, or analysis?
“Should we expect data values to be unique or duplicated?”
Example: - Given the context of the data, do we expect R to find any duplicates? Is it a bad thing or have any meaning if present?
“Is the string/character value of the right left given the context of the data?”
Example: - Is the string’s length appropriate given the data’s meaning?
“Is the general format of the data as we expect?”
Example: - Can R confirm that our data matches a specific format that is needed for our work?
“Does the data fall into an accepted pre-determined range?”
Example: - Can R confirm that our data matches a specific range that fits within the context of the data?
pointblank
?
pointblank
?
Pointblank is an R package by Rich Iannone (author/maintainer) and Mauricio Vargas (author) that was created to assist with methodically validating data and keeping track of relevant metadata (data about data) in R.
pointblank
?pointblank
currently has 6 presented validation workflows in the package:
pointblank
?
pointblank
?While you can use base R and other relevant packages like
validate
ortestthat
for data validation and testing in R,pointblank
is a validation package that has a heavy focus on implementing easy reporting and methodological validation schema with ease.
However…
pointblank
can be overwhelming…
pointblank
?pointblank
Use-Cases
pointblank
Use-Cases: Data Quality Reportingpointblank
Use-Cases: Data Quality Reportingpointblank
Use-Cases: Data Quality Reportingpointblank
Use-Cases: Data Quality ReportingWe’ve got some aggregate study data that includes the total number of patients present in a US State for our clinical trial studies. Each total also has an associated report date that’s recorded. Because this is the real world, there’s no external control in the data collection process and it’s pretty F-tier️🙃
pointblank
Use-Cases: Data Quality Reporting#example_script_1.R in the example_scripts folder#
library(readr)
#Load in the data#
example_data <- read_csv("data/example_data.csv")
#view it - or not 🤷🏾♀️
example_data
# A tibble: 5 × 3
state total_patients report_date
<chr> <dbl> <chr>
1 New Jersey 12 July, 1st 2020
2 New York 58 8/15/2020
3 Pennslyvania 34 6/13/2020
4 Mainne 12 9/1/2020
5 New Hampshire -5 8/20/2022
pointblank
Use-Cases: Data Quality ReportingStudy Patient Totals | ||
state | total_patients | report_date |
---|---|---|
New Jersey | 12 | July, 1st 2020 |
New York | 58 | 8/15/2020 |
Pennslyvania | 34 | 6/13/2020 |
Mainne | 12 | 9/1/2020 |
New Hampshire | -5 | 8/20/2022 |
There’s 3 variables (columns) and 5 observations (rows).
state
: Official U.S states that we’d expect to be spelled correctly and capitalized.
total_patients
: A total number of patients reported from each state. We’d expect this to be a numeric type and make sense. Only positive values.
report_date
: A reported date of entry. We’d expect this to be a date type and have consistent formatting for each observations.
pointblank
Use-Cases: Data Quality Reportingpointblank
Use-Cases: Data Quality Reportinglibrary(pointblank) #For validation help
#Make an agent
patient_agent <- create_agent(tbl = example_data,
tbl_name = "Patient Totals") %>%
col_vals_in_set(state, state.name ) %>% #Only valid states in the column?
col_is_numeric(total_patients) %>% #Is column type numeric?
col_vals_gte(total_patients,0) %>% #Only values greater than 0 in the column?
col_is_date(report_date) #Is column type date?
pointblank
Use-Cases: Data Quality Reportinglibrary(pointblank) #For validation help
#Make an agent#
patient_agent <- create_agent(tbl = example_data,
tbl_name = "Patient Totals") %>%
col_vals_in_set(state, state.name ) %>% #Only valid states in the column?
col_is_numeric(total_patients) %>% #Is column type numeric?
col_vals_gte(total_patients,0) %>% #Only values greater than 0 in the column?
col_is_date(report_date) #Is column type date?
#interrogate it#
patient_agent %>%
interrogate()
pointblank
Use-Cases: Data Quality Reportingpointblank
Use-Cases: Data Quality Reportingpointblank
Use-Cases: Data Quality ReportingSTEP
: The name of the validation functions used. Color-coded tabs let us know if a step was completed. Darker green means everything in the step passed
pointblank
Use-Cases: Data Quality ReportingCOLUMNS
: The target columns we told the agent to interrogate via our validation rules
pointblank
Use-Cases: Data Quality ReportingVALUES
: Any required values needed/used to test for validation if applicable.
pointblank
Use-Cases: Data Quality ReportingTBL
: Let’s us know if the table was mutated in a validation step. EVAL
: Let’s us know if there’s issues R might have evaluating the table itself.
pointblank
Use-Cases: Data Quality ReportingUNITS
: Gives the total number of tests ran for each step
Steps that check all values in a column = 5 because we have five rows of data
Steps that just check a whole column = 1 because it’s just evaluating one column
pointblank
Use-Cases: Data Quality ReportingPASS
/FAIL
: Gives the number/percentage of passing and failing unit tests
pointblank
Use-Cases: Data Quality ReportingW,S,N
: Tells us if the validation steps have entered WARN
, STOP
, or NOTIFY
. This is empty because there’s no action levels
set.
pointblank
Use-Cases: Data Quality ReportingEXT
: Provides a download of a data extract of observations that failed any validations if applicable.
pointblank
Use-Cases: Data Quality Reporting$`State Validation Fails`
# A tibble: 2 × 3
state total_patients report_date
<chr> <dbl> <chr>
1 Pennslyvania 34 6/13/2020
2 Mainne 12 9/1/2020
$`Patient Total Validation Fails`
# A tibble: 1 × 3
state total_patients report_date
<chr> <dbl> <chr>
1 New Hampshire -5 8/20/2022
pointblank
Use-Cases: Data Quality Reportingaction_levels()
:#Make an action levels object#
al <- action_levels(warn_at = 0.2,
stop_at = 0.5,
notify_at = 1)
#Make an agent#
patient_agent <- create_agent(tbl = example_data,
tbl_name = "Patient Totals",
actions = al) %>%
col_vals_in_set(state, state.name ) %>% #Only valid states in the column?
col_is_numeric(total_patients) %>% #Is column type numeric?
col_vals_gte(total_patients,0) %>% #Only values greater than 0 in the column?
col_is_date(report_date) #Is column type date?
#interrogate it#
patient_agent %>%
interrogate()
- We can use action_levels()
to give our agent more directives when interrogating our data.
- We can set the fraction/percentage levels of validation failure that determines when the agent warns us, or stops the process altogether.
pointblank
Use-Cases: Data Quality Reportingaction_levels()
) in Viewer Pane:pointblank
Use-Cases: Pipeline Data Validationpointblank
Use-Cases: Pipeline Data Validationpointblank
Use-Cases: Pipeline Data Validationpointblank
Use-Cases: Pipeline Data Validationpointblank
Use-Cases: Pipeline Data Validationwarn_on_fail()
for targeted warningspointblank
Use-Cases: Pipeline Data ValidationUse warn_on_fail()
for targeted warnings
Use stop_on_fail()
for targeted stops/error catches
pointblank
Use-Cases: Pipeline Data ValidationUse warn_on_fail()
for targeted warnings
Use stop_on_fail()
for targeted stops/error catches
Use an action_levels()
object for more control
pointblank
Use-Cases: Pipeline Data ValidationUse warn_on_fail()
for targeted warnings
Use stop_on_fail()
for targeted stops/error catches
Use an action_levels()
object for more control
Use no actions at all for basic testing
pointblank
Use-Cases: Pipeline Data Validationexample_data %>%
col_vals_in_set(state, state.name,
actions = warn_on_fail(warn_at = .6)) %>%
col_is_numeric(total_patients,
actions = stop_on_fail(stop_at = 1)) %>%
col_vals_gte(total_patients,-0,
actions = al) %>%
col_is_date(report_date)
Error: Failure to validate that column `report_date` is of type: Date.
The `col_is_date()` validation failed beyond the absolute threshold level (1).
* failure level (1) >= failure threshold (1)
pointblank
Use-Cases: Pipeline Data Validationpointblank
Use-Cases: Table Scanspointblank
Use-Cases: Table Scansscan_data()
to generate an automated HTML output that gives information about the table.pointblank
Use-Cases: Table Scanssections
- “OVICMS” (Overview, Variables, Interactions, Correlations, Missing Values, Sample)navbar
- Toggles the navigation bar on/offlang
- Chooses a language to present the report in: (English, French, German, Italian, Spanish, Portuguese, Chinese, Russian)width
- Width of the HTML reportlocale
- Sets the region/locale for formatting numerical values.