Validation Checks
This guide assumes basic familiarity with the Frictionless Framework. To learn more, please read the Introduction and Quick Start.
There are various validation checks included in the core Frictionless Framework along with an ability to create custom checks. Let's review what's in the box.
#
Baseline CheckThe Baseline Check is always enabled. It makes various small checks that reveal a great deal of tabular errors. There is a report.tasks[].scope
property to check which exact errors have been checked for:
Download
capital-invalid.csv
to reproduce the examples (right-click and "Save link as")..
The Baseline Check is incorporated into base Frictionless classes as though Resource, Header, and Row. There is no exact order in which those errors are revealed as it's highly optimized. One should consider the Baseline Check as one unit of validation.
#
Heuristic ChecksThere is a group of checks that indicate probable errors. You need to use the checks
argument of the validate
function to activate one or more of these checks.
#
Duplicate RowThis checks for duplicate rows. You need to take into account that checking for duplicate rows can lead to high memory consumption on big files. Here is an example:
#
ASCII ValueIf you want to skip non-ascii characters, this check helps to notify if there are any in data during validation. Here is how we can use this check:
#
Deviated CellThis check identifies deviated cells from the normal ones. To flag the deviated cell, the check compares the length of the characters in each cell with a threshold value. The threshold value is either 5000 or value calculated using Python's built-in statistics
module which is average plus(+) three standard deviation. The exact algorithm can be found here. For example:
Download
issue-1066.csv
to reproduce the examples (right-click and "Save link as")..
#
Deviated ValueThis check uses Python's built-in statistics
module to check a field's data for deviations. By default, deviated values are outside of the average +- three standard deviations. Take a look at the API Reference for more details about available options and default values. The exact algorithm can be found here. For example:
#
Truncated ValueSometime during data export from a database or other storage, data values can be truncated. This check tries to detect such truncation. Let's explore some truncation indicators:
#
Regulation ChecksContrary to heuristic checks, regulation checks give you the ability to provide additional rules for your data. Use the checks
argument of the validate
function to active one or more of these checks.
#
Forbidden ValueThis check ensures that some field doesn't have any forbidden or denylist values. For example:
#
Sequential ValueThis check gives us an opportunity to validate sequential fields like primary keys or other similar data. It doesn't need to start from 0 or 1. We're providing a field name:
#
Row ConstraintThis check is the most powerful one as it uses the external simpleeval
package allowing you to evaluate arbitrary Python expressions on data rows. Let's show on an example:
#
Table DimensionsThis check is used to validate if your data has expected dimensions as: exact number of rows (num_rows
), minimum (min_rows
) and maximum (max_rows
) number of rows, exact number of fields (num_fields
), minimum (min_fields
) and maximum (max_fields
) number of fields.
You can also give multiples limits at the same time:
It is possible to use de check declaratively as:
But the table dimensions check arguments num_rows
, min_rows
, max_rows
, num_fields
, min_fields
, max_fields
must be passed in camelCase format as the example above i.e. numRows
, minRows
, maxRows
, numFields
, minFields
and maxFields
.