Checks Configurations
Performing checks ove the metric results is an important step in Checkita framework. As metric results are calculated then checks can be configured to identify if there are any problems with quality of data.
In Checkita there are two main group of checks:
Snapshot
checks - allows comparison of metric results with static thresholds or with other metric results in the same Data Quality job.Trend
checks - allows evaluation of how metric result is changing over a certain period of time. Checks of this type are used to detect anomalies in data. In order trend check work it is required to set up Data Quality storage since check calculator need to fetch historical results for the metric of interest.Expression
checks - allows to define boolean expression (formula) which can refer to multiple metric results. Check status is set depending on boolean expression evaluation result.
IMPORTANT: Trend checks are scheduled for deprecation in Checkita 3.0 release as their functionality is now covered (and also enhanced) by trend metrics.
After evaluation, check will have a status as described in Status Model used in Results chapter.
Snapshot Checks
Snapshot checks represent a simple comparison of metric results with a static threshold or with other metric result.
The following snapshot checks are supported:
equalTo
- checks if metric results is equal to a given threshold value or to other metric result.lessThan
- checks if metric result is less than a given threshold value or other metric result.greaterThan
- checks if metric result is greater than a given threshold value or other metric result.differByLT
- checks if relative difference between two metric results is less than a given threshold. This check succeeds when following expression is true:| metric - compareMetric | / compareMetric < threshold
.
Snapshot checks are configured using common set of parameters, which are:
id
- Required. Check IDdescription
- Optional. Description of the check.metric
- Required. Metric ID which results is checked.compareMetric
- Optional. Metric ID which result is used as a threshold.threshold
- Optional. Explicit threshold value.isCritical
- Optional, default isfalse
. Boolean flag identifying whether this check is critical or not. Check criticality is important when application check failure tolerance is set toCRITICAL
. For more information, see Check Failure Tolerance.metadata
- Optional. List of user-defined metadata parameters specific to this check where each parameter is a string in format:param.name=param.value
.
IMPORTANT. When configuring check it should be specified either an explicit threshold value in
threshold
field or other metric ID incompareMetric
field which result will be used as a threshold value. The only exception to this rule isdifferByLY
check for which it is required to specify both, threshold value and metric ID to compare with.
Trend Checks
Trend checks are used to detect anomalies in data. This type of checks allows to verify that the value of the metric corresponds to its average value within a given deviation for a certain period of time. Maximum allowed deviation is configured by providing a threshold value.
Following trend checks are supported:
averageBoundFull
- sets the same upper and lower deviation from metric average result. Check succeeds when following expression is true:(1 - threshold) * avgResult <= currentResult <= (1 + threshold) * avgResult
.averageBoundUpper
- verifies only upper deviation from the metric average result. Check succeeds when following expression is true:currentResult <= (1 + threshold) * avgResult
.averageBoundLower
- verifies only lower deviation from the metric average result. Check succeeds when following expression is ture:(1 - threshold) * avgResult <= currentResult
.averageBoundRange
- sets different thresholds for upper and lower deviations from metric average results. Check succeeds when following expression is true:(1 - thresholdLower) * avgResult <= currentResult <= (1 + thresholdUpper) * avgResult
.
Trend checks are configured using following set of parameters:
id
- Required. Check IDdescription
- Optional. Description of the check.metric
- Required. Metric ID which result is checked.rule
- Required. The rule for calculating historical average value of the metric. There are two rules supported:record
- calculates the average value of metric for the configured number of historical records.datetime
- calculates the average value of metric for the configured datetime window.
windowSize
- Required. Size of the window for average metric value calculation:- If
rule
is set torecord
then window size is the number of records to retrieve. - If
rule
is set todatetime
then window size is a duration string which should conform to Scala Duration.
- If
windowOffset
- Optional, default is0
or0s
. Set window offset back from current reference date (see Working with Date and Time chapter for more details on reference date). By default, offset is absent and window start from current reference date.- If
rule
is set torecord
then window offset is the number of records to skip from reference date. - If
rule
is set todatetime
then window offset is a duration string which should conform to Scala Duration.
- If
threshold
- Required. Sets maximum allowed deviation from historical average metric result. Not used withaverageBoundRange
check.thresholdLower
- Required. Sets maximum allowed lower deviation from historical average metric result. *Used only foraverageBoundRange
check.thresholdUpper
- Required. Sets maximum allowed upper deviation from historical average metric result. *Used only foraverageBoundRange
check.isCritical
- Optional, default isfalse
. Boolean flag identifying whether this check is critical or not. Check criticality is important when application check failure tolerance is set toCRITICAL
. For more information, see Check Failure Tolerance.metadata
- Optional. List of user-defined metadata parameters specific to this metric where each parameter is a string in format:param.name=param.value
.
NOTE. Scala Duration string has a format of
<length><unit>
where following units are allowed:d
,day
,h
,hr
,hour
,m
,min
,minute
,s
,sec
,second
,ms
,milli
,millisecond
,µs
,micro
,microsecond
,ns
,nano
,nanosecond
.
Top N Rank Check
This is a special check designed specifically for Top N Metric and working only with it. Top N rank check calculates the Jacquard distance between the current and previous sets of top N metric and checks if it does not exceed the threshold value.
IMPORTANT: Calculation of this check is currently supported only between the current and previous topN metric sets.
Top N rank check is configured using following parameters:
id
- Required. Check IDdescription
- Optional. Description of the check.metric
- Required. Metric ID which result is checked.targetNumber
- Required. Number of records from the set of top N metric results that is considered. This number should be less than or equal to number of collected top values in top N metric.threshold
- Required. Maximum allowed Jacquard distance between current and previous sets of records from top N metric result. Should be a number in interval[0, 1]
.isCritical
- Optional, default isfalse
. Boolean flag identifying whether this check is critical or not. Check criticality is important when application check failure tolerance is set toCRITICAL
. For more information, see Check Failure Tolerance.metadata
- Optional. List of user-defined metadata parameters specific to this metric where each parameter is a string in format:param.name=param.value
.
Expression Checks
Expression checks represent a boolean expression referring to one or multiple metric results which evaluates to
true
or false
. Metrics must be referenced by their IDs.
Formula must be written using Mustache Template notation, e.g.:
{{ metric_1 }} + {{ metric_2 }}
.
There are following operations supported to build boolean expressions:
- Basic
+-*/
and exponentiation^
math operations. - Grouping using parentheses.
- Mathematical functions of single argument:
abs
,sqrt
,floor
,ceil
,round
(rounds to the closest integer),ln
(natural logarithm),lg
(decimal logarithm),exp
. - Mathematical functions of two arguments:
max
andmin
. - Logical operators:
&&
(and),||
(or),not
. - Comparison operators:
==
,<>
,>=
,<=
,>
,<
.
Expression checks are configured using following set of parameters:
id
- Required. Check IDdescription
- Optional. Description of the check.formula
- Required. Check formula: boolean expression referring to metric results.isCritical
- Optional, default isfalse
. Boolean flag identifying whether this check is critical or not. Check criticality is important when application check failure tolerance is set toCRITICAL
. For more information, see Check Failure Tolerance.metadata
- Optional. List of user-defined metadata parameters specific to this check where each parameter is a string in format:param.name=param.value
.
IMPORTANT. When storing check result into DQ storage, then first metric ID referenced in formula is considered as a base metric. And latter metric IDs are considered as compare metrics.
Checks Configuration Example
As it is shown in the example below, checks are grouped into two subsections: trend
and snapshot
.
Then, checks of the same type are grouped within subsections named after the type of the checks.
These subsections should contain a list of metrics configurations of the corresponding type.
jobConfig: {
checks: {
trend: {
averageBoundFull: [
{
id: "avg_bal_check",
description: "Check that average balance stays within +/-25% of the week average"
metric: "avro_file1_avg_bal",
rule: "datetime"
windowSize: "8d"
threshold: 0.25
metadata: [
"requestor=some.person@some.domain",
"critical.check=true"
]
}
]
averageBoundUpper: [
{id: "avg_pct_null", metric: "pct_of_null", rule: "datetime", windowSize: "15d", threshold: 0.5}
]
averageBoundLower: [
{id: "avg_distinct", metric: "fixed_file_dist_name", rule: "record", windowSize: 31, threshold: 0.3}
]
averageBoundRange: [
{
id: "avg_inn_match",
metric: "table_source1_inn_regex",
rule: "datetime",
windowSize: "8d",
thresholdLower: 0.2
thresholdUpper: 0.4
}
]
topNRank: [
{id: "top2_curr_match", metric: "filterVS_top3_currency", targetNumber: 2, threshold: 0.1}
]
}
snapshot: {
differByLT: [
{
id: "row_cnt_diff",
description: "Number of rows in two tables should not differ on more than 5%.",
metric: "hive_table_row_cnt"
compareMetric: "csv_file_row_cnt"
threshold: 0.05
}
]
equalTo: [
{id: "zero_nulls", description: "Hive Table1 mustn't contain nulls", metric: "hive_table_nulls", threshold: 0, isCritical: true}
]
greaterThan: [
{id: "completeness_check", metric: "orc_data_compl", threshold: 0.99, isCritical: true}
]
lessThan: [
{id: "null_threshold", metric: "pct_of_null", threshold: 0.01, isCritical: true}
]
}
expression: [
{
id: "row_cnt_diff_abs"
description: "Number of rows in two tables should not differ on more than 100 rows."
formula: "abs({{ hive_table_row_cnt }} - {{ csv_file_row_cnt }}) <= 100"
}
]
}
}