Application Settings
General Checkita Data Quality settings are configured in Hocon
file application.conf
which is supplied to the application on the startup.
All configurations are set within appConfig
section.
There is only one parameter that is set at the top level and this is applicationName
- name of the Spark application.
This parameter is optional and if not set, then Checkita Data Quality
application name is used by default.
The rest of the parameters are defined in the subsections that are described below.
DateTime Settings
DateTime configurations are set in the dateTimeOptions
section.
Please, see Working with Date and Time
section for more details on working with date and time in Checkita Framework.
DateTime settings include following:
timeZone
- Time zone in which string representation of reference date and execution date are parsed and rendered. Optional, default is"UTC"
.referenceDateFormat
- datetime format used to parse and render reference date. Optional, default is"yyyy-MM-dd'T'HH:mm:ss.SSS"
.executionDateFormat
- datetime format used to parse and render execution date. Optional, default is"yyyy-MM-dd'T'HH:mm:ss.SSS"
If dateTimeOptions
section is missing then default values are used for all parameters above.
Streaming Settings
These settings are only applicable to streaming applications and define various aspects of running data quality checks for streaming sources. Please, see Data Quality Checks over Streaming Sources section for more details on runnig data quality checks over streaming sources.
trigger
- Trigger interval: defines time interval for which micro-batches are collected. Optional, default is10s
.window
- Window interval: defines tabbing window size used to accumulate metrics. All metrics results and checks are evaluated per each window once it finalised. Optional, default is10m
.watermark
- Watermark level: defines time interval after which late records are no longer processed. Optional, default is5m
.allowEmptyWindows
- Boolean flag indicating whether empty windows are allowed. Thus, in situation when window is below watermark and for some of the processed streams there are no results then all related checks will be skipped if this flag is set totrue
. Otherwise, checks will be processed and will return error status with... metric results were not found ...
type of message. Optional, default isfalse
.checkpointDir
- Checkpoint directory location. Optional, if empty, then checkpoints will not be written.
IMPORTANT All intervals must be defined as a duration string which should conform to Scala Duration format.
Enablers
Section enablers
of application configuration file defines various boolean switchers is single-value parameters
that controls various aspects of data quality job execution:
allowSqlQueries
- Enables usage arbitrary SQL queries in data quality job configuration. Optional, default isfalse
allowNotifications
- Enables notifications to be sent from DQ application. Optional, default isfalse
aggregatedKafkaOutput
- Enables sending aggregates messages for Kafka Targets (one per each target type). By default, kafka messages are sent per each result entity. Optional, default isfalse
enableCaseSensitivity
- Enable columns case sensitivity. Controls column names comparison and lookup. Optional, default isfalse
errorDumpSize
- Maximum number of errors to be collected per single metric. Framework is able to collect source data rows where metric evaluation yielded some errors. But in order to prevent OOM the number of collected errors have to be limited to a reasonable value. Thus, maximum allowable number of errors per metric is10000
. It is possible to lower this number by setting this parameter. Optional, default is10000
outputRepartition
- Sets the number of partitions when writing outputs. By default, writes single file. Optional, default is1
metricEngineAPI
- Sets engine to be used for regular metric processing:rdd
(RDD-engine) ordf
(DF-engine) are available. It is recommended to use DF-engine for batch applications while streaming applications support only RDD-engine. Optional, default isrdd
.checkFailureTolerance
- Sets check failure tolerance for the application, i.e. whether the application should return non-zero exit code when some the checks have failed. For more info, see Check Failure Tolerance. Optional, default isnone
If enablers
section is missing then default values are used for all parameters above.
Storage Configuration
Parameters for connecting to Data Quality results storage are defined in storage
section of application configuration.
For more information on results storage refer to Data Quality Results Storage chapter of the documentation.
Thus, connection to storage is configured using following parameters:
dbType
- Type of database used to store Data Quality results. Required.url
- Database connection URL (without protocol identifiers). Required.username
- Username to connect to database with (if required). Optional.password
- Password to connect to database with (if required). Optional.schema
- Schema where data quality tables are located (if required). Optional.saveErrorsToStorage
- Enables metric errors to be stored in storage database. Optional, default isfalse
.
IMPORTANT If
storage
section is missing then application will run without usage of results storage:
- results won't be saved (only targets can be sent);
- trend checks (used for anomaly detection in data) won't be performed as they require historical data.
In addition, be mindful when storing metric errors to storage database. Depending on
errorDumpSize
settings, the number of collected errors could be quite large. This will load to overloading DQ storage as well as increase database write operations execution time. Another concern is related to the fact that metric errors contain data excerpts from sources being checked. These excerpts might contain some sensitive information that is rather not to be stored in DQ storage database. Alternatively, these excerpts can be encrypted before storing. See Encryption configuration for more details.
Email Configuration
In order to send notification via email it is necessary to configure connection to SMTP server which should be defined
in email
section of application configuration with following parameters:
host
- SMTP server host. Required.port
- SMTP server port. Required.address
- Email address to sent notification from. Required.name
- Name of the sender. Required.sslOnConnect
- Boolean parameter indicating whether to use SSL on connect. Optional, default isfalse
.tlsEnabled
- Boolean parameter indicating whether to enable TLS. Optional, default isfalse
.username
- Username for connection to SMTP server (if required). Optional.password
- Password for connection to SMTP server (if required). Optional.
If email
section is missing then email notifications cannot be sent. If ones were configured in job configuration,
then exception would be thrown at runtime.
Mattermost Configuration
In order to send notification to Mattermost it is necessary to configure connection to Mattermost API which
should be defined in mattermost
section of application configuration with following parameters:
host
- Mattermost API host.token
- Mattermost API token (using Bot accounts for notifications is preferable).
If mattermost
section is missing then corresponding notifications cannot be sent. If ones were configured in job
configuration, then exception would be thrown at runtime.
Default Spark Parameters
It is also possible to provide list of default Spark configuration parameters used across multiple jobs.
These parameters should be provided as defaultSparkOptions
list where each parameter is a string in format:
spark.param.name=spark.param.value
.
Encryption
When storage
section is defined, it is also recommended to use encryption
section in order to protect sensitive
information in job config. This should be done by defining the parameters within the application configuration file:
secret
- Secret string used to encrypt/decrypt sensitive fields. This string should contain at least 32 characters. Required.keyFields
- List of key fields used to identify fields that requires encryption/decryption. Optional, default is[password, secret]
.encryptErrorData
- Boolean flag indicating whether it is necessary tp encrypt data excerpts within collected metric errors. Optional, default isfalse
If encryption
section is missing then any sensitive information will not be encrypted.
IMPORTANT Both keys of job configuration and data excerpts that metric errors contain might contain some sensitive information. Storing raw sensitive information in DQ storage database might not satisfy security requirements. Therefore, DQ framework offers functionality to encrypt sensitive data with AES256 encryption algorithm. As AES25 is a symmetric algorithm then encrypted data can be decrypted with use secret key if needed.
Example of Application Configuration File
Hocon configuration format supports variable substitution and Checkita Data Quality framework has a mechanism to feed configuration files with extra variables at runtime. For more information, see Usage of Environment Variables and Extra Variables chapter of the documentation.
appConfig: {
applicationName: "Custom Data Quality Application Name"
dateTimeOptions: {
timeZone: "GMT+3"
referenceDateFormat: "yyyy-MM-dd"
executionDateFormat: "yyyy-MM-dd-HH-mm-ss"
}
enablers: {
allowSqlQueries: false
allowNotifications: true
aggregatedKafkaOutput: true
}
defaultSparkOptions: [
"spark.sql.orc.enabled=true"
"spark.sql.parquet.compression.codec=snappy"
"spark.sql.autoBroadcastJoinThreshold=-1"
]
storage: {
dbType: "postgres"
url: "localhost:5432/public"
username: "postgres"
password: "postgres"
schema: "dqdb"
saveErrorsToStorage: true
}
email: {
host: "smtp.some-company.domain"
port: "25"
username: "emailUser"
password: "emailPassword"
address: "some.service@some-company.domain"
name: "Data Quality Service"
sslOnConnect: true
}
mattermost: {
host: "https://some-team.mattermost.com"
token: ${dqMattermostToken}
}
encryption: {
secret: "secretmustbeatleastthirtytwocharacters"
keyFields: ["password", "username", "url"]
encryptErrorData: true
}
}