Schemas Configuration
Schemas are used in Data Quality jobs for two purposes:
- Provide explicit schema when reading sources without embedded schemas such as delimited or fixed width text files.
- Provide reference schema to validate actual source schema. Schemas are used in
schemaMatch
load checks. See Schema Match Check.
Schemas are set in schemas
section of job configuration and can be defined in different formats as described below.
Format in which schema is defined is set in kind
field and defines what other fields are need to be provided.
Apart from kind
field, all types of schemas configuration contain following common parameters:
id
- Schema ID that uniquely identifies its configuration;description
- Optional schema description;metadata
- Optional list of arbitrary user-defined metadata parameters.
Delimited Schema Configuration
This kind of schema definition is primarily used to provide schemas for delimited text files such as CSV or TSV.
Nevertheless, these schemas can be used for schemaMatch
load checks as well. Using this type of configuration,
only flat schemas can be defined (nested columns are not allowed).
Thus, delimited definition contains following parameters:
kind: "delimited"
- Required. Sets delimited schema definition format.id
- Required. Schema ID;description
- Optional. Schema description;schema
- Required. List of schema columns where each column is an object with following fields:name
- Required. Name of the column;type
- Required. Type of the column. See Supported Type Literals for allowed types.
metadata
- Optional. List of user-defined metadata parameters specific to this schema where each parameter is a string in format:param.name=param.value
.
Fixed-Full Schema Configuration
Fixed-full kind of schema definition is used to provide schemas for read fixed-width text files. The key difference from
other schema definitions is that columns widths are also provided which is crucial information for parsing fixed-width
files. This kind of schema may also be used for reading delimited files and for reference in schemaMatch
load checks.
Using this type of configuration, only flat schemas can be defined (nested columns are not allowed).
Fixed-fill schema definition contains following parameters:
kind: "fixedFull"
- Required. Sets fixed-full schema definition format.id
- Required. Schema ID;description
- Optional. Schema description;schema
- Required. List of schema columns where each column is an object with following fields:name
- Required. Name of the column;type
- Required. Type of the column. See Supported Type Literals for allowed types.width
- Required. Integer width of column (number of symbols).
metadata
- Optional. List of user-defined metadata parameters specific to this schema where each parameter is a string in format:param.name=param.value
.
Fixed-Short Schema Configuration
Fixed-short kind of schema definition provides a more compact syntax for defining schemas used for reading fixed-width
files. The columns are defined by their name and width only. Subsequently, all columns will have StringType.
This kind of schema may also be used for reading delimited files and for reference in schemaMatch
load checks.
Using this type of configuration, only flat schemas can be defined (nested columns are not allowed).
Fixed-short schema definition contains following parameters:
kind: "fixedShort"
- Required. Sets fixed-short schema definition format.id
- Required. Schema ID;description
- Optional. Schema description;schema
- Required. List of schema columns where each column is a string in formatcolumnName:columnWidth
. Type of columns is always a StringType.metadata
- Optional. List of user-defined metadata parameters specific to this schema where each parameter is a string in format:param.name=param.value
.
Avro Schema Configuration
Avro kind of schema configuration is used to read schema from file with avro schema .avsc
. Thus, schema read from
avro schema file can be used to read both, avro files and delimited text files as well as be used as reference
in schemaMatch
load checks. In addition, avro schema format supports complex schemas with nested columns.
In order to read schema from avro file it is required to supply following parameters:
kind: "avro"
- Required. Sets avro schema definition format.id
- Required. Schema ID;description
- Optional. Schema description;schema
- Required. Path to avro schema file.avsc
to read schema from.validateDefaults
- Optional, default isfalse
. Boolean flag enabling or disabling default values validation in Avro schema.metadata
- Optional. List of user-defined metadata parameters specific to this schema where each parameter is a string in format:param.name=param.value
.
Hive Schema Configuration
Hive catalogue can be used as a source of schemas. Hive kind of schema definition is intended to retrieve schemas from
hive tables. These schemas can be used to read both, avro files and delimited text files as well as be used as reference
in schemaMatch
load checks.
To retrieve schema from hive table it is required to set up following parameters:
kind: "hive"
- Required. Sets hive schema definition format.id
- Required. Schema ID;description
- Optional. Schema description;schema
- Required. Hive schema to search for a table.table
- Required. Hive table to retrieve schema from.excludeColumns
- Optional. List of column names to exclude from schema. Sometimes it is required, e.g. to exclude partition columns from schema.metadata
- Optional. List of user-defined metadata parameters specific to this schema where each parameter is a string in format:param.name=param.value
.
Registry Schema Configuration
Confluent Schema Registry can be used as a source of schemas.
This, registry
kind of schema definition is intended to read schemas from
confluent registry. These schemas can be used to read both, avro files and delimited text files as well as
be used as reference in schemaMatch
load checks.
To retrieve schema from Confluent Schema Registry it is required to set up following parameters:
kind: "registry"
- Required. Sets hive schema definition format.id
- Required. Schema ID;description
- Optional. Schema description;baseUrls
- Required. List of urls to connect to schema registry.schemaId
- Optional. Schema ID to search in schema registry.schemaSubject
- Optional. Schema subject to search in schema registry.version
- Optional. Schema version to read from schema registry. If empty, then latest available version is read.validateDefaults
- Optional, default isfalse
. Boolean flag enabling or disabling default values validation in Avro schema.properties
- Optional, default is empty. List of additional connection properties: sequence of strings in formatkey=value
.headers
- Optional, default is empty. List of additional HTML headers: sequence of strings in formatkey=value
.metadata
- Optional. List of user-defined metadata parameters specific to this schema where each parameter is a string in format:param.name=param.value
.connectionTimeoutMs
- Optional, default is 60000. Maximum time in milliseconds to wait for a response from the Schema Registry.retryAttempts
- Optional, default is 3. Number of retry attempts in case of a failure.retryIntervalMs
- Optional, default is 5000. Delay in milliseconds between retry attempts.
Supported Type Literals
The following type literals are supported when defining schema columns in job configuration file:
string
boolean
date
timestamp
integer (32-bit integer)
long (64-bit integer)
short (16-bit integer)
byte (signed integer in a single byte)
double
float
decimal(precision, scale)
(precision <= 38; scale <= precision)
Schemas Configuration Example
As it is shown in the example below, schemas
section represent a list of schema definitions of various kinds.
jobConfig: {
schemas: [
{
id: "schema1"
kind: "delimited"
description: "Schema describing content of CSV file"
schema: [
{name: "colA", type: "string"},
{name: "colB", type: "timestamp"},
{name: "colC", type: "decimal(10, 3)"}
]
}
{
id: "schema2"
kind: "fixedFull",
schema: [
{name: "col1", type: "integer", width: 5},
{name: "col2", type: "double", width: 6},
{name: "col3", type: "boolean", width: 4}
]
}
{id: "schema3", kind: "fixedShort", schema: ["colOne:5", "colTwo:7", "colThree:9"]}
{id: "hive_schema", kind: "hive", schema: "some_schema", table: "some_table"}
{
id: "avro_schema",
kind: "avro",
schema: "path/to/avro_schema.avsc"
metadata: [
"schema.origin=http://some-schema-registry-location"
]
}
]
}