Streaming Sources Configurations
When running Data Quality checks over the streaming data sources it is required to define them in streams
section
of job configuration. Thus, sources defined in this section are read as streaming dataframes using
Spark Structured Streaming API. More details on running data quality checks over streaming sources are given in
Data Quality Checks over Streaming Sources chapter.
The configuration of streaming sources is the same as for the static ones. See chapter Sources Configuration for more details.
It is important to note that not all supported sources can be read in streaming mode. Currently, only sources below can be read as streams:
- File sources.
- Kafka sources. Note, that there is a difference in default value for parameter
startingOffsets
. When defining streaming kafka source, the default value for this parameter islatest
. Also, for streaming kafka sources parameterendingOffsets
is ignored (all new records will be processed until application is stopped).
The only additional parameter that is required to be defined for all streaming sources is following:
windowBy
- Optional, default isprocessingTime
. Source of timestamp used to assign records to a particular streaming windows and also to skip "late" records. Applicable only for streaming jobs! There are following options supported:processingTime
- Uses current timestamp at the moment when Spark processes record.eventTime
- Mostly applicable to kafka sources. Uses column with nametimestamp
to retrieve time value from. This column must be of Timestamp type.custom(columnName)
- Uses arbitrary user-defined column to retrieve time value from. Specified column must be of Timestamp type. In addition, an SQL expressions are supported. An expression should also evaluate to value of Timestamp type. For example:custom(value.createdAt)
- the time value for a record will be retrieved from message value's field with namecreatedAt
.