Row Key Definition
Define a Row Key in a DDL statement
Any #_stream or #_changelog can have a Row Key according to the messages in its #entity. Row key definition in a DDL statement is optional and currently is only available for relations that are backed by a Kafka topic. If a row key is defined for a relation, source Kafka messages’ keys are used to set the row keys in the relation’s records. The row key in a record can be accessed via the rowkey
function listed in the Row Metadata Functions.
You can define the row key for a relation by adding certain properties to the relation’s DDL statement in the WITH
clause:
key.format
: This is the data format for the row key (that is, how key bytes are serialized) in the messages coming from the relation’s topic. Supported formats for a row key are:primitive
json
protobuf
avro
key.type
: This defines the structure of the row key’s fields:For a row key with a
primitive
format, thekey.type
defines the data type of key values. Valid data types for a primitive row key areSMALLINT
INTEGER
BIGINT
DOUBLE
VARCHAR
.
For a row key in the
json
,avro
, orprotobuf
format, thekey.type
is astruct
whose fields define the names and data types of the key’s fields.
key.descriptor.name
: When you define a relation with a row key in theprotobuf
format, you must upload a descriptor for the key to DeltaStream prior to running the DDL statement. This descriptor must be associated with the relation’s #entity. In the DDL statement, thekey.descriptor.name
parameter defines the name of theprotobuf
message descriptor in the descriptor source, which should be used for reading and writing records’ row keys.
Examples
The following DDL statement defines a stream that has a primitive
row key of the VARCHAR
type:
The following DDL statement defines a stream that has a row key in the json
format. The row key in each record can be accessed via the rowkey
function. It is a struct
that has a field named id
of the INTEGER
data type:
The following DDL statement defines a stream that has a row key in the protobuf
format. The row key in each record has two fields
an
INTEGER
field namedid
a
VARCHAR
field namedregion
.
A protobuf descriptor for the row key is defined with the name customer_key_msg
within the descriptor source of the relation’s topic.
Row Key in CREATE AS SELECT
Statements
CREATE AS SELECT
StatementsIn addition to a DDL statement, a new #_stream or #_changelog can be defined using a CREATE AS SELECT
(CAS) statement in CREATE STREAM AS SELECT and CREATE CHANGELOG AS SELECT, respectively. The query in the CAS statement has an impact on the row key definition of the new relation.
CAS with Simple SELECT
SELECT
In a CAS statement with a SELECT clause consisting of only projection and filter operators, using SELECT
and WHERE
clauses, respectively, the row key definition for the new relation is the same as the row key definition (if any) of the source.
As an example, the below CSAS statement creates a new pagevisits
stream by filtering some records from the pageviews
stream, which we defined above (see Examples). Given that pageviews
records have row keys with the primitive format,pagevisits
, records have row keys with a similar definition:
When running CAS, you can specify the key.format
for the new relation (sink) in the sink’s WITH
clause. For example, the below CSAS statement is similar to the above one, except that it changes the row key format to json
for the new pagevisits
stream, while the source relation pageviews
has its row key with the primitive
format:
CCAS with GROUP BY
GROUP BY
If the SELECT clause in a CREATE CHANGELOG AS SELECT statement has a GROUP BY
clause, the row key for the new relation consists of columns in the GROUP BY
clause.
The key.format
for the new relation is the same as a source relation’s key.format
. Similar to the example above, you can change the key.format
using the sink’s WITH
clause.
For example, the below query runs grouping and aggregation on pageviews
using a tumbling window to create a new changelog visits_rate
. The row key for visits_rate
has three fields:
window_start
window_end
userid
as they are the columns referred in the GROUP BY
clause:
The below query creates a new changelog region_stats
by running a grouping and aggregation on the source stream users
. Each record in region_stats
has a row key with a single field named location
:
CSAS with JOIN
JOIN
If the SELECT clause in a CREATE STREAM AS SELECT statement has a JOIN
clause, the row key for the new relation consists of the column from the left relation in the JOIN
criteria.
The key.format
for the new relation is the same as the left source relation’s key.format
. Similar to the example above, you can change the key.format
using the sink’s WITH
clause.
For example, the below query runs an interval join on two streams, pageviews
and users
, to create a new stream named pvusers
. Each record in pvusers
has a row key with one field, userid
, as userid
is the join column in pageviews
, which is the left-side of the join, referred in the join criteria: p.userid = u.userid
:
Last updated