Row Key Definition
Last updated
Last updated
Any or can have a according to the messages in its . Row key definition in a DDL statement is optional and currently is only available for relations that are backed by a Kafka topic. If a row key is defined for a relation, source Kafka messages’ keys are used to set the row keys in the relation’s records. The row key in a record can be accessed via the rowkey
function listed in the Row Metadata Functions.
The user can define the row key for a relation by adding certain properties to the relation’s DDL statement in the WITH
clause:
key.format
: This is the data format for the row key (i.e. how key bytes are serialized) in the messages coming from the relation’s topic. Supported formats for a row key are: primitive
, json
, protobuf
, and avro
.
key.type
: It defines the structure of the row key’s fields:
For a row key with a primitive
format, the key.type
defines the data type of key values. Valid data types for a primitive row key are: SMALLINT
, INTEGER
, BIGINT
, DOUBLE
, and VARCHAR
.
For a row key in the json
, avro
, or protobuf
format, the key.type
is a struct
whose fields define the names and data types of the key’s fields.
key.descriptor.name
: When defining a relation with a row key in the protobuf
format, a Descriptor for the key has to be uploaded to DeltaStream prior to running the DDL statement. This descriptor has to be associated with the relation’s . In the DDL statement, the key.descriptor.name
parameter defines the name of the protobuf
message descriptor in the descriptor source, which should be used for reading and writing records’ row keys.
The following DDL statement defines a stream that has a primitive
row key of the VARCHAR
type:
The following DDL statement defines a stream that has a row key in the json
format. The row key in each record can be accessed via the rowkey
function, and it is a struct
that has a field named id
of the INTEGER
data type:
The following DDL statement defines a stream that has a row key in the protobuf
format. The row key in each record has two fields: an INTEGER
field named id
and a VARCHAR
field named region
. A protobuf descriptor for the row key is defined with the name customer_key_msg
within the descriptor source of the relation’s topic.
CREATE AS SELECT
StatementsSELECT
In a CAS statement with a SELECT clause consisting of only projection and filter operators, using SELECT
and WHERE
clauses, respectively, the row key definition for the new relation will be the same as the row key definition (if any) of the source.
As an example, the below CSAS statement creates a new pagevisits
stream by filtering some records from the pageviews
stream, which we defined above (see Examples). Given that pageviews
records have row keys with the primitive format,pagevisits
records will have row keys with a similar definition:
When running CAS, the key.format
for the new relation (sink) can be specified in the sink’s WITH
clause. For example, the below CSAS statement is similar to the above one; However, it changes the row key format to json
for the new pagevisits
stream, while the source relation pageviews
has its row key with the primitive
format:
GROUP BY
If the SELECT clause in a CREATE CHANGELOG AS SELECT statement has a GROUP BY
clause, the row key for the new relation will consist of columns in the GROUP BY
clause.
The key.format
for the new relation will be the same as a source relation’s key.format
. Similar to the example above, the key.format
can be changed using the sink’s WITH
clause.
For example, the below query runs grouping and aggregation on pageviews
using a tumbling window to create a new changelog visits_rate
. The row key for visits_rate
will have three fields: window_start
, window_end
, and userid
as they are the columns referred in the GROUP BY
clause:
The below query creates a new changelog region_stats
by running a grouping and aggregation on the source stream users
. Each record in region_stats
will have a row key with a single field named location
:
JOIN
If the SELECT clause in a CREATE STREAM AS SELECT statement has a JOIN
clause, the row key for the new relation will consist of the column from the left relation in the JOIN
criteria.
The key.format
for the new relation will be the same as the left source relation’s key.format
. Similar to the example above, the key.format
can be changed using the sink’s WITH
clause.
For example, the below query runs an interval join on two streams, pageviews
and users
, to create a new stream named pvusers
. Each record in pvusers
will have a row key with one field, userid
, as userid
is the join column in pageviews
, which is left-side of the join, referred in the join criteria: p.userid = u.userid
:
In addition to a DDL statement, a new or can be defined using a CREATE AS SELECT
(CAS) statement in CREATE STREAM AS SELECT and CREATE CHANGELOG AS SELECT, respectively. The query in the CAS statement has an impact on the row key definition of the new relation.