JOIN
Syntax
Description
JOIN
is the SQL operator to combine records from two Relations according to a given join criteria. Join criteria, specified as the ON
clause, can only be a Boolean comparison expression that declares equality of a pair of matching columns, one from each Relation, e.g. left.col = right.col
. A Relation alias can be specified for each Relation in the join clause. The alias can then be used to refer to the Relation in other clauses in the query, such as SELECT
or WHERE
clauses. This is useful in case referred Relations have duplicate column names. DeltaStream has support for LEFT JOIN
, INNER JOIN
, and FULL JOIN
. Currently, two types of join are supported: interval join and temporal join.
Arguments
Join Types
Join Type | Description |
---|---|
| Returns records that have matching values in both Relations. |
| Same as |
| Returns all records when there is a match in either the left or right Relation. |
| Same as |
| Returns all records from the left Relation and the matched records from the right Relation. |
| Same as |
| Not supported. |
| Not supported. |
Interval Join (Stream/Stream)
Interval join combines records coming from two Streams according to a join condition and a time dimension. The ON
clause defines the join condition in the form of equality of two columns from the source Streams. The time dimension is specified using the WITHIN
clause, which defines a time interval. Two records are matching if they have the same value for referred columns in the join criteria and the difference between their timestamps is valid according to the time interval in the WITHIN
clause. For interval joins, INNER
, LEFT
, and FULL
join types are supported.
Examples
Join two Streams
Given orders
and shipments
#_streams with the following example records:
When the following SELECT
statement is run:
We get the following output:
NOTE: The output results in this example are not fully deterministic as factors such as late arriving data can alter results.
The above example shows a sample of outputs for this join query. Since this is a streaming query, more records will be outputted as data arrives in the source. Notice that not every record from orders
is joined and outputted. The reason for this is because the join type is JOIN
and the join condition is that the orders
record’s itemid
matches the shipments
record’s orderid
within 10 seconds of the orders
record’s ordertime
. For example, the second orders
record has an ordertime
of 3000
, but the only shipments
record with the corresponding orderid
has a shipmenttime
of 14000
, which is not within 10 seconds of the ordertime
. Similarly, the orders
record with itemid=2
and ordertime=26000
is not within 10 seconds of either of the shipments
records with the corresponding orderid
. However, the orders
record with itemid=2
and ordertime=10000
is within 10 seconds of both shipments
records with a corresponding orderid
, so the orders
record joins on both shipments
records and produces 2 records in the output, one for each successful join.
Interval self-join
The following query performs a temporal join, joining the pageviews
Stream with itself when the pageid
values match and the userid
values don’t match within 5 minutes of the left Relation’s timestamp. Essentially, the query finds all the pairs of unique users that visited the same pageid
within 5 minutes of the event’s timestamp.
Temporal Join (Stream/Changelog)
A temporal join combines records coming from a Stream with a Changelog according to a join condition on records’ timestamps. A join condition is defined using the ON
clause and specifies the equality of a pair of columns, one from the Stream side and one from the Changelog. Any column or field can be picked from the Stream, as long as it is defined with the NOT NULL
constraint. The column from the Changelog has to be the primary column. Records’ timestamps are used to correlate records from the Stream side with the relevant version of a record from the Changelog side. For temporal joins, only INNER
and LEFT
join types are supported.
Examples
Join a Stream and Changelog
Given pageviews
Stream and users_log
Changelog with a PRIMARY KEY
of userid
with the following example records:
When the following SELECT
statement is run:
We get the following output:
NOTE: The output results in this example are not fully deterministic as factors such as late arriving data can alter results.
The above example shows a sample of outputs for this join query. Since this is a streaming query, more records will be outputted as data arrives in the source. Notice that not every record from pageviews
is joined and outputted. The reason for this is because the join type is JOIN
and the join condition is that the pageviews
record’s userid
matches the users_log
record’s userid
at the event timestamp. For example, the first pageviews
record has a viewtime
of 2000
, but there has yet to be any user_logs
records with registertime <= 2000
. Thus, there is no corresponding record to join with and nothing is returned. Notice the pageviews
record with userid=User_2
and viewtime=10000
matches with the users_log
record with userid=User_2
and registertime=6000
, while the pageviews
record with userid=User_2
and viewtime=18000
matches with the users_log
record with userid=User_2
and registertime=15000
. Think of the users_log
Changelog as a constantly updating state of the world with a userid
as the PRIMARY KEY
. In this example, for each pageviews
record, we are looking up the current information in the users_log
Changelog for the corresponding userid
. At timestamp 10000
, User_2
has a registered top_interest
as News
. At timestamp 15000
, User_2
’s top_interest
changes to Sports
, so for pageviews
records with userid=User_2
with a timestamp > 15000
, the joined top interest will report Sports
instead of News
.
Temporal join with separate source properties
The following performs a temporal join between Stream pageviews
and Changelog Users
on equal userid
values. The Stream on the left side of the join will start ingesting data starting from the epoch timestamp 1675225325286
, and the right side of the join will start ingesting data from the earliest available records.
Temporal join on columns with non-primitive data types
The following shows an example of temporal join using a join criteria defined on a pair of columns whose data type is STRUCT
(i.e. non-primitive). The example includes the DDL statements for the Changelog and Stream relations, along with the temporal join's query. Note that the contactinfo
column in the users
Stream is defined with the NOT NULL constraint. This requirement is essential for temporal join, as both columns involved in the join criteria must not contain NULL values.
Last updated