MATCH_RECOGNIZE
Syntax
Description
The MATCH_RECOGNIZE
clause is used for searching patterns in a set of events from a #_relationas its input within the FROM clause.
A MATCH_RECOGNIZE
clause enables advanced event processing using the following tasks:
Logical partitioning of events using
PARTITION BY
Logical ordering of events for each partition based on one or many ordering columns using
ORDER BY
Definition of variables in the
DEFINE
subclause that are accessible fromPATTERN
, similar to a regular expression syntaxDefinition of one or many
MEASURE
columns using theDEFINE
variables, within the scope of matched events inPATTERN
Zero or one event per pattern recognition expression defined in
PATTERN
A SELECT with *
projection is not supported on a MATCH_RECOGNIZE
operator.
Arguments
PARTITION BY expression [, ...]
Optionally, this subclause can be used to parition the data to be looked at by the PATTERN
. This subclause groups the underlying data based on the paritioning expression(s), and optimizes DeltaStream's compute resources for parallel processing of the source data.
If PARTITION BY
subclause is not provided, all data is scanned sequentially to ensure global ordering for ORDER BY
, hence, it's recommended to use the PARTITION BY
subclause with at least one column.
The columns mentioned in this subclause contribute to the final schema of the MATCH_RECOGNIZE
operator. See #measures-expression-as-alias-... for definition of the full schema.
ORDER BY expression ASC [, expression [ASC | DESC], ...]
This cause defines the order in which the data should be sorted for each partition before they are passed to the pattern recognition in the MATCH_RECOGNIZE
operator.
The first column for the ORDER BY
subclause must be an event-time column defined for the input Relation in ascending order. The event-time column can be the record time at which a record was added to an Entity, referred to as rowtime()
. This can also be a user-defined timestamp column from a list of value columns. See #_source_properties for a user-defined timestamp column.
MEASURES expression AS alias [, ...]
This subclause is used for providing the output schema for the MATCH_RECOGNIZE
operator, much like the columns provided as a result of a SELECT statement. This means that MEASURES
can project source columns from the PATTERN
, or evaluated expressions that are based on projection columns.
The final schema of the MATCH_RECOGNIZE
is defined by the optional PARTITION BY
columns followed by the MEASURES
columns. ONE ROW PER MATCH defines how many rows to output for this schema, per match.
Since the MATCH_RECOGNIZE
operator alters the projection columns of its input #_relation, it is required to provide which variable from DEFINE
would provide the row metadata columns for the projected MATCH_RECOGNIZE
columns. See Row Metadata Functions for more information on how the row metadata columns are used.
Aggregation or offset functions may be used to do further processing on the variable definitions. See Built-in Functions for further function information.
ONE ROW PER MATCH
Optionally, this output mode can be provided for completeness of a MATCH_RECOGNIZE
operator, but is the only option for how many rows to output per match.
AFTER MATCH strategy
Optionally, start from a different position that is different than the AFTER MATCH SKIP PAST LAST ROW
strategy
, after a successful match.
strategy
can be set to:
SKIP TO NEXT ROW
: Continues matching starting at the first row after the current match.SKIP PAST LAST ROW
: As the default, continues matching for new rows starting after the last row of the current match.SKIP TO FIRST variable
: Continues matching starting at the first row that matched the providedvariable
. SeeDEFINES
for variable definitions.SKIP TO LAST variable
: Continues matching starting at the last row that matched the providedvariable
. SeeDEFINES
for variable definitions.
PATTERN (variable [, ...])
This subclause is the core processing power for the MATCH_RECOGNIZE
operator, and it explains what pattern the rows should be matched against. DEFINE
variables in addition to quantifiers or modifiers may build a complex pattern.
A quantifier is used to denote the number of DEFINE
variable or operation to be considered in the matching process. Supported quantifiers include:
+
: 1 or more rows*
: 0 or more rows?
: 0 or 1 row{n}
: Exactlyn
rows, wheren
> 0{n,}
:n
or more rows, wheren
is a non-negative number{,m}
: 0 tom
rows, inclusive, wherem
is a positive number{n,m}
:n
tom
rows, wheren
is a non-negative number andm
is greater thann
By default, quantifiers are greedy unless they are defined to be reluctant using the ?
operation.
WITHIN size time_unit
Optionally, this subclause is used to provide a time constraint for potential matches. While the WITHIN
clause is not part of the SQL standard for MATCH_RECOGNIZE
, we provide this subclause as a way to manage the amount of state that your query needs to maintain. If the WITHIN
subclause is not provided, then the default is that there is no time constraint for matches.
All of a match's records must fall within the time constraint to be considered a valid match. For example, if the time between the first and last record in a match is longer than the time constraint, then that match would be invalid and not produced as an output. The time of a record is determined by its event time (see timestamp
under #_source_properties).
Note that it is generally recommended to provide the WITHIN
clause. Many MATCH_RECOGNIZE
queries, especially those that use a greedy quantifier, will keep track of all the records it consumes in memory. Without providing a WITHIN
time constraint, these queries will eventually run out of memory and fail. The WITHIN
time constraint allows the query to safely clean up its memory of records that have fallen out of the time interval.
DEFINE variable AS expression [, ...]
This subclause is used as the building block for the entire MATCH_RECOGNIZE
operator for providing name for expressions that are used in building a PATTERN
. By default, a variable that is not defined, but used in a PATTERN
, evaluates to true for all rows that contibute to a recognized pattern.
Aggregation or offset functions may be used to do further processing on the variable definitions. See Built-in Functions for further function information.
Examples
Aggregated price trends over time
One of the common use cases for pattern recognition is detecting market price changes over time for a specific symbol to allow downstream notification for an end user or application:
This CSAS statement detects when a symbol
's average price drops under 15
. When such an event is detected, it projects the time at which this trend started and ended with, in addition to what average price the symbol
ended with.
These events are orderd by the time at which each price change was reported to market_ticker
.
Patterns within a time interval
When recognizing a pattern over a continuous stream, it's important to recognize that matched patterns are recognized with the help of an internal state over the life of the PATTERN
definition. Hence, the more events are involved in a pattern, the larger the state of the query becomes, and the larger the state of the query, the more risk there is to the health of the operations involved.
For that matter, the state of a pattern can be controlled over time, by providing a WITHIN
subclause, which dictates the amount of time a PATTERN
should be tracking against:
This example looks for a pattern starting with price > 10
, with zero to many events where price is under 15
, and ends with an event where price > 15
within the last hour. It's worth mentioning that this pattern recognition is true iff this pattern occurs within the last hour. If the events matching this pattern span more than an hour, the series won't be matched and no output will be recroded for it. See WITHIN size time_unit for more information.
One more thing to point out here is that we used AFTER MATCH SKIP TO LAST C
to ensure we're capturing continuous patterns, instead of AFTER MATCH SKIP PAST LAST ROW
used for isolating matched pattern from one another.
Non-overlapping PATTERNs with multiple ORDER BY columns
When matching patterns over more complex #_relations, we can use multiple ordering columns to narrow down and shape the pattern in a way that matters the most to the pattern we're looking for. This example uses ORDER BY ts, price
to order the events by the defined timestamp column ts
and the price
column:
We're defining our timestamp column to be ts
to ensure ordering on when a price change was reported by the upstream source rather than the time an event was recorded in market_ticker
. See timestamp
under #_source_properties for more information on defining an event time for source Relation.
Last updated