MATCH_RECOGNIZE
Last updated
Last updated
The MATCH_RECOGNIZE
clause is used for searching patterns in a set of events from a as its input within the FROM clause.
A MATCH_RECOGNIZE
clause enables advanced event processing using the following tasks:
Logical partitioning of events using PARTITION BY
Logical ordering of events for each partition based on one or many ordering columns using ORDER BY
Definition of variables in the DEFINE
subclause that are accessible from PATTERN
, similar to a regular expression syntax
Definition of one or many MEASURE
columns using the DEFINE
variables, within the scope of matched events in PATTERN
Zero or one event per pattern recognition expression defined in PATTERN
A SELECT with *
projection is not supported on a MATCH_RECOGNIZE
operator.
Optionally, this subclause can be used to parition the data to be looked at by the PATTERN
. This subclause groups the underlying data based on the paritioning expression(s), and optimizes DeltaStream's compute resources for parallel processing of the source data.
If PARTITION BY
subclause is not provided, all data is scanned sequentially to ensure global ordering for ORDER BY
, hence, it's recommended to use the PARTITION BY
subclause with at least one column.
The columns mentioned in this subclause contribute to the final schema of the MATCH_RECOGNIZE
operator. See #measures-expression-as-alias-... for definition of the full schema.
This cause defines the order in which the data should be sorted for each partition before they are passed to the pattern recognition in the MATCH_RECOGNIZE
operator.
This subclause is used for providing the output schema for the MATCH_RECOGNIZE
operator, much like the columns provided as a result of a SELECT statement. This means that MEASURES
can project source columns from the PATTERN
, or evaluated expressions that are based on projection columns.
The final schema of the MATCH_RECOGNIZE
is defined by the optional PARTITION BY
columns followed by the MEASURES
columns. ONE ROW PER MATCH defines how many rows to output for this schema, per match.
Aggregation or offset functions may be used to do further processing on the variable definitions. See Built-in Functions for further function information.
Optionally, this output mode can be provided for completeness of a MATCH_RECOGNIZE
operator, but is the only option for how many rows to output per match.
Optionally, start from a different position that is different than the AFTER MATCH SKIP PAST LAST ROW
strategy
, after a successful match.
strategy
can be set to:
SKIP TO NEXT ROW
: Continues matching starting at the first row after the current match.
SKIP PAST LAST ROW
: As the default, continues matching for new rows starting after the last row of the current match.
SKIP TO FIRST variable
: Continues matching starting at the first row that matched the provided variable
. See DEFINES
for variable definitions.
SKIP TO LAST variable
: Continues matching starting at the last row that matched the provided variable
. See DEFINES
for variable definitions.
This subclause is the core processing power for the MATCH_RECOGNIZE
operator, and it explains what pattern the rows should be matched against. DEFINE
variables in addition to quantifiers or modifiers may build a complex pattern.
A quantifier is used to denote the number of DEFINE
variable or operation to be considered in the matching process. Supported quantifiers include:
+
: 1 or more rows
*
: 0 or more rows
?
: 0 or 1 row
{n}
: Exactly n
rows, where n
> 0
{n,}
: n
or more rows, where n
is a non-negative number
{,m}
: 0 to m
rows, inclusive, where m
is a positive number
{n,m}
: n
to m
rows, where n
is a non-negative number and m
is greater than n
By default, quantifiers are greedy unless they are defined to be reluctant using the ?
operation.
Optionally, this subclause is used to provide a time constraint for potential matches. While the WITHIN
clause is not part of the SQL standard for MATCH_RECOGNIZE
, we provide this subclause as a way to manage the amount of state that your query needs to maintain. If the WITHIN
subclause is not provided, then the default is that there is no time constraint for matches.
Note that it is generally recommended to provide the WITHIN
clause. Many MATCH_RECOGNIZE
queries, especially those that use a greedy quantifier, will keep track of all the records it consumes in memory. Without providing a WITHIN
time constraint, these queries will eventually run out of memory and fail. The WITHIN
time constraint allows the query to safely clean up its memory of records that have fallen out of the time interval.
This subclause is used as the building block for the entire MATCH_RECOGNIZE
operator for providing name for expressions that are used in building a PATTERN
. By default, a variable that is not defined, but used in a PATTERN
, evaluates to true for all rows that contibute to a recognized pattern.
Aggregation or offset functions may be used to do further processing on the variable definitions. See Built-in Functions for further function information.
One of the common use cases for pattern recognition is detecting market price changes over time for a specific symbol to allow downstream notification for an end user or application:
This CSAS statement detects when a symbol
's average price drops under 15
. When such an event is detected, it projects the time at which this trend started and ended with, in addition to what average price the symbol
ended with.
These events are orderd by the time at which each price change was reported to market_ticker
.
When recognizing a pattern over a continuous stream, it's important to recognize that matched patterns are recognized with the help of an internal state over the life of the PATTERN
definition. Hence, the more events are involved in a pattern, the larger the state of the query becomes, and the larger the state of the query, the more risk there is to the health of the operations involved.
For that matter, the state of a pattern can be controlled over time, by providing a WITHIN
subclause, which dictates the amount of time a PATTERN
should be tracking against:
This example looks for a pattern starting with price > 10
, with zero to many events where price is under 15
, and ends with an event where price > 15
within the last hour. It's worth mentioning that this pattern recognition is true iff this pattern occurs within the last hour. If the events matching this pattern span more than an hour, the series won't be matched and no output will be recroded for it. See WITHIN size time_unit for more information.
One more thing to point out here is that we used AFTER MATCH SKIP TO LAST C
to ensure we're capturing continuous patterns, instead of AFTER MATCH SKIP PAST LAST ROW
used for isolating matched pattern from one another.
The first column for the ORDER BY
subclause must be an event-time column defined for the input Relation in ascending order. The event-time column can be the record time at which a record was added to an Entity, referred to as rowtime()
. This can also be a user-defined timestamp column from a list of value columns. See for a user-defined timestamp column.
Since the MATCH_RECOGNIZE
operator alters the projection columns of its input , it is required to provide which variable from DEFINE
would provide the row metadata columns for the projected MATCH_RECOGNIZE
columns. See Row Metadata Functions for more information on how the row metadata columns are used.
All of a match's records must fall within the time constraint to be considered a valid match. For example, if the time between the first and last record in a match is longer than the time constraint, then that match would be invalid and not produced as an output. The time of a record is determined by its event time (see timestamp
under ).
When matching patterns over more complex s, we can use multiple ordering columns to narrow down and shape the pattern in a way that matters the most to the pattern we're looking for. This example uses ORDER BY ts, price
to order the events by the defined timestamp column ts
and the price
column:
We're defining our timestamp column to be ts
to ensure ordering on when a price change was reported by the upstream source rather than the time an event was recorded in market_ticker
. See timestamp
under for more information on defining an event time for source Relation.