MATCH_RECOGNIZE

Syntax

MATCH_RECOGNIZE (
    [PARTITION BY expression [, ...]]
    ORDER BY expression ASC [, expression [ASC | DESC], ...]
    MEASURES expression AS alias [, ...]
    [ONE ROW PER MATCH]
    [AFTER MATCH strategy]
    PATTERN (variable [, ...])
    [WITHIN size time_unit]
    DEFINE variable AS expression [, ...]
)

Description

A MATCH_RECOGNIZE clause enables advanced event processing using the following tasks:

  • Logical partitioning of events using PARTITION BY

  • Logical ordering of events for each partition based on one or many ordering columns using ORDER BY

  • Definition of variables in the DEFINE subclause that are accessible from PATTERN, similar to a regular expression syntax

  • Definition of one or many MEASURE columns using the DEFINE variables, within the scope of matched events in PATTERN

  • Zero or one event per pattern recognition expression defined in PATTERN

A SELECT with * projection is not supported on a MATCH_RECOGNIZE operator.

Arguments

PARTITION BY expression [, ...]

Optionally, this subclause can be used to parition the data to be looked at by the PATTERN. This subclause groups the underlying data based on the paritioning expression(s), and optimizes DeltaStream's compute resources for parallel processing of the source data.

If PARTITION BY subclause is not provided, all data is scanned sequentially to ensure global ordering for ORDER BY, hence, it's recommended to use the PARTITION BY subclause with at least one column.

The columns mentioned in this subclause contribute to the final schema of the MATCH_RECOGNIZE operator. See #measures-expression-as-alias-... for definition of the full schema.

ORDER BY expression ASC [, expression [ASC | DESC], ...]

This cause defines the order in which the data should be sorted for each partition before they are passed to the pattern recognition in the MATCH_RECOGNIZE operator.

The first column for the ORDER BY subclause must be an event-time column defined for the input Relation in ascending order. The event-time column can be the record time at which a record was added to a Topic, referred to as rowtime(). This can also be a user-defined timestamp column from a list of value columns. See #_source_properties for a user-defined timestamp column.

MEASURES expression AS alias [, ...]

This subclause is used for providing the output schema for the MATCH_RECOGNIZE operator, much like the columns provided as a result of a SELECT statement. This means that MEASURES can project source columns from the PATTERN, or evaluated expressions that are based on projection columns.

The final schema of the MATCH_RECOGNIZE is defined by the optional PARTITION BY columns followed by the MEASURES columns. ONE ROW PER MATCH defines how many rows to output for this schema, per match.

Aggregation or offset functions may be used to do further processing on the variable definitions. See Built-in Functions for further function information.

ONE ROW PER MATCH

Optionally, this output mode can be provided for completeness of a MATCH_RECOGNIZE operator, but is the only option for how many rows to output per match.

AFTER MATCH strategy

Optionally, start from a different position that is different than the AFTER MATCH SKIP PAST LAST ROW strategy, after a successful match.

strategy can be set to:

  • SKIP TO NEXT ROW: Continues matching starting at the first row after the current match.

  • SKIP PAST LAST ROW: As the default, continues matching for new rows starting after the last row of the current match.

  • SKIP TO FIRST variable: Continues matching starting at the first row that matched the provided variable. See DEFINES for variable definitions.

  • SKIP TO LAST variable: Continues matching starting at the last row that matched the provided variable. See DEFINES for variable definitions.

PATTERN (variable [, ...])

This subclause is the core processing power for the MATCH_RECOGNIZE operator, and it explains what pattern the rows should be matched against. DEFINE variables in addition to quantifiers or modifiers may build a complex pattern.

A quantifier is used to denote the number of DEFINE variable or operation to be considered in the matching process. Supported quantifiers include:

  • +: 1 or more rows

  • *: 0 or more rows

  • ?: 0 or 1 row

  • {n}: Exactly n rows, where n > 0

  • {n,}: n or more rows, where n is a non-negative number

  • {,m}: 0 to m rows, inclusive, where m is a positive number

  • {n,m}: n to m rows, where n is a non-negative number and m is greater than n

By default, quantifiers are greedy unless they are defined to be reluctant using the ? operation.

WITHIN size time_unit

Optionally, this subclause is used to provide a time constraint for potential matches. While the WITHIN clause is not part of the SQL standard for MATCH_RECOGNIZE, we provide this subclause as a way to manage the amount of state that your query needs to maintain. If the WITHIN subclause is not provided, then the default is that there is no time constraint for matches.

All of a match's records must fall within the time constraint to be considered a valid match. For example, if the time between the first and last record in a match is longer than the time constraint, then that match would be invalid and not produced as an output. The time of a record is determined by its event time (see timestamp under #_source_properties).

Note that it is generally recommended to provide the WITHIN clause. Many MATCH_RECOGNIZE queries, especially those that use a greedy quantifier, will keep track of all the records it consumes in memory. Without providing a WITHIN time constraint, these queries will eventually run out of memory and fail. The WITHIN time constraint allows the query to safely clean up its memory of records that have fallen out of the time interval.

DEFINE variable AS expression [, ...]

This subclause is used as the building block for the entire MATCH_RECOGNIZE operator for providing name for expressions that are used in building a PATTERN. By default, a variable that is not defined, but used in a PATTERN, evaluates to true for all rows that contibute to a recognized pattern.

Aggregation or offset functions may be used to do further processing on the variable definitions. See Built-in Functions for further function information.

Examples

One of the common use cases for pattern recognition is detecting market price changes over time for a specific symbol to allow downstream notification for an end user or application:

CREATE STREAM ticker_average_drop
AS SELECT symbol, start_tstamp, end_tstamp, avg_price
FROM market_ticker
  MATCH_RECOGNIZE (
    PARTITION BY symbol
    ORDER BY rowtime()
    MEASURES
      A.row_timestamp AS row_timestamp,
      A.row_key AS row_key,
      A.row_metadata AS row_metadata,
      FIRST(A.ts) AS start_tstamp,
      LAST(A.ts) AS end_tstamp,
      AVG(A.price) AS avg_price
    ONE ROW PER MATCH
    AFTER MATCH SKIP PAST LAST ROW
    PATTERN (A+ B)
    DEFINE
      A AS AVG(A.price) < 15
  ) MR;

This CSAS statement detects when a symbol's average price drops under 15. When such an event is detected, it projects the time at which this trend started and ended with, in addition to what average price the symbol ended with.

These events are orderd by the time at which each price change was reported to market_ticker.

Patterns within a time interval

When recognizing a pattern over a continuous stream, it's important to recognize that matched patterns are recognized with the help of an internal state over the life of the PATTERN definition. Hence, the more events are involved in a pattern, the larger the state of the query becomes, and the larger the state of the query, the more risk there is to the health of the operations involved.

For that matter, the state of a pattern can be controlled over time, by providing a WITHIN subclause, which dictates the amount of time a PATTERN should be tracking against:

CREATE STREAM ticker_price_fluctuation
AS SELECT symbol, start_price, last_price
FROM market_ticker
  MATCH_RECOGNIZE(
    PARTITION BY symbol
    ORDER BY rowtime()
    MEASURES
      C.row_timestamp AS row_timestamp,
      C.row_key AS row_key,
      C.row_metadata AS row_metadata,
      A.price AS start_price,
      C.price AS last_price
    ONE ROW PER MATCH
    AFTER MATCH SKIP TO LAST C
    PATTERN (A B* C)
    WITHIN 1 HOUR
    DEFINE
      A AS price > 10,
      B AS price < 15,
      C AS price > 15
  );

This example looks for a pattern starting with price > 10, with zero to many events where price is under 15, and ends with an event where price > 15 within the last hour. It's worth mentioning that this pattern recognition is true iff this pattern occurs within the last hour. If the events matching this pattern span more than an hour, the series won't be matched and no output will be recroded for it. See WITHIN size time_unit for more information.

One more thing to point out here is that we used AFTER MATCH SKIP TO LAST C to ensure we're capturing continuous patterns, instead of AFTER MATCH SKIP PAST LAST ROW used for isolating matched pattern from one another.

Non-overlapping PATTERNs with multiple ORDER BY columns

CREATE STREAM ticker_average_drop
AS SELECT start_time, row_timestamp AS end_time, symbol, last_price
FROM market_ticker
  MATCH_RECOGNIZE(
    PARTITION BY symbol
    ORDER BY ts, price
    MEASURES
      C.row_timestamp AS row_timestamp,
      C.row_key AS row_key,
      C.row_metadata AS row_metadata,
      A.row_timestamp AS start_time,
      C.price AS last_price
    ONE ROW PER MATCH
    AFTER MATCH SKIP PAST LAST ROW
    PATTERN (A B* C)
    DEFINE
      A AS price > 10,
      B AS price < 15,
      C AS price > 12
  ) AS MR WITH ('timestamp'='ts');

We're defining our timestamp column to be ts to ensure ordering on when a price change was reported by the upstream source rather than the time an event was recorded in market_ticker. See timestamp under #_source_properties for more information on defining an event time for source Relation.

Last updated