Skip to content
This repository has been archived by the owner on Nov 26, 2021. It is now read-only.

Pattern Matching with MATCH_RECOGNIZE

Timo Walther edited this page Apr 1, 2019 · 8 revisions

This session discusses how to write SQL queries that detect patterns in streaming data.

Specifically, you will learn about

  • The MATCH_RECOGNIZE clause defined in SQL:2016
  • Supported semantics and limitations of Flink's implementation
  • Examples of how to perform pattern matching on append tables

Slides

Hands-On Exercise

This exercises will teach you how to detect a pattern in data streams and to perform a computation on the matching set of rows.

Rides and Rest

In this exercise, we want to detect how long taxis rest before they start a new ride.

Use the MATCH_RECOGNIZE clause to detect a pattern of:

  • a start event,
  • an arbitrary number of intermediate events for the same taxi but from different rides,
  • an end event,
  • and the next start event.

Compute the resting times in minutes.

Click here for hints.
  • Use a AFTER MATCH SKIP TO LAST variable strategy to include the last start event in the next pattern matching.
  • Use the built-in TIMESTAMPDIFF function to calculate the difference in minutes.


The output should look similar to:
                   taxiId                ride_start                  ride_end           next_ride_start           minutes_of_rest
                2013000002     2013-01-01 00:00:00.0     2013-01-01 00:06:00.0     2013-01-01 00:16:00.0                        10
                2013000004     2013-01-01 00:00:00.0     2013-01-01 00:08:00.0     2013-01-01 00:13:00.0                         5
                2013000032     2013-01-01 00:00:00.0     2013-01-01 00:05:00.0     2013-01-01 00:06:00.0                         1
                2013000128     2013-01-01 00:01:00.0     2013-01-01 00:04:00.0     2013-01-01 00:12:00.0                         8
                2013000256     2013-01-01 00:02:00.0     2013-01-01 00:10:00.0     2013-01-01 00:10:00.0                         0
                2013000512     2013-01-01 00:03:25.0     2013-01-01 00:04:51.0     2013-01-01 00:10:00.0                         5
                2013000512     2013-01-01 00:10:00.0     2013-01-01 00:13:31.0     2013-01-01 00:14:19.0                         0
                2013001028     2013-01-01 00:05:52.0     2013-01-01 00:13:20.0     2013-01-01 00:14:12.0                         0
                2013000258     2013-01-01 00:02:00.0     2013-01-01 00:08:00.0     2013-01-01 00:11:00.0                         3
                2013002070     2013-01-01 00:08:57.0     2013-01-01 00:12:26.0     2013-01-01 00:13:46.0                         1
Click to see the solution.
SELECT * FROM Rides
MATCH_RECOGNIZE(
  PARTITION BY taxiId
  ORDER BY rideTime
  MEASURES
    START_RIDE.rideTime AS ride_start,
    END_RIDE.rideTime AS ride_end,
    NEXT_RIDE.rideTime AS next_ride_start,
    TIMESTAMPDIFF(MINUTE, END_RIDE.rideTime, NEXT_RIDE.rideTime) AS minutes_of_rest
  AFTER MATCH SKIP TO LAST NEXT_RIDE
  PATTERN (START_RIDE M* END_RIDE NEXT_RIDE)
  DEFINE
    START_RIDE AS START_RIDE.isStart = true,
    M AS M.rideId <> START_RIDE.rideId,
    END_RIDE AS END_RIDE.isStart = false,
    NEXT_RIDE AS NEXT_RIDE.isStart = true
);

The query matches the pattern mentioned above. The variable START_RIDE detects the start event. M defines a greedy set of ride events that don't belong to the same ride of the start event. The END_RIDE detects the end event. The NEXT_RIDE variable defines the following start event for computing the resting time. Once a match has been detected, we compute the difference in minutes and return it.