RFC: Change representation of predicates and algorithm in Query Language to improve performance #100

ilumsden · 2023-07-21T23:14:55Z

ilumsden
Jul 21, 2023
Collaborator

At various points, people (mainly other people on Hatchet/Thicket) have noted how slow the query language (QL) is. After thinking about it, I realized that the QL could be sped up by changing the representation of the predicates in the base syntax. That realization led me to discover several other ways in which the QL could be improved.

This RFC details the changes that can be made for a "Query Language v2". The primary changes are as follows:

Change the representation of predicates in the base syntax to leverage Pandas vectorization
Replace the algorithm powering the QL with a newer and faster one

After discussing the changes, this RFC will touch on the benefits they will bring to Hatchet.

Changing the Representation of Predicates in the Base Syntax

Currently, predicates are represented as functions that accept the piece of the DataFrame for a node as input and return a single boolean (True if the node satisfies the predicate; False otherwise). Although this representation is easy to understand, there are several downsides to it, namely:

It requires users to design predicates differently depending on whether or not the DataFrame has a MultiIndex or not
It negatively impacts performance by minimizing opportunities for parallelism, vectorization, and relegation to lower-level languages (e.g., C/C++)
It requires complex, conditional parsing for the object- and string-based dialects due to downside 1

To mitigate these downsides, I propose changing the representation of the predicates so that them accept the entire DataFrame as input (really a reference to the DataFrame due to Python) and returns a pd.Series of booleans as output (each boolean is the result of applying the predicate to a particular row of the DataFrame). Designing the predicates this way allows the evaluation of predicates to be vectorized.

Allowing for Pandas Vectorization

When processing certain types of column- or row-based operations, Pandas is able to do one or both of the following:

Vectorize the operation
Speed up the operation using either its Cython or Numba engines

You can see examples of these benefits in this article: https://pythonspeed.com/articles/pandas-vectorization/

As is hinted at in this article, these benefits can only reliably be obtained by operating on entire DataFrames, Series, etc. Since the existing representation requires predicates to be applied repeatedly using Python for loops, it cannot exploit this parallelization. However, the proposed predicate representation allows for this by allowing predicates to be written with Pandas expressions. For example, consider the predicate below:

pred = lambda row: row["name"] == "MPI_Send" and row["time"] > 5.0

This predicate will look for MPI_Send nodes with a time of greater than 5.0. Because this predicate (in the existing representation) is applied row-by-row in a Python-based loop, it cannot be vectorized or sped up by Pandas. Now, consider the equivalent predicate in the new representation:

pred = lambda df: (df["name"] == "MPI_Send") & (row["time"] > 5.0)

Because each part of this predicate operates over entire columns (i.e., pd.Series), each part can be vectorized and sped up, speeding up the entire evaluation of the predicate.

Speeding up the QL with a New Algorithm

Currently, queries are applied to data using a modified version of the Ullmann's algorithm. However, if we change the representation of the predicates (as described above), we can consider other algorithms. The most notable of these alternate algorithms that I found is the (Glasgow Subgraph Solver)[https://doi.org/10.1007/978-3-030-51372-6_19] (GSS). Below I'll explain both of these algorithms and why GSS can speed up the QL.

Ullmann's Algorithm

Ullmann's algorithm is the de-facto standard algorithm for solving subgraph isomorphism (the general type of computation problem that the QL solves). At a high level, the algorithm can be split into an outer algorithm and an inner algorithm. The outer algorithm traverses the graph, pausing and calling the inner algorithm whenever a node that matches the first query node is reached. The inner algorithm traverses the subgraph from the node designated by the outer algorithm. As it traverses, it iteratively builds paths matching the query, pruning paths in the subgraph if it reaches nodes that cannot satisfy the query. Once the traversal is done, it stores the matching paths and returns control to the outer algorithm.

Glasgow Subgraph Solver

The Glasgow Subgraph Solver (GSS) frames the subgraph isomorphism problem in a fundamentally different way than Ullmann's algorithm (and its derivatives). GSS views subgraph isomorphism as a constraint programming problem that can be solved with dynamic programming. GSS can be summarized by the following steps:

Apply all predicates and build sets of nodes (i.e., "matching sets") that can satisfy each query node
For each node in the matching set of the first query node, recursively go through the subgraph rooted at the node and build matching paths
- As the algorithm moves back up the recursion, cache intermediate results for later use

Why does GSS speed up the QL?

In a few words: caching, vectorization, and less work

GSS performs two types of caching: (1) caching of the results of the predicates and (2) caching of intermediate results.

Although the current Ullmann's algorithm approach also caches predicate results, it cannot do the caching as quickly as GSS due to the use of the existing predicate representation (GSS requires the use of the proposed representation of predicates, which means it gets the vectorization benefits automatically). Additionally, GSS can immediately reduce the cached predicate results to a set of candidate matches per query node, whereas Ullmann's algorithm cannot reduce this cache. As a result, GSS can do this type of caching faster and with less memory footprint than Ullmann's algorithm.

As for the second type of caching, GSS can do this, but it is near impossible to implement with Ullmann's algorithm.

This caching also allows GSS to do less work. Since GSS precomputes possible matches to the query nodes, we can start the algorithm from those nodes. On the other hand, Ullmann's algorithm needs to start at the roots of the graph to correctly identify matches to the query.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Change representation of predicates and algorithm in Query Language to improve performance #100

{{title}}

Replies: 0 comments

Select a reply

RFC: Change representation of predicates and algorithm in Query Language to improve performance #100

ilumsden Jul 21, 2023 Collaborator

Changing the Representation of Predicates in the Base Syntax

Allowing for Pandas Vectorization

Speeding up the QL with a New Algorithm

Ullmann's Algorithm

Glasgow Subgraph Solver

Why does GSS speed up the QL?

Replies: 0 comments

ilumsden
Jul 21, 2023
Collaborator