You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At various points, people (mainly other people on Hatchet/Thicket) have noted how slow the query language (QL) is. After thinking about it, I realized that the QL could be sped up by changing the representation of the predicates in the base syntax. That realization led me to discover several other ways in which the QL could be improved.
This RFC details the changes that can be made for a "Query Language v2". The primary changes are as follows:
Change the representation of predicates in the base syntax to leverage Pandas vectorization
Replace the algorithm powering the QL with a newer and faster one
After discussing the changes, this RFC will touch on the benefits they will bring to Hatchet.
Changing the Representation of Predicates in the Base Syntax
Currently, predicates are represented as functions that accept the piece of the DataFrame for a node as input and return a single boolean (True if the node satisfies the predicate; False otherwise). Although this representation is easy to understand, there are several downsides to it, namely:
It requires users to design predicates differently depending on whether or not the DataFrame has a MultiIndex or not
It negatively impacts performance by minimizing opportunities for parallelism, vectorization, and relegation to lower-level languages (e.g., C/C++)
It requires complex, conditional parsing for the object- and string-based dialects due to downside 1
To mitigate these downsides, I propose changing the representation of the predicates so that them accept the entire DataFrame as input (really a reference to the DataFrame due to Python) and returns a pd.Series of booleans as output (each boolean is the result of applying the predicate to a particular row of the DataFrame). Designing the predicates this way allows the evaluation of predicates to be vectorized.
Allowing for Pandas Vectorization
When processing certain types of column- or row-based operations, Pandas is able to do one or both of the following:
Vectorize the operation
Speed up the operation using either its Cython or Numba engines
As is hinted at in this article, these benefits can only reliably be obtained by operating on entire DataFrames, Series, etc. Since the existing representation requires predicates to be applied repeatedly using Python for loops, it cannot exploit this parallelization. However, the proposed predicate representation allows for this by allowing predicates to be written with Pandas expressions. For example, consider the predicate below:
This predicate will look for MPI_Send nodes with a time of greater than 5.0. Because this predicate (in the existing representation) is applied row-by-row in a Python-based loop, it cannot be vectorized or sped up by Pandas. Now, consider the equivalent predicate in the new representation:
Because each part of this predicate operates over entire columns (i.e., pd.Series), each part can be vectorized and sped up, speeding up the entire evaluation of the predicate.
Speeding up the QL with a New Algorithm
Currently, queries are applied to data using a modified version of the Ullmann's algorithm. However, if we change the representation of the predicates (as described above), we can consider other algorithms. The most notable of these alternate algorithms that I found is the (Glasgow Subgraph Solver)[https://doi.org/10.1007/978-3-030-51372-6_19] (GSS). Below I'll explain both of these algorithms and why GSS can speed up the QL.
Ullmann's Algorithm
Ullmann's algorithm is the de-facto standard algorithm for solving subgraph isomorphism (the general type of computation problem that the QL solves). At a high level, the algorithm can be split into an outer algorithm and an inner algorithm. The outer algorithm traverses the graph, pausing and calling the inner algorithm whenever a node that matches the first query node is reached. The inner algorithm traverses the subgraph from the node designated by the outer algorithm. As it traverses, it iteratively builds paths matching the query, pruning paths in the subgraph if it reaches nodes that cannot satisfy the query. Once the traversal is done, it stores the matching paths and returns control to the outer algorithm.
Glasgow Subgraph Solver
The Glasgow Subgraph Solver (GSS) frames the subgraph isomorphism problem in a fundamentally different way than Ullmann's algorithm (and its derivatives). GSS views subgraph isomorphism as a constraint programming problem that can be solved with dynamic programming. GSS can be summarized by the following steps:
Apply all predicates and build sets of nodes (i.e., "matching sets") that can satisfy each query node
For each node in the matching set of the first query node, recursively go through the subgraph rooted at the node and build matching paths
As the algorithm moves back up the recursion, cache intermediate results for later use
Why does GSS speed up the QL?
In a few words: caching, vectorization, and less work
GSS performs two types of caching: (1) caching of the results of the predicates and (2) caching of intermediate results.
Although the current Ullmann's algorithm approach also caches predicate results, it cannot do the caching as quickly as GSS due to the use of the existing predicate representation (GSS requires the use of the proposed representation of predicates, which means it gets the vectorization benefits automatically). Additionally, GSS can immediately reduce the cached predicate results to a set of candidate matches per query node, whereas Ullmann's algorithm cannot reduce this cache. As a result, GSS can do this type of caching faster and with less memory footprint than Ullmann's algorithm.
As for the second type of caching, GSS can do this, but it is near impossible to implement with Ullmann's algorithm.
This caching also allows GSS to do less work. Since GSS precomputes possible matches to the query nodes, we can start the algorithm from those nodes. On the other hand, Ullmann's algorithm needs to start at the roots of the graph to correctly identify matches to the query.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
At various points, people (mainly other people on Hatchet/Thicket) have noted how slow the query language (QL) is. After thinking about it, I realized that the QL could be sped up by changing the representation of the predicates in the base syntax. That realization led me to discover several other ways in which the QL could be improved.
This RFC details the changes that can be made for a "Query Language v2". The primary changes are as follows:
After discussing the changes, this RFC will touch on the benefits they will bring to Hatchet.
Changing the Representation of Predicates in the Base Syntax
Currently, predicates are represented as functions that accept the piece of the
DataFrame
for a node as input and return a single boolean (True
if the node satisfies the predicate;False
otherwise). Although this representation is easy to understand, there are several downsides to it, namely:DataFrame
has aMultiIndex
or notTo mitigate these downsides, I propose changing the representation of the predicates so that them accept the entire
DataFrame
as input (really a reference to theDataFrame
due to Python) and returns apd.Series
of booleans as output (each boolean is the result of applying the predicate to a particular row of theDataFrame
). Designing the predicates this way allows the evaluation of predicates to be vectorized.Allowing for Pandas Vectorization
When processing certain types of column- or row-based operations, Pandas is able to do one or both of the following:
You can see examples of these benefits in this article: https://pythonspeed.com/articles/pandas-vectorization/
As is hinted at in this article, these benefits can only reliably be obtained by operating on entire
DataFrame
s,Series
, etc. Since the existing representation requires predicates to be applied repeatedly using Pythonfor
loops, it cannot exploit this parallelization. However, the proposed predicate representation allows for this by allowing predicates to be written with Pandas expressions. For example, consider the predicate below:This predicate will look for
MPI_Send
nodes with a time of greater than 5.0. Because this predicate (in the existing representation) is applied row-by-row in a Python-based loop, it cannot be vectorized or sped up by Pandas. Now, consider the equivalent predicate in the new representation:Because each part of this predicate operates over entire columns (i.e.,
pd.Series
), each part can be vectorized and sped up, speeding up the entire evaluation of the predicate.Speeding up the QL with a New Algorithm
Currently, queries are applied to data using a modified version of the Ullmann's algorithm. However, if we change the representation of the predicates (as described above), we can consider other algorithms. The most notable of these alternate algorithms that I found is the (Glasgow Subgraph Solver)[https://doi.org/10.1007/978-3-030-51372-6_19] (GSS). Below I'll explain both of these algorithms and why GSS can speed up the QL.
Ullmann's Algorithm
Ullmann's algorithm is the de-facto standard algorithm for solving subgraph isomorphism (the general type of computation problem that the QL solves). At a high level, the algorithm can be split into an outer algorithm and an inner algorithm. The outer algorithm traverses the graph, pausing and calling the inner algorithm whenever a node that matches the first query node is reached. The inner algorithm traverses the subgraph from the node designated by the outer algorithm. As it traverses, it iteratively builds paths matching the query, pruning paths in the subgraph if it reaches nodes that cannot satisfy the query. Once the traversal is done, it stores the matching paths and returns control to the outer algorithm.
Glasgow Subgraph Solver
The Glasgow Subgraph Solver (GSS) frames the subgraph isomorphism problem in a fundamentally different way than Ullmann's algorithm (and its derivatives). GSS views subgraph isomorphism as a constraint programming problem that can be solved with dynamic programming. GSS can be summarized by the following steps:
Why does GSS speed up the QL?
In a few words: caching, vectorization, and less work
GSS performs two types of caching: (1) caching of the results of the predicates and (2) caching of intermediate results.
Although the current Ullmann's algorithm approach also caches predicate results, it cannot do the caching as quickly as GSS due to the use of the existing predicate representation (GSS requires the use of the proposed representation of predicates, which means it gets the vectorization benefits automatically). Additionally, GSS can immediately reduce the cached predicate results to a set of candidate matches per query node, whereas Ullmann's algorithm cannot reduce this cache. As a result, GSS can do this type of caching faster and with less memory footprint than Ullmann's algorithm.
As for the second type of caching, GSS can do this, but it is near impossible to implement with Ullmann's algorithm.
This caching also allows GSS to do less work. Since GSS precomputes possible matches to the query nodes, we can start the algorithm from those nodes. On the other hand, Ullmann's algorithm needs to start at the roots of the graph to correctly identify matches to the query.
Beta Was this translation helpful? Give feedback.
All reactions