Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: lambda expression #69

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

RFC: lambda expression #69

wants to merge 6 commits into from

Conversation

st1page
Copy link
Contributor

@st1page st1page commented Aug 2, 2023

use lambda expression to enhance array's complex processing


### level 2: scalar lambda expression

Implement lambda expression function to make user can define there `transform` logic for array's each elements.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

risingwavelabs/risingwave#11123
Another part of this solution is to implement transform and reduce functions, whose inputs contain both data column and the lambda expression.

@chenzl25
Copy link
Contributor

chenzl25 commented Aug 2, 2023

I find that we can unnest this kind of subquery in some way. Here is a PR on how to transpose Apply with ProjectSet. risingwavelabs/risingwave#11390.
If we support the following two optimizer rules later, we can resolve this issue.

  1. transposing Apply with TableFunction which I think it can be reduced into ProjectSet + Value in the future.
  2. reducing trivial nested-loop join generated by subquery unnesting.

@st1page
Copy link
Contributor Author

st1page commented Aug 3, 2023

I find that we can unnest this kind of subquery in some way. Here is a PR on how to transpose Apply with ProjectSet. risingwavelabs/risingwave#11390.
If we support the following two optimizer rules later, we can resolve this issue.

LGTM, but still not the best because it introduce unnessary Join operator 🤔

@chenzl25
Copy link
Contributor

chenzl25 commented Aug 3, 2023

I find that we can unnest this kind of subquery in some way. Here is a PR on how to transpose Apply with ProjectSet. risingwavelabs/risingwave#11390.
If we support the following two optimizer rules later, we can resolve this issue.

LGTM, but still not the best because it introduce unnessary Join operator 🤔

True. It could be a temporal solution. From a performance perspective, this RFC could do better.

Co-authored-by: CAJan93 <[email protected]>
```SQL
select *,
array_sum(arr_i),
array_sum(transform(arr_json, v -> (v->'x')::int))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be a little confused that use -> for two purposes, and increase the complexity of parser and binder.

I'd propose using => for lambda function.

@fuyufjh fuyufjh changed the title WIP RFC: lambda expression RFC: lambda expression Aug 22, 2023
@TennyZhuang
Copy link
Contributor

TennyZhuang commented Sep 1, 2023

The latest progress:

TL;DR

  1. We use |x| x * 2 syntax for lambda function MVP.
  2. The array_reduce is very heavy in our system, so we implement the utility array functions array_{sum|min|max} first.

The lambda syntax

In this RFC, we proposed using the arrow symbol “->” as the syntax for lambda expressions, such as x -> x * 2, since it's widely supported by different systems. However, this symbol conflicts with JsonAccess and creates significant ambiguity.

A bad case: how should f(ARRAY[1,2,3], x -> 'y' or z) be parsed? (x -> 'y') or z or x -> ('y' or z)? The AST will be much different after the parser phase.

We also proposed the => symbol, which is used by JavaScript and also widely accepted by programmers. Unfortunately, it's still conflict with a pg syntax, Named notation.

-- Lambda function
SELECT f(ARRAY[1,2,3], x => x * 2);
-- Named notation
SELECT abs(a => -1)
-- ???
SELECT f(apply => (x => x * 2), array => ARRAY[1,2,3]);

There are definitely many compiler techniques that can solve the problem of ambiguity in the above syntax. After all, it is much simpler compared to C++ syntax. However, as an experimental feature, we do not want to introduce very complex refactors into the parser for it. We have decided to look for a syntax that is not ambiguous and relatively simple to support this feature. The final choice is a Rust-like syntax |x| x * 2.

In the current pg 15, there is no built-in operator that allows the prefix | operator. It may conflict with custom operators, but currently we do not support the feature. Even if we support it, it 's still acceptable to forbid “|”” as a unary operator.

Of course, another important reason is that we like Rust.

Choosing the Rust-like syntax in the MVP version does not mean we will not change, but because the cost of implementing it is very low, even if we eventually choose another syntax, we can simply deprecate the rust-like one and maintain compatibility with it forever without much effort.

If the PostgreSQL eventually support the lambda function using another syntax, or our users really like the -> symbol such as duckdb or prestodb, we can support that (with a huge refactor in the parser module).

The reduce performance issue

The RFC also proposed a reduce function to create an aggregate function or a scalar function which aggregates an array. Unfortunately, the implementation is heavy since we can't pass a column-based chunk to the lambda function.

-- Calculate the sum of an array
select array_reduce(ARRAY[1,2,3], (x, y) => x + y);

The lambda arg will eventually be converted to a BoxedExpression in the backend. However, we cannot invoke the eval(DataChunk) method and can only choose the inefficient eval_row(OwnedRow) method instead. The accumulation logic is very unfriendly to chunk-based evaluation.

There many overheads when evaluating one row.

  • The chained dynamic dispatch Box<dyn Expression>
  • Box and unbox of OwnedRow and Datum.

The overhead is much heavy, and the only solution is to support query compilation for expression, which is equivalent to fully rewrite the expression framework.

As a conclusion, even if we supported array_reduce, we still need to implement the array_{sum|min|max|etc} to optimize the performance. So we decide to implement the utility functions first instead of introducing array_reduce.

The bridge between aggregate functions and array scalar functions

The logic are almost identical for the scalar function array_sum and the aggregate function sum. Should we call the sum batch aggregator directly when implementing array_sum?

My personal opinion is to first try to implement array_sum manually, observe the commonalities, and then draw conclusions.

@wangrunji0408
Copy link

wangrunji0408 commented Sep 1, 2023

It is possible to introduce another macro to generate array scalar functions from aggregate definitions. It may look like:

#[aggregate("sum(int64) -> int64")]                // original aggregate
#[array_function("array_sum(int64[]) -> int64")]   // the new macro
fn sum(state: i64, input: i64) -> i64 {
    state + input
}

This approach requires no extra code for each function, instead it puts all complexity into the macro.

By contrast, manually implement each array_* function is a bit tedious but trivial:

#[function("array_sum(int64[]) -> int64")]
fn array_sum(array: ListRef<'_>) -> i64 {
    array.as_int64().iter().sum()
}

So I would also prefer manual implementing each array function in the first step.

@xiangjinwu
Copy link

Just ran into this:
https://www.postgresql.org/docs/current/intarray.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants