Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sql-udf): support basic anonymous sql udf #14139

Merged
merged 28 commits into from
Jan 5, 2024
Merged

feat(sql-udf): support basic anonymous sql udf #14139

merged 28 commits into from
Jan 5, 2024

Conversation

xzhseh
Copy link
Contributor

@xzhseh xzhseh commented Dec 21, 2023

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

This PR adds basic support for anonymous sql udf, the current supported syntax is simple, and is a subset of pg's create function standard (refer: https://www.postgresql.org/docs/current/sql-createfunction.html), for all to-be-supported syntax in RW, check #10151 for details.

The current supported syntax (anonymous parameter list) is as follows:

  • single quote definition
create function sub(INT, INT) returns int language sql as 'select $1 - $2';
  • double dollar definition
create function add(INT, INT) returns int language sql as $$select $1 + $2$$;
  • return expression
create function add(INT, INT) returns int language sql return $1 + $2;
  • multiple types interleaving
create function add_sub(INT, FLOAT, INT) returns float language sql as $$select -$1 + $2 - $3$$;
  • calls other pre-defined sql udfs / built-in functions
create function add_sub_binding() returns int language sql as 'select add(1, 1) + sub(2, 2)';
create function add_sub_wrapper(INT, INT) returns int language sql as 'select add($1, $2) + sub($1, $2) + 114512';
create function call_regexp_replace() returns varchar language sql as $$select regexp_replace('💩💩💩💩💩foo🤔️bar亲爱的😭baz这不是爱情❤️‍🔥', 'baz(...)', '这是🥵', 'ic')$$;

Note: recursive definition is NOT supported, e.g.,

# Recursive definition is forbidden
statement error recursive definition is forbidden, please recheck your function syntax
create function recursive(INT, INT) returns int language sql as 'select recursive($1, $2) + recursive($1, $2)';

See e2e_test/udf/sql_udf.slt for detailed use cases.

P.S. The current implementation will only check for syntax correctness (i.e., the sql body in as clause) by utilizing Parser::parse_sql, semantic check will be implemented in the future.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

Support basic anonymous sql user defined function, the current supported syntax is limited. See the above documentation for details.

@xzhseh

This comment was marked as outdated.

xxchan

This comment was marked as resolved.

src/frontend/src/handler/create_sql_function.rs Outdated Show resolved Hide resolved
src/frontend/src/handler/create_sql_function.rs Outdated Show resolved Hide resolved
src/frontend/src/handler/create_sql_function.rs Outdated Show resolved Hide resolved
src/frontend/src/handler/mod.rs Show resolved Hide resolved
src/frontend/src/binder/expr/function.rs Outdated Show resolved Hide resolved
database_id,
name: function_name,
kind: Some(kind),
arg_types: arg_types.into_iter().map(|t| t.into()).collect(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the type of this field should be changed to Vec<Field> or Schema so that we can find argument names here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change this field after reviewing the anonymous implementation logic

src/frontend/src/binder/expr/function.rs Outdated Show resolved Hide resolved
src/frontend/src/binder/expr/function.rs Outdated Show resolved Hide resolved
@wangrunji0408
Copy link
Contributor

P.S. the current parsing will be triggered per bind_function call which is super inefficient, and this will be refactored later.

Actually I think it is acceptable to parse the function every time it is bound. In order to be efficient, we can cache the parsed AST in the binder in case there are multiple calls in a query.

I prefer to store the original string of function body rather than any parsed form.

@xzhseh xzhseh changed the title feat(sql-udf): support basic sql udf function feat(sql-udf): support basic anonymous sql udf Dec 26, 2023
@xzhseh xzhseh marked this pull request as ready for review December 26, 2023 22:17
Copy link
Contributor

@wangrunji0408 wangrunji0408 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM. Good work!

src/frontend/src/session.rs Outdated Show resolved Hide resolved
src/frontend/src/handler/mod.rs Outdated Show resolved Hide resolved
src/frontend/src/catalog/schema_catalog.rs Outdated Show resolved Hide resolved
src/frontend/src/binder/expr/column.rs Outdated Show resolved Hide resolved
src/frontend/src/binder/expr/mod.rs Outdated Show resolved Hide resolved
src/frontend/src/binder/expr/function.rs Outdated Show resolved Hide resolved
src/frontend/src/binder/expr/function.rs Outdated Show resolved Hide resolved
src/frontend/src/binder/expr/function.rs Outdated Show resolved Hide resolved
src/frontend/src/binder/expr/function.rs Outdated Show resolved Hide resolved
Comment on lines 216 to 217
create_udf_context(self, &args, &Arc::clone(func));
return self.bind_expr(extract_udf_expression(ast));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we should create a new binder to do the inline. Or at least we should clean the udf_context after inlining, otherwise the following binding will be affected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the latter solution, creating a new binder to do the inline may add extra cost and is hard to integrate with the on-going binding process in the main routine 🤔️?

Copy link
Member

@xxchan xxchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

Some comments for missing tests. Generally, I expect there are test covering all errors. They can also kind of function as the specification.

src/sqlparser/tests/sqlparser_postgres.rs Outdated Show resolved Hide resolved
e2e_test/udf/sql_udf.slt Outdated Show resolved Hide resolved
src/frontend/src/handler/mod.rs Outdated Show resolved Hide resolved
e2e_test/udf/sql_udf.slt Outdated Show resolved Hide resolved
e2e_test/udf/sql_udf.slt Show resolved Hide resolved
src/frontend/src/catalog/root_catalog.rs Outdated Show resolved Hide resolved
@xxchan
Copy link
Member

xxchan commented Dec 28, 2023

lackoftests

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license-eye has totally checked 4688 files.

Valid Invalid Ignored Fixed
2048 1 2639 0
Click to see the invalid file list
  • src/frontend/src/handler/create_sql_function.rs

src/frontend/src/handler/create_sql_function.rs Outdated Show resolved Hide resolved
xzhseh and others added 2 commits January 1, 2024 22:15
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Copy link
Contributor

@chenzl25 chenzl25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way for us to view the definition of SQL UDFs? show functions and show create function xxx seems can't.

@xzhseh
Copy link
Contributor Author

xzhseh commented Jan 2, 2024

show functions and show create function xxx seems can't.

show functions works fine (without definition), but show create function xxx is not supported.
And since the definitions of sql udfs are also stored in function catalog, it's easy to implement such feature if needed. (I can come up with a PR that supports this)

CleanShot 2024-01-01 at 23 29 36@2x

Comment on lines +18 to +22
statement ok
create function call_regexp_replace() returns varchar language sql as $$select regexp_replace('💩💩💩💩💩foo🤔️bar亲爱的😭baz这不是爱情❤️‍🔥', 'baz(...)', '这是🥵', 'ic')$$;

statement error Expected end of statement, found: 💩
create function call_regexp_replace() returns varchar language sql as 'select regexp_replace('💩💩💩💩💩foo🤔️bar亲爱的😭baz这不是爱情❤️‍🔥', 'baz(...)', '这是🥵', 'ic')';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😄

Comment on lines +79 to +86
// We do NOT allow recursive calling inside sql udf
// Since there does not exist the base case for this definition
if body.contains(format!("{}(", name.real_value()).as_str()) {
return Err(ErrorCode::InvalidInputSyntax(
"recursive definition is forbidden, please recheck your function syntax".to_string(),
)
.into());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if create function recursive(INT, INT) returns int language sql as 'select not_recursive($1, $2) + not_recursive($1, $2)';?

Or create function foo(INT) returns int language sql as 'select bar($1)'; create function bar(INT) returns int language sql as 'select foo($1)';?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if create function foo(INT) returns varchar language sql as $$select 'foo(INT)'$$;?

Copy link
Contributor Author

@xzhseh xzhseh Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After testing on my local pg server, it seems that pg leaves this problem to users 😅, and for select not_recursive($1, $2) we can use a regexp to exact match the pattern, but I can't come up with a graceful way for the nested calling recursive chain. (And pg does not support this too)
CleanShot 2024-01-02 at 02 12 07@2x

So should we handle this gracefully for the users, or just leave it to them just as what pg does?

P.S. RW will quit after stack overflow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if create function foo(INT) returns varchar language sql as $$select 'foo(INT)'$$;?

Indeed, under current implementation this definition will be rejected though it's totally valid, but it will definitely add complexity to the codebase if we add check conditions for this.
The problem remains: should we handle this for users?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I can't come up with a graceful way for the nested calling recursive chain.

we can also have a max_recursive_depth to prevent infinite recursion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently there are two solutions (by @wangrunji0408) to this problem,

  1. Simulate pg's behavior, add a max_recursive_depth field somewhere to prevent infinite recursion.
  2. Add a field to binder indicating which functions are currently being inlined and should not be called again.

Which one do you prefer? cc @xxchan @stdrc @BugenZhao.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer solution 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little side note: RC's case (mutual recursion) will be rejected by PG due to function does not exist😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait a minute. Why we can't do recursive check here? Shouldn't it also fail with "function does not exist", similar with mutual recursion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be rejected by PG due to function does not exist

Good catch!
Currently we don't have the ability to check if the inner calling functions exist during definition time, the problem will only be exposed when actual calling the function. (i.e., during binding phase)
I prefer to support the checking in later PRs.

src/frontend/src/handler/mod.rs Outdated Show resolved Hide resolved
src/frontend/src/handler/create_sql_function.rs Outdated Show resolved Hide resolved
Comment on lines +269 to +272
// If the `udf_context` is not empty, this means the current binding
// function is not the root binding sql udf, thus we should NOT
// clean the context after binding.
clean_flag = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm feeling that it's time to create a new binder to do recursive scoped binding, which could simplify state management. 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add an extra field in binder, e.g., udf_scope_binder: Option<Binder>?
But I'm worrying the extra cost of adding a new binder when using sql udf, it should be acceptable though 🤣

Copy link
Contributor

@wangrunji0408 wangrunji0408 Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I mean creating a new binder on the stack (with only function parameters as available identifiers) and calling the bind method. This could resolve the name conflict problem.
The cost can be ignored in the binding stage I think. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will put this to later PRs. 🤡

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: Can prepared statements work correctly? And are there any tests?

Copy link
Member

@BugenZhao BugenZhao Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be surprising but I couldn't find any e2e tests for the prepared statement itself. 😕

#14141

Copy link
Contributor Author

@xzhseh xzhseh Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can prepared statements work correctly?

I think so, under current implementation the check of udf_context will be immediately cleared after bind_function, which ensure it will not interleave with the parameter bindings of prepare-related statements. (e.g., triggered by third party drivers, etc)

@xzhseh
Copy link
Contributor Author

xzhseh commented Jan 2, 2024

Below is the list for future to-dos, just for reference:

@xzhseh xzhseh requested a review from wangrunji0408 January 4, 2024 06:49
Copy link
Contributor

@wangrunji0408 wangrunji0408 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

@xzhseh xzhseh added this pull request to the merge queue Jan 5, 2024
Merged via the queue into main with commit 75735e2 Jan 5, 2024
28 checks passed
@xzhseh xzhseh deleted the xzhseh/sql-udf branch January 5, 2024 08:34
@xxchan
Copy link
Member

xxchan commented Jan 7, 2024

btw why is this called “ anonymous”?

@xzhseh
Copy link
Contributor Author

xzhseh commented Jan 7, 2024

btw why is this called “ anonymous”?

The parameters have no name. (i.e., $1 / $2 / ..., so I name it anonymous 😄)

@TennyZhuang
Copy link
Contributor

TennyZhuang commented Jan 8, 2024

In risingwave even building functions parameters also don’t have names because we don’t support named parameters yet

@wangrunji0408
Copy link
Contributor

In writing wave even building functions parameters also don’t have names because we don’t support named parameters yet

Should we support it? It can facilitate some function callings, such as:

jsonb_path_query(jsonb, jsonpath, silent => true);

@hengm3467
Copy link
Contributor

Making sure I understand it correctly. We now have SQL-UDF in parralel with external UDFs (Javan and Python), which requires deploying additional servers/instances. Correct? Thanks. @wangrunji0408 @BugenZhao

@xzhseh
Copy link
Contributor Author

xzhseh commented Jan 22, 2024

We now have SQL-UDF in parralel with external UDFs (Javan and Python), which requires deploying additional servers/instances.

Exactly. And sql udf does not rely on external service, basically performs inlining logic, cc @hengm3467.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants