-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(sql-udf): support basic anonymous sql udf #14139
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
database_id, | ||
name: function_name, | ||
kind: Some(kind), | ||
arg_types: arg_types.into_iter().map(|t| t.into()).collect(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the type of this field should be changed to Vec<Field>
or Schema
so that we can find argument names here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change this field after reviewing the anonymous implementation logic
Actually I think it is acceptable to parse the function every time it is bound. In order to be efficient, we can cache the parsed AST in the binder in case there are multiple calls in a query. I prefer to store the original string of function body rather than any parsed form. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM. Good work!
create_udf_context(self, &args, &Arc::clone(func)); | ||
return self.bind_expr(extract_udf_expression(ast)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we should create a new binder to do the inline. Or at least we should clean the udf_context
after inlining, otherwise the following binding will be affected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the latter solution, creating a new binder to do the inline may add extra cost and is hard to integrate with the on-going binding process in the main routine 🤔️?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
Some comments for missing tests. Generally, I expect there are test covering all errors. They can also kind of function as the specification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license-eye has totally checked 4688 files.
Valid | Invalid | Ignored | Fixed |
---|---|---|---|
2048 | 1 | 2639 | 0 |
Click to see the invalid file list
- src/frontend/src/handler/create_sql_function.rs
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way for us to view the definition of SQL UDFs? show functions
and show create function xxx
seems can't.
statement ok | ||
create function call_regexp_replace() returns varchar language sql as $$select regexp_replace('💩💩💩💩💩foo🤔️bar亲爱的😭baz这不是爱情❤️🔥', 'baz(...)', '这是🥵', 'ic')$$; | ||
|
||
statement error Expected end of statement, found: 💩 | ||
create function call_regexp_replace() returns varchar language sql as 'select regexp_replace('💩💩💩💩💩foo🤔️bar亲爱的😭baz这不是爱情❤️🔥', 'baz(...)', '这是🥵', 'ic')'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😄
// We do NOT allow recursive calling inside sql udf | ||
// Since there does not exist the base case for this definition | ||
if body.contains(format!("{}(", name.real_value()).as_str()) { | ||
return Err(ErrorCode::InvalidInputSyntax( | ||
"recursive definition is forbidden, please recheck your function syntax".to_string(), | ||
) | ||
.into()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if create function recursive(INT, INT) returns int language sql as 'select not_recursive($1, $2) + not_recursive($1, $2)';
?
Or create function foo(INT) returns int language sql as 'select bar($1)'; create function bar(INT) returns int language sql as 'select foo($1)';
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if create function foo(INT) returns varchar language sql as $$select 'foo(INT)'$$;
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After testing on my local pg server, it seems that pg leaves this problem to users 😅, and for select not_recursive($1, $2)
we can use a regexp to exact match the pattern, but I can't come up with a graceful way for the nested calling recursive chain. (And pg does not support this too)
So should we handle this gracefully for the users, or just leave it to them just as what pg does?
P.S. RW will quit after stack overflow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if create function foo(INT) returns varchar language sql as $$select 'foo(INT)'$$;?
Indeed, under current implementation this definition will be rejected though it's totally valid, but it will definitely add complexity to the codebase if we add check conditions for this.
The problem remains: should we handle this for users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I can't come up with a graceful way for the nested calling recursive chain.
we can also have a max_recursive_depth
to prevent infinite recursion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently there are two solutions (by @wangrunji0408) to this problem,
- Simulate pg's behavior, add a
max_recursive_depth
field somewhere to prevent infinite recursion. - Add a field to binder indicating which functions are currently being inlined and should not be called again.
Which one do you prefer? cc @xxchan @stdrc @BugenZhao.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer solution 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little side note: RC's case (mutual recursion) will be rejected by PG due to function does not exist😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait a minute. Why we can't do recursive check here? Shouldn't it also fail with "function does not exist", similar with mutual recursion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will be rejected by PG due to function does not exist
Good catch!
Currently we don't have the ability to check if the inner calling functions exist during definition time, the problem will only be exposed when actual calling the function. (i.e., during binding phase)
I prefer to support the checking in later PRs.
// If the `udf_context` is not empty, this means the current binding | ||
// function is not the root binding sql udf, thus we should NOT | ||
// clean the context after binding. | ||
clean_flag = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm feeling that it's time to create a new binder to do recursive scoped binding, which could simplify state management. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add an extra field in binder, e.g., udf_scope_binder: Option<Binder>
?
But I'm worrying the extra cost of adding a new binder when using sql udf, it should be acceptable though 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I mean creating a new binder on the stack (with only function parameters as available identifiers) and calling the bind
method. This could resolve the name conflict problem.
The cost can be ignored in the binding stage I think. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will put this to later PRs. 🤡
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: Can prepared statements work correctly? And are there any tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be surprising but I couldn't find any e2e tests for the prepared statement itself. 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can prepared statements work correctly?
I think so, under current implementation the check of udf_context
will be immediately cleared after bind_function
, which ensure it will not interleave with the parameter bindings of prepare-related statements. (e.g., triggered by third party drivers, etc)
Below is the list for future to-dos, just for reference:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks!
btw why is this called “ anonymous”? |
The parameters have no name. (i.e., $1 / $2 / ..., so I name it anonymous 😄) |
In risingwave even building functions parameters also don’t have names because we don’t support named parameters yet |
Should we support it? It can facilitate some function callings, such as: jsonb_path_query(jsonb, jsonpath, silent => true); |
Making sure I understand it correctly. We now have SQL-UDF in parralel with external UDFs (Javan and Python), which requires deploying additional servers/instances. Correct? Thanks. @wangrunji0408 @BugenZhao |
Exactly. And sql udf does not rely on external service, basically performs inlining logic, cc @hengm3467. |
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
This PR adds basic support for anonymous sql udf, the current supported syntax is simple, and is a subset of pg's
create function
standard (refer: https://www.postgresql.org/docs/current/sql-createfunction.html), for all to-be-supported syntax in RW, check #10151 for details.The current supported syntax (anonymous parameter list) is as follows:
Note: recursive definition is NOT supported, e.g.,
See
e2e_test/udf/sql_udf.slt
for detailed use cases.P.S. The current implementation will only check for syntax correctness (i.e., the sql body in as clause) by utilizing
Parser::parse_sql
, semantic check will be implemented in the future.Checklist
./risedev check
(or alias,./risedev c
)Documentation
Release note
Support basic anonymous sql user defined function, the current supported syntax is limited. See the above documentation for details.