Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(binder): bind RCTE #15522

Merged
merged 24 commits into from
Apr 9, 2024
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
27b5439
feat(binder): bind RCTE
TennyZhuang Mar 7, 2024
24d763b
Merge branch 'main' into bind_rcte
TennyZhuang Mar 13, 2024
51cefe3
Merge branch 'main' into bind_rcte
TennyZhuang Mar 14, 2024
20493ab
Update src/frontend/src/binder/bind_context.rs
TennyZhuang Mar 14, 2024
cfb9920
Update src/frontend/src/expr/mod.rs
TennyZhuang Mar 14, 2024
a5c7d81
Merge branch 'main' into bind_rcte
TennyZhuang Mar 14, 2024
f8e3788
Merge remote-tracking branch 'origin/main' into bind_rcte
TennyZhuang Mar 29, 2024
76cc79c
update
TennyZhuang Mar 29, 2024
8e5b68a
refine
TennyZhuang Mar 29, 2024
e2292a3
update comment; fix typo(s)
xzhseh Mar 30, 2024
78a2422
feat(binder): correctly bind rcte in `bind_with` & `bind_relation_by_…
xzhseh Apr 3, 2024
ced2f59
Merge branch 'main' into bind_rcte
xzhseh Apr 3, 2024
6b84174
fix nit; cleanup comment
xzhseh Apr 4, 2024
8ada38b
fix warning
TennyZhuang Apr 7, 2024
3175084
add a binder test
TennyZhuang Apr 7, 2024
b0f1446
Merge branch 'main' into bind_rcte
TennyZhuang Apr 7, 2024
65379b0
fix check in bind_context when merge_context with rcte
xzhseh Apr 7, 2024
f81531e
remove BoundRecursiveUnion; refactor input field for BoundShare
xzhseh Apr 8, 2024
4044d54
update fmt
xzhseh Apr 8, 2024
da9776c
add bail_not_implemented to prevent stack overflow in plan_relation
xzhseh Apr 8, 2024
ebaa6c2
update comments
xzhseh Apr 8, 2024
79b7015
update unit test accordingly
xzhseh Apr 8, 2024
249408e
update test's format to fix ci
xzhseh Apr 9, 2024
aa6cdb4
refactor(binder): update bind_rcte to be more rusty (#16214)
xiangjinwu Apr 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 73 additions & 10 deletions src/frontend/src/binder/bind_context.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,19 +12,22 @@
// See the License for the specific language governing permissions and
// limitations under the License.

use std::cell::RefCell;
use std::collections::hash_map::Entry;
use std::collections::{BTreeMap, HashMap, HashSet};
use std::rc::Rc;

use either::Either;
use parse_display::Display;
use risingwave_common::catalog::Field;
use risingwave_common::catalog::{Field, Schema};
use risingwave_common::types::DataType;
use risingwave_sqlparser::ast::TableAlias;

use crate::error::{ErrorCode, Result};

type LiteResult<T> = std::result::Result<T, ErrorCode>;

use super::BoundSetExpr;
use crate::binder::{BoundQuery, ShareId, COLUMN_GROUP_PREFIX};

#[derive(Debug, Clone)]
Expand Down Expand Up @@ -67,6 +70,60 @@ pub struct LateralBindContext {
pub context: BindContext,
}

/// For recursive CTE, we may need to store it in `cte_to_relation` first,
/// and then bind it *step by step*.
///
/// note: the below sql example is to illustrate when we get the
/// corresponding binding state when handling a recursive CTE like this.
///
/// ```sql
/// WITH RECURSIVE t(n) AS (
/// # -------------^ => Init
/// VALUES (1)
/// # ----------^ => BaseResolved (after binding the base term)
/// UNION ALL
/// SELECT n+1 FROM t WHERE n < 100
/// # ------------------^ => Bound (we know exactly what the entire cte looks like)
/// )
/// SELECT sum(n) FROM t;
/// ```
#[derive(Default, Debug, Clone)]
pub enum BindingCteState {
/// We know nothing about the CTE before resolving the body.
#[default]
Init,
/// We know the schema form after the base term resolved.
BaseResolved { schema: Schema },
/// We get the whole bound result of the (recursive) CTE.
Bound {
query: Either<BoundQuery, RecursiveUnion>,
},
}

/// the entire `RecursiveUnion` represents a *bound* recursive cte.
/// reference: <https://github.com/risingwavelabs/risingwave/pull/15522/files#r1524367781>
#[derive(Debug, Clone)]
pub struct RecursiveUnion {
/// currently this *must* be true,
/// otherwise binding will fail.
pub all: bool,
/// lhs part of the `UNION ALL` operator
pub base: Box<BoundSetExpr>,
/// rhs part of the `UNION ALL` operator
pub recursive: Box<BoundSetExpr>,
/// the aligned schema for this union
/// will be the *same* schema as recursive's
/// this is just for a better readability
pub schema: Schema,
}

#[derive(Clone, Debug)]
pub struct BindingCte {
pub share_id: ShareId,
pub state: BindingCteState,
pub alias: TableAlias,
}

#[derive(Default, Debug, Clone)]
pub struct BindContext {
// Columns of all tables.
Expand All @@ -79,9 +136,9 @@ pub struct BindContext {
pub clause: Option<Clause>,
// The `BindContext`'s data on its column groups
pub column_group_context: ColumnGroupContext,
/// Map the cte's name to its `Relation::Subquery`.
/// The `ShareId` of the value is used to help the planner identify the share plan.
pub cte_to_relation: HashMap<String, Rc<(ShareId, BoundQuery, TableAlias)>>,
/// Map the cte's name to its binding state.
/// The `ShareId` in `BindingCte` of the value is used to help the planner identify the share plan.
pub cte_to_relation: HashMap<String, Rc<RefCell<BindingCte>>>,
/// Current lambda functions's arguments
pub lambda_args: Option<HashMap<String, (usize, DataType)>>,
}
Expand Down Expand Up @@ -305,13 +362,19 @@ impl BindContext {
entry.extend(v.into_iter().map(|x| x + begin));
}
for (k, (x, y)) in other.range_of {
match self.range_of.entry(k) {
match self.range_of.entry(k.clone()) {
Entry::Occupied(e) => {
return Err(ErrorCode::InternalError(format!(
"Duplicated table name while merging adjacent contexts: {}",
e.key()
))
.into());
if let BindingCteState::Bound { .. } =
self.cte_to_relation.get(&k).unwrap().borrow().state.clone()
{
// do nothing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#16023 (comment)

We shall add the test case to showcase the necessity of this special check. (Btw I am still trying to understand it at this time. The ask for test case is not an endorsement of it being valid 😂 )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Panicked here:

thread 'rw-main' panicked at src/frontend/src/binder/bind_context.rs:368:54:
called `Option::unwrap()` on a `None` value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be fixed now - just update the logic when checking if merge_context is applying for rcte.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this special bypass just suppress something that is actually wrong behind the scene:

with recursive t(n) as (
values (1)
union all
select n + 1 from t where n < 100
)
select * from t;
______________^ here

Without the update here, the error is saying that t is already in BindContext before binding t. This error is TRUE, and adding this bypass does not fix the actual dirty-context problem: before binding t, the context is supposed to be empty.

As an example, having a dirty initial context makes the following malformed query to pass binder validation (and panic in planner):

with recursive t(n) as (
values (1)
union all
select n + 1 from t where n < 100
)
select n;  -- n is not supposed to be in context, but the binder implementation here contains it somehow

I am still tracing down how the context become dirty.

Copy link
Contributor

@xiangjinwu xiangjinwu Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

} else {
let bound_query = self.bind_query(query)?;

The nonrecursive branch's bind_query pushes a new context upon entry and clears it before returning, while the recursive branch's two bind_set_expr calls write directly into current context. Maybe (?) this is the reason 🤔

The problem would not arise if both sides of RecursiveUnion were considered Query rather than SetExpr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix available for review: #16282

} else {
return Err(ErrorCode::InternalError(format!(
"Duplicated table name while merging adjacent contexts: {}",
e.key()
))
.into());
}
}
Entry::Vacant(entry) => {
entry.insert((begin + x, begin + y));
Expand Down
135 changes: 122 additions & 13 deletions src/frontend/src/binder/query.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,18 +12,22 @@
// See the License for the specific language governing permissions and
// limitations under the License.

use std::cell::RefCell;
use std::collections::HashMap;
use std::rc::Rc;

use risingwave_common::bail_not_implemented;
use risingwave_common::catalog::Schema;
use risingwave_common::types::DataType;
use risingwave_common::util::sort_util::{ColumnOrder, OrderType};
use risingwave_sqlparser::ast::{Cte, Expr, Fetch, OrderByExpr, Query, Value, With};
use risingwave_sqlparser::ast::{
Cte, Expr, Fetch, OrderByExpr, Query, SetExpr, SetOperator, Value, With,
};
use thiserror_ext::AsReport;

use super::bind_context::BindingCteState;
use super::statement::RewriteExprsRecursive;
use super::BoundValues;
use crate::binder::bind_context::{BindingCte, RecursiveUnion};
use crate::binder::{Binder, BoundSetExpr};
use crate::error::{ErrorCode, Result};
use crate::expr::{CorrelatedId, Depth, ExprImpl, ExprRewriter};
Expand Down Expand Up @@ -279,20 +283,125 @@ impl Binder {
}

fn bind_with(&mut self, with: With) -> Result<()> {
if with.recursive {
bail_not_implemented!("recursive cte");
} else {
for cte_table in with.cte_tables {
let Cte { alias, query, .. } = cte_table;
let table_name = alias.name.real_value();
let bound_query = self.bind_query(query)?;
let share_id = self.next_share_id();
self.context
for cte_table in with.cte_tables {
let share_id = self.next_share_id();
let Cte { alias, query, .. } = cte_table;
let table_name = alias.name.real_value();

if with.recursive {
Copy link
Contributor

@xiangjinwu xiangjinwu Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this recursive is a single toggle for all CTEs (doc). When absent it allows a binder shortcut, but when specified we need to attempt in both directions for all CTEs. (Update on April 2nd: attempt and recover context is hard)

Example in PostgreSQL:

test=# with recursive t as (select 1) select * from t; -- unnecessary `recursive` allowed
 ?column? 
----------
        1
(1 row)

let Query {
with,
body,
order_by,
limit,
offset,
fetch,
} = query;

/// the input clause should not be supported.
fn should_be_empty<T>(v: Option<T>, clause: &str) -> Result<()> {
if v.is_some() {
return Err(ErrorCode::BindError(format!(
"`{clause}` is not supported in recursive CTE"
))
.into());
}
Ok(())
}

should_be_empty(order_by.first(), "ORDER BY")?;
should_be_empty(limit, "LIMIT")?;
should_be_empty(offset, "OFFSET")?;
should_be_empty(fetch, "FETCH")?;

let SetExpr::SetOperation {
op: SetOperator::Union,
all,
left,
right,
} = body.clone()
else {
return Err(ErrorCode::BindError(
"`UNION` is required in recursive CTE".to_string(),
)
.into());
};

if !all {
return Err(ErrorCode::BindError(
"only `UNION ALL` is supported in recursive CTE now".to_string(),
)
.into());
}

let entry = self
.context
.cte_to_relation
.insert(table_name, Rc::new((share_id, bound_query, alias)));
.entry(table_name)
.insert_entry(Rc::new(RefCell::new(BindingCte {
share_id,
state: BindingCteState::Init,
alias,
})))
.get()
.clone();

if let Some(with) = with {
self.bind_with(with)?;
}

// We assume `left` is the base term, otherwise the implementation may be very hard.
// The behavior is the same as PostgreSQL's.
// reference: <https://www.postgresql.org/docs/16/sql-select.html#:~:text=the%20recursive%20self%2Dreference%20must%20appear%20on%20the%20right%2Dhand%20side%20of%20the%20UNION>
let mut base = self.bind_set_expr(*left)?;

entry.borrow_mut().state = BindingCteState::BaseResolved {
schema: base.schema().clone(),
};

// bind the rest of the recursive cte
let mut recursive = self.bind_set_expr(*right)?;

// todo: add validate check here for *bound* `base` and `recursive`
Self::align_schema(&mut base, &mut recursive, SetOperator::Union)?;

// please note that even after aligning, the schema of `left`
// may not be the same as `right`; this is because there may
// be case(s) where the `base` term is just a value, and the
// `recursive` term is a select expression / statement.
xzhseh marked this conversation as resolved.
Show resolved Hide resolved
let schema = recursive.schema().clone();
// yet another sanity check
assert_eq!(
schema,
recursive.schema().clone(),
"expect `schema` to be the same as recursive's"
);

let recursive_union = RecursiveUnion {
all,
base: Box::new(base),
recursive: Box::new(recursive),
schema,
};

entry.borrow_mut().state = BindingCteState::Bound {
query: either::Either::Right(recursive_union),
};
} else {
let bound_query = self.bind_query(query)?;
self.context.cte_to_relation.insert(
table_name,
Rc::new(RefCell::new(BindingCte {
share_id,
state: BindingCteState::Bound {
query: either::Either::Left(bound_query),
},
alias,
})),
);
}
Ok(())
}
Ok(())
}
}

Expand Down
28 changes: 28 additions & 0 deletions src/frontend/src/binder/relation/cte_ref.rs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intuitively, if we have decoupled the declaration (knowing its schema) and definition (binding the recursive query) stage of CTE, we can use the existing BoundShare to refer to all CTEs no matter they are recursive or not. I feel that this new structure is not necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's a back reference. If you take a look at the definition of BoundShare, you'll know it's impossible to use that, otherwise the loop will not end.

Copy link
Contributor Author

@TennyZhuang TennyZhuang Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

DuckDB also have two kind of CTE nodes.

Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
// Copyright 2024 RisingWave Labs
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

use crate::binder::statement::RewriteExprsRecursive;
use crate::binder::ShareId;

/// A CTE reference, currently only used in the back reference of recursive CTE.
/// For the non-recursive one, see [`BoundShare`](super::BoundShare).
#[derive(Debug, Clone)]
pub struct BoundBackCteRef {
#[expect(dead_code)]
pub(crate) share_id: ShareId,
}

impl RewriteExprsRecursive for BoundBackCteRef {
fn rewrite_exprs_recursive(&mut self, _rewriter: &mut impl crate::expr::ExprRewriter) {}
}
Loading
Loading