Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disallow usage of jsonb/map type in group by / order by / primary key #7981

Open
Tracked by #17743
xiangjinwu opened this issue Feb 16, 2023 · 10 comments
Open
Tracked by #17743
Assignees
Labels
help wanted Issues that need help from contributors type/feature

Comments

@xiangjinwu
Copy link
Contributor

No description provided.

@github-actions github-actions bot added this to the release-0.1.18 milestone Feb 16, 2023
@xiangjinwu xiangjinwu added the help wanted Issues that need help from contributors label Mar 20, 2023
@xiangjinwu xiangjinwu modified the milestones: release-0.18, release-0.19 Mar 22, 2023
@xiangjinwu xiangjinwu self-assigned this Apr 4, 2023
@xiangjinwu xiangjinwu modified the milestones: release-0.19, release-0.20 Apr 26, 2023
@fuyufjh
Copy link
Member

fuyufjh commented Oct 15, 2023

I think we need to prioritize it

@fuyufjh
Copy link
Member

fuyufjh commented Jan 5, 2024

@xiangjinwu Any reason we postpone this?

@xiangjinwu
Copy link
Contributor Author

Until recently, it has been inactive because there was no mechanism to disallow a specific DataType in certain Usage. The initial motivation includes replacing Scalar: Hash by switching the caller from T: Scalar to T: Scalar + Hash, and report proper error when dispatching from DataType/ScalarRefImpl with a match. This can be a big refactor.

Recently we also run into a case where jsonb can be implicitly used as key due to subquery unnesting domain calculation / lateral join:

create table t (msg varchar, ext jsonb);
insert into t values ('foo', '[123, "abc"]'), ('bar', '[true, null]');

select msg, (select count(*) from jsonb_array_elements(ext)) from t;
select msg, ext_element from t, jsonb_array_elements(ext) as ext_elements(ext_element);

Disallowing it would effectively mark these queries as unsupported.

That being said, these queries perform badly today and are likely unpractical in heavier loads. But they do look useful and may need alternative plans to support.

@xxchan
Copy link
Member

xxchan commented Jul 18, 2024

Regarding the subqueries, I feel they can be rewritten. Basically it's table functions involved.

select col, agg(ele) from tf(json) as tf_results(ele) from t
== 
with expanded(pk, col, ele) as (select pk, col, tf(json) from t)
select first_value(col), agg(ele) from expanded
group by pk



select col, ele from t, tf(json) as tf_results(ele);
== 
select col, tf(json) from t;

Currently, they are not banned just because #14442 puts the Visitor at a too early position.

@xiangjinwu
Copy link
Contributor Author

xiangjinwu commented Jul 18, 2024

The subquery concern is about the general rule / algorithm, rather than these 2 specific cases.

In Unnesting Arbitrary Queries, the domain is defined as the set of unique values of correlated columns on the outer side. Without the ability to calculate unique values with agg and the ability to join on them, for certain data types, this decorrelation approach is effectively banned to support columns of such data types to be correlated.

^ This requires real agg & join at the executor level, unless simplified away by plan optimization.

@xxchan
Copy link
Member

xxchan commented Jul 18, 2024

Note that jsonb is not good for key not only because of performance. The semantic can also be controversial:

-- in risingwave
> select '{"a":1,"b":2}'::jsonb = '{"b":2, "a":1}';
t
> select '{"a":1,"b":2}'::jsonb < '{"b":2, "a":1}';
Unsupported function: less_than

-- in duckdb (btw, they seem to preserve original json text)
> SELECT {b:2,a:1}::JSON = {a:1,b:2}::JSON;
false
> SELECT {b:2,a:1}::JSON < {a:1,b:2}::JSON;
false

We chose to make it defined, but sacrificed performance.

#12952 (comment)

impl Ord for JsonbRef<'_> {
fn cmp(&self, other: &Self) -> std::cmp::Ordering {
// We do not intend to support ordering `jsonb` type.
// Before #7981 is done, we do not panic but just compare its string representation.
// Note that `serde_json` without feature `preserve_order` uses `BTreeMap` for json object.
// So its string form always have keys sorted.
//
// In PostgreSQL, Object > Array > Boolean > Number > String > Null.
// But here we have Object > true > Null > false > Array > Number > String.
// Because in ascii: `{` > `t` > `n` > `f` > `[` > `9` `-` > `"`.
//
// This is just to keep consistent with the memcomparable encoding, which uses string form.
// If we implemented the same typed comparison as PostgreSQL, we would need a corresponding
// memcomparable encoding for it.
self.0.to_string().cmp(&other.0.to_string())
}
}

For the new map type #17743 (which is physically a list), it has the same problem:

-- in duckdb
> SELECT MAP {'key1':10,'key2':20}=MAP {'key2':20,'key1':10};
false

@xiangjinwu
Copy link
Contributor Author

in duckdb (btw, they seem to preserve original json text)

I would consider this as a difference between json and jsonb. In PostgreSQL:

Because the json type stores an exact copy of the input text, it will preserve semantically-insignificant white space between tokens, as well as the order of keys within JSON objects. Also, if a JSON object within the value contains the same key more than once, all the key/value pairs are kept. (The processing functions consider the last value as the operative one.) By contrast, jsonb does not preserve white space, does not preserve the order of object keys, and does not keep duplicate object keys. If duplicate keys are specified in the input, only the last value is kept.

In general, most applications should prefer to store JSON data as jsonb, unless there are quite specialized needs, such as legacy assumptions about ordering of object keys.

@xxchan
Copy link
Member

xxchan commented Jul 29, 2024

Another problem found: for table without pk (i.e., _row_id as pk), there will be a HashDispatcher calculating the hash using all columns. So essentially all columns are used as key.

#17690 (comment)

image

@xxchan
Copy link
Member

xxchan commented Jul 30, 2024

Some discussion https://risingwave-labs.slack.com/archives/C034TRPKN1F/p1722317649393599

For the _row_id problem

Why the exchange is needed for DML?

UPDATE/DELETE will have _row_id present.

Why we chose to hash the entire row when the _row_id is missing, instead of sth like random or round-robin?

  1. prevent data skew
  2. make the shuffle deterministic

Can we change it to the random or round robin?

Perhaps we can give it a try.

Concern: Just unsure whether we can correctly handle round-robin distribution in streaming (like scheduler).

Alternative: Can we just generate vnode round-robin for the whole chunk in HashShuffle instead of using round-robin distribution if _row_id is missing?

For the "Key" problem

I’m not sure whether we really want to ban jsonb/map from keys, and to what extent:

  1. completely and seriously, solve all problems incl. the _row_id;
  2. only partially ban (e.g., not used in state table, but can use in dispatcher), and tolerate problems like _row_id
  3. (status quo) no special treatment, just discourage users to use it (like undefined behavior, or a defined problematic behavior).

@fuyufjh
Copy link
Member

fuyufjh commented Jul 30, 2024

Another problem found: for table without pk (i.e., _row_id as pk), there will be a HashDispatcher calculating the hash using all columns. So essentially all columns are used as key.

Actually, it's a different problem. This issue is mainly targeted at user interface, that is, we may forbid JSONB type in group by / order by / primary key, while internally keep its hash function for some corner cases like your example.

@xxchan xxchan changed the title disallow usage of jsonb type in group by / order by / primary key disallow usage of jsonb/map type in group by / order by / primary key Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Issues that need help from contributors type/feature
Projects
None yet
Development

No branches or pull requests

4 participants