disallow usage of `jsonb`/`map` type in `group by` / `order by` / `primary key` #7981

xiangjinwu · 2023-02-16T12:08:51Z

No description provided.

fuyufjh · 2023-10-15T10:38:22Z

I think we need to prioritize it

fuyufjh · 2024-01-05T09:05:02Z

@xiangjinwu Any reason we postpone this?

xiangjinwu · 2024-01-09T08:45:47Z

Until recently, it has been inactive because there was no mechanism to disallow a specific DataType in certain Usage. The initial motivation includes replacing Scalar: Hash by switching the caller from T: Scalar to T: Scalar + Hash, and report proper error when dispatching from DataType/ScalarRefImpl with a match. This can be a big refactor.

Recently we also run into a case where jsonb can be implicitly used as key due to subquery unnesting domain calculation / lateral join:

create table t (msg varchar, ext jsonb);
insert into t values ('foo', '[123, "abc"]'), ('bar', '[true, null]');

select msg, (select count(*) from jsonb_array_elements(ext)) from t;
select msg, ext_element from t, jsonb_array_elements(ext) as ext_elements(ext_element);

Disallowing it would effectively mark these queries as unsupported.

That being said, these queries perform badly today and are likely unpractical in heavier loads. But they do look useful and may need alternative plans to support.

xxchan · 2024-07-18T08:46:13Z

Regarding the subqueries, I feel they can be rewritten. Basically it's table functions involved.

select col, agg(ele) from tf(json) as tf_results(ele) from t
== 
with expanded(pk, col, ele) as (select pk, col, tf(json) from t)
select first_value(col), agg(ele) from expanded
group by pk



select col, ele from t, tf(json) as tf_results(ele);
== 
select col, tf(json) from t;

Currently, they are not banned just because #14442 puts the Visitor at a too early position.

xiangjinwu · 2024-07-18T09:08:14Z

The subquery concern is about the general rule / algorithm, rather than these 2 specific cases.

In Unnesting Arbitrary Queries, the domain is defined as the set of unique values of correlated columns on the outer side. Without the ability to calculate unique values with agg and the ability to join on them, for certain data types, this decorrelation approach is effectively banned to support columns of such data types to be correlated.

^ This requires real agg & join at the executor level, unless simplified away by plan optimization.

xxchan · 2024-07-18T09:17:20Z

Note that jsonb is not good for key not only because of performance. The semantic can also be controversial:

-- in risingwave
> select '{"a":1,"b":2}'::jsonb = '{"b":2, "a":1}';
t
> select '{"a":1,"b":2}'::jsonb < '{"b":2, "a":1}';
Unsupported function: less_than

-- in duckdb (btw, they seem to preserve original json text)
> SELECT {b:2,a:1}::JSON = {a:1,b:2}::JSON;
false
> SELECT {b:2,a:1}::JSON < {a:1,b:2}::JSON;
false

We chose to make it defined, but sacrificed performance.

#12952 (comment)

risingwave/src/common/src/types/jsonb.rs

Lines 93 to 109 in 0883df5

    
           impl Ord for JsonbRef<'_> { 
        
               fn cmp(&self, other: &Self) -> std::cmp::Ordering { 
        
                   // We do not intend to support ordering `jsonb` type. 
        
                   // Before #7981 is done, we do not panic but just compare its string representation. 
        
                   // Note that `serde_json` without feature `preserve_order` uses `BTreeMap` for json object. 
        
                   // So its string form always have keys sorted. 
        
                   // 
        
                   // In PostgreSQL, Object > Array > Boolean > Number > String > Null. 
        
                   // But here we have Object > true > Null > false > Array > Number > String. 
        
                   // Because in ascii: `{` > `t` > `n` > `f` > `[` > `9` `-` > `"`. 
        
                   // 
        
                   // This is just to keep consistent with the memcomparable encoding, which uses string form. 
        
                   // If we implemented the same typed comparison as PostgreSQL, we would need a corresponding 
        
                   // memcomparable encoding for it. 
        
                   self.0.to_string().cmp(&other.0.to_string()) 
        
               } 
        
           }

For the new map type #17743 (which is physically a list), it has the same problem:

-- in duckdb
> SELECT MAP {'key1':10,'key2':20}=MAP {'key2':20,'key1':10};
false

xiangjinwu · 2024-07-18T09:24:01Z

in duckdb (btw, they seem to preserve original json text)

I would consider this as a difference between json and jsonb. In PostgreSQL:

Because the json type stores an exact copy of the input text, it will preserve semantically-insignificant white space between tokens, as well as the order of keys within JSON objects. Also, if a JSON object within the value contains the same key more than once, all the key/value pairs are kept. (The processing functions consider the last value as the operative one.) By contrast, jsonb does not preserve white space, does not preserve the order of object keys, and does not keep duplicate object keys. If duplicate keys are specified in the input, only the last value is kept.

In general, most applications should prefer to store JSON data as jsonb, unless there are quite specialized needs, such as legacy assumptions about ordering of object keys.

xxchan · 2024-07-29T12:41:30Z

Another problem found: for table without pk (i.e., _row_id as pk), there will be a HashDispatcher calculating the hash using all columns. So essentially all columns are used as key.

#17690 (comment)

xxchan · 2024-07-30T07:28:18Z

Some discussion https://risingwave-labs.slack.com/archives/C034TRPKN1F/p1722317649393599

For the `_row_id` problem

Why the exchange is needed for DML?

UPDATE/DELETE will have _row_id present.

Why we chose to hash the entire row when the `_row_id` is missing, instead of sth like random or round-robin?

prevent data skew
make the shuffle deterministic

Can we change it to the random or round robin?

Perhaps we can give it a try.

Concern: Just unsure whether we can correctly handle round-robin distribution in streaming (like scheduler).

Alternative: Can we just generate vnode round-robin for the whole chunk in HashShuffle instead of using round-robin distribution if _row_id is missing?

For the "Key" problem

I’m not sure whether we really want to ban jsonb/map from keys, and to what extent:

completely and seriously, solve all problems incl. the _row_id;
only partially ban (e.g., not used in state table, but can use in dispatcher), and tolerate problems like _row_id
(status quo) no special treatment, just discourage users to use it (like undefined behavior, or a defined problematic behavior).

fuyufjh · 2024-07-30T08:50:19Z

Another problem found: for table without pk (i.e., _row_id as pk), there will be a HashDispatcher calculating the hash using all columns. So essentially all columns are used as key.

Actually, it's a different problem. This issue is mainly targeted at user interface, that is, we may forbid JSONB type in group by / order by / primary key, while internally keep its hash function for some corner cases like your example.

xiangjinwu added the type/feature label Feb 16, 2023

github-actions bot added this to the release-0.1.18 milestone Feb 16, 2023

xiangjinwu mentioned this issue Feb 17, 2023

feat(common): support insert and select for jsonb type #7994

Merged

4 tasks

xiangjinwu added the help wanted Issues that need help from contributors label Mar 20, 2023

xiangjinwu modified the milestones: release-0.18, release-0.19 Mar 22, 2023

xiangjinwu self-assigned this Apr 4, 2023

xiangjinwu modified the milestones: release-0.19, release-0.20 Apr 26, 2023

xiangjinwu modified the milestones: release-0.20, future-release-0.21 May 26, 2023

xiangjinwu modified the milestones: release-1.1, future-release-1.3 Aug 2, 2023

xiangjinwu modified the milestones: release-1.3, future-release-1.5 Oct 10, 2023

xiangjinwu mentioned this issue Oct 20, 2023

refactor(type): switch jsonb to flat representation #12952

Merged

7 tasks

xiangjinwu modified the milestones: release-1.5, future-release-1.6, release-1.6, future-release-1.7 Nov 30, 2023

xiangjinwu modified the milestones: future-release-1.7, future-release-1.8 Dec 29, 2023

xxchan mentioned this issue Jan 5, 2024

Discussion: union causes unreasonable long key #14314

Closed

fuyufjh assigned yuhao-su Jan 5, 2024

xiangjinwu mentioned this issue Jan 10, 2024

feat(frontend): ban jsonb in aggregation stream key #14442

Merged

9 tasks

xiangjinwu modified the milestones: release-1.8, future-release-1.9 Mar 6, 2024

xiangjinwu modified the milestones: release-1.9, future-release-1.10 Apr 8, 2024

xiangjinwu modified the milestones: release-1.10, future-release-1.11 May 13, 2024

xiangjinwu modified the milestones: release-1.11, future-release-1.12 Jul 10, 2024

This was referenced Jul 24, 2024

support native map datatype #17743

Closed

feat: add map type to scalar/array/expr #17690

Merged

xxchan changed the title ~~disallow usage of jsonb type in group by / order by / primary key~~ disallow usage of jsonb/map type in group by / order by / primary key Aug 12, 2024

xiangjinwu modified the milestones: future-release-2.1, future-release-2.2 Aug 19, 2024

xiangjinwu modified the milestones: release-2.2, future-release-2.3 Oct 17, 2024

xiangjinwu modified the milestones: release-2.3, future-release-2.5 Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disallow usage of `jsonb`/`map` type in `group by` / `order by` / `primary key` #7981

disallow usage of `jsonb`/`map` type in `group by` / `order by` / `primary key` #7981

xiangjinwu commented Feb 16, 2023

fuyufjh commented Oct 15, 2023 •

edited

Loading

fuyufjh commented Jan 5, 2024

xiangjinwu commented Jan 9, 2024

xxchan commented Jul 18, 2024 •

edited

Loading

xiangjinwu commented Jul 18, 2024 •

edited

Loading

xxchan commented Jul 18, 2024

xiangjinwu commented Jul 18, 2024

xxchan commented Jul 29, 2024

xxchan commented Jul 30, 2024 •

edited

Loading

fuyufjh commented Jul 30, 2024

disallow usage of jsonb/map type in group by / order by / primary key #7981

disallow usage of jsonb/map type in group by / order by / primary key #7981

Comments

xiangjinwu commented Feb 16, 2023

fuyufjh commented Oct 15, 2023 • edited Loading

fuyufjh commented Jan 5, 2024

xiangjinwu commented Jan 9, 2024

xxchan commented Jul 18, 2024 • edited Loading

xiangjinwu commented Jul 18, 2024 • edited Loading

xxchan commented Jul 18, 2024

xiangjinwu commented Jul 18, 2024

xxchan commented Jul 29, 2024

xxchan commented Jul 30, 2024 • edited Loading

For the _row_id problem

Why the exchange is needed for DML?

Why we chose to hash the entire row when the _row_id is missing, instead of sth like random or round-robin?

Can we change it to the random or round robin?

For the "Key" problem

fuyufjh commented Jul 30, 2024

disallow usage of `jsonb`/`map` type in `group by` / `order by` / `primary key` #7981

disallow usage of `jsonb`/`map` type in `group by` / `order by` / `primary key` #7981

fuyufjh commented Oct 15, 2023 •

edited

Loading

xxchan commented Jul 18, 2024 •

edited

Loading

xiangjinwu commented Jul 18, 2024 •

edited

Loading

xxchan commented Jul 30, 2024 •

edited

Loading

For the `_row_id` problem

Why we chose to hash the entire row when the `_row_id` is missing, instead of sth like random or round-robin?