Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CH] ORC read/write mismatch when map column contains null #8021

Open
taiyang-li opened this issue Nov 22, 2024 · 2 comments · May be fixed by #8023
Open

[CH] ORC read/write mismatch when map column contains null #8021

taiyang-li opened this issue Nov 22, 2024 · 2 comments · May be fixed by #8023
Labels
enhancement New feature or request

Comments

@taiyang-li
Copy link
Contributor

taiyang-li commented Nov 22, 2024

Description

First execute below sql with native write enabled.

drop table if exists tmp.tnm;
create table tmp.tnm using orc as 
with data_source as (
select
id as uid,
case when random() < 0.1 then null else floor(random() * 100) end as rec_room_id,
case when random() < 0.1 then null else floor(random() * 100) end as room_id,
case when random() < 0.1 then null else floor(random() * 100) end as dispatch_id,
case when random() < 0.1 then null else floor(random() * 100) end as gift_value_total,
case when random() < 0.1 then null else floor(random() * 100) end as follow_channel,
case when random() < 0.1 then null else floor(random() * 100) end as follow_user,
case when random() < 0.1 then null else floor(random() * 100) end as followed_channel,
case when random() < 0.1 then null else floor(random() * 100) end as need_filter,
case when random() < 0.1 then null else floor(random() * 100) end as mic_time
from range(100000)
)
select
uid,
rec_room_id,
room_id,
dispatch_id,
str_to_map(
concat(
'gift_value_total:', gift_value_total,
',follow_channel:', follow_channel,
',follow_user:', follow_user,
',followed_channel:', followed_channel,
',filter:', need_filter,
',mic_time:', mic_time
)
) as label_map,
mic_time
from
data_source;

Then download the orc file and view its content. We can see that the values of mic_time and label_map['mic_time'] don't match.

{"uid": 0, "rec_room_id": 28, "room_id": null, "dispatch_id": 65, "label_map": [{"key": "gift_value_total", "value": "75"}, {"key": "follow_channel", "value": "96"}, {"key": "follow_user", "value": "45"}, {"key": "followed_channel", "value": "64"}, {"key": "filter", "value": "62"}, {"key": "mic_time", "value": "86"}], "mic_time": 86}
{"uid": 1, "rec_room_id": 90, "room_id": 55, "dispatch_id": null, "label_map": null, "mic_time": 64}
{"uid": 2, "rec_room_id": 84, "room_id": 40, "dispatch_id": 38, "label_map": [{"key": "gift_value_total", "value": "2"}, {"key": "follow_channel", "value": "54"}, {"key": "follow_user", "value": "44"}, {"key": "followed_channel", "value": "62"}, {"key": "filter", "value": "68"}, {"key": "mic_time", "value": "64"}], "mic_time": 7}
{"uid": 3, "rec_room_id": 37, "room_id": null, "dispatch_id": 54, "label_map": [{"key": "gift_value_total", "value": "22"}, {"key": "follow_channel", "value": "2"}, {"key": "follow_user", "value": "66"}, {"key": "followed_channel", "value": "69"}, {"key": "filter", "value": "34"}, {"key": "mic_time", "value": "7"}], "mic_time": 86}
{"uid": 4, "rec_room_id": 2, "room_id": 64, "dispatch_id": 41, "label_map": [{"key": "gift_value_total", "value": "66"}, {"key": "follow_channel", "value": "24"}, {"key": "follow_user", "value": "3"}, {"key": "followed_channel", "value": "94"}, {"key": "filter", "value": "32"}, {"key": "mic_time", "value": "86"}], "mic_time": 63}
{"uid": 5, "rec_room_id": 10, "room_id": 28, "dispatch_id": 49, "label_map": null, "mic_time": 14}
{"uid": 6, "rec_room_id": null, "room_id": 74, "dispatch_id": 23, "label_map": null, "mic_time": 44}
{"uid": 7, "rec_room_id": 61, "room_id": 65, "dispatch_id": 46, "label_map": [{"key": "gift_value_total", "value": "22"}, {"key": "follow_channel", "value": "15"}, {"key": "follow_user", "value": "28"}, {"key": "followed_channel", "value": "70"}, {"key": "filter", "value": "30"}, {"key": "mic_time", "value": "63"}], "mic_time": 51}
{"uid": 8, "rec_room_id": null, "room_id": 20, "dispatch_id": 90, "label_map": [{"key": "gift_value_total", "value": "14"}, {"key": "follow_channel", "value": "97"}, {"key": "follow_user", "value": "99"}, {"key": "followed_channel", "value": "10"}, {"key": "filter", "value": "77"}, {"key": "mic_time", "value": "14"}], "mic_time": 41}
{"uid": 9, "rec_room_id": 16, "room_id": null, "dispatch_id": null, "label_map": null, "mic_time": 9} 
@taiyang-li taiyang-li added the enhancement New feature or request label Nov 22, 2024
@taiyang-li
Copy link
Contributor Author

taiyang-li commented Nov 22, 2024

Reason: orc requires offsets[i+1] == offset[i] when the i-th row is null in MapVectorBatch, otherwise the written orc batch is not consistent with CH column. So is ListVectorBatch. But in CH, the nullable map column returned from function str_to_map maybe like:

| nullmap | offsets | keys |
|---------|---------|------|
| 1       | 2      | [k1, k2]  |
| 1       | 4      | [k3, k4]  |
| 0       | 6      | [k5, k6]  |
| 1       | 8      | [k7, k8]  |

@taiyang-li
Copy link
Contributor Author

Solution: Recursively truncate non-empty nested data when current row is null in CH Map column before writing to ORC/Parquet.

@taiyang-li taiyang-li changed the title [CH] Fix orc writing Nullable(Map(K, V)) issue [CH] ORC read/write mismatch when map column contains null Nov 22, 2024
taiyang-li added a commit to bigo-sg/gluten that referenced this issue Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
1 participant