Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querybuilder suggestions for json field (nested json attributes) #542

Closed
yautkin opened this issue Oct 4, 2023 · 7 comments
Closed

Querybuilder suggestions for json field (nested json attributes) #542

yautkin opened this issue Oct 4, 2023 · 7 comments

Comments

@yautkin
Copy link

yautkin commented Oct 4, 2023

Hello, colleagues.

Querybuilder web interface doesn't suggest some options for nested JSON attributes in JSON field.

I have a 2 column table

  • timestamp Datetime,
  • message JSON

When write sql query in querybuilder it show timestamp and message field for me, but it don't know something about nested json attributes.

In example below you can see, that query return successful result for message.status, but querybuilder doesn't suggest some nested JSON attributes from message field for selection.

json_field_suggestions

@yautkin yautkin added datasource/ClickHouse type/bug Something isn't working labels Oct 4, 2023
@SpencerTorres SpencerTorres added enhancement New feature or request and removed type/bug Something isn't working labels Oct 18, 2023
@SpencerTorres
Copy link
Collaborator

From my understanding, the JSON column type doesn't always have a consistent schema. For the current suggestions the plugin will sample the table's schema, but in this case we would have to sample rows and determine which fields are on a JSON payload. The JSON's schema could change between rows too.

I think for now a good workaround is to write the column in manually if you know your schema. Any thoughts @aangelisc ?

@aangelisc
Copy link
Contributor

Yeah this is a nuanced topic that can be quite complicated to resolve. The following question if we were to sample rows would then be how to handle nested JSON or, as @SpencerTorres mentioned, columns where the JSON is inconsistent between rows.

The Fields input allows custom values to be used so I agree a good workaround is to manually write the column and we can treat this as a feature request to investigate.

@aangelisc aangelisc moved this from Incoming to Waiting in Partner Datasources Oct 19, 2023
@yautkin
Copy link
Author

yautkin commented Oct 19, 2023

Look at example.
set allow_experimental_object_type = 1

CREATE TABLE test.logs2
(
timestamp DateTime64(3) DEFAULT now(),
message Object('json')
)
ENGINE = MergeTree()
ORDER BY timestamp

insert into test.logs2 FORMAT JSONEachRow {"message": {"row": 2, "second_field": "new field"}}

select * from logs2 FORMAT JSONEachRow
output:
"""
{"timestamp":"2023-10-19 15:51:51.000","message":{"row":1}}
"""
insert into test.logs2 FORMAT JSONEachRow {"message": {"row": 2, "second_field": "new field"}}

select * from logs2 FORMAT JSONEachRow
output:
"""
{"timestamp":"2023-10-19 15:51:51.000","message":{"row":1,"second_field":""}}
{"timestamp":"2023-10-19 15:52:21.000","message":{"row":2,"second_field":"new field"}}
"""

If we have in first row JSON {"row": 1}, after we add new row with JSON {"second_field": "new field"}, clickhouse server adds field "second_field" for the first row with empty string value.

@SpencerTorres
Copy link
Collaborator

If we have in first row JSON {"row": 1}, after we add new row with JSON {"second_field": "new field"}, clickhouse server adds field "second_field" for the first row with empty string value.

I appreciate the test queries you ran there, I see how that could solve the row inconsistency problem. This brings up some more questions though:

  • Is it only returning second_field as an empty string because it's present in the next row? If you added LIMIT 1, would it still append the empty field? Does it build a unique set of keys for the entire table?
  • What about nested objects? If second_field was a nested object, what does it put for an empty value? null ?
  • What happens when second_field is a mixed data type (sometimes a string, sometimes a number)?

@yautkin
Copy link
Author

yautkin commented Oct 20, 2023

I appreciate the test queries you ran there, I see how that could solve the row inconsistency problem. This brings up some more questions though:

  • Is it only returning second_field as an empty string because it's present in the next row? If you added LIMIT 1, would it still append the empty field? Does it build a unique set of keys for the entire table?
  • What about nested objects? If second_field was a nested object, what does it put for an empty value? null ?
  • What happens when second_field is a mixed data type (sometimes a string, sometimes a number)?
  1. With limit 1 result is the same
    select * from logs2 limit 1 FORMAT JSONEachRow
    {"timestamp":"2023-10-19 15:51:51.000","message":{"row":1,"second_field":""}}

I find it difficult to answer about unique set of keys for the entire table. ORDER BY tuple() syntax create a table with no primary key. In this case, Clickhouse stores data in insert order.

EXPLAIN indexes=1 select * from logs2 order by message.row
┌─explain──────────────────────────────┐
│ Expression (Projection)
│ Sorting (Sorting for ORDER BY)
│ Expression (Before ORDER BY)
│ ReadFromMergeTree (test.logs2)
│ Indexes:
│ PrimaryKey
│ Condition: true
│ Parts: 1/1
│ Granules: 1/1
└──────────────────────────────────────┘

  1. All nested field creates recursive but have empty value. And number field has zero as empty value (look at the end of cooment)
    insert into test.logs2 FORMAT JSONEachRow {"message": {"row": 3, "third_column_with_json": {"nested_field": "test"}}}
    select * from logs2 order by message.row FORMAT JSONEachRow
    {"timestamp":"2023-10-19 15:51:51.000","message":{"row":1,"second_field":"","third_column_with_json":{"nested_field":""}}}
    {"timestamp":"2023-10-19 15:52:21.000","message":{"row":2,"second_field":"new field","third_column_with_json":{"nested_field":""}}}
    {"timestamp":"2023-10-20 10:53:47.000","message":{"row":3,"second_field":"","third_column_with_json":{"nested_field":"test"}}}

and limit 1 for example above return only same first row:
{"timestamp":"2023-10-19 15:51:51.000","message":{"row":1,"second_field":"","third_column_with_json":{"nested_field":""}}}

  1. Looks like clickhouse choose first datatype (string in our example, but last 4th row has number into "second_field" of inserted row)
    insert into test.logs2 FORMAT JSONEachRow {"message": {"row": 4, "second_field": 4}}
    select * from logs2 order by message.row FORMAT JSONEachRow
    {"timestamp":"2023-10-19 15:51:51.000","message":{"row":1,"second_field":"","third_column_with_json":{"nested_field":""}}}
    {"timestamp":"2023-10-19 15:52:21.000","message":{"row":2,"second_field":"new field","third_column_with_json":{"nested_field":""}}}
    {"timestamp":"2023-10-20 10:53:47.000","message":{"row":3,"second_field":"","third_column_with_json":{"nested_field":"test"}}}
    {"timestamp":"2023-10-20 10:58:21.000","message":{"row":4,"second_field":"4","third_column_with_json":{"nested_field":""}}}

2.2) Zero as number empty
insert into test.logs2 FORMAT JSONEachRow {"message": {"test_nested_number": {"number": 1}}}
{"timestamp":"2023-10-20 11:20:22.000","message":{"row":0,"second_field":"","test_nested_number":{"number":1},"third_column_with_json":{"nested_field":""}}}
{"timestamp":"2023-10-19 15:51:51.000","message":{"row":1,"second_field":"","test_nested_number":{"number":0},"third_column_with_json":{"nested_field":""}}}
{"timestamp":"2023-10-20 11:10:21.000","message":{"row":1,"second_field":"","test_nested_number":{"number":0},"third_column_with_json":{"nested_field":""}}}
{"timestamp":"2023-10-19 15:52:21.000","message":{"row":2,"second_field":"new field","test_nested_number":{"number":0},"third_column_with_json":{"nested_field":""}}}
{"timestamp":"2023-10-20 10:53:47.000","message":{"row":3,"second_field":"","test_nested_number":{"number":0},"third_column_with_json":{"nested_field":"test"}}}
{"timestamp":"2023-10-20 10:58:21.000","message":{"row":4,"second_field":"4","test_nested_number":{"number":0},"third_column_with_json":{"nested_field":""}}}

@SpencerTorres
Copy link
Collaborator

Thanks for sharing the example queries, I didn't know the JSON type had this behavior!
It's not as detailed as DESC table with specific column types, but I can see how this could be used to build column selections from a sampled row. I agree with @aangelisc, we can investigate this further as a feature request. I don't think there's an implementation for this right now that would be as reliable as simply writing in the column yourself.

@asimpson asimpson moved this from Waiting to Backlog in Partner Datasources Dec 1, 2023
@alyssabull alyssabull moved this from Backlog to Feature Requests in Partner Datasources Mar 12, 2024
@SpencerTorres
Copy link
Collaborator

After revisiting this thread, I don't think this is something we will add based on how the plugin functions now. The JSON column type is experimental and obsolete, but I will keep this concept in mind for Map types if we ever expand on query suggestions.

@github-project-automation github-project-automation bot moved this from Feature Requests to Done in Partner Datasources Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

3 participants