Skip to content

Commit

Permalink
Added doc on unnesting JSON array into separate records (pinot-contri…
Browse files Browse the repository at this point in the history
  • Loading branch information
rajagopr authored Oct 30, 2024
1 parent 275ab0f commit e25dbd2
Show file tree
Hide file tree
Showing 4 changed files with 176 additions and 0 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@
* [Google Cloud Storage](basics/data-import/pinot-file-system/import-from-gcp.md)
* [Input formats](basics/data-import/pinot-input-formats.md)
* [Complex Type (Array, Map) Handling](basics/data-import/complex-type.md)
* [Unnest JSON Array](basics/data-import/unnest-json-array.md)
* [Ingest records with dynamic schemas](basics/data-import/schema-conforming-transformer.md)
* [Reload a table segment](basics/data-import/segment-reload.md)
* [Upload a table segment](basics/data-import/segment-upload.md)
Expand Down
6 changes: 6 additions & 0 deletions basics/data-import/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,12 @@ This guide shows you how to handle the complex type in the ingested data, such a
[complex-type.md](complex-type.md)
{% endcontent-ref %}

This guide shows you how to unnest JSON records that are grouped into an array at the root level.

{% content-ref url="unnest-json-array.md" %}
[unnest-json-array.md](unnest-json-array.md)
{% endcontent-ref %}

This guide shows you how to handle records with dynamic schemas, like JSON log events.

{% content-ref url="schema-conforming-transformer.md" %}
Expand Down
169 changes: 169 additions & 0 deletions basics/data-import/unnest-json-array.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
---
description: Unnest JSON records in Apache Pinot.
---

# Unnest JSON records
In this example, we would look at un-nesting json records that are batched together as part of a single key at the root
level. We will make use of the [ComplexType](complex-type.md) configs to persist the individual student records as
separate rows in Pinot.

Consider the following array of student records.
```json
{
"students": [
{
"firstName": "Jane",
"id": "100",
"scores": {
"physics": 91,
"chemistry": 93,
"maths": 99
}
},
{
"firstName": "John",
"id": "101",
"scores": {
"physics": 97,
"chemistry": 98,
"maths": 99
}
},
{
"firstName": "Jen",
"id": "102",
"scores": {
"physics": 96,
"chemistry": 95,
"maths": 100
}
}
]
}
```


# Pinot Schema
The Pinot schema for this example would look as follows.

```json
{
"schemaName": "students001",
"enableColumnBasedNullHandling": false,
"dimensionFieldSpecs": [
{
"name": "students.firstName",
"dataType": "STRING",
"notNull": false,
"fieldType": "DIMENSION"
},
{
"name": "students.id",
"dataType": "STRING",
"notNull": false,
"fieldType": "DIMENSION"
},
{
"name": "students.scores",
"dataType": "JSON",
"notNull": false,
"fieldType": "DIMENSION"
}
],
"dateTimeFieldSpecs": [
{
"name": "ts",
"fieldType": "DATE_TIME",
"dataType": "LONG",
"format": "1:MILLISECONDS:EPOCH",
"granularity": "1:MILLISECONDS"
}
],
"metricFieldSpecs": []
}
```

# Pinot Table Configuration

The Pinot table configuration for this schema would look as follows.

```json
{
"description": "Pinot table config inferred for: S3",
"type": "PINOT",
"config": {
"tableName": "students001_OFFLINE",
"tableType": "OFFLINE",
"segmentsConfig": {
"deletedSegmentsRetentionPeriod": "7d",
"segmentPushType": "APPEND",
"minimizeDataMovement": false,
"replication": "1",
"timeColumnName": "ts",
"retentionTimeUnit": "DAYS",
"retentionTimeValue": "180"
},
"tenants": {
"broker": "DefaultTenant",
"server": "DefaultTenant"
},
"tableIndexConfig": {
"optimizeDictionaryForMetrics": false,
"noDictionarySizeRatioThreshold": 0,
"aggregateMetrics": false,
"columnMajorSegmentBuilderEnabled": true,
"loadMode": "MMAP",
"varLengthDictionaryColumns": [
"students.firstName",
"students.id",
"students.scores"
],
"enableDefaultStarTree": false,
"enableDynamicStarTreeCreation": false,
"nullHandlingEnabled": true,
"autoGeneratedInvertedIndex": false,
"createInvertedIndexDuringSegmentGeneration": true,
"rangeIndexVersion": 2,
"optimizeDictionary": false,
"invertedIndexColumns": [
"students.firstName",
"students.id"
]
},
"metadata": {},
"task": {
"taskTypeConfigsMap": {

}
},
"ingestionConfig": {
"complexTypeConfig": {
"fieldsToUnnest": [
"students"
]
},
"transformConfigs": [
{
"columnName": "ts",
"transformFunction": "now()"
}
],
"rowTimeValueCheck": true,
"segmentTimeValueCheck": false,
"continueOnError": true,
"batchIngestionConfig": {
"segmentIngestionType": "APPEND",
"consistentDataPush": false
}
},
"isDimTable": false
}
}
```

# Data in Pinot

Post ingestion, the student records would appear as separate records in Pinot. Note that the nested field `scores` is
captured as a JSON field.

![Unnested Student Records](../../.gitbook/unnested-student-records-json.png)

0 comments on commit e25dbd2

Please sign in to comment.