Nested Record Processing #41

snork-alt · 2022-09-26T14:43:59Z

snork-alt
Sep 26, 2022
Maintainer

Let's consider the following example source tables:

users

id (pk)	username	fn	ln	country
u1	john123	John	Smith	SG

users_contact

id (pk)	user_id (fk)	type	contact	active
c1	u1	phone	+65 776536	true
c2	u1	email	[email protected]	true
c3	u1	phone	+65 13349879	false

We want to populate the cache with an object containing the following information:

User details for all Singapore users: username, fn, ln, country from users table
All active contact details for that specific user: type, contact from users_contact table

The final object will look like this:

{
  username: "john123",
  fn: "John",
  ln: "Smith",
  country: "Singapore",
  contacts: [
    {type: "phone", contact: "+65776536"},
    {type: "email", contact: "[email protected]"}
 ] 
}

Query definition and schema registration

In order to obtain the above object, the user should define a new endpoint specifying the following query:

SELECT
    username, fn, ln
    NESTED_ARR(
        select type, contact FROM users_contact
        WHERE active = true
        INDEX BY user_id, id 
    ) AS contacts,
FROM users
WHERE country = 'SG'
INDEX BY id

Two new SQL keywords are defined by DOZER SQL:

NESTED_ARR: Allows the user to specify a subquery that will be represented as an array of records in the cache
INDEX BY: Allows the user to specify what field is used for indexing in the cache. In the example above the parent object uses id as a primary key. For the nested query, however, we are using two fields: user_id and id.
- user_id is used for creating the relationship between the nested object and the parent object. In the source tables user_id is a foreign key creating in a relationship between the user_contacts table and the users table.
- The second field id represents the primary key of the users_contact table

Upon definition of the query Dozer will register a new schema that will look like the following, assigning a new unique id to it

Schema {
    id: Some(100),
    indexed_by: vec![
        FieldDef {
            name: "id".to_string(),
            field_type: Int,
            index: 0
        }
    ],
    values: vec![
        FieldDef {
            name: "username".to_string(),
            field_type: Str,
            index: 1
        },
        FieldDef {
            name: "fn".to_string(),
            field_type: Str,
            index: 2
        },
        FieldDef {
            name: "ln".to_string(),
            field_type: Str,
            index: 3
        },
        FieldDef {
            name: "contacts".to_string(),
            field_type: RecordArray {
                schema : Schema {
                    id: None,
                    indexed_by: vec![
                        FieldDef {
                            name: "user_id".to_string(),
                            field_type: Int,
                            index: 0
                        },
                        FieldDef {
                            name: "id".to_string(),
                            field_type: Int,
                            index: 1
                        }
                    ],
                    values: vec![
                        FieldDef {
                            name: "type".to_string(),
                            field_type: Str,
                            index: 2
                        },
                        FieldDef {
                            name: "contact".to_string(),
                            field_type: Str,
                            index: 3
                        }
                    ]
                }
            },
            index: 4
        }
    ]
}

Pipeline

The pipeline will take care of processing any CDC event produced by the two tables above, processing the query and generating Operation objects following the schema above. Let's consider an example:

New record added in `user_contacts`

In the following example, a new email contact is being added for the user john123. Here is the Operation message produced at the end of the pipeline, to be consumed by the cache writer.

Operation::Insert {
    new: Record {
        schema_id: Some(100),
        values: vec![
            Field::Str("u1".to_string()),
            Field::Invalid,
            Field::Invalid,
            Field::Invalid,
            Field::Record(vec![
                Field::Str("u1".to_string()),
                Field::Str("c4".to_string()),
                Field::Str("email".to_string()),
                Field::Str("[email protected]".to_string())
            ])
        ]
    }
}

In the above examples, only the fields from the users_contacttable are propagated. The only field propagated from the parent table is the id, which is needed for the correct indexing in the cache. All the field values are positional and matching to the schema previously registered. The index field in the schema definition defines the offset of each field

Cache population

Each nested record in the cache is mapped to a specific key/value entry. For the object provided above the following is the physical cache representation

key	value
`/u1`	`{username: "john123", fn: "John", ln: "Smith", country: "SG"}`
`/u1/4:c1`	`{type: "phone", contact: "+65 776536 "}`
`/u1/4:c2`	`{type: "email", contact: " [email protected]"}`
`/u1/4:c4`	`{type: "email", contact: "[email protected]"}`

For the parent object, the key is simply represented by the value of the field indicated in the INDEX BY keyword in parent query. For the inner object, the key is represented by the values of the two fields indicated in the 'INDEX BY' keyword in in the nested query. Note that, for every inner object, the value is prefixed with the index of the field it is referring to (4 in this schema, as it it shown in the schema definition)

snork-alt · 2022-09-26T14:46:26Z

snork-alt
Sep 26, 2022
Maintainer Author

@v3g42 @duonganhthu43 I felt using the WHERE condition between child and parent was getting very complex to implement. See what you think of this, which is easier to implement. Later we could also consider implementing the WHERE

0 replies

v3g42 · 2022-09-27T02:08:12Z

v3g42
Sep 27, 2022
Maintainer

@snork-alt We still will filter by the foreign key value I am guessing ? Would this syntax be more appropriate ?

SELECT
    username, fn, ln
    NESTED_ARR(
        select type, contact FROM users_contact
        WHERE active = true
        INDEX BY id 
    ) AS contacts on users_contact.user_id = users.id,
FROM users
WHERE country = 'SG'
INDEX BY id

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nested Record Processing #41

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Nested Record Processing #41

snork-alt Sep 26, 2022 Maintainer

users

users_contact

Query definition and schema registration

Pipeline

New record added in user_contacts

Cache population

Replies: 2 comments

snork-alt Sep 26, 2022 Maintainer Author

v3g42 Sep 27, 2022 Maintainer

snork-alt
Sep 26, 2022
Maintainer

New record added in `user_contacts`

snork-alt
Sep 26, 2022
Maintainer Author

v3g42
Sep 27, 2022
Maintainer