Equip every record with an id
#1727
chubei
started this conversation in
Feature Requests
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Every processor in Dozer sql takes in one or more input tables and produces one or more output tables. Because the tables are always streaming, the question of identifying a row often raises: Upon delete or update, how do I know which row should be affected? There is also a less obvious question: Upon insert, how do I know if it's a new row or a previously deleted row?
The first question deeply affects how join is implemented. If upstream delete or update can't identify the row that's affected, join is forced to select a row from several possible rows. If we're using hash to index into the possible rows, we'll have to deserialize them and run the equality check to ensure correctness. On the other hand, if we have the identifier, it's just one look up.
Another practical limitation to using hash as the index is that, if we use lmdb multimap, the serialized record length is limited to 512 bytes.
Dozer cache has answers to both questions. It keeps track of a row "id", which identifies a logical row. Upon delete or update, the affected row id is always included in the output event. Upon insert, a previous deleted row's id will be reused, while a new row will have a new id.
Here we propose to add this row id to all the sources and processors, forcing them to answer the two questions.
Implementation
The source is responsible for introducing the ids, so it has some bookkeeping to do. The processors can rely on the input ids to avoid working with hash.
Implementation of id generation in source can be like this:
If the schema has primary key, we can keep track of a map from primary key to the id. Upon delete and update, we look up the id based on primary key. Upon insert, if the primary key already exists in the map, it's a previously deleted row, otherwise it's a new row.
If the schema doesn't have primary key, we can keep track of a multimap from hash to the id. Upon delete and update, we look up the id based on hash, and run equality check. Upon insert, we always generate a new id.
Note that this way the equality check only happens once in the source. Every following processor don't need to use the record hash as an index.
Beta Was this translation helpful? Give feedback.
All reactions