Avoid record duplication in JOIN processor #1771

chubei · 2023-07-20T05:20:23Z

chubei
Jul 20, 2023

Avoid record duplication in JOIN processor

Problem

The JOIN processor has to keep track of all its history inputs, so when a new record arrives, it can produce the cartesian product of the new record with the history, and filter based on the join condition.

Currently we're storing the history record as Vec<Field>. Imagine the case where a JOIN processor (call it JOIN2) uses the output of a previous JOIN processor (call it JOIN1) as a input. JOIN2 will have to store all the records that JOIN1 produced. However, JOIN1 also has all the records in its own history. Effectively, memory usage is doubled.

This goes worse when there're more cascaded JOIN processors.

Solution

We can store record as Vec<RefOrField> instead of Vec<Field>. RefOrField is an enum that can be either a reference to a record, or a direct field.

One may wonder how a RefOrField can be more memory efficient than a Field because most of the fields are small. The point here is that RefOrField references a record, so it can reference many fields at once.

The information of which fields are actually referenced in a RefOrField is stored in the corresponding RefSchema, a referencing version of Schema. While Schema stores Vec<FieldDefinition>, RefSchema stores Vec<RefOrFieldDefinition>. RefOrFieldDefinition is an enum that can be either a reference to a schema and some of the referee's field definitions, or a direct field definition.

Note that we only allow one level of reference, meaning that all the fields referenced in a RefOrFieldDefinition::Ref must be direct field definitions.

Code

#[derive(Debug, Clone)]
enum RefOrFieldDefinition {
    Ref {
        /// The referenced `Schema`.
        schema: Arc<RefSchema>,
        /// The indexes of referenced direct fields in the referenced `Schema`.
        indexes: Vec<u32>,
    },
    FieldDefinition(FieldDefinition),
}

#[derive(Debug, Default)]
struct RefSchema {
    fields: Vec<RefOrFieldDefinition>,
}

#[derive(Debug, Clone)]
enum RefOrField {
    Ref(Arc<RefRecord>),
    Field(Field),
}

#[derive(Debug, Default)]
struct RefRecord {
    values: Vec<RefOrField>,
}

Field indexing

To index into such RefSchemas and RefRecords, there has to be two levels of indexing. The first level is to index into the Vec<RefOrFieldDefinition> or Vec<RefOrField>, and the second level is to index into the direct fields of the referenced RefSchema or RefRecord.

/// To get the `FieldDefinition` or `Field` out of a `Schema` or `Record`, we need two levels of indirection:
///
/// 1. Outer index is to determine if this field is a referenced field in another `Schema`/`Record`, or a direct field.
/// 2. Inner index is to index into the referenced `Schema`/`Record` to get the `FieldDefinition`/`Field`.
///
/// Invariants:
///
/// - If `outer` points to a direct field, `inner` must be 0.
/// - If `outer` points to a referenced `Schema`/`Record`, `inner` must point to a direct field in that `Schema`/`Record`. In other words, we don't allow indirections more than 1 level deep.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
struct RefFieldIndex {
    /// The index of the `RefOrFieldDefinition` or `RefOrField` in the `Schema` or `Record`.
    outer: u32,
    /// The index of the `FieldDefinition` or `Field` in the `Schema` or `Record`.
    inner: u32,
}

impl RefSchema {
    fn get_field(&self, index: RefFieldIndex) -> &FieldDefinition {
        match &self.fields[index.outer as usize] {
            RefOrFieldDefinition::Ref {
                schema,
                indexes: inner_indexes,
            } => {
                debug_assert!(
                    inner_indexes.contains(&index.inner),
                    "Field index {} is not referenced: referrer: {:?}, referee: {:?}",
                    index.inner,
                    self,
                    schema
                );

                let RefOrFieldDefinition::FieldDefinition(field) = &schema.fields[index.inner as usize] else {
                    panic!("Invariant violated: inner index must point to a direct field in the referenced schema");
                };
                field
            }
            RefOrFieldDefinition::FieldDefinition(field_definition) => {
                debug_assert!(
                    index.inner == 0,
                    "Invariant violated: inner index must be 0 for direct fields"
                );
                field_definition
            }
        }
    }
}

impl RefRecord {
    fn get_field(&self, index: RefFieldIndex) -> &Field {
        match &self.values[index.outer as usize] {
            RefOrField::Ref(record) => {
                let RefOrField::Field(field) = &record.values[index.inner as usize] else {
                    panic!("Invariant violated: inner index must point to a direct field in the referenced record");
                };
                field
            }
            RefOrField::Field(field) => {
                debug_assert!(
                    index.inner == 0,
                    "Invariant violated: inner index must be 0 for direct fields"
                );
                field
            }
        }
    }
}

Dereferencing through cloning

RefSchema/RefRecord are not suitable for serialization because if we directly derive Serialize, all the referenced RefSchema/RefRecord will be serialized recursively. For ease of implementation, instead of implementing a custom serializer, we can "dereference" the RefSchema/RefRecord to a Schema/Record, cloning all the referenced FieldDefinition/Field in the process.

// The primary index also needs to be dereferenced. We're not showing it here to avoid clutter.
impl RefSchema {
    fn deref(&self) -> Schema {
        let mut fields = vec![];
        for (outer_index, ref_or_field_definition) in self.fields.iter().enumerate() {
            match ref_or_field_definition {
                RefOrFieldDefinition::Ref { schema, indexes } => {
                    for index in indexes {
                        let RefOrFieldDefinition::FieldDefinition(field_definition) = &schema.fields[*index as usize] else {
                            panic!("Invariant violated: inner index must point to a direct field in the referenced schema");
                        };
                        fields.push(field_definition.clone());
                    }
                }
                RefOrFieldDefinition::FieldDefinition(field_definition) => {
                    fields.push(field_definition.clone());
                }
            }
        }

        Schema {
            fields,
        }
    }
}

impl RefRecord {
    fn deref(&self, schema: &RefSchema) -> Record {
        let mut values = vec![];

        for (ref_or_field, ref_or_field_definition) in self.values.iter().zip(&schema.fields) {
            match (ref_or_field, ref_or_field_definition) {
                (RefOrField::Ref(record), RefOrFieldDefinition::Ref { indexes, .. }) => {
                    for index in indexes {
                        let RefOrField::Field(field) = &record.values[*index as usize] else {
                            panic!("Invariant violated: inner index must point to a direct field in the referenced record");
                        };
                        values.push(field.clone());
                    }
                }
                (RefOrField::Field(field), RefOrFieldDefinition::FieldDefinition(_)) => {
                    values.push(field.clone());
                }
                _ => panic!("Record and schema must match"),
            }
        }

        Record {
            values,
        }
    }
}

Use in JOIN

In JOIN processor, we produce the output schema/record by referencing the left and right input schemas/records. Here we define two helper functions to do that.

impl RefSchema {
    fn extend_fields(&mut self, referee: Arc<RefSchema>) {
        let mut direct_indexes = vec![];
        for (index, ref_or_field_definition) in referee.fields.iter().enumerate() {
            match ref_or_field_definition {
                // Any reference gets copied over as is.
                RefOrFieldDefinition::Ref { .. } => {
                    self.fields.push(ref_or_field_definition.clone())
                }
                // Any direct field gets referenced.
                RefOrFieldDefinition::FieldDefinition(_) => direct_indexes.push(index as u32),
            }
        }

        if !direct_indexes.is_empty() {
            self.fields.push(RefOrFieldDefinition::Ref {
                schema: referee,
                indexes: direct_indexes,
            });
        }
    }
}

impl RefRecord {
    fn extend(&mut self, record: Arc<RefRecord>) {
        let mut record_should_be_referenced = false;
        for ref_or_field in &record.values {
            if let RefOrField::Ref(_) = ref_or_field {
                // Any reference gets copied over as is.
                self.values.push(ref_or_field.clone());
            } else {
                record_should_be_referenced = true;
            }
        }

        if record_should_be_referenced {
            self.values.push(RefOrField::Ref(record));
        }
    }
}

Note that in above functions, columns are reordered so we don't keep duplicate references to the same RefSchema/RefRecord.

Now the JOIN processor can implement joining using referencing.

fn join_schema_fields(left: Arc<RefSchema>, right: Arc<RefSchema>) -> RefSchema {
    let mut schema = RefSchema::new();
    schema.extend_fields(left);
    schema.extend_fields(right);
    schema
}

fn join_record(left: Arc<RefRecord>, right: Arc<RefRecord>) -> RefRecord {
    let mut record = RefRecord::new();
    record.extend(left);
    record.extend(right);
    record
}

@mediuminvader @v3g42 @snork-alt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid record duplication in JOIN processor #1771

{{title}}

Replies: 0 comments

Select a reply

Avoid record duplication in JOIN processor #1771

chubei Jul 20, 2023

Avoid record duplication in JOIN processor

Problem

Solution

Code

Field indexing

Dereferencing through cloning

Use in JOIN

Replies: 0 comments

chubei
Jul 20, 2023