Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: column aware row encoding: improve the implementation and add bench #17818

Merged
merged 9 commits into from
Jul 30, 2024
Prev Previous commit
Next Next commit
build a default row in advance
fuyufjh committed Jul 30, 2024
commit b76984c7d5491feedfe0ebb652fba12fe43b5ab7
21 changes: 11 additions & 10 deletions src/common/src/util/value_encoding/column_aware_row_encoding.rs
Original file line number Diff line number Diff line change
@@ -170,7 +170,9 @@ impl ValueRowSerializer for Serializer {
pub struct Deserializer {
required_column_ids: HashMap<i32, usize>,
Copy link
Member

@BugenZhao BugenZhao Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I thought BTreeMap is faster when there are only a few entries: https://arc.net/l/quote/okdycbqi

The current value of B is 6. According to this, can we also benchmark the case for a table containing less than 6 columns?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added bench column_aware_row_encoding_4_columns. Observed a similar performance improvement.

Benchmarking column_aware_row_encoding_16_columns_encode: Collecting 100 samples in estimated 5.0015
column_aware_row_encoding_16_columns_encode
                        time:   [491.00 ns 492.30 ns 493.95 ns]
                        change: [-3.6361% -2.8341% -2.0340%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

Benchmarking column_aware_row_encoding_16_columns_decode: Collecting 100 samples in estimated 5.0011
column_aware_row_encoding_16_columns_decode
                        time:   [445.62 ns 448.76 ns 451.61 ns]
                        change: [-51.073% -50.248% -49.515%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

Benchmarking column_aware_row_encoding_4_columns_encode: Collecting 100 samples in estimated 5.0009 s (23M iter
column_aware_row_encoding_4_columns_encode
                        time:   [221.45 ns 224.51 ns 228.98 ns]
                        change: [+0.1302% +0.7332% +1.4856%] (p = 0.03 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

Benchmarking column_aware_row_encoding_4_columns_decode: Collecting 100 samples in estimated 5.0004 s (38M iter
column_aware_row_encoding_4_columns_decode
                        time:   [131.47 ns 131.82 ns 132.26 ns]
                        change: [-46.562% -46.074% -45.656%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  5 (5.00%) high severe

schema: Arc<[DataType]>,
default_column_values: Vec<(usize, Datum)>,

/// A row with default values for each column or `None` if no default value is specified
default_row: Vec<Datum>,
}

impl Deserializer {
@@ -180,14 +182,18 @@ impl Deserializer {
column_with_default: impl Iterator<Item = (usize, Datum)>,
) -> Self {
assert_eq!(column_ids.len(), schema.len());
let mut default_row: Vec<Datum> = vec![None; schema.len()];
for (i, datum) in column_with_default {
default_row[i] = datum;
}
Self {
required_column_ids: column_ids
.iter()
.enumerate()
.map(|(i, c)| (c.get_id(), i))
.collect::<HashMap<_, _>>(),
schema,
default_column_values: column_with_default.collect(),
default_row,
}
}
}
@@ -207,12 +213,7 @@ impl ValueRowDeserializer for Deserializer {
let offsets = &encoded_bytes[offsets_start_idx..data_start_idx];
let data = &encoded_bytes[data_start_idx..];

// initialize datums with default values
let mut datums: Vec<Datum> = vec![None; self.schema.len()];
for (i, datum) in &self.default_column_values {
datums[*i].clone_from(datum);
}

let mut row = self.default_row.clone();
for i in 0..datum_num {
let this_id = encoded_bytes.get_i32_le();
if let Some(&decoded_idx) = self.required_column_ids.get(&this_id) {
@@ -242,10 +243,10 @@ impl ValueRowDeserializer for Deserializer {
&mut data_slice,
)?)
};
datums[decoded_idx] = data;
row[decoded_idx] = data;
}
}
Ok(datums)
Ok(row)
}
}