Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: introduce new distribution to avoid the hack of hash of SERIAL #18677

Open
fuyufjh opened this issue Sep 24, 2024 · 1 comment
Open
Assignees

Comments

@fuyufjh
Copy link
Member

fuyufjh commented Sep 24, 2024

IIUC, the root cause of the problem is because of the improper way to hack the hash() function of Serial.

Currently, we did a hack here: if there is only one column and the column type is Serial, we will use extract_vnode_id_from_row_id instead of the standard hash function.

// `compute_chunk` is used to calculate the `VirtualNode` for the columns in the
// chunk. When only one column is provided and its type is `Serial`, we consider the column to
// be the one that contains RowId, and use a special method to skip the calculation of Hash
// and directly extract the `VirtualNode` from `RowId`.
pub fn compute_chunk(
data_chunk: &DataChunk,
keys: &[usize],
vnode_count: usize,
) -> Vec<VirtualNode> {
if let Ok(idx) = keys.iter().exactly_one()
&& let ArrayImpl::Serial(serial_array) = &**data_chunk.column_at(*idx)
{
return serial_array
.iter()
.enumerate()
.map(|(idx, serial)| {
if let Some(serial) = serial {
extract_vnode_id_from_row_id(serial.as_row_id())
} else {
// NOTE: here it will hash the entire row when the `_row_id` is missing,
// which could result in rows from the same chunk being allocated to different chunks.
// This process doesn’t guarantee the order of rows, producing indeterminate results in some cases,
// such as when `distinct on` is used without an `order by`.
let (row, _) = data_chunk.row_at(idx);
row.hash(Crc32FastBuilder).to_vnode(vnode_count)
}
})
.collect();
}
data_chunk
.get_hash_values(keys, Crc32FastBuilder)
.into_iter()
.map(|hash| hash.to_vnode(vnode_count))
.collect()
}

I think the best solution is to use a special distribution, e.g. RowIdDistribution instead of HashDistribution. This essentially remove the hack and make everything clear.

cc. @st1page

Originally posted by @fuyufjh in #18529 (comment)

@github-actions github-actions bot added this to the release-2.1 milestone Sep 24, 2024
@BugenZhao
Copy link
Member

The problem is that, we still need to maintain compatibility with existing jobs and ensure consistent behavior. 😢 So I'm afraid the hack cannot be fully removed.

@fuyufjh fuyufjh removed this from the release-2.1 milestone Oct 17, 2024
@fuyufjh fuyufjh self-assigned this Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants