Optimize integer reading using bmi2 instructions #4

jhorstmann · 2024-05-25T15:21:49Z

No description provided.

jhorstmann · 2024-05-27T09:57:53Z

Unclear how big the benefit actually would be. Many integers and especially enum identifiers would be small so the small bytewise loop should actually be faster. The optimization would require an additional branch to make sure we can read 4 or 8 bytes and fallback if not.

For writing to a vec, bmi2 could be more beneficial since could always write bigger chunks and then adjust the len afterwards.

jhorstmann · 2024-05-27T13:37:25Z

For reference, code for branchless decoding of up to 56 bits using bmi2 instructions.

/// Reads an uleb128 encoded integer with at most 56 bits (8 bytes with 7 bits worth of payload each).
/// Returns the integer and the number of bytes that made up this integer.
/// If the returned length is bigger than 8 this means the integer required more than 8 bytes and the remaining bytes need to be read sequentially and combined with the return value.
/// Safety: `data` needs to contain at least 8 bytes.
#[target_feature(enable="bmi2")]
pub unsafe fn decode_uleb_bmi2(data: &[u8]) -> (u64, usize) {
    const CONT_MARKER: u64 = 0x80808080_80808080;
    debug_assert!(data.len() >= 8);
    
    unsafe {
        let word = data.as_ptr().cast::<u64>().read_unaligned();
        // mask indicating continuation bytes
        let mask = std::arch::x86_64::_pext_u64(word, CONT_MARKER);
        let len = (!mask).trailing_zeros() + 1;
        // which payload bits to extract
        let ext = std::arch::x86_64::_bzhi_u64(!CONT_MARKER, 8*len);
        let payload = std::arch::x86_64::_pext_u64(word, ext);
        
        (payload, len as _)
    }
}

jhorstmann · 2024-05-27T14:27:19Z

Branchless uleb encoding for the writing side (not yet tested):

const REQUIRED_BYTES: [u8; 33] = [
    1, 1, 1, 1, 1, 1, 1,
    2, 2, 2, 2, 2, 2, 2,
    3, 3, 3, 3, 3, 3, 3,
    4, 4, 4, 4, 4, 4, 4,
    5, 5, 5, 5, 5
];

/// Safety: `data` needs to have space for at least 8 bytes.
#[target_feature(enable="bmi2")]
pub unsafe fn encode_uleb_bmi2(value: u32, output: &mut [u8]) -> usize {
    const CONT_MARKER: u64 = 0x80808080_80808080;
    const EXPAND_MASK: u64 = 0x7F7F7F_7F7F7F7F;
    unsafe {
        let required_bits = 32-value.leading_zeros();

        //let required_bytes = required_bits/7 + 1;
        let required_bytes = REQUIRED_BYTES[required_bits as usize];
        let expanded = std::arch::x86_64::_pdep_u64(value as u64, EXPAND_MASK);
        let marker = std::arch::x86_64::_bzhi_u64(CONT_MARKER, required_bytes as u32 * 8);

        let marked = expanded | marker;
        output.get_unchecked_mut(0..8).copy_from_slice(&marked.to_le_bytes());
        required_bytes as _
    }
}

jhorstmann · 2024-06-08T09:14:03Z

Skipping numbers should to be much simpler using sse2:

/// Safety: `input` needs to contains at least 16 bytes
unsafe fn skip_uleb_sse2(input: &[u8]) -> &[u8] {
    use std::arch::x86_64::*;
    let chunk = _mm_loadu_si128(input.as_ptr().cast());
    let mask = !_mm_movemask_epi8(chunk);
    let len = (mask.trailing_zeros() as usize) + 1;
    input.get_unchecked(len..)
}

jhorstmann mentioned this issue Jun 7, 2024

Selective decoding of a subset (e.g. columns or row groups) of parquet metadata apache/arrow-rs#5855

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize integer reading using bmi2 instructions #4

Optimize integer reading using bmi2 instructions #4

jhorstmann commented May 25, 2024

jhorstmann commented May 27, 2024 •

edited

Loading

jhorstmann commented May 27, 2024

jhorstmann commented May 27, 2024

jhorstmann commented Jun 8, 2024

Optimize integer reading using bmi2 instructions #4

Optimize integer reading using bmi2 instructions #4

Comments

jhorstmann commented May 25, 2024

jhorstmann commented May 27, 2024 • edited Loading

jhorstmann commented May 27, 2024

jhorstmann commented May 27, 2024

jhorstmann commented Jun 8, 2024

jhorstmann commented May 27, 2024 •

edited

Loading