parquet arrow writer doesn't track memory size correctly for fixed sized lists #6839

kszlim · 2024-12-05T21:48:10Z

Describe the bug
The arrow writer doesn't track memory size correctly, and it seems like it thinks FixedSizeList columns have a fixed memory usage. Ie. the reported memory usage doesn't grow despite the buffers actually growing in memory.

To Reproduce

[package]
name = "repro"
version = "0.1.0"
edition = "2021"

[dependencies]
arrow = "53.3.0"
parquet = "53.3.0"
rand = "0.8.5"

use arrow::array::{FixedSizeListBuilder, UInt8Builder};
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;
use parquet::arrow::ArrowWriter;
use parquet::file::properties::WriterProperties;
use rand::Rng;
use std::fs::File;
use std::sync::Arc;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Define the field and schema for a single column that is a fixed-size list of floats.
    let list_length = 1_048_576;
    let field = Field::new(
        "mylist",
        DataType::FixedSizeList(Arc::new(Field::new("item", DataType::UInt8, true)), list_length),
        true,
    );
    let schema = Arc::new(Schema::new(vec![field]));

    // Create a writer for the Parquet file
    let file = File::create("output_randomized.parquet")?;
    let props = WriterProperties::builder().build();
    let mut writer = ArrowWriter::try_new(file, schema.clone(), Some(props))?;

    let iterations = 10000;
    let values_per_batch = list_length;

    let mut list_arr_builder = FixedSizeListBuilder::new(UInt8Builder::new(), list_length);
    for _ in 0..iterations {
        // Generate random data for the values array
        let mut rng = rand::thread_rng();
        let values: Vec<u8> = (0..values_per_batch)
            .map(|_| rng.gen())
            .collect();

        list_arr_builder.values().append_slice(&values);
        list_arr_builder.append(true);
        let output = list_arr_builder.finish();
        let batch = RecordBatch::try_new(schema.clone(), vec![Arc::new(output)])?;
        let in_memory_size = writer.memory_size() + writer.in_progress_size();
        let before_in_memory_size_mb = (in_memory_size as f64) / (1024f64.powi(2));
        writer.write(&batch)?;
        let in_memory_size = writer.memory_size() + writer.in_progress_size();
        let after_in_memory_size_mb = (in_memory_size as f64) / (1024f64.powi(2));
        let change_in_usage = before_in_memory_size_mb - after_in_memory_size_mb;
        dbg!(change_in_usage, after_in_memory_size_mb, before_in_memory_size_mb);
    }

    writer.close()?;

    Ok(())
}

Expected behavior
We should see the reported memory usage rise over time, then as flush is triggered, it should go down to around zero. Then repeat.

The text was updated successfully, but these errors were encountered:

tustvold · 2024-12-05T23:43:05Z

The issue is that GenericColumnWriter::memory_size is not accounting for the data_pages it has buffered waiting for the dictionary page to be flushed. This should be a relatively straightforward case of changing it to be

pub(crate) fn memory_size(&self) -> usize {
    self.data_pages.iter().map(|x| x.data().len()).sum::<usize>()
        + self.column_metrics.total_bytes_written as usize
        + self.encoder.estimated_memory_size()
}

And adding an appropriate test.

FYI @wiedld who added this in #5967

Edit: This should probably actually be reported as part of get_estimated_total_bytes

alamb · 2024-12-07T12:52:04Z

Thanks @kszlim and @tustvold

@kszlim are you planning to propose a fix for this issue?

kszlim · 2024-12-07T19:40:16Z

Unfortunately I've been swamped and probably don't have time to fix it @alamb , I hope my reproduction will be enough for someone to pick it up!

alamb · 2024-12-09T15:48:43Z

Unfortunately I've been swamped and probably don't have time to fix it @alamb , I hope my reproduction will be enough for someone to pick it up!

100% -- much appreciated 🙏

alamb · 2024-12-17T18:46:34Z

label_issue.py automatically added labels {'parquet'} from #6862

kszlim added the bug label Dec 5, 2024

tustvold added good first issue Good for newcomers help wanted labels Dec 5, 2024

onursatici mentioned this issue Dec 9, 2024

add buffered data_pages to parquet column writer total bytes estimation #6862

Merged

tustvold closed this as completed in #6862 Dec 11, 2024

alamb added the parquet Changes to the parquet crate label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet arrow writer doesn't track memory size correctly for fixed sized lists #6839

parquet arrow writer doesn't track memory size correctly for fixed sized lists #6839

kszlim commented Dec 5, 2024 •

edited

Loading

tustvold commented Dec 5, 2024 •

edited

Loading

alamb commented Dec 7, 2024

kszlim commented Dec 7, 2024

alamb commented Dec 9, 2024

alamb commented Dec 17, 2024

parquet arrow writer doesn't track memory size correctly for fixed sized lists #6839

parquet arrow writer doesn't track memory size correctly for fixed sized lists #6839

Comments

kszlim commented Dec 5, 2024 • edited Loading

tustvold commented Dec 5, 2024 • edited Loading

alamb commented Dec 7, 2024

kszlim commented Dec 7, 2024

alamb commented Dec 9, 2024

alamb commented Dec 17, 2024

kszlim commented Dec 5, 2024 •

edited

Loading

tustvold commented Dec 5, 2024 •

edited

Loading