Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet arrow writer doesn't track memory size correctly for fixed sized lists #6839

Closed
kszlim opened this issue Dec 5, 2024 · 5 comments · Fixed by #6862
Closed

parquet arrow writer doesn't track memory size correctly for fixed sized lists #6839

kszlim opened this issue Dec 5, 2024 · 5 comments · Fixed by #6862
Labels
bug good first issue Good for newcomers help wanted parquet Changes to the parquet crate

Comments

@kszlim
Copy link
Contributor

kszlim commented Dec 5, 2024

Describe the bug
The arrow writer doesn't track memory size correctly, and it seems like it thinks FixedSizeList columns have a fixed memory usage. Ie. the reported memory usage doesn't grow despite the buffers actually growing in memory.

To Reproduce

[package]
name = "repro"
version = "0.1.0"
edition = "2021"

[dependencies]
arrow = "53.3.0"
parquet = "53.3.0"
rand = "0.8.5"
use arrow::array::{FixedSizeListBuilder, UInt8Builder};
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;
use parquet::arrow::ArrowWriter;
use parquet::file::properties::WriterProperties;
use rand::Rng;
use std::fs::File;
use std::sync::Arc;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Define the field and schema for a single column that is a fixed-size list of floats.
    let list_length = 1_048_576;
    let field = Field::new(
        "mylist",
        DataType::FixedSizeList(Arc::new(Field::new("item", DataType::UInt8, true)), list_length),
        true,
    );
    let schema = Arc::new(Schema::new(vec![field]));

    // Create a writer for the Parquet file
    let file = File::create("output_randomized.parquet")?;
    let props = WriterProperties::builder().build();
    let mut writer = ArrowWriter::try_new(file, schema.clone(), Some(props))?;

    let iterations = 10000;
    let values_per_batch = list_length;

    let mut list_arr_builder = FixedSizeListBuilder::new(UInt8Builder::new(), list_length);
    for _ in 0..iterations {
        // Generate random data for the values array
        let mut rng = rand::thread_rng();
        let values: Vec<u8> = (0..values_per_batch)
            .map(|_| rng.gen())
            .collect();

        list_arr_builder.values().append_slice(&values);
        list_arr_builder.append(true);
        let output = list_arr_builder.finish();
        let batch = RecordBatch::try_new(schema.clone(), vec![Arc::new(output)])?;
        let in_memory_size = writer.memory_size() + writer.in_progress_size();
        let before_in_memory_size_mb = (in_memory_size as f64) / (1024f64.powi(2));
        writer.write(&batch)?;
        let in_memory_size = writer.memory_size() + writer.in_progress_size();
        let after_in_memory_size_mb = (in_memory_size as f64) / (1024f64.powi(2));
        let change_in_usage = before_in_memory_size_mb - after_in_memory_size_mb;
        dbg!(change_in_usage, after_in_memory_size_mb, before_in_memory_size_mb);
    }

    writer.close()?;

    Ok(())
}

Expected behavior
We should see the reported memory usage rise over time, then as flush is triggered, it should go down to around zero. Then repeat.

@kszlim kszlim added the bug label Dec 5, 2024
@tustvold
Copy link
Contributor

tustvold commented Dec 5, 2024

The issue is that GenericColumnWriter::memory_size is not accounting for the data_pages it has buffered waiting for the dictionary page to be flushed. This should be a relatively straightforward case of changing it to be

pub(crate) fn memory_size(&self) -> usize {
    self.data_pages.iter().map(|x| x.data().len()).sum::<usize>()
        + self.column_metrics.total_bytes_written as usize
        + self.encoder.estimated_memory_size()
}

And adding an appropriate test.

FYI @wiedld who added this in #5967

Edit: This should probably actually be reported as part of get_estimated_total_bytes

@alamb
Copy link
Contributor

alamb commented Dec 7, 2024

Thanks @kszlim and @tustvold

@kszlim are you planning to propose a fix for this issue?

@kszlim
Copy link
Contributor Author

kszlim commented Dec 7, 2024

Unfortunately I've been swamped and probably don't have time to fix it @alamb , I hope my reproduction will be enough for someone to pick it up!

@alamb
Copy link
Contributor

alamb commented Dec 9, 2024

Unfortunately I've been swamped and probably don't have time to fix it @alamb , I hope my reproduction will be enough for someone to pick it up!

100% -- much appreciated 🙏

@alamb
Copy link
Contributor

alamb commented Dec 17, 2024

label_issue.py automatically added labels {'parquet'} from #6862

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug good first issue Good for newcomers help wanted parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants