Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support arrow:schema in Parquet writer to faithfully roundtrip duration types with Arrow #15875

Merged
Merged
Show file tree
Hide file tree
Changes from 52 commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
7e58a7a
Python bindings + initial artifacts for arrow schema in PQ writer
mhaseeb123 May 22, 2024
7351f91
Add artifacts to build flatbuffers.
mhaseeb123 May 23, 2024
9aca785
Add basic artifacts to construct the field vector.
mhaseeb123 May 23, 2024
de0fc40
Add artifacts for arrow schema in pq writer
mhaseeb123 May 24, 2024
9080d36
Merge branch 'arrow-schema-support-pq-writer' of https://github.com/m…
mhaseeb123 May 24, 2024
497727e
merge with upstream
mhaseeb123 May 24, 2024
d166fe6
Workin arrow schema builder. Need to handle nested_types and dict32
mhaseeb123 May 25, 2024
7dad37b
Handle structs and lists
mhaseeb123 May 29, 2024
65d2ab5
Merge branch 'rapidsai:branch-24.08' into arrow-schema-support-pq-writer
mhaseeb123 May 29, 2024
e1fc02e
Remove unused code borrowed from arrow.
mhaseeb123 May 29, 2024
44fb0ef
minor improvements to tests and code
mhaseeb123 May 29, 2024
e733ff1
Code cleanup and add API docs.
mhaseeb123 May 29, 2024
f4a9595
Revert changes to types.hpp
mhaseeb123 May 29, 2024
ede6191
Minor code and doc cleanup
mhaseeb123 May 29, 2024
62a2684
Minor fix for failing pytest
mhaseeb123 May 29, 2024
6e448ab
Handle int96 timestamps.
mhaseeb123 May 29, 2024
e003d65
Merge branch 'branch-24.08' into arrow-schema-support-pq-writer
mhaseeb123 May 29, 2024
3c800f5
Add `stats_dtype` to INT64 duration columns
mhaseeb123 May 29, 2024
f7aaaad
turn arrow schema off by default
mhaseeb123 May 30, 2024
04a1998
Disable the missed `store_schema` in parquet.py
mhaseeb123 May 30, 2024
ff22e7d
minor bug fixing
mhaseeb123 May 30, 2024
d5f01be
Fixes for tests
mhaseeb123 May 30, 2024
55296df
Cleanup and restore int96timestamps for this PR.
mhaseeb123 May 30, 2024
706eb18
Modify int96 and arrow schema option behavior
mhaseeb123 May 30, 2024
fa247b7
Revert _use_arrow_schema to true
mhaseeb123 May 30, 2024
9607618
Add tests
mhaseeb123 May 31, 2024
a044f3f
remove temp variables
mhaseeb123 May 31, 2024
a5ab9fb
minor comments cleanup
mhaseeb123 May 31, 2024
844a1d6
revert convertedtype setting
mhaseeb123 May 31, 2024
119e814
Merge branch 'branch-24.08' into arrow-schema-support-pq-writer
mhaseeb123 May 31, 2024
95e860b
Merge branch 'branch-24.08' into arrow-schema-support-pq-writer
mhaseeb123 Jun 2, 2024
dc1608c
Add decimal column conversion
mhaseeb123 Jun 4, 2024
0946eb4
minor ruff-formatting fix
mhaseeb123 Jun 4, 2024
050837b
Merge branch 'branch-24.08' into arrow-schema-support-pq-writer
mhaseeb123 Jun 4, 2024
ea01aee
refactor and move some helpers to writer_impl_helpers.cpp
mhaseeb123 Jun 4, 2024
7b3de64
Merge branch 'branch-24.08' into arrow-schema-support-pq-writer
mhaseeb123 Jun 6, 2024
b9f2989
resolve conflicts, minor doc and pytest updates.
mhaseeb123 Jun 6, 2024
9fea41c
Merge branch 'branch-24.08' of https://github.com/rapidsai/cudf into …
mhaseeb123 Jun 10, 2024
bdedaec
Merge branch 'branch-24.08' into arrow-schema-support-pq-writer
mhaseeb123 Jun 10, 2024
8e77687
Changes from reviewer suggestions
mhaseeb123 Jun 11, 2024
30057c0
Minor changes from reviewer suggestions.
mhaseeb123 Jun 11, 2024
c725273
Merge branch 'branch-24.08' into arrow-schema-support-pq-writer
mhaseeb123 Jun 11, 2024
0f13642
minor update. add nodiscard.
mhaseeb123 Jun 11, 2024
0a7df57
Merge branch 'rapidsai:branch-24.08' into arrow-schema-support-pq-writer
mhaseeb123 Jun 14, 2024
f9c123b
Minor changes addressing reviewer comments.
mhaseeb123 Jun 14, 2024
1ee8732
Merge branch 'branch-24.08' into arrow-schema-support-pq-writer
mhaseeb123 Jun 14, 2024
92d88a0
Rename `is_col_nullable` to `is_output_col_nullable`
mhaseeb123 Jun 14, 2024
df11288
minor comment update
mhaseeb123 Jun 14, 2024
578c8e1
minor comment update
mhaseeb123 Jun 14, 2024
9973175
Merge branch 'branch-24.08' of https://github.com/mhaseeb123/cudf int…
mhaseeb123 Jun 24, 2024
b1e6b6f
Minor refactor
mhaseeb123 Jun 26, 2024
cb39159
Merge branch 'branch-24.08' into arrow-schema-support-pq-writer
mhaseeb123 Jun 26, 2024
c011e51
Incorporating minor suggestions from review
mhaseeb123 Jun 27, 2024
2d45bd0
Merge branch 'branch-24.08' into arrow-schema-support-pq-writer
mhaseeb123 Jun 27, 2024
649a92d
Merge branch 'branch-24.08' into arrow-schema-support-pq-writer
mhaseeb123 Jul 4, 2024
b6a54ec
Test for exception handling to_parquet with int96 and arrow schema en…
mhaseeb123 Jul 4, 2024
bddfabe
Minor fix for failing pytests
mhaseeb123 Jul 5, 2024
e9ab52f
Minor changes from reviewer suggestions
mhaseeb123 Jul 9, 2024
ca9fc3f
Merge branch 'branch-24.08' into arrow-schema-support-pq-writer
mhaseeb123 Jul 9, 2024
9b163f7
Update cpp/src/io/parquet/arrow_schema_writer.hpp
mhaseeb123 Jul 9, 2024
5017f2a
Merge branch 'branch-24.08' into arrow-schema-support-pq-writer
mhaseeb123 Jul 9, 2024
13a06ac
Apply clang-format
mhaseeb123 Jul 9, 2024
1ceca42
Add details to `store_schema` docstring
mhaseeb123 Jul 9, 2024
db54c0b
Merge branch 'branch-24.08' into arrow-schema-support-pq-writer
mhaseeb123 Jul 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -406,6 +406,7 @@ add_library(
src/io/orc/stripe_init.cu
src/datetime/timezone.cpp
src/io/orc/writer_impl.cu
src/io/parquet/arrow_schema_writer.cpp
src/io/parquet/compact_protocol_reader.cpp
src/io/parquet/compact_protocol_writer.cpp
src/io/parquet/decode_preprocess.cu
Expand All @@ -422,6 +423,7 @@ add_library(
src/io/parquet/reader_impl_helpers.cpp
src/io/parquet/reader_impl_preprocess.cu
src/io/parquet/writer_impl.cu
src/io/parquet/writer_impl_helpers.cpp
src/io/parquet/decode_fixed.cu
src/io/statistics/orc_column_statistics.cu
src/io/statistics/parquet_column_statistics.cu
Expand Down
25 changes: 25 additions & 0 deletions cpp/include/cudf/io/parquet.hpp
vuule marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -597,6 +597,8 @@ class parquet_writer_options_base {
// Parquet writer can write timestamps as UTC
// Defaults to true because libcudf timestamps are implicitly UTC
bool _write_timestamps_as_UTC = true;
// Whether to write ARROW schema
bool _write_arrow_schema = false;
// Maximum size of each row group (unless smaller than a single page)
size_t _row_group_size_bytes = default_row_group_size_bytes;
// Maximum number of rows in row group (unless smaller than a single page)
Expand Down Expand Up @@ -689,6 +691,13 @@ class parquet_writer_options_base {
*/
[[nodiscard]] auto is_enabled_utc_timestamps() const { return _write_timestamps_as_UTC; }

/**
* @brief Returns `true` if arrow schema will be written
*
* @return `true` if arrow schema will be written
*/
[[nodiscard]] auto is_enabled_write_arrow_schema() const { return _write_arrow_schema; }

/**
* @brief Returns maximum row group size, in bytes.
*
Expand Down Expand Up @@ -824,6 +833,13 @@ class parquet_writer_options_base {
*/
void enable_utc_timestamps(bool val);

/**
* @brief Sets preference for writing arrow schema. Write arrow schema if set to `true`.
*
* @param val Boolean value to enable/disable writing of arrow schema.
*/
void enable_write_arrow_schema(bool val);

/**
* @brief Sets the maximum row group size, in bytes.
*
Expand Down Expand Up @@ -1084,6 +1100,15 @@ class parquet_writer_options_builder_base {
* @return this for chaining
*/
BuilderT& utc_timestamps(bool enabled);

/**
* @brief Set to true if arrow schema is to be written
*
* @param enabled Boolean value to enable/disable writing of arrow schema
* @return this for chaining
*/
BuilderT& write_arrow_schema(bool enabled);

/**
* @brief Set to true if V2 page headers are to be written.
*
Expand Down
18 changes: 18 additions & 0 deletions cpp/src/io/functions.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -762,6 +762,9 @@ void parquet_writer_options_base::set_compression(compression_type compression)

void parquet_writer_options_base::enable_int96_timestamps(bool req)
{
CUDF_EXPECTS(not req or not is_enabled_write_arrow_schema(),
"INT96 timestamps and arrow schema cannot be simultaneously "
"enabled as INT96 timestamps are deprecated in Arrow.");
_write_timestamps_as_int96 = req;
}

Expand All @@ -770,6 +773,14 @@ void parquet_writer_options_base::enable_utc_timestamps(bool val)
_write_timestamps_as_UTC = val;
}

void parquet_writer_options_base::enable_write_arrow_schema(bool val)
{
CUDF_EXPECTS(not val or not is_enabled_int96_timestamps(),
"arrow schema and INT96 timestamps cannot be simultaneously "
"enabled as INT96 timestamps are deprecated in Arrow.");
_write_arrow_schema = val;
}

void parquet_writer_options_base::set_row_group_size_bytes(size_t size_bytes)
{
CUDF_EXPECTS(
Expand Down Expand Up @@ -974,6 +985,13 @@ BuilderT& parquet_writer_options_builder_base<BuilderT, OptionsT>::utc_timestamp
return static_cast<BuilderT&>(*this);
}

template <class BuilderT, class OptionsT>
BuilderT& parquet_writer_options_builder_base<BuilderT, OptionsT>::write_arrow_schema(bool enabled)
{
_options.enable_write_arrow_schema(enabled);
return static_cast<BuilderT&>(*this);
}

template <class BuilderT, class OptionsT>
BuilderT& parquet_writer_options_builder_base<BuilderT, OptionsT>::write_v2_headers(bool enabled)
{
Expand Down
Loading
Loading