Add Parquet Plugin · hpcc-systems/HPCC-Platform@0850494

Commit

Add Parquet Plugin

Add basic EmbedFunctionContext in cpp files.

Add Arrow Includes.

Fix Header conflicts before rebase.

Fix Header conflicts before rebase.

Changes after rebase.

Added example file to show reading from and writing to a parquet file.

Add function calls to ecl file.

Added some basic options for embedding the parquet functions.
I am not sure if these will stay the same.

Added install calls so the shared library gets installed.

Fixed type.

Added function instantiations for header functions.

Added some useful function definitions.

Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions.

Fix typo

Added ParquetHelper class for holding the user inputs and opening the stream reader and writer.

Added dataset to argument of write function to test writing ECL datasets.

Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema.

Fix error in getFieldTypes method.

Fix addField method. Schema now gets built from the parameter dataset.

Can write to parquet files.

Set up ParquetRowStream to start reading parquet and output it to an ECL record.

Minor changes.

Can read a parquet table from a parquet file.

Add RowBuilder methods for builder result rows.

Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value.

Add #DEFINE RAPIDJSON_HAS_STDSTRING 1.
This allows for the use of some helpful GenericValue functions.

Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson.

Fixed error where reading from the file it reads one extra row and fails.

Remove Credentials.

Fix issue where utf8 encoded parameters were not properly converted to strings.

Fix issues dealing with utf8 strings and reading and writing them.

Add code for getting the schema for nested types.

Add additional feature from apache arrow.

Add code for reading partitioned files.

Comment out Partition code

Edit ECL files

Linker not properly linking dataset shared library.

Manually link Arrow Dataset shared library.

Added part of the code for writing to partitioned files.

Replace StreamWriter API with rapidjson api.
Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing.

Moved allocator for rapidjson documents to top level in the parquetembed namespace.

Minor changes. Schema needs to be correctly passed through.

Fix issue with writing to vector of rapidjson documents.
Add visit method for HalfFloat datatype.

Changed FileWriter creation to write in the parquet format.

Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once.
Changed write() to return the FileWriter instead of the FileOutputStream.
Writing in batches now works! Butgit add parquetembed.hpp  Further testing on edge cases and vector reuse is needed!

Removed old code. Currently in working state though not tested very much.

Read Files in RowGroups one at a time then iterate through each row.

Added file for testing reading and writing large parquet files.

Minor: Changes to simple example.

rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup.

Tidy up a bit...

Created file for testing nested data.

Added support for nested objects.

Refactored code and moved function implementations to cpp file.

Add basic EmbedFunctionContext in cpp files.

Add Arrow Includes.

Fix Header conflicts before rebase.

Fix Header conflicts before rebase.

Added function instantiations for header functions.

Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions.

Added ParquetHelper class for holding the user inputs and opening the stream reader and writer.

Added dataset to argument of write function to test writing ECL datasets.

Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema.

Fix error in getFieldTypes method.

Fix addField method. Schema now gets built from the parameter dataset.

Can write to parquet files.

Set up ParquetRowStream to start reading parquet and output it to an ECL record.

Minor changes.

Can read a parquet table from a parquet file.

Add RowBuilder methods for builder result rows.

Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value.

Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson.

Fixed error where reading from the file it reads one extra row and fails.

Fix issue where utf8 encoded parameters were not properly converted to strings.

Fix issues dealing with utf8 strings and reading and writing them.

Add code for getting the schema for nested types.

Add additional feature from apache arrow.

Add code for reading partitioned files.

Comment out Partition code

Edit ECL files

Linker not properly linking dataset shared library.

Added part of the code for writing to partitioned files.

Replace StreamWriter API with rapidjson api.
Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing.

Moved allocator for rapidjson documents to top level in the parquetembed namespace.

Minor changes. Schema needs to be correctly passed through.

Fix issue with writing to vector of rapidjson documents.
Add visit method for HalfFloat datatype.

Changed FileWriter creation to write in the parquet format.

Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once.
Changed write() to return the FileWriter instead of the FileOutputStream.
Writing in batches now works! Butgit add parquetembed.hpp  Further testing on edge cases and vector reuse is needed!

Removed old code. Currently in working state though not tested very much.

Read Files in RowGroups one at a time then iterate through each row.

rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup.

Tidy up a bit...

Added support for nested objects.

Refactored code and moved function implementations to cpp file.

Added comment.

Added comment.

Added new comment.

Documentation.

Rebase 8.12.x complete and tested.

Initial commit

Add basic EmbedFunctionContext in cpp files.

Add Arrow Includes.

Fix Header conflicts before rebase.

Fix Header conflicts before rebase.

Changes after rebase.

Added example file to show reading from and writing to a parquet file.

Add function calls to ecl file.

Added some basic options for embedding the parquet functions.
I am not sure if these will stay the same.

Added install calls so the shared library gets installed.

Fixed type.

Added function instantiations for header functions.

Added some useful function definitions.

Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions.

Fix typo

Added ParquetHelper class for holding the user inputs and opening the stream reader and writer.

Added dataset to argument of write function to test writing ECL datasets.

Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema.

Fix error in getFieldTypes method.

Fix addField method. Schema now gets built from the parameter dataset.

Can write to parquet files.

Set up ParquetRowStream to start reading parquet and output it to an ECL record.

Minor changes.

Can read a parquet table from a parquet file.

Add RowBuilder methods for builder result rows.

Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value.

Add #DEFINE RAPIDJSON_HAS_STDSTRING 1.
This allows for the use of some helpful GenericValue functions.

Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson.

Fixed error where reading from the file it reads one extra row and fails.

Remove Credentials.

Fix issue where utf8 encoded parameters were not properly converted to strings.

Fix issues dealing with utf8 strings and reading and writing them.

Add code for getting the schema for nested types.

Add additional feature from apache arrow.

Add code for reading partitioned files.

Comment out Partition code

Edit ECL files

Linker not properly linking dataset shared library.

Manually link Arrow Dataset shared library.

Added part of the code for writing to partitioned files.

Replace StreamWriter API with rapidjson api.
Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing.

Moved allocator for rapidjson documents to top level in the parquetembed namespace.

Minor changes. Schema needs to be correctly passed through.

Fix issue with writing to vector of rapidjson documents.
Add visit method for HalfFloat datatype.

Changed FileWriter creation to write in the parquet format.

Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once.
Changed write() to return the FileWriter instead of the FileOutputStream.
Writing in batches now works! Butgit add parquetembed.hpp  Further testing on edge cases and vector reuse is needed!

Removed old code. Currently in working state though not tested very much.

Read Files in RowGroups one at a time then iterate through each row.

Added file for testing reading and writing large parquet files.

Minor: Changes to simple example.

rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup.

Tidy up a bit...

Created file for testing nested data.

Added support for nested objects.

Refactored code and moved function implementations to cpp file.

Initial commit

Add basic EmbedFunctionContext in cpp files.

Add Arrow Includes.

Fix Header conflicts before rebase.

Fix Header conflicts before rebase.

Changes after rebase.

Added example file to show reading from and writing to a parquet file.

Add function calls to ecl file.

Added some basic options for embedding the parquet functions.
I am not sure if these will stay the same.

Added install calls so the shared library gets installed.

Fixed type.

Added function instantiations for header functions.

Added some useful function definitions.

Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions.

Fix typo

Added ParquetHelper class for holding the user inputs and opening the stream reader and writer.

Added dataset to argument of write function to test writing ECL datasets.

Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema.

Fix error in getFieldTypes method.

Fix addField method. Schema now gets built from the parameter dataset.

Can write to parquet files.

Set up ParquetRowStream to start reading parquet and output it to an ECL record.

Minor changes.

Can read a parquet table from a parquet file.

Add RowBuilder methods for builder result rows.

Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value.

Add #DEFINE RAPIDJSON_HAS_STDSTRING 1.
This allows for the use of some helpful GenericValue functions.

Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson.

Fixed error where reading from the file it reads one extra row and fails.

Remove Credentials.

Fix issue where utf8 encoded parameters were not properly converted to strings.

Fix issues dealing with utf8 strings and reading and writing them.

Add code for getting the schema for nested types.

Add additional feature from apache arrow.

Add code for reading partitioned files.

Comment out Partition code

Edit ECL files

Linker not properly linking dataset shared library.

Manually link Arrow Dataset shared library.

Added part of the code for writing to partitioned files.

Replace StreamWriter API with rapidjson api.
Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing.

Moved allocator for rapidjson documents to top level in the parquetembed namespace.

Minor changes. Schema needs to be correctly passed through.

Fix issue with writing to vector of rapidjson documents.
Add visit method for HalfFloat datatype.

Changed FileWriter creation to write in the parquet format.

Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once.
Changed write() to return the FileWriter instead of the FileOutputStream.
Writing in batches now works! Butgit add parquetembed.hpp  Further testing on edge cases and vector reuse is needed!

Removed old code. Currently in working state though not tested very much.

Read Files in RowGroups one at a time then iterate through each row.

Added file for testing reading and writing large parquet files.

Minor: Changes to simple example.

rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup.

Tidy up a bit...

Created file for testing nested data.

Added support for nested objects.

Refactored code and moved function implementations to cpp file.

Added comment.

Added comment.

Added new comment.

Documentation.

Fixed and Tested rebase.

Minor.

Fixup build and install

Signed-off-by: Dan S. Camper <[email protected]>

Change type mapping. Unsigned and Signed are now mapped explicitly to 64 and 32 bit versions of each.

Add different byte sizes for testing type mapping.

Refactor JsonValueConverter.
Add additional methods for UInt64/32 and Int32 for explicit type mapping.

Change interface for calling plugin.

Added information for which thor worker we are running on.
Divides the row groups (poorly) by the number of workers.

Minor: Updated test files.

Change which API is used for opening files.

WIP

Minor changes to opening parquet file

Initial commit

Add basic EmbedFunctionContext in cpp files.

Add Arrow Includes.

Fix Header conflicts before rebase.

Fix Header conflicts before rebase.

Changes after rebase.

Added example file to show reading from and writing to a parquet file.

Add function calls to ecl file.

Added some basic options for embedding the parquet functions.
I am not sure if these will stay the same.

Added install calls so the shared library gets installed.

Fixed type.

Added function instantiations for header functions.

Added some useful function definitions.

Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions.

Fix typo

Added ParquetHelper class for holding the user inputs and opening the stream reader and writer.

Added dataset to argument of write function to test writing ECL datasets.

Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema.

Fix error in getFieldTypes method.

Fix addField method. Schema now gets built from the parameter dataset.

Can write to parquet files.

Set up ParquetRowStream to start reading parquet and output it to an ECL record.

Minor changes.

Can read a parquet table from a parquet file.

Add RowBuilder methods for builder result rows.

Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value.

Add #DEFINE RAPIDJSON_HAS_STDSTRING 1.
This allows for the use of some helpful GenericValue functions.

Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson.

Fixed error where reading from the file it reads one extra row and fails.

Remove Credentials.

Fix issue where utf8 encoded parameters were not properly converted to strings.

Fix issues dealing with utf8 strings and reading and writing them.

Add code for getting the schema for nested types.

Add additional feature from apache arrow.

Add code for reading partitioned files.

Comment out Partition code

Edit ECL files

Linker not properly linking dataset shared library.

Manually link Arrow Dataset shared library.

Added part of the code for writing to partitioned files.

Replace StreamWriter API with rapidjson api.
Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing.

Moved allocator for rapidjson documents to top level in the parquetembed namespace.

Minor changes. Schema needs to be correctly passed through.

Fix issue with writing to vector of rapidjson documents.
Add visit method for HalfFloat datatype.

Changed FileWriter creation to write in the parquet format.

Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once.
Changed write() to return the FileWriter instead of the FileOutputStream.
Writing in batches now works! Butgit add parquetembed.hpp  Further testing on edge cases and vector reuse is needed!

Removed old code. Currently in working state though not tested very much.

Read Files in RowGroups one at a time then iterate through each row.

Added file for testing reading and writing large parquet files.

Minor: Changes to simple example.

rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup.

Tidy up a bit...

Created file for testing nested data.

Added support for nested objects.

Refactored code and moved function implementations to cpp file.

Add basic EmbedFunctionContext in cpp files.

Add Arrow Includes.

Fix Header conflicts before rebase.

Fix Header conflicts before rebase.

Added function instantiations for header functions.

Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions.

Added ParquetHelper class for holding the user inputs and opening the stream reader and writer.

Added dataset to argument of write function to test writing ECL datasets.

Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema.

Fix error in getFieldTypes method.

Fix addField method. Schema now gets built from the parameter dataset.

Can write to parquet files.

Set up ParquetRowStream to start reading parquet and output it to an ECL record.

Minor changes.

Can read a parquet table from a parquet file.

Add RowBuilder methods for builder result rows.

Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value.

Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson.

Fixed error where reading from the file it reads one extra row and fails.

Fix issue where utf8 encoded parameters were not properly converted to strings.

Fix issues dealing with utf8 strings and reading and writing them.

Add code for getting the schema for nested types.

Add additional feature from apache arrow.

Add code for reading partitioned files.

Comment out Partition code

Edit ECL files

Linker not properly linking dataset shared library.

Added part of the code for writing to partitioned files.

Replace StreamWriter API with rapidjson api.
Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing.

Moved allocator for rapidjson documents to top level in the parquetembed namespace.

Minor changes. Schema needs to be correctly passed through.

Fix issue with writing to vector of rapidjson documents.
Add visit method for HalfFloat datatype.

Changed FileWriter creation to write in the parquet format.

Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once.
Changed write() to return the FileWriter instead of the FileOutputStream.
Writing in batches now works! Butgit add parquetembed.hpp  Further testing on edge cases and vector reuse is needed!

Removed old code. Currently in working state though not tested very much.

Read Files in RowGroups one at a time then iterate through each row.

rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup.

Tidy up a bit...

Added support for nested objects.

Refactored code and moved function implementations to cpp file.

Added comment.

Added comment.

Added new comment.

Documentation.

Rebase 8.12.x complete and tested.

Initial commit

Add basic EmbedFunctionContext in cpp files.

Add Arrow Includes.

Fix Header conflicts before rebase.

Fix Header conflicts before rebase.

Changes after rebase.

Added example file to show reading from and writing to a parquet file.

Add function calls to ecl file.

Added some basic options for embedding the parquet functions.
I am not sure if these will stay the same.

Added install calls so the shared library gets installed.

Fixed type.

Added function instantiations for header functions.

Added some useful function definitions.

Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions.

Fix typo

Added ParquetHelper class for holding the user inputs and opening the stream reader and writer.

Added dataset to argument of write function to test writing ECL datasets.

Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema.

Fix error in getFieldTypes method.

Fix addField method. Schema now gets built from the parameter dataset.

Can write to parquet files.

Set up ParquetRowStream to start reading parquet and output it to an ECL record.

Minor changes.

Can read a parquet table from a parquet file.

Add RowBuilder methods for builder result rows.

Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value.

Add #DEFINE RAPIDJSON_HAS_STDSTRING 1.
This allows for the use of some helpful GenericValue functions.

Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson.

Fixed error where reading from the file it reads one extra row and fails.

Remove Credentials.

Fix issue where utf8 encoded parameters were not properly converted to strings.

Fix issues dealing with utf8 strings and reading and writing them.

Add code for getting the schema for nested types.

Add additional feature from apache arrow.

Add code for reading partitioned files.

Comment out Partition code

Edit ECL files

Linker not properly linking dataset shared library.

Manually link Arrow Dataset shared library.

Added part of the code for writing to partitioned files.

Replace StreamWriter API with rapidjson api.
Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing.

Moved allocator for rapidjson documents to top level in the parquetembed namespace.

Minor changes. Schema needs to be correctly passed through.

Fix issue with writing to vector of rapidjson documents.
Add visit method for HalfFloat datatype.

Changed FileWriter creation to write in the parquet format.

Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once.
Changed write() to return the FileWriter instead of the FileOutputStream.
Writing in batches now works! Butgit add parquetembed.hpp  Further testing on edge cases and vector reuse is needed!

Removed old code. Currently in working state though not tested very much.

Read Files in RowGroups one at a time then iterate through each row.

Added file for testing reading and writing large parquet files.

Minor: Changes to simple example.

rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup.

Tidy up a bit...

Created file for testing nested data.

Added support for nested objects.

Refactored code and moved function implementations to cpp file.

Initial commit

Add basic EmbedFunctionContext in cpp files.

Add Arrow Includes.

Fix Header conflicts before rebase.

Fix Header conflicts before rebase.

HPCC-28689 Refactor jhtree class structure to support additional node compression methods

Sort out concerns about usage and semantics of isLeaf(). Also
a few trivial changes suggested by earlier code review

Signed-off-by: Richard Chapman <[email protected]>

Changes after rebase.

Not sure what this does.

Added example file to show reading from and writing to a parquet file.

Add function calls to ecl file.

Added some basic options for embedding the parquet functions.
I am not sure if these will stay the same.

Added install calls so the shared library gets installed.

Fixed type.

Added function instantiations for header functions.

Added some useful function definitions.

Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions.

Fix typo

Added ParquetHelper class for holding the user inputs and opening the stream reader and writer.

Added dataset to argument of write function to test writing ECL datasets.

Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema.

Fix error in getFieldTypes method.

Fix addField method. Schema now gets built from the parameter dataset.

Can write to parquet files.

Set up ParquetRowStream to start reading parquet and output it to an ECL record.

Minor changes.

Can read a parquet table from a parquet file.

Add RowBuilder methods for builder result rows.

Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value.

Add #DEFINE RAPIDJSON_HAS_STDSTRING 1.
This allows for the use of some helpful GenericValue functions.

Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson.

Fixed error where reading from the file it reads one extra row and fails.

Remove Credentials.

Fix issue where utf8 encoded parameters were not properly converted to strings.

Fix issues dealing with utf8 strings and reading and writing them.

Add code for getting the schema for nested types.

Add additional feature from apache arrow.

Add code for reading partitioned files.

Comment out Partition code

Edit ECL files

Linker not properly linking dataset shared library.

Manually link Arrow Dataset shared library.

Added part of the code for writing to partitioned files.

Replace StreamWriter API with rapidjson api.
Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing.

Moved allocator for rapidjson documents to top level in the parquetembed namespace.

Minor changes. Schema needs to be correctly passed through.

Fix issue with writing to vector of rapidjson documents.
Add visit method for HalfFloat datatype.

Changed FileWriter creation to write in the parquet format.

Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once.
Changed write() to return the FileWriter instead of the FileOutputStream.
Writing in batches now works! Butgit add parquetembed.hpp  Further testing on edge cases and vector reuse is needed!

Removed old code. Currently in working state though not tested very much.

Read Files in RowGroups one at a time then iterate through each row.

Added file for testing reading and writing large parquet files.

Minor: Changes to simple example.

rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup.

Tidy up a bit...

Created file for testing nested data.

Added support for nested objects.

Refactored code and moved function implementations to cpp file.

Added comment.

Added comment.

Added new comment.

Documentation.

Minor.

Fixup build and install

Signed-off-by: Dan S. Camper <[email protected]>

Change type mapping. Unsigned and Signed are now mapped explicitly to 64 and 32 bit versions of each.

Add different byte sizes for testing type mapping.

Refactor JsonValueConverter.
Add additional methods for UInt64/32 and Int32 for explicit type mapping.

Change interface for calling plugin.

Added information for which thor worker we are running on.
Divides the row groups (poorly) by the number of workers.

Minor: Updated test files.

Change which API is used for opening files.

WIP

Minor changes to opening parquet file

Update FileReader usage to comply with version 12.0.0 standards.

Add overlay for pulling down forked arrow directory.

Add function to close file that resets unique_ptr when the plugin is done streaming rows to the engine.

Add Arrow to vcpkg.json.in

Add correct hash for current arrow commit.
add PARQUETEMBED to plugins.cmake.

Update arrow fork SHA512.

Remove trailing whitespace

Signed-off-by: Dan S. Camper <[email protected]>

Explicit default destructors; make one ParquetHelper pointer private

Signed-off-by: Dan S. Camper <[email protected]>

ParquetHelper: Use single MemoryPool; explicitly release pool memory on destruction

Signed-off-by: Dan S. Camper <[email protected]>

Add vcpkg overlay for building with Arrow 12.0.0 release version.

Add function for delegating row groups to any number of thor workers.

Fix error in dividing up rows between workers and add some additional comments.

Change writing parquet to use the same arrow::MemoryPool*

Add code for reading from arrow dataset (partitioned files).
Fix error in brind string params to rows when writing to parquet.
Add better way of setting single vs multiple reads/writes.

Implemented logic for scanning chunks of an arrow dataset.

Format and cleanup source code.

Add Unimplemented Tag to Scalar parameters for EmbedFunctionContext Bind functions.

Change check in ParquetHelper::next() to correct value.
Use PARQUET_ASSIGN_OR_THROW whenever an arrow::Result object is returned.

Create example for creating and reading a partitioned dataset.

Writing Datasets using HivePartitioning works.
Currently the partioning schema is hardcoded to the language field in the github dataset.
Streaming larger than memory datasets to a partitioned dataset is not working.

Write each batch out to a different file to avoid collisions of file names.

Clean up file name

Change partitioning behaviour to delete the directory contents before creating a new partition.

Implement thor functionality for writing parquet files.
Directory gets created and emptied if it already existed on write.
Increased default batch size for reading and writing to 10000.
Added critical section to all uses of scanner file operations.

Add WritePartition Macro call

Make a better example.

Start README for parquet plugin.

Added support for Data and Unicode datatypes when writing to parquet.
removed unnecessary variable.

Change rapidjson allocator to thread_local storage specification to allow for multi threaded writing.

Added arrow conversion methods for arrow binary type.

Changed default RowGroup size to 20000

Added example ecl for reading and writing BLOB data

Be more careful passing around length of the strings.
Add additional conversion to incoming DATA types to UTF8.

Update blob test

Use getProp rather than queryProp to preserve the length of a string result being returned.

Reimplemented ParquetRowBuilder to build rows from rapidjson::Value objects rather than an IPropertyTree.
Fixed Implementation of set datatype.

Update test files.

update create partition example.

Update Example files.

Fix reading rapidjson::Array and add Clear calls to the rapidjson allocator.

Remove calls to GetAllocator and instead pass jsonAlloc in.
Add function for queryingRows from the dataset.

Add function for querying rows from dataset. Gets a RecordBatchReader and a RecordBatchReaderIterator to iterate through the stream of record batches from the dataset.
There is currently a large limitation in arrow where iterators for datasets do not support random reads, they must iterate from the beginning to the starting point, and they cannot end early, on destruction they are iterated until the end of the stream.

Removed rapidjson conversion from RowBuilder implementation.
RowBuilder now builds the fields directly from the arrow::Table that is read from the arrow::FileReader object.
Implemented a ParquetVisitor class for getting the correct datatype from each scalar.
Tested on all test ecl files.

format

Fix issue with Real datatype not being returned properly.

Fixed some performance issues with Arrow FileReader interface.

Cache chunks rather than read them in every time we build a field.

Update test/example files.

Update file reading structure to expect each thor worker to have its own parquet file.

Fix compiler warnings.

Clean up...

Clean up...

Clean up...

Clean up ...

Format source code.

Refactor

Fix Utf8 conversion when returning String Results.

Change open file function call.
update default row and batch sizes.

Bump vcpkg version of Arrow to 13.0.0

Fix dependency installs.

Fix decimal type.

Remove PathTracker copy constructor.

Change initialization of PathTracker members.

Create ParquetArrayType enum.

Move ParquetDatasetBinder methods to cpp file. Static jsonAlloc can now be moved out of the header file.

minor change for clarity

Remove default initializations from constructors.

Fix partition condition for user input.

Fix decimal datatype.
Fix nested structures for all ECL types.

Utf8 type no longer gets translated to string.

Add utf8 characters to example file.

Encapsulate children processing check.

Create function currArrayIndex() from common code across source file.

Change write() to queryWriter() to be more descriptive.
Change return type to pointer to the FileWriter object.

Return references rather than pointer where object cannot be a nullptr.

Use consistent types, especially in comparisons.

Remove countFields because it is a duplicate.

Remove floating point ooperation from divide_row_groups.

Thor nodes that don't recieve any rows will not open a file.

Remove Critical section when writing.
Each node writes to a unique file.

Add override qualifier to virtual functions in ParquetRowBuilder and ParquetRowStream.

Add default initializers and clean up member variables.

Allow openReadFile to open any files matching the filename chosen by the user.
The files will all be opened and the row counts will be recorded. This will allow for even division of work, and will keep order intact.

Style: Make function names clearer and improve clarity.

Revert plugin collision change.

Fix non null terminated unicode parameters.

Fix null characters in string types.

Data datatype no longer gets converted to utf-8.

Remove extra rtlStrToDataX call in processData.

Add static qualifier to addMember function.

Remove commented lines.

Fix references in Next and DocValuesIterator

Fix constructor argument names.

Remove && for rows in DocValuesIterator constructor.

Use UTF-8 size instead of code-points.

Loading branch information

jackdelv committed Sep 28, 2023

1 parent b4167eb commit 0850494

CMakeLists.txt

-Original file line number
+Diff line change
@@ Expand Up / @@ -169,6 +169,7 @@ if ( PLUGIN ) @@
         HPCC_ADD_SUBDIRECTORY (plugins/h3 "H3")
         HPCC_ADD_SUBDIRECTORY (plugins/nlp "NLP")
         HPCC_ADD_SUBDIRECTORY (plugins/mongodb "MONGODBEMBED")
+        HPCC_ADD_SUBDIRECTORY (plugins/parquet "PARQUETEMBED")
     elseif ( NOT MAKE_DOCS_ONLY )
         HPCC_ADD_SUBDIRECTORY (system)
         HPCC_ADD_SUBDIRECTORY (initfiles)
@@ Expand Down @@

cmake_modules/plugins.cmake

-Original file line number
+Diff line change
@@ Expand Up / @@ -36,6 +36,7 @@ set(PLUGINS_LIST @@
         MONGODBEMBED
         MYSQLEMBED
         NLP
+        PARQUETEMBED
         REDIS
         REMBED
         SQLITE3EMBED
@@ Expand Down @@

plugins/CMakeLists.txt

-Original file line number
+Diff line change
@@ Expand Up / @@ -42,6 +42,7 @@ add_subdirectory (exampleplugin) @@
     add_subdirectory (couchbase)
     add_subdirectory (sqs)
     add_subdirectory (mongodb)
+    add_subdirectory (parquet)
     IF ( INCLUDE_EE_PLUGINS )
     add_subdirectory (eeproxies)
     ENDIF()
@@ Expand Down @@

plugins/parquet/CMakeLists.txt

-Original file line number
+Diff line change
@@ -0,0 +1,120 @@
+    ##############################################################################
+    #    HPCC SYSTEMS software Copyright (C) 2022 HPCC Systems®.
+    #    Licensed under the Apache License, Version 2.0 (the "License");
+    #    you may not use this file except in compliance with the License.
+    #    You may obtain a copy of the License at
+    #       http://www.apache.org/licenses/LICENSE-2.0
+    #    Unless required by applicable law or agreed to in writing, software
+    #    distributed under the License is distributed on an "AS IS" BASIS,
+    #    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    #    See the License for the specific language governing permissions and
+    #    limitations under the License.
+    ##############################################################################
+    # Component: parquetembed
+    #############################################################
+    # Description:
+    # -----------
+    # Cmake Input File for parquetembed
+    #############################################################
+    project(parquetembed)
+    message("CMAKE Version: ${CMAKE_VERSION}")
+    if(PARQUETEMBED)
+        ADD_PLUGIN(parquetembed)
+        if(MAKE_PARQUETEMBED)
+            find_package(Arrow CONFIG REQUIRED)
+            find_package(Parquet CONFIG REQUIRED)
+            find_package(ArrowDataset CONFIG REQUIRED)
+            set(
+                SRCS
+                parquetembed.cpp
+            )
+            INCLUDE_DIRECTORIES(
+                ${HPCC_SOURCE_DIR}/esp/platform
+                ${HPCC_SOURCE_DIR}/system/include
+                ${HPCC_SOURCE_DIR}/rtl/eclrtl
+                ${HPCC_SOURCE_DIR}/rtl/include
+                ${HPCC_SOURCE_DIR}/rtl/nbcd
+                ${HPCC_SOURCE_DIR}/common/deftype
+                ${HPCC_SOURCE_DIR}/system/jlib
+                ${HPCC_SOURCE_DIR}/roxie/roxiemem
+            )
+            HPCC_ADD_LIBRARY(parquetembed SHARED ${SRCS})
+            install(
+                TARGETS parquetembed
+                DESTINATION plugins CALC_DEPS
+            )
+            install(
+                FILES ${LIBARROW_LIB_REAL}
+                DESTINATION ${LIB_DIR} CALC_DEPS
+                PERMISSIONS OWNER_WRITE OWNER_READ OWNER_EXECUTE GROUP_READ GROUP_EXECUTE WORLD_READ WORLD_EXECUTE
+                COMPONENT Runtime)
+            install(
+                FILES ${LIBARROW_LIB} ${LIBARROW_LIB_ABI}
+                DESTINATION ${LIB_DIR} CALC_DEPS
+                COMPONENT Runtime)
+            install(
+                FILES ${LIBPARQUET_LIB_REAL}
+                DESTINATION ${LIB_DIR} CALC_DEPS
+                PERMISSIONS OWNER_WRITE OWNER_READ OWNER_EXECUTE GROUP_READ GROUP_EXECUTE WORLD_READ WORLD_EXECUTE
+                COMPONENT Runtime)
+            install(
+                FILES ${LIBPARQUET_LIB} ${LIBPARQUET_LIB_ABI}
+                DESTINATION ${LIB_DIR} CALC_DEPS
+                COMPONENT Runtime)
+            install(
+                FILES ${LIBARRORACERO_LIB_REAL}
+                DESTINATION ${LIB_DIR} CALC_DEPS
+                PERMISSIONS OWNER_WRITE OWNER_READ OWNER_EXECUTE GROUP_READ GROUP_EXECUTE WORLD_READ WORLD_EXECUTE
+                COMPONENT Runtime)
+            install(
+                FILES ${LIBARRORACERO_LIB} ${LIBARROWDATASET_LIB_ABI}
+                DESTINATION ${LIB_DIR} CALC_DEPS
+                COMPONENT Runtime)
+            install(
+                FILES ${LIBARROWDATASET_LIB_REAL}
+                DESTINATION ${LIB_DIR} CALC_DEPS
+                PERMISSIONS OWNER_WRITE OWNER_READ OWNER_EXECUTE GROUP_READ GROUP_EXECUTE WORLD_READ WORLD_EXECUTE
+                COMPONENT Runtime)
+            install(
+                FILES ${LIBARROWDATASET_LIB} ${LIBARROWDATASET_LIB_ABI}
+                DESTINATION ${LIB_DIR} CALC_DEPS
+                COMPONENT Runtime)
+            target_link_libraries(
+                parquetembed
+                eclrtl
+                jlib
+                Arrow::arrow_shared
+                Parquet::parquet_shared
+                ArrowDataset::arrow_dataset_shared
+            )
+        endif()
+    endif()
+    if(PLATFORM OR CLIENTTOOLS_ONLY)
+        install(
+            FILES ${CMAKE_CURRENT_SOURCE_DIR}/parquet.ecllib
+            DESTINATION plugins
+            COMPONENT Runtime
+        )
+    endif()

plugins/parquet/README.md

-Original file line number
+Diff line change
@@ -0,0 +1,59 @@
+    # Parquet Plugin for HPCC-Systems
+    The Parquet Plugin for HPCC-Systems is a powerful tool designed to facilitate the fast transfer of data stored in a columnar format to the ECL (Enterprise Control Language) data format. This plugin provides seamless integration between Parquet files and HPCC-Systems, enabling efficient data processing and analysis.
+    ## Installation
+    The plugin uses vcpkg and can be installed by creating a separate build directory from the platform and running the following commands:
+    ```
+    cd ./parquet-build
+    cmake -DPARQUETEMBED=ON ../HPCC-Platform
+    make -j4 package
+    sudo dpkg -i ./hpccsystems-plugin-parquetembed_<version>.deb
+    ```
+    ## Documentation
+    [Doxygen](https://www.doxygen.nl/index.html) can be used to create nice HTML documentation for the code. Call/caller graphs are also generated for functions if you have [dot](https://www.graphviz.org/download/) installed and available on your path.
+    Assuming `doxygen` is on your path, you can build the documentation via:
+    ```
+    cd plugins/parquet
+    doxygen Doxyfile
+    ```
+    ## Features
+    The Parquet Plugin offers the following main functions:
+    ### Regular Files
+    #### 1. Reading Parquet Files
+    The Read function allows ECL programmers to create an ECL dataset from both regular and partitioned Parquet files. It leverages the Apache Arrow interface for Parquet to efficiently stream data from ECL to the plugin, ensuring optimized data transfer.
+    ```
+    dataset := Read(layout, '/source/directory/data.parquet');
+    ```
+    #### 2. Writing Parquet Files
+    The Write function empowers ECL programmers to write ECL datasets to Parquet files. By leveraging the Parquet format's columnar storage capabilities, this function provides efficient compression and optimized storage for data.
+    ```
+    Write(inDataset, '/output/directory/data.parquet');
+    ```
+    ### Partitioned Files (Tabular Datasets)
+    #### 1. Reading Partitioned Files
+    The Read Partition function extends the Read functionality by enabling ECL programmers to read from partitioned Parquet files.
+    ```
+    github_dataset := ReadPartition(layout, '/source/directory/partioned_dataset');
+    ```
+    #### 2. Writing Partitioned Files
+    For partitioning parquet files all you need to do is run the Write function on Thor rather than hthor and each worker will create its own parquet file.

plugins/parquet/examples/blob_test.ecl

-Original file line number
+Diff line change
@@ -0,0 +1,20 @@
+    IMPORT STD;
+    IMPORT PARQUET;
+    imageRecord := RECORD
+        STRING filename;
+        DATA image;
+        UNSIGNED8 RecPos{virtual(fileposition)};
+    END;
+    #IF(0)
+    in_image_data := DATASET('~parquet::image', imageRecord, FLAT);
+    OUTPUT(in_image_data, NAMED('IN_IMAGE_DATA'));
+    PARQUET.Write(in_image_data, '/datadrive/dev/test_data/test_image.parquet');
+    #END;
+    #IF(1)
+    out_image_data := Read({DATA image}, '/datadrive/dev/test_data/test_image.parquet');
+    OUTPUT(out_image_data, NAMED('OUT_IMAGE_DATA'));
+    #END

plugins/parquet/examples/create_partition.ecl

-Original file line number
+Diff line change
@@ -0,0 +1,29 @@
+    IMPORT STD;
+    IMPORT Parquet;
+    #OPTION('outputLimit', 2000);
+    #OPTION('pickBestEngine', FALSE);
+    layout := RECORD
+        STRING actor_login;
+        INTEGER actor_id;
+        INTEGER comment_id;
+        STRING comment;
+        STRING repo;
+        STRING language;
+        STRING author_login;
+        INTEGER author_id;
+        INTEGER pr_id;
+        INTEGER c_id;
+        INTEGER commit_date;
+    END;
+    #IF(0)
+    github_dataset := Read(layout, '/datadrive/dev/test_data/ghtorrent-2019-01-07.parquet');
+    Write(DISTRIBUTE(github_dataset, SKEW(.05)), '/datadrive/dev/test_data/hpcc_gh_partition/data.parquet');
+    #END
+    #IF(1)
+    github_dataset := ReadPartition(layout, '/datadrive/dev/test_data/hpcc_gh_partition');
+    OUTPUT(COUNT(github_dataset), NAMED('GITHUB_PARTITION'));
+    #END

plugins/parquet/examples/decimal_test.ecl

-Original file line number
+Diff line change
@@ -0,0 +1,17 @@
+    IMPORT STD;
+    IMPORT PARQUET;
+    layout := RECORD
+        DECIMAL5_2 height;
+    END;
+    decimal_data := DATASET([{152.25}, {125.56}], layout);
+    #IF(1)
+    Write(decimal_data, '/datadrive/dev/test_data/decimal.parquet');
+    #END
+    #IF(1)
+    Read(layout, '/datadrive/dev/test_data/decimal.parquet');
+    #END

plugins/parquet/examples/large_io.ecl

-Original file line number
+Diff line change
@@ -0,0 +1,29 @@
+    IMPORT STD;
+    IMPORT Parquet;
+    #OPTION('outputLimit', 2000);
+    #OPTION('pickBestEngine', FALSE);
+    layout := RECORD
+        STRING actor_login;
+        INTEGER actor_id;
+        INTEGER comment_id;
+        STRING comment;
+        STRING repo;
+        STRING language;
+        STRING author_login;
+        INTEGER author_id;
+        INTEGER pr_id;
+        INTEGER c_id;
+        INTEGER commit_date;
+    END;
+    #IF(0)
+    csv_data := DATASET('~parquet::large::ghtorrent-2019-02-04.csv', layout, CSV(HEADING(1)));
+    Write(csv_data, '/datadrive/dev/test_data/ghtorrent-2019-02-04.parquet');
+    #END
+    #IF(1)
+    parquet_data := Read(layout, '/datadrive/dev/test_data/hpcc_gh_partition/data.parquet');
+    OUTPUT(COUNT(parquet_data), NAMED('ghtorrent_2019_01_07'));
+    #END

plugins/parquet/examples/nested_io.ecl

-Original file line number
+Diff line change
@@ -0,0 +1,30 @@
+    IMPORT Parquet;
+    friendsRec :=RECORD
+        UNSIGNED4 age;
+        INTEGER2 friends;
+        SET OF STRING friendsList;
+    END;
+    childRec := RECORD
+         friendsRec friends;
+         REAL height;
+         REAL weight;
+    END;
+    parentRec := RECORD
+        UTF8_de firstname;
+    	UTF8_de lastname;
+        childRec details;
+    END;
+    nested_dataset := DATASET([{U'J\353ck', U'\353ackson', { {22, 2, ['James', 'Jonathon']}, 5.9, 600}}, {'John', 'Johnson', { {17, 0, []}, 6.3, 18}},
+                                    {'Amy', U'Amy\353on', { {59, 1, ['Andy']}, 3.9, 59}}, {'Grace', U'Graceso\353', { {11, 3, ['Grayson', 'Gina', 'George']}, 7.9, 100}}], parentRec);
+    #IF(1)
+    Write(nested_dataset, '/datadrive/dev/test_data/nested.parquet');
+    #END
+    #IF(1)
+    read_in := Read(parentRec, '/datadrive/dev/test_data/nested.parquet');
+    OUTPUT(read_in, NAMED('NESTED_PARQUET_IO'));
+    #END

0 comments on commit `0850494`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `0850494`

Commit

There are no files selected for viewing

0 comments on commit 0850494

0 comments on commit `0850494`