Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add basic EmbedFunctionContext in cpp files. Add Arrow Includes. Fix Header conflicts before rebase. Fix Header conflicts before rebase. Changes after rebase. Added example file to show reading from and writing to a parquet file. Add function calls to ecl file. Added some basic options for embedding the parquet functions. I am not sure if these will stay the same. Added install calls so the shared library gets installed. Fixed type. Added function instantiations for header functions. Added some useful function definitions. Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions. Fix typo Added ParquetHelper class for holding the user inputs and opening the stream reader and writer. Added dataset to argument of write function to test writing ECL datasets. Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema. Fix error in getFieldTypes method. Fix addField method. Schema now gets built from the parameter dataset. Can write to parquet files. Set up ParquetRowStream to start reading parquet and output it to an ECL record. Minor changes. Can read a parquet table from a parquet file. Add RowBuilder methods for builder result rows. Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value. Add #DEFINE RAPIDJSON_HAS_STDSTRING 1. This allows for the use of some helpful GenericValue functions. Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson. Fixed error where reading from the file it reads one extra row and fails. Remove Credentials. Fix issue where utf8 encoded parameters were not properly converted to strings. Fix issues dealing with utf8 strings and reading and writing them. Add code for getting the schema for nested types. Add additional feature from apache arrow. Add code for reading partitioned files. Comment out Partition code Edit ECL files Linker not properly linking dataset shared library. Manually link Arrow Dataset shared library. Added part of the code for writing to partitioned files. Replace StreamWriter API with rapidjson api. Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing. Moved allocator for rapidjson documents to top level in the parquetembed namespace. Minor changes. Schema needs to be correctly passed through. Fix issue with writing to vector of rapidjson documents. Add visit method for HalfFloat datatype. Changed FileWriter creation to write in the parquet format. Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once. Changed write() to return the FileWriter instead of the FileOutputStream. Writing in batches now works! Butgit add parquetembed.hpp Further testing on edge cases and vector reuse is needed! Removed old code. Currently in working state though not tested very much. Read Files in RowGroups one at a time then iterate through each row. Added file for testing reading and writing large parquet files. Minor: Changes to simple example. rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup. Tidy up a bit... Created file for testing nested data. Added support for nested objects. Refactored code and moved function implementations to cpp file. Add basic EmbedFunctionContext in cpp files. Add Arrow Includes. Fix Header conflicts before rebase. Fix Header conflicts before rebase. Added function instantiations for header functions. Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions. Added ParquetHelper class for holding the user inputs and opening the stream reader and writer. Added dataset to argument of write function to test writing ECL datasets. Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema. Fix error in getFieldTypes method. Fix addField method. Schema now gets built from the parameter dataset. Can write to parquet files. Set up ParquetRowStream to start reading parquet and output it to an ECL record. Minor changes. Can read a parquet table from a parquet file. Add RowBuilder methods for builder result rows. Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value. Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson. Fixed error where reading from the file it reads one extra row and fails. Fix issue where utf8 encoded parameters were not properly converted to strings. Fix issues dealing with utf8 strings and reading and writing them. Add code for getting the schema for nested types. Add additional feature from apache arrow. Add code for reading partitioned files. Comment out Partition code Edit ECL files Linker not properly linking dataset shared library. Added part of the code for writing to partitioned files. Replace StreamWriter API with rapidjson api. Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing. Moved allocator for rapidjson documents to top level in the parquetembed namespace. Minor changes. Schema needs to be correctly passed through. Fix issue with writing to vector of rapidjson documents. Add visit method for HalfFloat datatype. Changed FileWriter creation to write in the parquet format. Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once. Changed write() to return the FileWriter instead of the FileOutputStream. Writing in batches now works! Butgit add parquetembed.hpp Further testing on edge cases and vector reuse is needed! Removed old code. Currently in working state though not tested very much. Read Files in RowGroups one at a time then iterate through each row. rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup. Tidy up a bit... Added support for nested objects. Refactored code and moved function implementations to cpp file. Added comment. Added comment. Added new comment. Documentation. Rebase 8.12.x complete and tested. Initial commit Add basic EmbedFunctionContext in cpp files. Add Arrow Includes. Fix Header conflicts before rebase. Fix Header conflicts before rebase. Changes after rebase. Added example file to show reading from and writing to a parquet file. Add function calls to ecl file. Added some basic options for embedding the parquet functions. I am not sure if these will stay the same. Added install calls so the shared library gets installed. Fixed type. Added function instantiations for header functions. Added some useful function definitions. Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions. Fix typo Added ParquetHelper class for holding the user inputs and opening the stream reader and writer. Added dataset to argument of write function to test writing ECL datasets. Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema. Fix error in getFieldTypes method. Fix addField method. Schema now gets built from the parameter dataset. Can write to parquet files. Set up ParquetRowStream to start reading parquet and output it to an ECL record. Minor changes. Can read a parquet table from a parquet file. Add RowBuilder methods for builder result rows. Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value. Add #DEFINE RAPIDJSON_HAS_STDSTRING 1. This allows for the use of some helpful GenericValue functions. Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson. Fixed error where reading from the file it reads one extra row and fails. Remove Credentials. Fix issue where utf8 encoded parameters were not properly converted to strings. Fix issues dealing with utf8 strings and reading and writing them. Add code for getting the schema for nested types. Add additional feature from apache arrow. Add code for reading partitioned files. Comment out Partition code Edit ECL files Linker not properly linking dataset shared library. Manually link Arrow Dataset shared library. Added part of the code for writing to partitioned files. Replace StreamWriter API with rapidjson api. Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing. Moved allocator for rapidjson documents to top level in the parquetembed namespace. Minor changes. Schema needs to be correctly passed through. Fix issue with writing to vector of rapidjson documents. Add visit method for HalfFloat datatype. Changed FileWriter creation to write in the parquet format. Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once. Changed write() to return the FileWriter instead of the FileOutputStream. Writing in batches now works! Butgit add parquetembed.hpp Further testing on edge cases and vector reuse is needed! Removed old code. Currently in working state though not tested very much. Read Files in RowGroups one at a time then iterate through each row. Added file for testing reading and writing large parquet files. Minor: Changes to simple example. rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup. Tidy up a bit... Created file for testing nested data. Added support for nested objects. Refactored code and moved function implementations to cpp file. Initial commit Add basic EmbedFunctionContext in cpp files. Add Arrow Includes. Fix Header conflicts before rebase. Fix Header conflicts before rebase. Changes after rebase. Added example file to show reading from and writing to a parquet file. Add function calls to ecl file. Added some basic options for embedding the parquet functions. I am not sure if these will stay the same. Added install calls so the shared library gets installed. Fixed type. Added function instantiations for header functions. Added some useful function definitions. Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions. Fix typo Added ParquetHelper class for holding the user inputs and opening the stream reader and writer. Added dataset to argument of write function to test writing ECL datasets. Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema. Fix error in getFieldTypes method. Fix addField method. Schema now gets built from the parameter dataset. Can write to parquet files. Set up ParquetRowStream to start reading parquet and output it to an ECL record. Minor changes. Can read a parquet table from a parquet file. Add RowBuilder methods for builder result rows. Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value. Add #DEFINE RAPIDJSON_HAS_STDSTRING 1. This allows for the use of some helpful GenericValue functions. Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson. Fixed error where reading from the file it reads one extra row and fails. Remove Credentials. Fix issue where utf8 encoded parameters were not properly converted to strings. Fix issues dealing with utf8 strings and reading and writing them. Add code for getting the schema for nested types. Add additional feature from apache arrow. Add code for reading partitioned files. Comment out Partition code Edit ECL files Linker not properly linking dataset shared library. Manually link Arrow Dataset shared library. Added part of the code for writing to partitioned files. Replace StreamWriter API with rapidjson api. Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing. Moved allocator for rapidjson documents to top level in the parquetembed namespace. Minor changes. Schema needs to be correctly passed through. Fix issue with writing to vector of rapidjson documents. Add visit method for HalfFloat datatype. Changed FileWriter creation to write in the parquet format. Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once. Changed write() to return the FileWriter instead of the FileOutputStream. Writing in batches now works! Butgit add parquetembed.hpp Further testing on edge cases and vector reuse is needed! Removed old code. Currently in working state though not tested very much. Read Files in RowGroups one at a time then iterate through each row. Added file for testing reading and writing large parquet files. Minor: Changes to simple example. rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup. Tidy up a bit... Created file for testing nested data. Added support for nested objects. Refactored code and moved function implementations to cpp file. Added comment. Added comment. Added new comment. Documentation. Fixed and Tested rebase. Minor. Fixup build and install Signed-off-by: Dan S. Camper <[email protected]> Change type mapping. Unsigned and Signed are now mapped explicitly to 64 and 32 bit versions of each. Add different byte sizes for testing type mapping. Refactor JsonValueConverter. Add additional methods for UInt64/32 and Int32 for explicit type mapping. Change interface for calling plugin. Added information for which thor worker we are running on. Divides the row groups (poorly) by the number of workers. Minor: Updated test files. Change which API is used for opening files. WIP Minor changes to opening parquet file Initial commit Add basic EmbedFunctionContext in cpp files. Add Arrow Includes. Fix Header conflicts before rebase. Fix Header conflicts before rebase. Changes after rebase. Added example file to show reading from and writing to a parquet file. Add function calls to ecl file. Added some basic options for embedding the parquet functions. I am not sure if these will stay the same. Added install calls so the shared library gets installed. Fixed type. Added function instantiations for header functions. Added some useful function definitions. Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions. Fix typo Added ParquetHelper class for holding the user inputs and opening the stream reader and writer. Added dataset to argument of write function to test writing ECL datasets. Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema. Fix error in getFieldTypes method. Fix addField method. Schema now gets built from the parameter dataset. Can write to parquet files. Set up ParquetRowStream to start reading parquet and output it to an ECL record. Minor changes. Can read a parquet table from a parquet file. Add RowBuilder methods for builder result rows. Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value. Add #DEFINE RAPIDJSON_HAS_STDSTRING 1. This allows for the use of some helpful GenericValue functions. Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson. Fixed error where reading from the file it reads one extra row and fails. Remove Credentials. Fix issue where utf8 encoded parameters were not properly converted to strings. Fix issues dealing with utf8 strings and reading and writing them. Add code for getting the schema for nested types. Add additional feature from apache arrow. Add code for reading partitioned files. Comment out Partition code Edit ECL files Linker not properly linking dataset shared library. Manually link Arrow Dataset shared library. Added part of the code for writing to partitioned files. Replace StreamWriter API with rapidjson api. Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing. Moved allocator for rapidjson documents to top level in the parquetembed namespace. Minor changes. Schema needs to be correctly passed through. Fix issue with writing to vector of rapidjson documents. Add visit method for HalfFloat datatype. Changed FileWriter creation to write in the parquet format. Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once. Changed write() to return the FileWriter instead of the FileOutputStream. Writing in batches now works! Butgit add parquetembed.hpp Further testing on edge cases and vector reuse is needed! Removed old code. Currently in working state though not tested very much. Read Files in RowGroups one at a time then iterate through each row. Added file for testing reading and writing large parquet files. Minor: Changes to simple example. rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup. Tidy up a bit... Created file for testing nested data. Added support for nested objects. Refactored code and moved function implementations to cpp file. Add basic EmbedFunctionContext in cpp files. Add Arrow Includes. Fix Header conflicts before rebase. Fix Header conflicts before rebase. Added function instantiations for header functions. Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions. Added ParquetHelper class for holding the user inputs and opening the stream reader and writer. Added dataset to argument of write function to test writing ECL datasets. Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema. Fix error in getFieldTypes method. Fix addField method. Schema now gets built from the parameter dataset. Can write to parquet files. Set up ParquetRowStream to start reading parquet and output it to an ECL record. Minor changes. Can read a parquet table from a parquet file. Add RowBuilder methods for builder result rows. Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value. Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson. Fixed error where reading from the file it reads one extra row and fails. Fix issue where utf8 encoded parameters were not properly converted to strings. Fix issues dealing with utf8 strings and reading and writing them. Add code for getting the schema for nested types. Add additional feature from apache arrow. Add code for reading partitioned files. Comment out Partition code Edit ECL files Linker not properly linking dataset shared library. Added part of the code for writing to partitioned files. Replace StreamWriter API with rapidjson api. Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing. Moved allocator for rapidjson documents to top level in the parquetembed namespace. Minor changes. Schema needs to be correctly passed through. Fix issue with writing to vector of rapidjson documents. Add visit method for HalfFloat datatype. Changed FileWriter creation to write in the parquet format. Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once. Changed write() to return the FileWriter instead of the FileOutputStream. Writing in batches now works! Butgit add parquetembed.hpp Further testing on edge cases and vector reuse is needed! Removed old code. Currently in working state though not tested very much. Read Files in RowGroups one at a time then iterate through each row. rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup. Tidy up a bit... Added support for nested objects. Refactored code and moved function implementations to cpp file. Added comment. Added comment. Added new comment. Documentation. Rebase 8.12.x complete and tested. Initial commit Add basic EmbedFunctionContext in cpp files. Add Arrow Includes. Fix Header conflicts before rebase. Fix Header conflicts before rebase. Changes after rebase. Added example file to show reading from and writing to a parquet file. Add function calls to ecl file. Added some basic options for embedding the parquet functions. I am not sure if these will stay the same. Added install calls so the shared library gets installed. Fixed type. Added function instantiations for header functions. Added some useful function definitions. Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions. Fix typo Added ParquetHelper class for holding the user inputs and opening the stream reader and writer. Added dataset to argument of write function to test writing ECL datasets. Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema. Fix error in getFieldTypes method. Fix addField method. Schema now gets built from the parameter dataset. Can write to parquet files. Set up ParquetRowStream to start reading parquet and output it to an ECL record. Minor changes. Can read a parquet table from a parquet file. Add RowBuilder methods for builder result rows. Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value. Add #DEFINE RAPIDJSON_HAS_STDSTRING 1. This allows for the use of some helpful GenericValue functions. Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson. Fixed error where reading from the file it reads one extra row and fails. Remove Credentials. Fix issue where utf8 encoded parameters were not properly converted to strings. Fix issues dealing with utf8 strings and reading and writing them. Add code for getting the schema for nested types. Add additional feature from apache arrow. Add code for reading partitioned files. Comment out Partition code Edit ECL files Linker not properly linking dataset shared library. Manually link Arrow Dataset shared library. Added part of the code for writing to partitioned files. Replace StreamWriter API with rapidjson api. Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing. Moved allocator for rapidjson documents to top level in the parquetembed namespace. Minor changes. Schema needs to be correctly passed through. Fix issue with writing to vector of rapidjson documents. Add visit method for HalfFloat datatype. Changed FileWriter creation to write in the parquet format. Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once. Changed write() to return the FileWriter instead of the FileOutputStream. Writing in batches now works! Butgit add parquetembed.hpp Further testing on edge cases and vector reuse is needed! Removed old code. Currently in working state though not tested very much. Read Files in RowGroups one at a time then iterate through each row. Added file for testing reading and writing large parquet files. Minor: Changes to simple example. rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup. Tidy up a bit... Created file for testing nested data. Added support for nested objects. Refactored code and moved function implementations to cpp file. Initial commit Add basic EmbedFunctionContext in cpp files. Add Arrow Includes. Fix Header conflicts before rebase. Fix Header conflicts before rebase. HPCC-28689 Refactor jhtree class structure to support additional node compression methods Sort out concerns about usage and semantics of isLeaf(). Also a few trivial changes suggested by earlier code review Signed-off-by: Richard Chapman <[email protected]> Changes after rebase. Not sure what this does. Added example file to show reading from and writing to a parquet file. Add function calls to ecl file. Added some basic options for embedding the parquet functions. I am not sure if these will stay the same. Added install calls so the shared library gets installed. Fixed type. Added function instantiations for header functions. Added some useful function definitions. Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions. Fix typo Added ParquetHelper class for holding the user inputs and opening the stream reader and writer. Added dataset to argument of write function to test writing ECL datasets. Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema. Fix error in getFieldTypes method. Fix addField method. Schema now gets built from the parameter dataset. Can write to parquet files. Set up ParquetRowStream to start reading parquet and output it to an ECL record. Minor changes. Can read a parquet table from a parquet file. Add RowBuilder methods for builder result rows. Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value. Add #DEFINE RAPIDJSON_HAS_STDSTRING 1. This allows for the use of some helpful GenericValue functions. Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson. Fixed error where reading from the file it reads one extra row and fails. Remove Credentials. Fix issue where utf8 encoded parameters were not properly converted to strings. Fix issues dealing with utf8 strings and reading and writing them. Add code for getting the schema for nested types. Add additional feature from apache arrow. Add code for reading partitioned files. Comment out Partition code Edit ECL files Linker not properly linking dataset shared library. Manually link Arrow Dataset shared library. Added part of the code for writing to partitioned files. Replace StreamWriter API with rapidjson api. Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing. Moved allocator for rapidjson documents to top level in the parquetembed namespace. Minor changes. Schema needs to be correctly passed through. Fix issue with writing to vector of rapidjson documents. Add visit method for HalfFloat datatype. Changed FileWriter creation to write in the parquet format. Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once. Changed write() to return the FileWriter instead of the FileOutputStream. Writing in batches now works! Butgit add parquetembed.hpp Further testing on edge cases and vector reuse is needed! Removed old code. Currently in working state though not tested very much. Read Files in RowGroups one at a time then iterate through each row. Added file for testing reading and writing large parquet files. Minor: Changes to simple example. rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup. Tidy up a bit... Created file for testing nested data. Added support for nested objects. Refactored code and moved function implementations to cpp file. Added comment. Added comment. Added new comment. Documentation. Minor. Fixup build and install Signed-off-by: Dan S. Camper <[email protected]> Change type mapping. Unsigned and Signed are now mapped explicitly to 64 and 32 bit versions of each. Add different byte sizes for testing type mapping. Refactor JsonValueConverter. Add additional methods for UInt64/32 and Int32 for explicit type mapping. Change interface for calling plugin. Added information for which thor worker we are running on. Divides the row groups (poorly) by the number of workers. Minor: Updated test files. Change which API is used for opening files. WIP Minor changes to opening parquet file Update FileReader usage to comply with version 12.0.0 standards. Add overlay for pulling down forked arrow directory. Add function to close file that resets unique_ptr when the plugin is done streaming rows to the engine. Add Arrow to vcpkg.json.in Add correct hash for current arrow commit. add PARQUETEMBED to plugins.cmake. Update arrow fork SHA512. Remove trailing whitespace Signed-off-by: Dan S. Camper <[email protected]> Explicit default destructors; make one ParquetHelper pointer private Signed-off-by: Dan S. Camper <[email protected]> ParquetHelper: Use single MemoryPool; explicitly release pool memory on destruction Signed-off-by: Dan S. Camper <[email protected]> Add vcpkg overlay for building with Arrow 12.0.0 release version. Add function for delegating row groups to any number of thor workers. Fix error in dividing up rows between workers and add some additional comments. Change writing parquet to use the same arrow::MemoryPool* Add code for reading from arrow dataset (partitioned files). Fix error in brind string params to rows when writing to parquet. Add better way of setting single vs multiple reads/writes. Implemented logic for scanning chunks of an arrow dataset. Format and cleanup source code. Add Unimplemented Tag to Scalar parameters for EmbedFunctionContext Bind functions. Change check in ParquetHelper::next() to correct value. Use PARQUET_ASSIGN_OR_THROW whenever an arrow::Result object is returned. Create example for creating and reading a partitioned dataset. Writing Datasets using HivePartitioning works. Currently the partioning schema is hardcoded to the language field in the github dataset. Streaming larger than memory datasets to a partitioned dataset is not working. Write each batch out to a different file to avoid collisions of file names. Clean up file name Change partitioning behaviour to delete the directory contents before creating a new partition. Implement thor functionality for writing parquet files. Directory gets created and emptied if it already existed on write. Increased default batch size for reading and writing to 10000. Added critical section to all uses of scanner file operations. Add WritePartition Macro call Make a better example. Start README for parquet plugin. Added support for Data and Unicode datatypes when writing to parquet. removed unnecessary variable. Change rapidjson allocator to thread_local storage specification to allow for multi threaded writing. Added arrow conversion methods for arrow binary type. Changed default RowGroup size to 20000 Added example ecl for reading and writing BLOB data Be more careful passing around length of the strings. Add additional conversion to incoming DATA types to UTF8. Update blob test Use getProp rather than queryProp to preserve the length of a string result being returned. Reimplemented ParquetRowBuilder to build rows from rapidjson::Value objects rather than an IPropertyTree. Fixed Implementation of set datatype. Update test files. update create partition example. Update Example files. Fix reading rapidjson::Array and add Clear calls to the rapidjson allocator. Remove calls to GetAllocator and instead pass jsonAlloc in. Add function for queryingRows from the dataset. Add function for querying rows from dataset. Gets a RecordBatchReader and a RecordBatchReaderIterator to iterate through the stream of record batches from the dataset. There is currently a large limitation in arrow where iterators for datasets do not support random reads, they must iterate from the beginning to the starting point, and they cannot end early, on destruction they are iterated until the end of the stream. Removed rapidjson conversion from RowBuilder implementation. RowBuilder now builds the fields directly from the arrow::Table that is read from the arrow::FileReader object. Implemented a ParquetVisitor class for getting the correct datatype from each scalar. Tested on all test ecl files. format Fix issue with Real datatype not being returned properly. Fixed some performance issues with Arrow FileReader interface. Cache chunks rather than read them in every time we build a field. Update test/example files. Update file reading structure to expect each thor worker to have its own parquet file. Fix compiler warnings. Clean up... Clean up... Clean up... Clean up ... Format source code. Refactor Fix Utf8 conversion when returning String Results. Change open file function call. update default row and batch sizes. Bump vcpkg version of Arrow to 13.0.0 Fix dependency installs. Fix decimal type. Remove PathTracker copy constructor. Change initialization of PathTracker members. Create ParquetArrayType enum. Move ParquetDatasetBinder methods to cpp file. Static jsonAlloc can now be moved out of the header file. minor change for clarity Remove default initializations from constructors. Fix partition condition for user input. Fix decimal datatype. Fix nested structures for all ECL types. Utf8 type no longer gets translated to string. Add utf8 characters to example file. Encapsulate children processing check. Create function currArrayIndex() from common code across source file. Change write() to queryWriter() to be more descriptive. Change return type to pointer to the FileWriter object. Return references rather than pointer where object cannot be a nullptr. Use consistent types, especially in comparisons. Remove countFields because it is a duplicate. Remove floating point ooperation from divide_row_groups. Thor nodes that don't recieve any rows will not open a file. Remove Critical section when writing. Each node writes to a unique file. Add override qualifier to virtual functions in ParquetRowBuilder and ParquetRowStream. Add default initializers and clean up member variables. Allow openReadFile to open any files matching the filename chosen by the user. The files will all be opened and the row counts will be recorded. This will allow for even division of work, and will keep order intact. Style: Make function names clearer and improve clarity. Revert plugin collision change. Fix non null terminated unicode parameters. Fix null characters in string types. Data datatype no longer gets converted to utf-8. Remove extra rtlStrToDataX call in processData. Add static qualifier to addMember function. Remove commented lines. Fix references in Next and DocValuesIterator Fix constructor argument names. Remove && for rows in DocValuesIterator constructor. Use UTF-8 size instead of code-points.
- Loading branch information