Intel Big Data Analytic Toolkit (abbrev. BDTK) is a set of acceleration libraries aimed to optimize big data analytic frameworks.
By using this library, frontend SQL engines like Prestodb/Spark query performance will be significant improved.
For big data analytic framework users, it becomes more and more significant needs for better performance. And most of existing big data analytic frameworks are built via Java and it's designed mostly for CPU only computation. To unblock performance further to bare metal hardware, native implementation and leveraging state-of-art hardwares are employed in this toolkit.
Furthermore, assembling & building becomes a new trend for data analytic solution providers. More and more SQL based solutions were built based on some primitive building blocks over the last five years. Having some performt OOB building blocks (as libraries) can significantly reduce time-to-value for building everything from scratch. With such general-purpose toolkit, it can significantly reduce time-to-value for analytic solution developers.
BDTK focuses on following areas:
- End-users of big data analytic frameworks who're looking for performance acceleration
- Data engineers who want some Intel architecture-based optimizations
- Database developers who're seeking for reusable building blocks
- Data Scientist who looks for heterogenous execution
Users can reuse implemented operators/functions to build a full-featured SQL engine. Currently this library offers a highly optimized compiler to JITed function for execution.
Building blocks utilizing compression codec (based on IAA, QAT) can be used directly to Hadoop/Spark for compression acceleration.
Below comes the view from personas for this project.
The following diagram shows the design architecture. Currently, it offers a few building blocks including a lightweight LLVM based SQL compiler(Cider) on top of Arrow data format, ICL - a compression codec leveraging Intel IAA accelerator, QATCodec - compression codec wrapper based on Intel QAT accelerator.
-
BDTK could provide a Presto End-to-End accelaration solution via Velox and Velox-Plugin.
-
Velox-plugin is a bridge to enable Big Data Analytic Toolkit onto Velox. It introduces hybrid execution mode for both compilation and vectorization (existed in Velox). It works as a plugin to Velox seamlessly without changing Velox code.
-
A modularized and general-purposed Just-In-Time (JIT) compiler for data analytic query engine. It employs Substrait as a protocol allowing to support multiple front-end engines. Currently it provides a LLVM based implementation based on HeavyDB.
-
-
Analytic Cache targets to improve data source side performance for multiple bigdata analytic framework such as Apache Spark and Apache Flink. Compare to other row based execution engine, Analytic Cache could utilize column format and do batch computation, which will boost performance in Ad-hoc queries. Meanwhile, Analytic Cache provide QAT codec accelaration and IAA predicition pushing down.
BDTK provides several functional modules for user to use or integrate into their product. Here are breif description for each module. Details can be found on Module Page
Intel Codec Library module provides compression and decompression library for Apache Hadoop/Spark to make use of the acceleration hardware for compression/decompression. It not only can leverage QAT/IAA hardware to accelerate deflate-compatible data compression algorithms but also supports the use of Intel software optimized solutions such as Intel ISA-L(Intel Intelligent Storage Acceleration Library and IPP(Intel Integrated Performance Primitives Library) to accelerate the data compression.
Hash table performance is critical to a SQL engine. Operators like hash join, hash aggregation count on an efficient hash table implementation.
Hash table module will provide a bunch of hash table implementations, which are easy to use, leverage state of art hardware technology like AVX-512, will be optimized for query-specific scenarios.
JIT Lib module provides unified JIT interfaces like Value, Ptr, control flow and .etc to isolate operator logic and IR generation
Expression evaluation module could do Projection/Filter computation effectively. It provides a runtime expression evaluation API which accept Substrait based expression representation and Apache Arrow based column format data representation. It only handles projection and filters currently.
BDTK implements typical SQL operators based on JitLib, provide a batch-at-a-time execution model. Each operator support plug and play, could easily integrated into other existing sql-engines. Operators BDTK target to supported includes: HashAggregation, HashJoin(HashBuild and HashProbe), etc.
Current supported features are available on Project Page. Newly supported feature in release 0.9 is available at release page.
git clone --recursive https://github.com/intel/BDTK.git
cd BDTK
# if you are updating an existing checkout
git submodule sync --recursive
git submodule update --init --recursive
We provide Dockerfile to help developers setup and install BDTK dependencies.
- Build an image from a Dockerfile
$ cd ${path_to_source_of_bdtk}/ci/docker
$ docker build -t ${image_name} .
- Start a docker container for development
$ docker run -d --name ${container_name} --privileged=true -v ${path_to_source_of_bdtk}:/workspace/bdtk ${image_name} /usr/sbin/init
Once you have setup the Docker build environment for BDTK and get the source, you can enter the BDTK container and build like:
Run make
in the root directory to compile the sources. For development, use
make debug
to build a non-optimized debug version, or make release
to build
an optimized version. Use make test-debug
or make test-release
to run tests.
To use it with Prestodb, Intel version Prestodb is required together with Intel version Velox. Detailed steps are available at installation guide.
In the next coming release, following working items were prioritized.
- Better test coverage for entire library
- Better robustness and enable more implemented features in Prestodb as pilot SQL engine, by improving offloading framework
- Better extensibility at multi-levels (incl. relational algebra operator, expression function, data format), by adopting state-of-art compiler design (multi-levels)
- Complete Arrow format migration
- Next-gen codegen framework
- Support large volume data processing
- Advanced features development
Big Data Analytic Toolkit's Code of Conduct can be found here.
You can find the all the Big Data Analytic Toolkit documents on the project web page.
Big Data Analytic Toolkit is licensed under the Apache 2.0 License. A copy of the license can be found here.