Distributable UCC Discovery Algorithm based on Akka - Implementation for my ongoing master thesis
Nowadays, tables in databases contain a lot of data. They do not only have many rows but also many columns. Tables are so large that the data becomes difficult to understand. Most often, structural information or documentation of the data is missing. It is of utmost importance to understand the data in order to work with them and query the data. Unique column combinations (uniques) are sets of columns with no duplicated rows. If at least one row of a combination contains the same values as another, we speak of a non-unique column combination. Unique column combinations are important when it comes to identifying keys in a data set. They are also used for other tasks such as query optimizations or data cleansing.
Existing state-of-the-art algorithms like HyUCC mainly work on one CPU and are therefore limited to the computing power of this machine. The DUCC algorithm is an exception and has been implemented in a distributed way. In some cases, the discovery of unique column combinations takes so long that the runtime is no longer affordable.
A distributed solution is needed to solve the problem for larger tables.
Thomas Bläsius et. al have introduced a two-step, single threaded algorithm [ website | paper ] that finds all unique column combinations, whose search space is not represented by a lattice (like traditional algorithms and thus better for distribution). In the first step, all minimal difference sets are formed from the data. A difference set indicates at which positions (columns) the two rows have different values. For this step, all rows are compared pairwise to form difference sets and the resulting sets are then compared in pairs to find all minimal sets. In the second step, unique column combinations are formed from the minimal difference sets by tree search based on the difference sets. The slowest step in this algorithm is the discovery of the difference sets and, therefore, the one with the highest potential for parallelization.
Xu Chu et al. introduced a way to distribute data for an algorithm that compares the columns pairwise [ publication ]. They called it the Triangle Distribution Strategy. Since the above presented algorithm also compares the columns pairwise in the first step and the actual comparison operation is interchangeable, we will use this strategy for the first step.
We have decided not to build a master-slave architecture. This is easier to develop, but brings performance and scaling problems with it. The result is a peer-to-peer network that regulates itself. There is no master that takes over the communication and therefore has too little load or can function as a single point of failure.
There are exactly 2 synchronization steps in the entire algorithm. A node reads the data and distributes it to all other nodes and itself. From then on, each node processes its tasks independently. This is most of the algorithm timewise. When the nodes are finished, there is a large synchronization point. Once this point is reached it goes into step two. Here each node has all necessary data to work autonomously or to communicate with other nodes. Finally there is the last synchronization to finish the algorithm and distribute the result to all.
In order to build and execute the code, you will need Java 8 and Maven.
To make sure that your project is set up correctly in an IDE, you can run the tests in the hitucc/src/test/java
folder. If you are operating from a command line instead, run mvn test
in the folder with pom.xml
file.
docker:stage
- create dockerfiledocker:publishLocal
- create docker container
You can start the algorithm via docker-compose up. Do not forget to change the mount path in the docker-compose.yml first.