WIP: FastADC implementation #470

ol-imorozko · 2024-10-05T17:41:11Z

No description provided.

This class represents an operator used in predicates for Denial Constrains (DC) representation. Predicates used there are less, greater, eq, neq, geq and leq. This C++ implementation is 100% bad, there is whole bunch of objects being created, but conceptually all of them are the same. Object "Operator '+'" and another object "Operator '+'" represents the same thing. ---------------------------------------------------------------------------- This is just a copy of a java code from https://github.com/RangerShaw/FastADC. I will refactor and think about better implementation later. I'll be just copying java code to get working algorithm ASAP, and after that I'll start thinking about good implementation.

This commit adds test_dc_structures.cpp file, which will be used to test different data structures which are required for DC representation (there are a lot).

This class represents a column operand within a predicate for FastADC. FastADC processes Denial Constraints (DCs) that involve comparisons between pairs of rows within a dataset. A typical DC example, derived from a Functional Dependency (FD) such as A -> B, is expressed as: ∀𝑡, 𝑠 ∈ 𝑟, ¬(𝑡.𝐴 = 𝑠.𝐴 ∧ 𝑡.𝐵 ≠ 𝑠.𝐵). This denotes that for any pair of rows in the relation, it should not be the case that while the values in column "A" are equal, the values in column "B" are unequal. A predicate in this context (e.g., 𝑡.𝐴 = 𝑠.𝐴) comprises three elements to be fully represented: the column operand from the first tuple ("t.A"), the comparison operator ("="), and the column operand from the second tuple ("s.A"). The `ColumnOperand` class encapsulates the column operand part of a predicate, such as "t.A" or "s.A".

First step in FastADC algorithm is to build so-called "Predicate Space". This is a long process during which many places in the code wants to get a Predicate. But each predicate is stored in a global storage -- map. In Java code this class (and other similar "provider" classes) are singletons. BaseProvider class is the class, from which a *Provider class should be derived. It ensures that only a PredicateBuilder class can initialize and free these singletons. I'm sure there exists a better approach, where we will store Provider classes in some fields to bind their lifetime more explicitly, but this is how it's done in Java, and I don't have much time to devise perfect architecture.

This class acts as a centralized storage to manage and provide access to Predicate objects. A Predicate is defined as "t1.A_i op t2.A_j", where t1 and t2 represent different rows, and A_i and A_j are columns (which may be the same or different) The FastADC algorithm first will build a so-called "Predicate Space", which is a set of all predicates that are allowed on R (set of rows, basically a table). In order to create and store predicates, this commit implements a singleton class with a hashmap storage.

FastADC processes Denial Constraints (DCs) that involve comparisons between pairs of rows within a dataset. A typical DC example, derived from a Functional Dependency such as A -> B, is expressed as: `forall t, s in r, not (t.A = s.A and t.B != s.B)` This denotes that for any pair of rows in the relation, it should not be the case that while the values in column "A" are equal, the values in column "B" are unequal. A predicate in this context (e.g., t.A == s.A) comprises three elements to be fully represented: the column operand from the first tuple ("t.A"), the comparison operator ("="), and the column operand from the second tuple ("s.A").

This simple test creates two predicates on a 2x2 table and evaluates them. We're checking for mo::GetPredicate function ability to correctly create a predicate

In the original FastADC pull request this class manages creation of predicates, so it initializes PredicateProvider. But in this pr this class is not required for DC verification. Hence adding a temorary class just to make the tests work

TypedColumData kInt type is int64_t, and FastADC algorithm uses 64-bit long types

FastADC algorithm for mining approximate Denial Constraints will be implemented here.

IndexProvider assigns unique indices to each distinct object of type T added to it. It will be used later for two main operations: 1. Map out all predicates to numbers to use dynamic bitsets for quick intersection/etc. 2. Hash all values in the table keeping their relative order (ignoring columns of DC-unsupported types. Only ints, doubles and strings are allowed). That is, the same values are substituited by the same integers, and higher values are replaced by larger integers

FastADC algotithm decides which column pairs to use to create predicate with by checking whether they are comparable with `==, !=, <, >, >=, <=` or with `==, !=`. In both cases when the columns are of expected type (string, int or double) but different, we need to assert some kind of similarity between them. Otherwise the predicate space will be too big and not really interesting from the DC finding stadpoint, since there will be predicates like `!=` in between two completely different data attributes. These metrics are: - "shared percentage" Measures the overlap between two columns by considering the frequency of each unique element. It calculates the frequency of each unique value in both columns and determines the ratio of the shared values to the total values. - "average ratio" Computes the average value of each column and then returns the ratio of the smaller average to the larger average.

Generates and categorizes predicates for the future evidence set construction

This test builds predicate space from the provided data (CSV file) and compares the list of predicates that will be used later for DC discovery with the expected one. The expected list of predicates was built manually from running the FastADC Java implementation. The next test check that inverse and mutex maps are being built correctly.

This commit introduces the Position List Indexes (Pli) building. It's working with hashed column data, such that that equal values are represented by identical keys, and values are sorted by their natural order. We also build a so-called PliShards, which are just Pli's for a specific segment of the dataset, splitting whole dataset into a bunch of shards. This will allow us to be more efficient later.

This class organizes predicates into packs and creates a correction map, which will be used for optimizing predicate comparisons in derived clasees, that will actually build clues from PLIs

Inherits from CommonClueSetBuilder and builds clues based from one PLI shard

Inherits from CommonClueSetBuilder and builds clues based from two PLI shards

Validates the number of bits in the clue, the structure of the predicate packs. And the correction map which stores predicate-to-bitset mappings

This is a class for constructing clues from PliShards.

The expected values are, once again, are taken from Java implementation

…d of kMixed

For now this class builds necessary structures to build Evidences later. The structures are clue set, correction map and cardinality mask.

This is class that maps 1to1 with Clue. The ApproximateEvidenceInversion algorithm (AEI) that will build approximate denial constraints is using Evidences as it's input

EvidenceSet is basically just a vector of evidences. The only thing that's adding is a method to get total count (I probably can publically inherit from std::vector<Evidence>...?)

Add the building of evidences

Java code sometimes uses LongBitSet to store predicates, which is like boost::dynamic_bitset, but Java's implementation restructs number of bits in the clue to 64. We need to investigate further whether the Java's algorithm could work with predicate space more than 64. But for now we use 64 as maxumum amount of predicates

This class reorders predicates by evidence coverage to accelerate trie later

This class allows to efficently store bitsets and find whether one bitset is a subset of the stored bitsets

I've used `namespace model` as a placeholder before, we should use proper one

- "util" for structures with some complex logic - "model" for representation of concepts needed for denial constraint - "misc" for miscellaneous functions

Previously providers were defined as sigleton classes with static duration. That leaded to persisting state in between a googletest runs, plus this will make running two instances of FastADC in parallel impossible. Made them a normal class that should be created on FastADC algo creation and cleaned after, and we're passing pointers to related structures

Now there are a lot of common initialization code, so moved that to a googletest fixture

This way we can get rid of strange deriving in clue builder that was in Java. Now the separate class manages creation of auxiliary strucutres that we can just pass to clue/evidence set builders

This structure is renamed PackAndCorrectionMapBuilder, as it now builds not only packs and correction map, but also a correction map. So renamed that to EvidenceAuxStructuresBuilder

The zero clue presented a lot in the clues vector

…g try_emplace

github-actions

clang-tidy made some suggestions

github-actions · 2024-10-05T18:24:45Z

src/tests/test_dc_structures_correct_results.h

+    return bset;
+}
+
+std::unordered_map<uint64_t, size_t> expected_clue_set = {


warning: use of undeclared identifier 'uint64_t' [clang-diagnostic-error]

std::unordered_map<uint64_t, size_t> expected_clue_set = { ^

polyntsov · 2024-11-15T14:35:45Z

src/core/algorithms/dc/FastADC/misc/misc.h

+namespace {
+// Helper to trigger a compile-time error for unsupported types
+template <typename T>
+struct DependentFalse : std::false_type {};
+}  // namespace


Don't use anonymous namespaces in headers, this is meaningless. Helper classes and variables should be placed inside namespace details or something similar

polyntsov · 2024-11-15T14:37:49Z

src/core/algorithms/dc/FastADC/misc/misc.h

+}  // namespace
+
+template <typename T>
+[[nodiscard]] inline T const& GetValue(model::TypedColumnData const& column, size_t row) {


template functions are inline by defualt

polyntsov · 2024-11-15T14:40:18Z

src/core/algorithms/dc/FastADC/misc/misc.h

+     * }
+     */
+    if constexpr (std::is_same_v<T, std::string>) {
+        static std::string const kEmptyStr = "";


Not sure why we need these static variables at all here, we can just return the value right away. However, if we really wan to have intermediate variables for this, I'd use constexpr instead of static

polyntsov · 2024-11-15T14:42:39Z

src/core/algorithms/dc/FastADC/misc/typed_column_data_value_differences.cpp

+double GetAverageRatio(model::TypedColumnData const& c1, model::TypedColumnData const& c2) {
+    if (c1.GetColumn() == c2.GetColumn()) return 1.;
+
+    double avg1 = 0.0, avg2 = 0.0;


Don't define several variables on the same line

polyntsov · 2024-11-15T14:46:12Z

src/core/algorithms/dc/FastADC/misc/typed_column_data_value_differences.cpp

+        default:
+            LOG(DEBUG) << "Column type  " << c1.GetType().ToString() << " is not numeric";
+            return -1;


Shouldn't we throw an exception here? This branch handles a case which breaks function contract, if we reach this place in the code, the caller side has a bug?

polyntsov · 2024-11-15T21:04:40Z

src/core/algorithms/dc/FastADC/util/evidence_aux_structures_builder.cpp

+        static std::initializer_list<OperatorType> eq_list = {OperatorType::kEqual,
+                                                              OperatorType::kUnequal};
+        static std::initializer_list<OperatorType> cardinality = {OperatorType::kUnequal};
+


let's not store std initializer list and especially static initializer lists..

polyntsov · 2024-11-15T21:07:07Z

src/core/algorithms/dc/FastADC/util/evidence_aux_structures_builder.h

+    void BuildAll();
+
+private:
+    PredicateBitset BuildMask(PredicatesSpan group, std::initializer_list<OperatorType>& types);


Let's not use std::initializer_list to store anything, it's the type mainly for compiler, not for programmer

polyntsov · 2024-11-15T21:13:59Z

src/core/algorithms/dc/FastADC/util/ntree_search.h

+namespace algos::fastadc {
+class NTreeSearch {
+public:
+    NTreeSearch() = default;


polyntsov · 2024-11-15T21:15:39Z

src/core/algorithms/dc/FastADC/util/predicate_builder.h

+    std::vector<PredicateBitset>&& GetMutexMap() noexcept {
+        return std::move(mutex_map_);
+    }
+
+    std::vector<size_t>&& GetInverseMap() noexcept {
+        return std::move(inverse_map_);
+    }


Either rename these methods or don't move fields

polyntsov · 2024-11-15T21:16:35Z

src/core/algorithms/dc/FastADC/util/predicate_organizer.h

+        std::iota(indexes.begin(), indexes.end(), 0);
+
+        std::stable_sort(indexes.begin(), indexes.end(),
+                         [&](int i, int j) { return coverages[i] < coverages[j]; });


don't capture everything by reference

ol-imorozko force-pushed the FastADC branch 2 times, most recently from 6395e4a to 394229a Compare October 5, 2024 17:43

ol-imorozko added 28 commits October 5, 2024 21:45

Implement tests for Operator class

817ab07

This commit adds test_dc_structures.cpp file, which will be used to test different data structures which are required for DC representation (there are a lot).

Add test that checks that PredicateProvider works

b4b929c

This simple test creates two predicates on a 2x2 table and evaluates them. We're checking for mo::GetPredicate function ability to correctly create a predicate

Create temporary PredicateBuilder class

276137a

In the original FastADC pull request this class manages creation of predicates, so it initializes PredicateProvider. But in this pr this class is not required for DC verification. Hence adding a temorary class just to make the tests work

Replace int with int64_t in Predicate class

0695e58

TypedColumData kInt type is int64_t, and FastADC algorithm uses 64-bit long types

Initial commit that adds dc folder and placeholder for dc.h

e014ea5

FastADC algorithm for mining approximate Denial Constraints will be implemented here.

Implement method to get value from TypedColumnData

d7bec72

Implement PrediateBuilder class

6de1e00

Generates and categorizes predicates for the future evidence set construction

Implement CommonClueSetBuilder

f91e8d8

This class organizes predicates into packs and creates a correction map, which will be used for optimizing predicate comparisons in derived clasees, that will actually build clues from PLIs

Implement SingleClueSetBuilder

b7daeac

Inherits from CommonClueSetBuilder and builds clues based from one PLI shard

Implement CrossClueSetBuilder

7706750

Inherits from CommonClueSetBuilder and builds clues based from two PLI shards

Add test that checks static fields of CommonClueSetBuilder

1984614

Validates the number of bits in the clue, the structure of the predicate packs. And the correction map which stores predicate-to-bitset mappings

Implement ClueSetBuilder

7cd11f8

This is a class for constructing clues from PliShards.

Add test that checks ClueSet building

13c4f05

The expected values are, once again, are taken from Java implementation

FIXME: Add an ability to force kString type on TypedColumnData instea…

ee5a55b

…d of kMixed

Add initial EvidenceSetBuilder class that builds cardinality mask

b4698a8

For now this class builds necessary structures to build Evidences later. The structures are clue set, correction map and cardinality mask.

Add test that verifies CardinalityMask

d652305

Implement Evidence

cf74991

This is class that maps 1to1 with Clue. The ApproximateEvidenceInversion algorithm (AEI) that will build approximate denial constraints is using Evidences as it's input

Implement EvidenceSet

4344fad

EvidenceSet is basically just a vector of evidences. The only thing that's adding is a method to get total count (I probably can publically inherit from std::vector<Evidence>...?)

Implement EvidenceSetBuilder

ff7b99f

Add the building of evidences

ol-imorozko added 26 commits October 5, 2024 21:45

Add test to verify evidence set

ced0db0

Fix wrong creating of inverted predicate, operands were swapped

3c791fc

Implement PredicateOrganizer class

29a2a56

This class reorders predicates by evidence coverage to accelerate trie later

Add test that validates predicate organizer

4acb27f

Implement DCCandidateTrie class

9384560

Implement PredicateSet class

1d3020f

Implement DenialConstraint class

8262756

Return reference from GetImplications Predicate method

bbcdde7

Implement Closure class

a34467a

Implement NTreeSearch class

c713f26

This class allows to efficently store bitsets and find whether one bitset is a subset of the stored bitsets

Implement DenialConstraintSet

fadeadf

Implement ApproximateEvidenceInverter class

72bc916

Implement test for approximate denial constraints

426d115

Change namespace model to namespace algos::fastadc for FastADC files

f701159

I've used `namespace model` as a placeholder before, we should use proper one

Split FastADC files into subfolders

16c2c16

- "util" for structures with some complex logic - "model" for representation of concepts needed for denial constraint - "misc" for miscellaneous functions

Correct includes paths after renaming and moving FastADC files

3db0fcc

Adjust unittests after providers refactoring

6cc5b6e

Now there are a lot of common initialization code, so moved that to a googletest fixture

Extract predicate packs and correction map building to a separate class

ca98a4a

This way we can get rid of strange deriving in clue builder that was in Java. Now the separate class manages creation of auxiliary strucutres that we can just pass to clue/evidence set builders

Move cardinality mask building from Evidence set to a new structure

fcafce1

This structure is renamed PackAndCorrectionMapBuilder, as it now builds not only packs and correction map, but also a correction map. So renamed that to EvidenceAuxStructuresBuilder

Remove unused clue field from Evidence class

67b5540

Remove unused N field from SearchNode class

39fa0ac

Optimize AccumulateClues by hashing clue with zero value

fccf6ae

The zero clue presented a lot in the clues vector

Increase performance of AccumulateClues by preallocating and utilizin…

3b2b34f

…g try_emplace

Optimize clues by moving allocations out of Build* methods

3cc3060

ol-imorozko force-pushed the FastADC branch 2 times, most recently from f7a5ca2 to 3cc3060 Compare October 5, 2024 17:46

github-actions bot reviewed Oct 5, 2024

View reviewed changes

polyntsov requested changes Nov 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: FastADC implementation #470

WIP: FastADC implementation #470

ol-imorozko commented Oct 5, 2024

github-actions bot left a comment

github-actions bot Oct 5, 2024

polyntsov Nov 15, 2024

polyntsov Nov 15, 2024

polyntsov Nov 15, 2024

polyntsov Nov 15, 2024

polyntsov Nov 15, 2024

polyntsov Nov 15, 2024

polyntsov Nov 15, 2024

polyntsov Nov 15, 2024

polyntsov Nov 15, 2024

polyntsov Nov 15, 2024

WIP: FastADC implementation #470

Are you sure you want to change the base?

WIP: FastADC implementation #470

Conversation

ol-imorozko commented Oct 5, 2024

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot Oct 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment