Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: FastADC implementation #470

Open
wants to merge 54 commits into
base: main
Choose a base branch
from

Conversation

ol-imorozko
Copy link
Collaborator

No description provided.

@ol-imorozko ol-imorozko force-pushed the FastADC branch 2 times, most recently from 6395e4a to 394229a Compare October 5, 2024 17:43
This class represents an operator used in predicates for
Denial Constrains (DC) representation. Predicates used there
are less, greater, eq, neq, geq and leq.

This C++ implementation is 100% bad, there is whole bunch of
objects being created, but conceptually all of them are the same.
Object "Operator '+'" and another object "Operator '+'" represents
the same thing.

----------------------------------------------------------------------------
This is just a copy of a java code from https://github.com/RangerShaw/FastADC.

I will refactor and think about better implementation later.
I'll be just copying java code to get working algorithm ASAP,
and after that I'll start thinking about good implementation.
This commit adds test_dc_structures.cpp file, which will be used
to test different data structures which are required for
DC representation (there are a lot).
This class represents a column operand within a predicate for FastADC.

FastADC processes Denial Constraints (DCs) that involve comparisons between
pairs of rows within a dataset. A typical DC example, derived from a Functional
Dependency (FD) such as A -> B, is expressed as: ∀𝑡, 𝑠 ∈ 𝑟, ¬(𝑡.𝐴 = 𝑠.𝐴 ∧ 𝑡.𝐵 ≠ 𝑠.𝐵).
This denotes that for any pair of rows in the relation, it should not be the case
that while the values in column "A" are equal, the values in column "B" are unequal.

A predicate in this context (e.g., 𝑡.𝐴 = 𝑠.𝐴) comprises three elements to be fully
represented: the column operand from the first tuple ("t.A"), the comparison operator
("="), and the column operand from the second tuple ("s.A"). The `ColumnOperand` class
encapsulates the column operand part of a predicate, such as "t.A" or "s.A".
First step in FastADC algorithm is to build so-called "Predicate Space".
This is a long process during which many places in the code wants to get
a Predicate. But each predicate is stored in a global storage -- map.
In Java code this class (and other similar "provider" classes) are
singletons.

BaseProvider class is the class, from which a *Provider class should be
derived. It ensures that only a PredicateBuilder class can initialize
and free these singletons.

I'm sure there exists a better approach, where we will store Provider
classes in some fields to bind their lifetime more explicitly, but
this is how it's done in Java, and I don't have much time to devise
perfect architecture.
This class acts as a centralized storage to manage
and provide access to Predicate objects.

A Predicate is defined as "t1.A_i op t2.A_j", where t1 and t2 represent
different rows, and A_i and A_j are columns (which may be the same or different)

The FastADC algorithm first will build a so-called "Predicate Space",
which is a set of all predicates that are allowed on R (set of rows,
basically a table). In order to create and store predicates, this commit
implements a singleton class with a hashmap storage.
FastADC processes Denial Constraints (DCs) that involve comparisons between
pairs of rows within a dataset. A typical DC example, derived from a Functional
Dependency such as A -> B, is expressed as:
`forall t, s in r, not (t.A = s.A and t.B != s.B)`
This denotes that for any pair of rows in the relation, it should not be the case
that while the values in column "A" are equal, the values in column "B" are unequal.

A predicate in this context (e.g., t.A == s.A) comprises three elements to be fully
represented: the column operand from the first tuple ("t.A"), the comparison operator
("="), and the column operand from the second tuple ("s.A").
This simple test creates two predicates on a 2x2 table and evaluates them.
We're checking for mo::GetPredicate function ability to correctly
create a predicate
In the original FastADC pull request this class manages creation of
predicates, so it initializes PredicateProvider. But in this pr this
class is not required for DC verification. Hence adding a temorary
class just to make the tests work
TypedColumData kInt type is int64_t, and FastADC algorithm uses
64-bit long types
FastADC algorithm for mining approximate Denial Constraints will
be implemented here.
IndexProvider assigns unique indices to each distinct object of
type T added to it. It will be used later for two main operations:
1. Map out all predicates to numbers to use dynamic bitsets for quick
intersection/etc.
2. Hash all values in the table keeping their relative order
(ignoring columns of DC-unsupported types. Only ints, doubles and
strings are allowed). That is, the same values are substituited by
the same integers, and higher values are replaced by larger integers
FastADC algotithm decides which column pairs to use to create predicate
with by checking whether they are comparable with `==, !=, <, >, >=, <=`
or with `==, !=`. In both cases when the columns are of expected type
(string, int or double) but different, we need to assert some kind of
similarity between them. Otherwise the predicate space will be too big
and not really interesting from the DC finding stadpoint, since there
will be predicates like `!=` in between two completely different
data attributes.

These metrics are:
- "shared percentage"
Measures the overlap between two columns by considering the frequency of
each unique element. It calculates the frequency of each unique value in
both columns and determines the ratio of the shared values to the total values.

- "average ratio"
Computes the average value of each column and then returns the ratio of the
smaller average to the larger average.
Generates and categorizes predicates for the future evidence set construction
This test builds predicate space from the provided data (CSV file)
and compares the list of predicates that will be used later for DC
discovery with the expected one.

The expected list of predicates was built manually from running
the FastADC Java implementation.

The next test check that inverse and mutex maps are being built
correctly.
This commit introduces the Position List Indexes (Pli) building.
It's working with hashed column data, such that that equal values are
represented by identical keys, and values are sorted by their natural order.

We also build a so-called PliShards, which are just Pli's for a specific
segment of the dataset, splitting whole dataset into a bunch of shards.

This will allow us to be more efficient later.
This class organizes predicates into packs and creates a correction map,
which will be used for optimizing predicate comparisons in derived
clasees, that will actually build clues from PLIs
Inherits from CommonClueSetBuilder and builds clues based from
one PLI shard
Inherits from CommonClueSetBuilder and builds clues based from
two PLI shards
Validates the number of bits in the clue, the structure of the predicate packs.
And the correction map which stores predicate-to-bitset mappings
This is a class for constructing clues from PliShards.
The expected values are, once again, are taken from Java implementation
For now this class builds necessary structures to build Evidences later.
The structures are clue set, correction map and cardinality mask.
This is class that maps 1to1 with Clue. The ApproximateEvidenceInversion algorithm
(AEI) that will build approximate denial constraints is using Evidences
as it's input
EvidenceSet is basically just a vector of evidences. The only thing
that's adding is a method to get total count

(I probably can publically inherit from std::vector<Evidence>...?)
Add the building of evidences
Java code sometimes uses LongBitSet to store predicates, which is like
boost::dynamic_bitset, but Java's implementation restructs number of bits
in the clue to 64.

We need to investigate further whether the Java's algorithm could work
with predicate space more than 64. But for now we use 64 as maxumum
amount of predicates
This class reorders predicates by evidence coverage to accelerate
trie later
This class allows to efficently store bitsets and find whether
one bitset is a subset of the stored bitsets
I've used `namespace model` as a placeholder before, we should use
proper one
- "util" for structures with some complex logic
- "model" for representation of concepts needed for denial constraint
- "misc" for miscellaneous functions
Previously providers were defined as sigleton classes with static
duration. That leaded to persisting state in between a googletest runs,
plus this will make running two instances of FastADC in parallel
impossible.

Made them a normal class that should be created on FastADC algo creation
and cleaned after, and we're passing pointers to related structures
Now there are a lot of common initialization code, so moved that
to a googletest fixture
This way we can get rid of strange deriving in clue builder that was in Java.
Now the separate class manages creation of auxiliary strucutres that
we can just pass to clue/evidence set builders
This structure is renamed PackAndCorrectionMapBuilder, as it now
builds not only packs and correction map, but also a correction map.
So renamed that to EvidenceAuxStructuresBuilder
The zero clue presented a lot in the clues vector
@ol-imorozko ol-imorozko force-pushed the FastADC branch 2 times, most recently from f7a5ca2 to 3cc3060 Compare October 5, 2024 17:46
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

return bset;
}

std::unordered_map<uint64_t, size_t> expected_clue_set = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use of undeclared identifier 'uint64_t' [clang-diagnostic-error]

std::unordered_map<uint64_t, size_t> expected_clue_set = {
                   ^

Comment on lines +7 to +11
namespace {
// Helper to trigger a compile-time error for unsupported types
template <typename T>
struct DependentFalse : std::false_type {};
} // namespace
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use anonymous namespaces in headers, this is meaningless. Helper classes and variables should be placed inside namespace details or something similar

} // namespace

template <typename T>
[[nodiscard]] inline T const& GetValue(model::TypedColumnData const& column, size_t row) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

template functions are inline by defualt

* }
*/
if constexpr (std::is_same_v<T, std::string>) {
static std::string const kEmptyStr = "";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we need these static variables at all here, we can just return the value right away. However, if we really wan to have intermediate variables for this, I'd use constexpr instead of static

double GetAverageRatio(model::TypedColumnData const& c1, model::TypedColumnData const& c2) {
if (c1.GetColumn() == c2.GetColumn()) return 1.;

double avg1 = 0.0, avg2 = 0.0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't define several variables on the same line

Comment on lines +82 to +84
default:
LOG(DEBUG) << "Column type " << c1.GetType().ToString() << " is not numeric";
return -1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we throw an exception here? This branch handles a case which breaks function contract, if we reach this place in the code, the caller side has a bug?

Comment on lines +74 to +77
static std::initializer_list<OperatorType> eq_list = {OperatorType::kEqual,
OperatorType::kUnequal};
static std::initializer_list<OperatorType> cardinality = {OperatorType::kUnequal};

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not store std initializer list and especially static initializer lists..

void BuildAll();

private:
PredicateBitset BuildMask(PredicatesSpan group, std::initializer_list<OperatorType>& types);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not use std::initializer_list to store anything, it's the type mainly for compiler, not for programmer

namespace algos::fastadc {
class NTreeSearch {
public:
NTreeSearch() = default;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed

Comment on lines +116 to +122
std::vector<PredicateBitset>&& GetMutexMap() noexcept {
return std::move(mutex_map_);
}

std::vector<size_t>&& GetInverseMap() noexcept {
return std::move(inverse_map_);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either rename these methods or don't move fields

std::iota(indexes.begin(), indexes.end(), 0);

std::stable_sort(indexes.begin(), indexes.end(),
[&](int i, int j) { return coverages[i] < coverages[j]; });
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't capture everything by reference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants