Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DES #490

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open

Conversation

VanyaVolgushev
Copy link
Contributor

Add differential evolution solver algorithm for mining numerical association rules.

@VanyaVolgushev VanyaVolgushev force-pushed the add-des-release branch 2 times, most recently from 66d00fc to dbbf329 Compare November 4, 2024 18:57
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/core/algorithms/nar/des/des.cpp Outdated Show resolved Hide resolved
src/core/algorithms/nar/des/des.cpp Outdated Show resolved Hide resolved
src/core/algorithms/nar/des/des.cpp Outdated Show resolved Hide resolved
src/core/algorithms/nar/des/encoded_nar.cpp Outdated Show resolved Hide resolved
src/core/algorithms/nar/des/encoded_nar.cpp Outdated Show resolved Hide resolved
src/core/algorithms/nar/des/rng.cpp Outdated Show resolved Hide resolved
src/core/algorithms/nar/des/rng.h Outdated Show resolved Hide resolved
src/core/algorithms/nar/des/rng.h Outdated Show resolved Hide resolved
src/core/algorithms/nar/value_range.h Show resolved Hide resolved
src/core/algorithms/nar/value_range.h Show resolved Hide resolved
@VanyaVolgushev VanyaVolgushev force-pushed the add-des-release branch 2 times, most recently from 578470b to 4edd689 Compare November 9, 2024 18:24
Copy link
Collaborator

@alexandrsmirn alexandrsmirn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first part of the review. I will check the rest of the code later.

src/core/algorithms/nar/nar.h Outdated Show resolved Hide resolved
src/core/algorithms/nar/nar.h Outdated Show resolved Hide resolved
src/core/algorithms/nar/nar_algorithm.h Show resolved Hide resolved
src/core/algorithms/nar/nar_algorithm.h Outdated Show resolved Hide resolved
src/core/algorithms/nar/nar_algorithm.cpp Outdated Show resolved Hide resolved
}

void NARAlgorithm::LoadDataInternal() {
typed_relation_ = model::ColumnLayoutTypedRelationData::CreateFrom(*input_table_, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the second parameter of CreateFrom function is is_null_equal_null. this parameter can be introduced as an algorithm parameter, like this way.

or, maybe, the authors of the algorithm have discussed this case in the paper?

src/core/config/descriptions.h Outdated Show resolved Hide resolved
src/core/algorithms/nar/value_range.h Outdated Show resolved Hide resolved
src/core/algorithms/nar/value_range.h Outdated Show resolved Hide resolved
src/core/algorithms/nar/des/des.cpp Outdated Show resolved Hide resolved
src/core/algorithms/nar/des/des.cpp Outdated Show resolved Hide resolved
src/core/algorithms/nar/des/des.cpp Outdated Show resolved Hide resolved
src/core/algorithms/nar/des/des.cpp Outdated Show resolved Hide resolved
src/core/algorithms/nar/des/des.cpp Outdated Show resolved Hide resolved
src/core/algorithms/nar/des/encoded_nar.cpp Outdated Show resolved Hide resolved
Comment on lines +75 to +74
auto domain = domains[feature_index];
auto decoded = encoded_value_ranges_[feature_index].Decode(domain);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, you copying the shared pointer when using auto domain = domains[feature_index]. Use const reference, no need to increase ref count of shared ptr for no reason

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, no need to create an additional copy of decoded.
I'm aware that InsertInCons wants a shared pointer, not a const ref, but when passing const ref to it it will create a copy automatically, so no problems here.

Comment on lines 86 to 89
EncodedNAR::EncodedNAR(FeatureDomains domains, TypedRelation const* typed_relation) {
size_t feature_count = domains.size();
for (size_t feature_index = 0; feature_index < feature_count; feature_index++) {
encoded_value_ranges_.emplace_back(EncodedValueRange());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a constructor, so we can just do

Suggested change
EncodedNAR::EncodedNAR(FeatureDomains domains, TypedRelation const* typed_relation) {
size_t feature_count = domains.size();
for (size_t feature_index = 0; feature_index < feature_count; feature_index++) {
encoded_value_ranges_.emplace_back(EncodedValueRange());
}
EncodedNAR::EncodedNAR(const FeatureDomains& domains, const TypedRelation* typed_relation)
: encoded_value_ranges_(domains.size(), EncodedValueRange()) {

Comment on lines +91 to +99
implication_sign_pos_ = RNG().Next();
SetQualities(domains, typed_relation);
}

EncodedNAR::EncodedNAR(size_t feature_count) {
for (size_t feature_index = 0; feature_index < feature_count; feature_index++) {
encoded_value_ranges_.emplace_back(EncodedValueRange());
}
implication_sign_pos_ = RNG().Next();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing with RNG, just create a std::random_device as a field in EncodedNAR class

Comment on lines +95 to +97
EncodedNAR::EncodedNAR(size_t feature_count) {
for (size_t feature_index = 0; feature_index < feature_count; feature_index++) {
encoded_value_ranges_.emplace_back(EncodedValueRange());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
EncodedNAR::EncodedNAR(size_t feature_count) {
for (size_t feature_index = 0; feature_index < feature_count; feature_index++) {
encoded_value_ranges_.emplace_back(EncodedValueRange());
}
EncodedNAR::EncodedNAR(size_t feature_count)
: encoded_value_ranges_(feature_count, EncodedValueRange()) {

using model::ValueRange;

DES::DES() : NARAlgorithm({}) {
using namespace config::names;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
using namespace config::names;

@@ -0,0 +1,89 @@
#include "des.h"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include-what-you-use is planned to be introduced in the CI, so it would be better to add these lines:

Suggested change
#include <algorithm>
#include <cstddef>
#include <vector>
#include <memory>

This also applies to other files


static FeatureDomains FindFeatureDomains(TypedRelation const* typed_relation);
std::vector<EncodedNAR> GetRandomPopulationInDomains(FeatureDomains domains) const;
void EvolvePopulation(std::vector<EncodedNAR>& population) const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find definition of DES::EvolvePopulation method. Shouldn't it's declaration be removed?


private:
double implication_sign_pos_ = -1;
std::vector<EncodedValueRange> encoded_value_ranges_ = std::vector<EncodedValueRange>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<EncodedValueRange> encoded_value_ranges_ = std::vector<EncodedValueRange>();
std::vector<EncodedValueRange> encoded_value_ranges_{};

switch (index) {
case 0:
return permutation;
break;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These break's are unreachable

Comment on lines +56 to +49
std::shared_ptr<model::ValueRange> Decode(std::shared_ptr<model::ValueRange> domain) const {
switch (domain->GetTypeId()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it's better to add

Suggested change
std::shared_ptr<model::ValueRange> Decode(std::shared_ptr<model::ValueRange> domain) const {
switch (domain->GetTypeId()) {
std::shared_ptr<model::ValueRange> Decode(std::shared_ptr<model::ValueRange> domain) const {
using namespace model;
switch (domain->GetTypeId()) {

and remove model:: further in this method

Comment on lines +58 to +73
case model::TypeId::kInt: {
std::shared_ptr<model::IntValueRange> int_domain =
std::static_pointer_cast<model::IntValueRange>(domain);
model::Int span = int_domain->upper_bound - int_domain->lower_bound;
model::Int resulting_lower = int_domain->lower_bound + span * this->bound1;
model::Int resulting_upper = int_domain->lower_bound + span * this->bound2;
if (resulting_lower > resulting_upper) {
std::swap(resulting_lower, resulting_upper);
}
return std::make_shared<model::IntValueRange>(
model::IntValueRange(resulting_lower, resulting_upper));
}
case model::TypeId::kDouble: {
std::shared_ptr<model::DoubleValueRange> double_domain =
std::static_pointer_cast<model::DoubleValueRange>(domain);
model::Double span = double_domain->upper_bound - double_domain->lower_bound;
model::Double resulting_lower = double_domain->lower_bound + span * this->bound1;
model::Double resulting_upper = double_domain->lower_bound + span * this->bound2;
if (resulting_lower > resulting_upper) {
std::swap(resulting_lower, resulting_upper);
}
return std::make_shared<model::DoubleValueRange>(
model::DoubleValueRange(resulting_lower, resulting_upper));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to use private template method here?

template <typename T, typename RangeT>
    std::shared_ptr<RangeT> DecodeTypedValueRange(std::shared_ptr<model::ValueRange> const& domain) const {
        std::shared_ptr<RangeT> int_domain = std::static_pointer_cast<RangeT>(domain);
        T span = int_domain->upper_bound - int_domain->lower_bound;
        T resulting_lower = int_domain->lower_bound + span * this->bound1;
        T resulting_upper = int_domain->lower_bound + span * this->bound2;
        if (resulting_lower > resulting_upper) {
            std::swap(resulting_lower, resulting_upper);
        }
        return std::make_shared<RangeT>(resulting_lower, resulting_upper);
    }

and call it here:

Suggested change
case model::TypeId::kInt: {
std::shared_ptr<model::IntValueRange> int_domain =
std::static_pointer_cast<model::IntValueRange>(domain);
model::Int span = int_domain->upper_bound - int_domain->lower_bound;
model::Int resulting_lower = int_domain->lower_bound + span * this->bound1;
model::Int resulting_upper = int_domain->lower_bound + span * this->bound2;
if (resulting_lower > resulting_upper) {
std::swap(resulting_lower, resulting_upper);
}
return std::make_shared<model::IntValueRange>(
model::IntValueRange(resulting_lower, resulting_upper));
}
case model::TypeId::kDouble: {
std::shared_ptr<model::DoubleValueRange> double_domain =
std::static_pointer_cast<model::DoubleValueRange>(domain);
model::Double span = double_domain->upper_bound - double_domain->lower_bound;
model::Double resulting_lower = double_domain->lower_bound + span * this->bound1;
model::Double resulting_upper = double_domain->lower_bound + span * this->bound2;
if (resulting_lower > resulting_upper) {
std::swap(resulting_lower, resulting_upper);
}
return std::make_shared<model::DoubleValueRange>(
model::DoubleValueRange(resulting_lower, resulting_upper));
}
case TypeId::kInt: {
return DecodeTypedValueRange<Int, IntValueRange>(domain);
}
case TypeId::kDouble: {
return DecodeTypedValueRange<Double, DoubleValueRange>(domain);
}

Also you can define Mixed case as template specialization:

template <>
std::shared_ptr<model::StringValueRange>
EncodedValueRange::DecodeTypedValueRange<model::String, model::StringValueRange>
(std::shared_ptr<model::ValueRange> const& domain) const {
    using namespace model;
    std::shared_ptr<StringValueRange> string_domain =
            std::static_pointer_cast<StringValueRange>(domain);
    std::vector<String> string_vector = string_domain->domain;
    size_t span = string_vector.size();
    // upper_bound is not used, resulting NARs bind categorical values with a single
    // value.
    String result;
    if (bound1 == 1.0) {
        result = string_vector.back();
    } else {
        result = string_vector[(size_t)(span * this->bound1)];
    }
    return std::make_shared<StringValueRange>(StringValueRange(result));
}

Note that specialization should appear after template itself and before first instantiation. The easier way to do this is to place template specialization and EncodeValueRange::Decode definition in .cpp file -- in that order

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also we should avoid using c-style casts: "(size_t)"

Comment on lines +58 to +73
case model::TypeId::kInt: {
std::shared_ptr<model::IntValueRange> int_domain =
std::static_pointer_cast<model::IntValueRange>(domain);
model::Int span = int_domain->upper_bound - int_domain->lower_bound;
model::Int resulting_lower = int_domain->lower_bound + span * this->bound1;
model::Int resulting_upper = int_domain->lower_bound + span * this->bound2;
if (resulting_lower > resulting_upper) {
std::swap(resulting_lower, resulting_upper);
}
return std::make_shared<model::IntValueRange>(
model::IntValueRange(resulting_lower, resulting_upper));
}
case model::TypeId::kDouble: {
std::shared_ptr<model::DoubleValueRange> double_domain =
std::static_pointer_cast<model::DoubleValueRange>(domain);
model::Double span = double_domain->upper_bound - double_domain->lower_bound;
model::Double resulting_lower = double_domain->lower_bound + span * this->bound1;
model::Double resulting_upper = double_domain->lower_bound + span * this->bound2;
if (resulting_lower > resulting_upper) {
std::swap(resulting_lower, resulting_upper);
}
return std::make_shared<model::DoubleValueRange>(
model::DoubleValueRange(resulting_lower, resulting_upper));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, EncodedValueRange::Decode should be placed in a .cpp file

Comment on lines +67 to +60
return std::make_shared<model::IntValueRange>(
model::IntValueRange(resulting_lower, resulting_upper));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return std::make_shared<model::IntValueRange>(
model::IntValueRange(resulting_lower, resulting_upper));
return std::make_shared<model::IntValueRange>(resulting_lower, resulting_upper);

The purpose of make_shared is to forward it's parameters to IntValueRange's constructor and create IntValueRange in-place. In your code temporary IntValueRange is created and being copied (or moved) to shared_ptr. Also it's not a good idea to repeat type name.
This also applies to Double and Mixed cases

Comment on lines +6 to +7
std::mt19937 RNG::rng_ = std::mt19937(kSeed);
std::uniform_real_distribution<double> RNG::uni_ = std::uniform_real_distribution(0.0, 1.0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::mt19937 RNG::rng_ = std::mt19937(kSeed);
std::uniform_real_distribution<double> RNG::uni_ = std::uniform_real_distribution(0.0, 1.0);
std::mt19937 RNG::rng_{kSeed};
std::uniform_real_distribution<double> RNG::uni_{0.0, 1.0};

@@ -0,0 +1 @@
#include "des/des.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include "des/des.h"
#pragma once
#include "des/des.h"

Comment on lines +8 to +17
size_t antecounter = 0;
for (auto const& [key, value] : ante_) {
if (antecounter > 0) {
result += ", ";
}
result += std::to_string(key);
result += ": ";
result += value->ToString();
antecounter++;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
size_t antecounter = 0;
for (auto const& [key, value] : ante_) {
if (antecounter > 0) {
result += ", ";
}
result += std::to_string(key);
result += ": ";
result += value->ToString();
antecounter++;
}
for (auto it{ante_.begin()}; it != ante_.end(); ++it) {
if (it != ante_.begin()) {
result += ", ";
}
result += std::to_string(it->first);
result += ": ";
result += it->second->ToString();
}

Comment on lines +19 to +28
size_t conscounter = 0;
for (auto const& [key, value] : cons_) {
if (conscounter > 0) {
result += ", ";
}
result += std::to_string(key);
result += ": ";
result += value->ToString();
conscounter++;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
size_t conscounter = 0;
for (auto const& [key, value] : cons_) {
if (conscounter > 0) {
result += ", ";
}
result += std::to_string(key);
result += ": ";
result += value->ToString();
conscounter++;
}
for (auto it{cons_.begin()}; it != cons_.end(); ++it) {
if (it != cons_.begin()) {
result += ", ";
}
result += std::to_string(it->first);
result += ": ";
result += it->second->ToString();
}

Comment on lines +43 to +45
} else {
result.confidence = num_rows_fit_ante_and_cons / (double)num_rows_fit_ante;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else here makes no sense -- if branch has return at the end.

Suggested change
} else {
result.confidence = num_rows_fit_ante_and_cons / (double)num_rows_fit_ante;
}
result.confidence = num_rows_fit_ante_and_cons / (double)num_rows_fit_ante;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also (double) is a c-style cast. Prefer using static_cast

Comment on lines +66 to +67
value); // is it okay to use a non-long double here? clion says GetValue returns
// just double.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's ok. Double is just double (it's defined exactly as using model::Double = double;), and GetValue<T> returns T

Comment on lines +61 to +81
DoubleValueRange::DoubleValueRange(model::TypedColumnData const& column) {
bool initialized = false;
for (size_t row_index = 0; row_index < column.GetNumRows(); row_index++) {
std::byte const* value = column.GetValue(row_index);
double double_value = model::Type::GetValue<model::Double>(
value); // is it okay to use a non-long double here? clion says GetValue returns
// just double.
if (!initialized) {
lower_bound = double_value;
upper_bound = double_value;
initialized = true;
continue;
}
if (double_value > upper_bound) {
upper_bound = double_value;
}
if (double_value < lower_bound) {
lower_bound = double_value;
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method repeats IntValueRange::IntValueRange. The only difference is int_value/double_value type. I think, you can use template function like

template <typename T>
CreateValueRange(...) {
    ...
    T typed_value = model::Type::GetValue<T>(value);
    ...

and in constructor will be CreateValueRange<model::Int>(...) and CreateValueRange<model::Double>(...).

bool initialized = false;
for (size_t row_index = 0; row_index < column.GetNumRows(); row_index++) {
std::byte const* value = column.GetValue(row_index);
long int int_value = model::Type::GetValue<model::Int>(value);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You assign value of type model::Int to int_value of type long. Then you assign it to lower_bound and upper_bound, both of type model::Int. Why can't int_value be of type model::Int?
Moreover, model::Int is not the same type as long -- it's alias for int_64_t, which is twice as long on some machines (see https://en.cppreference.com/w/cpp/language/types).

Suggested change
long int int_value = model::Type::GetValue<model::Int>(value);
model::Int int_value = model::Type::GetValue<model::Int>(value);

The same for DoubleValueRange

virtual bool Includes(std::byte const* value) const = 0;
virtual std::string ToString() const = 0;

protected:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style guide recommends that protected section precede public

algorithm->Execute();
SUCCEED();
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests, checking that algorithm works fine:

  • test that NARs that should hold are found
  • test that their qualities are calculated correctly (fitness, support, confidence)

return bound2;
break;
default:
throw std::out_of_range("Index out of range for value range.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove code duplication by creating a static array of four elements in private section:

    static constexpr double EncodedValueRange::* member_pointers_[kFieldCount] = {
        &EncodedValueRange::permutation,
        &EncodedValueRange::threshold,
        &EncodedValueRange::bound1,
        &EncodedValueRange::bound2
    };

Those are pointers to the member values

And use it like this:

    double& operator[](size_t index) {
        if (index >= kFieldCount) {
            throw std::out_of_range("Index out of range for EncodedValueRange.");
        }
        return this->*(member_pointers_[index]);
    }

    const double& operator[](size_t index) const {
        if (index >= kFieldCount) {
            throw std::out_of_range("Index out of range for EncodedValueRange.");
        }
        return this->*(member_pointers_[index]);
    }

But that looks kinda ugly IMO. Maybe your approach is readable enough, decide for yourself

Comment on lines +59 to +52
std::shared_ptr<model::IntValueRange> int_domain =
std::static_pointer_cast<model::IntValueRange>(domain);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use auto when specifying type of a variable when you are performing a cast. The reason is that the type can be easily deduced by both a compiler and a person reading the code:

Suggested change
std::shared_ptr<model::IntValueRange> int_domain =
std::static_pointer_cast<model::IntValueRange>(domain);
auto int_domain = std::static_pointer_cast<model::IntValueRange>(domain);

Comment on lines +71 to +64
std::shared_ptr<model::DoubleValueRange> double_domain =
std::static_pointer_cast<model::DoubleValueRange>(domain);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here with an auto

Comment on lines +58 to +73
case model::TypeId::kInt: {
std::shared_ptr<model::IntValueRange> int_domain =
std::static_pointer_cast<model::IntValueRange>(domain);
model::Int span = int_domain->upper_bound - int_domain->lower_bound;
model::Int resulting_lower = int_domain->lower_bound + span * this->bound1;
model::Int resulting_upper = int_domain->lower_bound + span * this->bound2;
if (resulting_lower > resulting_upper) {
std::swap(resulting_lower, resulting_upper);
}
return std::make_shared<model::IntValueRange>(
model::IntValueRange(resulting_lower, resulting_upper));
}
case model::TypeId::kDouble: {
std::shared_ptr<model::DoubleValueRange> double_domain =
std::static_pointer_cast<model::DoubleValueRange>(domain);
model::Double span = double_domain->upper_bound - double_domain->lower_bound;
model::Double resulting_lower = double_domain->lower_bound + span * this->bound1;
model::Double resulting_upper = double_domain->lower_bound + span * this->bound2;
if (resulting_lower > resulting_upper) {
std::swap(resulting_lower, resulting_upper);
}
return std::make_shared<model::DoubleValueRange>(
model::DoubleValueRange(resulting_lower, resulting_upper));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also we should avoid using c-style casts: "(size_t)"

if (bound1 == 1.0) {
result = string_vector.back();
} else {
result = string_vector[(size_t)(span * this->bound1)];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this->bound1? just bound1 is enough

Comment on lines +43 to +45
} else {
result.confidence = num_rows_fit_ante_and_cons / (double)num_rows_fit_ante;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also (double) is a c-style cast. Prefer using static_cast

NARQualities result;
if (num_rows_fit_ante == 0) {
result.fitness = 0.0;
result.confidence = 0.0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I also think that {0.0, 0.0, 0.0} will be better. In fact, we can do that for all cases:

    if (num_rows_fit_ante == 0) {
        return {0.0, 0.0, 0.0};
    }

    double confidence = static_cast<double>(num_rows_fit_ante_and_cons) / num_rows_fit_ante;
    double support = static_cast<double>(num_rows_fit_ante_and_cons) / num_rows;

    if (support == 0.0) {
        return {0.0, 0.0, 0.0};
    }

    double inclusion = static_cast<double>(included_features) / feature_count;
    double fitness = (confidence + support + inclusion) / 3.0;

    return {fitness, confidence, support};

// TODO: this function is way too big and cluttered
void NAR::SetQualities(TypedRelation const* typed_relation) {
if (ante_.size() == 0 || cons_.size() == 0) {
qualities_.fitness = 0.0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about other fields?

        qualities_ = {0.0, 0.0, 0.0};

Comment on lines +66 to +68
size_t num_rows_fit_ante = 0;
size_t num_rows_fit_ante_and_cons = 0;
for (size_t rowi = 0; rowi < typed_relation->GetNumRows(); rowi++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract typed_relation->GetNumRows() to a variable since you will be using it later

for (size_t rowi = 0; rowi < typed_relation->GetNumRows(); rowi++) {
bool row_fits_ante = true;
bool row_fits_cons = true;
for (size_t coli = 0; coli < typed_relation->GetNumColumns(); coli++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same with GetNumColumns()

Comment on lines +72 to +82
model::TypedColumnData const& column = typed_relation->GetColumnData(coli);
auto value = column.GetValue(rowi);
row_fits_ante &= AnteFitsValue(coli, value);
if (!row_fits_ante) {
break;
}
row_fits_cons &= ConsFitsValue(coli, value);
}
num_rows_fit_ante += row_fits_ante;
num_rows_fit_ante_and_cons += (row_fits_ante && row_fits_cons);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe smth like this is easier:

            if (row_fits_ante) {
                row_fits_ante = AnteFitsValue(coli, value);
            } else {
                break;
            }

            if (row_fits_ante) {
                row_fits_cons &= ConsFitsValue(coli, value);
            }
        }

        if (row_fits_ante) {
            ++num_rows_fit_ante;
            if (row_fits_cons) {
                ++num_rows_fit_ante_and_cons;
            }
        }

Comment on lines +109 to +116
bool map_binds_feature = map.find(feature_index) != map.end();
if (!map_binds_feature) {
return true;
} else if (map.at(feature_index)->Includes(value)) {
return true;
} else {
return false;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess find here saves one extra lookup, so maybe smth like this is acceptable too

    auto it = map.find(feature_index);
    if (it == map.end()) {
        return ture;
    }
    return it->second->Includes(value);
}

namespace model {

StringValueRange::StringValueRange(model::TypedColumnData const& column) {
domain = std::vector<std::string>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
domain = std::vector<std::string>();
domain.reserve(column.GetNumRows());

domain = std::vector<std::string>();
for (size_t row_index = 0; row_index < column.GetNumRows(); row_index++) {
std::byte const* value = column.GetValue(row_index);
std::string string_value = model::Type::GetValue<std::string>(value);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model not needed here

Comment on lines +31 to +45
std::string IntValueRange::ToString() const {
std::string result;
result += "[";
result += std::to_string(lower_bound);
result += " - ";
result += std::to_string(upper_bound);
result += "]";
return result;
}

IntValueRange::IntValueRange(model::TypedColumnData const& column) {
bool initialized = false;
for (size_t row_index = 0; row_index < column.GetNumRows(); row_index++) {
std::byte const* value = column.GetValue(row_index);
long int int_value = model::Type::GetValue<model::Int>(value);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe create a templated class NumericalValueRange and use smth like:

using IntValueRange = NumericValueRange<Int>;
using DoubleValueRange = NumericValueRange<Double>;

like

template <typename T>
class NumericValueRange : public ValueRange {
public:
    explicit NumericValueRange(const TypedColumnData& column) {
        bool initialized = false;
        for (size_t row_index = 0; row_index < column.GetNumRows(); ++row_index) {
            const std::byte* value = column.GetValue(row_index);
            T numeric_value = Type::GetValue<T>(value);
            if (!initialized) {
                lower_bound_ = numeric_value;
                upper_bound_ = numeric_value;
                initialized = true;
            } else {
                if (numeric_value < lower_bound_) {
                    lower_bound_ = numeric_value;
                }
                if (numeric_value > upper_bound_) {
                    upper_bound_ = numeric_value;
                }
            }
        }
    }

    [[nodiscard]] std::string ToString() const override {
        return "[" + std::to_string(lower_bound_) + " - " + std::to_string(upper_bound_) + "]";
    }

    [[nodiscard]] bool Includes(const std::byte* value) const override {
        T numeric_value = Type::GetValue<T>(value);
        return numeric_value >= lower_bound_ && numeric_value <= upper_bound_;
    }

private:
    T lower_bound_{};
    T upper_bound_{};
};

Comment on lines +18 to +29
std::string StringValueRange::ToString() const {
std::string result;
result += "[";
if (domain.size() > 0) {
result += domain[0];
}
for (size_t i = 1; i < domain.size(); i++) {
result += (", " + domain[i]);
}
result += "]";
return result;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        if (domain_.empty()) {
            return "[]";
        }
        std::string result = "[" + domain_.front();
        for (size_t i = 1; i < domain_.size(); ++i) {
            result += ", " + domain_[i];
        }
        result += "]";
        return result;

Comment on lines +99 to +102
case TypeId::kString:
return std::make_shared<StringValueRange>(column);
case TypeId::kMixed:
return std::make_shared<StringValueRange>(column);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use fallthrough:

        case TypeId::kString:
        case TypeId::kMixed:
            return std::make_shared<StringValueRange>(column);

Comment on lines +18 to +21
std::vector<size_t> ind_vec;
ind_vec.reserve(number_of_indices);
ind_vec.insert(ind_vec.end(), indices.begin(), indices.end());
return ind_vec;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<size_t> ind_vec;
ind_vec.reserve(number_of_indices);
ind_vec.insert(ind_vec.end(), indices.begin(), indices.end());
return ind_vec;
std::vector<size_t> ind_vec {indices.begin(), indices.end()};
return ind_vec;

Comment on lines +33 to +38
} else {
index--;
size_t feature = index / EncodedValueRange().kFieldCount;
size_t feature_field = index % EncodedValueRange().kFieldCount;
return encoded_value_ranges_[feature][feature_field];
}
Copy link
Collaborator

@Petua41 Petua41 Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else {
index--;
size_t feature = index / EncodedValueRange().kFieldCount;
size_t feature_field = index % EncodedValueRange().kFieldCount;
return encoded_value_ranges_[feature][feature_field];
}
index--;
size_t feature = index / EncodedValueRange::kFieldCount;
size_t feature_field = index % EncodedValueRange::kFieldCount;
return encoded_value_ranges_[feature][feature_field];

return qualities_;
}

NAR EncodedNAR::Decode(FeatureDomains& domains) const {
Copy link
Collaborator

@Petua41 Petua41 Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NAR EncodedNAR::Decode(FeatureDomains& domains) const {
NAR EncodedNAR::Decode(FeatureDomains const& domains) const {

If FeatureDomains doesn't have const operator[] (some of the stl containers may have this problem), use at -- it must have const version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants