Refactor data storage #15

MorrisNein · 2023-06-01T18:03:32Z

Major changes:

Changed roles of the classes: DatasetCache -> Dataset, Dataset -> DatasetData
Made separate dataset classes by sources. E.g. CustomDataset, OpenMLDataset, etc.
Datasets are now operated firstly by ids, names are optional. Use OpenMLDataset.from_search(...) to get OpenMLDataset explicitly by name (+ other search arguments)
A class of dataset is now responsible for caching/loading
OpenML datasets caching is entrusted to the openml library
OpenML datasets are now identified strictly by dataset id instead of possibly non-unique names
Unified cache storage. Cache of datasets and meta features is now stored in the directories: data/datasets/*source_name* and data/metafeatures/*source_name* correspondingly

Minor changes:

Improved typing: added DatasetIDType, PredictorType
Deleted DataManager that contained only static methods, separated its content to file_system.py and cache.py
Introduced CacheProperties that allow to use templates for cache paths or different types
Separated business-logics from low-level caching: classes do not store information about cache anymore; the module cache.py manages cache instead

ChernyakKonstantin

I don't see anything wrong. However, I didn't use that code to be sure enough.
Need some documentation to understand what's going on here.

AxiomAlive

Не то, чтобы все комментарии критичные, главное что стоит - это архитектурный вопрос. Пока что у нас нет какой-то схемы реализации, на которую мы полагаемся. Нужна ли она на данном этапе - решать не мне. Возможно стоит собраться всем контрибьюторам на митинг и обсудить это. Также стоит писать тесты и как уже было подмечено предыдущим ревьюером - не хватает docstring.

AxiomAlive · 2023-06-10T21:02:10Z

examples/4_advising_models/advise_models_from_similar_datasets.py

Предлагается переименовать на что-то по типу dataset_similarity_based_model_advisor.

Сейчас все примеры имеют название такого типа. Их можно все переименовать, но не в рамках этого PR.

AxiomAlive · 2023-06-10T21:04:56Z

experiments/fedot_warm_start/run.py

Данный файл требует отдельного пересмотра. Я уже делал кое-какой рефакторинг в нем по своим потребностям, но код еще не заливал.

Пока что этот файл выполняет роль "скрипта для экспериментов", и его участь — быть изменённым ещё много раз. Воздержитесь от рефакторинга, это преждевременно.

Upd: для своих целей лучше создать копию файла.

AxiomAlive · 2023-06-10T21:10:34Z

meta_automl/data_preparation/dataset/file_dataset.py

+    pass
+
+
+class FileDataset(DatasetBase):


DatasetFile

Это всё-таки файловый датасет, не файл датасета

В итоге назвал CustomDataset. Кажется, так понятнее и лучше

meta_automl/data_preparation/datasets_loaders/openml_datasets_loader.py

AxiomAlive · 2023-06-10T21:21:47Z

meta_automl/data_preparation/file_system_manager.py

-    def get_project_root(cls) -> Path:
-        """Returns project root folder."""
-        return Path(__file__).parents[2]
+    def get_dataset_cache_path(cls, dataset_id: Any, source_name: str) -> Path:


Вышеуказанное предложение позволит также избежать дублирования и в этом случае. Станет возможным выделить одну функцию, например get_cache_path.

Да, действительно, рудиментарное последствие разных форматов хранения метаданных и датасетов. Согласен, можно объединить.

AxiomAlive · 2023-06-11T22:04:44Z

meta_automl/data_preparation/models_loaders/knowledge_base_models_loader.py

 from meta_automl.data_preparation.dataset import DatasetCache
 from meta_automl.data_preparation.model import Model
 from meta_automl.data_preparation.models_loaders import ModelsLoader

-DEFAULT_KNOWLEDGE_BASE_PATH = DataManager.get_data_dir().joinpath('knowledge_base_0')
+DEFAULT_KNOWLEDGE_BASE_PATH = FileSystemManager.get_data_dir().joinpath('knowledge_base_0')


Лучше хранить в конфигурационном файле.

Имеется в виду конфигурационный файл какого масштаба? Всего фреймворка?

AxiomAlive · 2023-06-11T22:04:56Z

meta_automl/data_preparation/models_loaders/knowledge_base_models_loader.py

 from meta_automl.data_preparation.dataset import DatasetCache
 from meta_automl.data_preparation.model import Model
 from meta_automl.data_preparation.models_loaders import ModelsLoader

-DEFAULT_KNOWLEDGE_BASE_PATH = DataManager.get_data_dir().joinpath('knowledge_base_0')
+DEFAULT_KNOWLEDGE_BASE_PATH = FileSystemManager.get_data_dir().joinpath('knowledge_base_0')


 class KnowledgeBaseModelsLoader(ModelsLoader):


ModelLoader.

AxiomAlive · 2023-06-12T15:58:47Z

meta_automl/meta_algorithm/datasets_similarity_assessors/model_based_similarity_assessors.py


 import numpy as np
 import pandas as pd
 from sklearn.neighbors import NearestNeighbors

+from meta_automl.data_preparation.dataset import DatasetIDType
 from meta_automl.meta_algorithm.datasets_similarity_assessors.datasets_similarity_assessor import \
    DatasetsSimilarityAssessor


DatasetSimilarityAssessor.

AxiomAlive · 2023-06-12T16:02:16Z

meta_automl/meta_algorithm/datasets_similarity_assessors/model_based_similarity_assessors.py



 class KNeighborsBasedSimilarityAssessor(ModelBasedSimilarityAssessor):
    def __init__(self, n_neighbors: int = 1, **model_params):
        model = NearestNeighbors(n_neighbors=n_neighbors, **model_params)
        super().__init__(model, n_neighbors)

-    def fit(self, meta_features: pd.DataFrame, datasets: Iterable[str]):
+    def fit(self, meta_features: pd.DataFrame, datasets: Iterable[DatasetIDType]):
        meta_features = self.preprocess_meta_features(meta_features)


Вызывать статический метод через self не очень хорошо.

Почему бы и нет? Как лучше?

AxiomAlive · 2023-06-12T16:08:02Z

meta_automl/meta_algorithm/datasets_similarity_assessors/model_based_similarity_assessors.py

@@ -30,7 +31,7 @@ def fit(self, meta_features: pd.DataFrame, datasets: Iterable[str]):
    def preprocess_meta_features(meta_features: pd.DataFrame) -> pd.DataFrame:
        return meta_features.dropna(axis=1, how='any')

-    def predict(self, meta_features: pd.DataFrame, return_distance: bool = False) -> Iterable[Iterable[str]]:
+    def predict(self, meta_features: pd.DataFrame, return_distance: bool = False) -> Iterable[Iterable[DatasetIDType]]:


Может datasets передовать здесь, а не в fit?

Возможно. Почему?

nicl-nno · 2023-06-13T08:24:44Z

Не то, чтобы все комментарии критичные, главное что стоит - это архитектурный вопрос. Пока что у нас нет какой-то схемы реализации, на которую мы полагаемся

Согласен, тут в первую очередь надо с запланированной функциональностью определиться, чтобы максимизировать ценность данной библиотеки. Можно под это завести issue, вынесем в ближайшее время вариант на обсуждение.

MorrisNein

Что относилось к данному PR - исправил. По остальным правкам завёл issue.

Приглашаю дополнительно обсудить архитектуру фреймворка #21.

Вливаю в текущем состоянии в другую ветку с экспериментом. Там можно будет провести дополнительный ревью перед вливанием в мастер.

MorrisNein · 2023-06-28T22:04:06Z

meta_automl/data_preparation/dataset/file_dataset.py

+    pass
+
+
+class FileDataset(DatasetBase):


В итоге назвал CustomDataset. Кажется, так понятнее и лучше

MorrisNein · 2023-06-29T11:29:28Z

meta_automl/meta_algorithm/datasets_similarity_assessors/model_based_similarity_assessors.py



 class KNeighborsBasedSimilarityAssessor(ModelBasedSimilarityAssessor):
    def __init__(self, n_neighbors: int = 1, **model_params):
        model = NearestNeighbors(n_neighbors=n_neighbors, **model_params)
        super().__init__(model, n_neighbors)

-    def fit(self, meta_features: pd.DataFrame, datasets: Iterable[str]):
+    def fit(self, meta_features: pd.DataFrame, datasets: Iterable[DatasetIDType]):
        meta_features = self.preprocess_meta_features(meta_features)


Почему бы и нет? Как лучше?

MorrisNein · 2023-06-30T09:44:53Z

meta_automl/meta_algorithm/datasets_similarity_assessors/model_based_similarity_assessors.py

@@ -30,7 +31,7 @@ def fit(self, meta_features: pd.DataFrame, datasets: Iterable[str]):
    def preprocess_meta_features(meta_features: pd.DataFrame) -> pd.DataFrame:
        return meta_features.dropna(axis=1, how='any')

-    def predict(self, meta_features: pd.DataFrame, return_distance: bool = False) -> Iterable[Iterable[str]]:
+    def predict(self, meta_features: pd.DataFrame, return_distance: bool = False) -> Iterable[Iterable[DatasetIDType]]:


Возможно. Почему?

MorrisNein · 2023-06-30T12:05:51Z

meta_automl/data_preparation/models_loaders/knowledge_base_models_loader.py

 from meta_automl.data_preparation.dataset import DatasetCache
 from meta_automl.data_preparation.model import Model
 from meta_automl.data_preparation.models_loaders import ModelsLoader

-DEFAULT_KNOWLEDGE_BASE_PATH = DataManager.get_data_dir().joinpath('knowledge_base_0')
+DEFAULT_KNOWLEDGE_BASE_PATH = FileSystemManager.get_data_dir().joinpath('knowledge_base_0')


Имеется в виду конфигурационный файл какого масштаба? Всего фреймворка?

* refactor dataset classes, use openml cache * fix example select_similar_datasets_by_knn.py * create DatasetIDType * create PredictorType * remove DataManager, refactor cache * update tests & test data * allow explicit OpenMLDataset creation from name/search * adapt examples to the last changes

* fix similarity assessors * allow PymfeExtractor fill values with median * optional cache usage for MFE extractor * allow to advise only the n best models * move to FEDOT 0.7.1 * add logging in PymfeExtractor * add datasets train/test split * Refactor data storage (#15) * refactor dataset classes, use openml cache * remove DataManager, refactor cache * update tests & test data * separate framework cache from other data

MorrisNein changed the base branch from main to docker_and_experiments June 1, 2023 18:04

MorrisNein linked an issue Jun 1, 2023 that may be closed by this pull request

Restrict identifying OpenML datasets by name #14

Closed

MorrisNein force-pushed the refactor_datasets_storage branch from 404a5a5 to 765ec8b Compare June 1, 2023 18:18

MorrisNein changed the title ~~Refactor datasets storage~~ Refactor data storage Jun 1, 2023

MorrisNein requested review from ChernyakKonstantin, AxiomAlive and nicl-nno June 2, 2023 06:53

ChernyakKonstantin reviewed Jun 7, 2023

View reviewed changes

AxiomAlive reviewed Jun 12, 2023

View reviewed changes

This was referenced Jun 13, 2023

Give the class Model a more specific name #16

Closed

Make abstract classes inherit ABC #17

Closed

Give better names to example scripts #18

Open

nicl-nno approved these changes Jun 26, 2023

View reviewed changes

MorrisNein force-pushed the refactor_datasets_storage branch 2 times, most recently from 1ee2a1c to c1ff6a6 Compare June 30, 2023 12:04

MorrisNein commented Jun 30, 2023

View reviewed changes

MorrisNein force-pushed the refactor_datasets_storage branch from c1ff6a6 to 0b1beb9 Compare June 30, 2023 15:31

MorrisNein added 12 commits June 30, 2023 18:33

edit experiment script

1538a06

refactor dataset classes, use openml cache

65f07a8

adapt to refactoring

270342d

fix example select_similar_datasets_by_knn.py

e78fe79

create DatasetIDType

603af12

create PredictorType

3624561

edit experiment script

7f4c6e2

remove DataManager, refactor cache

8b77674

update test data

8eca6da

update tests

0388c4c

allow explicit OpenMLDataset creation from name/search

6b4b775

adapt examples to the last changes

6e68ac8

MorrisNein force-pushed the refactor_datasets_storage branch from 0b1beb9 to 6e68ac8 Compare June 30, 2023 15:34

MorrisNein merged commit cb11a3c into docker_and_experiments Jun 30, 2023

MorrisNein deleted the refactor_datasets_storage branch June 30, 2023 15:35

MorrisNein mentioned this pull request Jul 1, 2023

Perform meta-features separation by dataset source #29

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor data storage #15

Refactor data storage #15

MorrisNein commented Jun 1, 2023 •

edited

Loading

ChernyakKonstantin left a comment

AxiomAlive left a comment

AxiomAlive Jun 10, 2023

MorrisNein Jun 13, 2023 •

edited

Loading

AxiomAlive Jun 10, 2023

MorrisNein Jun 13, 2023 •

edited

Loading

AxiomAlive Jun 10, 2023

MorrisNein Jun 13, 2023 •

edited

Loading

MorrisNein Jun 28, 2023

AxiomAlive Jun 10, 2023

MorrisNein Jun 13, 2023

AxiomAlive Jun 11, 2023

MorrisNein Jun 30, 2023

AxiomAlive Jun 11, 2023

AxiomAlive Jun 12, 2023

AxiomAlive Jun 12, 2023

MorrisNein Jun 29, 2023

AxiomAlive Jun 12, 2023

MorrisNein Jun 30, 2023

nicl-nno commented Jun 13, 2023

MorrisNein left a comment

MorrisNein Jun 28, 2023

MorrisNein Jun 29, 2023

MorrisNein Jun 30, 2023

MorrisNein Jun 30, 2023

Refactor data storage #15

Refactor data storage #15

Conversation

MorrisNein commented Jun 1, 2023 • edited Loading

ChernyakKonstantin left a comment

Choose a reason for hiding this comment

AxiomAlive left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MorrisNein Jun 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MorrisNein Jun 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MorrisNein Jun 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicl-nno commented Jun 13, 2023

MorrisNein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MorrisNein commented Jun 1, 2023 •

edited

Loading

MorrisNein Jun 13, 2023 •

edited

Loading

MorrisNein Jun 13, 2023 •

edited

Loading

MorrisNein Jun 13, 2023 •

edited

Loading