Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor data storage #15

Merged
merged 12 commits into from
Jun 30, 2023
Merged

Conversation

MorrisNein
Copy link
Collaborator

@MorrisNein MorrisNein commented Jun 1, 2023

Major changes:

  • Changed roles of the classes: DatasetCache -> Dataset, Dataset -> DatasetData
  • Made separate dataset classes by sources. E.g. CustomDataset, OpenMLDataset, etc.
  • Datasets are now operated firstly by ids, names are optional. Use OpenMLDataset.from_search(...) to get OpenMLDataset explicitly by name (+ other search arguments)
  • A class of dataset is now responsible for caching/loading
  • OpenML datasets caching is entrusted to the openml library
  • OpenML datasets are now identified strictly by dataset id instead of possibly non-unique names
  • Unified cache storage. Cache of datasets and meta features is now stored in the directories: data/datasets/*source_name* and data/metafeatures/*source_name* correspondingly

Minor changes:

  • Improved typing: added DatasetIDType, PredictorType
  • Deleted DataManager that contained only static methods, separated its content to file_system.py and cache.py
  • Introduced CacheProperties that allow to use templates for cache paths or different types
  • Separated business-logics from low-level caching: classes do not store information about cache anymore; the module cache.py manages cache instead

@MorrisNein MorrisNein changed the base branch from main to docker_and_experiments June 1, 2023 18:04
@MorrisNein MorrisNein linked an issue Jun 1, 2023 that may be closed by this pull request
@MorrisNein MorrisNein changed the title Refactor datasets storage Refactor data storage Jun 1, 2023
Copy link
Collaborator

@ChernyakKonstantin ChernyakKonstantin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything wrong. However, I didn't use that code to be sure enough.
Need some documentation to understand what's going on here.

Copy link
Collaborator

@AxiomAlive AxiomAlive left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Не то, чтобы все комментарии критичные, главное что стоит - это архитектурный вопрос. Пока что у нас нет какой-то схемы реализации, на которую мы полагаемся. Нужна ли она на данном этапе - решать не мне. Возможно стоит собраться всем контрибьюторам на митинг и обсудить это. Также стоит писать тесты и как уже было подмечено предыдущим ревьюером - не хватает docstring.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Предлагается переименовать на что-то по типу dataset_similarity_based_model_advisor.

Copy link
Collaborator Author

@MorrisNein MorrisNein Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Сейчас все примеры имеют название такого типа. Их можно все переименовать, но не в рамках этого PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Данный файл требует отдельного пересмотра. Я уже делал кое-какой рефакторинг в нем по своим потребностям, но код еще не заливал.

Copy link
Collaborator Author

@MorrisNein MorrisNein Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Пока что этот файл выполняет роль "скрипта для экспериментов", и его участь — быть изменённым ещё много раз. Воздержитесь от рефакторинга, это преждевременно.

Upd: для своих целей лучше создать копию файла.

pass


class FileDataset(DatasetBase):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DatasetFile

Copy link
Collaborator Author

@MorrisNein MorrisNein Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Это всё-таки файловый датасет, не файл датасета

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

В итоге назвал CustomDataset. Кажется, так понятнее и лучше

def get_project_root(cls) -> Path:
"""Returns project root folder."""
return Path(__file__).parents[2]
def get_dataset_cache_path(cls, dataset_id: Any, source_name: str) -> Path:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Вышеуказанное предложение позволит также избежать дублирования и в этом случае. Станет возможным выделить одну функцию, например get_cache_path.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Да, действительно, рудиментарное последствие разных форматов хранения метаданных и датасетов. Согласен, можно объединить.

from meta_automl.data_preparation.dataset import DatasetCache
from meta_automl.data_preparation.model import Model
from meta_automl.data_preparation.models_loaders import ModelsLoader

DEFAULT_KNOWLEDGE_BASE_PATH = DataManager.get_data_dir().joinpath('knowledge_base_0')
DEFAULT_KNOWLEDGE_BASE_PATH = FileSystemManager.get_data_dir().joinpath('knowledge_base_0')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Лучше хранить в конфигурационном файле.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Имеется в виду конфигурационный файл какого масштаба? Всего фреймворка?

from meta_automl.data_preparation.dataset import DatasetCache
from meta_automl.data_preparation.model import Model
from meta_automl.data_preparation.models_loaders import ModelsLoader

DEFAULT_KNOWLEDGE_BASE_PATH = DataManager.get_data_dir().joinpath('knowledge_base_0')
DEFAULT_KNOWLEDGE_BASE_PATH = FileSystemManager.get_data_dir().joinpath('knowledge_base_0')


class KnowledgeBaseModelsLoader(ModelsLoader):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ModelLoader.


import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors

from meta_automl.data_preparation.dataset import DatasetIDType
from meta_automl.meta_algorithm.datasets_similarity_assessors.datasets_similarity_assessor import \
DatasetsSimilarityAssessor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DatasetSimilarityAssessor.



class KNeighborsBasedSimilarityAssessor(ModelBasedSimilarityAssessor):
def __init__(self, n_neighbors: int = 1, **model_params):
model = NearestNeighbors(n_neighbors=n_neighbors, **model_params)
super().__init__(model, n_neighbors)

def fit(self, meta_features: pd.DataFrame, datasets: Iterable[str]):
def fit(self, meta_features: pd.DataFrame, datasets: Iterable[DatasetIDType]):
meta_features = self.preprocess_meta_features(meta_features)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Вызывать статический метод через self не очень хорошо.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Почему бы и нет? Как лучше?

@@ -30,7 +31,7 @@ def fit(self, meta_features: pd.DataFrame, datasets: Iterable[str]):
def preprocess_meta_features(meta_features: pd.DataFrame) -> pd.DataFrame:
return meta_features.dropna(axis=1, how='any')

def predict(self, meta_features: pd.DataFrame, return_distance: bool = False) -> Iterable[Iterable[str]]:
def predict(self, meta_features: pd.DataFrame, return_distance: bool = False) -> Iterable[Iterable[DatasetIDType]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Может datasets передовать здесь, а не в fit?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Возможно. Почему?

@nicl-nno
Copy link
Contributor

Не то, чтобы все комментарии критичные, главное что стоит - это архитектурный вопрос. Пока что у нас нет какой-то схемы реализации, на которую мы полагаемся

Согласен, тут в первую очередь надо с запланированной функциональностью определиться, чтобы максимизировать ценность данной библиотеки. Можно под это завести issue, вынесем в ближайшее время вариант на обсуждение.

@MorrisNein MorrisNein force-pushed the refactor_datasets_storage branch 2 times, most recently from 1ee2a1c to c1ff6a6 Compare June 30, 2023 12:04
Copy link
Collaborator Author

@MorrisNein MorrisNein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Что относилось к данному PR - исправил. По остальным правкам завёл issue.

Приглашаю дополнительно обсудить архитектуру фреймворка #21.

Вливаю в текущем состоянии в другую ветку с экспериментом. Там можно будет провести дополнительный ревью перед вливанием в мастер.

pass


class FileDataset(DatasetBase):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

В итоге назвал CustomDataset. Кажется, так понятнее и лучше



class KNeighborsBasedSimilarityAssessor(ModelBasedSimilarityAssessor):
def __init__(self, n_neighbors: int = 1, **model_params):
model = NearestNeighbors(n_neighbors=n_neighbors, **model_params)
super().__init__(model, n_neighbors)

def fit(self, meta_features: pd.DataFrame, datasets: Iterable[str]):
def fit(self, meta_features: pd.DataFrame, datasets: Iterable[DatasetIDType]):
meta_features = self.preprocess_meta_features(meta_features)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Почему бы и нет? Как лучше?

@@ -30,7 +31,7 @@ def fit(self, meta_features: pd.DataFrame, datasets: Iterable[str]):
def preprocess_meta_features(meta_features: pd.DataFrame) -> pd.DataFrame:
return meta_features.dropna(axis=1, how='any')

def predict(self, meta_features: pd.DataFrame, return_distance: bool = False) -> Iterable[Iterable[str]]:
def predict(self, meta_features: pd.DataFrame, return_distance: bool = False) -> Iterable[Iterable[DatasetIDType]]:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Возможно. Почему?

from meta_automl.data_preparation.dataset import DatasetCache
from meta_automl.data_preparation.model import Model
from meta_automl.data_preparation.models_loaders import ModelsLoader

DEFAULT_KNOWLEDGE_BASE_PATH = DataManager.get_data_dir().joinpath('knowledge_base_0')
DEFAULT_KNOWLEDGE_BASE_PATH = FileSystemManager.get_data_dir().joinpath('knowledge_base_0')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Имеется в виду конфигурационный файл какого масштаба? Всего фреймворка?

@MorrisNein MorrisNein merged commit cb11a3c into docker_and_experiments Jun 30, 2023
@MorrisNein MorrisNein deleted the refactor_datasets_storage branch June 30, 2023 15:35
MorrisNein added a commit that referenced this pull request Jul 19, 2023
* refactor dataset classes, use openml cache

* fix example select_similar_datasets_by_knn.py

* create DatasetIDType

* create PredictorType

* remove DataManager, refactor cache

* update tests & test data

* allow explicit OpenMLDataset creation from name/search

* adapt examples to the last changes
MorrisNein added a commit that referenced this pull request Jul 20, 2023
* refactor dataset classes, use openml cache

* fix example select_similar_datasets_by_knn.py

* create DatasetIDType

* create PredictorType

* remove DataManager, refactor cache

* update tests & test data

* allow explicit OpenMLDataset creation from name/search

* adapt examples to the last changes
MorrisNein added a commit that referenced this pull request Jul 20, 2023
* refactor dataset classes, use openml cache

* fix example select_similar_datasets_by_knn.py

* create DatasetIDType

* create PredictorType

* remove DataManager, refactor cache

* update tests & test data

* allow explicit OpenMLDataset creation from name/search

* adapt examples to the last changes
MorrisNein added a commit that referenced this pull request Jul 21, 2023
* fix similarity assessors

* allow PymfeExtractor fill values with median

* optional cache usage for MFE extractor

* allow to advise only the n best models

* move to FEDOT 0.7.1

* add logging in PymfeExtractor

* add datasets train/test split

* Refactor data storage (#15)

* refactor dataset classes, use openml cache

* remove DataManager, refactor cache

* update tests & test data

* separate framework cache from other data
rostslove pushed a commit that referenced this pull request Aug 2, 2023
* fix similarity assessors

* allow PymfeExtractor fill values with median

* optional cache usage for MFE extractor

* allow to advise only the n best models

* move to FEDOT 0.7.1

* add logging in PymfeExtractor

* add datasets train/test split

* Refactor data storage (#15)

* refactor dataset classes, use openml cache

* remove DataManager, refactor cache

* update tests & test data

* separate framework cache from other data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Restrict identifying OpenML datasets by name
4 participants