-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor data storage #15
Refactor data storage #15
Conversation
404a5a5
to
765ec8b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see anything wrong. However, I didn't use that code to be sure enough.
Need some documentation to understand what's going on here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Не то, чтобы все комментарии критичные, главное что стоит - это архитектурный вопрос. Пока что у нас нет какой-то схемы реализации, на которую мы полагаемся. Нужна ли она на данном этапе - решать не мне. Возможно стоит собраться всем контрибьюторам на митинг и обсудить это. Также стоит писать тесты и как уже было подмечено предыдущим ревьюером - не хватает docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Предлагается переименовать на что-то по типу dataset_similarity_based_model_advisor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Сейчас все примеры имеют название такого типа. Их можно все переименовать, но не в рамках этого PR.
experiments/fedot_warm_start/run.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Данный файл требует отдельного пересмотра. Я уже делал кое-какой рефакторинг в нем по своим потребностям, но код еще не заливал.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Пока что этот файл выполняет роль "скрипта для экспериментов", и его участь — быть изменённым ещё много раз. Воздержитесь от рефакторинга, это преждевременно.
Upd: для своих целей лучше создать копию файла.
pass | ||
|
||
|
||
class FileDataset(DatasetBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DatasetFile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Это всё-таки файловый датасет, не файл датасета
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
В итоге назвал CustomDataset. Кажется, так понятнее и лучше
def get_project_root(cls) -> Path: | ||
"""Returns project root folder.""" | ||
return Path(__file__).parents[2] | ||
def get_dataset_cache_path(cls, dataset_id: Any, source_name: str) -> Path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Вышеуказанное предложение позволит также избежать дублирования и в этом случае. Станет возможным выделить одну функцию, например get_cache_path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Да, действительно, рудиментарное последствие разных форматов хранения метаданных и датасетов. Согласен, можно объединить.
from meta_automl.data_preparation.dataset import DatasetCache | ||
from meta_automl.data_preparation.model import Model | ||
from meta_automl.data_preparation.models_loaders import ModelsLoader | ||
|
||
DEFAULT_KNOWLEDGE_BASE_PATH = DataManager.get_data_dir().joinpath('knowledge_base_0') | ||
DEFAULT_KNOWLEDGE_BASE_PATH = FileSystemManager.get_data_dir().joinpath('knowledge_base_0') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Лучше хранить в конфигурационном файле.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Имеется в виду конфигурационный файл какого масштаба? Всего фреймворка?
from meta_automl.data_preparation.dataset import DatasetCache | ||
from meta_automl.data_preparation.model import Model | ||
from meta_automl.data_preparation.models_loaders import ModelsLoader | ||
|
||
DEFAULT_KNOWLEDGE_BASE_PATH = DataManager.get_data_dir().joinpath('knowledge_base_0') | ||
DEFAULT_KNOWLEDGE_BASE_PATH = FileSystemManager.get_data_dir().joinpath('knowledge_base_0') | ||
|
||
|
||
class KnowledgeBaseModelsLoader(ModelsLoader): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ModelLoader.
|
||
import numpy as np | ||
import pandas as pd | ||
from sklearn.neighbors import NearestNeighbors | ||
|
||
from meta_automl.data_preparation.dataset import DatasetIDType | ||
from meta_automl.meta_algorithm.datasets_similarity_assessors.datasets_similarity_assessor import \ | ||
DatasetsSimilarityAssessor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DatasetSimilarityAssessor.
|
||
|
||
class KNeighborsBasedSimilarityAssessor(ModelBasedSimilarityAssessor): | ||
def __init__(self, n_neighbors: int = 1, **model_params): | ||
model = NearestNeighbors(n_neighbors=n_neighbors, **model_params) | ||
super().__init__(model, n_neighbors) | ||
|
||
def fit(self, meta_features: pd.DataFrame, datasets: Iterable[str]): | ||
def fit(self, meta_features: pd.DataFrame, datasets: Iterable[DatasetIDType]): | ||
meta_features = self.preprocess_meta_features(meta_features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Вызывать статический метод через self не очень хорошо.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Почему бы и нет? Как лучше?
@@ -30,7 +31,7 @@ def fit(self, meta_features: pd.DataFrame, datasets: Iterable[str]): | |||
def preprocess_meta_features(meta_features: pd.DataFrame) -> pd.DataFrame: | |||
return meta_features.dropna(axis=1, how='any') | |||
|
|||
def predict(self, meta_features: pd.DataFrame, return_distance: bool = False) -> Iterable[Iterable[str]]: | |||
def predict(self, meta_features: pd.DataFrame, return_distance: bool = False) -> Iterable[Iterable[DatasetIDType]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Может datasets передовать здесь, а не в fit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Возможно. Почему?
Согласен, тут в первую очередь надо с запланированной функциональностью определиться, чтобы максимизировать ценность данной библиотеки. Можно под это завести issue, вынесем в ближайшее время вариант на обсуждение. |
1ee2a1c
to
c1ff6a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Что относилось к данному PR - исправил. По остальным правкам завёл issue.
Приглашаю дополнительно обсудить архитектуру фреймворка #21.
Вливаю в текущем состоянии в другую ветку с экспериментом. Там можно будет провести дополнительный ревью перед вливанием в мастер.
pass | ||
|
||
|
||
class FileDataset(DatasetBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
В итоге назвал CustomDataset. Кажется, так понятнее и лучше
|
||
|
||
class KNeighborsBasedSimilarityAssessor(ModelBasedSimilarityAssessor): | ||
def __init__(self, n_neighbors: int = 1, **model_params): | ||
model = NearestNeighbors(n_neighbors=n_neighbors, **model_params) | ||
super().__init__(model, n_neighbors) | ||
|
||
def fit(self, meta_features: pd.DataFrame, datasets: Iterable[str]): | ||
def fit(self, meta_features: pd.DataFrame, datasets: Iterable[DatasetIDType]): | ||
meta_features = self.preprocess_meta_features(meta_features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Почему бы и нет? Как лучше?
@@ -30,7 +31,7 @@ def fit(self, meta_features: pd.DataFrame, datasets: Iterable[str]): | |||
def preprocess_meta_features(meta_features: pd.DataFrame) -> pd.DataFrame: | |||
return meta_features.dropna(axis=1, how='any') | |||
|
|||
def predict(self, meta_features: pd.DataFrame, return_distance: bool = False) -> Iterable[Iterable[str]]: | |||
def predict(self, meta_features: pd.DataFrame, return_distance: bool = False) -> Iterable[Iterable[DatasetIDType]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Возможно. Почему?
from meta_automl.data_preparation.dataset import DatasetCache | ||
from meta_automl.data_preparation.model import Model | ||
from meta_automl.data_preparation.models_loaders import ModelsLoader | ||
|
||
DEFAULT_KNOWLEDGE_BASE_PATH = DataManager.get_data_dir().joinpath('knowledge_base_0') | ||
DEFAULT_KNOWLEDGE_BASE_PATH = FileSystemManager.get_data_dir().joinpath('knowledge_base_0') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Имеется в виду конфигурационный файл какого масштаба? Всего фреймворка?
c1ff6a6
to
0b1beb9
Compare
0b1beb9
to
6e68ac8
Compare
* refactor dataset classes, use openml cache * fix example select_similar_datasets_by_knn.py * create DatasetIDType * create PredictorType * remove DataManager, refactor cache * update tests & test data * allow explicit OpenMLDataset creation from name/search * adapt examples to the last changes
* refactor dataset classes, use openml cache * fix example select_similar_datasets_by_knn.py * create DatasetIDType * create PredictorType * remove DataManager, refactor cache * update tests & test data * allow explicit OpenMLDataset creation from name/search * adapt examples to the last changes
* refactor dataset classes, use openml cache * fix example select_similar_datasets_by_knn.py * create DatasetIDType * create PredictorType * remove DataManager, refactor cache * update tests & test data * allow explicit OpenMLDataset creation from name/search * adapt examples to the last changes
* fix similarity assessors * allow PymfeExtractor fill values with median * optional cache usage for MFE extractor * allow to advise only the n best models * move to FEDOT 0.7.1 * add logging in PymfeExtractor * add datasets train/test split * Refactor data storage (#15) * refactor dataset classes, use openml cache * remove DataManager, refactor cache * update tests & test data * separate framework cache from other data
* fix similarity assessors * allow PymfeExtractor fill values with median * optional cache usage for MFE extractor * allow to advise only the n best models * move to FEDOT 0.7.1 * add logging in PymfeExtractor * add datasets train/test split * Refactor data storage (#15) * refactor dataset classes, use openml cache * remove DataManager, refactor cache * update tests & test data * separate framework cache from other data
Major changes:
DatasetCache
->Dataset
,Dataset
->DatasetData
CustomDataset
,OpenMLDataset
, etc.OpenMLDataset.from_search(...)
to get OpenMLDataset explicitly by name (+ other search arguments)data/datasets/*source_name*
anddata/metafeatures/*source_name*
correspondinglyMinor changes:
DataManager
that contained only static methods, separated its content tofile_system.py
andcache.py
CacheProperties
that allow to use templates for cache paths or different typescache.py
manages cache instead