Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New loaders #47

Merged
merged 216 commits into from
Nov 7, 2023
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
216 commits
Select commit Hold shift + click to select a range
c847282
privatizes many methods of PipelineStep and allows .process_resource(…
johentsch Jun 23, 2023
06f5443
pulls apart FeatureStep(PipelineStep), which all PipelineSteps curren…
johentsch Jun 23, 2023
84e8ad9
renames PipelineStep._dispatch() => _make_new_resource()
johentsch Jun 23, 2023
f0a7b37
early version of the MuseScoreLoader
johentsch Jun 25, 2023
3cb34a1
makes loaders a package; separates ScoreLoader from Loader; adds skel…
johentsch Jun 26, 2023
babfc22
first naive version performing a raw music21 parse (events shadow str…
johentsch Jun 26, 2023
f723d8e
advances Music21Loader to parse a single resource
johentsch Jun 26, 2023
c1a850f
enables Music21Loader to recursively scan directory; computation of I…
johentsch Jun 26, 2023
1e955bf
adds basepath argument to Dataset.__init__()
johentsch Jun 26, 2023
28da592
adds default_output_dir = "~/dimcat_data" to settings.ini
johentsch Jun 26, 2023
94d1956
enables ScoreLoader.process(Dataset)
johentsch Jun 27, 2023
51653ef
moves PathFactory to loaders.utils
johentsch Jun 27, 2023
4e7c845
moves class variables and paths property up
johentsch Jun 27, 2023
d695194
pulls up Loader._process_dataset() harmonizing it for all loaders thr…
johentsch Jun 27, 2023
54b911f
implements PackageLoader (closes #40)
johentsch Jun 27, 2023
c0c69f9
updates contributors' guide
johentsch Jun 27, 2023
e45decb
corrects dependency syntax
johentsch Jun 27, 2023
f6cf868
renames default_output_dir => default_basepath and makes tests pass
johentsch Jun 27, 2023
f769d0c
drop empty columns when creating dataframe from m21
johentsch Jul 2, 2023
9238890
cleans up outputs and replaces print() with logger.info()
johentsch Jul 2, 2023
b81f26c
renames subdirs to return_tuples
johentsch Jul 2, 2023
9f65bcf
log message
johentsch Jul 2, 2023
ad39517
renames FeatureStep => FeatureProcessingStep and adds some docs
johentsch Jul 2, 2023
567bc8a
better explanation of the FeatureProcessingStep.is_transformation pro…
johentsch Jul 2, 2023
a8683c6
makes Pipeline a subclass of PipelineStp (not FeatureProcessingStep)
johentsch Jul 2, 2023
33dfed5
adds Loader test case to test_base.py; + a few cosmetics
johentsch Jul 2, 2023
763a303
makes utils.resolve_path() typesafe
johentsch Jul 2, 2023
f21d12d
improves PackageLoader by not requiring a package_name and discoverin…
johentsch Jul 2, 2023
1d9300a
implements Dataset.from_loader() (closes #42 :tada:)
johentsch Jul 2, 2023
33908c6
implements Loader.create_dataset() (closes #41)
johentsch Jul 2, 2023
e4d22aa
adds DimcatPackage.get_piece_index()
johentsch Jul 2, 2023
a67fe1e
adds DimcatPackage.get_boolean_resource_table() and tests that its in…
johentsch Jul 2, 2023
8089c75
makes DimcatResource a subclass of the new Resource base class
johentsch Jul 2, 2023
e411b31
adds a new Resource superclass and pulls basepath attribute up to the…
johentsch Jul 2, 2023
ff163ee
introduces new setting "default_resource_name"
johentsch Jul 3, 2023
cd0af40
cleans up and facilitates DimcatResource.__init__() and adapts Data o…
johentsch Jul 3, 2023
e05da81
Merge branch 'development' into loader
johentsch Jul 3, 2023
f437c24
Merge branch 'development' into loader
johentsch Jul 4, 2023
0760ad5
moves DimcatResource.Schema.init_object() up to Resource.Schema
johentsch Jul 4, 2023
0cd83d8
adds base test case for Resource
johentsch Jul 4, 2023
6b84707
moves DimcatPackage unittests to test_package.py
johentsch Jul 4, 2023
0edfb57
moves get_score_paths() to conftest
johentsch Jul 4, 2023
7ba82f4
generalizes get_score_paths(), specifies get_m21_score_paths()
johentsch Jul 4, 2023
5175df4
updates TestBaseResource to run on multiple, mixed score paths
johentsch Jul 4, 2023
d584964
adds DimcatPackage.from_resources() and .from_filepaths(), and allows…
johentsch Jul 4, 2023
9789f21
introduces global constant TEST_N_SCORES to allow for speeding up tes…
johentsch Jul 4, 2023
90ffbcd
renames DimcatPackage.make_new_resource() to .add_new_dimcat_resource…
johentsch Jul 4, 2023
9dd3b03
renames Loader.add_piece_facet() => .add_piece_facet_dataframe() to a…
johentsch Jul 4, 2023
615fbd7
factors out dataset.base.PackageSchema in order to introduce the dist…
johentsch Jul 5, 2023
1a505bf
pulls up Data.to_dict(pickle=False) and Data.pickle_schema and Data.g…
johentsch Jul 5, 2023
5720cba
more robust handling of descriptor_filepath
johentsch Jul 5, 2023
99896c4
adds property ID to Resource objects and allows for storing a corpus_…
johentsch Jul 5, 2023
047ae68
updates resolve_dir(), stripping terminal separators so that the base…
johentsch Jul 5, 2023
4752f97
enables specifying resource_names and corpus_names (factories) when c…
johentsch Jul 5, 2023
404ce24
tidies up Resource API removing unmanageable side-effects. DimcatReso…
johentsch Jul 5, 2023
bde5bc0
tidies up DimcatResource API according to the previous clean-up of Re…
johentsch Jul 6, 2023
ffa5ee2
removes now superfluous methods _set_descriptor_path and _set_file_path
johentsch Jul 6, 2023
3694a85
Resource.from_descriptor() now deserializes as the correct subclass i…
johentsch Jul 7, 2023
c2b6f85
precicises all internal imports
johentsch Jul 7, 2023
bec25d4
creates Package superclass (WIP)
johentsch Jul 7, 2023
fa07c07
factors out catalog and package into their own Python packages
johentsch Jul 7, 2023
06b8bcf
overall uniform packaging structure with subpackages of data and step…
johentsch Jul 7, 2023
6e45a5c
adds dedicated PathResource; harmonizes ResourceStatus and how Resour…
johentsch Jul 8, 2023
a0b1670
docs on ResourceStatus
johentsch Jul 8, 2023
18491b7
first step towards Packages correctly handling resources including pa…
johentsch Jul 10, 2023
545e94e
updates project requirements
johentsch Jul 18, 2023
16dc2eb
replaces manual.rst with a subfolder containing the Jupytext notebook…
johentsch Jul 18, 2023
5eb74d0
commenting out modin[ray] for now
johentsch Jul 18, 2023
a3fb373
moves the Minimal Working Example 'mwe' from the top level into docs,…
johentsch Jul 18, 2023
47b22a8
updates unittest_metacorpus commit
johentsch Jul 18, 2023
dc8dfb0
overall uniform packaging structure with subpackages of data and step…
johentsch Jul 18, 2023
47ce991
DimcatPackage to store resources in ZIP archive by default
johentsch Jul 18, 2023
97ff9cb
enables 'from dimcat import Pipeline'
johentsch Jul 18, 2023
4e5cf09
loaders underway to being adapted
johentsch Jul 18, 2023
1e0a71a
manual/data work-in-progress
johentsch Jul 18, 2023
af01cee
adds Python package for slicer
johentsch Jul 19, 2023
55ae3c8
towards fixing the loader tests
johentsch Jul 19, 2023
943035f
fixes for test_base.py
johentsch Jul 19, 2023
22836bd
makes all tests in test_package.py pass
johentsch Jul 19, 2023
88dba35
exludes RECONCILE modes from test_package.py to prevent copying the r…
johentsch Jul 19, 2023
ae555ff
prevents PathPackage from storing its descriptor after adding a resource
johentsch Jul 19, 2023
0a1c984
adapts Catalog to use Package (instead of DimcatPackage exclusively)
johentsch Jul 19, 2023
d59bcfd
progress on the loaders
johentsch Jul 19, 2023
f592bcf
Merge branch 'loader' into slicer
johentsch Jul 19, 2023
28f515a
minor bug fixes
johentsch Jul 19, 2023
2637925
behaviour closer to how it should be makes more tests fail
johentsch Jul 19, 2023
abc7b1a
Dependency fixes
Elizafox Jul 22, 2023
ad73863
Merge
Elizafox Jul 22, 2023
9a22e76
Merge pull request #1 from Elizafox/loader
johentsch Jul 27, 2023
7e0c2a2
Revert "Merge"
johentsch Jul 27, 2023
ba7d2e7
enable dc.PackageLoader convenience
johentsch Sep 9, 2023
18c0c53
updates package requirements
johentsch Sep 9, 2023
7cb2dc2
Resource.from_descriptor() dispatches to DimcatResource for all fl.Pa…
johentsch Sep 9, 2023
63c7380
Resource.from_descriptor() dispatches to PathResource if fl.Package.t…
johentsch Sep 9, 2023
922c5b7
adds Dataset.from_package() constructor
johentsch Sep 9, 2023
b7dc50a
updates pre-commit hook versions
johentsch Sep 9, 2023
bf4869e
renames .get_resource() => get_resource_by_name(); maintains custom m…
johentsch Sep 9, 2023
c217591
adds and executes tox lint
johentsch Sep 9, 2023
e7745d6
introduces MuseScorePackage and MuseScoreFacet. Initializing from a p…
johentsch Sep 9, 2023
f0c364a
DimcatCatalog.summary_dict() displays resource types by default
johentsch Sep 9, 2023
d444f24
MuseScoreFacet dispatches to subclass based on resource name
johentsch Sep 9, 2023
f21d531
introduces Facet base class with 'extractable_features' class variable
johentsch Sep 9, 2023
3470963
adds properties and methods to Package, updates .extract_feature()
johentsch Sep 9, 2023
7d39eef
correct compilation of descriptor dict
johentsch Sep 17, 2023
55b3862
updates MWE datapackage with dcml_corpora@a2afd8b via ms3 v2.2.1
johentsch Sep 17, 2023
142664c
adds Package.get_resources_by_regex() and .get_resources_by_type(), a…
johentsch Sep 17, 2023
28b1778
copies HarmonyLabels.__init__() from DimcatResource
johentsch Sep 17, 2023
865e163
renames package analyzer => analyzers
johentsch Sep 17, 2023
06f642d
renames package extractor => extractors
johentsch Sep 17, 2023
7158017
renames package grouper => groupers
johentsch Sep 17, 2023
7f423fa
renames package pipeline => pipelines
johentsch Sep 17, 2023
79afa6d
renames package slicer => slicers
johentsch Sep 17, 2023
261ec7b
renames package catalog => catalogs
johentsch Sep 17, 2023
dd8d1a2
renames package dataset => datasets
johentsch Sep 17, 2023
e534ed8
renames package package => packages
johentsch Sep 17, 2023
3020ad8
renames package resource => resources
johentsch Sep 17, 2023
12db382
adapts MWE notebook imports
johentsch Sep 17, 2023
7e9b23e
code cells with ipython3
johentsch Sep 17, 2023
e8d2dd0
updates status after creating resource from dataframe
johentsch Sep 17, 2023
dcfb6c5
small bugfix important for Resource._get_current_status()
johentsch Sep 17, 2023
21fa7bb
docstrings
johentsch Sep 17, 2023
2985ba9
update unittest_metarepo commit and adapts filepaths
johentsch Sep 17, 2023
17e11ab
elaborates on resources.utils.infer_schema_from_df() and uses it in D…
johentsch Sep 18, 2023
1c6a4a3
adds new 'context_columns' setting
johentsch Sep 18, 2023
7f46539
adds mechanism that, fundamentally, initializes a Feature as a subset…
johentsch Sep 18, 2023
14f1d81
adds options_class property to DimcatConfig
johentsch Sep 26, 2023
253356e
pulls ClassVar extrable_features up to DimcatResource; adds ClassVar …
johentsch Sep 26, 2023
7806adc
enables FeatureExtractor to work on a single DimcatResource
johentsch Sep 26, 2023
2a2a7d7
corrects circular import
johentsch Sep 26, 2023
1c8dcdc
pulls the mechanism __repr__()/__str__() -> info() -> summary_dict() …
johentsch Sep 26, 2023
d55f1c0
accessing ClassVar DimcatResource._extractable_features via property
johentsch Sep 26, 2023
334842d
ignores exceptions when sending an extracted feature through the Data…
johentsch Sep 26, 2023
56be4b6
adds debugging.py
johentsch Sep 26, 2023
d89f4ea
enables loading boolean columns even if they come as floats
johentsch Oct 2, 2023
4372db3
adds Measure feature
johentsch Oct 2, 2023
98e11f5
uses ms3 string parsers when loading columns that come with the corre…
johentsch Oct 2, 2023
259492e
relaxes requirements
johentsch Oct 2, 2023
762386a
better conversion to bool when loading tables
johentsch Oct 3, 2023
afb854c
adds .krn to music21-parseable formats
johentsch Oct 12, 2023
6624b79
adds Feature._transform_resource_df() and lets HarmonyLabels compute …
johentsch Oct 12, 2023
3960f74
adds rudimentary ModeGrouper
johentsch Oct 12, 2023
fa5507f
updates MWE datapackage
johentsch Oct 12, 2023
67546eb
prevents FeatureExtractor from copying Features that have been freshl…
johentsch Oct 12, 2023
254636f
cleans up PipelineStep._pre_process_resource() by moving ensure_level…
johentsch Oct 12, 2023
5d62725
properly disentangles DimcatResource.get_default_groupby() (which wro…
johentsch Oct 12, 2023
3e419a8
appends Feature name to resource_name in Package.extract_feature()
johentsch Oct 13, 2023
29194fb
factors out computation of (renamed) globalkey_mode and localkey_mode
johentsch Oct 13, 2023
c9a535f
adds resources.utils.make_adjacency_groups() and its helper make_adja…
johentsch Oct 13, 2023
fe8dbc4
skips FeatureExtractors when Dataset sends extracted Feature through …
johentsch Oct 13, 2023
3f0005b
Feature.df calls self._make_feature_df() only for freshly loaded reso…
johentsch Oct 13, 2023
63d49a9
correctly condenses KeyAnnotations (currently it is assumed they are …
johentsch Oct 13, 2023
402ae9e
has Groupers skip resources that cannot be grouped and warn about them
johentsch Oct 13, 2023
6395d54
adds CorpusGrouper and has DimcatResource.update_default_groupby() de…
johentsch Oct 13, 2023
ebe02b4
adds DimcatCatalog.get_resource methods parallelling those of Package
johentsch Oct 13, 2023
486ac25
allows resource.utils.condense_dataframe_by_groups() to drop rows wit…
johentsch Oct 13, 2023
dbd9d0c
make deserialization functions available for import
johentsch Oct 14, 2023
c95b99c
replaces typing_extensions.Self with Dataset
johentsch Oct 23, 2023
79feb73
updates dcml_corpora datapackage with ms3 2.4.0
johentsch Oct 23, 2023
428d514
more consistent package names when using MuseScoreLoader.from_ms3()
johentsch Oct 23, 2023
c6d4dcb
adds PipelineStep.uml draft diagram
johentsch Oct 23, 2023
356702f
introduces the high-level facet types Events, Controls, Annotations, …
johentsch Nov 1, 2023
d10d357
adds dimcat.base.is_instance_of()
johentsch Nov 1, 2023
a2e7451
allows a second argument for FeatureUnavailableError
johentsch Nov 1, 2023
5c832be
moves feature extraction from Package to DimcatResource, allowing for…
johentsch Nov 1, 2023
1de3a97
factors out make_bar_plot() from resources/results.py to dimcat.plott…
johentsch Nov 1, 2023
fa85bb3
adds ClassVar DimcatResource.default_value_column and property value_…
johentsch Nov 1, 2023
e77f00e
moves notebooks.utils.plot_pitch_class_distribution() and tpc_bubbles…
johentsch Nov 1, 2023
31025f2
adapts Durations.make_bubble_plot()
johentsch Nov 1, 2023
9ca3b94
PitchClassVectors analyzer is a special case of the Proportions analyzer
johentsch Nov 1, 2023
38cf3f2
makes plot_fifths_distribution() (previously fifths_bar_plot()) a spe…
johentsch Nov 1, 2023
5b0eec5
adds TypeAlias StepSpecs for 'steps' argument and function step_sepcs…
johentsch Nov 1, 2023
cbd9744
adds methods apply_steps(), and plot() to DimcatResource, the latter …
johentsch Nov 1, 2023
ea967c9
factors out make_plot_settings() and update_figure_layout() from make…
johentsch Nov 1, 2023
189c753
refactors make_tpc_bubble_plot() (previously tpc_bubbles()) as a spec…
johentsch Nov 1, 2023
25551b2
new Result type PitchClassDuration as special case of Duration create…
johentsch Nov 1, 2023
b2eb5f8
resolve basepath derived from path argument
johentsch Nov 2, 2023
b711d74
adds 'piece' index level when loading an individual resource
johentsch Nov 2, 2023
ae1e109
analyzers pass on their default_groupby and value_column
johentsch Nov 2, 2023
b1444fd
systematically lets .plot() return a bubbles plot and .plot_grouped()…
johentsch Nov 2, 2023
38e26ca
euqalizes docstrings for 'level_names' argument
johentsch Nov 2, 2023
6844921
renames make_tpc_bubble_plot() => make_lof_bubble_plot() and plot_fif…
johentsch Nov 2, 2023
202e675
moves get_middle_composition_year() to dimcat.utils
johentsch Nov 2, 2023
86e7f30
in DimcatResource.from_dataframe(), the default_groupby can be set on…
johentsch Nov 2, 2023
51850b4
adds YearGrouper() and allows CustomPieceGrouper() to be initialized …
johentsch Nov 2, 2023
a01e6ef
allows importing the modules directly "from dimcat" (rather than havi…
johentsch Nov 2, 2023
38f4d27
when Grouper has been applied, Result.plot() results in bubble plot, …
johentsch Nov 2, 2023
c365af5
moves write_image() to dimcat.plotting
johentsch Nov 2, 2023
217bbfd
sets default figure font size to 20
johentsch Nov 3, 2023
7ef34c8
reorganizes grouper modules, adds general ColumnGrouper, makes ModeGr…
johentsch Nov 3, 2023
ff438d5
improves plotting to the point that the line_of_fifths.md notebook wo…
johentsch Nov 3, 2023
1a57c20
returns metadata facet, not bare dataframe
johentsch Nov 4, 2023
1e0bc1c
for Features, default_value_column defaults to _feature_columns[0] if…
johentsch Nov 4, 2023
a0f5a4e
adds Result.combine() and beginning of HarmonyLabelsFormat
johentsch Nov 5, 2023
394d177
KeyAnnotations are also displayed with the corresponding mode
johentsch Nov 5, 2023
74c2410
omits superfluous localkey_resolved_mode column because it's identica…
johentsch Nov 5, 2023
f9e4ce0
KeyAnnotations are also displayed with the corresponding mode
johentsch Nov 5, 2023
1e5adb7
adds NgramAnalyzer and BigramAnalyzer
johentsch Nov 5, 2023
09127a0
adds result getters to AnalyzedDataset, analogous to resource getters…
johentsch Nov 5, 2023
6c3d7b1
makes original resource available to PipelineStep._post_process_result()
johentsch Nov 5, 2023
ac3ede0
adds NgramTable.make_ngram_tuples()
johentsch Nov 5, 2023
06a7eda
more precise NgramTableFormat
johentsch Nov 5, 2023
5eb4b64
adapts column schema after extending feature df
johentsch Nov 5, 2023
d592330
moves transition_matrix() to dimcat.utils and make_transition_heatmap…
johentsch Nov 6, 2023
b96166a
adds make_bigram_tuples(), get_transitions(), and plot_grouped() for …
johentsch Nov 6, 2023
3ea4ee6
updates dependencies
johentsch Nov 6, 2023
af1bc9c
renames DEGREE => SCALE_DEGREE
johentsch Nov 6, 2023
d7c534d
pulls Grouper._iter_resources() up
johentsch Nov 6, 2023
3e2c68e
DimcatResource gets the methods get_time_spans(), get_slice_intervals…
johentsch Nov 6, 2023
69bce05
introduces the base Slicer
johentsch Nov 6, 2023
7d416de
more robust creation and loading of feature dfs
johentsch Nov 7, 2023
effb304
factors out ResourceTransformation and makes Grouper and Slicer a sub…
johentsch Nov 7, 2023
513fc58
adds Dataset.load() and get_feature() without argument (to get the la…
johentsch Nov 7, 2023
e0451ad
more consistent auxiliary columns
johentsch Nov 7, 2023
cfa64a8
enables initializing DimcatConfig with just a string as positional ar…
johentsch Nov 7, 2023
d4cfd6f
adds AdjacencySlicer and KeySlicer
johentsch Nov 7, 2023
5fa0ff2
Merge pull request #2 from johentsch/plotting
johentsch Nov 7, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .isort.cfg
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
[settings]
known_third_party = _pytest,dimcat,frictionless,git,importlib_metadata,marshmallow,ms3,pandas,plotly,pytest,setuptools,typing_extensions
known_third_party = _pytest,dimcat,frictionless,git,importlib_metadata,marshmallow,ms3,music21,pandas,plotly,pytest,setuptools,tqdm,typing_extensions
profile = black
290 changes: 136 additions & 154 deletions CONTRIBUTING.rst

Large diffs are not rendered by default.

3 changes: 2 additions & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,8 @@ install_requires =
frictionless[zenodo,pandas]==5.13.1
importlib-metadata~=6.0.0
marshmallow==3.19.0
ms3>=1.1.1
ms3 @ git+https://github.com/johentsch/ms3.git@schema
music21==9.1.0
plotly==5.13.0
seaborn~=0.12.2
setuptools~=65.6.3
Expand Down
6 changes: 4 additions & 2 deletions src/dimcat/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -614,6 +614,7 @@ def deserialize_json_file(json_file) -> DimcatObject:
class DimcatSettings(DimcatObject):
"""Settings for the dimcat library."""

default_basepath: str = "~/dimcat_data"
never_store_unvalidated_data: bool = True
"""setting this to False allows for skipping mandatory validations; set to True for production"""
recognized_piece_columns: List[str] = dataclass_field(
Expand All @@ -622,6 +623,7 @@ class DimcatSettings(DimcatObject):
"""column names that are recognized as piece identifiers and automatically renamed to 'piece' when needed"""

class Schema(DimcatObject.Schema):
default_basepath = mm.fields.String(required=True)
never_store_unvalidated_data = mm.fields.Boolean(required=True)
recognized_piece_columns = mm.fields.List(mm.fields.String(), required=True)

Expand Down Expand Up @@ -671,14 +673,14 @@ def make_settings_from_config_file(config_filepath: str) -> DimcatConfig:
try:
config = parse_config_file(config_filepath)
except FileNotFoundError:
logger.warning(
logger.error(
f"Config file '{config_filepath}' not found. Falling back to default settings."
)
return make_default_settings()
try:
return make_settings_from_config_parser(config)
except Exception as e:
logger.warning(
logger.error(
f"Error while parsing config file '{config_filepath}': {e}. Falling back to default settings."
)
return make_default_settings()
Expand Down
2 changes: 2 additions & 0 deletions src/dimcat/data/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
Metadata,
Notes,
PieceIndex,
ResourceSpecs,
ResourceStatus,
Result,
ResultName,
Expand All @@ -51,6 +52,7 @@
resolve_columns_argument,
resolve_levels_argument,
resolve_recognized_piece_columns_argument,
resource_specs2resource,
)

logger = logging.getLogger(__name__)
38 changes: 26 additions & 12 deletions src/dimcat/data/dataset/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,6 @@

import frictionless as fl
import marshmallow as mm
import ms3
from dimcat.base import DimcatConfig, DimcatObjectField, FriendlyEnum, get_class
from dimcat.data.base import Data
from dimcat.data.resources.base import D, DimcatResource, ResourceStatus, SomeDataframe
Expand All @@ -53,12 +52,12 @@
PackageNotFoundError,
ResourceNotFoundError,
)
from dimcat.utils import check_file_path, check_name, get_default_basepath
from dimcat.utils import check_file_path, check_name, get_default_basepath, resolve_path
from typing_extensions import Self

if TYPE_CHECKING:
from dimcat.data.resources.results import Result
from dimcat.steps.base import PipelineStep
from dimcat.steps.base import FeatureStep
from dimcat.steps.pipelines import Pipeline

logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -196,7 +195,7 @@ def __init__(
if package_name is not None:
self.package_name = package_name
if basepath is not None:
basepath = ms3.resolve_dir(basepath)
basepath = resolve_path(basepath)
self.basepath = basepath
if descriptor_filepath is not None:
self.descriptor_filepath = descriptor_filepath
Expand Down Expand Up @@ -247,7 +246,7 @@ def basepath(self) -> str:

@basepath.setter
def basepath(self, basepath: str) -> None:
basepath = ms3.resolve_dir(basepath)
basepath = resolve_path(basepath)
if self.status > PackageStatus.NOT_SERIALIZED:
if basepath == self.basepath:
return
Expand Down Expand Up @@ -812,6 +811,9 @@ def __getitem__(self, item: str) -> DimcatPackage:
def __iter__(self) -> Iterator[DimcatPackage]:
yield from self._packages

def __len__(self) -> int:
return len(self._packages)

def __repr__(self):
return pformat(self.summary_dict(), sort_dicts=False)

Expand Down Expand Up @@ -971,7 +973,7 @@ def set_basepath(
set_packages: bool = True,
) -> None:
"""Sets the basepath for all packages in the catalog (if set_packages=True)."""
basepath_arg = ms3.resolve_dir(basepath)
basepath_arg = resolve_path(basepath)
if not os.path.isdir(basepath_arg):
raise ValueError(f"basepath {basepath_arg!r} is not an existing directory.")
self._basepath = basepath_arg
Expand Down Expand Up @@ -1047,6 +1049,8 @@ def from_catalogs(
new_dataset = cls(**kwargs)
if pipeline is not None:
new_dataset._pipeline = pipeline
new_dataset.inputs.basepath = inputs.basepath
new_dataset.outputs.basepath = outputs.basepath
new_dataset.inputs.extend(inputs)
new_dataset.outputs.extend(outputs)
return new_dataset
Expand Down Expand Up @@ -1081,18 +1085,28 @@ def init_object(self, data, **kwargs) -> Dataset:

def __init__(
self,
basepath: Optional[str] = None,
**kwargs,
):
"""The central type of object that all :obj:`PipelineSteps <.PipelineStep>` process and return a copy of.

Args:
**kwargs: Dataset is cooperative and calls super().__init__(data=dataset, **kwargs)
"""
self._inputs = InputsCatalog()
self._outputs = OutputsCatalog()
if basepath is None:
self._inputs = InputsCatalog()
self._outputs = OutputsCatalog()
else:
basepath_arg = resolve_path(basepath)
if not os.path.isdir(basepath_arg):
raise NotADirectoryError(
f"basepath {basepath_arg!r} is not an existing directory."
)
self._inputs = InputsCatalog(basepath=basepath_arg)
self._outputs = OutputsCatalog(basepath=basepath_arg)
self._pipeline = None
self.reset_pipeline()
super().__init__(**kwargs)
super().__init__(**kwargs) # calls the Mixin's __init__

def __repr__(self):
return self.info(return_str=True)
Expand Down Expand Up @@ -1160,7 +1174,7 @@ def add_output(

def apply(
self,
step: PipelineStep,
step: FeatureStep,
) -> Self:
"""Applies a pipeline step to the features it is configured for or, if None, to all active features."""
return step.process_dataset(self)
Expand Down Expand Up @@ -1279,7 +1293,7 @@ def get_metadata(self) -> SomeDataframe:

def load_package(
self,
package: Union[fl.Package, str],
package: PackageSpecs,
package_name: Optional[str] = None,
**options,
):
Expand Down Expand Up @@ -1309,7 +1323,7 @@ def load_package(
f"with basepath {self.inputs.basepath}."
)

def load_feature(self, feature: Union[FeatureName, str, DimcatConfig]) -> Feature:
def load_feature(self, feature: FeatureSpecs) -> Feature:
"""ToDo: Harmonize with FeatureExtractor"""
feature = self.get_feature(feature)
feature.load()
Expand Down
2 changes: 2 additions & 0 deletions src/dimcat/data/resources/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,10 @@
DimcatResource,
IndexField,
PieceIndex,
ResourceSpecs,
ResourceStatus,
get_pickle_schema,
resource_specs2resource,
)
from .features import (
Annotations,
Expand Down
37 changes: 32 additions & 5 deletions src/dimcat/data/resources/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import zipfile
from enum import IntEnum, auto
from functools import cache
from pathlib import Path
from pprint import pformat
from typing import (
Dict,
Expand All @@ -23,7 +24,13 @@
import pandas as pd
from dimcat.base import DimcatConfig, get_class, get_setting
from dimcat.data.base import Data
from dimcat.utils import check_file_path, check_name, get_default_basepath, replace_ext
from dimcat.utils import (
check_file_path,
check_name,
get_default_basepath,
replace_ext,
resolve_path,
)
from frictionless import FrictionlessException
from marshmallow import fields, post_load
from typing_extensions import Self
Expand Down Expand Up @@ -675,10 +682,10 @@ def __init__(
self.default_groupby = default_groupby

if basepath is not None:
basepath = ms3.resolve_dir(basepath)
basepath = resolve_path(basepath)

if resource is not None:
if isinstance(resource, str):
if isinstance(resource, (str, Path)):
descriptor_path = check_file_path(
resource, extensions=("resource.json", "resource.yaml")
)
Expand Down Expand Up @@ -760,7 +767,7 @@ def basepath(self) -> str:

@basepath.setter
def basepath(self, basepath: str):
basepath = ms3.resolve_dir(basepath)
basepath = resolve_path(basepath)
if self.is_frozen:
if basepath == self.basepath:
return
Expand Down Expand Up @@ -1066,7 +1073,7 @@ def _get_current_status(self) -> ResourceStatus:
return ResourceStatus.EMPTY

@cache
def get_dataframe(self) -> Union[D]:
def get_dataframe(self) -> D:
"""
Load the dataframe from disk based on the descriptor's normpath.

Expand Down Expand Up @@ -1297,6 +1304,26 @@ def validate(

# endregion DimcatResource

ResourceSpecs: TypeAlias = Union[DimcatResource, str, Path]


def resource_specs2resource(resource: ResourceSpecs) -> DimcatResource:
"""Converts a resource specification to a resource.

Args:
resource: A resource specification.

Returns:
A resource.
"""
if isinstance(resource, DimcatResource):
return resource
if isinstance(resource, (str, Path)):
return DimcatResource(resource)
raise TypeError(
f"Expected a DimcatResource, str, or Path. Got {type(resource).__name__!r}."
)


@cache
def get_pickle_schema(name, init=True):
Expand Down
4 changes: 2 additions & 2 deletions src/dimcat/data/resources/features.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
import marshmallow as mm
from dimcat.base import DimcatConfig, ObjectEnum, is_subclass_of
from dimcat.data.resources.base import DimcatResource
from dimcat.exceptions import FeatureNotProcessableError
from dimcat.exceptions import ResourceNotProcessableError

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -148,5 +148,5 @@ def features_argument2config_list(
allowed_features = [FeatureName(f) for f in allowed_features]
for config in configs:
if config.options_dtype not in allowed_features:
raise FeatureNotProcessableError(config.options_dtype)
raise ResourceNotProcessableError(config.options_dtype)
return configs
4 changes: 2 additions & 2 deletions src/dimcat/data/resources/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import os
from collections import Counter
from operator import itemgetter
from typing import TYPE_CHECKING, Dict, Iterable, List, Optional, Set, Tuple, Union
from typing import TYPE_CHECKING, Dict, Iterable, List, Optional, Set, Tuple
from zipfile import ZipFile

import frictionless as fl
Expand Down Expand Up @@ -189,7 +189,7 @@ def infer_schema_from_df(df: SomeDataframe) -> fl.Schema:
def load_fl_resource(
fl_resource: fl.Resource,
index_col: Optional[int | str | List[int | str]] = None,
usecols: Optional[Union[int, str, List[int | str]]] = None,
usecols: Optional[int | str | List[int | str]] = None,
) -> SomeDataframe:
"""Load a dataframe from a :obj:`frictionless.Resource`.

Expand Down
35 changes: 34 additions & 1 deletion src/dimcat/exceptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,16 @@ class BasePathNotDefinedError(DimcatError):
}


class DuplicateIDError(DimcatError):
"""optional args: (id, facet)"""

nargs2message = {
0: "An ID was already in use.",
1: lambda id: f"The ID {id!r} is already in use.",
2: lambda id, facet: f"The ID {id!r} is already in use for facet {facet!r}.",
}


class EmptyCatalogError(DimcatError):
nargs2message = {
0: "The catalog is empty.",
Expand Down Expand Up @@ -74,7 +84,30 @@ class EmptyResourceError(DimcatError):
}


class FeatureNotProcessableError(DimcatError):
class ExcludedFileExtensionError(DimcatError):
"""optional args: (extension, permissible_extensions)"""

nargs2message = {
0: "A file extension is excluded.",
1: lambda extension: f"File extension {extension!r} is excluded.",
2: lambda extension, permissible_extensions: f"File extension {extension!r} is excluded. "
f"Pass one of {permissible_extensions!r}.",
}


class NoMuseScoreExecutableSpecifiedError(DimcatError):
nargs2message = {
0: "No MuseScore executable specified.",
}


class NoPathsSpecifiedError(DimcatError):
nargs2message = {
0: "No paths have been specified.",
}


class ResourceNotProcessableError(DimcatError):
"""optional args: (feature_name,)"""

nargs2message = {
Expand Down
3 changes: 3 additions & 0 deletions src/dimcat/settings.ini
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
[DEFAULT]
default_basepath = ~/dimcat_data

[EXCEPTIONS]
# setting this to False allows for skipping mandatory validations; set to True for production
never_store_unvalidated_data = False
Expand Down
3 changes: 2 additions & 1 deletion src/dimcat/steps/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,10 @@
PitchClassVectors,
UnitOfAnalysis,
)
from .base import PipelineStep
from .base import FeatureStep
from .extractors import FeatureExtractor
from .groupers import CustomPieceGrouper, Grouper
from .loaders import MuseScoreLoader
from .pipelines import Pipeline

logger = logging.getLogger(__name__)
Loading