Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New loaders #47

Merged
merged 216 commits into from
Nov 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
216 commits
Select commit Hold shift + click to select a range
c847282
privatizes many methods of PipelineStep and allows .process_resource(…
johentsch Jun 23, 2023
06f5443
pulls apart FeatureStep(PipelineStep), which all PipelineSteps curren…
johentsch Jun 23, 2023
84e8ad9
renames PipelineStep._dispatch() => _make_new_resource()
johentsch Jun 23, 2023
f0a7b37
early version of the MuseScoreLoader
johentsch Jun 25, 2023
3cb34a1
makes loaders a package; separates ScoreLoader from Loader; adds skel…
johentsch Jun 26, 2023
babfc22
first naive version performing a raw music21 parse (events shadow str…
johentsch Jun 26, 2023
f723d8e
advances Music21Loader to parse a single resource
johentsch Jun 26, 2023
c1a850f
enables Music21Loader to recursively scan directory; computation of I…
johentsch Jun 26, 2023
1e955bf
adds basepath argument to Dataset.__init__()
johentsch Jun 26, 2023
28da592
adds default_output_dir = "~/dimcat_data" to settings.ini
johentsch Jun 26, 2023
94d1956
enables ScoreLoader.process(Dataset)
johentsch Jun 27, 2023
51653ef
moves PathFactory to loaders.utils
johentsch Jun 27, 2023
4e7c845
moves class variables and paths property up
johentsch Jun 27, 2023
d695194
pulls up Loader._process_dataset() harmonizing it for all loaders thr…
johentsch Jun 27, 2023
54b911f
implements PackageLoader (closes #40)
johentsch Jun 27, 2023
c0c69f9
updates contributors' guide
johentsch Jun 27, 2023
e45decb
corrects dependency syntax
johentsch Jun 27, 2023
f6cf868
renames default_output_dir => default_basepath and makes tests pass
johentsch Jun 27, 2023
f769d0c
drop empty columns when creating dataframe from m21
johentsch Jul 2, 2023
9238890
cleans up outputs and replaces print() with logger.info()
johentsch Jul 2, 2023
b81f26c
renames subdirs to return_tuples
johentsch Jul 2, 2023
9f65bcf
log message
johentsch Jul 2, 2023
ad39517
renames FeatureStep => FeatureProcessingStep and adds some docs
johentsch Jul 2, 2023
567bc8a
better explanation of the FeatureProcessingStep.is_transformation pro…
johentsch Jul 2, 2023
a8683c6
makes Pipeline a subclass of PipelineStp (not FeatureProcessingStep)
johentsch Jul 2, 2023
33dfed5
adds Loader test case to test_base.py; + a few cosmetics
johentsch Jul 2, 2023
763a303
makes utils.resolve_path() typesafe
johentsch Jul 2, 2023
f21d12d
improves PackageLoader by not requiring a package_name and discoverin…
johentsch Jul 2, 2023
1d9300a
implements Dataset.from_loader() (closes #42 :tada:)
johentsch Jul 2, 2023
33908c6
implements Loader.create_dataset() (closes #41)
johentsch Jul 2, 2023
e4d22aa
adds DimcatPackage.get_piece_index()
johentsch Jul 2, 2023
a67fe1e
adds DimcatPackage.get_boolean_resource_table() and tests that its in…
johentsch Jul 2, 2023
8089c75
makes DimcatResource a subclass of the new Resource base class
johentsch Jul 2, 2023
e411b31
adds a new Resource superclass and pulls basepath attribute up to the…
johentsch Jul 2, 2023
ff163ee
introduces new setting "default_resource_name"
johentsch Jul 3, 2023
cd0af40
cleans up and facilitates DimcatResource.__init__() and adapts Data o…
johentsch Jul 3, 2023
e05da81
Merge branch 'development' into loader
johentsch Jul 3, 2023
f437c24
Merge branch 'development' into loader
johentsch Jul 4, 2023
0760ad5
moves DimcatResource.Schema.init_object() up to Resource.Schema
johentsch Jul 4, 2023
0cd83d8
adds base test case for Resource
johentsch Jul 4, 2023
6b84707
moves DimcatPackage unittests to test_package.py
johentsch Jul 4, 2023
0edfb57
moves get_score_paths() to conftest
johentsch Jul 4, 2023
7ba82f4
generalizes get_score_paths(), specifies get_m21_score_paths()
johentsch Jul 4, 2023
5175df4
updates TestBaseResource to run on multiple, mixed score paths
johentsch Jul 4, 2023
d584964
adds DimcatPackage.from_resources() and .from_filepaths(), and allows…
johentsch Jul 4, 2023
9789f21
introduces global constant TEST_N_SCORES to allow for speeding up tes…
johentsch Jul 4, 2023
90ffbcd
renames DimcatPackage.make_new_resource() to .add_new_dimcat_resource…
johentsch Jul 4, 2023
9dd3b03
renames Loader.add_piece_facet() => .add_piece_facet_dataframe() to a…
johentsch Jul 4, 2023
615fbd7
factors out dataset.base.PackageSchema in order to introduce the dist…
johentsch Jul 5, 2023
1a505bf
pulls up Data.to_dict(pickle=False) and Data.pickle_schema and Data.g…
johentsch Jul 5, 2023
5720cba
more robust handling of descriptor_filepath
johentsch Jul 5, 2023
99896c4
adds property ID to Resource objects and allows for storing a corpus_…
johentsch Jul 5, 2023
047ae68
updates resolve_dir(), stripping terminal separators so that the base…
johentsch Jul 5, 2023
4752f97
enables specifying resource_names and corpus_names (factories) when c…
johentsch Jul 5, 2023
404ce24
tidies up Resource API removing unmanageable side-effects. DimcatReso…
johentsch Jul 5, 2023
bde5bc0
tidies up DimcatResource API according to the previous clean-up of Re…
johentsch Jul 6, 2023
ffa5ee2
removes now superfluous methods _set_descriptor_path and _set_file_path
johentsch Jul 6, 2023
3694a85
Resource.from_descriptor() now deserializes as the correct subclass i…
johentsch Jul 7, 2023
c2b6f85
precicises all internal imports
johentsch Jul 7, 2023
bec25d4
creates Package superclass (WIP)
johentsch Jul 7, 2023
fa07c07
factors out catalog and package into their own Python packages
johentsch Jul 7, 2023
06b8bcf
overall uniform packaging structure with subpackages of data and step…
johentsch Jul 7, 2023
6e45a5c
adds dedicated PathResource; harmonizes ResourceStatus and how Resour…
johentsch Jul 8, 2023
a0b1670
docs on ResourceStatus
johentsch Jul 8, 2023
18491b7
first step towards Packages correctly handling resources including pa…
johentsch Jul 10, 2023
545e94e
updates project requirements
johentsch Jul 18, 2023
16dc2eb
replaces manual.rst with a subfolder containing the Jupytext notebook…
johentsch Jul 18, 2023
5eb74d0
commenting out modin[ray] for now
johentsch Jul 18, 2023
a3fb373
moves the Minimal Working Example 'mwe' from the top level into docs,…
johentsch Jul 18, 2023
47b22a8
updates unittest_metacorpus commit
johentsch Jul 18, 2023
dc8dfb0
overall uniform packaging structure with subpackages of data and step…
johentsch Jul 18, 2023
47ce991
DimcatPackage to store resources in ZIP archive by default
johentsch Jul 18, 2023
97ff9cb
enables 'from dimcat import Pipeline'
johentsch Jul 18, 2023
4e5cf09
loaders underway to being adapted
johentsch Jul 18, 2023
1e0a71a
manual/data work-in-progress
johentsch Jul 18, 2023
af01cee
adds Python package for slicer
johentsch Jul 19, 2023
55ae3c8
towards fixing the loader tests
johentsch Jul 19, 2023
943035f
fixes for test_base.py
johentsch Jul 19, 2023
22836bd
makes all tests in test_package.py pass
johentsch Jul 19, 2023
88dba35
exludes RECONCILE modes from test_package.py to prevent copying the r…
johentsch Jul 19, 2023
ae555ff
prevents PathPackage from storing its descriptor after adding a resource
johentsch Jul 19, 2023
0a1c984
adapts Catalog to use Package (instead of DimcatPackage exclusively)
johentsch Jul 19, 2023
d59bcfd
progress on the loaders
johentsch Jul 19, 2023
f592bcf
Merge branch 'loader' into slicer
johentsch Jul 19, 2023
28f515a
minor bug fixes
johentsch Jul 19, 2023
2637925
behaviour closer to how it should be makes more tests fail
johentsch Jul 19, 2023
abc7b1a
Dependency fixes
Elizafox Jul 22, 2023
ad73863
Merge
Elizafox Jul 22, 2023
9a22e76
Merge pull request #1 from Elizafox/loader
johentsch Jul 27, 2023
7e0c2a2
Revert "Merge"
johentsch Jul 27, 2023
ba7d2e7
enable dc.PackageLoader convenience
johentsch Sep 9, 2023
18c0c53
updates package requirements
johentsch Sep 9, 2023
7cb2dc2
Resource.from_descriptor() dispatches to DimcatResource for all fl.Pa…
johentsch Sep 9, 2023
63c7380
Resource.from_descriptor() dispatches to PathResource if fl.Package.t…
johentsch Sep 9, 2023
922c5b7
adds Dataset.from_package() constructor
johentsch Sep 9, 2023
b7dc50a
updates pre-commit hook versions
johentsch Sep 9, 2023
bf4869e
renames .get_resource() => get_resource_by_name(); maintains custom m…
johentsch Sep 9, 2023
c217591
adds and executes tox lint
johentsch Sep 9, 2023
e7745d6
introduces MuseScorePackage and MuseScoreFacet. Initializing from a p…
johentsch Sep 9, 2023
f0c364a
DimcatCatalog.summary_dict() displays resource types by default
johentsch Sep 9, 2023
d444f24
MuseScoreFacet dispatches to subclass based on resource name
johentsch Sep 9, 2023
f21d531
introduces Facet base class with 'extractable_features' class variable
johentsch Sep 9, 2023
3470963
adds properties and methods to Package, updates .extract_feature()
johentsch Sep 9, 2023
7d39eef
correct compilation of descriptor dict
johentsch Sep 17, 2023
55b3862
updates MWE datapackage with dcml_corpora@a2afd8b via ms3 v2.2.1
johentsch Sep 17, 2023
142664c
adds Package.get_resources_by_regex() and .get_resources_by_type(), a…
johentsch Sep 17, 2023
28b1778
copies HarmonyLabels.__init__() from DimcatResource
johentsch Sep 17, 2023
865e163
renames package analyzer => analyzers
johentsch Sep 17, 2023
06f642d
renames package extractor => extractors
johentsch Sep 17, 2023
7158017
renames package grouper => groupers
johentsch Sep 17, 2023
7f423fa
renames package pipeline => pipelines
johentsch Sep 17, 2023
79afa6d
renames package slicer => slicers
johentsch Sep 17, 2023
261ec7b
renames package catalog => catalogs
johentsch Sep 17, 2023
dd8d1a2
renames package dataset => datasets
johentsch Sep 17, 2023
e534ed8
renames package package => packages
johentsch Sep 17, 2023
3020ad8
renames package resource => resources
johentsch Sep 17, 2023
12db382
adapts MWE notebook imports
johentsch Sep 17, 2023
7e9b23e
code cells with ipython3
johentsch Sep 17, 2023
e8d2dd0
updates status after creating resource from dataframe
johentsch Sep 17, 2023
dcfb6c5
small bugfix important for Resource._get_current_status()
johentsch Sep 17, 2023
21fa7bb
docstrings
johentsch Sep 17, 2023
2985ba9
update unittest_metarepo commit and adapts filepaths
johentsch Sep 17, 2023
17e11ab
elaborates on resources.utils.infer_schema_from_df() and uses it in D…
johentsch Sep 18, 2023
1c6a4a3
adds new 'context_columns' setting
johentsch Sep 18, 2023
7f46539
adds mechanism that, fundamentally, initializes a Feature as a subset…
johentsch Sep 18, 2023
14f1d81
adds options_class property to DimcatConfig
johentsch Sep 26, 2023
253356e
pulls ClassVar extrable_features up to DimcatResource; adds ClassVar …
johentsch Sep 26, 2023
7806adc
enables FeatureExtractor to work on a single DimcatResource
johentsch Sep 26, 2023
2a2a7d7
corrects circular import
johentsch Sep 26, 2023
1c8dcdc
pulls the mechanism __repr__()/__str__() -> info() -> summary_dict() …
johentsch Sep 26, 2023
d55f1c0
accessing ClassVar DimcatResource._extractable_features via property
johentsch Sep 26, 2023
334842d
ignores exceptions when sending an extracted feature through the Data…
johentsch Sep 26, 2023
56be4b6
adds debugging.py
johentsch Sep 26, 2023
d89f4ea
enables loading boolean columns even if they come as floats
johentsch Oct 2, 2023
4372db3
adds Measure feature
johentsch Oct 2, 2023
98e11f5
uses ms3 string parsers when loading columns that come with the corre…
johentsch Oct 2, 2023
259492e
relaxes requirements
johentsch Oct 2, 2023
762386a
better conversion to bool when loading tables
johentsch Oct 3, 2023
afb854c
adds .krn to music21-parseable formats
johentsch Oct 12, 2023
6624b79
adds Feature._transform_resource_df() and lets HarmonyLabels compute …
johentsch Oct 12, 2023
3960f74
adds rudimentary ModeGrouper
johentsch Oct 12, 2023
fa5507f
updates MWE datapackage
johentsch Oct 12, 2023
67546eb
prevents FeatureExtractor from copying Features that have been freshl…
johentsch Oct 12, 2023
254636f
cleans up PipelineStep._pre_process_resource() by moving ensure_level…
johentsch Oct 12, 2023
5d62725
properly disentangles DimcatResource.get_default_groupby() (which wro…
johentsch Oct 12, 2023
3e419a8
appends Feature name to resource_name in Package.extract_feature()
johentsch Oct 13, 2023
29194fb
factors out computation of (renamed) globalkey_mode and localkey_mode
johentsch Oct 13, 2023
c9a535f
adds resources.utils.make_adjacency_groups() and its helper make_adja…
johentsch Oct 13, 2023
fe8dbc4
skips FeatureExtractors when Dataset sends extracted Feature through …
johentsch Oct 13, 2023
3f0005b
Feature.df calls self._make_feature_df() only for freshly loaded reso…
johentsch Oct 13, 2023
63d49a9
correctly condenses KeyAnnotations (currently it is assumed they are …
johentsch Oct 13, 2023
402ae9e
has Groupers skip resources that cannot be grouped and warn about them
johentsch Oct 13, 2023
6395d54
adds CorpusGrouper and has DimcatResource.update_default_groupby() de…
johentsch Oct 13, 2023
ebe02b4
adds DimcatCatalog.get_resource methods parallelling those of Package
johentsch Oct 13, 2023
486ac25
allows resource.utils.condense_dataframe_by_groups() to drop rows wit…
johentsch Oct 13, 2023
dbd9d0c
make deserialization functions available for import
johentsch Oct 14, 2023
c95b99c
replaces typing_extensions.Self with Dataset
johentsch Oct 23, 2023
79feb73
updates dcml_corpora datapackage with ms3 2.4.0
johentsch Oct 23, 2023
428d514
more consistent package names when using MuseScoreLoader.from_ms3()
johentsch Oct 23, 2023
c6d4dcb
adds PipelineStep.uml draft diagram
johentsch Oct 23, 2023
356702f
introduces the high-level facet types Events, Controls, Annotations, …
johentsch Nov 1, 2023
d10d357
adds dimcat.base.is_instance_of()
johentsch Nov 1, 2023
a2e7451
allows a second argument for FeatureUnavailableError
johentsch Nov 1, 2023
5c832be
moves feature extraction from Package to DimcatResource, allowing for…
johentsch Nov 1, 2023
1de3a97
factors out make_bar_plot() from resources/results.py to dimcat.plott…
johentsch Nov 1, 2023
fa85bb3
adds ClassVar DimcatResource.default_value_column and property value_…
johentsch Nov 1, 2023
e77f00e
moves notebooks.utils.plot_pitch_class_distribution() and tpc_bubbles…
johentsch Nov 1, 2023
31025f2
adapts Durations.make_bubble_plot()
johentsch Nov 1, 2023
9ca3b94
PitchClassVectors analyzer is a special case of the Proportions analyzer
johentsch Nov 1, 2023
38cf3f2
makes plot_fifths_distribution() (previously fifths_bar_plot()) a spe…
johentsch Nov 1, 2023
5b0eec5
adds TypeAlias StepSpecs for 'steps' argument and function step_sepcs…
johentsch Nov 1, 2023
cbd9744
adds methods apply_steps(), and plot() to DimcatResource, the latter …
johentsch Nov 1, 2023
ea967c9
factors out make_plot_settings() and update_figure_layout() from make…
johentsch Nov 1, 2023
189c753
refactors make_tpc_bubble_plot() (previously tpc_bubbles()) as a spec…
johentsch Nov 1, 2023
25551b2
new Result type PitchClassDuration as special case of Duration create…
johentsch Nov 1, 2023
b2eb5f8
resolve basepath derived from path argument
johentsch Nov 2, 2023
b711d74
adds 'piece' index level when loading an individual resource
johentsch Nov 2, 2023
ae1e109
analyzers pass on their default_groupby and value_column
johentsch Nov 2, 2023
b1444fd
systematically lets .plot() return a bubbles plot and .plot_grouped()…
johentsch Nov 2, 2023
38e26ca
euqalizes docstrings for 'level_names' argument
johentsch Nov 2, 2023
6844921
renames make_tpc_bubble_plot() => make_lof_bubble_plot() and plot_fif…
johentsch Nov 2, 2023
202e675
moves get_middle_composition_year() to dimcat.utils
johentsch Nov 2, 2023
86e7f30
in DimcatResource.from_dataframe(), the default_groupby can be set on…
johentsch Nov 2, 2023
51850b4
adds YearGrouper() and allows CustomPieceGrouper() to be initialized …
johentsch Nov 2, 2023
a01e6ef
allows importing the modules directly "from dimcat" (rather than havi…
johentsch Nov 2, 2023
38f4d27
when Grouper has been applied, Result.plot() results in bubble plot, …
johentsch Nov 2, 2023
c365af5
moves write_image() to dimcat.plotting
johentsch Nov 2, 2023
217bbfd
sets default figure font size to 20
johentsch Nov 3, 2023
7ef34c8
reorganizes grouper modules, adds general ColumnGrouper, makes ModeGr…
johentsch Nov 3, 2023
ff438d5
improves plotting to the point that the line_of_fifths.md notebook wo…
johentsch Nov 3, 2023
1a57c20
returns metadata facet, not bare dataframe
johentsch Nov 4, 2023
1e0bc1c
for Features, default_value_column defaults to _feature_columns[0] if…
johentsch Nov 4, 2023
a0f5a4e
adds Result.combine() and beginning of HarmonyLabelsFormat
johentsch Nov 5, 2023
394d177
KeyAnnotations are also displayed with the corresponding mode
johentsch Nov 5, 2023
74c2410
omits superfluous localkey_resolved_mode column because it's identica…
johentsch Nov 5, 2023
f9e4ce0
KeyAnnotations are also displayed with the corresponding mode
johentsch Nov 5, 2023
1e5adb7
adds NgramAnalyzer and BigramAnalyzer
johentsch Nov 5, 2023
09127a0
adds result getters to AnalyzedDataset, analogous to resource getters…
johentsch Nov 5, 2023
6c3d7b1
makes original resource available to PipelineStep._post_process_result()
johentsch Nov 5, 2023
ac3ede0
adds NgramTable.make_ngram_tuples()
johentsch Nov 5, 2023
06a7eda
more precise NgramTableFormat
johentsch Nov 5, 2023
5eb4b64
adapts column schema after extending feature df
johentsch Nov 5, 2023
d592330
moves transition_matrix() to dimcat.utils and make_transition_heatmap…
johentsch Nov 6, 2023
b96166a
adds make_bigram_tuples(), get_transitions(), and plot_grouped() for …
johentsch Nov 6, 2023
3ea4ee6
updates dependencies
johentsch Nov 6, 2023
af1bc9c
renames DEGREE => SCALE_DEGREE
johentsch Nov 6, 2023
d7c534d
pulls Grouper._iter_resources() up
johentsch Nov 6, 2023
3e2c68e
DimcatResource gets the methods get_time_spans(), get_slice_intervals…
johentsch Nov 6, 2023
69bce05
introduces the base Slicer
johentsch Nov 6, 2023
7d416de
more robust creation and loading of feature dfs
johentsch Nov 7, 2023
effb304
factors out ResourceTransformation and makes Grouper and Slicer a sub…
johentsch Nov 7, 2023
513fc58
adds Dataset.load() and get_feature() without argument (to get the la…
johentsch Nov 7, 2023
e0451ad
more consistent auxiliary columns
johentsch Nov 7, 2023
cfa64a8
enables initializing DimcatConfig with just a string as positional ar…
johentsch Nov 7, 2023
d4cfd6f
adds AdjacencySlicer and KeySlicer
johentsch Nov 7, 2023
5fa0ff2
Merge pull request #2 from johentsch/plotting
johentsch Nov 7, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,13 @@
__pycache__/*
.cache/*
.*.swp
.ipynb_checkpoints/*
.DS_Store

# Jupyter Notebook
.ipynb_checkpoints
*.ipynb
.jupyter_cache/

# Project files
.ropeproject
.project
Expand Down
2 changes: 1 addition & 1 deletion .isort.cfg
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
[settings]
known_third_party = _pytest,dimcat,frictionless,git,importlib_metadata,marshmallow,ms3,pandas,plotly,pytest,setuptools,typing_extensions
known_third_party = _pytest,dimcat,frictionless,git,kaleido,marshmallow,matplotlib,ms3,music21,numpy,pandas,plotly,pytest,scipy,seaborn,setuptools,tqdm,typing_extensions,yaml
profile = black
4 changes: 2 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ repos:
- id: isort

- repo: https://github.com/ambv/black
rev: 23.1.0
rev: 23.9.0
hooks:
- id: black
language_version: python3.10
Expand All @@ -50,7 +50,7 @@ repos:
# additional_dependencies: [black]

- repo: https://github.com/PyCQA/flake8
rev: 6.0.0
rev: 6.1.0
hooks:
- id: flake8
args:
Expand Down
2 changes: 1 addition & 1 deletion .readthedocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,4 @@ sphinx:
python:
install:
- requirements: docs/requirements.txt
- {path: ., method: pip}
- {path: ., method: pip}
290 changes: 136 additions & 154 deletions CONTRIBUTING.rst

Large diffs are not rendered by default.

40 changes: 40 additions & 0 deletions docs/diagrams/PipelineStep.uml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
<?xml version="1.0" encoding="UTF-8"?>
<Diagram>
<ID>Python</ID>
<OriginalElement>dimcat.steps.base.PipelineStep</OriginalElement>
<nodes>
<node x="446.0" y="902.0">dimcat.steps.analyzers.base.Analyzer</node>
<node x="79.0" y="568.0">dimcat.steps.base.FeatureProcessingStep</node>
<node x="96.0" y="-16.0">dimcat.steps.base.PipelineStep</node>
<node x="-213.0" y="964.5">dimcat.steps.groupers.base.Grouper</node>
<node x="158.0" y="1066.0">dimcat.steps.slicers.base.Slicer</node>
</nodes>
<notes />
<edges>
<edge source="dimcat.steps.analyzers.base.Analyzer" target="dimcat.steps.base.FeatureProcessingStep" relationship="REALIZATION">
<point x="0.0" y="-179.5" />
<point x="654.5" y="877.0" />
<point x="292.0" y="877.0" />
<point x="0.0" y="142.0" />
</edge>
<edge source="dimcat.steps.groupers.base.Grouper" target="dimcat.steps.base.FeatureProcessingStep" relationship="REALIZATION">
<point x="0.0" y="-117.0" />
<point x="-37.5" y="877.0" />
<point x="292.0" y="877.0" />
<point x="0.0" y="142.0" />
</edge>
<edge source="dimcat.steps.base.FeatureProcessingStep" target="dimcat.steps.base.PipelineStep" relationship="REALIZATION">
<point x="0.0" y="-142.0" />
<point x="0.0" y="267.0" />
</edge>
<edge source="dimcat.steps.slicers.base.Slicer" target="dimcat.steps.base.FeatureProcessingStep" relationship="REALIZATION">
<point x="0.0" y="-15.5" />
<point x="0.0" y="142.0" />
</edge>
</edges>
<settings layout="Hierarchic" zoom="0.5960464477539062" showDependencies="false" x="-4.0" y="690.3207936" />
<SelectedNodes />
<Categories>
<Category>Methods</Category>
</Categories>
</Diagram>
4 changes: 2 additions & 2 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
dimcat
======

This is the documentation of **dimcat**. Mostly, it hasn't been written yet.
This is the documentation of **DiMCAT**, the **Di**gital **M**usicology **C**orpus **A**nalysis **T**oolkit.

Contents
========
Expand All @@ -11,7 +11,7 @@ Contents
:maxdepth: 3

Overview <readme>
manual
manual/index
Module Reference <api/modules>
Contributions & Help <contributing>
License <LICENSE>
Expand Down
21 changes: 0 additions & 21 deletions docs/manual.rst

This file was deleted.

243 changes: 243 additions & 0 deletions docs/manual/data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,243 @@
---
jupytext:
formats: ipynb,md:myst
text_representation:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.14.6
kernelspec:
display_name: dimcat
language: python
name: dimcat
---

```{code-cell} ipython3
import sys
if not sys.warnoptions:
import warnings
warnings.simplefilter("ignore")
import os
import frictionless as fl
from dimcat.base import deserialize_json_file
CORPUS_PATH = os.path.abspath(os.path.join("..", "..", "unittest_metacorpus"))
assert os.path.isdir(CORPUS_PATH)
sweelinck_dir = os.path.join(CORPUS_PATH, "sweelinck_keyboard")
```

# Data

## Resource

A resource is a combination of a file and its descriptor.
It allows for interacting with the file without having to "touch" it by interacting with its descriptor only.
The descriptor comes in form of a dictionary and is typically stored next to the file in JSON or YAML format.

DiMCAT follows the [Frictionless specification](https://specs.frictionlessdata.io/) for describing resources.
There are two types of resources:

* [PathResource](PathResource): Stands for a resource on local disk or on the web.
* [DimcatResource](DimcatResource): A [Frictionless Tabular Data Resource](https://specs.frictionlessdata.io/tabular-data-resource/).

They can be instantiated from a single filepath using the constructors

* `.from_resource_path()` which takes the path to the resource file to be described
* `.from_descriptor_filepath()` which takes a filepath pointing to a JSON or YAML file containing a resource descriptor

Let's exemplify looking at the

### PathResource

The `sweelinck_keyboard` repository contains a single MuseScore file (in the folder "MS3") and several TSV files extracted from it.
Let's load it:

```{code-cell} ipython3
from dimcat.data.resource import DimcatResource, PathResource
```

```{code-cell} ipython3
score_resource = os.path.join(sweelinck_dir, "MS3", "SwWV258_fantasia_cromatica.mscx")
score_resource = PathResource.from_resource_path(score_resource)
score_resource.get_path_dict()
```

The dictionary returned by `.get_path_dict()` tell us everything we need to know to handle the resource physically:

* `basepath` is an absolute directory
* `filepath` is the filepath (which can include subfolders), relative to the `basepath`
* `normpath` is the full path to the resource and defined as `basepath/filepath` (both need to be specified)
* `innerpath`: when `normpath` points to a .zip file, innerpath is the relative filepath of the resource within the ZIP archive
* `descriptor_filename` stores the name of a descriptor when it deviates from the default `<resource_name>.resource.json`. Cannot include subfolders since it is expected to be stored in `basepath` (otherwise, the relative `filepath` stored in the descriptor would resolve incorrectly)
* `descriptor_path`: defined by `basepath/descriptor_filename`

Here, the descriptor_path corresponds to the default, which does not currently point to an existing file:

```{code-cell} ipython3
score_resource.descriptor_exists
```

It can be created using `.store_descriptor()`:

```{code-cell} ipython3
score_descriptor_path = score_resource.store_descriptor()
score_resource.descriptor_exists
```

To underline the functionality of the path resource, even the new descriptor can be treated as a resource:

```{code-cell} ipython3
PathResource.from_resource_path(score_descriptor_path)
```

Which is different from creating the original PathResource from the created descriptor:

```{code-cell} ipython3
PathResource.from_descriptor_path(score_descriptor_path)
```

Note that the `descriptor_filename` is now set to keep track of the existing one the resource originates from.

By the way, the descriptors written to disk qualify as "normal" DimcatConfigs (see ???)...

```{code-cell} ipython3
deserialize_json_file(score_descriptor_path)
```

... and at the same time as valid Frictionless descriptors that can be validated using its commandline tool or Python library:

```{code-cell} ipython3
fl.validate(score_descriptor_path)
```

This is also what the property `is_valid` uses under the hood:

```{code-cell} ipython3
score_resource.is_valid
```

The status of a PathResource is always and unchangeably `PATH_ONLY`, with a value one above `EMPTY`:

```{code-cell} ipython3
score_resource.status
```

The path components cannot be modified because it would invalidate the relations with other path components:

```{code-cell} ipython3
:tags: [raises-exception]

base_path_level_up = os.path.dirname(score_resource.basepath)
score_resource.basepath = base_path_level_up
```

### DimcatResource

A DimcatResource is both a Resource in the above sense and a wrapped dataframe.
Let's create one from a TSV resource descriptor:

```{code-cell} ipython3
notes_descriptor_path = os.path.join(sweelinck_dir, "notes", "SwWV258_fantasia_cromatica.notes.resource.json")
notes_resource = DimcatResource.from_descriptor_path(notes_descriptor_path)
notes_resource
```

As the output shows, the status of the resource is `STANDALONE_NOT_LOADED`.
The resource is considered standalone, as opposed to packaged, because it has its own resource descriptor file.
And it is considered "not loaded" because the actual tabular data has not been loaded from the described TSV file into memory.
The latter is achieved through the property `df` (short for dataframe):

```{code-cell} ipython3
notes_resource.df
```

... which changes the status to `STANDALONE_LOADED`:

```{code-cell} ipython3
notes_resource.status
```

```{code-cell} ipython3
type(notes_resource)
```

## Package

A package, or DataPackage, is a collection of resources. Analogously there are two main types:

* [PathPackage](PathPackage) for collecting [PathResources](PathResource), and
* [DimcatPackage](DimcatPackage) for collecting [DimcatResources](DimcatResource).

Just like resources, packages have a basepath and may be stored as a frictionless package descriptor.

For starters, let's assemble a package from scratch:

```{code-cell} ipython3
from dimcat.data.package import PathPackage, DimcatPackage
```

```{code-cell} ipython3
path_package = PathPackage(package_name="scratch")
path_package
```

The fields are mostly familiar from above:

* `basepath`: Absolute path on disk where the descriptor and the ZIP file would be stored.
* `resources`: Currently an empty list. Typically, all `resources` need to have the same `basepath` (if not, the package is 'misaligned').
* `name`: As per the [Frictionless specification](https://specs.frictionlessdata.io/) every package needs a name. In DiMCAT, the relevant property is called `package_name`.
* `descriptor_filename`: The name of the descriptor file if it deviates from the default `<package_name>.datapackage.json`.
* `auto_validate`: If True, the package is automatically validated after it is stored to disk.

Now let's add the path resource we have created above:

```{code-cell} ipython3
path_package.add_resource(score_resource)
path_package
```

```{code-cell} ipython3
path_package.store_descriptor()
```

We can also create a package directly from a resource:

```{code-cell} ipython3
dimcat_package = DimcatPackage.from_resources([notes_resource], package_name="pack")
dimcat_package
```

```{code-cell} ipython3
score_resource.is_serialized
```

```{code-cell} ipython3
score_resource.status
```

```{code-cell} ipython3
score_resource.to_dict()
```

```{code-cell} ipython3
score_resource.to_dict(pickle=True)
```

```{code-cell} ipython3
score_resource.to_config().create()
```

```{code-cell} ipython3
notes_descriptor_path = os.path.join(sweelinck_dir, "notes", "SwWV258_fantasia_cromatica.notes.resource.json")
notes_path_resource = Resource.from_descriptor_path(notes_descriptor_path)
notes_path_resource = PathResource.from_descriptor_path(notes_descriptor_path)
notes_path_resource
```

```{code-cell} ipython3
notes_resource = Resource.from_descriptor_path(notes_descriptor_path)
notes_resource
```

```{code-cell} ipython3

```
8 changes: 8 additions & 0 deletions docs/manual/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
======
Manual
======

.. toctree::
:maxdepth: 3

data
File renamed without changes.
Loading
Loading