-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #47 from johentsch/loader
New loaders
- Loading branch information
Showing
88 changed files
with
12,974 additions
and
22,003 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,3 @@ | ||
[settings] | ||
known_third_party = _pytest,dimcat,frictionless,git,importlib_metadata,marshmallow,ms3,pandas,plotly,pytest,setuptools,typing_extensions | ||
known_third_party = _pytest,dimcat,frictionless,git,kaleido,marshmallow,matplotlib,ms3,music21,numpy,pandas,plotly,pytest,scipy,seaborn,setuptools,tqdm,typing_extensions,yaml | ||
profile = black |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<Diagram> | ||
<ID>Python</ID> | ||
<OriginalElement>dimcat.steps.base.PipelineStep</OriginalElement> | ||
<nodes> | ||
<node x="446.0" y="902.0">dimcat.steps.analyzers.base.Analyzer</node> | ||
<node x="79.0" y="568.0">dimcat.steps.base.FeatureProcessingStep</node> | ||
<node x="96.0" y="-16.0">dimcat.steps.base.PipelineStep</node> | ||
<node x="-213.0" y="964.5">dimcat.steps.groupers.base.Grouper</node> | ||
<node x="158.0" y="1066.0">dimcat.steps.slicers.base.Slicer</node> | ||
</nodes> | ||
<notes /> | ||
<edges> | ||
<edge source="dimcat.steps.analyzers.base.Analyzer" target="dimcat.steps.base.FeatureProcessingStep" relationship="REALIZATION"> | ||
<point x="0.0" y="-179.5" /> | ||
<point x="654.5" y="877.0" /> | ||
<point x="292.0" y="877.0" /> | ||
<point x="0.0" y="142.0" /> | ||
</edge> | ||
<edge source="dimcat.steps.groupers.base.Grouper" target="dimcat.steps.base.FeatureProcessingStep" relationship="REALIZATION"> | ||
<point x="0.0" y="-117.0" /> | ||
<point x="-37.5" y="877.0" /> | ||
<point x="292.0" y="877.0" /> | ||
<point x="0.0" y="142.0" /> | ||
</edge> | ||
<edge source="dimcat.steps.base.FeatureProcessingStep" target="dimcat.steps.base.PipelineStep" relationship="REALIZATION"> | ||
<point x="0.0" y="-142.0" /> | ||
<point x="0.0" y="267.0" /> | ||
</edge> | ||
<edge source="dimcat.steps.slicers.base.Slicer" target="dimcat.steps.base.FeatureProcessingStep" relationship="REALIZATION"> | ||
<point x="0.0" y="-15.5" /> | ||
<point x="0.0" y="142.0" /> | ||
</edge> | ||
</edges> | ||
<settings layout="Hierarchic" zoom="0.5960464477539062" showDependencies="false" x="-4.0" y="690.3207936" /> | ||
<SelectedNodes /> | ||
<Categories> | ||
<Category>Methods</Category> | ||
</Categories> | ||
</Diagram> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,243 @@ | ||
--- | ||
jupytext: | ||
formats: ipynb,md:myst | ||
text_representation: | ||
extension: .md | ||
format_name: myst | ||
format_version: 0.13 | ||
jupytext_version: 1.14.6 | ||
kernelspec: | ||
display_name: dimcat | ||
language: python | ||
name: dimcat | ||
--- | ||
|
||
```{code-cell} ipython3 | ||
import sys | ||
if not sys.warnoptions: | ||
import warnings | ||
warnings.simplefilter("ignore") | ||
import os | ||
import frictionless as fl | ||
from dimcat.base import deserialize_json_file | ||
CORPUS_PATH = os.path.abspath(os.path.join("..", "..", "unittest_metacorpus")) | ||
assert os.path.isdir(CORPUS_PATH) | ||
sweelinck_dir = os.path.join(CORPUS_PATH, "sweelinck_keyboard") | ||
``` | ||
|
||
# Data | ||
|
||
## Resource | ||
|
||
A resource is a combination of a file and its descriptor. | ||
It allows for interacting with the file without having to "touch" it by interacting with its descriptor only. | ||
The descriptor comes in form of a dictionary and is typically stored next to the file in JSON or YAML format. | ||
|
||
DiMCAT follows the [Frictionless specification](https://specs.frictionlessdata.io/) for describing resources. | ||
There are two types of resources: | ||
|
||
* [PathResource](PathResource): Stands for a resource on local disk or on the web. | ||
* [DimcatResource](DimcatResource): A [Frictionless Tabular Data Resource](https://specs.frictionlessdata.io/tabular-data-resource/). | ||
|
||
They can be instantiated from a single filepath using the constructors | ||
|
||
* `.from_resource_path()` which takes the path to the resource file to be described | ||
* `.from_descriptor_filepath()` which takes a filepath pointing to a JSON or YAML file containing a resource descriptor | ||
|
||
Let's exemplify looking at the | ||
|
||
### PathResource | ||
|
||
The `sweelinck_keyboard` repository contains a single MuseScore file (in the folder "MS3") and several TSV files extracted from it. | ||
Let's load it: | ||
|
||
```{code-cell} ipython3 | ||
from dimcat.data.resource import DimcatResource, PathResource | ||
``` | ||
|
||
```{code-cell} ipython3 | ||
score_resource = os.path.join(sweelinck_dir, "MS3", "SwWV258_fantasia_cromatica.mscx") | ||
score_resource = PathResource.from_resource_path(score_resource) | ||
score_resource.get_path_dict() | ||
``` | ||
|
||
The dictionary returned by `.get_path_dict()` tell us everything we need to know to handle the resource physically: | ||
|
||
* `basepath` is an absolute directory | ||
* `filepath` is the filepath (which can include subfolders), relative to the `basepath` | ||
* `normpath` is the full path to the resource and defined as `basepath/filepath` (both need to be specified) | ||
* `innerpath`: when `normpath` points to a .zip file, innerpath is the relative filepath of the resource within the ZIP archive | ||
* `descriptor_filename` stores the name of a descriptor when it deviates from the default `<resource_name>.resource.json`. Cannot include subfolders since it is expected to be stored in `basepath` (otherwise, the relative `filepath` stored in the descriptor would resolve incorrectly) | ||
* `descriptor_path`: defined by `basepath/descriptor_filename` | ||
|
||
Here, the descriptor_path corresponds to the default, which does not currently point to an existing file: | ||
|
||
```{code-cell} ipython3 | ||
score_resource.descriptor_exists | ||
``` | ||
|
||
It can be created using `.store_descriptor()`: | ||
|
||
```{code-cell} ipython3 | ||
score_descriptor_path = score_resource.store_descriptor() | ||
score_resource.descriptor_exists | ||
``` | ||
|
||
To underline the functionality of the path resource, even the new descriptor can be treated as a resource: | ||
|
||
```{code-cell} ipython3 | ||
PathResource.from_resource_path(score_descriptor_path) | ||
``` | ||
|
||
Which is different from creating the original PathResource from the created descriptor: | ||
|
||
```{code-cell} ipython3 | ||
PathResource.from_descriptor_path(score_descriptor_path) | ||
``` | ||
|
||
Note that the `descriptor_filename` is now set to keep track of the existing one the resource originates from. | ||
|
||
By the way, the descriptors written to disk qualify as "normal" DimcatConfigs (see ???)... | ||
|
||
```{code-cell} ipython3 | ||
deserialize_json_file(score_descriptor_path) | ||
``` | ||
|
||
... and at the same time as valid Frictionless descriptors that can be validated using its commandline tool or Python library: | ||
|
||
```{code-cell} ipython3 | ||
fl.validate(score_descriptor_path) | ||
``` | ||
|
||
This is also what the property `is_valid` uses under the hood: | ||
|
||
```{code-cell} ipython3 | ||
score_resource.is_valid | ||
``` | ||
|
||
The status of a PathResource is always and unchangeably `PATH_ONLY`, with a value one above `EMPTY`: | ||
|
||
```{code-cell} ipython3 | ||
score_resource.status | ||
``` | ||
|
||
The path components cannot be modified because it would invalidate the relations with other path components: | ||
|
||
```{code-cell} ipython3 | ||
:tags: [raises-exception] | ||
base_path_level_up = os.path.dirname(score_resource.basepath) | ||
score_resource.basepath = base_path_level_up | ||
``` | ||
|
||
### DimcatResource | ||
|
||
A DimcatResource is both a Resource in the above sense and a wrapped dataframe. | ||
Let's create one from a TSV resource descriptor: | ||
|
||
```{code-cell} ipython3 | ||
notes_descriptor_path = os.path.join(sweelinck_dir, "notes", "SwWV258_fantasia_cromatica.notes.resource.json") | ||
notes_resource = DimcatResource.from_descriptor_path(notes_descriptor_path) | ||
notes_resource | ||
``` | ||
|
||
As the output shows, the status of the resource is `STANDALONE_NOT_LOADED`. | ||
The resource is considered standalone, as opposed to packaged, because it has its own resource descriptor file. | ||
And it is considered "not loaded" because the actual tabular data has not been loaded from the described TSV file into memory. | ||
The latter is achieved through the property `df` (short for dataframe): | ||
|
||
```{code-cell} ipython3 | ||
notes_resource.df | ||
``` | ||
|
||
... which changes the status to `STANDALONE_LOADED`: | ||
|
||
```{code-cell} ipython3 | ||
notes_resource.status | ||
``` | ||
|
||
```{code-cell} ipython3 | ||
type(notes_resource) | ||
``` | ||
|
||
## Package | ||
|
||
A package, or DataPackage, is a collection of resources. Analogously there are two main types: | ||
|
||
* [PathPackage](PathPackage) for collecting [PathResources](PathResource), and | ||
* [DimcatPackage](DimcatPackage) for collecting [DimcatResources](DimcatResource). | ||
|
||
Just like resources, packages have a basepath and may be stored as a frictionless package descriptor. | ||
|
||
For starters, let's assemble a package from scratch: | ||
|
||
```{code-cell} ipython3 | ||
from dimcat.data.package import PathPackage, DimcatPackage | ||
``` | ||
|
||
```{code-cell} ipython3 | ||
path_package = PathPackage(package_name="scratch") | ||
path_package | ||
``` | ||
|
||
The fields are mostly familiar from above: | ||
|
||
* `basepath`: Absolute path on disk where the descriptor and the ZIP file would be stored. | ||
* `resources`: Currently an empty list. Typically, all `resources` need to have the same `basepath` (if not, the package is 'misaligned'). | ||
* `name`: As per the [Frictionless specification](https://specs.frictionlessdata.io/) every package needs a name. In DiMCAT, the relevant property is called `package_name`. | ||
* `descriptor_filename`: The name of the descriptor file if it deviates from the default `<package_name>.datapackage.json`. | ||
* `auto_validate`: If True, the package is automatically validated after it is stored to disk. | ||
|
||
Now let's add the path resource we have created above: | ||
|
||
```{code-cell} ipython3 | ||
path_package.add_resource(score_resource) | ||
path_package | ||
``` | ||
|
||
```{code-cell} ipython3 | ||
path_package.store_descriptor() | ||
``` | ||
|
||
We can also create a package directly from a resource: | ||
|
||
```{code-cell} ipython3 | ||
dimcat_package = DimcatPackage.from_resources([notes_resource], package_name="pack") | ||
dimcat_package | ||
``` | ||
|
||
```{code-cell} ipython3 | ||
score_resource.is_serialized | ||
``` | ||
|
||
```{code-cell} ipython3 | ||
score_resource.status | ||
``` | ||
|
||
```{code-cell} ipython3 | ||
score_resource.to_dict() | ||
``` | ||
|
||
```{code-cell} ipython3 | ||
score_resource.to_dict(pickle=True) | ||
``` | ||
|
||
```{code-cell} ipython3 | ||
score_resource.to_config().create() | ||
``` | ||
|
||
```{code-cell} ipython3 | ||
notes_descriptor_path = os.path.join(sweelinck_dir, "notes", "SwWV258_fantasia_cromatica.notes.resource.json") | ||
notes_path_resource = Resource.from_descriptor_path(notes_descriptor_path) | ||
notes_path_resource = PathResource.from_descriptor_path(notes_descriptor_path) | ||
notes_path_resource | ||
``` | ||
|
||
```{code-cell} ipython3 | ||
notes_resource = Resource.from_descriptor_path(notes_descriptor_path) | ||
notes_resource | ||
``` | ||
|
||
```{code-cell} ipython3 | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
====== | ||
Manual | ||
====== | ||
|
||
.. toctree:: | ||
:maxdepth: 3 | ||
|
||
data |
File renamed without changes.
Oops, something went wrong.