Skip to content

Commit

Permalink
Merge pull request #47 from johentsch/loader
Browse files Browse the repository at this point in the history
New loaders
  • Loading branch information
johentsch authored Nov 7, 2023
2 parents 9ce1b5b + 5fa0ff2 commit 396eabe
Show file tree
Hide file tree
Showing 88 changed files with 12,974 additions and 22,003 deletions.
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,13 @@
__pycache__/*
.cache/*
.*.swp
.ipynb_checkpoints/*
.DS_Store

# Jupyter Notebook
.ipynb_checkpoints
*.ipynb
.jupyter_cache/

# Project files
.ropeproject
.project
Expand Down
2 changes: 1 addition & 1 deletion .isort.cfg
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
[settings]
known_third_party = _pytest,dimcat,frictionless,git,importlib_metadata,marshmallow,ms3,pandas,plotly,pytest,setuptools,typing_extensions
known_third_party = _pytest,dimcat,frictionless,git,kaleido,marshmallow,matplotlib,ms3,music21,numpy,pandas,plotly,pytest,scipy,seaborn,setuptools,tqdm,typing_extensions,yaml
profile = black
4 changes: 2 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ repos:
- id: isort

- repo: https://github.com/ambv/black
rev: 23.1.0
rev: 23.9.0
hooks:
- id: black
language_version: python3.10
Expand All @@ -50,7 +50,7 @@ repos:
# additional_dependencies: [black]

- repo: https://github.com/PyCQA/flake8
rev: 6.0.0
rev: 6.1.0
hooks:
- id: flake8
args:
Expand Down
2 changes: 1 addition & 1 deletion .readthedocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,4 @@ sphinx:
python:
install:
- requirements: docs/requirements.txt
- {path: ., method: pip}
- {path: ., method: pip}
290 changes: 136 additions & 154 deletions CONTRIBUTING.rst

Large diffs are not rendered by default.

40 changes: 40 additions & 0 deletions docs/diagrams/PipelineStep.uml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
<?xml version="1.0" encoding="UTF-8"?>
<Diagram>
<ID>Python</ID>
<OriginalElement>dimcat.steps.base.PipelineStep</OriginalElement>
<nodes>
<node x="446.0" y="902.0">dimcat.steps.analyzers.base.Analyzer</node>
<node x="79.0" y="568.0">dimcat.steps.base.FeatureProcessingStep</node>
<node x="96.0" y="-16.0">dimcat.steps.base.PipelineStep</node>
<node x="-213.0" y="964.5">dimcat.steps.groupers.base.Grouper</node>
<node x="158.0" y="1066.0">dimcat.steps.slicers.base.Slicer</node>
</nodes>
<notes />
<edges>
<edge source="dimcat.steps.analyzers.base.Analyzer" target="dimcat.steps.base.FeatureProcessingStep" relationship="REALIZATION">
<point x="0.0" y="-179.5" />
<point x="654.5" y="877.0" />
<point x="292.0" y="877.0" />
<point x="0.0" y="142.0" />
</edge>
<edge source="dimcat.steps.groupers.base.Grouper" target="dimcat.steps.base.FeatureProcessingStep" relationship="REALIZATION">
<point x="0.0" y="-117.0" />
<point x="-37.5" y="877.0" />
<point x="292.0" y="877.0" />
<point x="0.0" y="142.0" />
</edge>
<edge source="dimcat.steps.base.FeatureProcessingStep" target="dimcat.steps.base.PipelineStep" relationship="REALIZATION">
<point x="0.0" y="-142.0" />
<point x="0.0" y="267.0" />
</edge>
<edge source="dimcat.steps.slicers.base.Slicer" target="dimcat.steps.base.FeatureProcessingStep" relationship="REALIZATION">
<point x="0.0" y="-15.5" />
<point x="0.0" y="142.0" />
</edge>
</edges>
<settings layout="Hierarchic" zoom="0.5960464477539062" showDependencies="false" x="-4.0" y="690.3207936" />
<SelectedNodes />
<Categories>
<Category>Methods</Category>
</Categories>
</Diagram>
4 changes: 2 additions & 2 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
dimcat
======

This is the documentation of **dimcat**. Mostly, it hasn't been written yet.
This is the documentation of **DiMCAT**, the **Di**gital **M**usicology **C**orpus **A**nalysis **T**oolkit.
Contents
========
Expand All @@ -11,7 +11,7 @@ Contents
:maxdepth: 3

Overview <readme>
manual
manual/index
Module Reference <api/modules>
Contributions & Help <contributing>
License <LICENSE>
Expand Down
21 changes: 0 additions & 21 deletions docs/manual.rst

This file was deleted.

243 changes: 243 additions & 0 deletions docs/manual/data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,243 @@
---
jupytext:
formats: ipynb,md:myst
text_representation:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.14.6
kernelspec:
display_name: dimcat
language: python
name: dimcat
---

```{code-cell} ipython3
import sys
if not sys.warnoptions:
import warnings
warnings.simplefilter("ignore")
import os
import frictionless as fl
from dimcat.base import deserialize_json_file
CORPUS_PATH = os.path.abspath(os.path.join("..", "..", "unittest_metacorpus"))
assert os.path.isdir(CORPUS_PATH)
sweelinck_dir = os.path.join(CORPUS_PATH, "sweelinck_keyboard")
```

# Data

## Resource

A resource is a combination of a file and its descriptor.
It allows for interacting with the file without having to "touch" it by interacting with its descriptor only.
The descriptor comes in form of a dictionary and is typically stored next to the file in JSON or YAML format.

DiMCAT follows the [Frictionless specification](https://specs.frictionlessdata.io/) for describing resources.
There are two types of resources:

* [PathResource](PathResource): Stands for a resource on local disk or on the web.
* [DimcatResource](DimcatResource): A [Frictionless Tabular Data Resource](https://specs.frictionlessdata.io/tabular-data-resource/).

They can be instantiated from a single filepath using the constructors

* `.from_resource_path()` which takes the path to the resource file to be described
* `.from_descriptor_filepath()` which takes a filepath pointing to a JSON or YAML file containing a resource descriptor

Let's exemplify looking at the

### PathResource

The `sweelinck_keyboard` repository contains a single MuseScore file (in the folder "MS3") and several TSV files extracted from it.
Let's load it:

```{code-cell} ipython3
from dimcat.data.resource import DimcatResource, PathResource
```

```{code-cell} ipython3
score_resource = os.path.join(sweelinck_dir, "MS3", "SwWV258_fantasia_cromatica.mscx")
score_resource = PathResource.from_resource_path(score_resource)
score_resource.get_path_dict()
```

The dictionary returned by `.get_path_dict()` tell us everything we need to know to handle the resource physically:

* `basepath` is an absolute directory
* `filepath` is the filepath (which can include subfolders), relative to the `basepath`
* `normpath` is the full path to the resource and defined as `basepath/filepath` (both need to be specified)
* `innerpath`: when `normpath` points to a .zip file, innerpath is the relative filepath of the resource within the ZIP archive
* `descriptor_filename` stores the name of a descriptor when it deviates from the default `<resource_name>.resource.json`. Cannot include subfolders since it is expected to be stored in `basepath` (otherwise, the relative `filepath` stored in the descriptor would resolve incorrectly)
* `descriptor_path`: defined by `basepath/descriptor_filename`

Here, the descriptor_path corresponds to the default, which does not currently point to an existing file:

```{code-cell} ipython3
score_resource.descriptor_exists
```

It can be created using `.store_descriptor()`:

```{code-cell} ipython3
score_descriptor_path = score_resource.store_descriptor()
score_resource.descriptor_exists
```

To underline the functionality of the path resource, even the new descriptor can be treated as a resource:

```{code-cell} ipython3
PathResource.from_resource_path(score_descriptor_path)
```

Which is different from creating the original PathResource from the created descriptor:

```{code-cell} ipython3
PathResource.from_descriptor_path(score_descriptor_path)
```

Note that the `descriptor_filename` is now set to keep track of the existing one the resource originates from.

By the way, the descriptors written to disk qualify as "normal" DimcatConfigs (see ???)...

```{code-cell} ipython3
deserialize_json_file(score_descriptor_path)
```

... and at the same time as valid Frictionless descriptors that can be validated using its commandline tool or Python library:

```{code-cell} ipython3
fl.validate(score_descriptor_path)
```

This is also what the property `is_valid` uses under the hood:

```{code-cell} ipython3
score_resource.is_valid
```

The status of a PathResource is always and unchangeably `PATH_ONLY`, with a value one above `EMPTY`:

```{code-cell} ipython3
score_resource.status
```

The path components cannot be modified because it would invalidate the relations with other path components:

```{code-cell} ipython3
:tags: [raises-exception]
base_path_level_up = os.path.dirname(score_resource.basepath)
score_resource.basepath = base_path_level_up
```

### DimcatResource

A DimcatResource is both a Resource in the above sense and a wrapped dataframe.
Let's create one from a TSV resource descriptor:

```{code-cell} ipython3
notes_descriptor_path = os.path.join(sweelinck_dir, "notes", "SwWV258_fantasia_cromatica.notes.resource.json")
notes_resource = DimcatResource.from_descriptor_path(notes_descriptor_path)
notes_resource
```

As the output shows, the status of the resource is `STANDALONE_NOT_LOADED`.
The resource is considered standalone, as opposed to packaged, because it has its own resource descriptor file.
And it is considered "not loaded" because the actual tabular data has not been loaded from the described TSV file into memory.
The latter is achieved through the property `df` (short for dataframe):

```{code-cell} ipython3
notes_resource.df
```

... which changes the status to `STANDALONE_LOADED`:

```{code-cell} ipython3
notes_resource.status
```

```{code-cell} ipython3
type(notes_resource)
```

## Package

A package, or DataPackage, is a collection of resources. Analogously there are two main types:

* [PathPackage](PathPackage) for collecting [PathResources](PathResource), and
* [DimcatPackage](DimcatPackage) for collecting [DimcatResources](DimcatResource).

Just like resources, packages have a basepath and may be stored as a frictionless package descriptor.

For starters, let's assemble a package from scratch:

```{code-cell} ipython3
from dimcat.data.package import PathPackage, DimcatPackage
```

```{code-cell} ipython3
path_package = PathPackage(package_name="scratch")
path_package
```

The fields are mostly familiar from above:

* `basepath`: Absolute path on disk where the descriptor and the ZIP file would be stored.
* `resources`: Currently an empty list. Typically, all `resources` need to have the same `basepath` (if not, the package is 'misaligned').
* `name`: As per the [Frictionless specification](https://specs.frictionlessdata.io/) every package needs a name. In DiMCAT, the relevant property is called `package_name`.
* `descriptor_filename`: The name of the descriptor file if it deviates from the default `<package_name>.datapackage.json`.
* `auto_validate`: If True, the package is automatically validated after it is stored to disk.

Now let's add the path resource we have created above:

```{code-cell} ipython3
path_package.add_resource(score_resource)
path_package
```

```{code-cell} ipython3
path_package.store_descriptor()
```

We can also create a package directly from a resource:

```{code-cell} ipython3
dimcat_package = DimcatPackage.from_resources([notes_resource], package_name="pack")
dimcat_package
```

```{code-cell} ipython3
score_resource.is_serialized
```

```{code-cell} ipython3
score_resource.status
```

```{code-cell} ipython3
score_resource.to_dict()
```

```{code-cell} ipython3
score_resource.to_dict(pickle=True)
```

```{code-cell} ipython3
score_resource.to_config().create()
```

```{code-cell} ipython3
notes_descriptor_path = os.path.join(sweelinck_dir, "notes", "SwWV258_fantasia_cromatica.notes.resource.json")
notes_path_resource = Resource.from_descriptor_path(notes_descriptor_path)
notes_path_resource = PathResource.from_descriptor_path(notes_descriptor_path)
notes_path_resource
```

```{code-cell} ipython3
notes_resource = Resource.from_descriptor_path(notes_descriptor_path)
notes_resource
```

```{code-cell} ipython3
```
8 changes: 8 additions & 0 deletions docs/manual/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
======
Manual
======

.. toctree::
:maxdepth: 3

data
File renamed without changes.
Loading

0 comments on commit 396eabe

Please sign in to comment.