Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docs, other changes for cookiecutter usage #133

Merged
merged 30 commits into from
May 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
4f30a65
Updating for usage in ingest cookiecutter
glass-ships Apr 6, 2024
72bcd3a
update documentation some more
glass-ships Apr 19, 2024
8bfd87f
update docs, examples/tests, and biolink and typer versions
glass-ships Apr 22, 2024
aafb61e
update config doc, allow using all extracted files
glass-ships Apr 22, 2024
004332a
further documentation changes
glass-ships Apr 23, 2024
e2fb994
add testing util to mock koza, rename cli runner to cli utils
glass-ships Apr 24, 2024
7dade07
fix imports in tests
glass-ships Apr 24, 2024
8fdae59
update deps
glass-ships Apr 24, 2024
9c4ac78
try class instead of function
glass-ships Apr 24, 2024
8f3a3eb
typo
glass-ships Apr 24, 2024
c835bf3
remove args from make_mock_koza_app
glass-ships Apr 24, 2024
791d654
more typos
glass-ships Apr 24, 2024
2a1545d
add underscore back to entities, maybe that matters?
glass-ships Apr 24, 2024
7e9a5d1
try making it a fixture
glass-ships Apr 24, 2024
bc0c502
ok fixture works. removing class
glass-ships Apr 24, 2024
aa99993
add testing documentation, create iterator in mock koza directly
glass-ships Apr 25, 2024
1e82b6c
update testing documentation
glass-ships Apr 26, 2024
e1a454a
dynamic version by release
glass-ships Apr 29, 2024
c72a656
fix pyproject.toml
glass-ships Apr 30, 2024
71ef3a3
print outfiles
glass-ships Apr 30, 2024
d72cd81
add outfiles
glass-ships Apr 30, 2024
48e6a53
try checking min number of rows
glass-ships May 3, 2024
f805ba9
add warning for 70% less, error for below that
glass-ships May 3, 2024
12d9473
add warning for 70% less, error for below that
glass-ships May 3, 2024
7dcfa80
try log with empty valueerror
glass-ships May 3, 2024
4f7d652
ok that looks horrible
glass-ships May 3, 2024
2d2e931
comment out node/edge report columns for now
glass-ships May 3, 2024
9b6246a
fix test
glass-ships May 3, 2024
1e41280
try adding node/edge type to cli args
glass-ships May 7, 2024
2b6c014
try fixing row count check
glass-ships May 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .github/dependabot.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Set update schedule for GitHub Actions

version: 2
updates:

- package-ecosystem: "github-actions"
directory: "/"
schedule:
# Check for updates to GitHub Actions every week
interval: "weekly"
47 changes: 24 additions & 23 deletions .github/workflows/publish.yaml
Original file line number Diff line number Diff line change
@@ -1,32 +1,33 @@
name: publish on pypi

on:
release:
types: [published]
release:
types: [published]

jobs:
publish:
runs-on: ubuntu-latest
steps:
- name: Checkout sources
uses: actions/checkout@v4
publish:
runs-on: ubuntu-latest
steps:
- name: Checkout sources
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.10"
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.10"

- name: Install dependencies
run: |
pip install poetry && poetry install
- name: Install dependencies
run: |
pip install poetry && poetry install

- name: Build
run: |
poetry build
- name: Build
run: |
poetry version $(git describe --tags --abbrev=0)
poetry build

- name: Publish to PyPi
env:
PYPI_API_TOKEN: ${{ secrets.PYPI_API_TOKEN }}
run: |
poetry config http-basic.pypi "__token__" "${PYPI_API_TOKEN}"
poetry publish
- name: Publish to PyPi
env:
PYPI_API_TOKEN: ${{ secrets.PYPI_API_TOKEN }}
run: |
poetry config http-basic.pypi "__token__" "${PYPI_API_TOKEN}"
poetry publish
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

[![Pyversions](https://img.shields.io/pypi/pyversions/koza.svg)](https://pypi.python.org/pypi/koza)
[![PyPi](https://img.shields.io/pypi/v/koza.svg)](https://pypi.python.org/pypi/koza)
![Github Action](https://github.com/monarch-initiative/koza/actions/workflows/build.yml/badge.svg)
![Github Action](https://github.com/monarch-initiative/koza/actions/workflows/test.yaml/badge.svg)

![pupa](docs/img/pupa.png)

Expand Down
12 changes: 12 additions & 0 deletions docs/Ingests/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<sub>
(For CLI usage, see the [CLI commands](../Usage/CLI.md) page.)
</sub>

Koza is designed to process and transform existing data into a target csv/json/jsonl format.

This process is internally known as an **ingest**. Ingests are defined by:

1. [Source config yaml](./source_config.md): Ingest configuration, including:
- metadata, formats, required columns, any SSSOM files, etc.
1. [Map config yaml](./mapping.md): (Optional) configures creation of mapping dictionary
1. [Transform code](./transform.md): a Python script, with specific transform instructions
62 changes: 62 additions & 0 deletions docs/Ingests/mapping.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@

Mapping with Koza is optional, but can be done in two ways:

- Automated mapping with SSSOM files
- Manual mapping with a map config yaml

### SSSOM Mapping

Koza supports mapping with SSSOM files (Semantic Similarity of Source and Target Ontology Mappings).
Simply add the path to the SSSOM file to your source config, the desired target prefixes,
and any prefixes you want to use to filter the SSSOM file.
Koza will automatically create a mapping lookup table which will automatically
attempt to map any values in the source file to an ID with the target prefix.

```yaml
sssom_config:
sssom_file: './path/to/your_mapping_file.sssom.tsv'
filter_prefixes:
- 'SOMEPREFIX'
- 'OTHERPREFIX'
target_prefixes:
- 'OTHERPREFIX'
use_match:
- 'exact'
```

**Note:** Currently, only the `exact` match type is supported (`narrow` and `broad` match types will be added in the future).

### Manual Mapping / Additional Data

The map config yaml allows you to include data from other sources in your ingests,
which may have different columns or formats.

If you don't have an SSSOM file, or you want to manually map some values, you can use a map config yaml.
You can then add this map to your source config yaml in the `depends_on` property.

Koza will then create a nested dictionary with the specified key and values.
For example, the following map config yaml maps values from the `STRING` column to the `entrez` and `NCBI taxid` columns.

```yaml
# koza/examples/maps/entrez-2-string.yaml
name: ...
files: ...

columns:
- 'NCBI taxid'
- 'entrez'
- 'STRING'

key: 'STRING'

values:
- 'entrez'
- 'NCBI taxid'
```


The mapping dict will be available in your transform script from the `koza_app` object (see the Transform Code section below).

---

**Next Steps: [Transform Code](./transform.md)**
79 changes: 79 additions & 0 deletions docs/Ingests/source_config.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
This YAML file sets properties for the ingest of a single file type from a within a Source.

!!! tip "Paths are relative to the directory from which you execute Koza."

## Source Configuration Properties

| **Required properties** | |
| --------------------------- | ------------------------------------------------------------------------------------------------------ |
| `name` | Name of the data ingest, as `<data source>_<type_of_ingest>`, <br/>ex. `hpoa_gene_to_disease` |
| `files` | List of files to process |
| | |
| `node_properties` | List of node properties to include in output |
| `edge_properties` | List of edge properties to include in output |
| **Note** | Either node or edge properties (or both) must be defined in the primary config yaml for your transform |
| | |
| **Optional properties** | |
| `file_archive` | Path to a file archive containing the file(s) to process <br/> Supported archive formats: zip, gzip |
| `format` | Format of the data file(s) (CSV or JSON) |
| `sssom_config` | Configures usage of SSSOM mapping files |
| `depends_on` | List of map config files to use |
| `metadata` | Metadata for the source, either a list of properties,<br/>or path to a `metadata.yaml` |
| `transform_code` | Path to a python file to transform the data |
| `transform_mode` | How to process the transform file |
| `global_table` | Path to a global translation table file |
| `local_table` | Path to a local translation table file |
| `field_type_map` | Dict of field names and their type (using the FieldType enum) |
| `filters` | List of filters to apply |
| `json_path` | Path within JSON object containing data to process |
| `required_properties` | List of properties that must be present in output (JSON only) |
| | |
| **CSV-Specific Properties** | |
| `delimiter` | Delimiter for csv files (**Required for CSV format**) |
| **Optional CSV Properties** | |
| `columns` | List of columns to include in output (CSV only) |
| `header` | Header row index for csv files |
| `header_delimiter` | Delimiter for header in csv files |
| `header_prefix` | Prefix for header in csv files |
| `comment_char` | Comment character for csv files |
| `skip_blank_lines` | Skip blank lines in csv files |

## Metadata Properties

Metadata is optional, and can be defined as a list of properties and values, or as a path to a `metadata.yaml` file,
for example - `metadata: "./path/to/metadata.yaml"`.
Remember that the path is relative to the directory from which you execute Koza.

| **Metadata Properties** | |
| ----------------------- | ---------------------------------------------------------------------------------------- |
| name | Name of data source, ex. "FlyBase" |
| description | Description of data/ingest |
| ingest_title | \*Title of source of data, map to biolink name |
| ingest_url | \*URL to source of data, Maps to biolink iri |
| provided_by | `<data source>_<type_of_ingest>`, ex. `hpoa_gene_to_disease` (see config propery "name") |
| rights | Link to license information for the data source |

**\*Note**: For more information on `ingest_title` and `ingest_url`, see the [infores catalog](https://biolink.github.io/information-resource-registry/infores_catalog.yaml)

## Composing Configuration from Multiple Yaml Files

Koza's custom YAML Loader supports importing/including other yaml files with an `!include` tag.

For example, if you had a file named `standard-columns.yaml`:

```yaml
- "column_1"
- "column_2"
- "column_3"
- "column_4": "int"
```

Then in any ingests you wish to use these columns, you can simply `!include` them:

```yaml
columns: !include "./path/to/standard-columns.yaml"
```

---

**Next Steps: [Mapping and Additional Data](./mapping.md)**
87 changes: 87 additions & 0 deletions docs/Ingests/testing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
Koza includes a `mock_koza` fixture (see `src/koza/utils/testing_utils`) that can be used to test your ingest configuration. This fixture accepts the following arguments:

| Argument | Type | Description |
| ------------------ | ------------------------- | ------------------------------------- |
| Required Arguments |
| `name` | `str` | The name of the ingest |
| `data` | `Union[Dict, List[Dict]]` | The data to be ingested |
| `transform_code` | `str` | Path to the transform code to be used |
| Optional Arguments |
| `map_cache` | `Dict` | Map cache to be used |
| `filters` | `List(str)` | List of filters to apply to data |
| `global_table` | `str` | Path to the global table |
| `local_table` | `str` | Path to the local table |

The `mock_koza` fixture returns a list of entities that would be generated by the ingest configuration.
This list can then be used to test the output based on the transform script.

Here is an example of how to use the `mock_koza` fixture to test an ingest configuration:

```python
import pytest

from koza.utils.testing_utils import mock_koza

# Define the source name and transform script path
INGEST_NAME = "your_ingest_name"
TRANSFORM_SCRIPT = "./src/{{cookiecutter.__project_slug}}/transform.py"

# Define an example row to test (as a dictionary)
@pytest.fixture
def example_row():
return {
"example_column_1": "entity_1",
"example_column_2": "entity_6",
"example_column_3": "biolink:related_to",
}

# Or a list of rows
@pytest.fixture
def example_list_of_rows():
return [
{
"example_column_1": "entity_1",
"example_column_2": "entity_6",
"example_column_3": "biolink:related_to",
},
{
"example_column_1": "entity_2",
"example_column_2": "entity_7",
"example_column_3": "biolink:related_to",
},
]

# Define the mock koza transform
@pytest.fixture
def mock_transform(mock_koza, example_row):
return mock_koza(
INGEST_NAME,
example_row,
TRANSFORM_SCRIPT,
)

# Or for multiple rows
@pytest.fixture
def mock_transform_multiple_rows(mock_koza, example_list_of_rows):
return mock_koza(
INGEST_NAME,
example_list_of_rows,
TRANSFORM_SCRIPT,
)

# Test the output of the transform

def test_single_row(mock_transform):
assert len(mock_transform) == 1
entity = mock_transform[0]
assert entity
assert entity.subject == "entity_1"


def test_multiple_rows(mock_transform_multiple_rows):
assert len(mock_transform_multiple_rows) == 2
entity_1 = mock_transform_multiple_rows[0]
entity_2 = mock_transform_multiple_rows[1]
assert entity_1.subject == "entity_1"
assert entity_2.subject == "entity_2"
```
67 changes: 67 additions & 0 deletions docs/Ingests/transform.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
This Python script is where you'll define the specific steps of your data transformation.
Koza will load this script and execute it for each row of data in your source file,
applying any filters and mapping as defined in your source config yaml,
and outputting the transformed data to the target csv/json/jsonl file.

When Koza is called, either by command-line or as a library using `transform_source()`,
it creates a `KozaApp` object for the specified ingest.
This KozaApp will be your entry point to Koza:

```python
from koza.cli_utils import get_koza_app
koza_app = get_koza_app('your-source-name')
```

The KozaApp object has the following methods which can be used in your transform code:

| Method | Description |
| ------------------- | ------------------------------------------------- |
| `get_row()` | Returns the next row of data from the source file |
| `next_row()` | Skip to the next row in the data file |
| `get_map(map_name)` | Returns the mapping dict for the specified map |
| `process_sources()` | TBD |
| `process_maps()` | Initializes the KozaApp's map cache |
| `write(*args)` | Writes the transformed data to the target file |

Once you have processed a row of data, and created a biolink entity node or edge object (or both),
you can pass these to `koza_app.write()` to output the transformed data to the target file.

??? tldr "Example Python Transform Script"

```python
# other imports, eg. uuid, pydantic, etc.
import uuid
from biolink_model.datamodel.pydanticmodel_v2 import Gene, PairwiseGeneToGeneInteraction

# Koza imports
from koza.cli_utils import get_koza_app

# This is the name of the ingest you want to run
source_name = 'map-protein-links-detailed'
koza_app = get_koza_app(source_name)

# If your ingest depends_on a mapping file, you can access it like this:
map_name = 'entrez-2-string'
koza_map = koza_app.get_map(map_name)

# This grabs the first/next row from the source data
# Koza will reload this script and return the next row until it reaches EOF or row-limit
while (row := koza_app.get_row()) is not None:
# Now you can lay out your actual transformations, and define your output:

gene_a = Gene(id='NCBIGene:' + koza_map[row['protein1']]['entrez'])
gene_b = Gene(id='NCBIGene:' + koza_map[row['protein2']]['entrez'])

pairwise_gene_to_gene_interaction = PairwiseGeneToGeneInteraction(
id="uuid:" + str(uuid.uuid1()),
subject=gene_a.id,
object=gene_b.id,
predicate="biolink:interacts_with"
)

# Finally, write the transformed row to the target file
koza_app.write(gene_a, gene_b, pairwise_gene_to_gene_interaction)
```

If you pass nodes, as well as edges, to `koza_app.write()`, Koza will automatically create a node file and an edge file.
If you pass only nodes, Koza will create only a node file, and if you pass only edges, Koza will create only an edge file.
1 change: 0 additions & 1 deletion docs/Usage/API.md

This file was deleted.

1 change: 1 addition & 0 deletions docs/Usage/Module.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.koza.cli_utils
Loading