Skip to content

Commit

Permalink
🚸 Enable versioning artifacts based on key akin to the AWS S3 behav…
Browse files Browse the repository at this point in the history
…ior (#1839)
  • Loading branch information
falexwolf authored Aug 23, 2024
1 parent 67970b8 commit f353217
Show file tree
Hide file tree
Showing 19 changed files with 455 additions and 352 deletions.
10 changes: 5 additions & 5 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ jobs:
- "faq"
- "storage"
- "cli"
timeout-minutes: 10
timeout-minutes: 11

steps:
- uses: actions/checkout@v4
Expand All @@ -37,14 +37,14 @@ jobs:
with:
python-version: "3.11"
- name: cache pre-commit
uses: actions/cache@v3
uses: actions/cache@v4
with:
path: ~/.cache/pre-commit
key: pre-commit-${{ runner.os }}-${{ hashFiles('.pre-commit-config.yaml') }}
- name: cache postgres
if: ${{ matrix.group == 'faq' || matrix.group == 'unit' }}
id: cache-postgres
uses: actions/cache@v3
uses: actions/cache@v4
with:
path: ~/postgres.tar
key: cache-postgres-0
Expand All @@ -64,7 +64,7 @@ jobs:
- run: nox -s lint
if: ${{ matrix.group == 'tutorial' }} # choose a fast-running a group
- run: nox -s "install_ci(group='${{ matrix.group }}')"
- uses: aws-actions/configure-aws-credentials@v1
- uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
Expand Down Expand Up @@ -97,7 +97,7 @@ jobs:
ssh-key: ${{ secrets.READ_LNDOCS }}
path: lndocs
ref: main
- uses: aws-actions/configure-aws-credentials@v1
- uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
Expand Down
129 changes: 92 additions & 37 deletions docs/introduction.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -116,9 +116,9 @@
"\n",
"git, by comparison, identifies code by its content hash & file name. If you rename a notebook or script file and change the content, you lose the identity of the file. Notebook platforms like Google Colab and DeepNote support renaming and changing content of a given notebook, but they do not support versioning in a simple queryable way: every notebook version comes with the same [notebook id](https://lamin.ai/blog/nbproject#metadata-tracking).\n",
"\n",
"To enable versioning, LaminDB auto-generates `uid` values so that different versions of a transform are grouped by a random \"stem uid\" `suid`, consisting in the same first 12 characters of the `uid`. The remaining 4 characters encode a revision in a `ruid`, hence, `uid = f\"{suid}{ruid}\"`. You can optionally label any given version with a semantic tag via the `transform.version` field.\n",
"To enable versioning, LaminDB auto-generates `uid = f\"{suid}{vuid}\"` so that different versions of a transform are grouped by a random \"stem uid\" `suid` (the first part of the `uid`) while the last **four** characters encode a version in a `vuid` (an auto-incrementing base62 number). You can optionally tag a version using the `.version` field.\n",
"\n",
"Datasets and all other versioned entities in lamindb are versioned in the same way.\n",
"All versioned entities in LaminDB are versioned in this way, including artifacts and collections.\n",
"\n",
":::"
]
Expand Down Expand Up @@ -151,25 +151,70 @@
"\n",
"# a sample dataset\n",
"df = pd.DataFrame(\n",
" {\"CD8A\": [1, 2, 3], \"CD4\": [3, 4, 5], \"CD14\": [5, 6, 7], \"perturbation\": [\"DMSO\", \"IFNG\", \"DMSO\"]},\n",
" index=[\"observation1\", \"observation2\", \"observation3\"],\n",
" {\"CD8A\": [1, 2, 3], \"CD4\": [3, 4, 5], \"CD14\": [5, 6, 7], \"perturbation\": [\"DMSO\", \"IL-12\", \"DMSO\"]},\n",
" index=[\"sample1\", \"sample2\", \"sample3\"],\n",
")\n",
"\n",
"# create an artifact from a DataFrame\n",
"artifact = ln.Artifact.from_df(df, description=\"my RNA-seq\", version=\"1\")\n",
"# create & save an artifact from a DataFrame\n",
"artifact = ln.Artifact.from_df(df, description=\"my RNA-seq\").save()\n",
"\n",
"# artifacts come with typed, relational metadata\n",
"artifact.describe()\n",
"artifact.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load the dataset into memory."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"hide-output"
]
},
"outputs": [],
"source": [
"# returns a dataframe\n",
"artifact.load()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a new version of the dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"hide-output"
]
},
"outputs": [],
"source": [
"# update the dataframe\n",
"df.loc[\"sample2\", \"perturbation\"] = \"IFNG\"\n",
"\n",
"# save data & metadata in one operation\n",
"artifact.save()"
"# create a new version of `artifact`\n",
"artifact_v2 = ln.Artifact.from_df(df, revises=artifact).save()\n",
"\n",
"# see all versions of an artifact\n",
"artifact_v2.versions.df()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"View data lineage:"
"Similar to tagging a git commit, you can label a version."
]
},
{
Expand All @@ -182,14 +227,16 @@
},
"outputs": [],
"source": [
"artifact.view_lineage()"
"artifact_v2.version = \"1.0\"\n",
"artifact_v2.save()\n",
"artifact_v2.versions.df()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load an artifact:"
"View data lineage."
]
},
{
Expand All @@ -202,7 +249,28 @@
},
"outputs": [],
"source": [
"artifact.load()"
"artifact_v2.view_lineage()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
":::{dropdown} I'd rather control versioning through a key or file path like on S3.\n",
"\n",
"That works, too, and you won't need to pass an old version via `revises`:\n",
"\n",
"```python\n",
"artifact_v1 = ln.Artifact.from_df(df, key=\"my_datasets/my_study1.parquet\").save()\n",
"# below automatically creates a new version of artifact_v1 because the `key` matches\n",
"artifact_v2 = ln.Artifact.from_df(df_updated, key=\"my_datasets/my_study1.parquet\").save()\n",
"```\n",
"\n",
"<br>\n",
"\n",
"The good thing about passing `revises: Artifact` is that it works for entities that don't come with a file path and you don't need to worry about coming up with naming conventions for paths. You'll see that LaminDB makes it easy to organize data by entities, rather than file paths.\n",
"\n",
":::"
]
},
{
Expand All @@ -211,20 +279,23 @@
"source": [
":::{dropdown} How does this look for a file or folder?\n",
"\n",
"Local:\n",
"Source path is local:\n",
"\n",
"```python\n",
"ln.Artifact(\"./my_data.fcs\", description=\"my flow cytometry file\")\n",
"ln.Artifact(\"./my_images/\", description=\"my folder of images\")\n",
"```\n",
"<br>\n",
"\n",
"Remote:\n",
"Upon `artifact.save()`, the source path will be copied (uploaded) into your default storage.\n",
"\n",
"If the source path is remote, `artifact.save()` won't trigger data duplication but register the existing path.\n",
"\n",
"```python\n",
"ln.Artifact(\"s3://my-bucket/my_data.fcs\", description=\"my flow cytometry file\")\n",
"ln.Artifact(\"s3://my-bucket/my_images/\", description=\"my folder of images\")\n",
"```\n",
"\n",
"<br>\n",
"You can also use other remote file systems supported by `fsspec`.\n",
"\n",
":::\n",
Expand Down Expand Up @@ -267,22 +338,6 @@
"\n",
"You can open large artifacts for slicing from the cloud or load small artifacts directly into memory.\n",
"\n",
":::\n",
"\n",
":::{dropdown} How to version artifacts?\n",
"\n",
"Every artifact is auto-versioned by its `hash` and the last for characters of the `uid`.\n",
"\n",
"You can optionally pass a human-readable `version` field when you create new versions via:\n",
"\n",
"```python\n",
"artifact_v2 = ln.Artifact(\"my_path\", is_new_version_of=artifact_v1)\n",
"```\n",
"\n",
"Artifacts of the same version family share the same uid stem (the first 16 characters of the `uid`).\n",
"\n",
"You can see all versions of an artifact via `artifact.versions`.\n",
"\n",
":::"
]
},
Expand Down Expand Up @@ -644,7 +699,7 @@
"curate.validate()\n",
"\n",
"# save curated artifact\n",
"artifact = curate.save_artifact(description=\"my RNA-seq\", version=\"1\")\n",
"artifact = curate.save_artifact(description=\"my RNA-seq\")\n",
"artifact.describe()"
]
},
Expand Down Expand Up @@ -755,7 +810,7 @@
"curate.validate()\n",
"\n",
"# save curated artifact\n",
"artifact = curate.save_artifact(description=\"my RNA-seq\", version=\"1\")\n",
"artifact = curate.save_artifact(description=\"my RNA-seq\")\n",
"artifact.describe()"
]
},
Expand Down Expand Up @@ -857,7 +912,7 @@
" \"CD38\": [4, 2, 3],\n",
" \"perturbation\": [\"DMSO\", \"IFNG\", \"IFNG\"]\n",
" },\n",
" index=[\"observation4\", \"observation5\", \"observation6\"],\n",
" index=[\"sample4\", \"sample5\", \"sample6\"],\n",
")\n",
"adata = ad.AnnData(df[[\"CD8A\", \"CD4\", \"CD38\"]], obs=df[[\"perturbation\"]])\n",
"\n",
Expand Down Expand Up @@ -896,7 +951,7 @@
},
"outputs": [],
"source": [
"collection = ln.Collection([artifact, artifact2], name=\"my RNA-seq collection\", version=\"1\")\n",
"collection = ln.Collection([artifact, artifact2], name=\"my RNA-seq collection\")\n",
"collection.save()\n",
"collection.describe()\n",
"collection.view_lineage()"
Expand Down Expand Up @@ -974,7 +1029,7 @@
"metadata": {},
"source": [
"```\n",
"ln.finish()\n",
"ln.context.finish()\n",
"```"
]
},
Expand Down
15 changes: 8 additions & 7 deletions docs/tutorial.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,8 @@
":::{dropdown} How do I track a pipeline instead of a notebook?\n",
"\n",
"```python\n",
"transform = ln.Transform(name=\"My pipeline\", version=\"1.2.0\")\n",
"transform = ln.Transform(name=\"My pipeline\")\n",
"transform.version = \"1.2.0\" # tag the version\n",
"ln.context.track(transform)\n",
"```\n",
"\n",
Expand Down Expand Up @@ -710,12 +711,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"If you'd like to version an artifact or transform, either provide the `version` parameter when creating it or create new versions through `is_new_version_of`.\n",
"You can create new versions of artifacts, collections & transforms when you pass an older version to `revises`.\n",
"\n",
"For instance:\n",
"```\n",
"new_artifact = ln.Artifact(data, is_new_version_of=old_artifact)\n",
"```"
"new_artifact = ln.Artifact(data, revises=old_artifact)\n",
"```\n",
"\n",
"Alternatively, you can set a `key` to append to a version family in the same way you'd do it on AWS S3."
]
},
{
Expand Down Expand Up @@ -754,7 +756,6 @@
"collection = ln.Collection(\n",
" artifact,\n",
" name=\"Iris collection\",\n",
" version=\"1\",\n",
" description=\"Iris study 0\",\n",
")\n",
"collection.save()"
Expand Down Expand Up @@ -786,7 +787,7 @@
" artifacts.append(new_artifact)\n",
" # create a new version of the collection\n",
" collection = ln.Collection(\n",
" artifacts, is_new_version_of=collection, description=f\"Now includes {folder_name}\"\n",
" artifacts, revises=collection, description=f\"Now includes {folder_name}\"\n",
" )\n",
" collection.save()"
]
Expand Down
Loading

0 comments on commit f353217

Please sign in to comment.