diff --git a/colabs/wandb_registry/zoo_wandb.ipynb b/colabs/wandb_registry/zoo_wandb.ipynb index c298cfca..d4a226aa 100644 --- a/colabs/wandb_registry/zoo_wandb.ipynb +++ b/colabs/wandb_registry/zoo_wandb.ipynb @@ -5,14 +5,14 @@ "id": "381c6d14-e10d-4996-bd01-e743208644d6", "metadata": {}, "source": [ - "The goal of this notebook is to show you how you can use W&B Registry to track, share, and use dataset and model artifacts in your machine learning workflow by you and other members of your organization. By the end of this notebook, you will know how to use W&B to:\n", + "The goal of this notebook is to demonstrate how you and other members of your organization can use W&B Registry to track, share, and use dataset and model artifacts in your machine learning workflows. By the end of this notebook, you will know how to use W&B to:\n", "\n", - "1. Create a [custom registry](https://docs.wandb.ai/guides/registry/create_registry)\n", - "2. Create [collections](https://docs.wandb.ai/guides/registry/create_collection) within our registry\n", - "3. Make our dataset and model artifacts available to other members of our organization. \n", - "4. See how to download artifacts from the registry for inference\n", + "1. Use a [core registry](https://docs.wandb.ai/guides/registry/registry_types#core-registry)\n", + "2. Create [collections](https://docs.wandb.ai/guides/registry/create_collection) within the default **Dataset** and **Model** registry\n", + "3. Make dataset and model artifacts available to other members of your organization, and\n", + "4. Download artifacts from the registry for inference\n", "\n", - "To do this, we will create a basic neural network to classify the biological class of animals." + "To achieve this, we will train a neural network to identify animal classes, as an example." ] }, { @@ -57,7 +57,7 @@ "We will use the open source [Zoo dataset](https://archive.ics.uci.edu/dataset/111/zoo) from the UCI Machine Learning Repository.\n", "\n", "### Retrieve dataset\n", - "We can either manually download the dataset or use the [`ucimlrepo` package](https://github.com/uci-ml-repo/ucimlrepo) to import the dataset directly into our notebook. For this example, we will go with the latter and import the dataset directly into this notebook:" + "We can either manually download the dataset or use the [`ucimlrepo` package](https://github.com/uci-ml-repo/ucimlrepo) to import the dataset directly into our notebook. For this example, we will import the dataset directly into this notebook:" ] }, { @@ -80,7 +80,9 @@ "id": "9137a521-ced5-4e35-88d0-b6245527cb90", "metadata": {}, "source": [ - "### Explore the data" + "### Explore the data\n", + "\n", + "Let's take a quick look at the shape and data type of the dataset:" ] }, { @@ -111,7 +113,7 @@ "source": [ "### Process data\n", "\n", - "Most of the major processing was already done for us (no missing values, normalized, etc.). For training we are going to convert our dataset from pandas DataFrames to tensores, convert the data type of our input tensotre to match the data type of the nn.Linear module, and convert our labels tensor to index from 0-6:" + "For training let's convert our dataset from a [pandas `DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) to [a tensor with PyTorch](https://pytorch.org/docs/stable/generated/torch.tensor.html#torch.tensor), convert the data type of our input tensor(float64 to float32) to match the data type of the `nn.Linear module`, and convert our label tensor to index from 0-6:" ] }, { @@ -152,37 +154,53 @@ }, { "cell_type": "markdown", - "id": "bb001f4b-b9a2-4fb4-9b8b-d943536d3e08", + "id": "44c2b1fb-a1af-4311-aae9-86497d00ffda", "metadata": {}, "source": [ - "## Create a registry for our dataset and models\n", + "## Track and publish dataset \n", + "\n", + "Within the Dataset registry we will create a collection called \"zoo-dataset-tensors\". A collection is a set of linked artifact versions in a registry. \n", + "\n", + "To create a collection we need to do two things:\n", + "1. Specify the collection and registry we want to link our artifact version to. To do this, we specify a \"target path\" for our artifact version.\n", + "2. Use the `run.link_artifact` method and pass our artifact object and the target path.\n", + "\n", + "The target path consists of the name of the organization your team belongs to, the name of the registry, and the name of the collection. There are two ways to get the target path, interatively with the W&B App UI or programmatically with the W&B Python SDK.\n", + "\n", + "#### Interactively get target path of a collection\n", "\n", - "Let's create a registry to organize both our dataset artifacts and (at a later step) our model artifacts. To do this, navigate to the Registry App in the W&B App UI:\n", + "1. Navigate to the Registry app at https://wandb.ai/registry/\n", + "2. Click on the registry you want to link your artifact version to.\n", + "3. At the top of the page, you will see an autogenerated code snippet. Copy the string next to the `target_path` parameter in `run.link_artifact()`.\n", "\n", - "2. Within Custom registry, click on the **Create registry** button.\n", - "3. Provide a name for your registry in the **Name** field. For this example, we will name our registry \"Zoo_Classifier\".\n", - "4. Optionally provide a description about the registry.\n", - "5. From the [**Registry visibility**](https://docs.wandb.ai/guides/registry/configure_registry#registry-visibility-types) dropdown, click select \"Organization\".\n", - "6. Select \"All types\" from the **Accepted artifacts** type dropdown.\n", - "7. Click on the **Create registry** button.\n", "\n", + "#### Programmatically make the collection target path \n", "\n", - "Note: You do not need to use one registry for organizing and tracking different types of artifacts. Another popular choice is to create a regsitry specifically for datasets, a registry specifically for models, and so forth." + "The target path of a collectin consists of three parts:\n", + "* The name of the organization\n", + "* The name of the registry\n", + "* The name of the collection within the registry\n", + "\n", + "If you know these three fields, you can create the full name yourself with string concatanation, f-strings, and so forth:\n", + "```python\n", + "target_path = f\"{ORG_NAME}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}\"\n", + "```" ] }, { "cell_type": "markdown", - "id": "ae6d6427-4244-4504-a40d-bdb6bdbc8758", + "id": "5f2217d9-39df-4e9a-a715-7067e9e5c731", "metadata": {}, "source": [ - "## Track and publish dataset \n", + "### Publish tensor dataset\n", "\n", - "Within our \"Zoo_Classifier\" we will create a collection called \"Datasets\". A collection is a set of linked artifact versions in a registry. In this example we will create two collections: one for our datasets and one for our models. First, let's create a collection for our datasets. To create a collection we need to do two things:\n", + "Let's publish our dataset to the Dataset registry in a collection called \"zoo-dataset-tensors\". To do this, we will \n", "\n", - "1. Specify the full path name where we want to store our artifact. \n", - " * The full paht name consists of: `{ORG_NAME}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}`\n", - "2. Use the `run.link_artifact` method and pass our artifact object and full path name\n", - "\n" + "1. Get or create the target path. For this notebook we will programmatically create the target path\n", + "1. Initialize a run\n", + "1. Create an Artifact object\n", + "2. Add each split dataset as individual files to the artifact object\n", + "3. Link the artifact object to the collection with `run.link_artifact()`. Here we specify the target path and the artifact we want to link." ] }, { @@ -192,14 +210,12 @@ "metadata": {}, "outputs": [], "source": [ - "PROJECT = \"zoo_experiment\"\n", - "TEAM_ENTITY = \"smle-reg-team-2\"\n", "ORG_NAME = \"smle-registries-bug-bash\"\n", - "REGISTRY_NAME = \"Zoo_Classifier\"\n", - "COLLECTION_NAME = \"Datasets\"\n", + "REGISTRY_NAME = \"Dataset\"\n", + "COLLECTION_NAME = \"zoo-dataset-tensors\"\n", "\n", - "target_path=f\"{ORG_NAME}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}\"\n", - "print(target_path)" + "# Path to link the artifact to a collection\n", + "dataset_target_path = f\"{ORG_NAME}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}\"" ] }, { @@ -209,21 +225,25 @@ "metadata": {}, "outputs": [], "source": [ + "PROJECT = \"zoo_experiment\"\n", + "TEAM_ENTITY = \"smle-reg-team-2\"\n", + "\n", "run = wandb.init(\n", " entity=TEAM_ENTITY,\n", " project=PROJECT,\n", - " job_type=\"upload_dataset\"\n", + " job_type=\"publish_dataset\"\n", ")\n", "\n", "artifact = wandb.Artifact(\n", " name=\"zoo_dataset\",\n", - " type=\"dataset\"\n", + " type=\"dataset\", \n", + " description=\"Processed dataset and labels.\"\n", ")\n", "\n", "artifact.add_file(local_path=\"zoo_dataset.pt\", name=\"zoo_dataset\")\n", "artifact.add_file(local_path=\"zoo_labels.pt\", name=\"zoo_labels\")\n", "\n", - "run.link_artifact(artifact=artifact, target_path=target_path)\n", + "run.link_artifact(artifact=artifact, target_path=dataset_target_path)\n", "\n", "run.finish()" ] @@ -233,8 +253,23 @@ "id": "383bc963-55f9-498c-b42d-a2a0bf86f3b7", "metadata": {}, "source": [ - "### Split data\n", - "Split the data into a training and test set." + "### Split data and publish split dataset\n", + "Split the data into a training and test set. Splitting the dataset and tracking them as separate files will make it easier for a different user to use the correct datasets for replicating our results later." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6432936b-31bf-477b-a4aa-a3a85345e5aa", + "metadata": {}, + "outputs": [], + "source": [ + "# Decsribe how we split the training dataset for future reference, reproducibility.\n", + "config = {\n", + " \"random_state\" : 42,\n", + " \"test_size\" : 0.25,\n", + " \"shuffle\" : True\n", + "}" ] }, { @@ -244,8 +279,81 @@ "metadata": {}, "outputs": [], "source": [ - "# using the train test split function\n", - "X_train, X_test, y_train, y_test = train_test_split(dataset,labels, random_state=42,test_size=0.25, shuffle=True)" + "# Split dataset into training and test set\n", + "X_train, X_test, y_train, y_test = train_test_split(\n", + " dataset,labels, \n", + " random_state=config[\"random_state\"],\n", + " test_size=config[\"test_size\"], \n", + " shuffle=config[\"shuffle\"]\n", + ")\n", + "\n", + "# Save the files locally\n", + "torch.save(X_train, \"zoo_dataset_X_train.pt\")\n", + "torch.save(y_train, \"zoo_labels_y_train.pt\")\n", + "\n", + "torch.save(X_test, \"zoo_dataset_X_test.pt\")\n", + "torch.save(y_test, \"zoo_labels_y_test.pt\")" + ] + }, + { + "cell_type": "markdown", + "id": "028d0992-15a9-4687-8c3b-4218bf70adb9", + "metadata": {}, + "source": [ + "Next, let's publish this dataset into a different collection within the Dataset registry called \"zoo-dataset-tensors-split\":" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a64bacb4-5ca7-4a74-abc8-e890c655fe17", + "metadata": {}, + "outputs": [], + "source": [ + "run = wandb.init(\n", + " entity=TEAM_ENTITY,\n", + " project=PROJECT,\n", + " job_type=\"publish_split_dataset\", \n", + " config=config\n", + ")\n", + "\n", + "# Let's add a description to let others know which file to use in future experiments\n", + "artifact = wandb.Artifact(\n", + " name=\"split_zoo_dataset\",\n", + " type=\"dataset\", \n", + " description=\"Artifact contains `zoo_dataset` split into 4 datasets. \\\n", + " For training, use `zoo_dataset_X_train` and `zoo_labels_y_train`. \\\n", + " For testing, use `zoo_dataset_X_test` and `zoo_labels_y_test`.\"\n", + ")\n", + "\n", + "artifact.add_file(local_path=\"zoo_dataset_X_train.pt\", name=\"zoo_dataset_X_train\")\n", + "artifact.add_file(local_path=\"zoo_labels_y_train.pt\", name=\"zoo_labels_y_train\")\n", + "artifact.add_file(local_path=\"zoo_dataset_X_test.pt\", name=\"zoo_dataset_X_test\")\n", + "artifact.add_file(local_path=\"zoo_labels_y_test.pt\", name=\"zoo_labels_y_test\")\n", + "\n", + "REGISTRY_NAME = \"Dataset\"\n", + "COLLECTION_NAME = \"zoo-dataset-tensors-split\"\n", + "target_dataset_path=f\"{ORG_NAME}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}\"\n", + "\n", + "run.link_artifact(artifact=artifact, target_path=target_dataset_path)\n", + "\n", + "run.finish()" + ] + }, + { + "cell_type": "markdown", + "id": "e971ed7f-f1d3-4d62-b2ff-26135f45045f", + "metadata": {}, + "source": [ + "We can verify we correctly linked our artifact to our desired collection and registry with W&B App UI: \n", + "\n", + "1. Navigate to the Registry App\n", + "2. Select on the Dataset registry\n", + "3. Click **View details** \"zoo-dataset-tensors-split\" collection\n", + "4. Click the **View** button next to the artifact version\n", + "5. Select the **Files** tab\n", + "\n", + "You should see four files: \"zoo_dataset_X_test\", \"zoo_dataset_X_train\", \"zoo_labels_y_test\", and \"zoo_labels_y_train\"." ] }, { @@ -253,7 +361,9 @@ "id": "a3492dc4-f465-4843-b573-0b2ca752b0ac", "metadata": {}, "source": [ - "## Define model" + "## Define a model\n", + "\n", + "The following cells show how to create a simple neural network classifier with PyTorch. There is nothing unique about this model, so we'll gloss over this section." ] }, { @@ -320,7 +430,22 @@ "source": [ "## Train model\n", "\n", - "Train model, save model, store model as an artifact in W&B" + "Next, let's train, save, and store model as an artifact in W&B. \n", + "\n", + "We'll train the model using the training data we published to the Dataset registry. To use the an artifact from a registry, we need to provide the name of the artifact. The name of the artifact looks similar to a filepath. In fact, this filepath is almost identical to the target path we used in a previous step to publish our artifact, except that we must specify the specific artifact version we want to use following the name of the collection: \n", + "\n", + "```python\n", + "# Target path for publishing an artifact version to a registry\n", + "f\"{ORG_NAME}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}\"\n", + "\n", + "```\n", + "\n", + "```python\n", + "# Artifact name/filepath for downloading and using artifact publsihed in a registry\n", + "f\"{ORG_NAME}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}:v{VERSION}\"\n", + "```\n", + "\n", + "Since we only linked on artifact version, the version we'll use is `v0`. (W&B uses 0 indexing)." ] }, { @@ -334,8 +459,28 @@ "source": [ "run = wandb.init(entity = TEAM_ENTITY, project = PROJECT, job_type = \"training\", config = hyperparameter_config)\n", "\n", + "# Get dataset artifacts from registry\n", + "VERSION = 0\n", + "artifact_name = f\"{ORG_NAME}/wandb-registry-{REGISTRY_NAME.lower()}/{COLLECTION_NAME}:v{VERSION}\"\n", + "dataset_artifact = run.use_artifact(artifact_or_name=artifact_name)\n", + "\n", + "# Download only the training data\n", + "X_train_path = dataset_artifact.download(path_prefix=\"zoo_dataset_X_train\")\n", + "y_train_path = dataset_artifact.download(path_prefix=\"zoo_labels_y_train\")\n", + "\n", + "# Load tensors \n", + "X_train = torch.load(f=X_train_path+\"/zoo_dataset_X_train\")\n", + "y_train = torch.load(f=y_train_path+\"/zoo_labels_y_train\")\n", + "\n", + "# Set initial dummy loss value to compare to in training loop\n", + "prev_best_loss = 1e10 \n", + "\n", + "# Keep track of which model version for us to use at a later step\n", + "# Set to -1 for 0 indexing\n", + "model_version = -1\n", + "\n", "# Training loop\n", - "for e in range(hyperparameter_config[\"epochs\"]):\n", + "for e in range(hyperparameter_config[\"epochs\"] + 1):\n", " pred = model(X_train)\n", " loss = loss_fn(pred, y_train.squeeze(1))\n", " \n", @@ -348,68 +493,134 @@ " \"train/train_loss\": loss\n", " })\n", "\n", - " # Evaluate model\n", - "\n", - " # Checkpoint model\n", - " if e % 99 == 1:\n", - " print(\"epoch: \", e,\"loss:\", loss.item())\n", + " # Checkpoint/save model if loss improves\n", + " if (e % 100 == 0) and (loss <= prev_best_loss):\n", + " print(\"epoch: \", e, \"loss:\", loss.item())\n", " \n", - " ## Checkpoint model\n", " PATH = 'zoo_wandb.pth' \n", " torch.save(model.state_dict(), PATH)\n", - " \n", + "\n", + " model_artifact_name = f\"zoo-{wandb.run.id}\"\n", " artifact = wandb.Artifact(\n", - " name=f\"zoo-{wandb.run.id}\",\n", + " name=model_artifact_name,\n", " type=\"model\",\n", " metadata={\n", " \"num_classes\": 7,\n", " \"model_type\": wandb.config[\"model_type\"]\n", " }\n", " )\n", + " \n", " # Add artifact file\n", " artifact.add_file(PATH)\n", " artifact.save()\n", "\n", - "run.finish()" + " # Store new best loss\n", + " prev_best_loss = loss\n", + " model_version += 1\n", + "\n", + "run.finish()\n", + "\n", + "print(f\"Training complete. Model artifact {model_artifact_name}:v{model_version} contains the model with the lowest recorded loss.\")" ] }, { "cell_type": "markdown", - "id": "7d85c431-fcc6-4dcd-919b-56568cc5db2b", + "id": "2666d37e-1232-4609-8e2c-78af670585ab", + "metadata": {}, + "source": [ + "The preceeding cell might look intimidating. Let's break it down:\n", + "\n", + "* First, we download the dataset from the Dataset registry and load it as a tensor\n", + "* Next, we create a simple training loop\n", + " * Within the training loop we log the loss for each step\n", + " * We checkpoint(save) the model every time the remainder of the epoch divided by 100 is 0 and the loss is lower than the previously recorded loss.\n", + " * We then add the saved PyTorch model to the Artifact. \n", + "\n", + "A couple of things to note:\n", + "1. The preceeding cell stores 10 versions of the model. You can confirm this by navigating to your project workspace, select Artifacts in the left navigation, and expand the model artifact.\n", + "2. At this point, we have only tracked the model artifact within our team's project. Anyone outside of our team does not have access to the model we created. To make this model accessible to members outside of our team, we will need to publish our model to the registry. " + ] + }, + { + "cell_type": "markdown", + "id": "05223ca1-3879-4f51-b843-76f1c25cf1f7", "metadata": {}, "source": [ "## Publish model to the registry\n", - "Let's make this model artifact available to other users in our organization. To do this, we will create another collection within our Zoo_Classifier registry.\n", + "Let's make this model artifact available to other users in our organization. To do this, we will create a collection within the Model registry.\n", + "\n", + "\n", + "To create a collection within a registry, we will need to get the full name (or path) of our model artifact. There are two ways to do this, interatively with the W&B App UI or programmatically with the W&B Python SDK.\n", "\n", - "To create a collection within our registry, we will need to get the full name (or path) of our model artifact. Go to the W&B App UI and find the full name of the model artifact you want to link to the registry:\n", + "### Interactively get the path of an artifact\n", "\n", - "1. Click on the **Artifacts** tab\n", - "2. Select the name of the artifact within the left navbar\n", - "3. Click on the **Version** tab\n", - "4. Within the **Version overview**, you will find the full name of your artifact. Make note of the name." + "1. Navigate to the project where you logged your model artifact on the W&B App.\n", + "1. Click on the **Artifacts** tab.\n", + "2. Select the name of the artifact within the left navigation.\n", + "3. Click on the **Version** tab.\n", + "4. Within the **Version overview**, copy and paste the path next to the **Full Name** field.\n", + "\n", + "### Programmatically make the path of an artifact\n", + "\n", + "The full name (path) of an artifact consists of four components:\n", + "* Team entity\n", + "* Project name\n", + "* The name of the artifact (the string you passed when you created the artifact object with `wandb.Aritfact()`)\n", + "* The artifact version\n", + "\n", + "If you know these four fields, you can create the full name yourself with string concatanation, f-strings, and so forth:\n", + "```python\n", + "# Artifact path\n", + "f'{TEAM_ENTITY}/{PROJECT}/{artifact_name}:v{version}'\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "edd97b93-d708-4260-ba69-d04b085a009c", + "metadata": {}, + "source": [ + "In this example, we'll programmatically create the path since we have these values loaded into memory:" ] }, { "cell_type": "code", "execution_count": null, - "id": "fb0524f8-6c37-4f79-bb1e-f9e0418b4dca", + "id": "bc31ddfa-53e0-4b00-b010-253eef47066b", "metadata": {}, "outputs": [], "source": [ - "ORG_NAME = \"smle-registries-bug-bash\"\n", - "REGISTRY_NAME = \"Zoo_Classifier\"\n", - "COLLECTION_NAME = \"Trained_models\"" + "artifact_path = f'{TEAM_ENTITY}/{PROJECT}/{model_artifact_name}:v{model_version}'" + ] + }, + { + "cell_type": "markdown", + "id": "6aaa9221-f6ac-454b-9a48-47597f47a572", + "metadata": {}, + "source": [ + "Similar to how we created a target path when we published our dataset artifact to the Dataset registry, let's create a target path for our model artifact. This target path will tell W&B which collection and which registry to link our artifact version to. The target path consists of:\n", + "\n", + "```python\n", + "# Target path of collection\n", + "f\"{ORG_NAME}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}\"\n", + "```\n", + "\n", + "To do this, we will need the name of our organization, the name of the registry, and the name of the collection. If the collection does not already exist, W&B will create one for us." ] }, { "cell_type": "code", "execution_count": null, - "id": "bec22dd2-83a5-4ba1-8c2d-130263b51fd4", + "id": "0e734592-7ab1-4baa-af04-71956a92f9d2", "metadata": {}, "outputs": [], "source": [ - "target_path=f\"{ORG_NAME}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}\"\n", - "print(target_path)" + "ORG_NAME = \"smle-registries-bug-bash\"\n", + "REGISTRY_NAME = \"Model\"\n", + "COLLECTION_NAME = \"Zoo_Classifier_Models\"\n", + "\n", + "collection_target_path=f\"{ORG_NAME}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}\"\n", + "print(\"Linking model artifact to: \", target_path)" ] }, { @@ -420,9 +631,8 @@ "outputs": [], "source": [ "run = wandb.init(entity=TEAM_ENTITY, project=PROJECT)\n", - "name=\"smle-reg-team-2/zoo_experiment/zoo-nhqnys3o:v10\"\n", - "model_artifact = run.use_artifact(artifact_or_name=name, type=\"model\")\n", - "run.link_artifact(artifact=model_artifact, target_path=target_path)\n", + "model_artifact = run.use_artifact(artifact_or_name=artifact_path, type=\"model\")\n", + "run.link_artifact(artifact=model_artifact, target_path=collection_target_path)\n", "run.finish()" ] }, @@ -433,12 +643,25 @@ "source": [ "## Download artifacts from registry for inference\n", "\n", - "For this last section, suppose you are a different user. This user wants to take take the model and dataset that you pushed to the registry and make predictions on a new test set. Also suppose that this user has [member role permissions](https://docs.wandb.ai/guides/registry/configure_registry#registry-roles-permissions) which means they can view and download artifacts from our registry.\n", + "For this last section, suppose you are in a different team than the user who uploaded the artifact. You want to retrieve the model and dataset artifact was pushed to the registry and make predictions with a new test set. Your team is called \"smle-reg-team-2\" (previously the team was \"smle-reg-team-1\") and your team is working on analyzing zoo models in a project called \"Compare_Zoo_Models\".\n", + "\n", + "\n", + "Also suppose that this user has [member role permissions](https://docs.wandb.ai/guides/registry/configure_registry#registry-roles-permissions) which means they can view and download artifacts from our registry.\n", "\n", - "How can this person get your artifacts that you published to the registry? Simple:\n", + "How can you retrieve the artifacts version that were published by another user in another team? Simple:\n", "\n", - "1. Know the path of the artifact in the registry\n", - "2. Use the W&B Python SDK to download the artifacts" + "1. Get the path of the artifact version from the registry UI\n", + "2. Use the W&B Python SDK to download the artifacts\n", + "\n", + "#### Interactively get path of the model and dataset artifacts \n", + "1. Go to the W&B Registry app at https://wandb.ai/registry/.\n", + "2. Select the registry that your artifact is linked to.\n", + "3. Click the **View details** button next to the name of the collection with your linked artifact. \n", + "4. Click on the **View** button next to the artifact version. \n", + "5. Within the **Version** tab, copy and paste the path listed next to **Full Name**.\n", + "\n", + "\n", + "In this notebook example, we linked our artifact version to a collection called \"Zoo_Classifier_Models\" within the **Model** registry. " ] }, { @@ -446,7 +669,32 @@ "id": "32bc7cfe-eef0-478d-9898-95067e37d572", "metadata": {}, "source": [ - "### Download model" + "### Download trained model" + ] + }, + { + "cell_type": "code", + "execution_count": 128, + "id": "ef0828dd-4986-4077-b469-54a60b2db2aa", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Model artifact name: smle-registries-bug-bash/wandb-registry-model/Zoo_Classifier_Models:v0\n" + ] + } + ], + "source": [ + "# Create model artifact name\n", + "ORG_NAME = \"smle-registries-bug-bash\"\n", + "REGISTRY_NAME = \"model\"\n", + "COLLECTION_NAME = \"Zoo_Classifier_Models\"\n", + "VERSION = 0\n", + "\n", + "model_artifact_name = f\"{ORG_NAME}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}:v{VERSION}\"\n", + "print(f\"Model artifact name: {model_artifact_name}\")" ] }, { @@ -456,9 +704,11 @@ "metadata": {}, "outputs": [], "source": [ - "run = wandb.init(entity=TEAM_ENTITY, project=PROJECT)\n", - "name=\"smle-registries-bug-bash/wandb-registry-Zoo_Classifier/Trained_models:v0\"\n", - "registry_model = run.use_artifact(artifact_or_name=name)\n", + "DIFFERENT_TEAM_ENTITY = \"smle-reg-team-2\"\n", + "DIFFERNT_PROJECT = \"Compare_Zoo_Models\"\n", + "\n", + "run = wandb.init(entity=DIFFERENT_TEAM_ENTITY, project=DIFFERNT_PROJECT)\n", + "registry_model = run.use_artifact(artifact_or_name=model_artifact_name)\n", "local_model_path = registry_model.download()" ] }, @@ -486,19 +736,34 @@ "id": "93895213-e75e-46e6-b3a8-8017ccaff9e2", "metadata": {}, "source": [ - "### Get dataset from registry\n", + "### Get test dataset from registry\n", "\n", - "Let's get the dataset from our registry. For this example, we will download the dataset and use the same random seed to get our test set and labels." + "Let's get the test dataset from our registry. Similar to before when we downloaded the training dataset from the Dataset registry, we will use a slightly modified target path to retrieve our training dataset. (Recall that the name of the artifact looks like the target path, but has the version appended to it)." ] }, { "cell_type": "code", - "execution_count": null, - "id": "8fa34f51-fbe5-4a41-9e69-da6ea9ce943a", + "execution_count": 126, + "id": "55f3fa24-4865-480f-9f56-c1dde25b0141", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Dataset artifact name: smle-registries-bug-bash/wandb-registry-dataset/zoo-dataset-tensors-split:v0\n" + ] + } + ], "source": [ - "name = \"smle-registries-bug-bash/wandb-registry-Zoo_Classifier/Datasets:v0\"" + "# Create dataset artifact name\n", + "ORG_NAME = \"smle-registries-bug-bash\"\n", + "REGISTRY_NAME = \"dataset\"\n", + "COLLECTION_NAME = \"zoo-dataset-tensors-split\"\n", + "VERESION = 0\n", + "\n", + "data_artifact_name = f\"{ORG_NAME}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}:v{VERSION}\"\n", + "print(f\"Dataset artifact name: {data_artifact_name}\")" ] }, { @@ -508,8 +773,8 @@ "metadata": {}, "outputs": [], "source": [ - "run = wandb.init(entity=TEAM_ENTITY, project=PROJECT)\n", - "dataset_artifact = run.use_artifact(artifact_or_name=name, type=\"dataset\")\n", + "run = wandb.init(entity=DIFFERENT_TEAM_ENTITY, project=DIFFERNT_PROJECT)\n", + "dataset_artifact = run.use_artifact(artifact_or_name=data_artifact_name, type=\"dataset\")\n", "local_dataset_path = dataset_artifact.download()" ] }, @@ -520,12 +785,14 @@ "metadata": {}, "outputs": [], "source": [ + "# Test data and label filenames\n", + "test_data_filename = \"zoo_dataset_X_test\"\n", + "test_labels_filename = \"zoo_labels_y_test\" \n", + "\n", "# Load dataset and labels into notebook\n", - "loaded_data = torch.load(f=local_dataset_path+ \"/zoo_dataset\")\n", - "loaded_labels = torch.load(f=local_dataset_path + \"/zoo_labels\")\n", + "loaded_data = torch.load(f\"{local_dataset_path}/{test_data_filename}\")\n", + "loaded_labels = torch.load(f\"{local_dataset_path}/{test_labels_filename}\")\n", "\n", - "# using the train test split function using the same random state seed\n", - "X_train, X_test, y_train, y_test = train_test_split(loaded_data,loaded_labels, random_state=42,test_size=0.25, shuffle=True)\n", "run.finish()" ] }, @@ -534,69 +801,343 @@ "id": "e522de6d-9a30-4534-a85d-6d0faf56a2f0", "metadata": {}, "source": [ - "### Make predictions with loaded model\n", - "\n", - "(Noah to do, track this w/ W&B)" + "### Make predictions with loaded model" ] }, { "cell_type": "code", "execution_count": null, - "id": "5f9d8aec-cd4d-4100-b186-50c7eff0bcaf", + "id": "d55c9556-726a-4c65-859d-91ebf7933e77", "metadata": {}, "outputs": [], "source": [ - "run = wandb.init(entity=TEAM_ENTITY, project=PROJECT)" + "class_labels = {\n", + " 0: \"Aves\",\n", + " 1: \"Mammalia\",\n", + " 2: \"Reptilia\",\n", + " 3: \"Actinopterygii\",\n", + " 4: \"Amphibia\",\n", + " 5: \"Insecta\",\n", + " 6: \"Crustacea\",\n", + "}" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 160, "id": "d264c840-5e69-4a2e-960b-46f09ef878ce", "metadata": {}, "outputs": [], "source": [ - "outputs = loaded_model(X_test)" + "outputs = loaded_model(loaded_data)\n", + "__, predicted = torch.max(outputs, 1)" + ] + }, + { + "cell_type": "markdown", + "id": "dcb6e86b-eafc-45b9-80d7-01a7a36ce793", + "metadata": {}, + "source": [ + "These integers don't mean much, let's convert them to return the animal class and store this into a pandas DataFrame for us to compare the predicted vs the true values:" ] }, { "cell_type": "code", - "execution_count": null, - "id": "01599d26-b719-4e43-b395-1e70a77aca9f", + "execution_count": 169, + "id": "1173dc84-0395-4b23-be84-d43a0cacd38e", "metadata": {}, "outputs": [], "source": [ - "__, predicted = torch.max(outputs, 1)\n", - "print(predicted[:10])" + "results = list(map(lambda x: class_labels.get(x), predicted.numpy()))\n", + "true_values = list(map(lambda x: class_labels.get(x), loaded_labels.squeeze().numpy()))\n", + "\n", + "# Create pandas DataFrame\n", + "df = pd.DataFrame(\n", + " {\n", + " 'Predicted': results,\n", + " 'True values': true_values\n", + " }\n", + ")\n", + "\n", + "# Create new column where we compare the predicted vs true\n", + "df[\"Predicted correctly\"] = df[\"Predicted\"] == df[\"True values\"]" ] }, { "cell_type": "code", - "execution_count": null, - "id": "d55c9556-726a-4c65-859d-91ebf7933e77", + "execution_count": 176, + "id": "30818967-2a5a-4302-8de9-7c4ac61377bb", "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PredictedTrue valuesPredicted correctly
0AvesAvesTrue
1AvesAvesTrue
2AvesAvesTrue
3AvesAvesTrue
4AvesAvesTrue
5InsectaInsectaTrue
6AvesAvesTrue
7AvesAvesTrue
8AvesAvesTrue
9AvesAvesTrue
10ActinopterygiiActinopterygiiTrue
11InsectaInsectaTrue
12InsectaInsectaTrue
13MammaliaMammaliaTrue
14CrustaceaCrustaceaTrue
15AvesAvesTrue
16AvesAvesTrue
17MammaliaMammaliaTrue
18ActinopterygiiActinopterygiiTrue
19AvesAvesTrue
20ActinopterygiiReptiliaFalse
21AmphibiaAmphibiaTrue
22AmphibiaAmphibiaTrue
23InsectaInsectaTrue
24AvesAvesTrue
25CrustaceaCrustaceaTrue
\n", + "
" + ], + "text/plain": [ + " Predicted True values Predicted correctly\n", + "0 Aves Aves True\n", + "1 Aves Aves True\n", + "2 Aves Aves True\n", + "3 Aves Aves True\n", + "4 Aves Aves True\n", + "5 Insecta Insecta True\n", + "6 Aves Aves True\n", + "7 Aves Aves True\n", + "8 Aves Aves True\n", + "9 Aves Aves True\n", + "10 Actinopterygii Actinopterygii True\n", + "11 Insecta Insecta True\n", + "12 Insecta Insecta True\n", + "13 Mammalia Mammalia True\n", + "14 Crustacea Crustacea True\n", + "15 Aves Aves True\n", + "16 Aves Aves True\n", + "17 Mammalia Mammalia True\n", + "18 Actinopterygii Actinopterygii True\n", + "19 Aves Aves True\n", + "20 Actinopterygii Reptilia False\n", + "21 Amphibia Amphibia True\n", + "22 Amphibia Amphibia True\n", + "23 Insecta Insecta True\n", + "24 Aves Aves True\n", + "25 Crustacea Crustacea True" + ] + }, + "execution_count": 176, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "class_labels = {\n", - " 0: \"Aves\",\n", - " 1: \"Mammalia\",\n", - " 2: \"Reptilia\",\n", - " 3: \"Actinopterygii\",\n", - " 4: \"Amphibia\",\n", - " 5: \"Insecta\",\n", - " 6: \"Crustacea\",\n", - "}" + "df" ] }, { "cell_type": "code", - "execution_count": null, - "id": "bb30698c-5231-458c-9862-dd919cbc7b20", + "execution_count": 179, + "id": "d4f2e9a5-3513-4aa8-bc38-58da3f031d51", "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "Predicted correctly\n", + "True 25\n", + "False 1\n", + "Name: count, dtype: int64" + ] + }, + "execution_count": 179, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "results = list(map(lambda x: class_labels.get(x), predicted.numpy()))\n", - "results[:10]" + "# Count how many predictions were wrong\n", + "df['Predicted correctly'].value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": 178, + "id": "f0eaea82-54a6-42f9-be89-e492849ee482", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The number of incorrect guesses on test set is: 1\n" + ] + } + ], + "source": [ + "# Count how many predictions were wrong\n", + "count_false = (~df['Predicted correctly']).sum()\n", + "print(\"The number of incorrect guesses on test set is:\", count_false) " ] }, { @@ -606,7 +1147,9 @@ "metadata": {}, "outputs": [], "source": [ - "run.finish()" + "# ?? Track these predictions into a project???\n", + "# run = wandb.init(entity=DIFFERENT_TEAM_ENTITY, project=DIFFERNT_PROJECT)\n", + "# run.finish()" ] } ],