From cbc32aaa89c7ee2f028474bb3b516da51aeced9a Mon Sep 17 00:00:00 2001
From:  <>
Date: Thu, 21 Nov 2024 09:49:47 +0000
Subject: [PATCH] Deployed e9259bf with MkDocs version: 1.6.1

---
 s7_deployment/ml_deployment/index.html |   4 ++--
 search/search_index.json               |   2 +-
 sitemap.xml.gz                         | Bin 127 -> 127 bytes
 3 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/s7_deployment/ml_deployment/index.html b/s7_deployment/ml_deployment/index.html
index c45981f1f..7772f86c5 100644
--- a/s7_deployment/ml_deployment/index.html
+++ b/s7_deployment/ml_deployment/index.html
@@ -1984,7 +1984,7 @@ <h1 id="deployment-of-machine-learning-models">Deployment of Machine Learning Mo
 <td>✅</td>
 <td>✅</td>
 <td><a href="https://github.com/NVIDIA/triton-inference-server">🔗 Link</a></td>
-<td>8.3k</td>
+<td>8.4k</td>
 </tr>
 <tr>
 <td>OpenVINO</td>
@@ -2026,7 +2026,7 @@ <h1 id="deployment-of-machine-learning-models">Deployment of Machine Learning Mo
 <td>❌</td>
 <td>❌</td>
 <td><a href="https://github.com/vllm-project/vllm">🔗 Link</a></td>
-<td>30.5k</td>
+<td>30.6k</td>
 </tr>
 </tbody>
 </table>
diff --git a/search/search_index.json b/search/search_index.json
index f3ebd48f5..f1b01136c 100644
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Home","text":"<p> Machine Learning Operations <p>Repository for course 02476 at DTU.</p> <p>Checkout the homepage!</p> </p> <p> </p>"},{"location":"#i-course-information","title":"\u2139\ufe0f Course information","text":"<ul> <li>Course responsible<ul> <li>Postdoc Nicki Skafte Detlefsen, nsde@dtu.dk</li> <li>Professor S\u00f8ren Hauberg, sohau@dtu.dk</li> </ul> </li> <li>5 ECTS (European Credit Transfer System), corresponding to 140 hours of work</li> <li>3 week period in January</li> <li>Master level course</li> <li>Grade: Pass/not passed</li> <li>Type of assessment: project report</li> <li> <p>Recommended prerequisites: DTU course 02456 (Deep Learning) or     experience with the following topics:</p> <ul> <li>General understanding of machine learning (datasets, probability, classifiers, overfitting etc.)</li> <li>Basic knowledge of deep learning (backpropagation, convolutional neural networks, auto-encoders etc.)</li> <li>Coding in PyTorch. On the first day, we provide some exercises in PyTorch to     get everyone's skills up-to-date as fast as possible.</li> </ul> </li> </ul>"},{"location":"#course-setup","title":"\ud83d\udcbb Course setup","text":"<p>Start by cloning or downloading this repository</p> <pre><code>git clone https://github.com/SkafteNicki/dtu_mlops\n</code></pre> <p>If you do not have git installed (yet) we will touch upon it in the course. The folder will contain all the exercise material for this course and lectures. Additionally, you should join our Slack channel which we use for communication. The link may be expired, write to me.</p>"},{"location":"#course-organization","title":"\ud83d\udcc2 Course organization","text":"<p>We highly recommend that when going through the material you use the homepage which is the corresponding GitHub Pages version of this repository that is more nicely rendered, and also includes some special HTML magic provided by Material for MkDocs.</p> <p>The course is divided into sessions, denoted by capital S, and modules, denoted by capital M. A session corresponds to a full day of work if you are following the course, meaning approximately 9 hours of work. Each session (S) corresponds to a topic within MLOps and consists of multiple modules (M) that each cover a specific topic.</p> <p>Importantly we differ between core modules and optional modules. Core modules will be marked by</p> <p>Core Module</p> <p>at the top of their corresponding page. Core modules are important to go through to be able to pass the course. You are highly recommended to still do the optional modules.</p> <p>Additionally, be aware of the following icons throughout the course material:</p> <ul> <li> <p>This icon can be expanded to show code belonging to a given exercise</p> Example <p>I will contain some code for an exercise.</p> </li> <li> <p>This icon can be expanded to show a solution for a given exercise</p> Solution <p>I will present a solution to the exercise.</p> </li> <li> <p>This icon (1) can be expanded to show a hint or a note for a given exercise</p> <ol> <li> I am a hint or note</li> </ol> </li> </ul>"},{"location":"#mlops-what-is-it","title":"\ud83c\udd92 MLOps: What is it?","text":"<p>Machine Learning Operations (MLOps) is a rather new field that has seen its uprise as machine learning and particularly deep learning has become a widely available technology. The term itself is a compound of \"machine learning\" and \"operations\" and covers everything that has to do with the management of the production ML lifecycle.</p> <p>The lifecycle of production ML can largely be divided into three phases:</p> <ol> <li> <p>Design: The initial phase starts with an investigation of the problem. Based on this analysis, several requirements     can be prioritized for what we want our future model to do. Since machine learning requires     data to be trained, we also investigate in this step what data we have and if we need to source it in some other way.</p> </li> <li> <p>Model development: Based on the design phase we can begin to conjure some machine learning algorithms to solve our     problems. As always, the initial step often involves doing some data analysis to make sure that our model is     learning the signal that we want it to learn. Secondly, is the machine learning engineering phase, where the     particular model architecture is chosen. Finally, we also need to do validation and testing to make sure that     our model is generalizing well.</p> </li> <li> <p>Operations: Based on the model development phase, we now have a model that we want to use. The operations are where     create an automatic pipeline that makes sure that whenever we make changes to our codebase they get automatically     incorporated into our model, such that we do not slow down production. Equally important is the ongoing monitoring     of already deployed models to make sure that they behave exactly as we specified them.</p> </li> </ol> <p>It is important to note that the three steps are a cycle, meaning that when you have successfully deployed a machine learning model that is not the end of it. Your initial requirements may change, forcing you to revisit the design phase. Some new algorithms may show promising results, so you revisit the model development phase to implement this. Finally, you may try to cut the cost of running your model in production, making you revisit the operations phase, and trying to optimize some steps.</p> <p>The focus in this course is particularly on the Operations part of MLOps as this is what many data scientists are missing in their toolbox to take all the knowledge they have about data processing and model development into a production setting.</p>"},{"location":"#learning-objectives","title":"\u2754 Learning objectives","text":"<p>General course objective</p> <p>Introduce the student to a number of coding practices that will help them organization, scale, monitor and deploy machine learning models either in a research or production setting. To provide hands-on experience with a number of frameworks, both local and in the cloud, for doing large scale machine learning models.</p> <p>This includes:</p> <ul> <li>Organize code in an efficient way for easy maintainability and shareability</li> <li>Understand the importance of reproducibility and how to create reproducible containerized applications and experiments</li> <li>Cable of using version control to efficiently collaborate on code development</li> <li>Knowledge of continuous integration (CI) and continuous machine learning (CML) for automating code development</li> <li>Being able to debug, profile, visualize and monitor multiple experiments to assess model performance</li> <li>Cable of using online cloud-based computing services to scale experiments</li> <li>Demonstrate knowledge about different distributed training paradigms within  machine learning and how to apply them</li> <li>Deploy machine learning models, both locally and in the cloud</li> <li>Conduct a research project in collaboration with fellow students using the frameworks taught in the course</li> <li>Have lots of fun and share memes! :)</li> </ul>"},{"location":"#references","title":"\ud83d\udcd3 References","text":"<p>Additional reading resources (in no particular order):</p> <ul> <li> <p>Ref 1     Introduction blog post for those who have never heard about MLOps and want to get an overview.</p> </li> <li> <p>Ref 2     Great document from Google about the different levels of MLOps.</p> </li> <li> <p>Ref 3     Another introduction to the principles of MLOps and the different stages of MLOps.</p> </li> <li> <p>Ref 4     Great paper about the technical dept in machine learning.</p> </li> <li> <p>Ref 5     Interview study that uncovers many of the pain points that ML engineers go through when doing MLOps.</p> </li> </ul> <p>Other courses with content similar to this:</p> <ul> <li> <p>Made with ML. Great online MLOps course that also covers additional topics on the     foundations of working with ML.</p> </li> <li> <p>Full stack deep learning. Another MLOps online course going through the whole     developer pipeline.</p> </li> <li> <p>MLOps Zoomcamp. MLOps online course that includes many of the same     topics.</p> </li> </ul>"},{"location":"#contributing","title":"\ud83d\udc68\u200d\ud83c\udfeb Contributing","text":"<p>If you want to contribute to the course, we are happy to have you! Anything from fixing typos to adding new content is welcome. For building the course material locally, it is a simple two-step process:</p> <pre><code>pip install -r requirements.txt\nmkdocs serve\n</code></pre> <p>Which will start a local server that you can access at <code>http://127.0.0.1:8000</code> and will automatically update when you make changes to the course material. When you have something that you want to contribute, please make a pull request.</p>"},{"location":"#license","title":"\u2755 License","text":"<p>I highly value open source, and the content of this course is therefore free to use under the Apache 2.0 license. If you use parts of this course in your work, please cite using:</p> <pre><code>@misc{skafte_mlops,\n    author       = {Nicki Skafte Detlefsen},\n    title        = {Machine Learning Operations},\n    howpublished = {\\url{https://github.com/SkafteNicki/dtu_mlops}},\n    year         = {2024}\n}\n</code></pre>"},{"location":"pages/faq/","title":"Frequently asked questions","text":"<p>For further questions, please contact Nicki.</p>"},{"location":"pages/faq/#when-is-the-next-time-the-course-is-running","title":"When is the next time the course is running \u2754","text":"<p>The course always runs in January, during the 3-week period at DTU. The exact dates can be found in the academic calendar.</p>"},{"location":"pages/faq/#is-it-possible-to-attend-the-course-fully-online","title":"Is it possible to attend the course fully online \u2754","text":"<p>Mostly yes. All exercises are provided online and lectures will be recorded and streamed. However, do note that</p> <ul> <li>For project days (see which days in the time plan) you will need to agree with your project group that     you are working from home.</li> <li>We have limited TA resources and will be prioritizing students coming to campus for help. If you are attending online,     feel free to ask questions on our Slack channel and we will help to the best of our ability.</li> </ul> <p>Overall we try to support flexible learning as much as possible with some limitations.</p>"},{"location":"pages/faq/#what-are-the-prerequisites-for-taking-this-course","title":"What are the prerequisites for taking this course \u2754","text":"<p>We recommend that you have a basic understanding of machine learning concepts such as what a dataset is, what probabilities are, what a classifier is, what overfitting means etc. This corresponds to the curriculum covered in course 02450. The actual focus of the course is not on machine learning models, but we will be using these basic concepts throughout the exercises.</p> <p>Additionally, we recommend basic knowledge about deep learning and how to code in PyTorch, corresponding to the curriculum covered in 02456. From prior experience, we know that not all students have gained knowledge about deep learning models before this course, and we will be covering the basics of how to code in PyTorch in one of the first modules of the course to get everyone up to speed.</p>"},{"location":"pages/faq/#i-will-be-missing-x-days-of-the-course-will-that-be-a-problem","title":"I will be missing X days of the course, will that be a problem \u2754","text":"<p>Depends. The course is fairly intensive, with most students working from 9-17 every day. If you already know that you will be missing X days of the course, then I highly recommend that you go through some of the first sessions before the course starts to give yourself a bit of breathing room. If you are not able to do so, please be aware that an additional effort may be needed from you to keep up with your fellow students.</p>"},{"location":"pages/faq/#how-many-should-we-be-in-a-group-for-the-projects","title":"How many should we be in a group for the projects \u2754","text":"<p>Between 3 and 5. The projects are designed to be done in groups meaning that we intentionally make them too big for one person to do alone. Luckily, a lot of the work that you need to do can be done in parallel, so it is not as bad as it sounds.</p>"},{"location":"pages/faq/#when-will-the-exam-take-place","title":"When will the exam take place \u2754","text":"<p>From 2025 and onwards, the exam only consist of a project report. The report should be handed in at midnight on the final day of the course. For January 2025, this means the 24th.</p>"},{"location":"pages/faq/#where-can-i-find-information-regarding-the-exam","title":"Where can I find information regarding the exam \u2754","text":"<p>Look at the bottom of this page. Details will be updated as we get closer to the exam date.</p>"},{"location":"pages/faq/#can-i-use-chatgpt-or-similar-tools-for-the-exercises-project-exam-report-coding-writing","title":"Can I use ChatGPT or similar tools for the exercises, project, exam report (coding + writing) \u2754","text":"<p>Yes, yes, and yes, but remember that its a tool and you need to validate the output before using it. We would prefer for the exam report that you formulate the answers in your own words because it is intended for you do describe what you have been doing in your project. The I in LLM stands for intelligence.</p>"},{"location":"pages/faq/#i-am-a-phd-student-not-enrolled-at-dtu-can-i-take-the-course","title":"I am a PhD student not enrolled at DTU, can I take the course \u2754","text":"<p>Yes, PhD students from other universities can attend the course. You can checkout this page for more information or in general you can contact phdcourses@dtu.dk for more information. Do note that the registration deadline is usually in beginning of December.</p>"},{"location":"pages/faq/#i-am-a-foreign-student-and-my-home-university-doesnt-accept-passnot-pass-what-can-i-do","title":"I am a foreign student and my home university doesn't accept pass/not pass, what can I do \u2754","text":"<p>We can give a grade on the Danish 7-point grading scale for foreign students who need it, where their home university does not accept pass/no-pass. You need to contact the course responsible Nicki within the first week of the course to request this. Secondly, we may need to further validate your work, so please be prepared for doing a short oral exam on one of the last days of the course.</p>"},{"location":"pages/faq/#i-am-a-euroteq-student-any-special-rules-for-me","title":"I am a EuroTEQ student, any special rules for me \u2754","text":"<p>Not really, you will attend the course as any other student. However, we will provide a special Slack channel for you, trying to make sure that you can get the same help as students from DTU who can attend the course on campus.</p>"},{"location":"pages/overview/","title":"Summary of course content","text":"<p>There are a lot of moving parts in this course, so it may be hard to understand how it all fits together. This page provides a summary of the frameworks in this course e.g. the stack of tools used. In the figure below we have provided an overview on how the different tools of the course interacts with each other. The table after the figure provides a short description of each of the parts.</p> <p></p>  The MLOps stack in the course. This is just an example of one stack, and depending on your use case you may want to use a different stack of tools that better fits your needs. Regardless of the stack, the principles of MLOps are the same.  Framework Description PyTorch is the backbone of our code, it provides the computational engine and the data structures that we need to define our data structures. PyTorch lightning is a framework that provides a high-level interface to PyTorch. It provides a lot of functionality that we need to train our models, such as logging, checkpointing, early stopping, etc. such that we do not have to implement it ourselves. It also allows us to scale our models to multiple GPUs and multiple nodes. We control the dependencies and Python interpreter using Conda that enables us to construct reproducible virtual environments For configuring our experiments we use Hydra that allows us to define a hierarchical configuration structure config files Using Weights and Bias allows us to track and log any values and hyperparameters for our experiments Whenever we run into performance bottlenecks with our code we can use the Profiler to find the cause of the bottleneck When we run into bugs in our code we can use the Debugger to find the cause of the bug For organizing our code and creating templates we can use Cookiecutter Docker is a tool that allows us to create a container that contains all the dependencies and code that we need to run our code For controlling the versions of our data and synchronization between local and remote data storage, we can use DVC that makes this process easy For version control of our code we use Git (in complement with Github) that allows multiple developers to work together on a shared codebase We can use Pytest to write unit tests for our code, to make sure that new changes to the code does break the code base For linting our code and keeping a consistent coding style we can use tools such as Pylint and Flake8 that checks our code for common mistakes and style issues For running our unit tests and other checks on our code in a continuous manner e.g. after we commit and push our code we can use Github actions that automate this process Using Cloud build we can automate the process of building our docker images and pushing them to our artifact registry Artifact registry is a service that allows us to store our docker images for later use by other services For storing our data and trained models we can use Cloud storage that provides a scalable and secure storage solution For general compute tasks we can use Compute engine that provides a scalable and secure compute solution For training our experiments in a easy and scalable manner we can use Vertex AI For creating a REST API for our model we can use FastAPI that provides a high-level interface for creating APIs For simple deployments of our code we can use Cloud functions that allows us to run our code in response to events through simple Python functions For more complex deployments of our code we can use Cloud run that allows us to run our code in response to events through docker containers Cloud monitoring gives us the tools to keep track of important logs and errors from the other cloud services For monitoring our deployed model is experiencing any drift we can use Evidently AI that provides a framework and dashboard for monitoring drift For monitoring the telemetry of our deployed model we can use OpenTelemetry that provides a standard for collecting and exporting telemetry data"},{"location":"pages/projects/","title":"Project work","text":"<p>Slides</p> <p>Approximately 1/3 of the course time is dedicated to doing project work. The projects will serve as the basis of your exam. In the project, you will essentially re-apply everything that you learn throughout the course to a self chosen project. The overall goals with the project is:</p> <ul> <li>Being able to work in a group on a larger project</li> <li>To formulate a project within the provided guidelines</li> <li>Apply the material though in the course to the problem</li> <li>Present your findings</li> </ul> <p>In the projects you are free to work on whatever problem that you want. That said, we have a specific requirement, that you need to incorporate some third-party framework into your project. If you want inspiration for projects, here are some examples</p> <ol> <li> <p>Classification of tweets</p> </li> <li> <p>Translating from English to German</p> </li> <li> <p>Classification of scientific papers</p> </li> <li> <p>Classification of rice types from images</p> </li> </ol> <p>We hope most students will be able to form groups by themselves. Expected group size is between 3 and 5. If you are not able to form a group, please make sure to post in the <code>#looking-for-group</code> channel on Slack or make sure to be present on the 4th day of the course (the day before the project work starts) where we will help students that have not found a group yet.</p>"},{"location":"pages/projects/#open-source-tools","title":"Open-source tools","text":"<p>We strive to keep the tools thought in this course as open-source as possible. The great thing about the open-source community is that whatever problem you are working on, there is probably some package out there that can get you at least 10% of the way. For the project, we want to enforce this point and you are required to include some third-party package, that is neither PyTorch or one of the tools already covered in the course, into your project.</p> <p>If you have no idea what framework to include, the PyTorch ecosystem is a great place for finding open-source frameworks that can help you accelerate your own projects where PyTorch is the backengine. All tools in the ecosystem should work greatly together with PyTorch. However, it is important to note that the ecosystem is not a complete list of all the awesome packages that exist to extend the functionality of PyTorch. If you are still missing inspiration for frameworks to use, we highly recommend these three that have been used in previous years of the course:</p> <ul> <li> <p>PyTorch Image Models. PyTorch Image Models (also known as TIMM)     is the absolutely most used computer vision package (maybe except for <code>torchvision</code>). It contains models, scripts and     pre trained for a lot of state-of-the-art image models within computer vision.</p> </li> <li> <p>Transformers. The Transformers repository from the Huggingface group     focuses on state-of-the-art Natural Language Processing (NLP). It provides many pre-trained model to perform tasks on     texts such as classification, information extraction, question answering, summarization, translation, text generation,     etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.</p> </li> <li> <p>PyTorch-Geometric. PyTorch Geometric (PyG) is a geometric deep     learning. It consists of various methods for deep learning on graphs and other irregular structures, also known as     geometric deep learning, from a variety of published papers.</p> </li> </ul>"},{"location":"pages/projects/#project-days","title":"Project days","text":"<p>Each project day is fully dedicated to project work, except for maybe external inspirational lectures in the morning. The group decides exactly where they want to work on the project, how they want to work on the project, how do distribute the workload etc. We encourage strongly to parallelize work during the project, because there are a lot of tasks to do, but it it is important that all group members at least have some understanding of the whole project.</p> <p>Remember that the focus of the project work is not to demonstrate that you can work with the biggest and baddest deep learning model, but instead that you show that you can incorporate the tools that are taught throughout the course in a meaningful way.</p> <p>Also note that the project is not expected to be very large in scope. It may simply be that you want to train X model on Y data. You will approximately be given 4 full days to work on the project. It is better that you start out with a smaller project and then add complexity along the way if you have time.</p>"},{"location":"pages/projects/#day-1","title":"Day 1","text":"<p>The first project days is all about getting started on the projects and formulating exactly what you want to work on as a group.</p> <ol> <li> <p>Start by brainstorming projects! Try to figure out exactly what you want to work with and begin to investigate what     third party package that can support the project.</p> </li> <li> <p>When you have come up with an idea, write a project description. The description is the delivery for today and should     be at least 300 words. Try to answer the following questions in the description:</p> <ul> <li>Overall goal of the project</li> <li>What framework are you going to use and you do you intend to include the framework into your project?</li> <li>What data are you going to run on (initially, may change)</li> <li>What models do you expect to use</li> </ul> </li> <li> <p>(Optional) If you want to think more about the product design of your project, feel free to fill out the     ML canvas (or part of it). You can read more about the     different fields on canvas here.</p> </li> <li> <p>After having done the product description, you can start on the actual coding of the project. In the next section,     a to-do list is attached that summaries what we are doing in the course. You are NOT expected to fulfill all bullet     points from week 1 today.</p> </li> </ol> <p>The project description will serve as an guideline for us at the exam that you have somewhat reached the goals that you set out to do. By the end of the day, you should commit your project description to the <code>README.md</code> file belonging to your project repository. If you filled out the ML canvas, feel free to include that as part of the <code>README.md</code> file. Also remember to commit whatever you have done on the project until now. When you have done this, go to DTU Learn and hand-in (as a group) the link to your GitHub repository as an assignment.</p> <p>We will briefly (before next Monday) look over your GitHub repository and project description to check that everything is fine. If we have any questions/concerns we will contact you.</p>"},{"location":"pages/projects/#day-2","title":"Day 2","text":"<p>The goal for today is simply to continue working on your project. Start with bullet points in the checklist from week 1 and continue with bullet points for week 2.</p>"},{"location":"pages/projects/#day-3","title":"Day 3","text":"<p>Continue working on your project, today you should hopefully focus on the bullet points in the checklist from week 2. There is no delivery for this week, but make sure that you have committed all your progress at the end of the day. We will again briefly look over the repositories and will reach out to your group if we are worried about the progression of your project.</p>"},{"location":"pages/projects/#day-4","title":"Day 4","text":"<p>We have now entered the final week of the course and the second last project day. You are most likely continuing with bullet points from week 2, but should hopefully begin to look at the bullet points from week 3 today. These are in general much more complex, so we recommend looking at them until you have completed most from week 2. We also recommend that you being to fill our report template.</p>"},{"location":"pages/projects/#day-5","title":"Day 5","text":"<p>Today you are finishing your project. We recommend that you start by creating a architechtual overview of your project similar to this figure. I recommend using draw.io for creating this kind of diagram, but feel free to use any tool you like. Else you should just continue working on your project, checking of as many bullet points as possible. Finally, you should also prepare yourself for the exam tomorrow.</p>"},{"location":"pages/projects/#project-hints","title":"Project hints","text":"<p>Below are listed some hints to prevent you from getting stuck during the project work with problems that previous groups have encountered.</p> <p>Data</p> <ul> <li> <p>Start out small! We recommend that you start out with less than 1GB of data. If the dataset you want to work with     is larger, then subsample it. You can use dvc to version control your data and only download the full dataset     when you are ready to train the model.</p> </li> <li> <p>Be aware of many smaller files. <code>DVC</code> does not handle many small files well, and can take a long time to download.     If you have many small files, consider zipping them together and then unzip them at runtime.</p> </li> <li> <p>You do not need to use <code>DVC</code> for everything regarding data. You workflow is to just use <code>DVC</code> for version     controlling the data, but when you need to get it you can just download it from the source. For example if you     are storing your data in a GCP bucket, you can use the <code>gsutil</code> command to download the data or directly     accessing the it using the     cloud storage file system</p> </li> </ul> <p>Modelling</p> <ul> <li> <p>Again, start out small! Start with a simple model and then add complexity as you go along. It is better to have a     simple model that works than a complex model that does not work.</p> </li> <li> <p>Try fine-tuning a pre-trained model. This is often much faster than training a model from scratch.</p> </li> </ul> <p>Deployment</p> <ul> <li>When getting around to deployment always start out by running your application locally first, then run it locally     inside a docker container and then finally try to deploy it in the cloud. This way you can catch errors early     and not waste time on debugging cloud deployment issues.</li> </ul>"},{"location":"pages/projects/#project-checklist","title":"Project checklist","text":"<p>Please note that all the lists are exhaustive meaning that I do not expect you to have completed very point on the checklist for the exam.</p>"},{"location":"pages/projects/#week-1","title":"Week 1","text":"<ul> <li> Create a git repository</li> <li> Make sure that all team members have write access to the GitHub repository</li> <li> Create a dedicated environment for you project to keep track of your packages</li> <li> Create the initial file structure using cookiecutter</li> <li> Fill out the <code>make_dataset.py</code> file such that it downloads whatever data you need and</li> <li> Add a model file and a training script and get that running</li> <li> Remember to fill out the <code>requirements.txt</code> file with whatever dependencies that you are using</li> <li> Remember to comply with good coding practices (<code>pep8</code>) while doing the project</li> <li> Do a bit of code typing and remember to document essential parts of your code</li> <li> Setup version control for your data or part of your data</li> <li> Construct one or multiple docker files for your code</li> <li> Build the docker files locally and make sure they work as intended</li> <li> Write one or multiple configurations files for your experiments</li> <li> Used Hydra to load the configurations and manage your hyperparameters</li> <li> When you have something that works somewhat, remember at some point to to some profiling and see if     you can optimize your code</li> <li> Use Weights &amp; Biases to log training progress and other important metrics/artifacts in your code. Additionally,     consider running a hyperparameter optimization sweep.</li> <li> Use PyTorch-lightning (if applicable) to reduce the amount of boilerplate in your code</li> </ul>"},{"location":"pages/projects/#week-2","title":"Week 2","text":"<ul> <li> Write unit tests related to the data part of your code</li> <li> Write unit tests related to model construction and or model training</li> <li> Calculate the coverage.</li> <li> Get some continuous integration running on the GitHub repository</li> <li> Create a data storage in GCP Bucket for you data and preferable link this with your data version control setup</li> <li> Create a trigger workflow for automatically building your docker images</li> <li> Get your model training in GCP using either the Engine or Vertex AI</li> <li> Create a FastAPI application that can do inference using your model</li> <li> If applicable, consider deploying the model locally using torchserve</li> <li> Deploy your model in GCP using either Functions or Run as the backend</li> </ul>"},{"location":"pages/projects/#week-3","title":"Week 3","text":"<ul> <li> Check how robust your model is towards data drifting</li> <li> Setup monitoring for the system telemetry of your deployed model</li> <li> Setup monitoring for the performance of your deployed model</li> <li> If applicable, play around with distributed data loading</li> <li> If applicable, play around with distributed model training</li> <li> Play around with quantization, compilation and pruning for you trained models to increase inference speed</li> </ul>"},{"location":"pages/projects/#additional","title":"Additional","text":"<ul> <li> Revisit your initial project description. Did the project turn out as you wanted?</li> <li> Make sure all group members have a understanding about all parts of the project</li> <li> Uploaded all your code to github</li> </ul>"},{"location":"pages/projects/#exam","title":"Exam","text":"<p>From January 2025 the exam only consist of a project report. The report should be handed in at midnight on the final day of the course. For January 2025, this means the 24th. We provide template folder called reports. As the first task you should copy the folder and all its content to your project repository. Then, you jobs is to fill out the <code>README.md</code> file which contains the report template. The file itself contains instructions on how to fill it out and instructions on using the included <code>report.py</code> file for validating your work. You will hand-in the template by simple including it in your project repository. By midnight on the final day of the course, we will automatically scrape the report and use it as the basis for grading you. Therefore, changes after this point are not registered.</p>"},{"location":"pages/timeplan/","title":"Timeplan","text":"<p>Slides</p> <p>The course is organised into exercise (2/3 of the course) days and project days (1/3 of the course).</p> <p>Exercise days start at 9:00 in the morning with an lecture (usually 30-45 min) that will give some context about at least one of the topics of that day. Additionally, previous days exercises may shortly be touched upon. The remaining of the day will be spend on solving exercises either individually or in small groups. For some people the exercises may be fast to do and for others it will take the whole day. We will provide help throughout the day. We will try to answer questions on slack but help with be priorities to students physically on campus.</p> <p>Project days are intended for project work and you are therefore responsible for making an agreement with your group when and where you are going to work. The first project days there will be a lecture at 9:00 with project information. Other project days we may also start the day with an external lecture, which we highly recommend that you participate in. During each project day we will have office hours where you can ask questions for the project.</p> <p>Below is an overall timeplan for each day, including the presentation topic of the day and the frameworks that you will be using in the exercises.</p> <p>Recodings (link to drive folder with mp4 files):</p> <ul> <li>\ud83c\udfa52023 Lectures</li> <li>\ud83c\udfa52024 Lectures</li> </ul>"},{"location":"pages/timeplan/#week-1","title":"Week 1","text":"<p>In the first week you will be introduced to a number of development practices for organizing and developing code, especially with a focus on making everything reproducible.</p> Date Day Presentation topic Frameworks Format 6/1/25 Monday Deep learning software\ud83d\udcdd Terminal, Conda, IDE, PyTorch Exercises 7/1/25 Tuesday MLOps: what is it?\ud83d\udcdd Git, CookieCutter, Pep8, DVC Exercises 8/1/25 Wednesday Reproducibility\ud83d\udcdd Docker, Hydra Exercises 9/1/25 Thursday Debugging\ud83d\udcdd Debugger, Profiler, Wandb, Lightning Exercises 10/1/25 Friday Project work\ud83d\udcdd - Projects"},{"location":"pages/timeplan/#week-2","title":"Week 2","text":"<p>The second week is about automatization and the cloud. Automatization will help use making sure that our code does not break when we make changes to it. The cloud will help us scale up our applications and we learn how to use different services to help develop a full machine learning pipeline.</p> Date Day Presentation topic Frameworks Format 13/1/25 Monday Continuous Integration\ud83d\udcdd Pytest, Github actions, Pre-commit, CML Exercises 14/1/25 Tuesday The Cloud\ud83d\udcdd GCP Engine, Bucket, Artifact registry, Vertex AI Exercises 15/1/25 Wednesday Deployment\ud83d\udcdd FastAPI, Torchserve, GCP Functions, GCP Run Exercises 16/1/25 Thursday No lecture - Projects 17/1/25 Friday Company presentation (TBA) - Projects"},{"location":"pages/timeplan/#week-3","title":"Week 3","text":"<p>For the final week we look into advance topics such as monitoring and scaling of applications. Monitoring is especially important for the longivity for the applications that we develop, that we can deploy them either locally or in the cloud and that we have the tools to monitor how they behave over time. Scaling of applications is an important topic if we ever want our applications to be used by many people at the same time.</p> Date Day Presentation topic Frameworks Format 20/1/25 Monday Monitoring\ud83d\udcdd Evidently AI, Prometheus, GCP Monitoring Exercises 21/1/25 Tuesday Scalable applications\ud83d\udcdd PyTorch, Lightning Exercises 22/1/25 Wednesday Company presentation (TBA) - Projects 23/1/25 Thursday No lecture - Projects 24/1/25 Friday No lecture - Projects"},{"location":"reports/","title":"Exam template for 02476 Machine Learning Operations","text":"<p>This is the report template for the exam. Please only remove the text formatted as with three dashes in front and behind like:</p> <p><code>--- question 1 fill here ---</code></p> <p>where you instead should add your answers. Any other changes may have unwanted consequences when your report is auto-generated at the end of the course. For questions where you are asked to include images, start by adding the image to the <code>figures</code> subfolder (please only use <code>.png</code>, <code>.jpg</code> or <code>.jpeg</code>) and then add the following code in your answer:</p> <pre><code>![my_image](figures/&lt;image&gt;.&lt;extension&gt;)\n</code></pre> <p>In addition to this markdown file, we also provide the <code>report.py</code> script that provides two utility functions:</p> <p>Running:</p> <pre><code>python report.py html\n</code></pre> <p>will generate a <code>.html</code> page of your report. After the deadline for answering this template, we will auto-scrape everything in this <code>reports</code> folder and then use this utility to generate an <code>.html</code> page that will be your serve as your final hand-in.</p> <p>Running</p> <pre><code>python report.py check\n</code></pre> <p>will check your answers in this template against the constraints listed for each question e.g. is your answer too short, too long, or have you included an image when asked to.</p> <p>For both functions to work you mustn't rename anything. The script has two dependencies that can be installed with</p> <pre><code>pip install click markdown\n</code></pre>"},{"location":"reports/#overall-project-checklist","title":"Overall project checklist","text":"<p>The checklist is exhaustive which means that it includes everything that you could do on the project included in the curriculum in this course. Therefore, we do not expect at all that you have checked all boxes at the end of the project.</p>"},{"location":"reports/#week-1","title":"Week 1","text":"<ul> <li> Create a git repository</li> <li> Make sure that all team members have write access to the GitHub repository</li> <li> Create a dedicated environment for you project to keep track of your packages</li> <li> Create the initial file structure using cookiecutter</li> <li> Fill out the <code>make_dataset.py</code> file such that it downloads whatever data you need and</li> <li> Add a model file and a training script and get that running</li> <li> Remember to fill out the <code>requirements.txt</code> file with whatever dependencies that you are using</li> <li> Remember to comply with good coding practices (<code>pep8</code>) while doing the project</li> <li> Do a bit of code typing and remember to document essential parts of your code</li> <li> Setup version control for your data or part of your data</li> <li> Construct one or multiple docker files for your code</li> <li> Build the docker files locally and make sure they work as intended</li> <li> Write one or multiple configurations files for your experiments</li> <li> Used Hydra to load the configurations and manage your hyperparameters</li> <li> When you have something that works somewhat, remember at some point to to some profiling and see if       you can optimize your code</li> <li> Use Weights &amp; Biases to log training progress and other important metrics/artifacts in your code. Additionally,       consider running a hyperparameter optimization sweep.</li> <li> Use PyTorch-lightning (if applicable) to reduce the amount of boilerplate in your code</li> </ul>"},{"location":"reports/#week-2","title":"Week 2","text":"<ul> <li> Write unit tests related to the data part of your code</li> <li> Write unit tests related to model construction and or model training</li> <li> Calculate the coverage.</li> <li> Get some continuous integration running on the GitHub repository</li> <li> Create a data storage in GCP Bucket for you data and preferable link this with your data version control setup</li> <li> Create a trigger workflow for automatically building your docker images</li> <li> Get your model training in GCP using either the Engine or Vertex AI</li> <li> Create a FastAPI application that can do inference using your model</li> <li> If applicable, consider deploying the model locally using torchserve</li> <li> Deploy your model in GCP using either Functions or Run as the backend</li> </ul>"},{"location":"reports/#week-3","title":"Week 3","text":"<ul> <li> Check how robust your model is towards data drifting</li> <li> Setup monitoring for the system telemetry of your deployed model</li> <li> Setup monitoring for the performance of your deployed model</li> <li> If applicable, play around with distributed data loading</li> <li> If applicable, play around with distributed model training</li> <li> Play around with quantization, compilation and pruning for you trained models to increase inference speed</li> </ul>"},{"location":"reports/#additional","title":"Additional","text":"<ul> <li> Revisit your initial project description. Did the project turn out as you wanted?</li> <li> Make sure all group members have a understanding about all parts of the project</li> <li> Uploaded all your code to github</li> </ul>"},{"location":"reports/#group-information","title":"Group information","text":""},{"location":"reports/#question-1","title":"Question 1","text":"<p>Enter the group number you signed up on  <p>Answer:</p> <p>--- question 1 fill here ---</p>"},{"location":"reports/#question-2","title":"Question 2","text":"<p>Enter the study number for each member in the group</p> <p>Example:</p> <p>sXXXXXX, sXXXXXX, sXXXXXX</p> <p>Answer:</p> <p>--- question 2 fill here ---</p>"},{"location":"reports/#question-3","title":"Question 3","text":"<p>What framework did you choose to work with and did it help you complete the project?</p> <p>Recommended answer length: 100-200 words.</p> <p>Example: We used the third-party framework ... in our project. We used functionality ... and functionality ... from the package to do ... and ... in our project.</p> <p>Answer:</p> <p>--- question 3 fill here ---</p>"},{"location":"reports/#coding-environment","title":"Coding environment","text":"<p>In the following section we are interested in learning more about you local development environment.</p>"},{"location":"reports/#question-4","title":"Question 4","text":"<p>Explain how you managed dependencies in your project? Explain the process a new team member would have to go through to get an exact copy of your environment.</p> <p>Recommended answer length: 100-200 words</p> <p>Example: We used ... for managing our dependencies. The list of dependencies was auto-generated using ... . To get a complete copy of our development environment, one would have to run the following commands</p> <p>Answer:</p> <p>--- question 4 fill here ---</p>"},{"location":"reports/#question-5","title":"Question 5","text":"<p>We expect that you initialized your project using the cookiecutter template. Explain the overall structure of your code. Did you fill out every folder or only a subset?</p> <p>Recommended answer length: 100-200 words</p> <p>Example: From the cookiecutter template we have filled out the ... , ... and ... folder. We have removed the ... folder because we did not use any ... in our project. We have added an ... folder that contains ... for running our experiments. Answer:</p> <p>--- question 5 fill here ---</p>"},{"location":"reports/#question-6","title":"Question 6","text":"<p>Did you implement any rules for code quality and format? Additionally, explain with your own words why these concepts matters in larger projects.</p> <p>Recommended answer length: 50-100 words.</p> <p>Answer:</p> <p>--- question 6 fill here ---</p>"},{"location":"reports/#version-control","title":"Version control","text":"<p>In the following section we are interested in how version control was used in your project during development to corporate and increase the quality of your code.</p>"},{"location":"reports/#question-7","title":"Question 7","text":"<p>How many tests did you implement and what are they testing in your code?</p> <p>Recommended answer length: 50-100 words.</p> <p>Example: In total we have implemented X tests. Primarily we are testing ... and ... as these the most critical parts of our application but also ... .</p> <p>Answer:</p> <p>--- question 7 fill here ---</p>"},{"location":"reports/#question-8","title":"Question 8","text":"<p>What is the total code coverage (in percentage) of your code? If you code had an code coverage of 100% (or close to), would you still trust it to be error free? Explain you reasoning.</p> <p>Recommended answer length: 100-200 words.</p> <p>Example: The total code coverage of code is X%, which includes all our source code. We are far from 100% coverage of our ** code and even if we were then...*</p> <p>Answer:</p> <p>--- question 8 fill here ---</p>"},{"location":"reports/#question-9","title":"Question 9","text":"<p>Did you workflow include using branches and pull requests? If yes, explain how. If not, explain how branches and pull request can help improve version control.</p> <p>Recommended answer length: 100-200 words.</p> <p>Example: We made use of both branches and PRs in our project. In our group, each member had an branch that they worked on in addition to the main branch. To merge code we ...</p> <p>Answer:</p> <p>--- question 9 fill here ---</p>"},{"location":"reports/#question-10","title":"Question 10","text":"<p>Did you use DVC for managing data in your project? If yes, then how did it improve your project to have version control of your data. If no, explain a case where it would be beneficial to have version control of your data.</p> <p>Recommended answer length: 100-200 words.</p> <p>Example: We did make use of DVC in the following way: ... . In the end it helped us in ... for controlling ... part of our pipeline</p> <p>Answer:</p> <p>--- question 10 fill here ---</p>"},{"location":"reports/#question-11","title":"Question 11","text":"<p>Discuss you continuous integration setup. What kind of continuous integration are you running (unittesting, linting, etc.)? Do you test multiple operating systems, Python  version etc. Do you make use of caching? Feel free to insert a link to one of your GitHub actions workflow.</p> <p>Recommended answer length: 200-300 words.</p> <p>Example: We have organized our continuous integration into 3 separate files: one for doing ..., one for running ... testing and one for running ... . In particular for our ..., we used ... .An example of a triggered workflow can be seen here:  <p>Answer:</p> <p>--- question 11 fill here ---</p>"},{"location":"reports/#running-code-and-tracking-experiments","title":"Running code and tracking experiments","text":"<p>In the following section we are interested in learning more about the experimental setup for running your code and especially the reproducibility of your experiments.</p>"},{"location":"reports/#question-12","title":"Question 12","text":"<p>How did you configure experiments? Did you make use of config files? Explain with coding examples of how you would run a experiment.</p> <p>Recommended answer length: 50-100 words.</p> <p>Example: We used a simple argparser, that worked in the following way: Python  my_script.py --lr 1e-3 --batch_size 25</p> <p>Answer:</p> <p>--- question 12 fill here ---</p>"},{"location":"reports/#question-13","title":"Question 13","text":"<p>Reproducibility of experiments are important. Related to the last question, how did you secure that no information is lost when running experiments and that your experiments are reproducible?</p> <p>Recommended answer length: 100-200 words.</p> <p>Example: We made use of config files. Whenever an experiment is run the following happens: ... . To reproduce an experiment one would have to do ...</p> <p>Answer:</p> <p>--- question 13 fill here ---</p>"},{"location":"reports/#question-14","title":"Question 14","text":"<p>Upload 1 to 3 screenshots that show the experiments that you have done in W&amp;B (or another experiment tracking service of your choice). This may include loss graphs, logged images, hyperparameter sweeps etc. You can take inspiration from this figure. Explain what metrics you are tracking and why they are important.</p> <p>Recommended answer length: 200-300 words + 1 to 3 screenshots.</p> <p>Example: As seen in the first image when have tracked ... and ... which both inform us about ... in our experiments. As seen in the second image we are also tracking ... and ...</p> <p>Answer:</p> <p>--- question 14 fill here ---</p>"},{"location":"reports/#question-15","title":"Question 15","text":"<p>Docker is an important tool for creating containerized applications. Explain how you used docker in your experiments? Include how you would run your docker images and include a link to one of your docker files.</p> <p>Recommended answer length: 100-200 words.</p> <p>Example: For our project we developed several images: one for training, inference and deployment. For example to run the training docker image: <code>docker run trainer:latest lr=1e-3 batch_size=64</code>. Link to docker file:  <p>Answer:</p> <p>--- question 15 fill here ---</p>"},{"location":"reports/#question-16","title":"Question 16","text":"<p>When running into bugs while trying to run your experiments, how did you perform debugging? Additionally, did you try to profile your code or do you think it is already perfect?</p> <p>Recommended answer length: 100-200 words.</p> <p>Example: Debugging method was dependent on group member. Some just used ... and others used ... . We did a single profiling run of our main code at some point that showed ...</p> <p>Answer:</p> <p>--- question 16 fill here ---</p>"},{"location":"reports/#working-in-the-cloud","title":"Working in the cloud","text":"<p>In the following section we would like to know more about your experience when developing in the cloud.</p>"},{"location":"reports/#question-17","title":"Question 17","text":"<p>List all the GCP services that you made use of in your project and shortly explain what each service does?</p> <p>Recommended answer length: 50-200 words.</p> <p>Example: We used the following two services: Engine and Bucket. Engine is used for... and Bucket is used for...</p> <p>Answer:</p> <p>--- question 17 fill here ---</p>"},{"location":"reports/#question-18","title":"Question 18","text":"<p>The backbone of GCP is the Compute engine. Explained how you made use of this service and what type of VMs you used?</p> <p>Recommended answer length: 100-200 words.</p> <p>Example: We used the compute engine to run our ... . We used instances with the following hardware: ... and we started the using a custom container: ...</p> <p>Answer:</p> <p>--- question 18 fill here ---</p>"},{"location":"reports/#question-19","title":"Question 19","text":"<p>Insert 1-2 images of your GCP bucket, such that we can see what data you have stored in it. You can take inspiration from this figure.</p> <p>Answer:</p> <p>--- question 19 fill here ---</p>"},{"location":"reports/#question-20","title":"Question 20","text":"<p>Upload one image of your GCP artifact registry, such that we can see the different images that you have stored. You can take inspiration from this figure.</p> <p>Answer:</p> <p>--- question 20 fill here ---</p>"},{"location":"reports/#question-21","title":"Question 21","text":"<p>Upload one image of your GCP cloud build history, so we can see the history of the images that have been build in your project. You can take inspiration from this figure.</p> <p>Answer:</p> <p>--- question 21 fill here ---</p>"},{"location":"reports/#question-22","title":"Question 22","text":"<p>Did you manage to deploy your model, either in locally or cloud? If not, describe why. If yes, describe how and preferably how you invoke your deployed service?</p> <p>Recommended answer length: 100-200 words.</p> <p>Example: For deployment we wrapped our model into application using ... . We first tried locally serving the model, which worked. Afterwards we deployed it in the cloud, using ... . To invoke the service an user would call <code>curl -X POST -F \"file=@file.json\"&lt;weburl&gt;</code></p> <p>Answer:</p> <p>--- question 22 fill here ---</p>"},{"location":"reports/#question-23","title":"Question 23","text":"<p>Did you manage to implement monitoring of your deployed model? If yes, explain how it works. If not, explain how monitoring would help the longevity of your application.</p> <p>Recommended answer length: 100-200 words.</p> <p>Example: We did not manage to implement monitoring. We would like to have monitoring implemented such that over time we could measure ... and ... that would inform us about this ... behaviour of our application.</p> <p>Answer:</p> <p>--- question 23 fill here ---</p>"},{"location":"reports/#question-24","title":"Question 24","text":"<p>How many credits did you end up using during the project and what service was most expensive?</p> <p>Recommended answer length: 25-100 words.</p> <p>Example: Group member 1 used ..., Group member 2 used ..., in total ... credits was spend during development. The service costing the most was ... due to ...</p> <p>Answer:</p> <p>--- question 24 fill here ---</p>"},{"location":"reports/#overall-discussion-of-project","title":"Overall discussion of project","text":"<p>In the following section we would like you to think about the general structure of your project.</p>"},{"location":"reports/#question-25","title":"Question 25","text":"<p>Include a figure that describes the overall architecture of your system and what services that you make use of. You can take inspiration from this figure. Additionally in your own words, explain the overall steps in figure.</p> <p>Recommended answer length: 200-400 words</p> <p>Example:</p> <p>The starting point of the diagram is our local setup, where we integrated ... and ... and ... into our code. Whenever we commit code and push to github, it auto triggers ... and ... . From there the diagram shows ...</p> <p>Answer:</p> <p>--- question 25 fill here ---</p>"},{"location":"reports/#question-26","title":"Question 26","text":"<p>Discuss the overall struggles of the project. Where did you spend most time and what did you do to overcome these challenges?</p> <p>Recommended answer length: 200-400 words.</p> <p>Example: The biggest challenges in the project was using ... tool to do ... . The reason for this was ...</p> <p>Answer:</p> <p>--- question 26 fill here ---</p>"},{"location":"reports/#question-27","title":"Question 27","text":"<p>State the individual contributions of each team member. This is required information from DTU, because we need to make sure all members contributed actively to the project</p> <p>Recommended answer length: 50-200 words.</p> <p>Example: Student sXXXXXX was in charge of developing of setting up the initial cookie cutter project and developing of the docker containers for training our applications. Student sXXXXXX was in charge of training our models in the cloud and deploying them afterwards. All members contributed to code by...</p> <p>Answer:</p> <p>--- question 27 fill here ---</p>"},{"location":"s10_extra/","title":"Extra learning modules","text":"<p>All modules listed here are not part of the core course but expand on some of the other topics. Some of them may still be under construction and may in the future be moved into other sessions.</p> <ul> <li> <p></p> <p>Learn how to setup a simple documentation system for your application</p> <p> M32: Documentation</p> </li> <li> <p></p> <p>Learn how to do hyperparameter optimization using Optuna</p> <p> M33: Hyperparameter Optimization</p> </li> <li> <p></p> <p>Learn how to use HPC systems that use PBS to do job scheduling</p> <p> M34: High Performance Clusters</p> </li> </ul>"},{"location":"s10_extra/calibration/","title":"Calibration of ML models","text":"<p>Danger</p> <p>Module is still under development</p>"},{"location":"s10_extra/calibration/#methods","title":"Methods","text":""},{"location":"s10_extra/calibration/#exercises","title":"\u2754 Exercises","text":"<ol> <li> <p>Implement a script</p> </li> <li> <p>Implement temperature scaling</p> </li> <li> <p>Implement label smoothing</p> <pre><code>alpha = 0.1\nfor i in range(len(y_true)):\n    y_true[i] = (1 - alpha) * y_true[i] + alpha / num_classes\n</code></pre> </li> <li> <p>Implement mixup</p> </li> <li> <p>Implement cutmix</p> </li> <li> <p>Implement the Focal Loss</p> </li> <li> <p>Implement it in a continues integration setup</p> </li> </ol>"},{"location":"s10_extra/calibration/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":""},{"location":"s10_extra/design/","title":"Designing MLOps pipelines","text":"<p>Danger</p> <p>Module is still under development</p> <p>\"Machine learning engineering is 10% machine learning and 90% engineering.\" - Chip Huyen</p> <p>We highly recommend that you read the book Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications by Chip Huyen which gives an fantastic overview of the thought processes that goes into designing moder machine learning systems.</p>"},{"location":"s10_extra/design/#the-stack","title":"The stack","text":"<p>Have you ever encountered the concept of full stack developer. A full stack developer is an developer who can both develop client and server software or in more general terms, it is a developer who can take care of the complete developer pipeline.</p> <p>Below is seen an image of the massive amounts of tools that exist within the MLOps umbrella. </p>"},{"location":"s10_extra/design/#visualizing-the-design","title":"Visualizing the design","text":""},{"location":"s10_extra/documentation/","title":"M32 - Documentation","text":""},{"location":"s10_extra/documentation/#documentation","title":"Documentation","text":"<p>In today's rapidly evolving software development landscape, effective documentation is a crucial component of any project. The ability to create clear, concise, and user-friendly technical documentation can make a significant difference in the success of your codebase. We all probably encountered code that we wanted to use, only for us to abandon using it because it was missing documentation such that we could get started with it.</p> <p>Technical documentation or code documentation can be many things:</p> <ul> <li>Plain text, images and videos explaining core concepts for your software</li> <li>Documentation of API on how to call a function or class, what the different parameters are etc.</li> <li>Code examples of how to use certain functionality</li> </ul> <p>and many more. We are in this module going to focus on setting up a very basic documentation system that will automatically help you document the API of your code. For this reason we recommend that before continuing with this module that you have completed module M7 on good coding practices or have similar experience with writing docstrings for Python functions and classes.</p> <p>There are different systems for writing documentation. In fact there is a lot to choose from:</p> <ul> <li>MkDocs</li> <li>Sphinx</li> <li>GitBook</li> <li>Docusaurus</li> <li>Doxygen</li> <li>Jekyll</li> </ul> <p>Important to note that all these are static site generators. The word static here refers to that when the content is generated and served on a webside, the underlying HTML code will not change. It may contain HTML elements that dynamic (like video), but the site does not change (1).</p> <ol> <li> Good examples of dynamic sites are any social media or news media where new posts, pages etc.     are constantly added over time. Good examples of static sites are documentation, blogposts etc.</li> </ol> <p>We are in this module going to look at Mkdocs, which (in my opinion) is one of the easiest systems to get started with because all documentation is written in markdown and the build system is written in Python. As an alternativ, you can consider doing the exercises in Sphinx which is probably the most used documentation system for Python code. Sphinx offer more customization than Mkdocs, so is generally preferred for larger projects with complex documentation, but for smaller projects Mkdocs should be easier to get started with and is sufficient.</p> <p>Mkdocs by default does not include many features and for that reason we are directly going to dive into using the material for mkdocs theme that provides a lot of nice customization to create professional static sites. In fact, this whole course is written in mkdocs using the material theme.</p>"},{"location":"s10_extra/documentation/#mkdocs","title":"Mkdocs","text":"<p>The core file when using mkdocs is the <code>mkdocs.yaml</code> file, which is the configuration file for the project:</p> <pre><code>site_name: Documentation of my project\nsite_author: Jane Doe\ndocs_dir: source # (1)!\n\ntheme:\n    language: en\n    name: material # (2)!\n    features: # (3)!\n    - content.code.copy\n    - content.code.annotate\n\nplugins: # (4)!\n    - search\n    - mkdocstrings\n\nnav: # (5)!\n  - Home: index.md\n</code></pre> <ol> <li> <p> This indicates the source directory of our documentation. If the layout of your documentation is     a bit different than what described above, you may need to change this.</p> </li> <li> <p> The overall theme of your documentation. We recommend the <code>material</code> theme but there are     many more to choose from and you can also     create your own.</p> </li> <li> <p> The <code>featuers</code> section is where features that are supported by your given theme can be enabled.     In this example we have enabled <code>content.code.copy</code> feature which adds a small copy button to all code block and the     <code>content.code.annotate</code> feature which allows you to add annotations like this box to code blocks.</p> </li> <li> <p> Plugins add new functionality to your documentation.     In this case we have added two plugins that add functionality for searching through our documentation and     automatically adding documentation from docstrings. Remember that some plugins requires you to install additional     Python packages with those plugins, so remember to add them to your <code>requirements.txt</code> file.</p> </li> <li> <p> The <code>nav</code> section is where you define the navigation structure of your documentation. When you     add new <code>.md</code> files to the <code>source</code> folder you then need to add them to the <code>nav</code> section.</p> </li> </ol> <p>And that is more or less what you need to get started. In general, if you need help with configuration of your documentation in mkdocs I recommend looking at this page and this page.</p>"},{"location":"s10_extra/documentation/#exercises","title":"Exercises","text":"<p>In this set of exercises we assume that you have completed module M6 on code structure and therefore have a repository that at least contains the following:</p> <pre><code>\u251c\u2500\u2500 pyproject.toml     &lt;- Project configuration file with package metadata\n\u2502\n\u251c\u2500\u2500 docs               &lt;- Documentation folder\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 index.md       &lt;- Homepage for your documentation\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 mkdocs.yaml     &lt;- Configuration file for mkdocs\n\u2502   \u2502\n\u2502   \u2514\u2500\u2500 source/        &lt;- Source directory for documentation files\n\u2502\n\u2514\u2500\u2500 src                &lt;- Source code for use in this project.\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 __init__.py    &lt;- Makes src a Python module\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 models         &lt;- model implementations, training script\n\u2502   \u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u2502   \u251c\u2500\u2500 model.py\n\u2502   \u2502   \u251c\u2500\u2500 train_model.py\n...\n</code></pre> <p>It is not important exactly what is in the <code>src</code> folder for the exercises, but we are going to refer to the above structure in the exercises, so adjust accordingly if you diviate from this. Additionally, we are going to assume that your project code is installed in your environment such that it can be imported as normal Python code.</p> <ol> <li> <p>We are going to need two Python packages to get started: mkdocs and     material for mkdocs. Install with</p> <pre><code>pip install \"mkdocs-material &gt;= 4.8.0\" # (1)!\n</code></pre> <ol> <li>Since <code>mkdocs</code> is a dependency of <code>mkdocs-material</code> we only need to install the latter.</li> </ol> </li> <li> <p>Run in your terminal (from the <code>docs</code> folder):</p> <pre><code>mkdocs serve # (1)!\n</code></pre> <ol> <li> <code>mkdocs serve</code> will automatically rebuild the whole site whenever you save a file inside the     <code>docs</code> folder. This is not a problem if you have a fairly small site with not that many pages (or elements), but     can take a long time for large sites. Consider running with the <code>--dirty</code> option for only re-building the site     for files that have been changed.</li> </ol> <p>which should render the <code>index.md</code> file as the homepage. You can leave the documentation server running during the remaining exercises.</p> </li> <li> <p>We are no ready to document the API of our code:</p> <ol> <li> <p>Make sure you at least have one function and class inside your <code>src</code> module. If you do not have you can for     simplicity copy the following module to the <code>src/models/model.py</code> file</p> <pre><code>import torch\n\nclass MyNeuralNet(torch.nn.Module):\n    \"\"\"Basic neural network class.\n\n    Args:\n        in_features: number of input features\n        out_features: number of output features\n\n    \"\"\"\n    def __init__(self, in_features: int, out_features: int) -&gt; None:\n        self.l1 = torch.nn.Linear(in_features, 500)\n        self.l2 = torch.nn.Linear(500, out_features)\n        self.r = torch.nn.ReLU()\n\n    def forward(self, x: torch.Tensor) -&gt; torch.Tensor:\n        \"\"\"Forward pass of the model.\n\n        Args:\n            x: input tensor expected to be of shape [N,in_features]\n\n        Returns:\n            Output tensor with shape [N,out_features]\n\n        \"\"\"\n        return self.l2(self.r(self.l1(x)))\n</code></pre> <p>and the following function to add <code>src/predict_model.py</code> file:</p> <pre><code>def predict(\n    model: torch.nn.Module,\n    dataloader: torch.utils.data.DataLoader\n) -&gt; None:\n    \"\"\"Run prediction for a given model and dataloader.\n\n    Args:\n        model: model to use for prediction\n        dataloader: dataloader with batches\n\n    Returns\n        Tensor of shape [N, d] where N is the number of samples and d is the output dimension of the model\n\n    \"\"\"\n    return [model(batch) for batch in dataloader]\n</code></pre> </li> <li> <p>Add a markdown file to the <code>docs/source</code> folder called <code>my_api.md</code> and add that file to the <code>nav:</code> section in     the <code>mkdocs.yaml</code> file.</p> </li> <li> <p>To that file add the following code:</p> <pre><code># My API\n\n::: src.models.model.MyNeuralNet\n\n::: src.predict_model.predict\n</code></pre> <p>The <code>:::</code> indicator tells mkdocs that it should look for the corresponding function/module and then render it on the given page. Thus, if you have a function/module located in another location change the paths accordingly.</p> </li> <li> <p>Make sure that the documentation correctly includes your function and module on the given page.</p> </li> <li> <p>(Optional) Include more functions/modules in your documentation.</p> </li> </ol> </li> <li> <p>(Optional) Look through the documentation for mkdocstrings and try to     improve the layout a bit. Especially, the     headings,     docstrings and     signatures could be of interest to adjust.</p> </li> <li> <p>Finally, try to build a final version of your documentation</p> <pre><code>mkdocs build\n</code></pre> <p>this should result in a <code>site</code> folder that contains the actual HTML code for documentation.</p> </li> </ol>"},{"location":"s10_extra/documentation/#publish-your-documentation","title":"Publish your documentation","text":"<p>To publish your documentation you need a place to host your build documentation e.g. the content of the <code>site</code> folder you build in the last exercise. There are many places to host your documentation, but if you only need a static site and are already hosting your code through Github, then a good option is Github Pages. Github pages is free to use for your public projects.</p> <p>Before getting started with this set of exercises you should have completed module M16 on GitHub actions so you already know about workflow files.</p>"},{"location":"s10_extra/documentation/#exercises_1","title":"Exercises","text":"<ol> <li> <p>Start by adding a new file called <code>deploy_docs.yaml</code> to the <code>.github/workflows</code> folder. Add the following cod to that     file and save it.</p> <pre><code>name: Deploy docs\n\non:\npush:\n    branches:\n        - main\n\npermissions:\n    contents: write # (1)!\n\njobs:\n  deploy:\n    name: Deploy docs\n    runs-on: ubuntu-latest\n    steps:\n    - name: Checkout code\n      uses: actions/checkout@v4\n      with:\n        fetch-depth: 0\n\n    - name: Set up Python\n      uses: actions/setup-python@v5\n      with:\n        python-version: 3.11\n        cache: 'pip'\n        cache-dependency-path: setup.py\n\n    - name: Install dependencies\n      run: pip install -r requirements.txt\n\n    - name: Deploy docs\n      run: mkdocs gh-deploy --force\n</code></pre> <ol> <li> It is important to give <code>write</code> permissions to this actions because it is not only reading     your code but it will also push code.</li> </ol> <p>Before continuing, make sure you understand what the different steps of the workflow does and especially we recommend looking at the documentation of the <code>mkdocs gh-deploy</code> command.</p> </li> <li> <p>Commit and push the file. Check that the action is executed and if it succeeds, that your build project is pushed to     a branch called <code>gh-pages</code>. If the action does not succeeds, then figure out what is wrong and fix it!</p> </li> <li> <p>After confirming that our action is working, you need to configure Github to publish the content being     build by Github Actions. Do the following:</p> <ul> <li>Go to the Settings tab and then the Pages subsection</li> <li>In the <code>Source</code> setting choose the <code>Deploy from a branch</code></li> <li>In the <code>Branch</code> setting choose the <code>gh-pages</code> branch and <code>/(root)</code> folder and save</li> </ul> <p> </p> <p>This should then start deploying your site to <code>https://&lt;your-username&gt;.github.io/&lt;your-reponame&gt;/</code>. If it does not do this you may need to recommit and trigger the GitHub actions build again.</p> </li> <li> <p>Make sure your documentation is published and looks as it should.</p> </li> </ol> <p>This ends the module on technical documentation. We cannot stress enough how important it is to write proper documentation for larger projects that need to be maintained over a longer time. It is often a iterative process, but it is often best to do it while writing the code.</p>"},{"location":"s10_extra/high_performance_clusters/","title":"M34 - High Performance Clusters","text":""},{"location":"s10_extra/high_performance_clusters/#high-performance-clusters","title":"High Performance Clusters","text":"<p>As discussed in the intro session on the cloud, cloud providers offers near infinite compute resources. However, using these resources comes at a hefty price often and it is therefore important to be aware of another resource many have access to: High Performance Clusters or HPC. HPCs exist all over the world, and many time you already have access to one or can easily get access to one. If you are an university student you most likely have a local HPC that you can access through your institution. Else, there exist public HPC resources that everybody (with a project) can apply for. As an example in the EU we have EuroHPC initiative that currently has 8 different supercomputers with a centralized location for applying for resources that are both open for research projects and start-ups.</p> <p>Depending on your application, you may have different needs and it is therefore important to be aware also of the different tiers of HPC. In Europe, HPC are often categorized such that Tier-0 are European Centers with petaflop or hexascale machines,\u00a0Tier 1 are National centers of supercomputers, and Tier 2 are Regional centers. The lower the Tier, the larger applications it is possible to run.</p> <p></p>"},{"location":"s10_extra/high_performance_clusters/#cluster-architectures","title":"Cluster architectures","text":"<p>In very general terms, cluster can come as two different kind of systems: supercomputers and LSF (Load Sharing Facility). A supercomputer (as shown below) is organized into different modules, that are separated by network link. When you login to a supercomputer you will meet the front end which contains all the software needed to run computations. When you submit a job it will get sent to the backend modules which in most cases includes: general compute modules (CPU), acceleration modules (GPU), a memory module (RAM) and finally a storage module (HDD). Depending on your application you may need one module more than another. For example in deep learning the acceleration module is important but in physics simulation the general compute module / storage model is probably more important.</p> <p></p>  Overview of the Meluxina supercomputer that's part of EuroHPC.  Image credit  <p>Alternatively, LSF are a network of computers where each computer has its own CPU, GPU, RAM etc. and the individual computes (or nodes) are then connected by network. The important different between a supercomputer and as LSF systems is how the resources are organized. When comparing supercomputers to LSF system it is generally the case that it is better to run on a LSF system if you are only requesting resources that can be handled by a single node, however it is better to run on a supercomputer if you have a resource intensive application that requires many devices to communicate with each others.</p> <p>Regardless of cluster architectures, on the software side of HPC, the most important part is what's called the HPC scheduler. Without a HPC scheduler an HPC cluster would just be a bunch of servers with different jobs interfering with each other. The problem is when you have a large collection of resources and a large collection of users, you cannot rely on the users just running their applications without interfering with each other. A HPC scheduler is in charge of managing that whenever an user request to run an application, they get put in a queue and whenever the resources their application ask for are available the application gets run.</p> <p>The biggest bach control systems for doing scheduling on HPC are:</p> <ul> <li>SLURM</li> <li>MOAB HPC Suite</li> <li>PBS Works</li> </ul> <p>We are going to take a look at PBS works as that is what is installed on our local university cluster.</p>"},{"location":"s10_extra/high_performance_clusters/#exercises","title":"\u2754 Exercises","text":"<p>Exercise files</p> <p>The following exercises are focused on local students at DTU that want to use our local HPC resources. That said, the steps in the exercise are fairly general to other types of cluster. For the purpose of this exercise we are going to see how we can run this image classifier script , but feel free to work with whatever application you want to.</p> <ol> <li> <p>Start by accessing the cluster. This can either be through <code>ssh</code> in a terminal or if you want a graphical interface     thinlinc can be installed. In general we recommend following the steps     here for DTU students as the setup depends on if you are on campus or not.</p> </li> <li> <p>When you have access to the cluster we are going to start with the setup phase. In the setup phase we are going     to setup the environment necessary for our computations. If you have accessed the cluster through graphical interface     start by opening a terminal.</p> <ol> <li> <p>Lets start by setting up conda for controlling our dependencies. If you have not already worked with <code>conda</code>,     please checkout module     M2 on package managers and virtual environments. In general     you should be able to setup (mini)conda through these two commands:</p> <pre><code>wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nsh Miniconda3-latest-Linux-x86_64.sh\n</code></pre> </li> <li> <p>Close the terminal and open a new for the installation to complete. Type <code>conda</code> in the terminal to check that     everything is fine. Go ahead and create a new environment that we can install dependencies in</p> <pre><code>conda create -n \"hpc_env\" python=3.10 --no-default-packages\n</code></pre> <p>and activate it.</p> </li> <li> <p>Copy over any files you need. For the image classifier script you need the     requirements file     and the actual     application.</p> </li> <li> <p>Next, install all the requirements you need. If you want to run the image classifier script you can run this     command in the terminal</p> <pre><code>pip install -r image_classifier_requirements.txt\n</code></pre> <p>using this requirements file.</p> </li> </ol> </li> <li> <p>That's all the setup needed. You would need to go through the creating of environment and installation of requirements     whenever you start a new project (no need for reinstalling conda). For the next step we need to look at how to submit     jobs on the cluster. We are now ready to submit the our first job to the cluster:</p> <ol> <li> <p>Start by checking the statistics for the different clusters. Try to use both the <code>qstat</code> command which should give     an overview of the different cluster, number of running jobs and number of pending jobs. For many system you can     also try the much more user friendly command <code>classstat</code> command.</p> </li> <li> <p>Figure out which queue you want to use. For the sake of the exercises it needs to be one with GPU support. For     DTU students, any queue that starts with <code>gpu</code> are GPU accelerated.</p> </li> <li> <p>Now we are going to develop a bash script for submitting our job. We have provided an example of such     scripts. Take a     careful look and go each line and make sure you understand it. Afterwards, change it to your needs     (queue and student email).</p> </li> <li> <p>Try to submit the script:</p> <pre><code>bsub &lt; jobscript.sh\n</code></pre> <p>You can check the status of your script by running the <code>bstat</code> command. Hopefully, the job should go through really quickly. Take a look at the output file, it should be called something like <code>gpu_*.out</code>. Also take a look at the <code>gpu_*.err</code> file. Does both files look as they should?</p> </li> </ol> </li> <li> <p>Lets now try to run our application on the cluster. To do that we need to take care of two things:</p> <ol> <li> <p>First we need to load the correct version of CUDA. A cluster system often contains multiple versions of specific     software to suit the needs of all their users, and it is the users that are in charge of loading the correct     software during job submission. The only extra software that needs to be loaded for most PyTorch applications     are a CUDA module. You can check which modules are available on the cluster with</p> <pre><code>module avail\n</code></pre> <p>Afterwards, add the correct CUDA version you need to the <code>jobscript.sh</code> file. If you are trying to run the provided image classifier script then the correct version is <code>CUDA/11.7</code> (can be seen in the requirements file).</p> <pre><code># add to the bottom of the file\nmodule load cuda/11.7\n</code></pre> </li> <li> <p>We are now ready to add in our application. The only thing we need to take care of is telling the system to run     it using the <code>python</code> version that is connected to our <code>hpc_env</code> we created in the beginning. Try typing:</p> <pre><code>which python\n</code></pre> <p>which should give you the full path. Then add to the bottom of the <code>jobscript</code> file:</p> <pre><code>~/miniconda3/envs/hpc_env/bin/python \\\n    image_classifier.py \\\n    --trainer.accelerator 'gpu' --trainer.devices 1  --trainer.max_epochs 5\n</code></pre> <p>which will run the image classifier script (change it if you are running something else).</p> </li> <li> <p>Finally submit the job:</p> <pre><code>bsub &lt; jobscript.sh\n</code></pre> <p>and check when it is done that it has produced what you expected.</p> </li> <li> <p>(Optional) If you application supports multi GPUs also try that out. You would first need to change the     jobscript to request multiple GPUs and additionally you would need to tell your application to run on multiple     GPUs. For the image classifier script it can be done by changing the <code>--trainer.devices</code> flag     to <code>2</code> (or higher).</p> </li> </ol> </li> </ol> <p>This ends the module on using HPC systems.</p>"},{"location":"s10_extra/hyperparameters/","title":"M33 - Hyperparameter optimization","text":""},{"location":"s10_extra/hyperparameters/#hyperparameter-optimization","title":"Hyperparameter optimization","text":"<p>Outdated module</p> <p>This module has not been updated for a long time and therefore some functionality of Optuna, which is used in these exercises, may not be included. If you have completed the module on Weights &amp; Bias then we highly recommend instead using their sweep functionality.</p> <p>Hyperparameter optimization is not a new idea within machine learning but have somewhat seen a renaissance with the uprise of deep learning. This can mainly be contributed to the following:</p> <ul> <li>Trying to beat state-of-the-art often comes down to very small differences in performance, and hyperparameter     optimization can help squeeze out a bit more</li> <li>Deep learning models are in general not that robust towards the choice of hyparameter so choosing the wrong set     may lead to a model that does not work</li> </ul> <p>However the problem with doing hyperparameter optimization of a deep learning models is that it can take over a week to train a single model. In most cases we therefore cannot do a full grid search of all hyperparameter combinations to get the best model. Instead we have to do some tricks that will help us speed up our searching. In these exercises we are going to be integrating optuna into our different models, that will provide the tools for speeding up our search.</p> <p></p> <p>It should be noted that a lot of deep learning models does not optimize every hyperparameter that is included in the model but instead relies on heuristic guidelines (\"rule of thumb\") based on what seems to be working in general e.g. a learning rate of 0.01 seems to work great with the Adam optimizer. That said, these rules probably only apply to 80% of deep learning model, whereas for the last 20% the recommendations may be suboptimal Here is a great site that has collected an extensive list of these recommendations, taken from the excellent deep learning book by Ian Goodfellow, Yoshua Bengio and Aaron Courville.</p> <p>In practice, I recommend trying to identify (through experimentation) which hyperparameters that are important for the performance of your model and then spend your computational budget trying to optimize them while setting the rest to a \"recommended value\".</p>"},{"location":"s10_extra/hyperparameters/#exercises","title":"\u2754 Exercises","text":"<p>Exercise files</p> <ol> <li> <p>Start by installing optuna:     <code>pip install optuna</code></p> </li> <li> <p>Initially we will look at the <code>cross_validate.py</code> file. It implements simple K-fold cross validation of     a random forest sklearn digits dataset (subset of MNIST). Look over the script and try to run it.</p> </li> <li> <p>We will now try to write the same code in optune. Please note that the script have a variable <code>OPTUNA=False</code>     that you can use to change what part of the code should run. The three main concepts of optuna is</p> <ul> <li> <p>A trial: a single experiment</p> </li> <li> <p>A study: a collection of trials</p> </li> <li> <p>The objective: function to determine how \"good\" a trial is</p> </li> </ul> <p>Lets start by writing the objective function, which we have already started in the script. For now you do not need to care about the <code>trial</code> argument, just assume that it contains the hyperparameters needed to define your random forest model. The output of the objective function should be a single number that we want to optimize. (HINT: did you remember to do K-fold crossvalidation inside your objective function?)</p> </li> <li> <p>Next lets focus on the trial. Inside the <code>objective</code> function the trial should be used to suggest what     parameters to use next. Take a look at the documentation for     trial or take a look at     the code examples and figure out how to define the hyperparameter of the model.</p> </li> <li> <p>Finally lets launch a study. It can be as simple as</p> <pre><code>study = optuna.create_study()\nstudy.optimize(objective, n_trials=100)\n</code></pre> <p>but lets play around a bit with it:</p> <ol> <li> <p>By default the <code>.optimize</code> method will minimize the objective (by definition the optimum of an objective     function is at its minimum). Is the score your objective function is returning something that should     be minimized? If not, a simple solution is to put a <code>-</code> in front of the metric. However, look through     the documentation on how to change the direction of the optimization.</p> </li> <li> <p>Optuna will by default do Bayesian optimization when sampling the hyperparameters (using a evolutionary     algorithm for suggesting new trials). However, since this example is quite simple, we can actually     perform a full grid search. How would you do this in Optuna?</p> </li> <li> <p>Compare the performance of a single optuna run using Bayesian optimization with <code>n_trials=10</code> with a     exhaustive grid search that have search through all hyperparameters. What is the performance/time     trade-off for these two solutions?</p> </li> </ol> </li> <li> <p>In addition to doing baysian optimization, the other great part about Optuna is that it have native support     for Pruning unpromising trials. Pruning refers to the user stopping trials for hyperparameter combinations     that does not seem to lead anywhere. You may have learning rate that is so high that training is diverging or     a neural network with too many parameters so it is just overfitting to the training data. This however begs the     question: what constitutes an unpromising trial? This is up to you to define based on prior experimentation.</p> <ol> <li> <p>Start by looking at the <code>fashion_trainer.py</code> script. Its a simple classification network for classifying     images in the FashionMNIST dataset. Run the script     with the default hyperparameters to get a feeling of how the training should be progress.     Note down the performance on the test set.</p> </li> <li> <p>Start by defining a validation set and a validation dataloader that we can use for hyperparameter optimization     (HINT: use 5-10% of you training data).</p> </li> <li> <p>Now, adjust the script to use Optuna. The 5 hyperparameters listed in the table above should at least be included     in the hyperparameter search. For some we have already defined the search space but for the remaining you need to     come up with a good range of values to investigate. We done integrating optuna, run a small study (<code>n_tirals=3</code>)     to check that the code is working.</p> Hyperparameter Search space Learning rate 1e-6 to 1e0 Number of output features in the second last layer ??? The amount of dropout to apply ??? Batch size ??? Use batch normalize or not {True, False} (Optional) Different activations functions {<code>nn.ReLU</code>, <code>nn.Tanh</code>, <code>nn.RReLU</code>, <code>nn.LeakyReLU</code>, <code>nn.ELU</code>} </li> <li> <p>If implemented correctly the number of hyperparameter combinations should be at least 1000, meaning that     we not only need baysian optimization but probably also need pruning to succeed. Checkout the page for     built-in pruners in Optuna. Implement     pruning in the script. I recommend using either the <code>MedianPruner</code> or the <code>ProcentilePruner</code>.</p> </li> <li> <p>Re-run the study using pruning with a large number of trials (<code>n_trials&gt;50</code>)</p> </li> <li> <p>Take a look at this     visualization page     for ideas on how to visualize the study you just did. Make at least two visualization of the study and     make sure that you understand them.</p> </li> <li> <p>Pruning is great for better spending your computational budged, however it comes with a trade-off. What is     it and what hyperparameter should one be especially careful about when using pruning?</p> </li> <li> <p>Finally, what parameter combination achieved the best performance? What is the test set performance for this     set of parameters. Did you improve over the initial set of hyperparameters?</p> </li> </ol> </li> <li> <p>The exercises until now have focused on doing the hyperparameter searching sequentially, meaning that we test one     set of parameters at the time. It is a fine approach because you can easily let it run for a week without any     interaction. However, assuming that you have the computational resources to run in parallel, how do you do that?</p> <ol> <li> <p>To run hyperparameter search in parallel we need a common database that all experiments can read and     write to. We are going to use the recommended <code>mysql</code>. You do not have to understand what SQL is to     complete this exercise, but it is basically a language (like python)     for managing databases. Install mysql.</p> </li> <li> <p>Next we are going to initialize a database that we can read and write to. For this exercises we are going     to focus on a locally stored database but it could of course also be located in the cloud.</p> <pre><code>mysql -u root -e \"CREATE DATABASE IF NOT EXISTS example\"\n</code></pre> <p>you can also do this directly in Python when calling the <code>create_study</code> command by also setting the <code>storage</code> and <code>load_if_exists=True</code> flags.</p> </li> <li> <p>Now we are going to create a Optuna study in our database</p> <pre><code>optuna create-study --study-name \"distributed-example\" --storage \"mysql://root@localhost/example\"\n</code></pre> </li> <li> <p>Change how you initialize the study to read and write to the database. Therefore, instead of doing</p> <pre><code>study = optuna.create_study()\n</code></pre> <p>then do</p> <pre><code>study = optuna.load_study(\n    study_name=\"distributed-example\", storage=\"mysql://root@localhost/example\"\n)\n</code></pre> <p>where the <code>study_name</code> and <code>storage</code> should match how the study was created.</p> </li> <li> <p>For running in parallel, you can either open up a extra terminal and simple launch your script once     per open terminal or you can use the provided <code>parallel_lancher.py</code> that will launch multiple executions     of your script. It should be used as:</p> <pre><code>python parallel_lancher.py myscript.py --num_parallel 2\n</code></pre> </li> <li> <p>Finally, make sure that you can access the results</p> </li> </ol> </li> </ol> <p>That's all on how to do hyperparameter optimization in a scalable way. If you feel like it you can try to apply these techniques on the ongoing corrupted MNIST example, where you are free to choose what hyperparameters that you want to use.</p>"},{"location":"s10_extra/infrastructure_as_code/","title":"Infrastructure as code","text":"<p>Danger</p> <p>Module is still under development</p>"},{"location":"s10_extra/infrastructure_as_code/#infrastructure-as-code-iac","title":"Infrastructure as Code (IaC)","text":"<p>Infrastructure as Code (IaC) is the process of managing and provisioning computer data centers through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools. The IT infrastructure managed by this comprises both physical equipment such as bare-metal servers as well as virtual machines and associated configuration resources. The definitions are written in a high-level programming language and can be versioned, and the code can be tested and validated.</p>"},{"location":"s10_extra/infrastructure_as_code/#terraform","title":"Terraform","text":"<p>Terraform is an open-source infrastructure as code software tool created by HashiCorp. It enables users to define and provision a datacenter infrastructure using a high-level configuration language known as Hashicorp Configuration Language (HCL), or optionally JSON. It allows infrastructure to be expressed as code in a simple, human-readable language called HCL (HashiCorp Configuration Language). It supports a multitude of cloud providers, including AWS, Azure, Google Cloud, and many others.</p>"},{"location":"s10_extra/infrastructure_as_code/#installation","title":"Installation","text":"<p>To install Terraform, download the appropriate package for your operating system from the official Terraform website. Once downloaded, unzip the package and move the binary to a directory included in your system's PATH.</p>"},{"location":"s10_extra/infrastructure_as_code/#getting-started","title":"Getting started","text":"<p>To get started with Terraform, you need to create a configuration file. This file is a human-readable file that describes the infrastructure and set of resources to be created. The file is saved with a <code>.tf</code> extension. Here is an example of a simple Terraform configuration file that creates an AWS EC2 instance:</p> <pre><code>provider \"aws\" {\n  region = \"us-west-2\"\n}\n\nresource \"aws_instance\" \"example\" {\n  ami           = \"ami-0c55b159cbfafe1f0\"\n  instance_type = \"t2.micro\"\n}\n</code></pre> <p>To create the infrastructure described in the configuration file, navigate to the directory containing the file and run the following commands:</p> <pre><code>terraform init\nterraform apply\n</code></pre> <p>The <code>terraform init</code> command is used to initialize a working directory containing Terraform configuration files. This is the first command that should be run after writing a new Terraform configuration or cloning an existing one from version control. The <code>terraform apply</code> command is used to apply the changes required to reach the desired state of the configuration, or the pre-determined set of actions generated by a <code>terraform plan</code> execution plan.</p>"},{"location":"s10_extra/infrastructure_as_code/#resources","title":"Resources","text":"<ul> <li>Terraform documentation</li> </ul>"},{"location":"s10_extra/kubernetes/","title":"Kubernetes","text":"<p>Danger</p> <p>Module is still under development</p>"},{"location":"s10_extra/kubernetes/#kubernetes","title":"Kubernetes","text":"<p>Kubernetes, also known as K8s, is an open-source platform designed to automate the deployment, scaling, and operation of application containers across clusters of hosts. It provides the framework to run distributed systems resiliently, handling scaling and failover for your applications, providing deployment patterns, and more.</p>"},{"location":"s10_extra/kubernetes/#what-is-kubernetes","title":"What is Kubernetes?","text":""},{"location":"s10_extra/kubernetes/#brief-history","title":"Brief History","text":"<p>Kubernetes was originally developed by Google and is now maintained by the Cloud Native Computing Foundation.</p>"},{"location":"s10_extra/kubernetes/#core-functions","title":"Core Functions","text":"<p>Kubernetes makes it easier to deploy and manage containerized applications at scale.</p>"},{"location":"s10_extra/kubernetes/#key-concepts","title":"Key Concepts","text":"<ul> <li>Pods</li> <li>Nodes</li> <li>Clusters</li> <li>...</li> </ul>"},{"location":"s10_extra/kubernetes/#kubernetes-architecture","title":"Kubernetes Architecture","text":"<p>Kubernetes follows a client-server architecture. At a high level, it consists of a Control Plane (master) and Nodes (workers).</p> <p></p>  Overview of Kubernetes Architecture. Image Credit: Kubernetes Official Documentation"},{"location":"s10_extra/kubernetes/#control-plane-components","title":"Control Plane Components","text":"<ul> <li>API Server: The frontend for Kubernetes.</li> <li>etcd: Consistent and highly-available key value store.</li> <li>...</li> </ul>"},{"location":"s10_extra/kubernetes/#node-components","title":"Node Components","text":"<ul> <li>Kubelet: An agent that runs on each node.</li> <li>Container Runtime: The software responsible for running containers.</li> <li>...</li> </ul>"},{"location":"s10_extra/kubernetes/#minikube-local-kubernetes-environment","title":"Minikube: Local Kubernetes Environment","text":"<p>Minikube is a tool that allows you to run Kubernetes locally. It runs a single-node Kubernetes cluster inside a VM on your laptop for users looking to try out Kubernetes or develop with it day-to-day.</p>"},{"location":"s10_extra/kubernetes/#installing-minikube","title":"Installing Minikube","text":"<ol> <li>System Requirements: Ensure your system meets the minimum requirements.</li> <li>Download and Install: Visit Minikube's official installation guide.</li> <li>Start Minikube: Run <code>minikube start</code>.</li> </ol>"},{"location":"s10_extra/kubernetes/#exercises","title":"\u2754 Exercises","text":"<ol> <li>Install Minikube following the steps above.</li> <li>Validate the installation by typing <code>minikube</code> in a terminal.</li> <li>Ensure that kubectl, the command-line tool for Kubernetes, is correctly installed by typing <code>kubectl</code> in a terminal.</li> </ol>"},{"location":"s10_extra/kubernetes/#yatai-model-serving-platform-for-kubernetes","title":"Yatai: Model Serving Platform for Kubernetes","text":"<p>Yatai is a model serving platform, making it easier to deploy machine learning models in Kubernetes environments.</p>"},{"location":"s10_extra/kubernetes/#what-is-yatai","title":"What is Yatai?","text":"<p>Yatai simplifies the deployment, management, and scaling of machine learning models in Kubernetes.</p>"},{"location":"s10_extra/kubernetes/#getting-started-with-yatai","title":"Getting Started with Yatai","text":"<ol> <li>Installation: Steps to install Yatai in your Kubernetes cluster.</li> <li>Basic Usage: How to deploy your first model using Yatai.</li> </ol>"},{"location":"s10_extra/kubernetes/#additional-resources","title":"Additional Resources","text":"<ul> <li>Official Kubernetes Documentation</li> <li>Interactive Tutorials</li> <li>Community Forums</li> <li>...</li> </ul>"},{"location":"s10_extra/orchestration/","title":"Orchestration","text":"<p>Danger</p> <p>Module is still under development</p>"},{"location":"s10_extra/orchestration/#workflow-orchestration","title":"Workflow orchestration","text":""},{"location":"s10_extra/orchestration/#prefect","title":"Prefect","text":"<p>If you give an MLOps engineer a job</p> <ul> <li>Could you just set up this pipeline to train this model?</li> <li>Could you set up logging?</li> <li>Could you do it every day?</li> <li>Could you make it retry if it fails?</li> <li>Could you send me a message when it succeeds?</li> <li>Could you visualize the dependencies?</li> <li>Could you add caching?</li> <li>Could you add add collaborators to run ad hoc - who don't code e.g could you add a UI?</li> </ul> <pre><code>pip install prefect\n</code></pre> <pre><code>from prefect import task, Flow\n</code></pre>"},{"location":"s10_extra/orchestration/#exercises","title":"\u2754 Exercises","text":"<ol> <li> <p>Start by installing <code>prefect</code>:</p> <pre><code>pip install prefect\n</code></pre> </li> <li> <p>Start a local Prefect server instance in your virtual environment.</p> <pre><code>prefect server start\n</code></pre> </li> <li> <p>The great thing about Prefect is that the orchestration tasks and flows are written in pure Python.</p> </li> </ol>"},{"location":"s10_extra/quantization/","title":"Quantization","text":""},{"location":"s10_extra/quantization/#quantization","title":"Quantization","text":"<p>Danger</p> <p>Module is still under development</p>"},{"location":"s10_extra/quantization/#exercises","title":"\u2754 Exercises","text":"<p>We are in these exercises going to be looking at two different kinds of quantization strategies: quantization-aware training and post-training quantization. As the names suggest, the quantization is either applied while training or after training. There are good reasons for doing both:</p> <ul> <li> <p>If the model you are going to deploy in the end needs to be quantized, either due to hard requirements for how the     big the model can be or in the effort to optimize inference time, quantization-aware training is the better     approach. The reason here being that the model is specifically optimized to always be quantized and therefore in     general end up with a better model.</p> </li> <li> <p>If the most important metric for deployment is the overall performance of the model with no regards to model size     and inference speed, post-training quantization is the better option. This allows you to most likely train a better     model to begin with and then try out converting the model afterwards. In the best case this can be done without     any hits to performance.</p> </li> <li> <p>Start by installing intel neural compressor</p> <pre><code>pip install neural_compressor\n</code></pre> <p>and remember to add this to your <code>requirements.txt</code> file.</p> </li> <li> <p>Let's start a new script called <code>model_converter.py</code>. Start by filling it with some simple code for loading a given     <code>float32</code> model checkpoint. You should already have such code from earlier exercises. Preferably, add a small CLI     interface to load a model by passing the filename in the command line:</p> <pre><code>python model_converter.py model_checkpoint.ckpt\n</code></pre> Solution <p>We are here going to assume that you are either loading from a <code>onnx</code> model or alternatively loading a PyTorch Lightning checkpoint:</p> <pre><code>from typer import App\nimport onnx\nfrom onnx.onnx_ml_pb2 import ModelProto\nfrom pytorch_lightning import LightningModule\nfrom my_model import MyModel\napp = App()\n\n@app.command()\n@app.argument(\"model_checkpoint\")\ndef quantize(model_checkpoint: ModelProto | LightningModule) -&gt; None:\n    if isinstance(model_checkpoint, LightningModule):\n        model = MyModel.load_from_checkpoint(model_checkpoint)\n    else:\n        model = onnx.load(model_checkpoint)\n</code></pre> </li> <li> <p>Next you also need to add</p> </li> <li> <p>Finally, calculate the size (in MB) of the original model and the quantized model. How much smaller is the quantized     model?</p> Solution <p>Assuming the models are saved as <code>checkpoint.ckpt</code> and <code>checkpoint_quantized.ckpt</code> we can calculate the size using <code>os.path.getsize</code> in Python:</p> <pre><code>original_size = os.path.getsize(\"models/checkpoint.onnx\") / (1024 * 1024)\nquantized_size = os.path.getsize(\"models/checkpoint_quantized.onnx\") / (1024 * 1024)\n</code></pre> <p>The quantized model should be very close to 4 times smaller as <code>int4</code> only uses 1/4 the bits to store weights compared to <code>float32</code> format.</p> </li> </ul>"},{"location":"s10_extra/quantization/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":""},{"location":"s1_development_environment/","title":"Setting up a development environment","text":"<p>Slides</p> <ul> <li> <p></p> <p>Learn the basics of the command line, and how to use it to navigate your file system and run programs.</p> <p> M1: Command line</p> </li> <li> <p></p> <p>Learn how package managers work in Python and how to create reproducible virtual environments using <code>conda</code> and <code>pip</code>.</p> <p> M2: Package Manager</p> </li> <li> <p></p> <p>Learn how to use a modern editor for code development.</p> <p> M3: Editor</p> </li> <li> <p></p> <p>Refresh your PyTorch skills and implement a simple deep-learning model.</p> <p> M4: Deep Learning Software</p> </li> </ul> <p>Today we start our journey into the world of machine learning operations (MLOps). However, before we can get started, we need to make sure that you have a basic understanding of a couple of topics, as we will be using these throughout the course. In particular, today is all about getting set up with a proper development environment that can support your journey. Most of you probably already have experience with these topics, and it will be mostly repetition.</p> <p>The reason we are starting here is that many students are missing very basic skills that are never taught but are just expected to be picked up by yourself. This session will only cover the most basic skills to get you started on your development journey. If you wish to learn more about basic computer science skills in general, we highly recommend that you check out The Missing Semester of Your CS Education course from MIT.</p> <p>Learning objectives</p> <p>The learning objectives of this session are:</p> <ul> <li>Understand the basics of the command line.</li> <li>Being able to create reproducible virtual environments.</li> <li>Able to use a modern editor for code development</li> <li>Write and run a Python program, implementing a simple deep-learning model</li> </ul>"},{"location":"s1_development_environment/command_line/","title":"M1 - The command line","text":""},{"location":"s1_development_environment/command_line/#the-command-line","title":"The command line","text":"<p>Core Module</p> <p></p>  Image credit  <p>Contrary to popular belief, the command line (also commonly known as the terminal) is not a mythical being that has existed since the dawn of time. Instead, it was created at a time when it was not given that your computer had a graphical interface that you could interact with. Think of it as a text interface to your computer.</p> <p>The terminal is a well-known concept to users of Linux; however, MAC and (especially) Windows users often do not need it and therefore encounter it. Having a basic understanding of how to use a command line can help improve your workflow. The reason that the command line is an important tool to get to know is that doing any kind of MLOps will require us to be able to interact with many different tools, many of which do not have a graphical interface. Additionally, when we get to working in the cloud later in the course, you will be forced to interact with the command line.</p> <p>Note if you already are a terminal wizard then feel free to skip the exercises below. They are very elementary.</p>"},{"location":"s1_development_environment/command_line/#the-anatomy-of-the-command-line","title":"The anatomy of the command line","text":"<p>Regardless of the operating system, all command lines look more or less the same:</p> <p></p> <p>As already stated, it is essentially just a big text interface to interact with your computer. As the image illustrates, when trying to execute a command, there are several parts to it:</p> <ol> <li>The prompt is the part where you type your commands. It usually contains the name of the current directory you     are in, followed by some kind of sign: <code>$</code>, <code>&gt;</code>, <code>:</code> are the usual ones. It can also contain other information,     such as in the case of the above image which also shows the current <code>conda</code> environment.</li> <li>The command is the actual command you want to execute. For example, <code>ls</code> or <code>cd</code>.</li> <li>The options are additional arguments that you can pass to the command. For example, <code>ls -l</code> or <code>cd ..</code>.</li> <li>The arguments are the actual arguments that you pass to the command. For example, <code>ls -l figures</code> or <code>cd ..</code>.</li> </ol> <p>The core difference between options and arguments is that options are optional, while arguments are not.</p> <p></p>  Image credit"},{"location":"s1_development_environment/command_line/#exercises","title":"\u2754 Exercises","text":"<p>We have put a cheat sheet in the exercise files folder belonging to this session, that gives a quick overview of the different commands that can be executed in the command line.</p> Windows users <p>We highly recommend that you install Windows Subsystem for Linux (WSL). This will install a full Linux system on your Windows machine. Please follow this guide. Remember to run commands from an elevated (as administrator) Windows Command Prompt. You can in general complete all exercises in the course from a normal Windows Command Prompt, but some are easier to do if you run from WSL.</p> <p>If you decide to run in WSL, you need to remember that you now have two different systems, and installing a package on one system does not mean that it is installed on the other. For example, if you install <code>pip</code> in WSL, you need to install it again in Windows if you want to use it there.</p> <p>If you decide to not run in WSL, please always work in a Windows Command Prompt and not Powershell.</p> <ol> <li> <p>Start by opening a terminal.</p> </li> <li> <p>To navigate inside a terminal, we rely on the <code>cd</code> command and <code>pwd</code> command. Make sure you know how to go back and     forth in your file system. (1)</p> <ol> <li> Your terminal should support     tab-completion which can help finish commands for you!</li> </ol> </li> <li> <p>The <code>ls</code> command is important when we want to know the content of a folder. Try to use the command, and also try     it with the additional option <code>-l</code>. What does it show?</p> </li> <li> <p>Make sure to familiarize yourself with the <code>which</code>, <code>echo</code>, <code>cat</code>, <code>wget</code>, <code>less</code>, and <code>top</code> commands. Also,     familiarize yourself with the <code>&gt;</code> operator. You are probably going to use some of them throughout the course or in     your future career. For Windows users, these commands may be named something else, e.g., <code>where</code> command on Windows     corresponds to <code>which</code>.</p> </li> <li> <p>It is also significant that you know how to edit a file through the terminal. Most systems should have the     <code>nano</code> editor installed; else, try to figure out which one is installed on your system.</p> <ol> <li> <p>Type <code>nano</code> in the terminal.</p> </li> <li> <p>Write the following text in the script</p> <pre><code>if __name__ == \"__main__\":\n    print(\"Hello world!\")\n</code></pre> </li> <li> <p>Save the script and try to execute it.</p> </li> <li> <p>Afterward, try to edit the file through the terminal (change <code>Hello world</code> to something else).</p> </li> </ol> </li> <li> <p>All terminals come with a programming language. The most common system is called <code>bash</code>, which can come in handy     when being able to write simple programs in bash. For example, one case is that you want to execute multiple Python     programs sequentially, which can be done through a bash script.</p> Windows users <p>Bash is not part of Windows, so you need to run this part through WSL. If you did not install WSL, you can skip this part or as an alternative do the exercises in Powershell which is the native Windows scripting language (not recommended).</p> <ol> <li> <p>Write a bash script (in <code>nano</code>) and try executing it:</p> <pre><code>#!/bin/bash\n# A sample Bash script, by Ryan\necho Hello World!\n</code></pre> </li> <li> <p>Change the bash script to call the Python program you just wrote.</p> </li> <li> <p>Try to Google how to write a simple for-loop that executes the Python script 10 times in a row.</p> </li> </ol> </li> <li> <p>A trick you may need throughout this course is setting environment variables. An environment variable is just a     dynamically named value that may alter the way running processes behave on a computer. The syntax for setting an     environment variable depends on your operating system:</p> WindowsLinux/Mac <pre><code>set MY_VAR=hello\necho %MY_VAR%\n</code></pre> <pre><code>export MY_VAR=hello\necho $MY_VAR\n</code></pre> <ol> <li> <p>Try to set an environment variable and print it out.</p> </li> <li> <p>To use an environment variable in a Python program, you can use the <code>os.environ</code> function from the <code>os</code> module.     Write a Python program that prints out the environment variable you just set.</p> </li> <li> <p>If you have a collection of environment variables, these can be stored in a file called <code>.env</code>. The file is     formatted as follows:</p> <pre><code>MY_VAR=hello\nMY_OTHER_VAR=world\n</code></pre> <p>To load the environment variables from the file, you can use the <code>python-dotenv</code> package. Install it with <code>pip install python-dotenv</code> and then try to load the environment variables from the file and print them out.</p> <pre><code>from dotenv import load_dotenv\nload_dotenv()\nimport os\nprint(os.environ[\"MY_VAR\"])\n</code></pre> </li> </ol> </li> </ol>"},{"location":"s1_development_environment/command_line/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>Here is one command from later in the course when we are going to work in the cloud</p> <pre><code>gcloud compute instances create-with-container instance-1 \\\n    --container-image=gcr.io/&lt;project-id&gt;/gcp_vm_tester\n    --zone=europe-west1-b\n</code></pre> <p>Identify the command, options, and arguments.</p> Solution <ul> <li>The command is <code>gcloud compute instances create-with-container</code>.</li> <li>The options are <code>--container-image=gcr.io/&lt;project-id&gt;/gcp_vm_tester</code> and <code>--zone=europe-west1-b</code>.</li> <li>The arguments are <code>instance-1</code>.</li> </ul> <p>The tricky part of this example is that commands can have subcommands, which are also commands. In this case, <code>compute</code> is a subcommand to <code>gcloud</code>, <code>instances</code> is a subcommand to <code>compute</code>, and <code>create-with-container</code> is a subcommand to <code>instances</code>.</p> </li> <li> <p>Two common arguments that nearly all commands have are the <code>-h</code> and <code>-V</code> options. What does each of them do?</p> Solution <p>The <code>-h</code> (or <code>--help</code>) option prints the help message for the command, including subcommands and arguments. Try it out by executing <code>python -h</code>.   The <code>-V</code> (or <code>--version</code>) option prints the version of the installed program. Try it out by executing <code>python --version</code>.</p> </li> </ol> <p>This ends the module on the command line. If you are still not comfortable working with the command line, fear not as we are going to use it extensively throughout the course. If you want to spend additional time on this topic, we highly recommend that you watch this video on how to use the command line.</p> <p>If you are interested in personalizing your command line, you can check out the starship project, which allows you to customize your command line with a lot of different options.</p>"},{"location":"s1_development_environment/deep_learning_software/","title":"M4 - Deep Learning Software","text":""},{"location":"s1_development_environment/deep_learning_software/#deep-learning-software","title":"Deep Learning Software","text":"<p>Core Module</p> <p>Deep learning has, since its revolution back in 2012, transformed our lives. From Google Translate to driverless cars to personal assistants to protein engineering, deep learning is transforming nearly every sector of our economy and our lives. However, it did not take long before people realized that deep learning is not a simple beast to tame and it comes with its own kinds of problems, especially if you want to use it in a production setting. In particular, the concept of technical debt was invented to indicate the significant maintenance costs at a system level that it takes to run machine learning in production. MLOps should very much be seen as the response to the concept of technical debt, namely that we should develop methods, processes, and tools (with inspiration from classical DevOps) to counter the problems we run into when working with deep learning models.</p> <p>It is important to note that all the concepts and tools that have been developed for MLOps can be used together with more classical machine learning models (think K-nearest neighbor, Random forest, etc.), however, deep learning comes with its own set of problems which mostly have to do with the sheer size of the data and models we are working with. For these reasons, we are focusing on working with deep learning models in this course.</p>"},{"location":"s1_development_environment/deep_learning_software/#software-landscape-for-deep-learning","title":"Software Landscape for Deep Learning","text":"<p>Regarding software for Deep Learning, the landscape is currently dominated by three software frameworks (listed in order of when they were published):</p> <p> </p> <ul> <li> <p>TensorFlow</p> </li> <li> <p>PyTorch</p> </li> <li> <p>JAX</p> </li> </ul> <p>We won't go into a longer discussion on which framework is best, as it is pointless. PyTorch and TensorFlow have been around for the longest and therefore have bigger communities and feature sets at this point in time. They are both very similar in the sense that they both have features directed against research and production. JAX is kind of the new kid on the block, which in many ways improves on PyTorch and TensorFlow, but is still not as mature as the other frameworks. As the frameworks use different kinds of programming principles (object-oriented vs. functional programming), comparing them is essentially meaningless.</p> <p>In this course, we have chosen to work with PyTorch because we find it a bit more intuitive and it is the framework that we use for our day-to-day research life. Additionally, as of right now, it is absolutely the dominating framework for published models, research papers, and competition winners.</p> <p>The intention behind this set of exercises is to bring everyone's PyTorch skills up to date. If you already are a PyTorch-Jedi, feel free to pass the first set of exercises, but I recommend that you still complete it. The exercises are, in large part, taken directly from the deep learning course at Udacity. Note that these exercises are given as notebooks, which is the last time we are going to use them actively in the course. Instead, after this set of exercises, we are going to focus on writing code in Python scripts.</p> <p>The notebooks contain a lot of explanatory text. The exercises that you are supposed to fill out are inlined in the text in small \"exercise\" blocks:</p> <p></p> <p>If you need a refresher on any deep learning topic in general throughout the course, we recommend finding the relevant chapter in the deep learning book by Ian Goodfellow, Yoshua Bengio, and Aaron Courville (which can also be found in the literature folder). It is not necessary to be good at deep learning to pass this course as the focus is on all the software needed to get deep learning models into production. However, it's important to have a basic understanding of the concepts.</p>"},{"location":"s1_development_environment/deep_learning_software/#exercises","title":"\u2754 Exercises","text":"<p>Exercise files</p> <ol> <li> <p>Start a Jupyter Notebook session in your terminal (assuming you are standing at the root of the course material).     Alternatively, you should be able to open the notebooks directly in your code editor. For VS code users you can read     more about how to work with Jupyter Notebooks in VS code     here</p> </li> <li> <p>Complete the     Tensors in PyTorch     notebook. It focuses on the basic manipulation of PyTorch tensors. You can pass this notebook if you are comfortable     doing this.</p> </li> <li> <p>Complete the     Neural Networks in PyTorch     notebook. It focuses on building a very simple neural network using the PyTorch <code>nn.Module</code> interface.</p> </li> <li> <p>Complete the     Training Neural Networks     notebook. It focuses on how to write a simple training loop for training a neural network.</p> </li> <li> <p>Complete the     Fashion MNIST     notebook, which summarizes concepts learned in notebooks 2 and 3 on building a neural network for classifying the     Fashion MNIST dataset.</p> </li> <li> <p>Complete the     Inference and Validation     notebook. This notebook adds important concepts on how to do inference and validation on our neural network.</p> </li> <li> <p>Complete the     Saving_and_Loading_Models     notebook. This notebook addresses how to save and load model weights. This is important if you want to share a     model with someone else.</p> </li> </ol>"},{"location":"s1_development_environment/deep_learning_software/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>If tensor <code>a</code> has shape <code>[N, d]</code> and tensor <code>b</code> has shape <code>[M, d]</code> how can we calculate the pairwise distance     between rows in <code>a</code> and <code>b</code> without using a for loop?</p> Solution <p>We can take advantage of broadcasting to do this</p> <pre><code>a = torch.randn(N, d)\nb = torch.randn(M, d)\ndist = torch.sum((a.unsqueeze(1) - b.unsqueeze(0))**2, dim=2)  # shape [N, M]\n</code></pre> </li> <li> <p>What should be the size of <code>S</code> for an input image of size 1x28x28, and how many parameters does the neural network     then have?</p> <pre><code>from torch import nn\nneural_net = nn.Sequential(\n    nn.Conv2d(1, 32, 3), nn.ReLU(), nn.Conv2d(32, 64, 3), nn.ReLU(), nn.Flatten(), nn.Linear(S, 10)\n)\n</code></pre> Solution <p>Since both convolutions have a kernel size of 3, stride 1 (default value) and no padding that means that we lose 2 pixels in each dimension, because the kernel can not be centered on the edge pixels. Therefore, the output of the first convolution would be 32x26x26. The output of the second convolution would be 64x24x24. The size of <code>S</code> must therefore be <code>64 * 24 * 24 = 36864</code>. The number of parameters in a convolutional layer is <code>kernel_size * kernel_size * in_channels * out_channels + out_channels</code> (last term is the bias) and the number of parameters in a linear layer is <code>in_features * out_features + out_features</code> (last term is the bias). Therefore, the total number of parameters in the network is <code>3*3*1*32 + 32 + 3*3*32*64 + 64 + 36864*10 + 10 = 387,466</code>, which could be calculated by running:</p> <pre><code>sum([prod(p.shape) for p in neural_net.parameters()])\n</code></pre> </li> <li> <p>A working training loop in PyTorch should have these three function calls: <code>optimizer.zero_grad()</code>,     <code>loss.backward()</code>, <code>optimizer.step()</code>. Explain what would happen in the training loop (or implement it) if you     forgot each of the function calls.</p> Solution <p><code>optimizer.zero_grad()</code> is in charge of zeroring the gradient. If this is not done, then gradients would accumulate over the steps leading to exploding gradients. <code>loss.backward()</code> is in charge of calculating the gradients. If this is not done, then the gradients will not be calculated and the optimizer will not be able to update the weights. <code>optimizer.step()</code> is in charge of updating the weights. If this is not done, then the weights will not be updated and the model will not learn anything.</p> </li> </ol>"},{"location":"s1_development_environment/deep_learning_software/#final-exercise","title":"Final exercise","text":"<p>As the final exercise, we will develop a simple baseline model that we will continue to develop during the course. For this exercise, we provide the data in the <code>data/corruptmnist</code> folder. Do NOT use the data in the <code>corruptmnist_v2</code> folder as that is intended for another exercise. As the name suggests, this is a (subsampled) corrupted version of the regular MNIST. Your overall task is the following:</p> <p>Implement an MNIST neural network that achieves at least 85% accuracy on the test set.</p> <p>Before any training can start, you should identify the corruption that we have applied to the MNIST dataset to create the corrupted version. This can help you identify what kind of neural network to use to get good performance, but any network should be able to achieve this.</p> <p>One key point of this course is trying to stay organized. Spending time now organizing your code will save time in the future as you start to add more and more features. As subgoals, please fulfill the following exercises:</p> <ol> <li> <p>Implement your model in a script called <code>model.py</code>.</p> Starting point for <code>model.py</code> model.py<pre><code>from torch import nn\n\n\nclass MyAwesomeModel(nn.Module):\n    \"\"\"My awesome model.\"\"\"\n\n    def __init__(self) -&gt; None:\n        super().__init__()\n        self.fc1 = nn.Linear(784, 128)\n</code></pre> Solution <p>The provided solution implements a convolutional neural network with 3 convolutional layers and a single fully connected layer. Because the MNIST dataset consists of images, we want an architecture that can take advantage of the spatial information in the images.</p> model.py<pre><code>import torch\nfrom torch import nn\n\n\nclass MyAwesomeModel(nn.Module):\n    \"\"\"My awesome model.\"\"\"\n\n    def __init__(self) -&gt; None:\n        super().__init__()\n        self.conv1 = nn.Conv2d(1, 32, 3, 1)\n        self.conv2 = nn.Conv2d(32, 64, 3, 1)\n        self.conv3 = nn.Conv2d(64, 128, 3, 1)\n        self.dropout = nn.Dropout(0.5)\n        self.fc1 = nn.Linear(128, 10)\n\n    def forward(self, x: torch.Tensor) -&gt; torch.Tensor:\n        \"\"\"Forward pass.\"\"\"\n        x = torch.relu(self.conv1(x))\n        x = torch.max_pool2d(x, 2, 2)\n        x = torch.relu(self.conv2(x))\n        x = torch.max_pool2d(x, 2, 2)\n        x = torch.relu(self.conv3(x))\n        x = torch.max_pool2d(x, 2, 2)\n        x = torch.flatten(x, 1)\n        x = self.dropout(x)\n        return self.fc1(x)\n\n\nif __name__ == \"__main__\":\n    model = MyAwesomeModel()\n    print(f\"Model architecture: {model}\")\n    print(f\"Number of parameters: {sum(p.numel() for p in model.parameters())}\")\n\n    dummy_input = torch.randn(1, 1, 28, 28)\n    output = model(dummy_input)\n    print(f\"Output shape: {output.shape}\")\n</code></pre> </li> <li> <p>Implement your data setup in a script called <code>data.py</code>. The data was saved using <code>torch.save</code>, so to load it you     should use <code>torch.load</code>.</p> <p>Saving the model</p> <p>When saving the model, you should use <code>torch.save(model.state_dict(), \"model.pt\")</code>, and when loading the model, you should use <code>model.load_state_dict(torch.load(\"model.pt\"))</code>. If you do <code>torch.save(model, \"model.pt\")</code>, this can lead to problems when loading the model later on, as it will try to not only save the model weights but also the model definition. This can lead to problems if you change the model definition later on (which you most likely are going to do).</p> Starting point for <code>data.py</code> model.py<pre><code>import torch\n\n\ndef corrupt_mnist():\n    \"\"\"Return train and test dataloaders for corrupt MNIST.\"\"\"\n    # exchange with the corrupted mnist dataset\n    train = torch.randn(50000, 784)\n    test = torch.randn(10000, 784)\n    return train, test\n</code></pre> Solution <p>Data is stored in <code>.pt</code> files which can be loaded using <code>torch.load</code> (1). We iterate over the files, load them and concatenate them into a single tensor. In particular, we have highlighted the use of <code>.unsqueeze</code> function. Convolutional neural networks (which we propose as a solution) need the data to be in the shape <code>[N, C, H, W]</code> where <code>N</code> is the number of samples, <code>C</code> is the number of channels, <code>H</code> is the height of the image and <code>W</code> is the width of the image. The dataset is stored in the shape <code>[N, H, W]</code> and therefore we need to add a channel.</p> <ol> <li> The <code>.pt</code> files are nothing else than a <code>.pickle</code> file in disguise. The     <code>torch.save/torch.load</code> function is essentially a wrapper around the <code>pickle</code> module in Python, which     produces serialized files. However, it is convention to use <code>.pt</code> to indicate that the file contains PyTorch     tensors.</li> </ol> <p>We have additionally in the solution added functionality for plotting the images together with the labels for inspection. Remember: all good machine learning starts with a good understanding of the data.</p> model.py<pre><code>from __future__ import annotations\n\nimport matplotlib.pyplot as plt  # only needed for plotting\nimport torch\nfrom mpl_toolkits.axes_grid1 import ImageGrid  # only needed for plotting\n\nDATA_PATH = \"data/corruptmnist\"\n\n\ndef corrupt_mnist() -&gt; tuple[torch.utils.data.Dataset, torch.utils.data.Dataset]:\n    \"\"\"Return train and test dataloaders for corrupt MNIST.\"\"\"\n    train_images, train_target = [], []\n    for i in range(5):\n        train_images.append(torch.load(f\"{DATA_PATH}/train_images_{i}.pt\"))\n        train_target.append(torch.load(f\"{DATA_PATH}/train_target_{i}.pt\"))\n    train_images = torch.cat(train_images)\n    train_target = torch.cat(train_target)\n\n    test_images = torch.load(f\"{DATA_PATH}/test_images.pt\")\n    test_target = torch.load(f\"{DATA_PATH}/test_target.pt\")\n\n    train_images = train_images.unsqueeze(1).float()\n    test_images = test_images.unsqueeze(1).float()\n    train_target = train_target.long()\n    test_target = test_target.long()\n\n    train_set = torch.utils.data.TensorDataset(train_images, train_target)\n    test_set = torch.utils.data.TensorDataset(test_images, test_target)\n\n    return train_set, test_set\n\n\ndef show_image_and_target(images: torch.Tensor, target: torch.Tensor) -&gt; None:\n    \"\"\"Plot images and their labels in a grid.\"\"\"\n    row_col = int(len(images) ** 0.5)\n    fig = plt.figure(figsize=(10.0, 10.0))\n    grid = ImageGrid(fig, 111, nrows_ncols=(row_col, row_col), axes_pad=0.3)\n    for ax, im, label in zip(grid, images, target):\n        ax.imshow(im.squeeze(), cmap=\"gray\")\n        ax.set_title(f\"Label: {label.item()}\")\n        ax.axis(\"off\")\n    plt.show()\n\n\nif __name__ == \"__main__\":\n    train_set, test_set = corrupt_mnist()\n    print(f\"Size of training set: {len(train_set)}\")\n    print(f\"Size of test set: {len(test_set)}\")\n    print(f\"Shape of a training point {(train_set[0][0].shape, train_set[0][1].shape)}\")\n    print(f\"Shape of a test point {(test_set[0][0].shape, test_set[0][1].shape)}\")\n    show_image_and_target(train_set.tensors[0][:25], train_set.tensors[1][:25])\n</code></pre> </li> <li> <p>Implement training and evaluation of your model in <code>main.py</code> script. The <code>main.py</code> script should be able to take     additional subcommands indicating if the model should be trained or evaluated. It will look something like this:</p> <pre><code>python main.py train --lr 1e-4\npython main.py evaluate trained_model.pt\n</code></pre> <p>which can be implemented in various ways. We provide you with a starting script that uses the <code>click</code> library to define a command line interface (CLI), which you can learn more about in this module.</p> VS code and command line arguments <p>If you try to execute the above code in VS code using the debugger (F5) or the build run functionality in the upper right corner:</p> <p> </p> <p>you will get an error message saying that you need to select a command to run e.g. <code>main.py</code> either needs the <code>train</code> or <code>evaluate</code> command. This can be fixed by adding a <code>launch.json</code> to a specialized <code>.vscode</code> folder in the root of the project. The <code>launch.json</code> file should look something like this:</p> <pre><code>{\n    \"version\": \"0.2.0\",\n    \"configurations\": [\n        {\n            \"name\": \"Python: Current File\",\n            \"type\": \"python\",\n            \"request\": \"launch\",\n            \"program\": \"${file}\",\n            \"args\": [\n                \"train\",\n                \"--lr\",\n                \"1e-4\"\n            ],\n            \"console\": \"integratedTerminal\",\n            \"justMyCode\": true\n        }\n    ]\n}\n</code></pre> <p>This will inform VS code that then we execute the current file (in this case <code>main.py</code>) we want to run it with the <code>train</code> command and additionally pass the <code>--lr</code> argument with the value <code>1e-4</code>. You can read more about creating a <code>launch.json</code> file here. If you want to have multiple configurations you can add them to the <code>configurations</code> list as additional dictionaries.</p> Starting point for <code>main.py</code> main.py<pre><code>import click\nimport torch\nfrom data_solution import corrupt_mnist\nfrom model import MyAwesomeModel\n\n\n@click.group()\ndef cli() -&gt; None:\n    \"\"\"Command line interface.\"\"\"\n\n\n@click.command()\n@click.option(\"--lr\", default=1e-3, help=\"learning rate to use for training\")\ndef train(lr) -&gt; None:\n    \"\"\"Train a model on MNIST.\"\"\"\n    print(\"Training day and night\")\n    print(lr)\n\n    # TODO: Implement training loop here\n    model = MyAwesomeModel()\n    train_set, _ = corrupt_mnist()\n\n\n@click.command()\n@click.argument(\"model_checkpoint\")\ndef evaluate(model_checkpoint) -&gt; None:\n    \"\"\"Evaluate a trained model.\"\"\"\n    print(\"Evaluating like my life depends on it\")\n    print(model_checkpoint)\n\n    # TODO: Implement evaluation logic here\n    model = torch.load(model_checkpoint)\n    _, test_set = corrupt_mnist()\n\n\ncli.add_command(train)\ncli.add_command(evaluate)\n\n\nif __name__ == \"__main__\":\n    cli()\n</code></pre> Solution <p>The solution implements a simple training loop and evaluation loop. Furthermore, we have added additional hyperparameters that can be passed to the training loop. Highlighted in the solution are the different lines where we take care that our model and data are moved to GPU (or Apple MPS accelerator if you have a newer Mac) if available.</p> main.py<pre><code>import click\nimport matplotlib.pyplot as plt\nimport torch\nfrom model import MyAwesomeModel\n\nfrom data import corrupt_mnist\n\nDEVICE = torch.device(\"cuda\" if torch.cuda.is_available() else \"mps\" if torch.backends.mps.is_available() else \"cpu\")\n\n\n@click.group()\ndef cli() -&gt; None:\n    \"\"\"Command line interface.\"\"\"\n\n\n@click.command()\n@click.option(\"--lr\", default=1e-3, help=\"learning rate to use for training\")\n@click.option(\"--batch_size\", default=32, help=\"batch size to use for training\")\n@click.option(\"--epochs\", default=10, help=\"number of epochs to train for\")\ndef train(lr, batch_size, epochs) -&gt; None:\n    \"\"\"Train a model on MNIST.\"\"\"\n    print(\"Training day and night\")\n    print(f\"{lr=}, {batch_size=}, {epochs=}\")\n\n    model = MyAwesomeModel().to(DEVICE)\n    train_set, _ = corrupt_mnist()\n\n    train_dataloader = torch.utils.data.DataLoader(train_set, batch_size=batch_size)\n\n    loss_fn = torch.nn.CrossEntropyLoss()\n    optimizer = torch.optim.Adam(model.parameters(), lr=lr)\n\n    statistics = {\"train_loss\": [], \"train_accuracy\": []}\n    for epoch in range(epochs):\n        model.train()\n        for i, (img, target) in enumerate(train_dataloader):\n            img, target = img.to(DEVICE), target.to(DEVICE)\n            optimizer.zero_grad()\n            y_pred = model(img)\n            loss = loss_fn(y_pred, target)\n            loss.backward()\n            optimizer.step()\n            statistics[\"train_loss\"].append(loss.item())\n\n            accuracy = (y_pred.argmax(dim=1) == target).float().mean().item()\n            statistics[\"train_accuracy\"].append(accuracy)\n\n            if i % 100 == 0:\n                print(f\"Epoch {epoch}, iter {i}, loss: {loss.item()}\")\n\n    print(\"Training complete\")\n    torch.save(model.state_dict(), \"model.pth\")\n    fig, axs = plt.subplots(1, 2, figsize=(15, 5))\n    axs[0].plot(statistics[\"train_loss\"])\n    axs[0].set_title(\"Train loss\")\n    axs[1].plot(statistics[\"train_accuracy\"])\n    axs[1].set_title(\"Train accuracy\")\n    fig.savefig(\"training_statistics.png\")\n\n\n@click.command()\n@click.argument(\"model_checkpoint\")\ndef evaluate(model_checkpoint) -&gt; None:\n    \"\"\"Evaluate a trained model.\"\"\"\n    print(\"Evaluating like my life depended on it\")\n    print(model_checkpoint)\n\n    model = MyAwesomeModel().to(DEVICE)\n    model.load_state_dict(torch.load(model_checkpoint))\n\n    _, test_set = corrupt_mnist()\n    test_dataloader = torch.utils.data.DataLoader(test_set, batch_size=32)\n\n    model.eval()\n    correct, total = 0, 0\n    for img, target in test_dataloader:\n        img, target = img.to(DEVICE), target.to(DEVICE)\n        y_pred = model(img)\n        correct += (y_pred.argmax(dim=1) == target).float().sum().item()\n        total += target.size(0)\n    print(f\"Test accuracy: {correct / total}\")\n\n\ncli.add_command(train)\ncli.add_command(evaluate)\n\n\nif __name__ == \"__main__\":\n    cli()\n</code></pre> </li> <li> <p>As documentation that your model is working when running the <code>train</code> command, the script needs to produce a single     plot with the training curve (training step vs training loss). When the <code>evaluate</code> command is run, it should write     the test set accuracy to the terminal.</p> </li> </ol> <p>It is part of the exercise to not implement in notebooks, as code development in real life happens in scripts. As the model is simple to run (for now), you should be able to complete the exercise on your laptop, even if you are only training on CPU. That said, you are allowed to upload your scripts to your own \"Google Drive\" and then you can call your scripts from a Google Colab notebook, which is shown in the image below where all code is placed in the <code>fashion_trainer.py</code> script and the Colab notebook is just used to execute it.</p> <p></p> <p>Be sure to have completed the final exercise before the next session, as we will be building on top of the model you have created.</p>"},{"location":"s1_development_environment/editor/","title":"M3 - Editor","text":""},{"location":"s1_development_environment/editor/#editoride","title":"Editor/IDE","text":"<p>Core Module</p> <p>Notebooks can be great for testing out ideas, developing simple code, and explaining and visualizing certain aspects of a codebase. Remember that Jupyter Notebook was created to \"...allows you to create and share documents that contain live code, equations, visualizations, and narrative text.\" However, any larger machine learning project will require you to work in multiple <code>.py</code> files, and here notebooks will provide a suboptimal workflow. Therefore, to truly get \"work done,\" you will need a good editor/IDE.</p> <p>Many opinions exist on this matter, but for simplicity, we recommend getting started with one of the following 3:</p> Editor Webpage Comment (Biased opinion) Spyder https://www.spyder-ide.org/ A Matlab-like environment that is easy to get started with Visual Studio Code https://code.visualstudio.com/ Support for multiple languages with fairly easy setup PyCharm https://www.jetbrains.com/pycharm/ An IDE for Python professionals. Will take a bit of time getting used to <p>We highly recommend Visual Studio (VS) Code if you do not already have an editor installed (or just want to try something new). We, therefore, put additional effort into explaining VS Code.</p> <p>Below, you see an overview of the VS Code interface</p> <p></p>  Image credit  <p>The main components of VS Code are:</p> <ul> <li> <p>The action bar: VS Code is not an editor meant for a single language and can do many things. One of the core reasons     that VS Code has become so popular is that custom plug-ins called extensions can be installed to add     functionality to VS Code. It is in the action bar that you can navigate between these different applications     when you have installed them.</p> </li> <li> <p>The sidebar: The sidebar has different functionality depending on what extension you have open.     In most cases, the sidebar will just contain the file explorer.</p> </li> <li> <p>The editor: This is where your code is. VS Code supports several layouts in the editor (one column, two columns,     etc.). You can make a custom layout by dragging a file to where you want the layout to split.</p> </li> <li> <p>The panel: The panel contains a terminal for you to interact with. This can quickly be used to try out code by     opening a <code>python</code> interpreter, management of environments, etc.</p> </li> <li> <p>The status bar: The status bar contains information based on the extensions you have installed. In particular,     for Python development, the status bar can be used to change the conda environment.</p> </li> </ul>"},{"location":"s1_development_environment/editor/#exercises","title":"\u2754 Exercises","text":"<p>The overall goal of the exercises is that you should start familiarizing yourself with the editor you have chosen. If you are already an expert in one of them, feel free to skip the rest. You should at least be able to:</p> <ul> <li>Create a new file</li> <li>Run a Python script</li> <li>Change the Python environment</li> </ul> <p>The instructions below are specific to Visual Studio Code, but we recommend that you try to answer the questions if using another editor. In the <code>exercise_files</code> folder belonging to this session, we have put cheat sheets for VS Code (one for Windows and one for Mac/Linux) that can give you an easy overview of the different macros in VS Code. The following exercises are just to get you started, but you can find many more tutorials here.</p> <ol> <li> <p>VS Code is a general editor for many languages, and to get proper Python support, we need to install some     extensions. In the <code>action bar</code>, go to the <code>extension</code> tab and search for <code>python</code> in the marketplace. From here,     we highly recommend installing the following packages:</p> <ul> <li>Python: general Python support for VS Code</li> <li>Pylance: language server for     Python that provides better code completion and type-checking</li> <li>Jupyter: support for Jupyter notebooks     directly in VS Code</li> <li>Python Environment Manager:     allows for easy management of virtual environments</li> </ul> </li> <li> <p>If you install the <code>Python</code> package, you should see something like this in your status bar:</p> <p> </p> <p>which indicates that you are using the stock Python installation instead of the one you have created using <code>conda</code>. Click it and change the Python environment to the one you want to use.</p> </li> <li> <p>One of the most useful tools in VS Code is the ability to navigate the whole project using the built-in     <code>Explorer</code>. To take advantage of VS Code, you need to make sure what you are working on is a project.     Create a folder called <code>hello</code> (somewhere on your laptop) and open it in VS Code (Click <code>File</code> in the menu and then     select <code>Open Folder</code>). You should end up with a completely clean workspace (as shown below). Click the <code>New file</code>     button and create a file called <code>hello.py</code>.</p> <p>  Image credit  </p> </li> <li> <p>Finally, let's run some code. Add something simple to the <code>hello.py</code> file like:</p> <p>  Image credit  </p> <p>and click the <code>run</code> button as shown in the image. It should create a new terminal, activate the environment that you have chosen, and finally run your script. In addition to clicking the <code>run</code> button, you can also:</p> <ul> <li>Select some code and press <code>Shift+Enter</code> to run it in the terminal</li> <li>Select some code and right-click, choosing to run in an interactive window (where you can interact with the results     like in a Jupyter Notebook)</li> </ul> </li> </ol> <p>That's the basics of using VS Code. We highly recommend that you revisit this tutorial during the course when we get to topics such as debugging and version control, which VS Code can help with. We can also recommend this blog post that goes over some good extensions for AI/ML development in VS Code.</p>"},{"location":"s1_development_environment/editor/#a-note-on-jupyter-notebooks-in-production-environments","title":"A note on Jupyter notebooks in production environments","text":"<p>As already stated, Jupyter Notebooks are great for development as they allow developers to easily test out new ideas. However, they often lead to pain points when models need to be deployed. We highly recommend reading section 5.1.1 of this paper by Shankar et al. which in more detail discusses the strong opinions on Jupyter notebooks that exist within the developer community.</p> <p>All this said, there exists one simple tool to make notebooks work better in a production setting. It's called <code>nbconvert</code> and can be installed with</p> <pre><code>pip install nbconvert\n</code></pre> <p>You may need some further dependencies such as Pandoc, TeX and Pyppeteer for it to work (see install instructions here). After this, converting a notebook to a <code>.py</code> script is as simple as:</p> <pre><code>jupyter nbconvert --to=script my_notebook.ipynb\n</code></pre> <p>which will produce a similarly named script called <code>my_notebook.py</code>. We highly recommend that you stick to developing scripts directly during the course to get experience with doing so, but <code>nbconvert</code> can be a fantastic tool to have in your toolbox.</p>"},{"location":"s1_development_environment/editor/#ai-assistance","title":"AI assistance","text":"<p>You are probably all familiar with using AI tools for solving different tasks in your daily life and you have most likely also used AI tools like ChatGPT or similar for programming. However, most of these tools are not directly integrated into your editor, which can lead to a lot of context-switching that in general leads to lower productivity.</p> <p>We are therefore in this section going to be looking at GitHub Copilot, which is an AI tool that directly integrates into your editor, eliminating the need to switch between browser tabs or external tools. In addition, the strength of having AI directly in your editor is that it can provide suggestions based on the code you are currently writing and in general it just has access to a larger context than a standalone tool.</p>"},{"location":"s1_development_environment/editor/#exercises_1","title":"\u2754 Exercises","text":"<ol> <li> <p>As of writing this GitHub Copilot is free for all students, teachers and maintainers of popular open-source projects.     As a student, sign up for the Student Developer Pack</p> </li> <li> <p>Install the GitHub Copilot extension in your     editor</p> </li> <li> <p>GitHub Copilot has many different features, but the most important one is the ability to provide suggestions based     on the code you are currently writing. Try to write some code in a new Python file and see if you can get some     suggestions from GitHub Copilot on how to complete the code. If you have no idea what to try out here is a     simple example of starting out coding a neural network in PyTorch:</p> <pre><code>import torch\nfrom torch import nn\nclass Net(nn.Module):\n</code></pre> <p>Github Copilot will most likely suggest you complete the code using linear layers with an input dimension of <code>28*28</code>. Can you explain why it suggests this and where this bias comes from?</p> </li> <li> <p>The second feature that can be very useful is the ability to directly chat or ask questions regarding     your code. Try highlighting (in your code editor) the code from the previous exercise and press <code>Ctrl+i</code> which     should open a chat window. Ask it to complete it with a convolutional neural network instead of a linear one.</p> <p> </p> </li> <li> <p>Finally, let's try the built-in chat feature. You can get to this by clicking the <code>Chat</code> icon in the Activity bar and     begin to ask questions similar to how you would ask ChatGPT. However, we have also the option to provide context     either from the code editor or the terminal. Try saving the following code in a Python script <code>copilot.py</code>:</p> <pre><code>import torch\nfrom torch import nn\nclass Net(nn.Module):\n    def __init__(self):\n        super(Net, self).__init__()\n        self.fc1 = nn.Linear(28*28, 128)\n        self.fc2 = nn.Linear(128, 64)\n        self.fc3 = nn.Linear(64, 10)\n    def forward(self, x):\n        x = x.view(-1, 28*28)\n        x = torch.relu(self.fc1(x))\n        x = torch.relu(self.fc2(x))\n        x = self.fc3(x)\n        return x\n\nmodel = Net()\nprint(model(torch.randn(1, 1, 14, 14)))\n</code></pre> <p>and run it in the terminal: <code>python copilot.py</code>. It will naturally give you an error, but you can now ask GitHub Copilot for help. The easiest way to do this is by highlighting the output in the terminal and then pressing running the <code>Github Copilot: Explain This (Terminal)</code> command (see the image below, use <code>Ctrl+Shift+P</code> to open the command palette and search for the command). Does the explanation make sense e.g. can you figure out what to change to get the code running?</p> <p> </p> </li> <li> <p>(Optional) Just to investigate the difference between using Github Copilot and ChatGPT, try to redo the previous     exercises using ChatGPT. What are the main differences between the two tools? (1)</p> <ol> <li> Remember that ChatGPT is a general AI model, meaning that it was trained to be good at many     different tasks, whereas GitHub Copilot (which uses OpenAI's Codex model under the hood) was specifically     trained to be good at coding.</li> </ol> </li> </ol> <p>That was a small introduction to GitHub Copilot. We highly recommend that you try to use it during the course to see how it can help you solve both the exercises and the final project. However, when using AI tools it is always important to remember that they are not perfect and that you need to critically evaluate the suggestions they provide. In the end, you are the one responsible for the code you write, not the AI tool.</p>"},{"location":"s1_development_environment/package_manager/","title":"M2 - Package Manager","text":""},{"location":"s1_development_environment/package_manager/#package-managers-and-virtual-environments","title":"Package managers and virtual environments","text":"<p>Core Module</p> <p>Python is a great programming language and this is mostly due to its vast ecosystem of packages. No matter what you want to do, there is probably a package that can get you started. Just try to remember when the last time you wrote a program only using the Python standard library. Probably never. For this reason, we need a way to install third-party packages and this is where package managers come into play.</p> <p>You have probably already used <code>pip</code> for the longest time, which is the default package manager for Python. <code>pip</code> is great for beginners but it is missing one essential feature that you will need as a developer or data scientist: virtual environments. Virtual environments are an essential way to make sure that the dependencies of different projects do not cross-contaminate each other. As a naive example, consider project A that requires <code>torch==1.3.0</code> and project B that requires <code>torch==2.0</code>, then doing</p> <pre><code>cd project_A  # move to project A\npip install torch==1.3.0  # install old torch version\ncd ../project_B  # move to project B\npip install torch==2.0  # install new torch version\ncd ../project_A  # move back to project A\npython main.py  # try executing main script from project A\n</code></pre> <p>will mean that even though we are executing the main script from project A's folder, it will use <code>torch==2.0</code> instead of <code>torch==1.3.0</code> because that is the last version we installed because in both cases <code>pip</code> will install the package into the same environment, in this case, the global environment. Instead, if we did something like:</p> Unix/macOSWindows <pre><code>cd project_A  # move to project A\npython -m venv env  # create a virtual environment in project A\nsource env/bin/activate  # activate that virtual environment\npip install torch==1.3.0  # Install the old torch version into the virtual environment belonging to project A\ncd ../project_B  # move to project B\npython -m venv env  # create a virtual environment in project B\nsource env/bin/activate  # activate that virtual environment\npip install torch==2.0  # Install new torch version into the virtual environment belonging to project B\ncd ../project_A  # Move back to project A\nsource env/bin/activate  # Activate the virtual environment belonging to project A\npython main.py  # Succeed in executing the main script from project A\n</code></pre> <pre><code>cd project_A  # Move to project A\npython -m venv env  # Create a virtual environment in project A\n.\\env\\Scripts\\activate  # Activate that virtual environment\npip install torch==1.3.0  # Install the old torch version into the virtual environment belonging to project A\ncd ../project_B  # Move to project B\npython -m venv env  # Create a virtual environment in project B\n.\\env\\Scripts\\activate  # Activate that virtual environment\npip install torch==2.0  # Install new torch version into the virtual environment belonging to project B\ncd ../project_A  # Move back to project A\n.\\env\\Scripts\\activate  # Activate the virtual environment belonging to project A\npython main.py  # Succeed in executing the main script from project A\n</code></pre> <p>then we would be sure that <code>torch==1.3.0</code> is used when executing <code>main.py</code> in project A because we are using two different virtual environments. In the above case, we used the venv module which is the built-in Python module for creating virtual environments. <code>venv+pip</code> is arguably a good combination but when working on multiple projects it can quickly become a hassle to manage all the different virtual environments yourself, remembering which Python version to use, which packages to install and so on.</p> <p>For this reason, a number of package managers have been created that can help you manage your virtual environments and dependencies, with some of the most popular being:</p> <ul> <li>conda</li> <li>pipenv</li> <li>poetry</li> <li>pipx</li> <li>hatch</li> <li>pdm</li> </ul> <p>with more being created every year (rye is looking like an interesting project). This is considered a problem in the Python community because it means that there is no standard way of managing dependencies like in other languages like <code>npm</code> for <code>node.js</code> or <code>cargo</code> for <code>rust</code>.</p> <p></p>  Image credit  <p>In the course, we do not care about which package manager you use, but we do care that you use one. If you are already familiar with one package manager, then skip this exercise and continue to use that. The best recommendation that I can give regarding package managers, in general, is to find one you like and then stick with it. A lot of time can be wasted on trying to find the perfect package manager, but in the end, they all do the same with some minor differences. Check out this blog post if you want a fairly up-to-date evaluation of the different environment management and packaging tools that exist in the Python ecosystem.</p> <p>If you are not familiar with any package managers, then we recommend that you use <code>conda</code> and <code>pip</code> for this course. You probably already have conda installed on your laptop, which is great. What conda does especially well, is that it allows you to create virtual environments with different Python versions, which can be really useful if you encounter dependencies that have not been updated in a long time. In this course specifically, we are going to recommend the following workflow</p> <ul> <li>Use <code>conda</code> to create virtual environments with specific Python versions</li> <li>Use <code>pip</code> to install packages in that environment</li> </ul> <p>Installing packages with <code>pip</code> inside <code>conda</code> environments has been considered a bad practice for a long time, but since <code>conda&gt;=4.6</code> it is considered safe to do so. The reason for this is that <code>conda</code> now has a built-in compatibility layer that makes sure that <code>pip</code> installed packages are compatible with the other packages installed in the environment.</p>"},{"location":"s1_development_environment/package_manager/#python-dependencies","title":"Python dependencies","text":"<p>Before we get started with the exercises, let's first talk a bit about Python dependencies. One of the most common ways to specify dependencies in the Python community is through a <code>requirements.txt</code> file, which is a simple text file that contains a list of all the packages that you want to install. The format allows you to specify the package name and version number you want, with 7 different operators:</p> <pre><code>package1           # any version\npackage2 == x.y.z  # exact version\npackage3 &gt;= x.y.z  # at least version x.y.z\npackage4 &gt;  x.y.z  # newer than version x.y.z\npackage4 &lt;= x.y.z  # at most version x.y.z\npackage5 &lt;  x.y.z  # older than version x.y.z\npackage6 ~= x.y.z  # install version newer than x.y.z and older than x.y+1\n</code></pre> <p>In general, all packages (should) follow the semantic versioning standard, which means that the version number is split into three parts: <code>x.y.z</code> where <code>x</code> is the major version, <code>y</code> is the minor version and <code>z</code> is the patch version.</p> <p>The reason that we need to specify the version number is that we want to make sure that we can reproduce our code at a later point. If we do not specify the version number, then we are at the mercy of the package maintainer to not change the API of the package. This is especially important when working with machine learning models, as we want to make sure that we can reproduce the exact same model at a later point.</p> <p>Finally, we also need to discuss dependency resolution, which is the process of figuring out which packages are compatible. This is a non-trivial problem, and there exist a lot of different algorithms for doing this. If you have ever thought that <code>pip</code> and <code>conda</code> were taking a long time to install something, then it is probably because they were trying to figure out which packages are compatible with each other. For example, if you try to install</p> <pre><code>pip install \"matplotlib &gt;= 3.8.0\" \"numpy &lt;= 1.19\" --dry-run\n</code></pre> <p>then it would simply fail because there are no versions of <code>matplotlib</code> and <code>numpy</code> under the given constraints that are compatible with each other. In this case, we would need to relax the constraints to something like</p> <pre><code>pip install \"matplotlib &gt;= 3.8.0\" \"numpy &lt;= 1.21\" --dry-run\n</code></pre> <p>to make it work.</p>"},{"location":"s1_development_environment/package_manager/#exercises","title":"\u2754 Exercises","text":"<p>For hints regarding how to use <code>conda</code> you can check out the cheat sheet in the exercise folder.</p> <ol> <li> <p>Download and install <code>conda</code>. You are free to either install full <code>conda</code> or the much simpler version <code>miniconda</code>.     The core difference between the two packages is that <code>conda</code> already comes with a lot of packages that you would     normally have to install with <code>miniconda</code>. The downside is that <code>conda</code> is a much larger package which can be a     huge disadvantage on smaller devices. Make sure that your installation is working by writing <code>conda help</code> in a     terminal and it should show you the help message for conda. If this does not work you probably need to set some     system variable to     point to the conda installation</p> </li> <li> <p>If you have successfully installed conda, then you should be able to execute the <code>conda</code> command in a terminal.</p> <p> </p> <p>Conda will always tell you what environment you are currently in, indicated by the <code>(env_name)</code> in the prompt. By default, it will always start in the <code>(base)</code> environment.</p> </li> <li> <p>Try creating a new virtual environment. Make sure that it is called <code>my_environment</code> and that it installs version    3.11 of Python. What command should you execute to do this?</p> Use Python 3.8 or higher <p>We highly recommend that you use Python 3.8 or higher for this course. In general, we recommend that you use the second latest version of Python that is available (currently Python 3.11 as of writing this). This is because the latest version of Python is often not supported by all dependencies. You can always check the status of different Python version support here.</p> Solution <pre><code>conda create --name my_environment python=3.11\n</code></pre> </li> <li> <p>Which <code>conda</code> command gives you a list of all the environments that you have created?</p> Solution <pre><code>conda env list\n</code></pre> </li> <li> <p>Which <code>conda</code> command gives you a list of the packages installed in the current environment?</p> Solution <pre><code>conda list\n</code></pre> <ol> <li> <p>How do you easily export this list to a text file? Do this, and make sure you export it to     a file called <code>environment.yaml</code>, as conda uses another format by default than <code>pip</code>.</p> Solution <pre><code>conda list --explicit &gt; environment.yaml\n</code></pre> </li> <li> <p>Inspect the file to see what is in it.</p> </li> <li> <p>The <code>environment.yaml</code> file you have created is one way to secure reproducibility between users because     anyone should be able to get an exact copy of your environment if they have your <code>environment.yaml</code> file.     Try creating a new environment directly from your <code>environment.yaml</code> file and check that the packages being     installed exactly match what you originally had.</p> Solution <pre><code>conda env create --file environment.yaml\n</code></pre> </li> </ol> </li> <li> <p>As the introduction states, it is fairly safe to use <code>pip</code> inside <code>conda</code> today. What is the corresponding <code>pip</code>     command that gives you a list of all <code>pip</code> installed packages? And how do you export this to <code>requirements.txt</code>     file?</p> Solution <pre><code>pip list # List all installed packages\npip freeze &gt; requirements.txt # Export all installed packages to a requirements.txt file\n</code></pre> </li> <li> <p>If you look through the requirements that both <code>pip</code> and <code>conda</code> produce then you will see that it     is often filled with a lot more packages than what you are using in your project. What you are interested in are the     packages that you import in your code: <code>from package import module</code>. One way to get around this is to use the     package <code>pipreqs</code>, which will automatically scan your project and create a requirements file specific to that.     Let's try it out:</p> <ol> <li> <p>Install <code>pipreqs</code>:</p> <pre><code>pip install pipreqs\n</code></pre> </li> <li> <p>Either try out <code>pipreqs</code> on one of your own projects or try it out on some other online project.     What does the <code>requirements.txt</code> file <code>pipreqs</code> produces look like compared to the files produced     by either <code>pip</code> or <code>conda</code>.</p> </li> </ol> </li> </ol>"},{"location":"s1_development_environment/package_manager/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>Try executing the command</p> <pre><code>pip install \"pytest &lt; 4.6\" pytest-cov==2.12.1\n</code></pre> <p>based on the error message you get, what would be a compatible way to install these?</p> Solution <p>As <code>pytest-cov==2.12.1</code> requires a version of <code>pytest</code> newer than <code>4.6</code>, we can simply change the command to be:</p> <pre><code>pip install \"pytest &gt;= 4.6\" pytest-cov==2.12.1\n</code></pre> <p>but there of course exist other solutions as well.</p> </li> </ol> <p>This ends the module on setting up virtual environments. While the methods mentioned in the exercises are great ways to construct requirements files automatically, sometimes it is just easier to manually sit down and create the files as you in that way ensure that only the most necessary requirements are installed when creating a new environment.</p>"},{"location":"s2_organisation_and_version_control/","title":"Organization and version control","text":"<p>Slides</p> <ul> <li> <p></p> <p>Learn the basics of version control and how to use <code>git</code> to track changes to your code and collaborate with others.</p> <p> M5: Git</p> </li> <li> <p></p> <p>Learn how to organize Python code into a library, package it and use templates to create new projects.</p> <p> M6: Code Structure</p> </li> <li> <p></p> <p>Learn different coding practices and how to use them to improve the quality of your code.</p> <p> M7: Good Coding Practice</p> </li> <li> <p></p> <p>Learn how to version control data using <code>dvc</code>.</p> <p> M8: Data Version Control</p> </li> <li> <p></p> <p>Learn the different ways to setup command line interfaces for your applications.</p> <p> M9: Command Line Interfaces</p> </li> </ul> <p>Today we take our first steps into the world of MLOps. The set of modules in this session focuses on getting organized and making sure that you are familiar with good development practices. While many of the practices you will learn about these modules do not seem that important when you are a single person working on a project, it is crucial when working in large groups that the difference in how different people organize and write their code is minimized. The topics in this session will focus on:</p> <ul> <li>Version control to help track and manage changes to your code and data</li> <li>Coding practices for staying organized in large projects</li> </ul> <p></p>  Image credit  <p>Some exercises in this course are very loosely stated (including the exercises today). You are expected to seek out information before you ask for help (Google is your friend!) as you will both learn more for trying to solve the problems yourself, and it is more realistic of how the \"real world\" works.</p> <p>Learning objectives</p> <p>The learning objectives of this session are:</p> <ul> <li>Understand the basics of version control and can use <code>git</code> to track changes to your code</li> <li>Knowledge of how to package Python code into a library and how to organize your code for reuse</li> <li>Understand different coding practices and how to use them to improve the quality of your code</li> <li>Can use <code>dvc</code> to version control data</li> </ul>"},{"location":"s2_organisation_and_version_control/cli/","title":"M9 - Command Line Interfaces","text":""},{"location":"s2_organisation_and_version_control/cli/#command-line-interfaces","title":"Command line interfaces","text":"<p>As we already laid out in the very first module, the command line is a powerful tool for interacting with your computer. You should already now be familiar with running basic Python commands in the terminal:</p> <pre><code>python my_script.py\n</code></pre> <p>However, as your projects grow in size and complexity, you will often find yourself in need of more advanced ways of interacting with your code. This is where command line interface (CLI) comes into play. A CLI can be seen as a way for you to define the user interface of your application directly in the terminal. Thus, there is no right or wrong way of creating a CLI, it is all about what makes sense for your application.</p> <p>In this module we are going to look at three different ways of creating a CLI for your machine learning projects. They are all serving a bit different purposes and can therefore be combined in the same project. However, you will most likely also feel that they are overlapping in some areas. That is completely fine, and it is up to you to decide which one to use in which situation.</p>"},{"location":"s2_organisation_and_version_control/cli/#project-scripts","title":"Project scripts","text":"<p>You might already be familiar with the concept of executable scripts. An executable script is a Python script that can be run directly from the terminal without having to call the Python interpreter. This has been possible for a long time in Python, by the inclusion of a so-called shebang line at the top of the script. However, we are going to look at a specific way of defining executable scripts using the standard <code>pyproject.toml</code> file, which you should have learned about in this module.</p>"},{"location":"s2_organisation_and_version_control/cli/#exercises","title":"\u2754 Exercises","text":"<ol> <li> <p>We are going to assume that you have a training script in your project that you would like to be able to run from the     terminal directly without having to call the Python interpreter. Lets assume it is located like this</p> <pre><code>src/\n\u251c\u2500\u2500 my_project/\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u251c\u2500\u2500 train.py\npyproject.toml\n</code></pre> <p>In your <code>pyproject.toml</code> file add the following lines. You will need to alter the paths to match your project.</p> <pre><code>[project.scripts]\ntrain = \"my_project.train:main\"\n</code></pre> <p>what do you think the <code>train = \"my_project.train:main\"</code> line do?</p> Solution <p>The line tells Python that we want to create an executable script called <code>train</code> that should run the <code>main</code> function in the <code>train.py</code> file located in the <code>my_project</code> package.</p> </li> <li> <p>Now, all that is left to do is install the project again in editable mode</p> <pre><code>pip install -e .\n</code></pre> <p>and you should now be able to run the following command in the terminal</p> <pre><code>train\n</code></pre> <p>Try it out and see if it works.</p> </li> <li> <p>Add additional scripts to your <code>pyproject.toml</code> file that allows you to run other scripts in your project from the     terminal.</p> Solution <p>We assume that you also have a script called <code>evaluate.py</code> in the <code>my_project</code> package.</p> <pre><code>[project.scripts]\ntrain = \"my_project.train:main\"\nevaluate = \"my_project.evaluate:main\"\n</code></pre> </li> </ol> <p>That is all there really is to it. You can now run your scripts directly from the terminal without having to call the Python interpreter. Some good examples of Python packages that uses this approach are numpy, pylint and kedro.</p>"},{"location":"s2_organisation_and_version_control/cli/#command-line-arguments","title":"Command line arguments","text":"<p>If you have worked with Python for some time you are probably familiar with the <code>argparse</code> package, which allows you to directly pass in additional arguments to your script in the terminal</p> <pre><code>python my_script.py --arg1 val1 --arg2 val2\n</code></pre> <p><code>argparse</code> is a very simple way of constructing what is called a command line interfaces. However, one limitation of <code>argparse</code> is the possibility of easily defining an CLI with subcommands. If we take <code>git</code> as an example, <code>git</code> is the main command but it has multiple subcommands: <code>push</code>, <code>pull</code>, <code>commit</code> etc. that all can take their own arguments. This kind of second CLI with subcommands is somewhat possible to do using only <code>argparse</code>, however it requires a bit of hacks.</p> <p>You could of course ask the question why we at all would like to have the possibility of defining such CLI. The main argument here is to give users of our code a single entrypoint to interact with our application instead of having multiple scripts. As long as all subcommands are proper documented, then our interface should be simple to interact with (again think <code>git</code> where each subcommand can be given the <code>-h</code> arg to get specific help).</p> <p>Instead of using <code>argparse</code> we are here going to look at the yyper package. <code>typer</code> extends the functionalities of <code>argparse</code> to allow for easy definition of subcommands and many other things, which we are not going to touch upon in this module. For completeness we should also mention that <code>typer</code> is not the only package for doing this, and of other excellent frameworks for creating command line interfaces easily we can mention click.</p>"},{"location":"s2_organisation_and_version_control/cli/#exercises_1","title":"\u2754 Exercises","text":"<ol> <li> <p>Start by installing the <code>typer</code> package</p> <pre><code>pip install typer\n</code></pre> <p>remember to add the package to your <code>requirements.txt</code> file.</p> </li> <li> <p>To get you started with <code>typer</code>, let's just create a simple hello world type of script. Create a new Python file     called <code>greetings.py</code> and use the <code>typer</code> package to create a command line interface such that running the     following lines</p> <pre><code>python greetings.py            # should print \"Hello World!\"\npython greetings.py --count=3  # should print \"Hello World!\" three times\npython greetings.py --help     # should print the help message, informing the user of the possible arguments\n</code></pre> <p>executes and gives the expected output. Relevant documentation.</p> Solution <p>Importantly for <code>typer</code> is that you need to provide type hints for the arguments. This is because <code>typer</code> needs these to be able to work properly.</p> <pre><code>import typer\napp = typer.Typer()\n\n@app.command()\ndef hello(count: int = 1, name: str = \"World\"):\n    for x in range(count):\n        typer.echo(f\"Hello {name}!\")\n\nif __name__ == \"__main__\":\n    app()\n</code></pre> </li> <li> <p>Next, lets try on a bit harder example. Below is a simple script that trains a support vector machine on the iris     dataset.</p> <p>iris_classifier.py</p> iris_classifier.py<pre><code>from sklearn.datasets import load_breast_cancer\nfrom sklearn.metrics import accuracy_score, classification_report\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.svm import SVC\n\n\ndef train():\n    \"\"\"Train and evaluate the model.\"\"\"\n    # Load the dataset\n    data = load_breast_cancer()\n    x = data.data\n    y = data.target\n\n    # Split the dataset into training and testing sets\n    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)\n\n    # Standardize the features\n    scaler = StandardScaler()\n    x_train = scaler.fit_transform(x_train)\n    x_test = scaler.transform(x_test)\n\n    # Train a Support Vector Machine (SVM) model\n    model = SVC(kernel=\"linear\", random_state=42)\n    model.fit(x_train, y_train)\n\n    # Make predictions on the test set\n    y_pred = model.predict(x_test)\n\n    # Evaluate the model\n    accuracy = accuracy_score(y_test, y_pred)\n    report = classification_report(y_test, y_pred)\n\n    print(f\"Accuracy: {accuracy:.2f}\")\n    print(\"Classification Report:\")\n    print(report)\n    return accuracy, report\n\n\nif __name__ == \"__main__\":\n    train()\n</code></pre> <p>Implement a CLI for the script such that the following commands can be run</p> <pre><code>python iris_classifier.py train --output 'model.ckpt'  # should train the model and save it to 'model.ckpt'\npython iris_classifier.py train -o 'model.ckpt'  # should be the same as above\n</code></pre> Solution <p>We are here making use of the short name option in typer for giving an shorter alias to the <code>--output</code> option.</p> iris_classifier.py<pre><code>import pickle\nfrom typing import Annotated\n\nimport typer\nfrom sklearn.datasets import load_breast_cancer\nfrom sklearn.metrics import accuracy_score, classification_report\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.svm import SVC\n\napp = typer.Typer()\n\n\n@app.command()\ndef train(output: Annotated[str, typer.Option(\"--output\", \"-o\")] = \"model.ckpt\"):\n    \"\"\"Train and evaluate the model.\"\"\"\n    # Load the dataset\n    data = load_breast_cancer()\n    x = data.data\n    y = data.target\n\n    # Split the dataset into training and testing sets\n    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)\n\n    # Standardize the features\n    scaler = StandardScaler()\n    x_train = scaler.fit_transform(x_train)\n    x_test = scaler.transform(x_test)\n\n    # Train a Support Vector Machine (SVM) model\n    model = SVC(kernel=\"linear\", random_state=42)\n    model.fit(x_train, y_train)\n\n    # Make predictions on the test set\n    y_pred = model.predict(x_test)\n\n    # Evaluate the model\n    accuracy = accuracy_score(y_test, y_pred)\n    report = classification_report(y_test, y_pred)\n\n    with open(output, \"wb\") as f:\n        pickle.dump(model, f)\n\n    print(f\"Accuracy: {accuracy:.2f}\")\n    print(\"Classification Report:\")\n    print(report)\n    return accuracy, report\n\n\nif __name__ == \"__main__\":\n    app()\n</code></pre> </li> <li> <p>Next lets create a CLI that has more than a single command. Continue working in the basic machine learning     application from the previous exercise, but this time we want to define two separate commands</p> <pre><code>python iris_classifier.py train --output 'model.ckpt'\npython iris_classifier.py evaluate 'model.ckpt'\n</code></pre> Solution <p>The only key difference between the two is that in the <code>train</code> command we define the <code>output</code> argument to to be an optional parameter e.g. we provide a default and for the <code>evaluate</code> command it is a required parameter.</p> iris_classifier.py<pre><code>import pickle\nfrom typing import Annotated\n\nimport typer\nfrom sklearn.datasets import load_breast_cancer\nfrom sklearn.metrics import accuracy_score, classification_report\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.svm import SVC\n\napp = typer.Typer()\n\n# Load the dataset\ndata = load_breast_cancer()\nx = data.data\ny = data.target\n\n# Split the dataset into training and testing sets\nx_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)\n\n# Standardize the features\nscaler = StandardScaler()\nx_train = scaler.fit_transform(x_train)\nx_test = scaler.transform(x_test)\n\n\n@app.command()\ndef train(output_file: Annotated[str, typer.Option(\"--output\", \"-o\")] = \"model.ckpt\") -&gt; None:\n    \"\"\"Train the model.\"\"\"\n    # Train a Support Vector Machine (SVM) model\n    model = SVC(kernel=\"linear\", random_state=42)\n    model.fit(x_train, y_train)\n\n    with open(output_file, \"wb\") as f:\n        pickle.dump(model, f)\n\n\n@app.command()\ndef evaluate(model_file):\n    \"\"\"Evaluate the model.\"\"\"\n    with open(model_file, \"rb\") as f:\n        model = pickle.load(f)\n\n    # Make predictions on the test set\n    y_pred = model.predict(x_test)\n\n    # Evaluate the model\n    accuracy = accuracy_score(y_test, y_pred)\n    report = classification_report(y_test, y_pred)\n\n    print(f\"Accuracy: {accuracy:.2f}\")\n    print(\"Classification Report:\")\n    print(report)\n    return accuracy, report\n\n\nif __name__ == \"__main__\":\n    app()\n</code></pre> </li> <li> <p>Finally, let's try to define subcommands for our subcommands e.g. something similar to how <code>git</code> has the subcommand     <code>remote</code> which in itself has multiple subcommands like <code>add</code>, <code>rename</code> etc. Continue on the simple machine     learning application from the previous exercises, but this time define a cli such that</p> <pre><code>python iris_classifier.py train svm --kernel 'linear'\npython iris_classifier.py train knn -k 5\n</code></pre> <p>e.g the <code>train</code> command now has two subcommands for training different machine learning models (in this case SVM and KNN) which each takes arguments that are unique to that model. Relevant documentation.</p> Success iris_classifier.py<pre><code>import pickle\nfrom typing import Annotated\n\nimport typer\nfrom sklearn.datasets import load_breast_cancer\nfrom sklearn.metrics import accuracy_score, classification_report\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.svm import SVC\n\napp = typer.Typer()\ntrain_app = typer.Typer()\napp.add_typer(train_app, name=\"train\")\n\n# Load the dataset\ndata = load_breast_cancer()\nx = data.data\ny = data.target\n\n# Split the dataset into training and testing sets\nx_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)\n\n# Standardize the features\nscaler = StandardScaler()\nx_train = scaler.fit_transform(x_train)\nx_test = scaler.transform(x_test)\n\n\n@train_app.command()\ndef svm(kernel: str = \"linear\", output_file: Annotated[str, typer.Option(\"--output\", \"-o\")] = \"model.ckpt\") -&gt; None:\n    \"\"\"Train a SVM model.\"\"\"\n    model = SVC(kernel=kernel, random_state=42)\n    model.fit(x_train, y_train)\n\n    with open(output_file, \"wb\") as f:\n        pickle.dump(model, f)\n\n\n@train_app.command()\ndef knn(k: int = 5, output_file: Annotated[str, typer.Option(\"--output\", \"-o\")] = \"model.ckpt\") -&gt; None:\n    \"\"\"Train a KNN model.\"\"\"\n    model = KNeighborsClassifier(n_neighbors=k)\n    model.fit(x_train, y_train)\n\n    with open(output_file, \"wb\") as f:\n        pickle.dump(model, f)\n\n\n@app.command()\ndef evaluate(model_file):\n    \"\"\"Evaluate the model.\"\"\"\n    with open(model_file, \"rb\") as f:\n        model = pickle.load(f)\n\n    # Make predictions on the test set\n    y_pred = model.predict(x_test)\n\n    # Evaluate the model\n    accuracy = accuracy_score(y_test, y_pred)\n    report = classification_report(y_test, y_pred)\n\n    print(f\"Accuracy: {accuracy:.2f}\")\n    print(\"Classification Report:\")\n    print(report)\n    return accuracy, report\n\n\nif __name__ == \"__main__\":\n    app()\n</code></pre> </li> <li> <p>(Optional) Let's try to combine what we have learned until now. Try to make your <code>typer</code> cli into a executable     script using the <code>pyproject.toml</code> file and try it out!</p> Solution <p>Assuming that our <code>iris_classifier.py</code> script from before is placed in <code>src/my_project</code> folder, we should just add</p> <pre><code>[project.scripts]\ngreetings = \"src.my_project.iris_classifier:app\"\n</code></pre> <p>and remember to install the project in editable mode</p> <pre><code>pip install -e .\n</code></pre> <p>and you should now be able to run the following command in the terminal</p> <pre><code>iris_classifier train knn\n</code></pre> </li> </ol> <p>This covers the basic of <code>typer</code> but feel free to deep dive into how the package can help you custimize your CLIs. Checkout this page on adding colors to your CLI or this page on validating the inputs to your CLI.</p>"},{"location":"s2_organisation_and_version_control/cli/#non-python-code","title":"Non-Python code","text":"<p>The two sections above have shown you how to create a simple CLI for your Python scripts. However, when doing machine learning projects, you often have a lot of non-Python code that you would like to run from the terminal. Based on the learning modules you have already completed, you have already encountered a couple of CLI tools that are used in our projects:</p> <ul> <li>conda for managing environments</li> <li>git for version control of code</li> <li>dvc for version control of data</li> </ul> <p>As we begin to move into the next couple of learning modules, we are going to encounter even more CLI tools that we need to interact with. Here is a example of long command that you might need to run in your project in the future</p> <pre><code>docker run -v $(pwd):/app -w /app --gpus all --rm -it my_image:latest python my_script.py --arg1 val1 --arg2 val2\n</code></pre> <p>This can be a lot to remember, and it can be easy to make mistakes. Instead it would be nice if we could just do</p> <pre><code>run my_command --arg1=val1 --arg2=val2\n</code></pre> <p>e.g. easier to remember because we have remove a lot of the hard-to-remember stuff, but we are still able to configure it to our liking. To help with this, we are going to look at the invoke package. <code>invoke</code> is a Python package that allows you to define tasks that can be run from the terminal. It is a bit like a more advanced version of the Makefile that you might have encountered in other programming languages. Some good alternatives to <code>invoke</code> are just and task, but we have chosen to focus on <code>invoke</code> in this module because it can be installed as a Python package making installation across different systems easier.</p>"},{"location":"s2_organisation_and_version_control/cli/#exercises_2","title":"\u2754 Exercises","text":"<ol> <li> <p>Start by installing <code>invoke</code></p> <pre><code>pip install invoke\n</code></pre> <p>remember to add the package to your <code>requirements.txt</code> file.</p> </li> <li> <p>Add a <code>tasks.py</code> file to your repository and try to just run</p> <pre><code>invoke --list\n</code></pre> <p>which should work but inform you that no tasks are added yet.</p> </li> <li> <p>Let's now try to add a task to the <code>tasks.py</code> file. The way to do this with invoke is to import the <code>task</code>     decorator from <code>invoke</code> and then decorate a function with it:</p> <pre><code>from invoke import task\nimport os\n\n@task\ndef python(ctx):\n    \"\"\" \"\"\"\n    ctx.run(\"which python\" if os.name != \"nt\" else \"where python\")\n</code></pre> <p>the first argument of any task-decorated function is the <code>ctx</code> context argument that implements the <code>run</code> method for running any command as we run them in the terminal. In this case we have simply implemented a task that returns the current Python interpreter but it works for all operating systems. Check that it works by running:</p> <pre><code>invoke hello\n</code></pre> </li> <li> <p>Lets try to create a task that simplifies the process of <code>git add</code>, <code>git commit</code>, <code>git push</code>. Create a task such     that the following command can be run</p> <pre><code>invoke git --message \"My commit message\"\n</code></pre> <p>Implement it and use the command to commit the taskfile you just created!</p> Solution <pre><code>@task\ndef git(ctx, message):\n    ctx.run(f\"git add .\")\n    ctx.run(f\"git commit -m '{message}'\")\n    ctx.run(f\"git push\")\n</code></pre> </li> <li> <p>As you have hopefully realized by now, the most important method in <code>invoke</code> is the <code>ctx.run</code> method which actually     run the commands you want to run in the terminal. This command takes multiple additional arguments. Try out the     arguments <code>warn</code>, <code>pty</code>, <code>echo</code> and explain in your own words what they do.</p> Solution <ul> <li><code>warn</code>: If set to <code>True</code> the command will not raise an exception if the command fails. This can be useful if     you want to run multiple commands and you do not want the whole process to stop if one of the commands fail.</li> <li><code>pty</code>: If set to <code>True</code> the command will be run in a pseudo-terminal. If you want to enable this or not,     depends on the command you are running.     Here     is a good explanation of when/why you should use it.</li> <li><code>echo</code>: If set to <code>True</code> the command will be printed to the terminal before it is run.</li> </ul> </li> <li> <p>Create a command that simplifies the process of bootstrapping a <code>conda</code> environment and install the relevant     dependencies of your project.</p> Solution <pre><code>@task\ndef conda(ctx, name: str = \"dtu_mlops\"):\n    ctx.run(f\"conda env create -f environment.yml\", echo=True)\n    ctx.run(f\"conda activate {name}\", echo=True)\n    ctx.run(f\"pip install -e .\", echo=True)\n</code></pre> <p>and try to run the following command</p> <pre><code>invoke conda\n</code></pre> </li> <li> <p>Assuming you have completed the exercises on using dvc for version control of data, lets also try to add     a task that simplifies the process of adding new data. This is the list of commands that need to be run to add new     data to a dvc repository: <code>dvc add</code>, <code>git add</code>, <code>git commit</code>, <code>git push</code>, <code>dvc push</code>. Try to implement a task     that simplifies this process. It needs to take two arguments for defining the folder to add and the commit message.</p> Solution <pre><code>@task\ndef dvc(ctx, folder=\"data\", message=\"Add new data\"):\n    ctx.run(f\"dvc add {folder}\")\n    ctx.run(f\"git add {folder}.dvc .gitignore\")\n    ctx.run(f\"git commit -m '{message}'\")\n    ctx.run(f\"git push\")\n    ctx.run(f\"dvc push\")\n</code></pre> <p>and try to run the following command</p> <pre><code>invoke dvc --folder 'data' --message 'Add new data'\n</code></pre> </li> <li> <p>As the final exercise, lets try to combine every way of defining CLIs we have learned about in this module. Define     a task that does the following</p> <ul> <li>calls <code>dvc pull</code> to download the data</li> <li>calls a entrypoint <code>my_cli</code> with the subcommand <code>train</code> with the arguments <code>--output 'model.ckpt'</code></li> </ul> Solution <pre><code>from invoke import task\n\n@task\ndef pull_data(ctx):\n    ctx.run(\"dvc pull\")\n\n@task(pull_data)\ndef train(ctx)\n    ctx.run(\"my_cli train\")\n</code></pre> </li> </ol> <p>That is all there is to it. You should now be able to define tasks that can be run from the terminal to simplify the process of running your code. We recommend that as you go through the learning modules in this course that you slowly start to add tasks to your <code>tasks.py</code> file that simplifies the process of running the code you are writing.</p>"},{"location":"s2_organisation_and_version_control/cli/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>What is the purpose of a command line interface?</p> Solution <p>A command line interface is a way for you to define the user interface of your application directly in the terminal. It allows you to interact with your code in a more advanced way than just running Python scripts.</p> </li> </ol>"},{"location":"s2_organisation_and_version_control/code_structure/","title":"M6 - Code structure","text":""},{"location":"s2_organisation_and_version_control/code_structure/#code-organization","title":"Code organization","text":"<p>Core Module</p> <p>With a basic understanding of version control, it is now time to really begin filling up our code repository. However, the question then remains how to organize our code? As developers we tend to not think about code organization that much. It is instead something that just dynamically is being created as we may need it. However, maybe we should spend some time initially getting organized with the chance of this making our code easier to develop and maintain in the long run. If we do not spend time organizing our code, we may end up with a mess of code that is hard to understand or maintain</p> <p>Big ball of Mud</p> <p>A Big Ball of Mud is a haphazardly structured, sprawling, sloppy, duct-tape-and-baling-wire, spaghetti-code jungle. These systems show unmistakable signs of unregulated growth, and repeated, expedient repair. Information is shared promiscuously among distant elements of the system, often to the point where nearly all the important information becomes global or duplicated. The overall structure of the system may never have been well defined. If it was, it may have eroded beyond recognition. Programmers with a shred of architectural sensibility shun these quagmires. Only those who are unconcerned about architecture, and, perhaps, are comfortable with the inertia of the day-to-day chore of patching the holes in these failing dikes, are content to work on such systems.  Brian Foote and Joseph Yoder, Big Ball of Mud. Fourth Conference on Patterns Languages of Programs (PLoP '97/EuroPLoP '97) Monticello, Illinois, September 1997</p> <p>We are here going to focus on the organization of data science projects and machine learning projects. The core difference this kind of projects introduces compared to more traditional systems, is data. The key to modern machine learning is without a doubt the vast amounts of data that we have access to today. It is therefore not unreasonable that data should influence our choice of code structure. If we had another kind of application, then the layout of our codebase should probably be different.</p>"},{"location":"s2_organisation_and_version_control/code_structure/#cookiecutter","title":"Cookiecutter","text":"<p>We are in this course going to use the tool cookiecutter, which is tool for creating projects from project templates. A project template is in short just an overall structure of how you want your folders, files etc. to be organised from the beginning. For this course we are going to be using a custom MLOps template. The template is essentially a fork of the cookiecutter data science template template that has been used for a couple of years in the course, but specialized a bit more towards MLOps instead of general data science.</p> <p>We are not going to argue that this template is better than every other template, we are just focusing on that it is a standardized way of creating project structures for machine learning projects. By standardized we mean, that if two persons are both using <code>cookiecutter</code> with the same template, the layout of their code does follow some specific rules, enabling one to faster understand the other person's code. Code organization is therefore not only to make the code easier for you to maintain but also for others to read and understand.</p> <p>Shown below is the default code structure of cookiecutter for data science projects.</p> <p></p> <p>What is important to keep in mind when using a template, is that it exactly is a template. By definition a template is a guide to make something. Therefore, not all parts of a template may be important for your project at hand. Your job is to pick the parts from the template that is useful for organizing your machine learning project and add the parts that are missing.</p>"},{"location":"s2_organisation_and_version_control/code_structure/#python-projects","title":"Python projects","text":"<p>While the same template in principal could be used regardless of what language we were using for our machine learning or data science application, there are certain considerations to take into account based on what language we are using. Python is the dominant language for machine learning and data science currently, which is why we in this section are focusing on some of the special files you will need for your Python projects.</p> <p>The first file you may or may not know is the <code>__init__.py</code> file. In Python the <code>__init__.py</code> file is used to mark a directory as a Python package. Therefore as a bare minimum, any Python package should look something like this:</p> <pre><code>\u251c\u2500\u2500 src/\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u251c\u2500\u2500 file1.py\n\u2502   \u251c\u2500\u2500 file2.py\n\u251c\u2500\u2500 pyproject.toml\n</code></pre> <p>The second file to focus on is the <code>pyproject.toml</code>. This file is important for actually converting your code into a Python project. Essentially, whenever you run <code>pip install</code>, <code>pip</code> is in charge of both downloading the package you want but also in charge of installing it. For <code>pip</code> to be able to install a package it needs instructions on what part of the code it should install and how to install it. This is the job of the <code>pyproject.toml</code> file.</p> <p>Below we have both added a description of the structure of the <code>pyproject.toml</code> file but also <code>setup.py + setup.cfg</code> which is the \"old\" way of providing project instructions regarding Python project. However, you may still encounter a lot of projects using <code>setup.py + setup.cfg</code> so it is good to at least know about them.</p> pyproject.tomlsetup.py + setup.cfg <p><code>pyproject.toml</code> is the new standardized way of describing project metadata in a declaratively way, introduced in PEP 621. It is written in toml format which is easy to read. At the very least your <code>pyproject.toml</code> file should include the <code>[build-system]</code> and <code>[project]</code> sections:</p> <pre><code>[build-system]\nrequires = [\"setuptools\", \"wheel\"]\nbuild-backend = \"setuptools.build_meta\"\n\n[project]\nname = \"my-package-name\"\nversion = \"0.1.0\"\nauthors = [{name = \"EM\", email = \"me@em.com\"}]\ndescription = \"Something cool here.\"\nrequires-python = \"&gt;=3.8\"\ndynamic = [\"dependencies\"]\n\n[tool.setuptools.dynamic]\ndependencies = {file = [\"requirements.txt\"]}\n</code></pre> <p>the <code>[build-system]</code> informs <code>pip</code>/<code>python</code> that to build this Python project it needs the two packages <code>setuptools</code> and <code>wheels</code> and that it should call the setuptools.build_meta function to actually build the project. The <code>[project]</code> section essentially contains metadata regarding the package, what its called etc. if we ever want to publish it to PyPI.</p> <p>For specifying dependencies of your project you have two options. Either you specify them in a <code>requirements.txt</code> file and it as a dynamic field in <code>pyproject.toml</code> as shown above. Alternatively, you can add a <code>dependencies</code> field under the <code>[project]</code> header like this:</p> <pre><code>[project]\ndependencies = [\n    'torch==2.1.0',\n    'matplotlib&gt;=3.8.1'\n]\n</code></pre> <p>The improvement over <code>setup.py + setup.cfg</code> is that <code>pyproject.toml</code> also allows for metadata from other tools to be specified in it, essentially making sure you only need a single file for your project. For example, in the next [module M7 on good coding practices] you will learn about the tool <code>ruff</code> and how it can help format your code. If we want to configure <code>ruff</code> for our project we can do that directly in <code>pyproject.toml</code> by adding additional headers:</p> <pre><code>[ruff]\nruff_option = ...\n</code></pre> <p>To read more about how to specify <code>pyproject.toml</code> this page is a good place to start.</p> <p><code>setup.py</code> is the original way to describing how a Python package should be build. The most basic <code>setup.py</code> file will look like this:</p> <pre><code>from setuptools import setup\nfrom pip.req import parse_requirements\nrequirements = [str(ir.req) for ir in parse_requirements(\"requirements.txt\")]\nsetup(\n    name=\"my-package-name\",\n    version=\"0.1.0\",\n    author=\"EM\",\n    description=\"Something cool here.\"\n    install_requires=requirements,\n)\n</code></pre> <p>Essentially, the it is the exact same meta information as in <code>pyproject.toml</code>, just written directly in Python syntax instead of <code>toml</code>. Because there was a wish to deperate this meta information into a separate file, the <code>setup.cfg</code> file was created which can contain the exact same information as <code>setup.py</code> just in a declarative config.</p> <pre><code>[metadata]\nname = my-package-name\nversion = 0.1.0\nauthor = EM\ndescription = \"Something cool here.\"\n# ...\n</code></pre> <p>This non-standardized way of providing meta information regarding a package was essentially what lead to the creation of <code>pyproject.toml</code>.</p> <p>Regardless of what way a project is configured, after creating the above files the correct way to install them would be the same</p> <pre><code>pip install .\n# or in developer mode\npip install -e . # (1)!\n</code></pre> <ol> <li> The <code>-e</code> is short for <code>--editable</code> mode also called     developer mode. Since we will continuously     iterating on our package this is the preferred way to install our package, because that means that we do not have     to run <code>pip install</code> every time we make a change. Essentially, in developer mode changes in the Python source code     can immediately take place without requiring a new installation.</li> </ol> <p>after running this your code should be available to import as <code>from project_name import ...</code> like any other Python package you use. This is the most essential you need to know about creating Python packages.</p>"},{"location":"s2_organisation_and_version_control/code_structure/#exercises","title":"\u2754 Exercises","text":"<p>After having installed cookiecutter (exercise 1 and 2), the remaining exercises are intended to be used on taking the simple CNN MNIST classifier from yesterdays exercise and force it into this structure. You are not required to fill out every folder and file in the project structure, but try to at least follow the steps in exercises. Whenever you need to run a file I recommend always doing this from the root directory e.g.</p> <pre><code>python &lt;project_name&gt;/data/make_dataset.py data/raw data/processed\npython &lt;project_name&gt;/models/train_model.py &lt;arguments&gt;\netc...\n</code></pre> <p>in this way paths (for saving and loading files) are always relative to the root.</p> <ol> <li> <p>Install cookiecutter framework</p> <pre><code>pip install cookiecutter\n</code></pre> </li> <li> <p>Start a new project using this template, that is specialized for     this course (1).</p> <ol> <li>If you feel like the template can be improve in some way, feel free to either open a issue with the proposed     improvement or directly send a pull request to the repository \ud83d\ude04.</li> </ol> <p>You do this by running the cookiecutter command using the template url:</p> <pre><code>cookiecutter &lt;url-to-template&gt;\n</code></pre> <p>Valid project names</p> <p>When asked for a project name you should follow the PEP8 guidelines for naming packages. This means that the name should be all lowercase and if you want to separate words, you should use underscores. For example <code>my_project</code> is a valid name, while <code>MyProject</code> is not. Additionally, the packaage name cannot start with a number.</p> Flat-layout vs src-layout <p>There are two common choices on how layout your source directory. The first is called src-layout where the source code is always place in a <code>src/&lt;project_name&gt;</code> folder and the second is called flat-layout where the source code is place is just placed in a <code>&lt;project_name&gt;</code> folder. The template we are using in this course is using the flat-layout, but there are pros and cons for both.</p> </li> <li> <p>After having created your new project, the first step is to also create a corresponding virtual environment and     install any needed requirements. If you have a virtual environment from yesterday feel free to use that else create     a new. Then install the project in that environment</p> <pre><code>pip install -e .\n</code></pre> </li> <li> <p>Start by filling out the <code>&lt;project_name&gt;/data/make_dataset.py</code> file. When this file runs, it should take the raw     data e.g. the corrupted MNIST files from yesterday (<code>../data/corruptmnist</code>) which now should be located in a     <code>data/raw</code> folder and process them into a single tensor, normalize the tensor and save this intermediate     representation to the <code>data/processed</code> folder. By normalization here we refer to making sure the images have mean 0     and standard deviation 1.</p> Solution make_dataset.py<pre><code>import click\nimport torch\n\n\ndef normalize(images: torch.Tensor) -&gt; torch.Tensor:\n    \"\"\"Normalize images.\"\"\"\n    return (images - images.mean()) / images.std()\n\n\n@click.command()\n@click.option(\"raw_dir\", default=\"data/raw\", help=\"Path to raw data directory\")\n@click.option(\"processed_dir\", default=\"data/processed\", help=\"Path to processed data directory\")\ndef make_data(raw_dir: str, processed_dir: str) -&gt; None:\n    \"\"\"Process raw data and save it to processed directory.\"\"\"\n    train_images, train_target = [], []\n    for i in range(5):\n        train_images.append(torch.load(f\"{raw_dir}/train_images_{i}.pt\"))\n        train_target.append(torch.load(f\"{raw_dir}/train_target_{i}.pt\"))\n    train_images = torch.cat(train_images)\n    train_target = torch.cat(train_target)\n\n    test_images: torch.Tensor = torch.load(f\"{raw_dir}/test_images.pt\")\n    test_target: torch.Tensor = torch.load(f\"{raw_dir}/test_target.pt\")\n\n    train_images = train_images.unsqueeze(1).float()\n    test_images = test_images.unsqueeze(1).float()\n    train_target = train_target.long()\n    test_target = test_target.long()\n\n    train_images = normalize(train_images)\n    test_images = normalize(test_images)\n\n    torch.save(train_images, f\"{processed_dir}/train_images.pt\")\n    torch.save(train_target, f\"{processed_dir}/train_target.pt\")\n    torch.save(test_images, f\"{processed_dir}/test_images.pt\")\n    torch.save(test_target, f\"{processed_dir}/test_target.pt\")\n\n\nif __name__ == \"__main__\":\n    make_data()\n</code></pre> </li> <li> <p>This template comes with a <code>Makefile</code> that can be used to easily define common operations in a project. You do not     have to understand the complete file but try taking a look at it. In particular the following commands may come in     handy</p> <pre><code>make data  # runs the make_dataset.py file, try it!\nmake clean  # clean __pycache__ files\nmake requirements  # install everything in the requirements.txt file\n</code></pre> Windows users <p><code>make</code> is a GNU build tool that is by default not available on Windows. There are two recommended ways to get it running on Windows. The first is leveraging linux subsystem for Windows which you maybe have already installed. The second option is utilizing the chocolatey package manager, which enables Windows users to install packages similar to Linux system.</p> <p>In general we recommend that you add commands to the <code>Makefile</code> as you move along in the course. If you want to know more about how to write <code>Makefile</code>s then this is an excellent video.</p> </li> <li> <p>Put your model file (<code>model.py</code>) into <code>&lt;project_name&gt;/models</code> folder together and insert the relevant code from the     <code>main.py</code> file into the <code>train_model.py</code> file. Make sure that whenever a model is trained and it is saved, that it     gets saved to the <code>models</code> folder (preferably in sub-folders).</p> </li> <li> <p>When you run <code>train_model.py</code>, make sure that some statistics/visualizations from the trained models gets saved to     the <code>reports/figures/</code> folder. This could be a simple <code>.png</code> of the training curve.</p> </li> <li> <p>(Optional) Can you figure out a way to add a <code>train</code> command to the <code>Makefile</code> such that training can be started     using</p> <pre><code>make train\n</code></pre> Solution <pre><code>train:\n    python &lt;project_name&gt;/models/train_model.py\n</code></pre> </li> <li> <p>Fill out the newly created <code>&lt;project_name&gt;/models/predict_model.py</code> file, such that it takes a pre-trained model file     and creates prediction for some data. Recommended interface is that users can give this file either a folder with     raw images that gets loaded in or a <code>numpy</code> or <code>pickle</code> file with already loaded images e.g. something like this</p> <pre><code>python &lt;project_name&gt;/models/predict_model.py \\\n    models/my_trained_model.pt \\  # file containing a pretrained model\n    data/example_images.npy  # file containing just 10 images for prediction\n</code></pre> </li> <li> <p>Fill out the file <code>&lt;project_name&gt;/visualization/visualize.py</code> with this (as minimum, feel free to add more     visualizations)</p> <ul> <li>Loads a pre-trained network</li> <li>Extracts some intermediate representation of the data (your training set) from your cnn. This could be the     features just before the final classification layer</li> <li>Visualize features in a 2D space using     t-SNE to do the dimensionality     reduction.</li> <li>Save the visualization to a file in the <code>reports/figures/</code> folder.</li> </ul> Solution <p>The solution here depends a bit on the choice of model. However, in most cases your last layer in the model will be a fully connected layer, which we assume is named <code>fc</code>. The easiest way to get the features before this layer is to replace the layer with <code>torch.nn.Identity</code> which essentially does nothing (see highlighted line below). Alternatively, if you implemented everything in a <code>torch.nn.Sequential</code> you can just remove the last layer from the <code>Sequential</code> object: <code>model = model[:-1]</code>.</p> visualize.py<pre><code>import click\nimport matplotlib.pyplot as plt\nimport torch\nfrom my_project_name.model import MyAwesomeModel\nfrom sklearn.decomposition import PCA\nfrom sklearn.manifold import TSNE\n\n\n@click.command()\n@click.option(\"model_checkpoint\", default=\"model.pth\", help=\"Path to model checkpoint\")\n@click.option(\"processed_dir\", default=\"data/processed\", help=\"Path to processed data directory\")\n@click.option(\"figure_dir\", default=\"reports/figures\", help=\"Path to save figures\")\n@click.option(\"figure_name\", default=\"embeddings.png\", help=\"Name of the figure\")\ndef visualize(model_checkpoint: str, processed_dir: str, figure_dir: str, figure_name: str) -&gt; None:\n    \"\"\"Visualize model predictions.\"\"\"\n    model = MyAwesomeModel().load_state_dict(torch.load(model_checkpoint))\n    model.eval()\n    model.fc = torch.nn.Identity()\n\n    test_images = torch.load(f\"{processed_dir}/test_images.pt\")\n    test_target = torch.load(f\"{processed_dir}/test_target.pt\")\n    test_dataset = torch.utils.data.TensorDataset(test_images, test_target)\n\n    embeddings, targets = [], []\n    with torch.inference_mode():\n        for batch in torch.utils.data.DataLoader(test_dataset, batch_size=32):\n            images, target = batch\n            predictions = model(images)\n            embeddings.append(predictions)\n            targets.append(target)\n        embeddings = torch.cat(embeddings).numpy()\n        targets = torch.cat(targets).numpy()\n\n    if embeddings.shape[1] &gt; 500:  # Reduce dimensionality for large embeddings\n        pca = PCA(n_components=100)\n        embeddings = pca.fit_transform(embeddings)\n    tsne = TSNE(n_components=2)\n    embeddings = tsne.fit_transform(embeddings)\n\n    plt.figure(figsize=(10, 10))\n    for i in range(10):\n        mask = targets == i\n        plt.scatter(embeddings[mask, 0], embeddings[mask, 1], label=str(i))\n    plt.legend()\n    plt.savefig(f\"{figure_dir}/{figure_name}\")\n</code></pre> </li> <li> <p>(Optional) Feel free to create more files/visualizations (what about investigating/explore the data distribution?)</p> </li> <li> <p>Make sure to update the <code>README.md</code> file with a short description on how your scripts should be run</p> </li> <li> <p>Finally make sure to update the <code>requirements.txt</code> file with any packages that are necessary for running your     code (see this set of exercises for help)</p> </li> <li> <p>(Optional) Lets say that you are not satisfied with the template I have recommended that you use, which is     completely fine. What should you then do? You should of course create your own template! This is actually not that     hard to do.</p> <ol> <li> <p>Just for a starting point I would recommend that you fork either the     mlops template which you have already been using or     alternatively fork the data science template     template.</p> </li> <li> <p>After forking the template, clone it down locally and lets start modifying it. The first step is changing     the <code>cookiecutter.json</code> file. For the mlops template it looks like this:</p> <pre><code>{\n    \"project_name\": \"project_name\",\n    \"repo_name\": \"{{ cookiecutter.project_name.lower().replace(' ', '_') }}\",\n    \"author_name\": \"Your name (or your organization/company/team)\",\n    \"description\": \"A short description of the project.\",\n    \"python_version_number\": \"3.10\",\n    \"open_source_license\": [\"No license file\", \"MIT\", \"BSD-3-Clause\"]\n}\n</code></pre> <p>simply add a new line to the json file with the name of the variable you want to add and the default value you want it to have.</p> </li> <li> <p>The actual template is located in the <code>{{ cookiecutter.project_name }}</code> folder. <code>cookiecutter</code> works by replacing     everywhere that it sees <code>{{ cookiecutter.&lt;variable_name&gt; }}</code> with the value of the variable. Therefore, if you     want to add a new file to the template, just add it to the <code>{{ cookiecutter.project_name }}</code> folder and make     sure to add the <code>{{ cookiecutter.&lt;variable_name&gt; }}</code> where you want the variable to be replaced.</p> </li> <li> <p>After you have made the changes you want to the template, you should test it locally. Just run</p> <pre><code>cookiecutter . -f --no-input\n</code></pre> <p>and it should create a new folder using the default values of the <code>cookiecutter.json</code> file.</p> </li> <li> <p>Finally, make sure to push any changes you made to the template to GitHub, such that you in the future can use it     by simply running</p> <pre><code>cookiecutter https://github.com/&lt;username&gt;/&lt;my_template_repo&gt;\n</code></pre> </li> </ol> </li> </ol>"},{"location":"s2_organisation_and_version_control/code_structure/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>Starting from complete scratch, what is the steps needed to create a new GitHub repository and push a specific     template to it as the very first commit.</p> Solution <ol> <li> <p>Create a completely barebone repository, either using the GitHub UI or if you have the GitHub cli installed     (not <code>git</code>) you can run</p> <pre><code>gh repo create &lt;repo_name&gt; --public --confirm\n</code></pre> </li> <li> <p>Run <code>cookiecutter</code> with the template you want to use</p> <pre><code>cookiecutter &lt;template&gt;\n</code></pre> <p>The name of the folder created by <code>cookiecutter</code> should be the same as  you just used. <li> <p>Run the following sequence of commands</p> <pre><code>cd &lt;project_name&gt;\ngit init\ngit add .\ngit commit -m \"Initial commit\"\ngit remote add origin https://github.com/&lt;username&gt;/&lt;repo_name&gt;\ngit push origin master\n</code></pre> </li> <p>That's it. The template should now have been pushed to the repository as the first commit.</p> <p>That ends the module on code structure and <code>cookiecutter</code>. We again want to stress the point of using <code>cookiecutter</code> is not about following one specific template, but instead just to use any template for organizing your code. What often happens in a team is that multiple templates are needed in different stages of the development phase or for different product types because they share common structure, while still having some specifics. Keeping templates up-to-date then becomes critical such that no team member is using an outdated template. If you ever end up in this situation, we highly recommend to checkout cruft that works alongside <code>cookiecutter</code> to not only make projects but update existing ones as template evolves. Cruft additionally also has template validation capabilities to ensure projects match the latest version of a template.</p>"},{"location":"s2_organisation_and_version_control/dvc/","title":"M8 - Data version control","text":""},{"location":"s2_organisation_and_version_control/dvc/#data-version-control","title":"Data Version Control","text":"<p>Core Module</p> <p>In this module, we are going to return to version control. However, this time we are going to focus on version control of data. The reason we need to separate between standard version control and data version control comes down to one problem: size.</p> <p>Classic version control was developed to keep track of code files, which all are simple text files. Even a codebase that contains 1000+ files with millions of lines of code can probably be stored in less than a single gigabyte (GB). On the other hand, the size of data can be drastically bigger. As most machine learning algorithms only get better with the more data that you feed them, we are seeing models today that are being trained on petabytes of data (1.000.000 GB).</p> <p>Because this is an important concept there exist a couple of frameworks that have specialized in versioning data such as DVC, DAGsHub, Hub, Modelstore and ModelDB. Regardless of what framework, they all implement somewhat the same concept: instead of storing the actual data files or in general storing any large artifacts files we instead store a pointer to these large flies. We then version control the point instead of the artifact.</p> <p></p>  Image credit  <p>We are in this course going to use <code>DVC</code> provided by iterative.ai as they also provide tools for automatizing machine learning, which we are going to focus on later.</p>"},{"location":"s2_organisation_and_version_control/dvc/#dvc-what-is-it","title":"DVC: What is it?","text":"<p>DVC (Data Version Control) is simply an extension of <code>git</code> to not only take versioning data but also models and experiments in general. But how does it deal with these large data files? Essentially, <code>DVC</code> will just keep track of a small metafile that will then point to some remote location where your original data is stored. Metafiles essentially work as placeholders for your data files. Your large data files are then stored in some remote location such as Google Drive or an <code>S3</code> bucket from Amazon.</p> <p> </p>  Image credit  <p>As the figure shows, we now have two remote locations: one for code and one for data. We use <code>git pull/push</code> for the code and <code>dvc pull/push</code> for the data. The key concept is the connection between the data file <code>model.pkl</code> which is fairly large and its respective metafile <code>model.pkl.dvc</code> which is very small. The large file is stored in the data remote and the metafile is stored in the code remote.</p>"},{"location":"s2_organisation_and_version_control/dvc/#exercises","title":"\u2754 Exercises","text":"<p>If in doubt about some of the exercises, we recommend checking out the documentation for DVC as it contains excellent tutorials.</p> <ol> <li> <p>For these exercises, we are going to use Google drive as a remote storage     solution for our data. If you do not already have a Google account, please create one (we are going to use it again     in later exercises). Please make sure that you at least have 1GB of free space.</p> </li> <li> <p>Next, install DVC and the Google Drive extension</p> <pre><code>pip install dvc\npip install dvc-gdrive\n</code></pre> <p>If you installed DVC via pip and plan to use cloud services as remote storage, you might need to install these optional dependencies: [s3], [azure], [gdrive], [gs], [oss], [ssh]. Alternatively, use [all] to include them all. If you encounter that the installation fails, we recommend that you start by updating pip and then trying to update <code>dvc</code>:</p> <pre><code>pip install -U pip\npip install -U dvc-gdrive\n</code></pre> <p>If this does not work for you, it is most likely due to a problem with <code>pygit2</code> and in that case we recommend that you follow the instructions here.</p> </li> <li> <p>In your MNIST repository run the following command from the terminal</p> <pre><code>dvc init\n</code></pre> <p>this will setup <code>dvc</code> for this repository (similar to how <code>git init</code> will initialize a git repository). These files should be committed using standard <code>git</code> to your repository.</p> </li> <li> <p>Go to your Google Drive and create a new folder called <code>dtu_mlops_data</code>. Then copy the unique identifier     belonging to that folder as shown in the figure below</p> <p> </p> <p>Using this identifier, add it as a remote storage</p> <pre><code>dvc remote add -d storage gdrive://&lt;your_identifier&gt;\n</code></pre> </li> <li> <p>Check the content of the file <code>.dvc/config</code>. Does it contain a pointer to your remote storage? Afterwards, make sure     to add this file to the next commit we are going to make:</p> <pre><code>git add .dvc/config\n</code></pre> </li> <li> <p>Call the <code>dvc add</code> command on your data files exactly like you would add a file with <code>git</code> (you do not need to     add every file by itself as you can directly add the <code>data/</code> folder). Doing this should create a human-readable     file with the extension <code>.dvc</code>. This is the metafile as explained earlier that will serve as a placeholder for     your data. If you are on Windows and this step fails you may need to install <code>pywin32</code>. At the same time, the <code>data</code>     folder should have been added to the <code>.gitignore</code> file that marks which files should not be tracked by git. Confirm     that this is correct.</p> </li> <li> <p>Now we are going to add, commit and tag the metafiles so we can restore to this stage later on. Commit and tag     the files, which should look something like this:</p> <pre><code>git add data.dvc .gitignore\ngit commit -m \"First datasets, containing 25000 images\"\ngit tag -a \"v1.0\" -m \"data v1.0\"\n</code></pre> </li> <li> <p>Finally, push your data to the remote storage using <code>dvc push</code>. You will be asked to authenticate, which involves     copy-pasting the code in the link prompted. Check out your Google Drive folder. You will see that the data is not     in a recognizable format anymore due to the way that <code>dvc</code> packs and tracks the data. The boring detail is that     <code>dvc</code> converts the data into content-addressable storage     which makes data much faster to get. Finally, make sure that your data is not stored in your Github repository.</p> <p>After authenticating the first time, <code>DVC</code> should be setup without having to authenticate again. If you for some reason encounter that dvc fails to authenticate, you can try to reset the authentication. Locate the file <code>$CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json</code> where <code>$CACHE_HOME</code> depends on your operating system:</p> macOSLinuxWindows <p><code>~/Library/Caches</code></p> <p><code>~/.cache</code>  This is the typical location, but it may vary depending on what distro you are running</p> <p><code>{user}/AppData/Local</code></p> <p>Delete the complete <code>{gdrive_client_id}</code> folder and retry authenticating with <code>dvc push</code>.</p> </li> <li> <p>After completing the above steps, it is very easy for others (or yourself) to get setup with both     code and data by simply running</p> <pre><code>git clone &lt;my_repository&gt;\ncd &lt;my_repository&gt;\ndvc pull\n</code></pre> <p>(assuming that you give them access right to the folder in your drive). Try doing this (in some other location than your standard code) to make sure that the two commands indeed download both your code and data.</p> </li> <li> <p>Lets look about the process of updating our data. Remember the important aspect of version control is that we do not     need to store explicit files called <code>data_v1.pt</code>, <code>data_v2.pt</code> etc. but just have a single <code>data.pt</code> that where we     can always checkout earlier versions. Initially start by copying the data <code>data/corruptmnist_v2</code> folder from this     repository to your MNIST code. This contains 3 extra datafiles with 15000 additional observations. Rerun your data     pipeline so these gets incorporated into the files in your <code>processed</code> folder.</p> </li> <li> <p>Redo the above steps, adding the new data using <code>dvc</code>, committing and tagging the metafiles e.g. the following     commands should be executed (with appropriate input):</p> <p><code>dvc add -&gt; git add -&gt; git commit -&gt; git tag -&gt; dvc push -&gt; git push</code>.</p> </li> <li> <p>Lets say that you wanted to go back to the state of your data in v1.0. If the above steps have been done correctly,     you should be able to do this using:</p> <pre><code>git checkout v1.0\ndvc checkout\n</code></pre> <p>confirm that you have reverted to the original data.</p> </li> <li> <p>(Optional) Finally, it is important to note that <code>dvc</code> is not only intended to be used to store data files but also     any other large files such as trained model weights (with billions of parameters these can be quite large). For     example, if we always store our best-performing model in a file called <code>best_model.ckpt</code> then we can use <code>dvc</code> to     version control it, store it online and make it easy for others to download. Feel free to experiment with this using     your model checkpoints.</p> </li> </ol> <p>In general <code>dvc</code> is a great framework for version-controlling data and models. However, it is important to note that it does have some performance issue when dealing with datasets that consist of many files. Therefore, if you are ever working with a dataset that consists of many small files, it can be a good idea to:</p> <ul> <li> <p>zip files into a single archive and then version control the archive. The <code>zip</code> archive should be placed in a     <code>data/raw</code> folder and then unzipped in the <code>data/processed</code> folder.</p> </li> <li> <p>If possible turn your data into 1D arrays, then it can be stored in a single file such as <code>.parquet</code> or <code>.csv</code>.     This is especially useful for tabular data. Then you can version control the single file instead of the many files.</p> </li> </ul>"},{"location":"s2_organisation_and_version_control/dvc/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>How do you know that a repository is using dvc?</p> Solution <p>Similar to a git repository having a <code>.git</code> directory, a repository using dvc needs to have a <code>.dvc</code> folder. Alternatively you can you the <code>dvc status</code> command.</p> </li> <li> <p>Assume you just added a folder called <code>data/</code> that you want to track with <code>dvc</code>. What is the sequence of 5 commands     to successful version control the folder? (assuming you already setup a remote)</p> Solution <pre><code>dvc add data/\ngit add .\ngit commit -m \"added raw data\"\ngit push\ndvc push\n</code></pre> </li> </ol> <p>That's all for today. With the combined power of <code>git</code> and <code>dvc</code> we should be able to version control everything in our development pipeline such that no changes are lost (assuming we commit regularly). It should be noted that <code>dvc</code> offers more than just data version control, so if you want to deep dive into <code>dvc</code> we recommend their pipeline feature and how this can be used to setup version controlled experiments. Note that we are going to revisit <code>dvc</code> later for a more permanent (and large-scale) storage solution.</p>"},{"location":"s2_organisation_and_version_control/git/","title":"M5 - Git","text":""},{"location":"s2_organisation_and_version_control/git/#git","title":"Git","text":"<p>Core Module</p> <p>Proper collaboration with other people will require that you can work on the same codebase in an organized manner. This is the reason that version control exist. Simply stated, it is a way to keep track of:</p> <ul> <li>Who made changes to the code</li> <li>When did the change happen</li> <li>What changes were made</li> </ul> <p>For a full explanation please see this page</p> <p>Secondly, it is important to note that GitHub is not git! GitHub is the dominating player when it comes to hosting repositories but that does not mean that they are the only one providing free repository hosting (see bitbucket or gitlab) for some other examples.</p> <p>That said we will be using git and GitHub throughout this course. It is a requirement for passing this course that you create a public repository with your code and use git to upload any code changes. How much you choose to integrate this into your own projects depends, but you are at least expected to be familiar with git+GitHub.</p> <p></p>  Image credit"},{"location":"s2_organisation_and_version_control/git/#initial-config","title":"Initial config","text":"<p>What does Git stand for?</p> <p>The name \"git\" was given by Linus Torvalds when he wrote the very first version. He described the tool as \"the stupid content tracker\" and the name as (depending on your mood):</p> <ul> <li>Random three-letter combination that is pronounceable, and not actually used by any common UNIX command. The fact     that it is a mispronunciation of \"get\" may or may not be relevant.</li> <li>Stupid. Contemptible and Despicable. simple. Take your pick from the dictionary of slang.</li> <li>\"Global information tracker\": you're in a good mood, and it actually works for you.     Angels sing, and a light suddenly fills the room.</li> <li>\"Goddamn idiotic truckload of sh*t\": when it breaks</li> </ul> <ol> <li> <p>Install git on your computer and make sure     that your installation is working by writing <code>git help</code> in a terminal and it should show you the help message for     git.</p> </li> <li> <p>Create a GitHub account if you do not already have one.</p> </li> <li> <p>To make sure that we do not have to type in our GitHub username every time that we want to do some changes,     we can once and for all set them on our local machine</p> <pre><code># type in a terminal\ngit config credential.helper store\ngit config --global user.email &lt;email&gt;\n</code></pre> </li> </ol>"},{"location":"s2_organisation_and_version_control/git/#git-overview","title":"Git overview","text":"<p>The most simple way to think of version control, is that it is just nodes with lines connecting them</p> <p></p> <p>Each node, which we call a commit is uniquely identified by a hash string. Each node, stores what our code looked like at that point in time (when we made the commit) and using the hash codes we can easily revert to a specific point in time.</p> <p>The commits are made up of local changes that we make to our code. A basic workflow for adding commits are seen below</p> <p></p> <p>Assuming that we have made some changes to our local working directory and that we want to get these updates to be online in the remote repository we have to do the following steps:</p> <ul> <li> <p>First we run the command <code>git add</code>. This will move our changes to the staging area. While changes are in the     staging area we can very easily revert them (using <code>git restore</code>). There have therefore not been assigned a unique     hash to the code yet, and we can therefore still overwrite it.</p> </li> <li> <p>To take our code from the staging area and make it into a commit, we simply run <code>git commit</code> which will locally     add a note to the graph. It is important again, that we have not pushed the commit to the online repository yet.</p> </li> <li> <p>Finally, we want others to be able to use the changes that we made. We do a simple <code>git push</code> and our     commit gets online</p> </li> </ul> <p>Of course, the real power of version control is the ability to make branches, as in the image below</p> <p></p>  Image credit  <p>Each branch can contain code that are not present on other branches. This is useful when you are many developers working together on the same project.</p>"},{"location":"s2_organisation_and_version_control/git/#exercises","title":"\u2754 Exercises","text":"<ol> <li> <p>In your GitHub account create an repository, where the intention is that you upload the code from the final     exercise from yesterday</p> <ol> <li> <p>After creating the repository, clone it to your computer</p> <pre><code>git clone https://github.com/my_user_name/my_repository_name.git\n</code></pre> </li> <li> <p>Move/copy the three files from yesterday into the repository (and any other that you made)</p> </li> <li> <p>Add the files to a commit by using <code>git add</code> command</p> </li> <li> <p>Commit the files using <code>git commit</code> command where you use the <code>-m</code> argument to provide a commit message (1).</p> <ol> <li> Writing good commit message is a skill in itself. A commit message should be short but     informative about the work you are trying to commit. Try to practise writing good commit messages     throughout the course. You can see     this guideline for help.</li> </ol> </li> <li> <p>Finally push the files to your repository using <code>git push</code>. Make sure to check online that the files have been     updated in your repository.</p> </li> <li> <p>You can always use the command <code>git status</code> to check where you are in the process of making a commit.</p> </li> <li> <p>Also checkout the <code>git log</code> command, which will show you the history of commits that you have made.</p> </li> </ol> </li> <li> <p>Make sure that you understand how to make branches, as this will allow you to try out code changes without     messing with your working code. Creating a new branch can be done using:</p> <pre><code># create a new branch\ngit checkout -b &lt;my_branch_name&gt;\n</code></pre> <p>Afterwards, you can use <code>git checkout</code> (1) to change between branches (remember to commit your work!) Try adding something (a file, a new line of code etc.) to the newly created branch, commit it and try changing back to master afterwards. You should hopefully see whatever you added on the branch is not present on the main branch.</p> <ol> <li> The <code>git checkout</code> command is used for a lot of different things in git. It can be used to     change branches, to revert changes and to create new branches. An alternative is using <code>git switch</code> and     <code>git restore</code> which are more modern commands.</li> </ol> </li> <li> <p>If you do not already have a cloned version of this repository belonging to the course, make sure to make one!     I am continuously updating/changing some of the material during the course and I therefore recommend that you     each day before the lecture do a <code>git pull</code> on your local copy</p> </li> <li> <p>Git may seem like a waste of time when solutions like dropbox, google drive etc exist, and it is     not completely untrue when you are only one or two working on a project. However, these file management     systems falls short when hundreds to thousands of people work together. For this exercise you will     go through the steps of sending an open-source contribution:</p> <ol> <li> <p>Go online and find a project you do not own, where you can improve the code. You can either look at this     page of good issues to get started with or for simplicity you can just choose     the repository belonging to the course. Now fork the project by clicking the Fork button.</p> <p></p> <p>This will create a local copy of the repository which you have complete writing access to. Note that code updates to the original repository do not update code in your local repository.</p> </li> <li> <p>Clone your local fork of the project using <code>git clone</code>.</p> </li> <li> <p>As default your local repository will be on the <code>main branch</code> (HINT: you can check this with the     <code>git status</code> command). It is good practice to make a new branch when working on some changes. Use     the <code>git branch</code> command followed by the <code>git checkout</code> command to create a new branch.</p> </li> <li> <p>You are now ready to make changes to the repository. Try to find something to improve (any spelling mistakes?).     When you have made the changes, do the standard git cycle: <code>add -&gt; commit -&gt; push</code></p> </li> <li> <p>Go online to the original repository and go to the <code>Pull requests</code> tab. Find <code>compare</code> button and     choose the button to compare the <code>master branch</code> of the original repo with the branch that you just created     in your own repository. Check the diff on the page to make sure that it contains the changes you have made.</p> </li> <li> <p>Write a bit about the changes you have made and click <code>Create pull request</code> :)</p> </li> </ol> </li> <li> <p>Forking a repository has the consequence that your fork and the repository that you forked can diverge. To     mitigate this we can set what is called an remote upstream. Take a look on this     page     , and set a remote upstream for the repository you just forked.</p> Solution <pre><code>git remote add upstream &lt;url-to-original-repo&gt;\n</code></pre> </li> <li> <p>After setting the upstream branch, we need to pull and merge any update. Take a look on this     page     and figure out how to do this.</p> Solution <pre><code>git fetch upstream\ngit checkout main\ngit merge upstream/main\n</code></pre> </li> <li> <p>As a final exercise we want to simulate a merge conflict, which happens when two users try to commit changes     to exactly same lines of code in the codebase, and git is not able to resolve how the different commits should be     integrated.</p> <ol> <li> <p>In your browser, open your favorite repository (it could be the one you just worked on), go to any file of     your choosing and click the edit button (see image below) and make some change to the file. For example, if     you choose a Python file you can just import some random packages at the top of the file. Commit the change.</p> <p> </p> </li> <li> <p>Make sure not to pull the change you just made to your local computer. Locally make changes to the same     file in the same lines and commit them afterwards.</p> </li> <li> <p>Now try to <code>git pull</code> the online changes. What should (hopefully) happen is that git will tell you that it found     a merge conflict that needs to be resolved. Open the file and you should see something like this</p> <pre><code>&lt;&lt;&lt;&lt;&lt;&lt;&lt; HEAD\nthis is some content to mess with\ncontent to append\n=======\ntotally different content to merge later\n&gt;&gt;&gt;&gt;&gt;&gt;&gt; master\n</code></pre> <p>this should be interpret as: everything that's between <code>&lt;&lt;&lt;&lt;&lt;&lt;&lt;</code> and <code>=======</code> are the changes made by your local commit and everything between <code>=======</code> and <code>&gt;&gt;&gt;&gt;&gt;&gt;&gt;</code> are the changes you are trying to pull. To fix the merge conflict you simply have to make the code in the two \"cells\" work together. When you are done, remove the identifiers <code>&lt;&lt;&lt;&lt;&lt;&lt;&lt;</code>, <code>=======</code> and <code>&gt;&gt;&gt;&gt;&gt;&gt;&gt;</code>.</p> </li> <li> <p>Finally, commit the merge and try to push.</p> </li> </ol> </li> <li> <p>(Optional) The above exercises have focused on how to use git from the terminal, which I highly recommend learning.     However, if you are using a proper editor they also have build in support for version control. We recommend getting     familiar with these features (here is a tutorial for     VS Code)</p> </li> </ol>"},{"location":"s2_organisation_and_version_control/git/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>How do you know if a certain directory is a git repository?</p> Solution <p>You can check if there is a \".git\" directory. Alternative you can use the <code>git status</code> command.</p> </li> <li> <p>Explain what the file <code>gitignore</code> is used for?</p> Solution <p>The file <code>gitignore</code> is used to tell git which files to ignore when doing a <code>git add .</code> command. This is useful for files that are not part of the codebase, but are needed for the code to run (e.g. data files) or files that contain sensitive information (e.g. <code>.env</code> files that contain API keys and passwords).</p> </li> <li> <p>You have two branches - main and devel. What sequence of commands would you need to execute to make sure that     devel is in sync with main?</p> Solution <pre><code>git checkout main\ngit pull\ngit checkout devel\ngit merge main\n</code></pre> </li> <li> <p>What best practices are you familiar with regarding version control?</p> Solution <ul> <li>Use a descriptive commit message</li> <li>Make each commit a logical unit</li> <li>Incorporate others' changes frequently</li> <li>Share your changes frequently</li> <li>Coordinate with your co-workers</li> <li>Don't commit generated files</li> </ul> </li> </ol> <p>That covers the basics of git to get you started. In the exercise folder you can find a git cheat sheet with the most useful commands for future reference. Finally, we want to point out another awesome feature of GitHub: in browser editor. Sometimes you have a small edit that you want to make, but still would like to do this in a IDE/editor. Or you may be in the situation where you are working from another device than your usual developer machine. GitHub has an built-in editor that can simply be enabled by changing any URL from</p> <pre><code>https://github.com/username/repository\n</code></pre> <p>to</p> <pre><code>https://github.dev/username/repository\n</code></pre> <p>Try it out on your newly created repository.</p>"},{"location":"s2_organisation_and_version_control/good_coding_practice/","title":"M7 - Good coding practice","text":""},{"location":"s2_organisation_and_version_control/good_coding_practice/#good-coding-practice","title":"Good coding practice","text":"<p>Quote</p> <p>Code is read more often than it is written.  Guido Van Rossum (author of Python)</p> <p>It is hard to define exactly what good coding practises are. But the above quote by Guido does hint at what it could be, namely that it has to do with how others observe and persive your code. In general, good coding practice is about making sure that you code is easy to read and understand, not only by others but also by your future self. The key concept to keep in mind with we are talking about good coding practice is consistency. In many cases it does not matter exactly how you choose to style your code etc., the important part is that you are consistent about it.</p> <p></p>  Image credit"},{"location":"s2_organisation_and_version_control/good_coding_practice/#documentation","title":"Documentation","text":"<p>Most programmers have a love-hate relationship with documentation: We absolute hate writing it ourself, but love when someone else has actually taken time to add it to their code. There is no doubt about that well documented code is much easier to maintain, as you do not need to remember all details about the code to still maintain it. It is key to remember that good documentation saves more time, than it takes to write.</p> <p>The problem with documentation is that there is no right or wrong way to do it. You can end up doing:</p> <ul> <li> <p>Under documentation: You document information that is clearly visible from the code and not the complex     parts that are actually hard to understand.</p> </li> <li> <p>Over documentation: Writing too much documentation will have the opposite effect on most people than     what you want: there is too much to read, so people will skip it.</p> </li> </ul> <p>Writing good documentation is a skill that takes time to train, so lets try to do it.</p> <p>Quote</p> <p>Code tells you how; Comments tell you why.  Jeff Atwood</p>"},{"location":"s2_organisation_and_version_control/good_coding_practice/#exercises","title":"\u2754 Exercises","text":"<ol> <li> <p>Go over the most complicated file in your project. Be critical and add comments where the logic     behind the code is not easily understandable. (1)</p> <ol> <li> <p> In deep learning we often work with multi-dimensional tensors that constantly changes shape     after each operation. It is good practise to annotate with comments when tensors undergoes some reshaping.     In the following example we compute the pairwise euclidean distance between two tensors using broadcasting     which results in multiple shape operations.</p> <pre><code>x = torch.randn(5, 10)  # N x D\ny = torch.randn(7, 10)  # M x D\nxy = x.unsqueeze(1) - y.unsqueeze(0)  # (N x 1 x D) - (1 x M x D) = (N x M x D)\npairwise_euc_dist = xy.abs().pow(2.0).sum(dim=-1)  # N x M\n</code></pre> </li> </ol> </li> <li> <p>Add docstrings to at least two Python function/methods.     You can see here (example 5) a good example     how to use identifiable keywords such as <code>Parameters</code>, <code>Args</code>, <code>Returns</code> which standardizes the way of     writing docstrings.</p> </li> </ol>"},{"location":"s2_organisation_and_version_control/good_coding_practice/#styling","title":"Styling","text":"<p>While Python already enforces some styling (e.g. code should be indented in a specific way), this is not enough to secure that code from different users actually look like each other. Maybe even more troubling is that you will often see that your own style of coding changes as you become more and more experienced. This kind of difference in coding style is not that important to take care of when you are working on a personal project, but when working multiple people together on the same project it is important to consider.</p> <p>The question then remains what styling you should use. This is where Pep8 comes into play, which is the  official style guide for python. It is essentially contains what is considered \"good practice\" and \"bad practice\" when coding python.</p> <p>The many years the most commonly used tool to check if you code is PEP8 compliant is to use flake8. However, we are in this course going to be using ruff that are quickly gaining popularity due to how fast it is and how quickly the developers are adding new features. (1)</p> <ol> <li> both <code>flake8</code> and <code>ruff</code> is what is called a     linter or lint tool, which is any kind of static code analyze     program that is used to flag programming errors, bugs, and styling errors.</li> </ol>"},{"location":"s2_organisation_and_version_control/good_coding_practice/#exercises_1","title":"\u2754 Exercises","text":"<ol> <li> <p>Install <code>ruff</code></p> <pre><code>pip install ruff\n</code></pre> </li> <li> <p>Run <code>ruff</code> on your project or part of your project</p> <pre><code>ruff check .  # Lint all files in the current directory (and any subdirectories)\nruff check path/to/code/  # Lint all files in `/path/to/code` (and any subdirectories).\n</code></pre> <p>are you PEP8 compliant or are you a normal mortal?</p> </li> </ol> <p>You could go and fix all the small errors that <code>ruff</code> is giving. However, in practice large projects instead relies on some kind of code formatter, that will automatically format your code for you to be PEP8 compliant. Some of the biggest formatters for the longest time in Python have been black and yapf, but we are going to use <code>ruff</code> which also have a build in formatter that should be a drop-in replacement for <code>black</code>.</p> <ol> <li> <p>Try to use <code>ruff format</code> to format your code</p> <pre><code>ruff format .  # Format all files in the current directory.\nruff format /path/to/file.py  # Format a single file.\n</code></pre> </li> </ol> <p>By default <code>ruff</code> will apply a selection of rules when we are either checking it or formatting it. However, many more rules can be activated through configuration.  If you have completed module M6 on code structure you will have encountered the <code>pyproject.toml</code> file, which can store both build instructions about our package but also configuration of developer tools. Lets try to configure <code>ruff</code> using the <code>pyproject.toml</code> file.</p> <ol> <li> <p>One aspect that is not covered by PEP8 is how <code>import</code> statements in Python should be organized. If you are like     most people, you place your <code>import</code> statements at the top of the file and they are ordered simply by when you     needed them. A better practice is to introduce some clear structure in our imports. In older versions of this course     we have used isort to do the job, but we are here going to configure <code>ruff</code> to do     the job. In your <code>pyproject.toml</code> file add the following lines</p> <pre><code>[tool.ruff]\nselect = [\"I\"]\n</code></pre> <p>and try re-running <code>ruff check</code> and <code>ruff format</code>. Hopefully this should reorganize your imports to follow common practice. (1)</p> <ol> <li> the common practise is to first list built-in Python packages (like <code>os</code>) in one block,     followed by third-party dependencies (like <code>torch</code>) in a second block and finally imports from your own package     in a third block. Each block is then put in alphabetical order.</li> </ol> </li> <li> <p>One PEP8 styling rule that is often diverged from is the recommended line length of 79 characters, which by many     (including myself) is considered very restrictive. If you code consist of multiple levels of indentation, you can     quickly run into 79 characters being limiting. For this reason many projects increase it, often to 120 characters     which seems to be the sweet spot of how many characters fits in a coding window on a laptop.     Add the line</p> <pre><code>line-length=120\n</code></pre> <p>under the <code>[tool.ruff]</code> section in the <code>pyproject.toml</code> file and rerun <code>ruff check</code> and <code>ruff format</code> on your code.</p> </li> <li> <p>Experiment yourself with further configuration of <code>ruff</code>. In particular we recommend adding more     rules and looking <code>[tool.ruff.pydocstyle]</code> configuration to indicate how you     have styled your documentation.</p> </li> </ol>"},{"location":"s2_organisation_and_version_control/good_coding_practice/#typing","title":"Typing","text":"<p>In addition to writing documentation and following a specific styling, in Python we have a third way of improving the quality of our code: through typing. Typing goes back to the earlier programming languages like <code>c</code>, <code>c++</code> etc. where data types needed to be explicit stated for variables:</p> <pre><code>int main() {\n    int x = 5 + 6;\n    float y = 0.5;\n    cout &lt;&lt; \"Hello World! \" &lt;&lt; x &lt;&lt; std::endl();\n}\n</code></pre> <p>This is not required by Python but it can really improve the readability of code, that you can directly read from the code what the expected types of input arguments and returns are. In Python the <code>:</code> character have been reserved for type hints. Here is one example of adding typing to a function:</p> <pre><code>def add2(x: int, y: int) -&gt; int:\n    return x+y\n</code></pre> <p>here we mark that both <code>x</code> and <code>y</code> are integers and using the arrow notation <code>-&gt;</code> we mark that the output type is also an integer. Assuming that we are also going to use the function for floats and <code>torch.Tensor</code>s we could improve the typing by specifying a union of types. Depending on the version of Python you are using the syntax for this can be different.</p> python &lt;3.10python &gt;=3.10 <pre><code>from torch import Tensor  # note it is Tensor with upper case T. This is the base class of all tensors\nfrom typing import Union\ndef add2(x: Union[int, float, Tensor], y: Union[int, float, Tensor]) -&gt; Union[int, float, Tensor]:\n    return x+y\n</code></pre> <pre><code>from torch import Tensor  # note it is Tensor with upper case T. This is the base class of all tensors\ndef add2(x: int | float | Tensor, y: int | float | Tensor) -&gt; int | float | Tensor:\n    return x+y\n</code></pre> <p>Finally, since this is a very generic function it also works on <code>numpy</code> arrays etc. we can always default to the <code>Any</code> type if we are not sure about all the specific types that a function can take</p> <pre><code>from typing import Any\ndef add2(x: Any, y: Any) -&gt; Any:\n    return x+y\n</code></pre> <p>However, in this case we basically is in the same case as if our function were not typed, as the type hints does not help us at all. Therefore, use <code>Any</code> only when necessary.</p>"},{"location":"s2_organisation_and_version_control/good_coding_practice/#exercises_2","title":"\u2754 Exercises","text":"<p>Exercise files</p> <ol> <li> <p>We provide a file called <code>typing_exercise.py</code>. Add typing everywhere in the file. Please note that you will     need the following import:</p> <pre><code>from typing import Callable, Optional, Tuple, Union, List  # you will need all of them in your code\n</code></pre> <p>for it to work. This cheat sheet is a good resource on typing. We also provide <code>typing_exercise_solution.py</code>, but try to solve the exercise yourself.</p> <code>typing_exercise.py</code> typing_exercise.py<pre><code>import torch\nfrom torch import nn\n\n\nclass Network(nn.Module):\n    \"\"\"Builds a feedforward network with arbitrary hidden layers.\n\n    Arguments:\n        input_size: integer, size of the input layer\n        output_size: integer, size of the output layer\n        hidden_layers: list of integers, the sizes of the hidden layers\n\n    \"\"\"\n\n    def __init__(self, input_size, output_size, hidden_layers, drop_p=0.5) -&gt; None:\n        super().__init__()\n        # Input to a hidden layer\n        self.hidden_layers = nn.ModuleList([nn.Linear(input_size, hidden_layers[0])])\n\n        # Add a variable number of more hidden layers\n        layer_sizes = zip(hidden_layers[:-1], hidden_layers[1:])\n        self.hidden_layers.extend([nn.Linear(h1, h2) for h1, h2 in layer_sizes])\n\n        self.output = nn.Linear(hidden_layers[-1], output_size)\n\n        self.dropout = nn.Dropout(p=drop_p)\n\n    def forward(self, x):\n        \"\"\"Forward pass through the network, returns the output logits.\"\"\"\n        for each in self.hidden_layers:\n            x = nn.functional.relu(each(x))\n            x = self.dropout(x)\n        x = self.output(x)\n\n        return nn.functional.log_softmax(x, dim=1)\n\n\ndef validation(model, testloader, criterion):\n    \"\"\"Validation pass through the dataset.\"\"\"\n    accuracy = 0\n    test_loss = 0\n    for images, labels in testloader:\n        images = images.resize_(images.size()[0], 784)\n\n        output = model.forward(images)\n        test_loss += criterion(output, labels).item()\n\n        ## Calculating the accuracy\n        # Model's output is log-softmax, take exponential to get the probabilities\n        ps = torch.exp(output)\n        # Class with highest probability is our predicted class, compare with true label\n        equality = labels.data == ps.max(1)[1]\n        # Accuracy is number of correct predictions divided by all predictions, just take the mean\n        accuracy += equality.type_as(torch.FloatTensor()).mean()\n\n    return test_loss, accuracy\n\n\ndef train(model, trainloader, testloader, criterion, optimizer=None, epochs=5, print_every=40) -&gt; None:\n    \"\"\"Train a PyTorch Model.\"\"\"\n    if optimizer is None:\n        optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)\n    steps = 0\n    running_loss = 0\n    for e in range(epochs):\n        # Model in training mode, dropout is on\n        model.train()\n        for images, labels in trainloader:\n            steps += 1\n\n            # Flatten images into a 784 long vector\n            images.resize_(images.size()[0], 784)\n\n            optimizer.zero_grad()\n\n            output = model.forward(images)\n            loss = criterion(output, labels)\n            loss.backward()\n            optimizer.step()\n\n            running_loss += loss.item()\n\n            if steps % print_every == 0:\n                # Model in inference mode, dropout is off\n                model.eval()\n\n                # Turn off gradients for validation, will speed up inference\n                with torch.no_grad():\n                    test_loss, accuracy = validation(model, testloader, criterion)\n\n                print(\n                    f\"Epoch: {e + 1}/{epochs}.. \",\n                    f\"Training Loss: {running_loss / print_every:.3f}.. \",\n                    f\"Test Loss: {test_loss / len(testloader):.3f}.. \",\n                    f\"Test Accuracy: {accuracy / len(testloader):.3f}\",\n                )\n\n                running_loss = 0\n\n                # Make sure dropout and grads are on for training\n                model.train()\n</code></pre> Solution typing_exercise_solution.py<pre><code>from __future__ import annotations\n\nfrom collections.abc import Callable\n\nimport torch\nfrom torch import nn\n\n\nclass Network(nn.Module):\n    \"\"\"Builds a feedforward network with arbitrary hidden layers.\n\n    Arguments:\n        input_size: integer, size of the input layer\n        output_size: integer, size of the output layer\n        hidden_layers: list of integers, the sizes of the hidden layers\n\n    \"\"\"\n\n    def __init__(\n        self,\n        input_size: int,\n        output_size: int,\n        hidden_layers: list[int],\n        drop_p: float = 0.5,\n    ) -&gt; None:\n        super().__init__()\n        # Input to a hidden layer\n        self.hidden_layers = nn.ModuleList([nn.Linear(input_size, hidden_layers[0])])\n\n        # Add a variable number of more hidden layers\n        layer_sizes = zip(hidden_layers[:-1], hidden_layers[1:])\n        self.hidden_layers.extend([nn.Linear(h1, h2) for h1, h2 in layer_sizes])\n\n        self.output = nn.Linear(hidden_layers[-1], output_size)\n\n        self.dropout = nn.Dropout(p=drop_p)\n\n    def forward(self, x: torch.Tensor) -&gt; torch.Tensor:\n        \"\"\"Forward pass through the network, returns the output logits.\"\"\"\n        for each in self.hidden_layers:\n            x = nn.functional.relu(each(x))\n            x = self.dropout(x)\n        x = self.output(x)\n\n        return nn.functional.log_softmax(x, dim=1)\n\n\ndef validation(\n    model: nn.Module,\n    testloader: torch.utils.data.DataLoader,\n    criterion: Callable | nn.Module,\n) -&gt; tuple[float, float]:\n    \"\"\"Validation pass through the dataset.\"\"\"\n    accuracy = 0\n    test_loss = 0\n    for images, labels in testloader:\n        images = images.resize_(images.size()[0], 784)\n\n        output = model.forward(images)\n        test_loss += criterion(output, labels).item()\n\n        ## Calculating the accuracy\n        # Model's output is log-softmax, take exponential to get the probabilities\n        ps = torch.exp(output)\n        # Class with highest probability is our predicted class, compare with true label\n        equality = labels.data == ps.max(1)[1]\n        # Accuracy is number of correct predictions divided by all predictions, just take the mean\n        accuracy += equality.type_as(torch.FloatTensor()).mean().item()\n\n    return test_loss, accuracy\n\n\ndef train(\n    model: nn.Module,\n    trainloader: torch.utils.data.DataLoader,\n    testloader: torch.utils.data.DataLoader,\n    criterion: Callable | nn.Module,\n    optimizer: None | torch.optim.Optimizer = None,\n    epochs: int = 5,\n    print_every: int = 40,\n) -&gt; None:\n    \"\"\"Train a PyTorch Model.\"\"\"\n    if optimizer is None:\n        optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)\n    steps = 0\n    running_loss = 0\n    for e in range(epochs):\n        # Model in training mode, dropout is on\n        model.train()\n        for images, labels in trainloader:\n            steps += 1\n\n            # Flatten images into a 784 long vector\n            images.resize_(images.size()[0], 784)\n\n            optimizer.zero_grad()\n\n            output = model.forward(images)\n            loss = criterion(output, labels)\n            loss.backward()\n            optimizer.step()\n\n            running_loss += loss.item()\n\n            if steps % print_every == 0:\n                # Model in inference mode, dropout is off\n                model.eval()\n\n                # Turn off gradients for validation, will speed up inference\n                with torch.no_grad():\n                    test_loss, accuracy = validation(model, testloader, criterion)\n\n                print(\n                    f\"Epoch: {e + 1}/{epochs}.. \",\n                    f\"Training Loss: {running_loss / print_every:.3f}.. \",\n                    f\"Test Loss: {test_loss / len(testloader):.3f}.. \",\n                    f\"Test Accuracy: {accuracy / len(testloader):.3f}\",\n                )\n\n                running_loss = 0\n\n                # Make sure dropout and grads are on for training\n                model.train()\n</code></pre> </li> <li> <p>mypy is what is called a static type checker. If you are using     typing in your code, then a static type checker can help you find common mistakes. <code>mypy</code> does not run your code,     but it scans it and checks that the types you have given are compatible. Install <code>mypy</code></p> <pre><code>pip install mypy\n</code></pre> </li> <li> <p>Try to run <code>mypy</code> on the <code>typing.py</code> file</p> <pre><code>mypy typing_exercise.py\n</code></pre> <p>If you have solved exercise 11 correctly then you should get no errors. If not <code>mypy</code> should tell you where your types are incompatible.</p> </li> </ol>"},{"location":"s2_organisation_and_version_control/good_coding_practice/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>According to PEP8 what is wrong with the following code?</p> <pre><code>class myclass(nn.Module):\n    def TrainNetwork(self, X, y):\n        ...\n</code></pre> Solution <p>According to PEP8 classes should follow the CapWords convention, meaning that the first letter in each word of the class name should be capitalized. Thus <code>myclass</code> should therefore be <code>MyClass</code>. On the other hand, functions and methods should be full lowercase with words separated by underscore. Thus <code>TrainNetwork</code> should be <code>train_network</code>.</p> </li> <li> <p>What would be the of argument <code>x</code> for a function <code>def f(x):</code> if it should support the following input</p> <pre><code>x1 = [1, 2, 3, 4]\nx2 = (1, 2, 3, 4)\nx3 = None\nx4 = {1: \"1\", 2: \"2\", 3: \"3\", 4: \"4\"}\n</code></pre> Solution <p>The easy solution would be to do <code>def f(x : Any)</code>. But instead we could also go with:</p> <pre><code>def f(x: None | Tuple[int, ...] | List[int] | Dict[int, str]):\n</code></pre> <p>alternatively, we could also do</p> <pre><code>def f(x: None | Iterable[int]):\n</code></pre> <p>because both <code>list</code>, <code>tuple</code> and <code>dict</code> are iterables and therefore can be covered by one type (in this specific case).</p> </li> </ol> <p>This ends the module on coding style. We again want to emphazize that a good coding style is more about having a consistent style than strictly following PEP8. A good example of this is Google, that have their own style guide for Python. This guide does not match PEP8 exactly, but it makes sure that different teams within google that are working on different projects are still to a large degree following the same style and therefore if a project is handed from one team to another then at least that will not be a problem.</p>"},{"location":"s3_reproducibility/","title":"Reproducibility","text":"<p>Slides</p> <ul> <li> <p></p> <p>Learn how to create reproducible computing environments using <code>docker</code> and how to use them to run your code.</p> <p> M10: Docker</p> </li> <li> <p></p> <p>Learn how to use <code>hydra</code> to manage configuration files and how to integrate it with your code.</p> <p> M11: Config Files</p> </li> </ul> <p>Today is all about reproducibility - one of those concepts that everyone agrees is very important and something should be done about, but the reality is that it is very hard to secure full reproducibility. The last sessions have already touched a bit on how tools like <code>conda</code> and code organization can help make code more reproducible. Today we are going all the way to ensure that our scripts and our computing environment are fully reproducible.</p>"},{"location":"s3_reproducibility/#why-does-reproducibility-matter","title":"Why does reproducibility matter","text":"<p>Reproducibility is closely related to the scientific method:</p> <p>Observe -&gt; Question -&gt; Hypotheses -&gt; Experiment -&gt; Conclude -&gt; Result -&gt; Observe -&gt; ...</p> <p>Not having reproducible experiments essentially breaks the cycle between doing experiments and making conclusions. If experiments are not reproducible, then we do not expect that others will arrive at the same conclusion as ourselves. As machine learning experiments are fundamentally the same as doing chemical experiments in a laboratory, we should be equally careful in making sure our environments are reproducible (think of your laptop as your laboratory).</p> <p>Secondly, if we focus on why reproducibility matters especially in machine learning, it is part of the bigger challenge of making sure that machine learning is trustworthy.</p> <p></p>  Many different aspects are needed if trustworthy machine learning is ever going to be a reality. We need robustness of our pipelines so we can trust that they do not fail under heavy load. We need integrity to make sure that pipelines are deployed if they are of high quality. We need explainability to make sure that we understand what our machine learning models are doing, so it is not just a black box. We need reproducibility to make sure that the results of our models can be reproduced over and over again. Finally, we need fairness to make sure that our models are not biased toward specific populations. Figure inspired by this paper.  <p>Trustworthy ML is the idea that machine learning agents can be trusted. Take the example of a machine learning agent being responsible for medical diagnoses. It is very clear that we need to be able to trust that the agent gives us the correct diagnosis for the system to work in practice. Reproducibility plays a big role here, because without we cannot be sure that the same agent deployed at two different hospitals will give the same diagnosis (given the same input).</p> <p>Learning objectives</p> <p>The learning objectives of this session are:</p> <ul> <li>To understand the importance of reproducibility in computer science</li> <li>To be able to use <code>docker</code> to create a reproducible container, including how to build them from scratch</li> <li>Understand different ways of configuring your code and how to use <code>hydra</code> to integrate with config files</li> </ul>"},{"location":"s3_reproducibility/config_files/","title":"M11 - Config Files","text":""},{"location":"s3_reproducibility/config_files/#config-files","title":"Config files","text":"<p>With docker we can make sure that our compute environment is reproducible, but that does not mean that all our experiments magically become reproducible. There are other factors that are important for creating reproducible experiments.</p> <p>In this paper (highly recommended read) the authors tried to reproduce the results of 255 papers and tried to figure out which factors where significant to succeed. One of those factors were \"Hyperparameters Specified\" e.g. whether or not the authors of the paper had precisely specified the hyperparameter that was used to run the experiments. It should come as no surprise that this can be a determining factor for reproducibility, however it is not given that hyperparameters are always well specified.</p>"},{"location":"s3_reproducibility/config_files/#configure-experiments","title":"Configure experiments","text":"<p>There is really no way around it: deep learning contains a lot of hyperparameters. In general, a hyperparameter is any parameter that affects the learning process (e.g. the weights of a neural network are not hyperparameters because they are a consequence of the learning process). The problem with having many hyperparameters to control in your code, is that if you are not careful and structure them it may be hard after running a experiment to figure out which hyperparameters were actually used. Lack of proper configuration management can cause serious problems with reliability, uptime, and the ability to scale a system.</p> <p>One of the most basic ways of structuring hyperparameters, is just to put them directly into you <code>train.py</code> script in some object:</p> <pre><code>class my_hp:\n    batch_size: 64\n    lr: 128\n    other_hp: 12345\n\n# easy access to them\ndl = DataLoader(Dataset, batch_size=my_hp.batch_size)\n</code></pre> <p>the problem here is configuration is not easy. Each time you want to run a new experiment, you basically have to change the script. If you run the code multiple times, without committing the changes in between then the exact hyperparameter configuration for some experiments may be lost. Alright, with this in mind you change strategy to use an argument parser e.g. run experiments like this</p> <pre><code>python train.py --batch_size 256 --learning_rate 1e-4 --other_hp 12345\n</code></pre> <p>This at least solves the problem with configurability. However, we again can end up with losing experiments if we are not careful.</p> <p>What we really want is some way to easily configure our experiments where the hyperparameters are systematically saved with the experiment. For this we turn our attention to Hydra, a configuration tool that is based around writing config files to keep track of hyperparameters. Hydra operates on top of OmegaConf which is a <code>yaml</code> based hierarchical configuration system.</p> <p>A simple <code>yaml</code> configuration file could look like</p> <pre><code>#config.yaml\nhyperparameters:\n  batch_size: 64\n  learning_rate: 1e-4\n</code></pre> <p>with the corresponding Python code for loading the file</p> <pre><code>from omegaconf import OmegaConf\n# loading\nconfig = OmegaConf.load('config.yaml')\n\n# accessing in two different ways\ndl = DataLoader(dataset, batch_size=config.hyperparameters.batch_size)\noptimizer = torch.optim.Adam(model.parameters(), lr=config['hyperparameters']['learning_rate'])\n</code></pre> <p>or using <code>hydra</code> for loading the configuration</p> <pre><code>import hydra\n\n@hydra.main(config_name=\"basic.yaml\")\ndef main(cfg):\n    print(cfg.hyperparameters.batch_size, cfg.hyperparameters.learning_rate)\n\nif __name__ == \"__main__\":\n    main()\n</code></pre> <p>The idea behind refactoring our hyperparameters into <code>.yaml</code> files is that we disentangle the model configuration from the model. In this way it is easier to do version control of the configuration because we have it in a separate file.</p>"},{"location":"s3_reproducibility/config_files/#exercises","title":"\u2754 Exercises","text":"<p>Exercise files</p> <p>The main idea behind the exercises is to take a single script (that we provide) and use Hydra to make sure that everything gets correctly logged such that you would be able to exactly report to others how each experiment was configured. In the provided script, the hyperparameters are hardcoded into the code and your job will be to separate them out into a configuration file.</p> <p>Note that we provide a solution (in the <code>vae_solution</code> folder) that can help you get through the exercise, but try to look online for your answers before looking at the solution. Remember: its not about the result, its about the journey.</p> <ol> <li> <p>Start by installing hydra:</p> <pre><code>pip install hydra-core\n</code></pre> <p>Remember to add it to your <code>requirements.txt</code> file.</p> </li> <li> <p>Next take a look at the <code>vae_mnist.py</code> and <code>model.py</code> file and understand what is going on. It is a model we will     revisit during the course.</p> </li> <li> <p>Identify the key hyperparameters of the script. Some of them should be easy to find, but at least 3 have made it     into the core part of the code. One essential hyperparameter is also not included in the script but is needed to be     completely reproducible (HINT: the weights of any neural network are initialized at random).</p> Solution <p>From the top of the file <code>batch_size</code>, <code>x_dim</code>, <code>hidden_dim</code> can be found as hyperparameters. Looking through the code it can be seen that the <code>latent_dim</code> of the encoder and decoder, <code>lr</code> or the optimzer, <code>epochs</code> in the training loop also are hyperparameters. Finally, the <code>seed</code> is not included in the script but is needed to make the script fully reproducible e.g. <code>torch.manual_seed(seed)</code>.</p> </li> <li> <p>Write a configuration file <code>config.yaml</code> where you write down the hyperparameters that you have found</p> </li> <li> <p>Get the script running by loading the configuration file inside your script (using hydra) that incorporates the     hyperparameters into the script. Note: you should only edit the <code>vae_mnist.py</code> file and not the <code>model.py</code> file.</p> </li> <li> <p>Run the script</p> </li> <li> <p>By default hydra will write the results to a <code>outputs</code> folder, with a sub-folder for the day the experiment was     run and further the time it was started. Inspect your run by going over each file the hydra has generated and check     the information has been logged. Can you find the hyperparameters?</p> </li> <li> <p>Hydra also allows for dynamically changing and adding parameters on the fly from the command-line:</p> <ol> <li> <p>Try changing one parameter from the command-line</p> <pre><code>python vae_mnist.py hyperparameters.seed=1234\n</code></pre> </li> <li> <p>Try adding one parameter from the command-line</p> <pre><code>python vae_mnist.py +experiment.stuff_that_i_want_to_add=42\n</code></pre> </li> </ol> </li> <li> <p>By default the file <code>vae_mnist.log</code> should be empty, meaning that whatever you printed to the terminal did not get     picked up by Hydra. This is due to Hydra under the hood making use of the native python     logging package. This means that to also save all printed output     from the script we need to convert all calls to <code>print</code> with <code>log.info</code></p> <ol> <li> <p>Create a logger in the script:</p> <pre><code>import logging\nlog = logging.getLogger(__name__)\n</code></pre> </li> <li> <p>Exchange all calls to <code>print</code> with calls to <code>log.info</code></p> </li> <li> <p>Try re-running the script and make sure that the output printed to the terminal also gets saved to the     <code>vae_mnist.log</code> file</p> </li> </ol> </li> <li> <p>Make sure that your script is fully reproducible. To check this you will need two runs of the script to compare.     Then run the <code>reproducibility_tester.py</code> script as</p> <pre><code>python reproducibility_tester.py path/to/run/1 path/to/run/2\n</code></pre> <p>the script will go over trained weights to see if the match and that the hyperparameters was the same. Note: for the script to work, the weights should be saved to a file called <code>trained_model.pt</code> (this is the default of the <code>vae_mnist.py</code> script, so only relevant if you have changed the saving of the weights)</p> </li> <li> <p>Make a new experiment using a new configuration file where you have changed a hyperparameter of your own     choice. You are not allowed to change the configuration file in the script but should instead be able to provide it     as an argument when launching the script e.g. something like</p> <pre><code>python vae_mnist.py experiment=exp2\n</code></pre> <p>We recommend that you use a file structure like this</p> <pre><code>|--conf\n|  |--config.yaml\n|  |--experiments\n|     |--exp1.yaml\n|     |--exp2.yaml\n|--my_app.py\n</code></pre> </li> <li> <p>Finally, a awesome feature of hydra is the     instantiate feature. This allows you to define a     configuration file that can be used to directly instantiating objects in python. Try to create a configuration file     that can be used to instantiating the <code>Adam</code> optimizer in the <code>vae_mnist.py</code> script.</p> Solution <p>The configuration file could look like this</p> <pre><code>optimizer:\n  _target_: torch.optim.Adam\n  lr: 1e-3\n  betas: [0.9, 0.999]\n  eps: 1e-8\n  weight_decay: 0\n</code></pre> <p>and the python code to load the configuration file and instantiate the optimizer could look like this</p> <pre><code>import hydra\nimport torch.optim as optim\n\n@hydra.main(config_name=\"adam.yaml\")\ndef main(cfg):\n    optimizer = hydra.utils.instantiate(cfg.optimizer)\n    print(optimizer)\n\nif __name__ == \"__main__\":\n    main()\n</code></pre> <p>This will print the optimizer object that is created from the configuration file.</p> </li> </ol>"},{"location":"s3_reproducibility/config_files/#final-exercise","title":"Final exercise","text":"<p>Make your MNIST code reproducible! Apply what you have just done to the simple script to your MNIST code. Only requirement is that you this time use multiple configuration files, meaning that you should have at least one <code>model_conf.yaml</code> file and a <code>training_conf.yaml</code> file that separates out the hyperparameters that have to do with the model definition and those that have to do with the training. You can also choose to work with even more complex config setups: in the image below the configuration has two layers such that we individually can specify hyperparameters belonging to a specific model architecture and hyperparameters for each individual optimizer that we may try.</p> <p> </p>  Image credit"},{"location":"s3_reproducibility/docker/","title":"M10 - Docker","text":""},{"location":"s3_reproducibility/docker/#docker","title":"Docker","text":"<p>Core Module</p> <p></p>  Image credit  <p>While the above picture may seem silly at first, it is actually pretty close to how Docker came into existence. A big part of creating an MLOps pipeline is being able to reproduce it. Reproducibility goes beyond versioning our code with <code>git</code> and using <code>conda</code> environments to keep track of our Python installations. To truly achieve reproducibility, we need to capture system-level components such as:</p> <ul> <li>Operating system</li> <li>Software dependencies (other than Python packages)</li> </ul> <p>Docker provides this kind of system-level reproducibility by creating isolated program dependencies. In addition to providing reproducibility, one of the key features of Docker is scalability, which is important when we later discuss deployment. Because Docker ensures system-level reproducibility, it does not (conceptually) matter whether we try to start our program on a single machine or on 1000 machines at once.</p>"},{"location":"s3_reproducibility/docker/#docker-overview","title":"Docker Overview","text":"<p>Docker has three main concepts: Dockerfile, Docker image, and Docker container:</p> <p></p> <ul> <li> <p>A Dockerfile is a basic text document that contains all the commands a user could call on the command line to     run an application. This includes installing dependencies, pulling data from online storage, setting up code, and     specifying commands to run (e.g., <code>python train.py</code>).</p> </li> <li> <p>Running, or more correctly, building a Dockerfile will create a Docker image. An image is a lightweight,     standalone/containerized, executable package of software that includes everything (application code, libraries,     tools, dependencies, etc.) necessary to make an application run.</p> </li> <li> <p>Actually running an image will create a Docker container. This means that the same image can be launched     multiple times, creating multiple containers.</p> </li> </ul> <p>The exercises today will focus on how to construct the actual Dockerfile, as this is the first step to constructing your own container.</p>"},{"location":"s3_reproducibility/docker/#docker-sharing","title":"Docker Sharing","text":"<p>The whole point of using Docker is that sharing applications becomes much easier. In general, we have two options:</p> <ul> <li> <p>After creating the <code>Dockerfile</code>, we can simply commit it to GitHub (it's just a text file) and then ask other users     to simply build the image themselves.</p> </li> <li> <p>After building the image ourselves, we can choose to upload it to an image registry such as     Docker Hub, where others can get our image by simply running <code>docker pull</code>, making them     able to instantaneously run it as a container, as shown in the figure below:</p> </li> </ul> <p></p>  Image credit"},{"location":"s3_reproducibility/docker/#exercises","title":"\u2754 Exercises","text":"<p>In the following exercises, we guide you on how to build a docker file for your MNIST repository that will make the training and prediction a self-contained application. Please make sure that you somewhat understand each step and do not just copy the exercise. Also, note that you probably need to execute the exercise from an elevated terminal e.g. with administrative privilege.</p> <p>The exercises today are only an introduction to docker and some of the steps are going to be unoptimized from a production setting view. For example, we often want to keep the size of the docker image as small as possible, which we are not focusing on for these exercises.</p> <p>If you are using <code>VScode</code> then we recommend installing the VScode docker extension for easy getting an overview of which images have been building and which are running. Additionally, the extension named Dev Containers may also be beneficial for you to download.</p> <ol> <li> <p>Start by installing docker. How much trouble you need to go through     depends on your operating system. For Windows and Mac, we recommend they install Docker Desktop, which comes with     a graphical user interface (GUI) for quickly viewing docker images and docker containers currently built/in use.     Windows users that have not installed WSL yet are going to have to do it now (as docker needs it as a backend for     starting virtual machines) but you do not need to install docker in WSL. After installing docker we recommend that     you restart your laptop.</p> </li> <li> <p>Try running the following to confirm that your installation is working:</p> <pre><code>docker run hello-world\n</code></pre> <p>which should give the message</p> <pre><code>Hello from Docker!\nThis message shows that your installation appears to be working correctly.\n</code></pre> </li> <li> <p>Next, let's try to download an image from Docker Hub. Download the <code>busybox</code> image:</p> <pre><code>docker pull busybox\n</code></pre> <p>which is a very small (1-5Mb) containerized application that contains the most essential GNU file utilities, shell utilities, etc.</p> </li> <li> <p>After pulling the image, write</p> <pre><code>docker images\n</code></pre> <p>which should show you all available images. You should see the <code>busybox</code> image that we just downloaded.</p> </li> <li> <p>Let's try to run this image</p> <pre><code>docker run busybox\n</code></pre> <p>You will see that nothing happens! The reason for that is we did not provide any commands to <code>docker run</code>. We essentially just ask it to start the <code>busybox</code> virtual machine, do nothing, and then close it again. Now, try again, this time with</p> <pre><code>docker run busybox echo \"hello from busybox\"\n</code></pre> <p>Note how fast this process is. In just a few seconds, Docker is able to start a virtual machine, execute a command, and kill it afterward.</p> </li> <li> <p>Try running</p> <pre><code>docker ps\n</code></pre> <p>What does this command do? What if you add <code>-a</code> to the end?</p> </li> <li> <p>If we want to run multiple commands within the virtual machine, we can start it in interactive mode</p> <pre><code>docker run -it busybox\n</code></pre> <p>This can be a great way to investigate what the filesystem of our virtual machine looks like.</p> </li> <li> <p>As you may have already noticed by now, each time we execute <code>docker run</code>, we can still see small remnants of the     containers using <code>docker ps -a</code>. These stray containers can end up taking up a lot of disk space. To remove them,     use <code>docker rm</code> where you provide the container ID that you want to delete</p> <pre><code>docker rm &lt;container_id&gt;\n</code></pre> </li> <li> <p>Let's now move on to trying to construct a Dockerfile ourselves for our MNIST project. Create a file called     <code>trainer.dockerfile</code>. The intention is that we want to develop one Dockerfile for running our training script and     one for doing predictions.</p> </li> <li> <p>Instead of starting from scratch, we nearly always want to start from some base image. For this exercise, we are     going to start from a simple <code>python</code> image. Add the following to your <code>Dockerfile</code></p> <pre><code># Base image\nFROM python:3.9-slim\n</code></pre> </li> <li> <p>Next, we are going to install some essentials in our image. The essentials more or less consist of a Python     installation. These instructions may seem familiar if you are using Linux:</p> <pre><code># Install Python\nRUN apt update &amp;&amp; \\\n    apt install --no-install-recommends -y build-essential gcc &amp;&amp; \\\n    apt clean &amp;&amp; rm -rf /var/lib/apt/lists/*\n</code></pre> </li> <li> <p>The previous two steps are common for any Docker application where you want to run Python. All the remaining steps     are application-specific (to some degree):</p> <ol> <li> <p>Let's copy over our application (the essential parts) from our computer to the container:</p> <pre><code>COPY requirements.txt requirements.txt\nCOPY pyproject.toml pyproject.toml\nCOPY &lt;project-name&gt;/ &lt;project-name&gt;/\nCOPY data/ data/\n</code></pre> <p>Remember that we only want the essential parts to keep our Docker image as small as possible. Why do we need each of these files/folders to run training in our Docker container?</p> </li> <li> <p>Let's set the working directory in our container and add commands that install the dependencies (1):</p> <ol> <li> <p> We split the installation into two steps so that Docker can cache our project dependencies     separately from our application code. This means that if we change our application code, we do not need to     reinstall all the dependencies. This is a common strategy for Docker images.</p> <p> As an alternative, you can use <code>RUN make requirements</code> if you have a <code>Makefile</code> that installs the dependencies. Just remember to also copy over the <code>Makefile</code> into the Docker image.</p> </li> </ol> <pre><code>WORKDIR /\nRUN pip install -r requirements.txt --no-cache-dir\nRUN pip install . --no-deps --no-cache-dir\n</code></pre> <p>The <code>--no-cache-dir</code> is quite important. Can you explain what it does and why it is important in relation to Docker?</p> </li> <li> <p>Finally, we are going to name our training script as the entrypoint for our Docker image. The entrypoint is     the application that we want to run when the image is being executed:</p> <pre><code>ENTRYPOINT [\"python\", \"-u\", \"&lt;project_name&gt;/train_model.py\"]\n</code></pre> <p>The <code>\"u\"</code> here makes sure that any output from our script, e.g., any <code>print(...)</code> statements, gets redirected to our terminal. If not included, you would need to use <code>docker logs</code> to inspect your run.</p> </li> </ol> </li> <li> <p>We are now ready to build our Dockerfile into a Docker image.</p> <pre><code>docker build -f trainer.dockerfile . -t trainer:latest\n</code></pre> MAC M1/M2 users <p>In general, Docker images are built for a specific platform. For example, if you are using a Mac with an M1/M2 chip, then you are running on an ARM architecture. If you are using a Windows or Linux machine, then you are running on an AMD64 architecture. This is important to know when building Docker images. Thus, Docker images you build may not work on other platforms than the ones you build on. You can specify which platform you want to build for by adding the <code>--platform</code> argument to the <code>docker build</code> command:</p> <pre><code>docker build --platform linux/amd64 -f trainer.dockerfile . -t trainer:latest\n</code></pre> <p>and also when running the image:</p> <pre><code>docker run --platform linux/amd64 trainer:latest\n</code></pre> <p>Note that this will significantly increase the build and run time of your Docker image when running locally, because Docker will need to emulate the other platform. In general, for the exercises today, you should not need to specify the platform, but be aware of this if you are building Docker images on your own.</p> <p>Please note that here we are providing two extra arguments to <code>docker build</code>. The <code>-f trainer.dockerfile .</code> (the dot is important to remember) indicates which Dockerfile we want to run (except if you named it just <code>Dockerfile</code>) and the <code>-t trainer:latest</code> is the respective name and tag that we see afterward when running <code>docker images</code> (see image below). Please note that building a Docker image can take a couple of minutes.</p> <p> </p> Docker images and space <p>Docker images can take up a lot of space on your computer, especially the Docker images we are trying to build because PyTorch is a huge dependency. If you are running low on space, you can try to</p> <pre><code>docker system prune\n</code></pre> <p>Alternatively, you can manually delete images using <code>docker rmi {image_name}:{image_tag}</code>.</p> </li> <li> <p>Try running <code>docker images</code> and confirm that you get output similar to the one above. If you succeed with this,     then try running the docker image</p> <pre><code>docker run --name experiment1 trainer:latest\n</code></pre> <p>you should hopefully see your training starting. Please note that we can start as many containers as we want at the same time by giving them all different names using the <code>--name</code> tag.</p> <ol> <li> <p>You are most likely going to rebuild your Docker image multiple times, either due to an implementation error     or the addition of new functionality. Therefore, instead of watching pip suffer through downloading <code>torch</code> for     the 20th time, you can reuse the cache from the last time the Docker image was built. To do this, replace the line     in your Dockerfile that installs your requirements with:</p> <pre><code>RUN --mount=type=cache,target=~/pip/.cache pip install -r requirements.txt --no-cache-dir\n</code></pre> <p>which mounts your local pip cache to the Docker image. For building the image, you need to have enabled the BuildKit feature. If you have Docker version v23.0 or later (you can check this by running <code>docker version</code>), then this is enabled by default. Otherwise, you need to enable it by setting the environment variable <code>DOCKER_BUILDKIT=1</code> before building the image.</p> <p>Try changing your Dockerfile and rebuilding the image. You should see that the build process is much faster.</p> </li> </ol> </li> <li> <p>Remember, if you ever are in doubt about how files are organized inside a Docker image, you always have the option     to start the image in interactive mode:</p> <pre><code>docker run -it --entrypoint sh {image_name}:{image_name}\n</code></pre> </li> <li> <p>When your training has completed you will notice that any files that are created when running your training script     are not present on your laptop (for example if your script is saving the trained model to a file). This is because     the files were created inside your container (which is a separate little machine). To get the files you have two     options:</p> <ol> <li> <p>If you already have a completed run then you can use it</p> <pre><code>docker cp\n</code></pre> <p>to copy the files between your container and laptop. For example to copy a file called <code>trained_model.pt</code> from a folder you would do:</p> <pre><code>docker cp {container_name}:{dir_path}/{file_name} {local_dir_path}/{local_file_name}\n</code></pre> <p>Try this out.</p> </li> <li> <p>A much more efficient strategy is to mount a volume that is shared between the host (your laptop) and the     container. This can be done with the <code>-v</code> option for the <code>docker run</code> command. For example, if we want to     automatically get the <code>trained_model.pt</code> file after running our training script we could simply execute the     container as</p> <pre><code>docker run --name {container_name} -v %cd%/models:/models/ trainer:latest\n</code></pre> <p>this command mounts our local <code>models</code> folder as a corresponding <code>models</code> folder in the container. Any file save by the container to this folder will be synchronized back to our host machine. Try this out! Note if you have multiple files/folders that you want to mount (if in doubt about file organization in the container try to do the next exercise first). Also note that the <code>%cd%</code> needs to change depending on your OS, see this page for help.</p> </li> </ol> </li> <li> <p>With training done we also need to write an application for prediction. Create a new docker image called     <code>predict.dockerfile</code>. This file should call your <code>&lt;project_name&gt;/models/predict_model.py</code> script instead. This image     will need some trained model weights to work. Feel free to either include these during the build process or mount     them afterwards. When you create the file try to <code>build</code> and <code>run</code> it to confirm that it works. Hint: if     you are passing in the model checkpoint and prediction data as arguments to your script, your <code>docker run</code> probably     needs to look something like</p> <pre><code>docker run --name predict --rm \\\n    -v %cd%/trained_model.pt:/models/trained_model.pt \\  # mount trained model file\n    -v %cd%/data/example_images.npy:/example_images.npy \\  # mount data we want to predict on\n    predict:latest \\\n    ../../models/trained_model.pt \\  # argument to script, path relative to script location in container\n    ../../example_images.npy\n</code></pre> </li> <li> <p>(Optional, requires GPU support) By default, a virtual machine created by docker only has access to your <code>cpu</code> and     not your <code>gpu</code>. While you do not necessarily have a laptop with a GPU that supports the training of neural networks     (e.g. one from Nvidia) it is beneficial that you understand how to construct a docker image that can take advantage     of a GPU if you were to run this on a machine in the future that has a GPU (e.g. in the cloud). It does take a bit     more work, but many of the steps will be similar to building a normal docker image.</p> <ol> <li> <p>There are three prerequisites for working with Nvidia GPU-accelerated docker containers. First, you need to have     the Docker Engine installed (already taken care of), have Nvidia GPU with updated GPU drivers and finally have     the Nvidia container toolkit     installed. The last part you not likely have not installed and needs to do. Some distros of Linux have known     problems with the installation process, so you may have to search through known issues in     nvidia-docker repository to find a solution</p> </li> <li> <p>To test that everything is working start by pulling a relevant Nvidia docker image. In my case this is     the correct image:</p> <pre><code>docker pull nvidia/cuda:11.0.3-base-ubuntu20.04\n</code></pre> <p>but it may differ based on what Cuda vision you have. You can find all the different official Nvidia images here. After pulling the image, try running the <code>nvidia-smi</code> command inside a container based on the image you just pulled. It should look something like this:</p> <pre><code>docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi\n</code></pre> <p>and should show an image like below:</p> <p> </p> <p>If it does not work, try redoing the steps.</p> </li> <li> <p>We should hopefully have a working setup now for running Nvidia accelerated docker containers. The next step is     to get PyTorch inside our container, such that our PyTorch implementation also correctly identifies the GPU.     Luckily for us, Nvidia provides a set of docker images for GPU-optimized software for AI, HPC and visualizations     through their NGC Catalog.     The containers that have to do with PyTorch can be seen     here. Try pulling the latest:</p> <pre><code>docker pull nvcr.io/nvidia/pytorch:22.07-py3\n</code></pre> <p>It may take some time because the NGC images include a lot of other software for optimizing PyTorch applications. It may be possible for you to find other images for running GPU-accelerated applications that have a smaller memory footprint, but NGC is the recommended and supported way.</p> </li> <li> <p>Let's test that this container works:</p> <pre><code>docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:22.07-py3\n</code></pre> <p>this should run the container in interactive mode attached to your current terminal. Try opening <code>python</code> in the container and try writing:</p> <pre><code>import torch\nprint(torch.cuda.is_available())\n</code></pre> <p>which hopefully should return <code>True</code>.</p> </li> <li> <p>Finally, we need to incorporate all this into our already developed docker files for our application. This is     also fairly easy as we just need to change our <code>FROM</code> statement at the beginning of our docker file:</p> <pre><code>FROM python:3.7-slim\n</code></pre> <p>change to</p> <pre><code>FROM  nvcr.io/nvidia/pytorch:22.07-py3\n</code></pre> <p>try doing this to one of your docker files, build the image and run the container. Remember to check that your application is using GPU by printing <code>torch.cuda.is_available()</code>.</p> </li> </ol> </li> <li> <p>(Optional) Another way you can use Dockerfiles in your day-to-day work is for Dev-containers. Developer containers     allow you to develop code directly inside a container, making sure that your code is running in the same     environment as it will when deployed. This is especially useful if you are working on a project that has a lot of     dependencies that are hard to install on your local machine. Setup instructions for VS Code and PyCharm can be found     here (should be simple since we have already installed Docker):</p> <ul> <li>VS Code</li> <li>PyCharm</li> </ul> <p>We will focus on the VS Code setup here.</p> <ol> <li> <p>First, install the     Remote - Containers     extension.</p> </li> <li> <p>Create a <code>.devcontainer</code> folder in your project root and create a <code>Dockerfile</code> inside it. We will keep this file very     barebones for now, so let's just define a base installation of Python:</p> <pre><code>FROM python:3.11-slim-buster\n\nRUN apt update &amp;&amp; \\\n    apt install --no-install-recommends -y build-essential gcc &amp;&amp; \\\n    apt clean &amp;&amp; rm -rf /var/lib/apt/lists/*\n</code></pre> </li> <li> <p>Create a <code>devcontainer.json</code> file in the <code>.devcontainer</code> folder. This file should look something like this:</p> <pre><code>{\n    \"name\": \"my_working_env\",\n    \"dockerFile\": \"Dockerfile\",\n    \"postCreateCommand\": \"pip install -r requirements.txt\"\n}\n</code></pre> <p>This file tells VS Code that we want to use the <code>Dockerfile</code> that we just created and that we want to install our Python dependencies after the container has been created.</p> </li> <li> <p>After creating these files, you should be able to open the command palette in VS Code (F1) and search for the     option <code>Remote-Containers: Reopen in Container</code> or <code>Remote-Containers: Rebuild and Reopen in Container</code>. Choose     either of these options.</p> <p> </p> <p>This will start a new VS Code instance inside a Docker container. You should be able to see this in the bottom left corner of your VS Code window. You should also be able to see that the Python interpreter has changed to the one inside the container.</p> <p>You are now ready to start developing inside the container. Try opening a terminal and run <code>python</code> and <code>import torch</code> to confirm that everything is working.</p> </li> </ol> </li> <li> <p>(Optional) In M8 on Data version control you learned about the     framework <code>dvc</code> for version controlling data. A neutral question at this point would then be how to incorporate     <code>dvc</code> into our docker image. We need to do two things:</p> <ul> <li>Make sure that <code>dvc</code> has all the correct files to pull data from our remote storage</li> <li>Make sure that <code>dvc</code> has the correct credentials to pull data from our remote storage</li> </ul> <p>We are going to assume that <code>dvc</code> (and any <code>dvc</code> extension needed) is part of your <code>requirements.txt</code> file and that it is already being installed in a <code>RUN pip install -r requirements.txt</code> command in your Dockerfile. If not, then you need to add it.</p> <ol> <li> <p>Add the following lines to your Dockerfile</p> <pre><code>RUN dvc init --no-scm\nCOPY .dvc/config .dvc/config\nCOPY *.dvc .dvc/\nRUN dvc config core.no_scm true\nRUN dvc pull\n</code></pre> <p>The first line initializes <code>dvc</code> in the Docker image. The <code>--no-scm</code> option is needed because normally <code>dvc</code> can only be initialized inside a git repository, but this option allows initializing <code>dvc</code> without being in one. The second and third lines copy over the <code>dvc</code> config file and the <code>dvc</code> metadata files that are needed to pull data from your remote storage. The last line pulls the data.</p> </li> <li> <p>If your data is not public, we need to provide credentials in some way to pull the data. We are for now going to     do it in a not-so-secure way. When <code>dvc</code> first connected to your drive, a credential file was created. This file     is located in <code>$CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json</code> where <code>$CACHE_HOME</code>.</p> macOSLinuxWindows <p><code>~/Library/Caches</code></p> <p><code>~/.cache</code>  This is the typical location, but it may vary depending on what distro you are running.</p> <p><code>{user}/AppData/Local</code></p> <p>Find the file. The content should look similar to this (only some fields are shown):</p> <pre><code>{\n    \"access_token\": ...,\n    \"client_id\": ...,\n    \"client_secret\": ...,\n    \"refresh_token\": ...,\n    ...\n}\n</code></pre> <p>We are going to copy the file into our Docker image. This, of course, is not a secure way of doing it, but it is the easiest way to get started. As long as you are not sharing your Docker image with anyone else, then it is fine. Add the following lines to your Dockerfile before the <code>RUN dvc pull</code> command:</p> <pre><code>COPY &lt;path_to_default.json&gt; default.json\ndvc remote modify myremote --local gdrive_service_account_json_file_path default.json\n</code></pre> <p>where <code>&lt;path_to_default.json&gt;</code> is the path to the <code>default.json</code> file that you just found. The last line tells <code>dvc</code> to use the <code>default.json</code> file as the credentials for pulling data from your remote storage. You can confirm that this works by running <code>dvc pull</code> in your Docker image.</p> </li> </ol> </li> </ol>"},{"location":"s3_reproducibility/docker/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>What is the difference between a docker image and a docker container?</p> Solution <p>A Docker image is a template for a Docker container. A Docker container is a running instance of a Docker image. A Docker image is a static file, while a Docker container is a running process.</p> </li> <li> <p>What are the 3 steps involved in containerizing an application?</p> Solution <ol> <li>Write a Dockerfile that includes your app (including the commands to run it) and its dependencies.</li> <li>Build the image using the Dockerfile you wrote.</li> <li>Run the container using the image you've built.</li> </ol> </li> <li> <p>What advantage is there to running your application inside a Docker container instead of running the application     directly on your machine?</p> Solution <p>Running inside a Docker container gives you a consistent and independent environment for your application. This means that you can be sure that your application will run the same way on your machine as it will on another machine. Thus, Docker gives the ability to abstract away the differences between different machines.</p> </li> <li> <p>A Docker container is built from a series of layers that are stacked on top of each other. This should be clear if     you look at the output when building a Docker image. What is the advantage of this?</p> Solution <p>The advantage is efficiency and reusability. When a change is made to a Docker image, only the layer(s) that are changed need to be updated. For example, if you update the application code in your Docker image, which usually is the last layer, then only that layer needs to be rebuilt, making the process much faster. Additionally, if you have multiple Docker images that share the same base image, then the base image only needs to be downloaded once.</p> </li> </ol> <p>This covers the absolute minimum you should know about Docker to get a working image and container. If you want to really deep dive into this topic, you can find a copy of the Docker Cookbook by S\u00e9bastien Goasguen in the literature folder.</p> <p>If you are actively going to be using Docker in the future, one thing to consider is the image size. Even these simple images that we have built still take up GB in size. Several optimization steps can be taken to reduce the image size for you or your end user. If you have time, you can read this article on different approaches to reducing image size. Additionally, you can take a look at the dive-in extension for Docker Desktop that lets you explore in depth your Docker images.</p>"},{"location":"s4_debugging_and_logging/","title":"Debugging, Profiling, Logging and Boilerplate","text":"<p>Slides</p> <ul> <li> <p></p> <p>Learn how to use the debugger in your editor to find bugs in your code.</p> <p> M12: Debugging</p> </li> <li> <p></p> <p>Learn how to use a profiler to identify bottlenecks in your code and from those profiles optimize the runtime of your programs.</p> <p> M13: Profiling</p> </li> <li> <p></p> <p>Learn how to systematically log experiments and hyperparameters to make your code reproducible.</p> <p> M14: Logging</p> </li> <li> <p></p> <p>Learn how to use <code>pytorch-lightning</code> framework to minimize boilerplate code and structure deep learning models.</p> <p> M15: Boilerplate</p> </li> </ul> <p>Today we are initially going to go over three different topics that are all fundamentally necessary for any data scientist or DevOps engineer:</p> <ul> <li>Debugging</li> <li>Profiling</li> <li>Logging</li> </ul> <p>All three topics can be characterized by something you probably already are familiar with. Since you started programming, you have done debugging as nobody can write perfect code on the first try. Similarly, while you have not directly profiled your code, I bet that you at some point have had some very slow code and optimized it to run faster. Identifying and improving are the fundamentals of profiling code. Finally, logging is a very broad term and refers to any kind of output from your applications that helps you at a later point identify the \"performance\" of you application.</p> <p>However, while we expect you to already be familiar with these topics, we do not expect all of you to be experts as it is very rare that these topics are focused on. Today we are going to introduce some best practices and tools to help you overcome every one of these three important topics. As the final topic for today, we are going to learn about how we can minimize boilerplate and focus on coding what matters for our project instead of all the boilerplate to get it working.</p> <p>Learning objectives</p> <p>The learning objectives of this session are:</p> <ul> <li>Understand the basics of debugging and how to use a debugger to find bugs in your code</li> <li>Can use a profiler to identify bottlenecks in your code and from those profiles optimize the runtime of your     programs</li> <li>Familiar with an experiment logging framework for tracking experiments and hyperparameters of your code to make it     reproducible</li> <li>Be able to use <code>pytorch-lightning</code> framework to minimize boilerplate code and structure deep learning models</li> </ul>"},{"location":"s4_debugging_and_logging/boilerplate/","title":"M15 - Boilerplate","text":""},{"location":"s4_debugging_and_logging/boilerplate/#minimizing-boilerplate","title":"Minimizing boilerplate","text":"<p>Boilerplate is a general term that describes any standardized text, copy, documents, methods, or procedures that may be used over again without making major changes to the original. But how does this relate to doing machine learning projects? If you have already tried doing a couple of projects within machine learning you will probably have seen a pattern: every project usually consist of these three aspects of code:</p> <ul> <li>a model implementation</li> <li>some training code</li> <li>a collection of utilities for saving models, logging images etc.</li> </ul> <p>While the latter two certainly seems important, in most cases the actual development or research often revolves around defining the model. In this sense, both the training code and the utilities becomes boilerplate that should just carry over from one project to another. But the problem usually is that we have not generalized our training code to take care of the small adjusted that may be required in future projects and we therefore end up implementing it over and over again every time that we start a new project. This is of course a waste of our time that we should try to find a solution to.</p> <p>This is where high-level frameworks comes into play. High-level frameworks are build on top of another framework (PyTorch in this case) and tries to abstract/standardize how to do particular tasks such as training. At first it may seem irritating that you need to comply to someone else code structure, however there is a good reason for that. The idea is that you can focus on what really matters (your task, model architecture etc.) and do not have to worry about the actual boilerplate that comes with it.</p> <p>The most popular high-level (training) frameworks within the <code>PyTorch</code> ecosystem are:</p> <ul> <li>fast.ai</li> <li>Ignite</li> <li>skorch</li> <li>Catalyst</li> <li>Composer</li> <li>PyTorch Lightning</li> </ul> <p>They all offer many of the same features, so choosing one over the other for most projects should not matter that much. We are here going to use <code>PyTorch Lightning</code>, as it offers all the functionality that we are going to need later in the course.</p>"},{"location":"s4_debugging_and_logging/boilerplate/#pytorch-lightning","title":"PyTorch Lightning","text":"<p>In general we refer to the documentation from PyTorch lightning if in doubt about how to format your code for doing specific tasks. We are here going to explain the key concepts of the API that you need to understand to use the framework, starting with the <code>LightningModule</code> and the <code>Trainer</code>.</p>"},{"location":"s4_debugging_and_logging/boilerplate/#lightningmodule","title":"LightningModule","text":"<p>The <code>LightningModule</code> is a subclass of a standard <code>nn.Module</code> that basically adds additional structure. In addition to the standard <code>__init__</code> and <code>forward</code> methods that need to be implemented in a <code>nn.Module</code>, a <code>LightningModule</code> further requires two more methods implemented:</p> <ul> <li> <p><code>training_step</code>: should contain your actual training code e.g. given a batch of data this should return the loss     that you want to optimize</p> </li> <li> <p><code>configure_optimizers</code>: should return the optimizer that you want to use</p> </li> </ul> <p>Below is shown these two methods added to standard MNIST classifier</p> <p></p> <p>Compared to a standard <code>nn.Module</code>, the additional methods in the <code>LightningModule</code> basically specifies exactly how you want to optimize your model.</p>"},{"location":"s4_debugging_and_logging/boilerplate/#trainer","title":"Trainer","text":"<p>The second component to lightning is the <code>Trainer</code> object. As the name suggest, the <code>Trainer</code> object takes care of the actual training, automizing everything that you do not want to worry about.</p> <pre><code>from pytorch_lightning import Trainer\nmodel = MyAwesomeModel()  # this is our LightningModule\ntrainer = Trainer()\ntraier.fit(model)\n</code></pre> <p>That's is essentially all that you need to specify in lightning to have a working model. The trainer object does not have methods that you need to implement yourself, but it have a bunch of arguments that can be used to control how many epochs that you want to train, if you want to run on gpu etc. To get the training of our model to work we just need to specify how our data should be feed into the lighning framework.</p>"},{"location":"s4_debugging_and_logging/boilerplate/#data","title":"Data","text":"<p>For organizing our code that has to do with data in <code>Lightning</code> we essentially have three different options. However, all three assume that we are using <code>torch.utils.data.DataLoader</code> for the dataloading.</p> <ol> <li> <p>If we already have a <code>train_dataloader</code> and possible also a <code>val_dataloader</code> and <code>test_dataloader</code> defined we can     simply add them to our <code>LightningModule</code> using the similar named methods:</p> <pre><code>def train_dataloader(self):\n    return DataLoader(...)\n\ndef val_dataloader(self):\n    return DataLoader(...)\n\ndef test_dataloader(self):\n    return DataLoader(...)\n</code></pre> </li> <li> <p>Maybe even simpler, we can directly feed such dataloaders in the <code>fit</code> method of the <code>Trainer</code> object:</p> <pre><code>trainer.fit(model, train_dataloader, val_dataloader)\ntrainer.test(model, test_dataloader)\n</code></pre> </li> <li> <p>Finally, <code>Lightning</code> also have the <code>LightningDataModule</code> that organizes data loading into a single structure, see     this page for more info. Putting     data loading into a <code>DataModule</code> makes sense as it is then can be reused between projects.</p> </li> </ol>"},{"location":"s4_debugging_and_logging/boilerplate/#callbacks","title":"Callbacks","text":"<p>Callbacks is one way to add additional functionality to your model, that strictly speaking is not already part of your model. Callbacks should therefore be seen as self-contained feature that can be reused between projects. You have the option to implement callbacks yourself (by inheriting from the <code>pytorch_lightning.callbacks.Callback</code> base class) or use one of the build in callbacks. Of particular interest are <code>ModelCheckpoint</code> and <code>EarlyStopping</code> callbacks:</p> <ul> <li> <p>The <code>ModelCheckpoint</code> makes sure to save checkpoints of you model. This is in principal not hard to do yourself, but     the <code>ModelCheckpoint</code> callback offers additional functionality by saving checkpoints only when some metric improves,     or only save the best <code>K</code> performing models etc.</p> <pre><code>model = MyModel()\ncheckpoint_callback = ModelCheckpoint(\n    dirpath=\"./models\", monitor=\"val_loss\", mode=\"min\"\n)\ntrainer = Trainer(callbacks=[checkpoint_callbacks])\ntrainer.fit(model)\n</code></pre> </li> <li> <p>The <code>EarlyStopping</code> callback can help you prevent overfitting by automatically stopping the training if a certain     value is not improving anymore:</p> <pre><code>model = MyModel()\nearly_stopping_callback = EarlyStopping(\n    monitor=\"val_loss\", patience=3, verbose=True, mode=\"min\"\n)\ntrainer = Trainer(callbacks=[early_stopping_callback])\ntrainer.fit(model)\n</code></pre> </li> </ul> <p>Multiple callbacks can be used by passing them all in a list e.g.</p> <pre><code>trainer = Trainer(callbacks=[checkpoint_callbacks, early_stopping_callback])\n</code></pre>"},{"location":"s4_debugging_and_logging/boilerplate/#exercises","title":"\u2754 Exercises","text":"<p>Please note that the in following exercise we will basically ask you to reformat all your MNIST code to follow the lightning standard, such that we can take advantage of all the tricks the framework has to offer. The reason we did not implement our model in <code>lightning</code> to begin with, is that to truly understand why it is beneficially to use a high-level framework to do some of the heavy lifting you need to have gone through some of implementation troubles yourself.</p> <ol> <li> <p>Install pytorch lightning:</p> <pre><code>pip install pytorch-lightning # (1)!\n</code></pre> <ol> <li> You may also install it as <code>pip install lightning</code> which includes more than just the     <code>PyTorch Lightning</code> package. This also includes <code>Lightning Fabric</code> and <code>Lightning Apps</code> which you can read more     about here and here.</li> </ol> </li> <li> <p>Convert your corrupted MNIST model into a <code>LightningModule</code>. You can either choose to completely override your old     model or implement it in a new file. The bare minimum that you need to add while converting to get it working with     the rest of lightning:</p> <ul> <li> <p>The <code>training_step</code> method. This function should contain essentially what goes into a single     training step and should return the loss at the end</p> </li> <li> <p>The <code>configure_optimizers</code> method</p> </li> </ul> <p>Please read the documentation for more info.</p> Solution lightning.py<pre><code>import pytorch_lightning as pl\nimport torch\nfrom torch import nn\n\n\nclass MyAwesomeModel(pl.LightningModule):\n    \"\"\"My awesome model.\"\"\"\n\n    def __init__(self) -&gt; None:\n        super().__init__()\n        self.conv1 = nn.Conv2d(1, 32, 3, 1)\n        self.conv2 = nn.Conv2d(32, 64, 3, 1)\n        self.conv3 = nn.Conv2d(64, 128, 3, 1)\n        self.dropout = nn.Dropout(0.5)\n        self.fc1 = nn.Linear(128, 10)\n\n        self.loss_fn = nn.CrossEntropyLoss()\n\n    def forward(self, x: torch.Tensor) -&gt; torch.Tensor:\n        \"\"\"Forward pass.\"\"\"\n        x = torch.relu(self.conv1(x))\n        x = torch.max_pool2d(x, 2, 2)\n        x = torch.relu(self.conv2(x))\n        x = torch.max_pool2d(x, 2, 2)\n        x = torch.relu(self.conv3(x))\n        x = torch.max_pool2d(x, 2, 2)\n        x = torch.flatten(x, 1)\n        x = self.dropout(x)\n        return self.fc1(x)\n\n    def training_step(self, batch):\n        \"\"\"Training step.\"\"\"\n        img, target = batch\n        y_pred = self(img)\n        return self.loss_fn(y_pred, target)\n\n    def configure_optimizers(self):\n        \"\"\"Configure optimizer.\"\"\"\n        return torch.optim.Adam(self.parameters(), lr=1e-3)\n\n\nif __name__ == \"__main__\":\n    model = MyAwesomeModel()\n    print(f\"Model architecture: {model}\")\n    print(f\"Number of parameters: {sum(p.numel() for p in model.parameters())}\")\n\n    dummy_input = torch.randn(1, 1, 28, 28)\n    output = model(dummy_input)\n    print(f\"Output shape: {output.shape}\")\n</code></pre> </li> <li> <p>Make sure your data is formatted such that it can be loaded using the <code>torch.utils.data.DataLoader</code> object.</p> </li> <li> <p>Instantiate a <code>Trainer</code> object. It is recommended to take a look at the     trainer arguments (there     are many of them) and maybe adjust some of them:</p> <ol> <li> <p>Investigate what the <code>default_root_dir</code> flag does</p> </li> <li> <p>As default lightning will run for 1000 epochs. This may be too much (for now). Change this by     changing the appropriate flag. Additionally, there also exist a flag to set the maximum number of steps that we     should train for.</p> Solution <p>Setting the <code>max_epochs</code> will accomplish this.</p> <pre><code>trainer = Trainer(max_epochs=10)\n</code></pre> <p>Additionally, you may consider instead setting the <code>max_steps</code> flag to limit based on the number of steps or <code>max_time</code> to limit based on time. Similarly, the flags <code>min_epochs</code>, <code>min_steps</code> and <code>min_time</code> can be used to set the minimum number of epochs, steps or time.</p> </li> <li> <p>To start with we also want to limit the amount of training data to 20% of its original size. which     trainer flag do you need to set for this to work?</p> Solution <p>Setting the <code>limit_train_batches</code> flag will accomplish this.</p> <pre><code>trainer = Trainer(limit_train_batches=0.2)\n</code></pre> <p>Similarly, you can also set the <code>limit_val_batches</code> and <code>limit_test_batches</code> flags to limit the validation and test data.</p> </li> </ol> </li> <li> <p>Try fitting your model: <code>trainer.fit(model)</code></p> </li> <li> <p>Now try adding some <code>callbacks</code> to your trainer.</p> Solution <pre><code>early_stopping_callback = EarlyStopping(\n    monitor=\"val_loss\", patience=3, verbose=True, mode=\"min\"\n)\ncheckpoint_callback = ModelCheckpoint(\n    dirpath=\"./models\", monitor=\"val_loss\", mode=\"min\"\n)\ntrainer = Trainer(callbacks=[early_stopping_callback, checkpoint_callback])\n</code></pre> </li> <li> <p>The privous module was all about logging in <code>wandb</code>, so the question is naturally how does <code>lightning</code> support this.     Lightning does not only support <code>wandb</code>, but also many     others. Common for all of them, is that     logging just need to happen through the <code>self.log</code> method in your <code>LightningModule</code>:</p> <ol> <li> <p>Add <code>self.log</code> to your `LightningModule. Should look something like this:</p> <pre><code>def training_step(self, batch, batch_idx):\n    data, target = batch\n    preds = self(data)\n    loss = self.criterion(preds, target)\n    acc = (target == preds.argmax(dim=-1)).float().mean()\n    self.log('train_loss', loss)\n    self.log('train_acc', acc)\n    return loss\n</code></pre> </li> <li> <p>Add the <code>wandb</code> logger to your trainer</p> <pre><code>trainer = Trainer(logger=pl.loggers.WandbLogger(project=\"dtu_mlops\"))\n</code></pre> <p>and try to train the model. Confirm that you are seeing the scalars appearing in your <code>wandb</code> portal.</p> </li> <li> <p><code>self.log</code> does sadly only support logging scalar tensors. Luckily, for logging other quantities we     can still access the standard <code>wandb.log</code> through our model</p> <pre><code>def training_step(self, batch, batch_idx):\n    ...\n    # self.logger.experiment is the same as wandb.log\n    self.logger.experiment.log({'logits': wandb.Histrogram(preds)})\n</code></pre> <p>try doing this, by logging something else than scalar tensors.</p> </li> </ol> </li> <li> <p>Finally, we maybe also want to do some validation or testing. In lightning we just need to add the <code>validation_step</code>     and <code>test_step</code> to our lightning module and supply the respective data in form of a separate dataloader. Try to at     least implement one of them.</p> Solution <p>Both validation and test steps can be implemented in the same way as the training step:</p> <pre><code>def validation_step(self, batch) -&gt; None:\n    data, target = batch\n    preds = self(data)\n    loss = self.criterion(preds, target)\n    acc = (target == preds.argmax(dim=-1)).float().mean()\n    self.log('val_loss', loss, on_epoch=True)\n    self.log('val_acc', acc, on_epoch=True)\n</code></pre> <p>two things to take note of here is that we are setting the <code>on_epoch</code> flag to <code>True</code> in the <code>self.log</code> method. This is because we want to log the validation loss and accuracy only once per epoch. Additionally, we are not returning anything from the <code>validation_step</code> method, because we do not optimize over the loss.</p> </li> <li> <p>(Optional, requires GPU) One of the big advantages of using <code>lightning</code> is that you no more need to deal with device     placement e.g. called <code>.to('cuda')</code> everywhere. If you have a GPU, try to set the <code>gpus</code> flag in the trainer. If you     do not have one, do not worry, we are going to return to this when we are going to run training in the cloud.</p> Solution <p>The two arguments <code>accelerator</code> and <code>devices</code> can be used to specify which devices to run on and how many to run on. For example, to run on a single GPU you can do</p> <pre><code>trainer = Trainer(accelerator=\"gpu\", devices=1)\n</code></pre> <p>as an alternative the accelerator can just be set to <code>accelerator=\"auto\"</code> to automatically detect the best available device.</p> </li> <li> <p>(Optional) As default PyTorch uses <code>float32</code> for representing floating point numbers. However, research have shown     that neural network training is very robust towards a decrease in precision. The great benefit going from <code>float32</code>     to <code>float16</code> is that we get approximately half the     memory consumption. Try out half-precision     training in PyTorch lightning. You can enable this by setting the     precision flag in the <code>Trainer</code>.</p> Solution <p>Lightning supports four different types of mixed precision training (16-bit and 16-bit bfloat) and two types of:</p> <pre><code># 16-bit mixed precision (model weights remain in torch.float32)\ntrainer = Trainer(precision=\"16-mixed\", devices=1)\n\n# 16-bit bfloat mixed precision (model weights remain in torch.float32)\ntrainer = Trainer(precision=\"bf16-mixed\", devices=1)\n\n# 16-bit precision (model weights get cast to torch.float16)\ntrainer = Trainer(precision=\"16-true\", devices=1)\n\n# 16-bit bfloat precision (model weights get cast to torch.bfloat16)\ntrainer = Trainer(precision=\"bf16-true\", devices=1)\n</code></pre> </li> <li> <p>(Optional) Lightning also have built-in support for profiling. Checkout how to do this using the     profiler argument in     the <code>Trainer</code> object.</p> </li> <li> <p>(Optional) Another great feature of Lightning is that the allow for easily defining command line interfaces through     the Lightning CLI feature. The     Lightning CLI is essentially a drop in replacement for defining command line interfaces (covered in     this module) and can also replace the need for config files     (covered in this module) for securing reproducibility when working inside     the Lightning framework. We highly recommend checking out the feature and that you try to refactor your code such     that you do not need to call <code>trainer.fit</code> anymore but it is instead directly controlled from the Lightning CLI.</p> </li> <li> <p>Free exercise: Experiment with what the lightning framework is capable of. Either try out more of the trainer flags,     some of the other callbacks, or maybe look into some of the other methods that can be implemented in your lightning     module. Only your imagination is the limit!</p> </li> </ol> <p>That covers everything for today. It has been a mix of topics that all should help you write \"better\" code (by some objective measure). If you want to deep dive more into the PyTorch lightning framework, we highly recommend looking at the different tutorials in the documentation that covers more advanced models and training cases. Additionally, we also want to highlight other frameworks in the lightning ecosystem:</p> <ul> <li>Torchmetrics: collection of machine learning metrics written     in PyTorch</li> <li>lightning flash: High-level framework for fast prototyping,     baselining, finetuning with a even simpler interface than lightning</li> <li>lightning-bolts: Collection of SOTA pretrained models, model     components, callbacks, losses and datasets for testing out ideas as fast a possible</li> </ul>"},{"location":"s4_debugging_and_logging/debugging/","title":"M12 - Debugging","text":""},{"location":"s4_debugging_and_logging/debugging/#debugging","title":"Debugging","text":"<p>Debugging is very hard to teach and is one of the skills that just comes with experience. That said, there are good and bad ways to debug a program. We are all probably familiar with just inserting <code>print(...)</code> statements everywhere in our code. It is easy and can many times help narrow down where the problem happens. That said, this is not a great way of debugging when dealing with a very large codebase. You should therefore familiarize yourself with the built-in python debugger as it may come in handy during the course.</p> <p></p> <p>To invoke the build in Python debugger you can either:</p> <ul> <li> <p>Set a trace directly with the Python debugger by calling</p> <pre><code>import pdb\npdb.set_trace()\n</code></pre> <p>anywhere you want to stop the code. Then you can use different commands (see the <code>python_debugger_cheatsheet.pdf</code>) to step through the code.</p> </li> <li> <p>If you are using an editor, then you can insert inline breakpoints (in VS code this can be done by pressing <code>F9</code>)     and then execute the script in debug mode (inline breakpoints can often be seen as small red dots to the left of     your code). The editor should then offer some interface to allow you step through your code. Here is a guide to     using the build in debugger in VScode.</p> </li> <li> <p>Additionally, if your program is stopping on an error and you automatically want to start the debugger where it     happens, then you can simply launch the program like this from the terminal</p> <pre><code>python -m pdb -c continue my_script.py\n</code></pre> </li> </ul>"},{"location":"s4_debugging_and_logging/debugging/#exercises","title":"\u2754 Exercises","text":"<p>Exercise files</p> <p>We here provide a script <code>vae_mnist_bugs.py</code> which contains a number of bugs to get it running. Start by going over the script and try to understand what is going on. Hereafter, try to get it running by solving the bugs. The following bugs exist in the script:</p> <ul> <li>One device bug (will only show if running on gpu, but try to find it anyways)</li> <li>One shape bug</li> <li>One math bug</li> <li>One training bug</li> </ul> <p>Some of the bugs prevents the script from even running, while some of them influences the training dynamics. Try to find them all. We also provide a working version called <code>vae_mnist_working.py</code> (but please try to find the bugs before looking at the script). Successfully debugging and running the script should produce three files:</p> <ul> <li><code>orig_data.png</code> containing images from the standard MNIST training set</li> <li><code>reconstructions.png</code> reconstructions from the model</li> <li><code>generated_samples.png</code> samples from the model</li> </ul> <p>Again, we cannot stress enough that the exercise is actually not about finding the bugs but using a proper debugger to find them.</p>"},{"location":"s4_debugging_and_logging/logging/","title":"M14 - Logging","text":""},{"location":"s4_debugging_and_logging/logging/#logging","title":"Logging","text":"<p>Core Module</p> <p>Logging in general refers to the practise of recording events activities over time. Having proper logging in your applications can be extremely beneficial for a few reasons:</p> <ul> <li> <p>Debugging becomes easier because we in a more structure way can output information about the state of our program,     variables, values etc. to help identify and fix bugs or unexpected behavior.</p> </li> <li> <p>When we move into a more production environment, proper logging is essential for monitoring the health and     performance of our application.</p> </li> <li> <p>It can help in auditing as logging info about specific activities etc. can help keeping a record of who did what     and when.</p> </li> <li> <p>Having proper logging means that info is saved for later, that can be analysed to gain insight into the behavior of     our application, such as trends.</p> </li> </ul> <p>We are in this course going to divide the kind of logging we can do into categories: application logging and experiment logging. In general application logging is important regardless of the kind of application you are developing, whereas experiment logging is important machine learning based projects where we are doing experiments.</p>"},{"location":"s4_debugging_and_logging/logging/#application-logging","title":"Application logging","text":"<p>The most basic form of logging in Python applications is the good old <code>print</code> statement:</p> <pre><code>for batch_idx, batch in enumerate(dataloader):\n    print(f\"Processing batch {batch_idx} out of {len(dataloader)}\")\n    ...\n</code></pre> <p>This will keep a \"record\" of the events happening in our script, in this case how far we have progressed. We could even change the print to include something like <code>batch.shape</code> to also have information about the current data being processed.</p> <p>Using <code>print</code> statements is fine for small applications, but to have proper logging we need a bit more functionality than what <code>print</code> can offer. Python actually comes with a great logging module, that defines functions for flexible logging. It is exactly this we are going to look at in this module.</p> <p>The four main components to the Python logging module are:</p> <ol> <li> <p>Logger: The main entry point for using the logging system. You create instances of the Logger class to emit log     messages.</p> </li> <li> <p>Handler: Defines where the log messages go. Handlers send the log messages to specific destinations, such as the     console or a file.</p> </li> <li> <p>Formatter: Specifies the layout of the log messages. Formatters determine the structure of the log records,     including details like timestamps and log message content.</p> </li> <li> <p>Level: Specifies the severity of a log message.</p> </li> </ol> <p>Especially, the last point is important to understand. Levels essentially allows of to get rid of statements like this:</p> <pre><code>if debug:\n    print(x.shape)\n</code></pre> <p>where the logging is conditional on the variable <code>debug</code> which we can set a runtime. Thus, it is something we can disable for users of our application (<code>debug=False</code>) but have enabled when we develop the application (<code>debug=True</code>). And it makes sense that not all things logged, should be available to all stakeholders of a codebase. We as developers probably always wants the highest level of logging, whereas users of the our code need less info and we may want to differentiate this based on users.</p> <p></p> <p>It is also important to understand the different between logging and error handling. Error handling Python is done using <code>raise</code> statements and <code>try/catch</code> like:</p> <pre><code>def f(x: int):\n    if not isinstance(x, int):\n        raise ValueError(\"Expected an integer\")\n    return 2 * x\n\ntry:\n    f(5):\nexcept ValueError:\n    print(\"I failed to do a thing, but continuing.\")\n</code></pre> <p>Why would we evere need log <code>warning</code>, <code>error</code>, <code>critical</code> levels of information, if we are just going to handle it? The reason is that raising exceptions are meant to change the program flow at runtime e.g. things we do not want the user to do, but we can deal with in some way. Logging is always for after a program have run, to inspect what went wrong. Sometimes you need one, sometimes the other, sometimes both.</p>"},{"location":"s4_debugging_and_logging/logging/#exercises","title":"\u2754 Exercises","text":"<p>Exercises are inspired by this made with ml module on the same topic. If you need help for the exercises you can find a simple solution script here.</p> <ol> <li> <p>As logging is a built-in module in Python, nothing needs to be installed. Instead start a new file called     <code>my_logger.py</code> and start out with the following code:</p> <pre><code>import logging\nimport sys\n\n# Create super basic logger\nlogging.basicConfig(stream=sys.stdout, level=logging.DEBUG)\nlogger = logging.getLogger(__name__) # (1)\n\n# Logging levels (from lowest to highest priority)\nlogger.debug(\"Used for debugging your code.\")\nlogger.info(\"Informative messages from your code.\")\nlogger.warning(\"Everything works but there is something to be aware of.\")\nlogger.error(\"There's been a mistake with the process.\")\nlogger.critical(\"There is something terribly wrong and process may terminate.\")\n</code></pre> <ol> <li> The built-in variable <code>__name__</code> always contains the record of the script or module that is     currently being run. Therefore if we initialize our logger base using this variable, it will always be unique     to our application and not conflict with logger setup by any third-party package.</li> </ol> <p>Try running the code. Than try changing the argument <code>level</code> when creating the logger. What happens when you do that?</p> </li> <li> <p>Instead of sending logs to the terminal, we may also want to send them to a file. This can be beneficial, such that     only <code>warning</code> level logs and higher are available to the user, but <code>debug</code> and <code>info</code> is still saved when the     application is running.</p> <ol> <li> <p>Try adding the following dict to your <code>logger.py</code> file:</p> <pre><code>logging_config = {\n    \"version\": 1,\n    \"formatters\": { # (1)\n        \"minimal\": {\"format\": \"%(message)s\"},\n        \"detailed\": {\n            \"format\": \"%(levelname)s %(asctime)s [%(name)s:%(filename)s:%(funcName)s:%(lineno)d]\\n%(message)s\\n\"\n        },\n    },\n    \"handlers\": { # (2)\n        \"console\": {\n            \"class\": \"logging.StreamHandler\",\n            \"stream\": sys.stdout,\n            \"formatter\": \"minimal\",\n            \"level\": logging.DEBUG,\n        },\n        \"info\": {\n            \"class\": \"logging.handlers.RotatingFileHandler\",\n            \"filename\": Path(LOGS_DIR, \"info.log\"),\n            \"maxBytes\": 10485760,  # 1 MB\n            \"backupCount\": 10,\n            \"formatter\": \"detailed\",\n            \"level\": logging.INFO,\n        },\n        \"error\": {\n            \"class\": \"logging.handlers.RotatingFileHandler\",\n            \"filename\": Path(LOGS_DIR, \"error.log\"),\n            \"maxBytes\": 10485760,  # 1 MB\n            \"backupCount\": 10,\n            \"formatter\": \"detailed\",\n            \"level\": logging.ERROR,\n        },\n    },\n    \"root\": {\n        \"handlers\": [\"console\", \"info\", \"error\"],\n        \"level\": logging.INFO,\n        \"propagate\": True,\n    },\n}\n</code></pre> <ol> <li> <p> The formatter section     determines how logs should be formatted. Here we define two separate formatters, called <code>minimal</code> and     <code>detailed</code> which we can use in the next part of the code.</p> </li> <li> <p> The handlers is in     charge of what should happen to different level of logging. <code>console</code> uses the <code>minimal</code> format we defined     and sens logs to the <code>stdout</code> stream for messages of level <code>DEBUG</code> and higher. The <code>info</code> handler uses     the <code>detailed</code> format and sends messages of level <code>INFO</code> and higher to a separate <code>info.log</code> file. The     <code>error</code> handler does the same for messages of level <code>ERROR</code> and higher to a file called <code>error.log</code>.</p> </li> </ol> <p>you will need to set the <code>LOGS_DIR</code> variable and also figure out how to add this <code>logging_config</code> using the logging config submodule to your logger.</p> </li> <li> <p>When the code successfully runs, check the <code>LOGS_DIR</code> folder and make sure that a <code>info.log</code> and <code>error.log</code> file     was created with the appropriate content.</p> </li> </ol> </li> <li> <p>Finally, lets try to add a little bit of style and color to our logging. For this we can use the great package     rich which is a great package for rich text and beautiful formatting in     terminals. Install <code>rich</code> and add the following line to your <code>my_logger.py</code>     script:</p> <pre><code>logger.root.handlers[0] = RichHandler(markup=True)  # set rich handler\n</code></pre> <p>and try re-running the script. Hopefully you should see something beautiful in your terminal like this:</p> <p> </p> </li> <li> <p>(Optional) We already briefly touched on logging during the     module on config files using hydra. If you want to configure hydra to use     custom logging scheme as the one we setup in the last two exercises, you can take a look at this     page. In hydra you will need to provide the configuration of the     logger as config file. You can find examples of such config file     here.</p> </li> </ol>"},{"location":"s4_debugging_and_logging/logging/#experiment-logging","title":"Experiment logging","text":"<p>When most people think machine learning, we think about the training phase. Being able to track and log experiments is an important part of understanding what is going on with your model while you are training. It can help you debug your model and help tweak your models to perfection. Without proper logging of experiments, it can be really hard to iterate on the model because you do not know what changes lead to increase or decrease in performance.</p> <p>The most basic logging we can do when running experiments is writing the metrics that our model is producing e.g. the loss or the accuracy to the terminal or a file for later inspection. We can then also use tools such as matplotlib for plotting the progression of our metrics over time. This kind of workflow may be enough when doing smaller experiments or working alone on a project, but there is no way around using a proper experiment tracker and visualizer when doing large scale experiments in collaboration with others. It especially becomes important when you want to compare performance between different runs.</p> <p>There exist many tools for logging your experiments, with some of them being:</p> <ul> <li>Tensorboard</li> <li>Comet</li> <li>MLFlow</li> <li>Neptune</li> <li>Weights and Bias</li> </ul> <p>All of the frameworks offers many of the same functionalities, you can see a (bias) review here. We are going to use Weights and Bias (wandb), as it support everything we need in this course. Additionally, it is an excellent tool for collaboration and sharing of results.</p> <p></p>  Using the Weights and Bias (wandb) dashboard we can quickly get an overview and compare many runs over different metrics. This allows for better iteration of models and training procedure."},{"location":"s4_debugging_and_logging/logging/#exercises_1","title":"\u2754 Exercises","text":"<ol> <li> <p>Start by creating an account at wandb. I recommend using your GitHub account but feel     free to choose what you want. When you are logged in you should get an API key of length 40. Copy this for later     use (HINT: if you forgot to copy the API key, you can find it under settings), but make sure that you do not share     it with anyone or leak it in any way.</p> .env file <p>A good place to store not only your wandb API key but also other sensitive information is in a <code>.env</code> file. This file should be added to your <code>.gitignore</code> file to make sure that it is not uploaded to your repository. You can then load the variables in the <code>.env</code> file using the <code>python-dotenv</code> package. For more information see this page.</p> <p>.env<pre><code>WANDB_API_KEY=your-api-key\nWANDB_PROJECT=my_project\nWANDB_ENTITY=my_entity\n...\n</code></pre> load_from_env_file.py<pre><code>from dotenv import load_dotenv\nload_dotenv()\nimport os\napi_key = os.getenv(\"WANDB_API_KEY\")\n</code></pre></p> </li> <li> <p>Next install wandb on your laptop</p> <pre><code>pip install wandb\n</code></pre> </li> <li> <p>Now connect to your wandb account</p> <pre><code>wandb login\n</code></pre> <p>you will be asked to provide the 40 length API key. The connection should be remain open to the wandb server even when you close the terminal, such that you do not have to login each time. If using <code>wandb</code> in a notebook you need to manually close the connection using <code>wandb.finish()</code>.</p> </li> <li> <p>We are now ready for incorporating <code>wandb</code> into our code. We are going to continue development on our corrupt MNIST     codebase from the previous sessions. For help, we recommend looking at this     quickstart and this guide     for PyTorch applications. You first job is to alter your training script to include <code>wandb</code> logging, at least for     the training loss.</p> Solution train.py<pre><code>import click\nimport torch\nimport wandb\nfrom my_project.data import corrupt_mnist\nfrom my_project.model import MyAwesomeModel\n\nDEVICE = torch.device(\"cuda\" if torch.cuda.is_available() else \"mps\" if torch.backends.mps.is_available() else \"cpu\")\n\n\n@click.command()\n@click.option(\"--lr\", type=float, default=0.001, help=\"Learning rate\")\n@click.option(\"--batch_size\", type=int, default=32, help=\"Batch size\")\n@click.option(\"--epochs\", type=int, default=5, help=\"Number of epochs\")\ndef train(lr, batch_size, epochs) -&gt; None:\n    \"\"\"Train a model on MNIST.\"\"\"\n    print(\"Training day and night\")\n    print(f\"{lr=}, {batch_size=}, {epochs=}\")\n    wandb.init(\n        project=\"corrupt_mnist\",\n        config={\"lr\": lr, \"batch_size\": batch_size, \"epochs\": epochs},\n    )\n\n    model = MyAwesomeModel().to(DEVICE)\n    train_set, _ = corrupt_mnist()\n\n    train_dataloader = torch.utils.data.DataLoader(train_set, batch_size=batch_size)\n\n    loss_fn = torch.nn.CrossEntropyLoss()\n    optimizer = torch.optim.Adam(model.parameters(), lr=lr)\n\n    for epoch in range(epochs):\n        model.train()\n        for i, (img, target) in enumerate(train_dataloader):\n            img, target = img.to(DEVICE), target.to(DEVICE)\n            optimizer.zero_grad()\n            y_pred = model(img)\n            loss = loss_fn(y_pred, target)\n            loss.backward()\n            optimizer.step()\n            accuracy = (y_pred.argmax(dim=1) == target).float().mean().item()\n            wandb.log({\"train_loss\": loss.item(), \"train_accuracy\": accuracy})\n\n            if i % 100 == 0:\n                print(f\"Epoch {epoch}, iter {i}, loss: {loss.item()}\")\n\n\nif __name__ == \"__main__\":\n    train()\n</code></pre> <ol> <li> <p>After running your model, checkout the webpage. Hopefully you should be able to see at least one run with     something logged.</p> </li> <li> <p>Now log something else than scalar values. This could be a image, a histogram or a matplotlib figure. In all     cases the logging is still going to use <code>wandb.log</code> but you need extra calls to <code>wandb.Image</code> etc. depending     on what you choose to log.</p> Solution <p>In this solution we log the input images to the model every 100 step. Additionally, we also log a histogram of the gradients to inspect if the model is converging. Finally, we create a ROC curve which is a matplotlib figure and log that as well.</p> train.py<pre><code>import click\nimport matplotlib.pyplot as plt\nimport torch\nimport wandb\nfrom my_project.data import corrupt_mnist\nfrom my_project.model import MyAwesomeModel\nfrom sklearn.metrics import RocCurveDisplay\n\nDEVICE = torch.device(\"cuda\" if torch.cuda.is_available() else \"mps\" if torch.backends.mps.is_available() else \"cpu\")\n\n\n@click.command()\n@click.option(\"--lr\", type=float, default=0.001, help=\"Learning rate\")\n@click.option(\"--batch_size\", type=int, default=32, help=\"Batch size\")\n@click.option(\"--epochs\", type=int, default=5, help=\"Number of epochs\")\ndef train(lr, batch_size, epochs) -&gt; None:\n    \"\"\"Train a model on MNIST.\"\"\"\n    print(\"Training day and night\")\n    print(f\"{lr=}, {batch_size=}, {epochs=}\")\n    wandb.init(\n        project=\"corrupt_mnist\",\n        config={\"lr\": lr, \"batch_size\": batch_size, \"epochs\": epochs},\n    )\n\n    model = MyAwesomeModel().to(DEVICE)\n    train_set, _ = corrupt_mnist()\n\n    train_dataloader = torch.utils.data.DataLoader(train_set, batch_size=batch_size)\n\n    loss_fn = torch.nn.CrossEntropyLoss()\n    optimizer = torch.optim.Adam(model.parameters(), lr=lr)\n\n    for epoch in range(epochs):\n        model.train()\n\n        preds, targets = [], []\n        for i, (img, target) in enumerate(train_dataloader):\n            img, target = img.to(DEVICE), target.to(DEVICE)\n            optimizer.zero_grad()\n            y_pred = model(img)\n            loss = loss_fn(y_pred, target)\n            loss.backward()\n            optimizer.step()\n            accuracy = (y_pred.argmax(dim=1) == target).float().mean().item()\n            wandb.log({\"train_loss\": loss.item(), \"train_accuracy\": accuracy})\n\n            preds.append(y_pred.detach().cpu())\n            targets.append(target.detach().cpu())\n\n            if i % 100 == 0:\n                print(f\"Epoch {epoch}, iter {i}, loss: {loss.item()}\")\n\n                # add a plot of the input images\n                images = wandb.Image(img[:5].detach().cpu(), caption=\"Input images\")\n                wandb.log({\"images\": images})\n\n                # add a plot of histogram of the gradients\n                grads = torch.cat([p.grad.flatten() for p in model.parameters() if p.grad is not None], 0)\n                wandb.log({\"gradients\": wandb.Histogram(grads)})\n\n        # add a custom matplotlib plot of the ROC curves\n        preds = torch.cat(preds, 0)\n        targets = torch.cat(targets, 0)\n\n        for class_id in range(10):\n            one_hot = torch.zeros_like(targets)\n            one_hot[targets == class_id] = 1\n            _ = RocCurveDisplay.from_predictions(\n                one_hot,\n                preds[:, class_id],\n                name=f\"ROC curve for {class_id}\",\n                plot_chance_level=(class_id == 2),\n            )\n\n        wandb.plot({\"roc\": plt})\n        # alternatively the wandb.plot.roc_curve function can be used\n\n\nif __name__ == \"__main__\":\n    train()\n</code></pre> </li> <li> <p>Finally, we want to log the model itself. This is done by saving the model as an artifact and then logging the     artifact. You can read much more about what artifacts are here but     they are essentially one or more files logged together with runs that can be versioned and equipped with     metadata. Log the model after training and see if you can find it in the wandb dashboard.</p> Solution <p>In this solution we have added the calculating of final training metrics and when we then log the model we add these as metadata to the artifact.</p> train.py<pre><code>import click\nimport matplotlib.pyplot as plt\nimport torch\nimport wandb\nfrom my_project.data import corrupt_mnist\nfrom my_project.model import MyAwesomeModel\nfrom sklearn.metrics import RocCurveDisplay, accuracy_score, f1_score, precision_score, recall_score\n\nDEVICE = torch.device(\"cuda\" if torch.cuda.is_available() else \"mps\" if torch.backends.mps.is_available() else \"cpu\")\n\n\n@click.command()\n@click.option(\"--lr\", type=float, default=0.001, help=\"Learning rate\")\n@click.option(\"--batch_size\", type=int, default=32, help=\"Batch size\")\n@click.option(\"--epochs\", type=int, default=5, help=\"Number of epochs\")\ndef train(lr, batch_size, epochs) -&gt; None:\n    \"\"\"Train a model on MNIST.\"\"\"\n    print(\"Training day and night\")\n    print(f\"{lr=}, {batch_size=}, {epochs=}\")\n    run = wandb.init(\n        project=\"corrupt_mnist\",\n        config={\"lr\": lr, \"batch_size\": batch_size, \"epochs\": epochs},\n    )\n\n    model = MyAwesomeModel().to(DEVICE)\n    train_set, _ = corrupt_mnist()\n\n    train_dataloader = torch.utils.data.DataLoader(train_set, batch_size=batch_size)\n\n    loss_fn = torch.nn.CrossEntropyLoss()\n    optimizer = torch.optim.Adam(model.parameters(), lr=lr)\n\n    for epoch in range(epochs):\n        model.train()\n\n        preds, targets = [], []\n        for i, (img, target) in enumerate(train_dataloader):\n            img, target = img.to(DEVICE), target.to(DEVICE)\n            optimizer.zero_grad()\n            y_pred = model(img)\n            loss = loss_fn(y_pred, target)\n            loss.backward()\n            optimizer.step()\n            accuracy = (y_pred.argmax(dim=1) == target).float().mean().item()\n            wandb.log({\"train_loss\": loss.item(), \"train_accuracy\": accuracy})\n\n            preds.append(y_pred.detach().cpu())\n            targets.append(target.detach().cpu())\n\n            if i % 100 == 0:\n                print(f\"Epoch {epoch}, iter {i}, loss: {loss.item()}\")\n\n                # add a plot of the input images\n                images = wandb.Image(img[:5].detach().cpu(), caption=\"Input images\")\n                wandb.log({\"images\": images})\n\n                # add a plot of histogram of the gradients\n                grads = torch.cat([p.grad.flatten() for p in model.parameters() if p.grad is not None], 0)\n                wandb.log({\"gradients\": wandb.Histogram(grads)})\n\n        # add a custom matplotlib plot of the ROC curves\n        preds = torch.cat(preds, 0)\n        targets = torch.cat(targets, 0)\n\n        for class_id in range(10):\n            one_hot = torch.zeros_like(targets)\n            one_hot[targets == class_id] = 1\n            _ = RocCurveDisplay.from_predictions(\n                one_hot,\n                preds[:, class_id],\n                name=f\"ROC curve for {class_id}\",\n                plot_chance_level=(class_id == 2),\n            )\n\n        wandb.plot({\"roc\": plt})\n        # alternatively the wandb.plot.roc_curve function can be used\n\n    final_accuracy = accuracy_score(targets, preds.argmax(dim=1))\n    final_precision = precision_score(targets, preds.argmax(dim=1), average=\"weighted\")\n    final_recall = recall_score(targets, preds.argmax(dim=1), average=\"weighted\")\n    final_f1 = f1_score(targets, preds.argmax(dim=1), average=\"weighted\")\n\n    # first we save the model to a file then log it as an artifact\n    torch.save(model.state_dict(), \"model.pth\")\n    artifact = wandb.Artifact(\n        name=\"corrupt_mnist_model\",\n        type=\"model\",\n        description=\"A model trained to classify corrupt MNIST images\",\n        metadata={\"accuracy\": final_accuracy, \"precision\": final_precision, \"recall\": final_recall, \"f1\": final_f1},\n    )\n    artifact.add_file(\"model.pth\")\n    run.log_artifact(artifact)\n\n\nif __name__ == \"__main__\":\n    train()\n</code></pre> <p>After running the script you should be able to see the logged artifact in the wandb dashboard.</p> <p> </p> </li> </ol> </li> <li> <p>Weights and bias was created with collaboration in mind and lets therefore share our results with others.</p> <ol> <li> <p>Lets create a report that you can share. Click the Create report button (upper right corner when you are in     a project workspace) and include some of the graphs/plots/images that you have generated in the report.</p> </li> <li> <p>Make the report shareable by clicking the Share button and create view-only-link. Send a link to your report     to a group member, fellow student or a friend. In the worst case that you have no one else to share with you can     send a link to my email <code>nsde@dtu.dk</code>, so I can checkout your awesome work \ud83d\ude03</p> </li> </ol> </li> <li> <p>When calling <code>wandb.init</code> you can provide many additional argument. Some of the most important are</p> <ul> <li><code>project</code></li> <li><code>entity</code></li> <li><code>job_type</code></li> </ul> <p>Make sure you understand what these arguments do and try them out. It will come in handy for your group work as they essentially allows multiple users to upload their own runs to the same project in <code>wandb</code>.</p> Solution <p>Relevant documentation can be found here. The <code>project</code> indicates what project all experiments and artifacts are logged to. We want to keep this the same for all group members. The <code>entity</code> is the username of the person or team who owns the project, which should also be the same for all group members. The job type is important if you have different jobs that log to the same project. A common example is one script that trains a model and another that evaluates it. By setting the job type you can easily filter the runs in the wandb dashboard.</p> <p> </p> </li> <li> <p>Wandb also comes with build in feature for doing hyperparameter sweeping     which can be beneficial to get a better working model. Look through the documentation on how to do a hyperparameter     sweep in Wandb. You at least need to create a new file called <code>sweep.yaml</code> and make sure that you call <code>wandb.log</code>     in your code on an appropriate value.</p> <ol> <li> <p>Start by creating a <code>sweep.yaml</code> file. Relevant documentation can be found     here. We recommend placing the file in a     <code>configs</code> folder in your project.</p> Solution <p>The <code>sweep.yaml</code> file will depend on kind of hyperparameters your model accepts as arguments and how they are passed to the model. For this solution we assume that the model accepts the hyperparameters <code>lr</code>, <code>batch_size</code> and <code>epochs</code> and that they are passed as <code>--args</code> (with hyphens) (1) e.g. this would be how we run the script</p> <ol> <li> If the script you want to run hyperparameter sweeping is configured using     hydra then you will need to change the default <code>command</code> config     in your <code>sweep.yaml</code> file. This is because <code>wandb</code> uses <code>--args</code> to pass hyperparameters to the script,     whereas <code>hydra</code> uses <code>args</code> (without the hyphen). See this     page for more information.</li> </ol> <pre><code>python train.py --lr=0.01 --batch_size=32 --epochs=10\n</code></pre> <p>The <code>sweep.yaml</code> could then look like this:</p> <pre><code>program: train.py\nname: sweepdemo\nproject: my_project  # change this\nentity: my_entity  # change this\nmetric:\n    goal: minimize\n    name: validation_loss\nparameters:\n    learning_rate:\n        min: 0.0001\n        max: 0.1\n        distribution: log_uniform\n    batch_size:\n        values: [16, 32, 64]\n    epochs:\n        values: [5, 10, 15]\nrun_cap: 10\n</code></pre> </li> <li> <p>Afterwards, you need to create a sweep using the <code>wandb sweep</code> command:</p> <pre><code>wandb sweep configs/sweep.yaml\n</code></pre> <p>this will output a sweep id that you need to use in the next step.</p> </li> <li> <p>Finally, you need to run the sweep using the <code>wandb agent</code> command:</p> <pre><code>wandb agent &lt;sweep_id&gt;\n</code></pre> <p>where <code>&lt;sweep_id&gt;</code> is the id of the sweep you just created. You can find the id in the output of the <code>wandb sweep</code> command. The reason that we first lunch the sweep and then the agent is that we can have multiple agents running at the same time, parallelizing the search for the best hyperparameters. Try this out by opening a new terminal and running the <code>wandb agent</code> command again (with the same <code>&lt;sweep_id&gt;</code>).</p> </li> <li> <p>Inspect the sweep results in the wandb dashboard. You should see multiple new runs under the project you are     logging the sweep to, corresponding to the different hyperparameters you tried. Make sure you understand the     results and can answer what hyperparameters gave the best results and what hyperparameters had the largest     impact on the results.</p> Solution <p>In the sweep dashboard you should see something like this:</p> <p> </p> <p>Importantly you can:</p> <ol> <li>Sort the runs based on what metric you are interested in, thereby quickly finding the best runs.</li> <li>Look at the parallel coordinates plot to see if there are any tendencies in the hyperparameters that     gives the best results.</li> <li>Look at the importance/correlation plot to see what hyperparameters have the largest impact on the     results.</li> </ol> </li> </ol> </li> <li> <p>Next we need to understand the model registry, which will be very important later on when we get to the deployment     of our models. The model registry is a centralized place for storing and versioning models. Importantly, any model     in the registry is immutable, meaning that once a model is uploaded it cannot be changed. This is important for     reproducibility and traceability of models.</p> <p>  The model registry is in general a repository of a teams trained models where ML practitioners publish candidates for production and share them with others. Figure from wandb.  <ol> <li> <p>The model registry builds on the artifact registry in wandb. Any model that is uploaded to the model registry is     stored as an artifact. This means that we first need to log our trained models as artifacts before we can     register them in the model registry. Make sure you have logged at least one model as an artifact before     continuing.</p> </li> <li> <p>Next lets create a registry. Go to the model registry tab (left pane, visible from your homepage) and then click     the <code>New Registered Model</code> button. Fill out the form and create the registry.</p> <p> </p> </li> <li> <p>When then need to link our artifact to the model registry we just created. We can do this in two ways: either     through the web interface or through the <code>wandb</code> API. In the web interface, go to the artifact you want to link     to the model registry and click the <code>Link to registry</code> button (upper right corner). If you want to use the     API you need to call the link method on a artifact object.</p> Solution <p>To use the API, create a new script called <code>link_to_registry.py</code> and add the following code:</p> link_to_registry.py<pre><code>import wandb\napi = wandb.Api()\nartifact_path = \"&lt;entity&gt;/&lt;project&gt;/&lt;artifact_name&gt;:&lt;version&gt;\"\nartifact = api.artifact(artifact_path)\nartifact.link(target_path=\"&lt;entity&gt;/model-registry/&lt;my_registry_name&gt;\")\nartifact.save()\n</code></pre> <p>In the code <code>&lt;entity&gt;</code>, <code>&lt;project&gt;</code>, <code>&lt;artifact_name&gt;</code>, <code>&lt;version&gt;</code> and <code>&lt;my_registry_name&gt;</code> should be replaced with the appropriate values.</p> </li> <li> <p>We are now ready to consume our model, which can be done by downloading the artifact from the model registry. In     this case we use the wandb API to download the artifact.</p> <pre><code>import wandb\nrun = wandb.init()\nartifact = run.use_artifact('&lt;entity&gt;/model-registry/&lt;my_registry_name&gt;:&lt;version&gt;', type='model')\nartifact_dir = artifact.download(\"&lt;artifact_dir&gt;\")\nmodel = MyModel()\nmodel.load_state_dict(torch.load(\"&lt;artifact_dir&gt;/model.ckpt\"))\n</code></pre> <p>Try running this code with the appropriate values for <code>&lt;entity&gt;</code>, <code>&lt;my_registry_name&gt;</code>, <code>&lt;version&gt;</code> and <code>&lt;artifact_dir&gt;</code>. Make sure that you can load the model and that it is the same as the one you trained.</p> </li> <li> <p>Each model in the registry have at least one alias, which is the version of the model. The most recently added     model also receives the alias <code>latest</code>. Aliases are great for indicating where in workflow a model is, e.g. if     it is a candidate for production or if it is a model that is still being developed. Try adding an alias to one     of your models in the registry.</p> </li> <li> <p>(Optional) A model always corresponds to an artifact, and artifacts can contain metadata that we can use to     automate the process of registering models. We could for example imaging that we at the end of each week run     a script that registers the best model from the week. Try creating a small script using the <code>wandb</code> API that     goes over a collection of artifacts and registers the best one.</p> Solution auto_register_best_model.py<pre><code>import logging\nimport operator\nimport os\n\nimport click\nimport wandb\nfrom dotenv import load_dotenv\n\nlogger = logging.getLogger(__name__)\nload_dotenv()\n\n\n@click.command()\n@click.argument(\"model-name\")\n@click.option(\"--metric_name\", default=\"accuracy\", help=\"Name of the metric to choose the best model from.\")\n@click.option(\"--higher-is-better\", default=True, help=\"Whether higher metric values are better.\")\ndef stage_best_model_to_registry(model_name, metric_name, higher_is_better) -&gt; None:\n    \"\"\"\n    Stage the best model to the model registry.\n\n    Args:\n        model_name: Name of the model to be registered.\n        metric_name: Name of the metric to choose the best model from.\n        higher_is_better: Whether higher metric values are better.\n\n    \"\"\"\n    api = wandb.Api(\n        api_key=os.getenv(\"WANDB_API_KEY\"),\n        overrides={\"entity\": os.getenv(\"WANDB_ENTITY\"), \"project\": os.getenv(\"WANDB_PROJECT\")},\n    )\n    artifact_collection = api.artifact_collection(type_name=\"model\", name=model_name)\n\n    best_metric = float(\"-inf\") if higher_is_better else float(\"inf\")\n    compare_op = operator.gt if higher_is_better else operator.lt\n    best_artifact = None\n    for artifact in list(artifact_collection.artifacts()):\n        if metric_name in artifact.metadata and compare_op(artifact.metadata[metric_name], best_metric):\n            best_metric = artifact.metadata[metric_name]\n            best_artifact = artifact\n\n    if best_artifact is None:\n        logging.error(\"No model found in registry.\")\n        return\n\n    logger.info(f\"Best model found in registry: {best_artifact.name} with {metric_name}={best_metric}\")\n    best_artifact.link(\n        target_path=f\"{os.getenv('WANDB_ENTITY')}/model-registry/{model_name}\",\n        aliases=[\"best\", \"staging\"],\n    )\n    best_artifact.save()\n    logger.info(\"Model staged to registry.\")\n\n\nif __name__ == \"__main__\":\n    stage_best_model_to_registry()\n</code></pre> </li> </ol> <li> <p>In the future it will be important for us to be able to run Wandb inside a docker container (together with whatever     training or inference we specify). The problem here is that we cannot authenticate Wandb in the same way as the     previous exercise, it needs to happen automatically. Lets therefore look into how we can do that.</p> <ol> <li> <p>First we need to generate an authentication key, or more precise an API key. This is in general the way any     service (like a docker container) can authenticate. Start by going https://wandb.ai/home, click your profile     icon in the upper right corner and then go to settings. Scroll down to the danger zone and generate a new API     key and finally copy it.</p> </li> <li> <p>Next create a new docker file called <code>wandb.docker</code> and add the following code</p> <pre><code>FROM python:3.10-slim\nRUN apt update &amp;&amp; \\\n    apt install --no-install-recommends -y build-essential gcc &amp;&amp; \\\n    apt clean &amp;&amp; rm -rf /var/lib/apt/lists/*\nRUN pip install wandb\nCOPY s4_debugging_and_logging/exercise_files/wandb_tester.py wandb_tester.py\nENTRYPOINT [\"python\", \"-u\", \"wandb_tester.py\"]\n</code></pre> <p>please take a look at the script being copied into the image and afterwards build the docker image.</p> </li> <li> <p>When we want to run the image, what we need to do is including a environment variables that contains the API key     we generated. This will then authenticate the docker container with the wandb server:</p> <pre><code>docker run -e WANDB_API_KEY=&lt;your-api-key&gt; wandb:latest\n</code></pre> <p>Try running it an confirm that the results are uploaded to the wandb server (1).</p> <ol> <li> If you have stored the API key in a <code>.env</code> file you can use the <code>--env-file</code> flag instead     of <code>-e</code> to load the environment variables from the file e.g. <code>docker run --env-file .env wandb:latest</code>.</li> </ol> </li> </ol> </li> <li> <p>Feel free to experiment more with <code>wandb</code> as it is a great tool for logging, organizing and sharing experiments.</p> </li> <p>That is the module on logging. Please note that at this point in the course you will begin to see some overlap between the different frameworks. While we mainly used <code>hydra</code> for configuring our Python scripts it can also be used to save metrics and hyperparameters similar to how <code>wandb</code> can. Similar arguments holds for <code>dvc</code> which can also be used to log metrics. In our opinion <code>wandb</code> just offers a better experience when interacting with the results after logging. We want to stress that the combination of tools presented in this course may not be the best for all your future projects, and we recommend finding a setup that fits you. That said, each framework provide specific features that the others does not.</p> <p>Finally, we want to note that we during the course really try to showcase a lot of open source frameworks, Wandb is not one. It is free to use for personal usage (with a few restrictions) but for enterprise it does require a license. If you are eager to only work with open-source tools we highly recommend trying out MLFlow which offers the same overall functionalities as Wandb.</p>"},{"location":"s4_debugging_and_logging/profiling/","title":"M13 - Profiling","text":""},{"location":"s4_debugging_and_logging/profiling/#profilers","title":"Profilers","text":"<p>Core Module</p>"},{"location":"s4_debugging_and_logging/profiling/#profilers_1","title":"Profilers","text":"<p>In general profiling code is about improving the performance of your code. In this session we are going to take a somewhat narrow approach to what \"performance\" is: runtime, meaning the time it takes to execute your program.</p> <p>At the bare minimum, the two questions a proper profiling of your program should be able to answer is:</p> <ul> <li>\u201c How many times is each method in my code called?\u201d</li> <li>\u201c How long do each of these methods take?\u201d</li> </ul> <p>The first question is important to priorities optimization. If two methods <code>A</code> and <code>B</code> have approximately the same runtime, but <code>A</code> is called 1000 more times than <code>B</code> we should probably spend time optimizing <code>A</code> over <code>B</code> if we want to speedup our code. The second question is gives itself, directly telling us which methods are the expensive to call.</p> <p>Using profilers can help you find bottlenecks in your code. In this exercise we will look at two different profilers, with the first one being the cProfile. <code>cProfile</code> is pythons build in profiler that can help give you an overview runtime of all the functions and methods involved in your programs.</p>"},{"location":"s4_debugging_and_logging/profiling/#exercises","title":"\u2754 Exercises","text":"<ol> <li> <p>Run the <code>cProfile</code> on the <code>vae_mnist_working.py</code> script. Hint: you can directly call the profiler on a     script using the <code>-m</code> arg</p> <pre><code>python -m cProfile -o &lt;output_file&gt; -s &lt;sort_order&gt; myscript.py\n</code></pre> </li> <li> <p>Try looking at the output of the profiling. Can you figure out which function took the longest to run?</p> </li> <li> <p>Can you explain the difference between <code>tottime</code> and <code>cumtime</code>? Under what circumstances does these differ and     when are they equal.</p> </li> <li> <p>To get a better feeling of the profiled result we can try to visualize it. Python does not     provide a native solution, but open-source solutions such as snakeviz     exist. Try installing <code>snakeviz</code> and load a profiled run into it (HINT: snakeviz expect the run to have the file     format <code>.prof</code>).</p> </li> <li> <p>Try optimizing the run! (Hint: The data is not stored as torch tensor). After optimizing the code make sure     (using <code>cProfile</code> and <code>snakeviz</code>) that the code actually runs faster.</p> </li> </ol>"},{"location":"s4_debugging_and_logging/profiling/#pytorch-profiling","title":"PyTorch profiling","text":"<p>Profiling machine learning code can become much more complex because we are suddenly beginning to mix different devices (CPU+GPU), that can (and should) overlap some of their computations. When profiling this kind of machine learning code we are often looking for bottlenecks. A bottleneck is simple the place in your code that is preventing other processes from performing their best. This is the reason that all major deep learning frameworks also include their own profilers that can help profiling more complex applications.</p> <p>The image below show a typical report using the build in profiler in pytorch. As the image shows the profiler looks both a the <code>kernel</code> time (this is the time spend doing actual computations) and also transfer times such as <code>memcpy</code> (where we are copying data between devices). It can even analyze your code and give recommendations.</p> <p></p> <p>Using the profiler can be as simple as wrapping the code that you want to profile with the <code>torch.profiler.profile</code> decorator</p> <pre><code>with torch.profiler.profile(...) as prof:\n    # code that I want to profile\n    output = model(data)\n</code></pre>"},{"location":"s4_debugging_and_logging/profiling/#exercises_1","title":"\u2754 Exercises","text":"<p>Exercise files</p> <p>In these investigate the profiler that is build into PyTorch already. Note that these exercises requires that you have PyTorch v1.8.1 installed (or higher). You can always check which version you currently have installed by writing (in a python interpreter):</p> <pre><code>import torch\nprint(torch.__version__)\n</code></pre> <p>But we always recommend to update to the latest PyTorch version for the best experience. Additionally, to display the result nicely (like <code>snakeviz</code> for <code>cProfile</code>) we are also going to use the tensorboard profiler extension</p> <pre><code>pip install torch_tb_profiler\n</code></pre> <ol> <li> <p>A good starting point is too look at the API for the profiler. Here     the important class to look at is the <code>torch.profiler.profile</code> class.</p> </li> <li> <p>Lets try out an simple example (taken from     here):</p> <ol> <li> <p>Try to run the following code</p> <pre><code>import torch\nimport torchvision.models as models\nfrom torch.profiler import profile, ProfilerActivity\n\nmodel = models.resnet18()\ninputs = torch.randn(5, 3, 224, 224)\n\nwith profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:\n    model(inputs)\n</code></pre> <p>this will profile the <code>forward</code> pass of Resnet 18 model.</p> </li> <li> <p>Running this code will produce an <code>prof</code> object that contains all the relevant information about the profiling.     Try writing the following code:</p> <pre><code>print(prof.key_averages().table(sort_by=\"cpu_time_total\", row_limit=10))\n</code></pre> <p>what operation is taking most of the cpu?</p> </li> <li> <p>Try running</p> <pre><code>print(prof.key_averages(group_by_input_shape=True).table(sort_by=\"cpu_time_total\", row_limit=30))\n</code></pre> <p>can you see any correlation between the shape of the input and the cost of the operation?</p> </li> <li> <p>(Optional) If you have a GPU you can also profile the operations on that device:</p> <pre><code>with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:\n    model(inputs)\n</code></pre> </li> <li> <p>(Optional) As an alternative to using <code>profile</code> as an     context-manager we can also use its <code>.start</code> and     <code>.stop</code> methods:</p> <pre><code>prof = profile(...)\nprof.start()\n...  # code I want to profile\nprof.stop()\n</code></pre> <p>Try doing this on the above example.</p> </li> </ol> </li> <li> <p>The <code>torch.profiler.profile</code> function takes some additional arguments. What argument would you need to     set to also profile the memory usage? (Hint: this page)     Try doing it to the simple example above and make sure to sort the sample by <code>self_cpu_memory_usage</code>.</p> </li> <li> <p>As mentioned we can also get a graphical output for better inspection. After having done a profiling     try to export the results with:</p> <pre><code>prof.export_chrome_trace(\"trace.json\")\n</code></pre> <p>you should be able to visualize the file by going to <code>chrome://tracing</code> in any chromium based web browser. Can you still identify the information printed in the previous exercises from the visualizations?</p> </li> <li> <p>Running profiling on a single forward step can produce misleading results as it only provides a single sample that     may depend on what background processes that are running on your computer. Therefore it is recommended to profile     multiple iterations of your model. If this is the case then we need to include <code>prof.step()</code> to tell the profiler     when we are doing a new iteration</p> <pre><code>with profile(...) as prof:\n    for i in range(10):\n        model(inputs)\n        prof.step()\n</code></pre> <p>Try doing this. Is the conclusion this the same on what operations that are taken up most of the time? Have the percentage changed significantly?</p> </li> <li> <p>Additionally, we can also visualize the profiling results using the profiling viewer in tensorboard.</p> <ol> <li> <p>Start by initializing the <code>profile</code> class with an additional argument:</p> <pre><code>from torch.profiler import profile, tensorboard_trace_handler\nwith profile(..., on_trace_ready=tensorboard_trace_handler(\"./log/resnet18\")) as prof:\n    ...\n</code></pre> <p>Try run a profiling (using a couple of iterations) and make sure that a file with the <code>.pt.trace.json</code> is produced in the <code>log/resnet18</code> folder.</p> </li> <li> <p>Now try launching tensorboard</p> <pre><code>tensorboard --logdir=./log\n</code></pre> <p>and open the page http://localhost:6006/#pytorch_profiler, where you should hopefully see an image similar to the one below:</p> <p>  Image credit  </p> <p>Try poking around in the interface.</p> </li> <li> <p>Tensorboard have a nice feature for comparing runs under the <code>diff</code> tab. Try redoing a profiling run but use     <code>model = models.resnet34()</code> instead. Load up both runs and try to look at the <code>diff</code> between them.</p> </li> </ol> </li> <li> <p>As an final exercise, try to use the profiler on the <code>vae_mnist_working.py</code> file from the previous module on     debugging, where you profile a whole training run (not only the forward pass). What is the bottleneck during the     training? Is it still the forward pass or is it something else? Can you improve the code somehow based on the     information from the profiler.</p> </li> </ol> <p>This end the module on profiling. If you want to go into more details on this topic we can recommend looking into line_profiler and kernprof. A downside of using python's <code>cProfile</code> is that it can only profiling at an functional/modular level, that is great for identifying hotspots in your code. However, sometimes the cause of an computationally hotspot is a single line of code in a function, which will not be caught by <code>cProfile</code>. An example would be an simple index operations such as <code>a[idx] = b</code>, which for large arrays and non-sequential indexes is really expensive. For these cases line_profiler and kernprof are excellent tools to have in your toolbox. Additionally, if you do not like cProfile we can also recommend py-spy which is another open-source profiling tool for Python programs.</p>"},{"location":"s5_continuous_integration/","title":"Continuous Integration","text":"<p>Slides</p> <ul> <li> <p>     Learn how to write unit tests that cover both data and models in your ML pipeline.</p> <p> M16: Unit testing</p> </li> <li> <p>     Learn how to implement continuous integration using Github actions such that tests are automatically executed on     code changes.</p> <p> M17: Github Actions</p> </li> <li> <p>     Learn how to use pre-commit to ensure that code that is not up to standard does not get committed.</p> <p> M18: Pre-commit</p> </li> <li> <p></p> <p>Learn how to implement continuous machine learning pipelines in Github actions.</p> <p> M19: Continuous Machine Learning</p> </li> </ul> <p>Continues integration is a sub-discipline of the general field of Continues X. Continuous X is one of the core elements of modern DevOps, and by extension MLOps. Continuous X assumes that we have a (long) developer pipeline (see image below) where we want to make some changes to our code e.g:</p> <ul> <li>Update our training data or data processing</li> <li>Update our model architecture</li> <li>Something else...</li> </ul> <p>Basically, any code change we will expect will have a influence on the final result. The problem with doing changes to the start of our pipeline is that we want the change to propagate all the way through to the end of the pipeline.</p> <p></p>  Image credit  <p>This is where continuous X comes into play. The word continuous here refers to the fact that the pipeline should continuously be updated as we make code changes. You can also choose to think of this as the automatization of processes. The X then covers that the process we need to go through to automate steps in the pipeline depends on where we are in the pipeline e.g. the tools needed to do continuous integration are different from the tools needed to do continuous delivery.</p> <p>In this session, we are going to focus on continuous integration (CI). As indicated in the image above, continuous integration usually takes care of the first part of the developer pipeline which has to do with the code base, code building and code testing. This is paramount to step in automatization as we would rather catch bugs at the beginning of our pipeline than in the end.</p> <p>Learning objectives</p> <p>The learning objectives of this session are:</p> <ul> <li>Being able to write unit tests that cover both data and models in your ML pipeline</li> <li>Know how to implement continuous integration using Github actions such that tests are automatically executed on     code changes</li> <li>Can use pre-commit to secure that code that is not up to standard does not get committed</li> <li>Know how to implement continuous integration for continuous building of containers</li> <li>Basic knowledge of how machine learning processes can be implemented in a continuous way</li> </ul>"},{"location":"s5_continuous_integration/cml/","title":"M19 - Continuous Machine Learning","text":""},{"location":"s5_continuous_integration/cml/#continuous-machine-learning","title":"Continuous Machine Learning","text":"<p>The continuous integration we have looked at until now is what we can consider \"classical\" continuous integration, which has its roots in DevOps and not MLOps. While the test that we have written and the containers we have developed in the previous session have been about machine learning, everything we have done translates completely to how it would be done if we had developed any other application that did not include machine learning.</p> <p>In this session, we are now gonna change gears and look at continuous machine learning (CML). As the name may suggest we are now focusing on automatizing actual machine learning processes. The reason for doing this is the same as with continuous integration, namely that we often have a bunch of checks that we want our newly trained model to pass before we trust it to be ready for deployment. Writing unit tests secures that the code that we use for training our model is not broken, but there exist other failure modes of a machine learning pipeline:</p> <ul> <li>Did I train on the correct data?</li> <li>Did my model converge at all?</li> <li>Did a metric that I care about improve?</li> <li>Did I overfit?</li> <li>Did I underfit?</li> <li>...</li> </ul> <p>All these questions are questions that we can answer by writing tests that are specific to machine learning. In this session, we are going to look at how we can begin to use Github Actions to automate these tests.</p>"},{"location":"s5_continuous_integration/cml/#mlops-maturity-model","title":"MLOps maturity model","text":"<p>Before getting started with the exercises, let's first take a side step and look at what is called the MLOps maturity model. The reason here is to get a better understanding of when continuous machine learning is relevant. The main idea behind the MLOps maturity model is to help organizations understand where they are in their machine learning operations journey and what the next logical steps are. The model is divided into five stages:</p> <p></p>  Image credit  <code>Level 0</code> <p>At this level, organizations are doing machine learning in an ad-hoc manner. There is no standardization, no version control, no testing, and no monitoring.</p> <code>Level 1</code> <p>At this level, organizations have started to implement DevOps practices in their machine learning workflows. They have started to use version control and maybe come with basic continuous integration practices.</p> <code>Level 2</code> <p>At this level, organizations have started to standardize the training process and tackle the problem of creating reproducible experiments. Centralization of model artifacts and metadata is common at this level. They have started to implement model versioning and model registry practices.</p> <code>Level 3</code> <p>At this level, organizations have started to implement continuous integration and continuous deployment practices. They have started to automate the testing of their models and have started to monitor their models in production.</p> <code>Level 4</code> <p>At this level, organizations have started to implement continuous machine learning practices. They have started to automate the training, evaluation, and deployment of their models. They have started to implement automated retraining and model updates.</p> <p>The MLOps maturity model tells us that continuous machine learning is the highest form of maturity in MLOps. It is the stage where we have automated the entire machine learning pipeline and the cases we will be going through in the exercises are therefore some of the last steps in the MLOps maturity model.</p>"},{"location":"s5_continuous_integration/cml/#exercises","title":"\u2754 Exercises","text":"<p>In the following exercises, we are going to look at two different cases where we can use continuous machine learning. The first one is a simple case where we are automatically going to trigger some workflow (like training of a model) whenever we make changes to our data. This is a very common use case in machine learning where we have a data pipeline that is continuously updating our data. The second case is connected to staging and deploying models. In this case, we are going to look at how we can automatically do further processing of our model whenever we push a new model to our repository.</p> <ol> <li> <p>For the first set of exercises, we are going to rely on the <code>cml</code> framework by iterative.ai,     which is a framework that is built on top of GitHub actions. The figure below describes the overall process using     the <code>cml</code> framework. It should be clear that it is the very same process that we go through in the other     continuous integration sessions: <code>push code</code> -&gt; <code>trigger GitHub actions</code> -&gt; <code>do stuff</code>. The new part in this session     that we are only going to trigger whenever data changes.</p> <p>  Image credit  </p> <ol> <li> <p>If you have not already created a dataset class for the corrupted Mnist data, start by doing that. Essentially,     it is a class that should inherit from <code>torch.utils.data.Dataset</code> and should have a <code>__getitem__</code> and <code>__len__</code></p> Solution dataset.py<pre><code>from __future__ import annotations\n\nimport os\nfrom typing import TYPE_CHECKING\n\nimport torch\nfrom torch import Tensor\nfrom torch.utils.data import Dataset\n\nif TYPE_CHECKING:\n    import torchvision.transforms.v2 as transforms\n\n\nclass MnistDataset(Dataset):\n    \"\"\"MNIST dataset for PyTorch.\n\n    Args:\n        data_folder: Path to the data folder.\n        train: Whether to load training or test data.\n        img_transform: Image transformation to apply.\n        target_transform: Target transformation to apply.\n    \"\"\"\n\n    name: str = \"MNIST\"\n\n    def __init__(\n        self,\n        data_folder: str = \"data\",\n        train: bool = True,\n        img_transform: transforms.Transform | None = None,\n        target_transform: transforms.Transform | None = None,\n    ) -&gt; None:\n        super().__init__()\n        self.data_folder = data_folder\n        self.train = train\n        self.img_transform = img_transform\n        self.target_transform = target_transform\n        self.load_data()\n\n    def load_data(self) -&gt; None:\n        \"\"\"Load images and targets from disk.\"\"\"\n        images, target = [], []\n        if self.train:\n            nb_files = len([f for f in os.listdir(self.data_folder) if f.startswith(\"train_images\")])\n            for i in range(nb_files):\n                images.append(torch.load(f\"{self.data_folder}/train_images_{i}.pt\"))\n                target.append(torch.load(f\"{self.data_folder}/train_target_{i}.pt\"))\n        else:\n            images.append(torch.load(f\"{self.data_folder}/test_images.pt\"))\n            target.append(torch.load(f\"{self.data_folder}/test_target.pt\"))\n        self.images = torch.cat(images, 0)\n        self.target = torch.cat(target, 0)\n\n    def __getitem__(self, idx: int) -&gt; tuple[Tensor, Tensor]:\n        \"\"\"Return image and target tensor.\"\"\"\n        img, target = self.images[idx], self.target[idx]\n        if self.img_transform:\n            img = self.img_transform(img)\n        if self.target_transform:\n            target = self.target_transform(target)\n        return img, target\n\n    def __len__(self) -&gt; int:\n        \"\"\"Return the number of images in the dataset.\"\"\"\n        return self.images.shape[0]\n</code></pre> </li> <li> <p>Then let's create a function that can report basic statistics such as the number of training samples, number     of test samples and generate figures of sample images in the dataset and distribution of the classes in the     dataset. This function should be called <code>dataset_statistics</code> and should take a path to the dataset as input.</p> Solution dataset.py<pre><code>import click\nimport matplotlib.pyplot as plt\nimport torch\nfrom mnist_dataset import MnistDataset\nfrom utils import show_image_and_target\n\n\n@click.command()\n@click.option(\"--datadir\", default=\"data\", help=\"Path to the data directory\")\ndef dataset_statistics(datadir: str) -&gt; None:\n    \"\"\"Compute dataset statistics.\"\"\"\n    train_dataset = MnistDataset(data_folder=datadir, train=True)\n    test_dataset = MnistDataset(data_folder=datadir, train=False)\n    print(f\"Train dataset: {train_dataset.name}\")\n    print(f\"Number of images: {len(train_dataset)}\")\n    print(f\"Image shape: {train_dataset[0][0].shape}\")\n    print(\"\\n\")\n    print(f\"Test dataset: {test_dataset.name}\")\n    print(f\"Number of images: {len(test_dataset)}\")\n    print(f\"Image shape: {test_dataset[0][0].shape}\")\n\n    show_image_and_target(train_dataset.images[:25], train_dataset.target[:25], show=False)\n    plt.savefig(\"mnist_images.png\")\n    plt.close()\n\n    train_label_distribution = torch.bincount(train_dataset.target)\n    test_label_distribution = torch.bincount(test_dataset.target)\n\n    plt.bar(torch.arange(10), train_label_distribution)\n    plt.title(\"Train label distribution\")\n    plt.xlabel(\"Label\")\n    plt.ylabel(\"Count\")\n    plt.savefig(\"train_label_distribution.png\")\n    plt.close()\n\n    plt.bar(torch.arange(10), test_label_distribution)\n    plt.title(\"Test label distribution\")\n    plt.xlabel(\"Label\")\n    plt.ylabel(\"Count\")\n    plt.savefig(\"test_label_distribution.png\")\n    plt.close()\n\n\nif __name__ == \"__main__\":\n    dataset_statistics()\n</code></pre> </li> <li> <p>Next, we are going to implement a GitHub actions workflow that only activates when we make changes to our data.     Create a new workflow file (call it <code>cml_data.yaml</code>) and make sure it only activates on push/pull-request events     when <code>data/</code> changes. Relevant     documentation</p> Solution <p>The secret is to use the <code>paths</code> keyword in the workflow file. We here specify that the workflow should only trigger when the <code>.dvc</code> folder or any file with the <code>.dvc</code> extension changes, which is the case when we update our data and call <code>dvc add data/</code>.</p> <pre><code>name: DVC Workflow\n\non:\n  pull_request:\n    branches:\n    - main\n    paths:\n    - '**/*.dvc'\n    - '.dvc/**'\n</code></pre> </li> <li> <p>The next step is to implement steps in our workflow that do something when data changes. This is the reason     why we created the <code>dataset_statistics</code> function. Implement a workflow that:</p> <ol> <li>Check-out the code</li> <li>Setups Python</li> <li>Installs dependencies</li> <li>Downloads the data</li> <li>Runs the <code>dataset_statistics</code> function on the data</li> </ol> Solution <p>This solution assumes that data is stored in a GCP bucket and that the credentials are stored in a secret called <code>GCP_SA_KEY</code>. If this is not the case for you, you need to adjust the workflow accordingly with the correct way to pull the data.</p> <pre><code>jobs:\n  run_data_checker:\n    runs-on: ubuntu-latest\n    steps:\n    - name: Checkout code\n      uses: actions/checkout@v4\n\n    - name: Set up Python\n      uses: actions/setup-python@v5\n      with:\n        python-version: 3.11\n        cache: 'pip'\n        cache-dependency-path: setup.py\n\n    - name: Install dependencies\n      run: |\n        make dev_requirements\n        pip list\n\n    - name: Auth with GCP\n      uses: google-github-actions/auth@v2\n      with:\n        credentials_json: ${{ secrets.GCP_SA_KEY }}\n\n    - name: Pull data\n      run: |\n        dvc pull --no-run-cache\n\n    - name: Check data statistics\n      run: |\n        python dataset_statistics.py\n</code></pre> </li> <li> <p>Let's make sure that the workflow works as expected for now. Create a new branch and either add or remove a file     in the <code>data/</code> folder. Then run</p> <pre><code>dvc add data/\ngit add data.dvc\ngit commit -m \"Update data\"\ngit push\n</code></pre> <p>to commit the changes to data. Open a pull request with the branch and make sure that the workflow activates and runs as expected.</p> </li> <li> <p>Let's now add the <code>cml</code> framework such that we can comment the results of the <code>dataset_statistics</code> function in     the pull request automatically. Look at the     getting started guide for help on how to do this. You will     need write all the content of the <code>dataset_statistics</code> function to a file called <code>report.md</code> and then use the     <code>cml comment create</code> command to create a comment in the pull request with the content of the file.</p> Solution <pre><code>jobs:\n  dataset_statistics:\n    runs-on: ubuntu-latest\n    steps:\n    # ...all the previous steps\n    - name: Check data statistics &amp; generate report\n    run: |\n      python src/example_mlops/data.py &gt; data_statistics.md\n      echo '![](./mnist_images.png \"MNIST images\")' &gt;&gt; data_statistics.md\n      echo '![](./train_label_distribution.png \"Train label distribution\")' &gt;&gt; data_statistics.md\n      echo '![](./test_label_distribution.png \"Test label distribution\")' &gt;&gt; data_statistics.md\n\n    - name: Setup cml\n      uses: iterative/setup-cml@v2\n\n    - name: Comment on PR\n      env:\n        REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}\n      run: |\n        cml comment create data_statistics.md --watermark-title=\"Data Checker\" # (1)!\n</code></pre> <ol> <li> The <code>--watermark-title</code> flag is used to watermark the comment created by <code>cml</code>. It is     to make sure that no new comments are created every time the workflow runs.</li> </ol> </li> <li> <p>Make sure that the workflow works as expected. You should see a comment created by <code>github-actions (bot)</code> like     this if you have done everything correctly:</p> <p> </p> </li> <li> <p>(Optional) Feel free to add more checks to the workflow. For example, you could add a check that runs a small     baseline model on the updated data and checks that the model converges. This is a very common sanity check that     is done in machine learning pipelines.</p> </li> </ol> </li> <li> <p>For the second set of exercises, we are going to look at how to automatically run further testing of our models     whenever we add them to our model registry. For that reason, do not continue with this set of exercises before you     have completed the exercises on the model registry in this module.</p> <p>  The model registry is in general a repository of a team's trained models where ML practitioners publish candidates for production and share them with others. Figure from wandb.  <ol> <li> <p>The first step is in our weights and bias account to create a team. Some of these more advanced features are only     available for teams, however every user is allowed to create one team for free. Go to your weights and bias     account and create a team (the option should be on the left side of the UI). Give a team name and select W&amp;B     cloud storage.</p> </li> <li> <p>Now we need to generate a personal access token that can link our weights and bias account to our GitHub account.     Go to this page and generate a new token. You can also     find the page by clicking your profile icon in the upper right corner of Github and selecting     <code>Settings</code>, then <code>Developer settings</code>, then <code>Personal access tokens</code> and finally choose either     <code>Tokens (classic)</code> or <code>Fine-grained tokens</code> (which is the safer option, which is also what the link points to).</p> <p> </p> <p>give it a name, set what repositories it should have access to and select the permissions you want it to have. In our case if you choose to create <code>Fine-grained token</code> then it needs access to the <code>contents:write</code> permission. If you choose <code>Tokens (classic)</code> then it needs access to the <code>repo</code> permission. After you have created the token, copy it and save it somewhere safe.</p> </li> <li> <p>Go to the settings of your newly created team: https://wandb.ai/teamname/settings and scroll down to the     <code>Team secrets</code> section. Here add the token you just created as a secret with the name <code>GITHUB_ACTIONS_TOKEN</code>.     WANDB will now be able to use this token to trigger actions in your repository.</p> </li> <li> <p>On the same settings page, scroll down to the <code>Webhooks</code> settings. Click the <code>New webhook</code> button in fill in the     following information:</p> <ul> <li>Name: <code>github_actions_dispatch</code></li> <li>URL: <code>https://api.github.com/repos/&lt;owner&gt;/&lt;repo&gt;/dispatches</code></li> <li>Access token: <code>GITHUB_ACTIONS_TOKEN</code></li> <li>Secret: leave empty</li> </ul> <p>You here need to replace <code>&lt;owner&gt;</code> and <code>&lt;repo&gt;</code> with your own information. The <code>/dispatches</code> endpoint is a special endpoint that all Github actions workflows can listen to. Thus, if you ever want to setup a webhook in some other framework that should trigger a Github action, you can use this endpoint.</p> </li> <li> <p>Next, navigate to your model registry. It should hopefully contain at least one registry with at least one model     registered. If not, go back to the previous module and do that.</p> </li> <li> <p>When you have a model in your registry, click on the <code>View details</code> button. Then click the <code>New automation</code>     button. On the first page, select that you want to trigger the automation when an alias is added to a model     version, set that alias to <code>staging</code> and select the action type to be <code>Webhook</code>. On the next page, select the     <code>github_actions_dispatch</code> webhook that you just created and add this as the payload:</p> <pre><code>{\n    \"event_type\": \"staged_model\",\n    \"client_payload\":\n    {\n        \"event_author\": \"${event_author}\",\n        \"artifact_version\": \"${artifact_version}\",\n        \"artifact_version_string\": \"${artifact_version_string}\",\n        \"artifact_collection_name\": \"${artifact_collection_name}\",\n        \"project_name\": \"${project_name}\",\n        \"entity_name\": \"${entity_name}\"\n    }\n}\n</code></pre> <p>Finally, on the next page give the automation a name and click <code>Create automation</code>.</p> <p> </p> <p>Make sure you understand overall what is happening here.</p> Solution <p>The automation is set up to trigger a webhook whenever the alias <code>staging</code> is added to a model version. The webhook is set up to trigger a Github action workflow that listens to the <code>/dispatches</code> endpoint and has the event type <code>staged_model</code>. The payload that is sent to the webhook contains information about the model that was staged.</p> </li> <li> <p>We are now ready to create the <code>Github actions workflow</code> that listens to the <code>/dispatches</code> endpoint and triggers     whenever a model is staged. Create a new workflow file (called <code>stage_model.yaml</code>) and make sure it only     activates on the <code>staged_model</code> event. Hint: relevant     documentation</p> Solution <pre><code>name: Check staged model\n\non:\n  repository_dispatch:\n    types: staged_model\n</code></pre> </li> <li> <p>Next, we need to implement the steps in our workflow that do something when a model is staged. The payload that     is sent to the webhook contains information about the model that was staged. Implement a workflow that:</p> <ol> <li>Identifies the model that was staged</li> <li>Sets an environment variable with the corresponding artifact path</li> <li>Outputs the model name</li> </ol> Solution <pre><code>jobs:\n  identify_event:\n    runs-on: ubuntu-latest\n    outputs:\n      model_name: ${{ steps.set_output.outputs.model_name }}\n    steps:\n      - name: Check event type\n        run: |\n          echo \"Event type: repository_dispatch\"\n          echo \"Payload Data: ${{ toJson(github.event.client_payload) }}\"\n\n      - name: Setting model environment variable and output\n        id: set_output\n        run: |\n          echo \"model_name=${{ github.event.client_payload.artifact_version_string }}\" &gt;&gt; $GITHUB_OUTPUT\n</code></pre> </li> <li> <p>We now need to write a script that can be executed on our staged model. In this case, we are going to run some     performance tests on it to check that it is fast enough for deployment. Therefore, do the following:</p> <ol> <li> <p>In a <code>tests/performancetests</code> folder, create a new file called <code>test_model.py</code></p> </li> <li> <p>Implement a test that loads the model from an wandb artifact path e.g.     //: and runs it on a random input. Importantly, the     artifact path should be read from an environment variable called <code>MODEL_NAME</code>. <li> <p>The test should assert that the model can do 100 predictions in less than X amount of time</p> </li> Solution <p>In this solution we assume that 4 environment variables are set: <code>WANDB_API</code>, <code>WANDB_ENTITY</code>, <code>WANDB_PROJECT</code> and <code>MODEL_NAME</code>.</p> test_model.py<pre><code>import wandb\nimport os\nimport time\nfrom my_project.models import MyModel\n\ndef load_model(artifact):\n    api = wandb.Api(\n        api_key=os.getenv(\"WANDB_API_KEY\"),\n        overrides={\"entity\": os.getenv(\"WANDB_ENTITY\"), \"project\": os.getenv(\"WANDB_PROJECT\")},\n    )\n    artifact = api.artifact(model_checkpoint)\n    artifact.download(root=logdir)\n    file_name = artifact.files()[0].name\n    return MyModel.load_from_checkpoint(f\"{logdir}/{file_name}\")\n\ndef test_model_speed():\n    model = load_model(os.getenv(\"MODEL_NAME\"))\n    start = time.time()\n    for _ in range(100):\n        model(torch.rand(1, 1, 28, 28))\n    end = time.time()\n    assert end - start &lt; 1\n</code></pre> <li> <p>Let's now add another job that calls the script we just wrote. It needs to:</p> <ul> <li>Setup the correct environment variables</li> <li>Checkout the code</li> <li>Setup Python</li> <li>Install dependencies</li> <li>Run the test</li> </ul> <p>which is very similar to the kind of jobs we have written before.</p> Solution <pre><code>jobs:\n  identify_event:\n    ...\n  test_model:\n    runs-on: ubuntu-latest\n    needs: identify_event\n    env:\n      WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}\n      WANDB_ENTITY: ${{ secrets.WANDB_ENTITY }}\n      WANDB_PROJECT: ${{ secrets.WANDB_PROJECT }}\n      MODEL_NAME: ${{ needs.identify_event.outputs.model_name }}\n    steps:\n    - name: Echo model name\n      run: |\n        echo \"Model name: $MODEL_NAME\"\n    - name: Checkout code\n      uses: actions/checkout@v4\n\n    - name: Set up Python\n      uses: actions/setup-python@v5\n      with:\n        python-version: 3.11\n        cache: 'pip'\n        cache-dependency-path: setup.py\n\n    - name: Install dependencies\n      run: |\n        pip install -r requirements.txt\n        pip list\n\n    - name: Test model\n      run: |\n        pytest tests/performancetests/test_model.py\n</code></pre> </li> <li> <p>Finally, we are going to assume in this setup that if the model gets this far then it is ready for deployment.     We are therefore going to add a final job that will add a new alias to the model called <code>production</code>. Here is     some relevant Python code that can be used to add the alias:</p> <pre><code>import click\nimport os\nimport wandb\n\n@click.command()\n@click.argument(\"artifact-path\")\n@click.option(\n    \"--aliases\", \"-a\", multiple=True, default=[\"staging\"], help=\"List of aliases to link the artifact with.\"\n)\ndef link_model(artifact_path: str, aliases: list[str]) -&gt; None:\n    \"\"\"\n    Stage a specific model to the model registry.\n\n    Args:\n        artifact_path: Path to the artifact to stage.\n            Should be of the format \"entity/project/artifact_name:version\".\n        aliases: List of aliases to link the artifact with.\n\n    Example:\n        model_management link-model entity/project/artifact_name:version -a staging -a best\n\n    \"\"\"\n    if artifact_path == \"\":\n        click.echo(\"No artifact path provided. Exiting.\")\n        return\n\n    api = wandb.Api(\n        api_key=os.getenv(\"WANDB_API_KEY\"),\n        overrides={\"entity\": os.getenv(\"WANDB_ENTITY\"), \"project\": os.getenv(\"WANDB_PROJECT\")},\n    )\n    _, _, artifact_name_version = artifact_path.split(\"/\")\n    artifact_name, _ = artifact_name_version.split(\":\")\n\n    artifact = api.artifact(artifact_path)\n    artifact.link(target_path=f\"{os.getenv('WANDB_ENTITY')}/model-registry/{artifact_name}\", aliases=aliases)\n    artifact.save()\n    click.echo(f\"Artifact {artifact_path} linked to {aliases}\")\n</code></pre> <p>for example, you can run this script with the following command:</p> <pre><code>python link_model.py entity/project/artifact_name:version -a staging -a production\n</code></pre> <p>Implement a final job that calls this script and adds the <code>production</code> alias to the model.</p> Solution <pre><code>jobs:\n  identify_event:\n    ...\n  test_model:\n    ...\n  add_production_alias:\n    runs-on: ubuntu-latest\n    needs: identify_event\n    env:\n      WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}\n      WANDB_ENTITY: ${{ secrets.WANDB_ENTITY }}\n      WANDB_PROJECT: ${{ secrets.WANDB_PROJECT }}\n      MODEL_NAME: ${{ needs.identify_event.outputs.model_name }}\n    steps:\n    - name: Echo model name\n      run: |\n        echo \"Model name: $MODEL_NAME\"\n\n    - name: Checkout code\n      uses: actions/checkout@v4\n\n    - name: Set up Python\n      uses: actions/setup-python@v5\n      with:\n        python-version: 3.11\n        cache: 'pip'\n        cache-dependency-path: setup.py\n\n    - name: Install dependencies\n      run: |\n        pip install -r requirements.txt\n        pip list\n\n    - name: Add production alias\n      run: |\n        python link_model.py $MODEL_NAME -a production\n</code></pre> </li> <li> <p>Finally, make sure the workflow works as expected. To try it out again and again for testing purposes, you can     just manually add and then delete the <code>staging</code> alias to any model version in the model registry.</p> </li> <li> <p>(Optional) Consider adding more checks to the workflow. For example, you could add a step that checks if the     model is too large for deployment, runs some further evaluation scripts, or checks if the model is robust to     adversarial attacks. Only the imagination sets the limits here.</p> </li> <li> <p>(Optional) If you have got this far, consider combining principles from the two exercises. Here is an idea: we use     the workflow from the second exercise to trigger a workflow that checks a staged model for performance. We then     use the <code>cml</code> framework to automatically create a pull request e.g. use <code>cml pr create</code> instead of     <code>cml comment create</code> to create a pull request with the results of the performance test. Then if we are happy with     the performance, we can then approve that pull request and the production alias is added to the model. This is a     better workflow because it allows for human intervention before the model is deployed.</p> </li>"},{"location":"s5_continuous_integration/cml/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>What is the difference between continuous integration and continuous machine learning?</p> Solution <p>There are three key differences between continuous integration and continuous machine learning:</p> <ul> <li>Scope: CI focuses on integrating and testing software code, while CML encompasses the entire lifecycle of     machine learning models, including data handling, model training, evaluation, deployment, and monitoring.</li> <li>Automation Focus: CI automates code testing and integration, whereas CML automates the training, evaluation,     deployment, and monitoring of machine learning models.</li> <li>Feedback Mechanisms: CI primarily uses automated tests to provide feedback on code quality. CML uses     performance metrics from deployed models to provide feedback and trigger retraining or model updates.</li> </ul> </li> <li> <p>Imaging you get hired in the pharmasuitical industri being asked to develop a machine learning pipeline that can     automatically sort out which drugs are safe and which are not. What level of the MLOps maturity model would you     strive to reach?</p> Solution <p>There is really no right or wrong answer here, but in most cases we would actually not aim for level 4. The reason is that the consequences of a bad model in this case can be severe. Therefore, we would probably not want automated retraining and model updates, which is what level 4 is about. Instead, we would probably aim for level 3 where we have automated testing and monitoring of our models but there is still human oversight in the process.</p> </li> </ol> <p>This ends the module on continuous machine learning. As we have hopefully convinced you, it is only the imagination that sets the limits for what you can use Github actions for in your machine learning pipeline. However, we do want to stress that it is important that human oversight is always present in the process. Automation is great, but it should never replace human judgement. This is especially true in machine learning where the consequences of a bad model can be severe if it is used in critical decision making.</p> <p>Finally, if you have completed the exercises on using the cloud consider checking out the cml runner lunch command that allows you to run your workflows on cloud resources instead of the GitHub actions runners.</p>"},{"location":"s5_continuous_integration/github_actions/","title":"M17 - Github Actions","text":""},{"location":"s5_continuous_integration/github_actions/#github-actions","title":"GitHub actions","text":"<p>Core Module</p> <p>With the tests established in the previous module, we are now ready to move on to implementing some continuous integration in our pipeline. As you probably have already realized testing your code locally may be cumbersome to do, because</p> <ul> <li>You need to run it often to make sure to catch bugs early on</li> <li>If you want to have high code coverage of your code base, you will need many tests that take a long time to run</li> </ul> <p>For these reasons, we want to automate the testing, such that it is done every time we push to our repository. If we combine this with only pushing to branches and then only merging these branches whenever all automated testing has passed, our code should be fairly safe against unwanted bugs (assuming your tests are well covering your code).</p>"},{"location":"s5_continuous_integration/github_actions/#github-actions_1","title":"GitHub actions","text":"<p>GitHub actions are the continuous integration solution that GitHub provides. Each of your repositories gets 2,000 minutes of free testing per month which should be more than enough for the scope of this course (and probably all personal projects you do). Getting GitHub actions set up in a repository may seem complicated at first, but workflow files that you create for one repository can more or less be reused for any other repository that you have.</p> <p>Let's take a look at how a GitHub workflow file is organized:</p> <ul> <li>Initially, we start by giving the workflow a <code>name</code></li> <li>Next, we specify what events the workflow should be triggered. This includes both the action     (pull request, push etc) and on what branches it should activate</li> <li>Next, we list the jobs that we want to do. Jobs are by default executed in parallel but can     also be dependent on each other</li> <li>In the <code>runs-on</code>, we can specify which operation system we want the workflow to run on.</li> <li>Finally, we have the <code>steps</code>. This is where we specify the actual commands that should be     run when the workflow is executed.</li> </ul> <p></p>  Image credit"},{"location":"s5_continuous_integration/github_actions/#exercises","title":"\u2754 Exercises","text":"<ol> <li> <p>Start by creating a <code>.github</code> folder in the root of your repository. Add a sub-folder to that called <code>workflows</code>.</p> </li> <li> <p>Go over this page that explains how to do     automated testing of Python code in GitHub actions. You do not have     to understand everything, but at least get a feeling of what a workflow file should look like.</p> </li> <li> <p>We have provided a workflow file called <code>tests.yaml</code> that should run your tests for you. Place     this file in the <code>.github/workflows/</code> folder. The workflow file consists of three steps</p> <ul> <li> <p>First, a Python environment is initiated (in this case Python 3.8)</p> </li> <li> <p>Next all dependencies required to run the test are installed</p> </li> <li> <p>Finally, <code>pytest</code> is called and our tests will be run</p> </li> </ul> <p>Go over the file and try to understand the overall structure and syntax of the file.</p> <code>tests.yaml</code> tests.yaml<pre><code>name: \"Run tests\"\n\non:\n  push:\n    branches: [ master, main ]\n  pull_request:\n    branches: [ master, main ]\n\njobs:\n  build:\n\n    runs-on: ubuntu-latest\n\n    steps:\n    - name: Checkout\n      uses: actions/checkout@v4\n    - name: Set up Python\n      uses: actions/setup-python@v5\n      with:\n        python-version: 3.11\n    - name: Install dependencies\n      run: |\n        python -m pip install --upgrade pip\n        pip install -r requirements.txt\n        pip install -r requirements_tests.txt\n    - name: Test with pytest\n      run: |\n        pytest -v\n</code></pre> </li> <li> <p>For the script to work you need to define the <code>requirements.txt</code> and <code>requirements_tests.txt</code>. The first file should     contain all packages required to run your code. The second file contains all additional packages required to run     the tests. In your simple case, it may very well be that the second file is empty, however, sometimes additional     packages are used for testing that are not strictly required for the scripts to run.</p> </li> <li> <p>Finally, try pushing the changes to your repository. Hopefully, your tests should just start, and you will after some     time see a green check mark next to the hash of the commit. Also, try to inspect the Actions tap where you can see     the history of actions run.</p> <p> </p> </li> <li> <p>Normally we develop code on only one operating system and just hope that it will work on other operating systems.     However, continuous integration enables us to automatically test on other systems than the one we are using.</p> <ol> <li> <p>The provided <code>tests.yaml</code> only runs on one operating system. Which one?</p> </li> <li> <p>Alter the file such that it executes the test on the two other main operating systems that exist. You can find     information on available operating systems also called runners here</p> Solution <p>We can \"parametrize\" of script to run on different operating systems by using the <code>strategy</code> attribute. This attribute allows us to define a matrix of values that the workflow will run on. The following code will run the tests on <code>ubuntu-latest</code>, <code>windows-latest</code>, and <code>macos-latest</code>:</p> tests.yaml<pre><code>jobs:\n  build:\n    runs-on: ${{ matrix.os }}\n    strategy:\n      matrix:\n        os: [\"ubuntu-latest\", \"windows-latest\", \"macos-latest\"]\n</code></pre> </li> <li> <p>Can you also figure out how to run the tests using different Python versions?</p> Solution <p>Just add another line to the <code>strategy</code> attribute that specifies the Python version and use the value in the setup Python action. The following code will run the tests on Python versions</p> tests.yaml<pre><code>jobs:\n  build:\n    runs-on: ${{ matrix.os }}\n    strategy:\n      matrix:\n        os: [\"ubuntu-latest\", \"windows-latest\", \"macos-latest\"]\n        python-version: [\"3.10\", \"3.11\", \"3.12\"]\n\n    steps:\n    - uses: actions/checkout@v4\n    - name: Set up Python\n      uses: actions/setup-python@v5\n      with:\n        python-version: ${{ matrix.python-version }}\n</code></pre> </li> <li> <p>If you push the changes above you will maybe see that whenever one of the tests in the matrix fails, it will     automatically cancel the other tests. This is for saving time and resources. However, sometimes you want all the     tests to run even if one fails. Can you figure out how to do that?</p> Solution <p>You can set the <code>fail-fast</code> attribute to <code>false</code> under the <code>strategy</code> attribute:</p> tests.yaml<pre><code>jobs:\n  build:\n    runs-on: ${{ matrix.os }}\n    strategy:\n      fail-fast: false\n      matrix:\n        os: [\"ubuntu-latest\", \"windows-latest\", \"macos-latest\"]\n        python-version: [\"3.10\", \"3.11\", \"3.12\"]\n</code></pre> </li> </ol> </li> <li> <p>As the workflow is currently implemented, GitHub actions will destroy every downloaded package     when the workflow has been executed. To improve this we can take advantage of <code>caching</code>:</p> <ol> <li> <p>Figure out how to implement <code>caching</code> in your workflow file. You can find a guide     here and     here.</p> Solution tests.yaml<pre><code>steps:\n- uses: actions/checkout@v4\n- uses: actions/setup-python@v5\n  with:\n    python-version: 3.11\n    cache: 'pip' # caching pip dependencies\n- run: pip install -r requirements.txt\n</code></pre> </li> <li> <p>When you have implemented a caching system go to <code>Actions-&gt;Caches</code> in your repository and make sure that they     are correctly added. It should look something like the image below</p> <p> </p> </li> <li> <p>Measure how long your workflow takes before and after adding <code>caching</code> to your workflow. Did it improve the     runtime of your workflow?</p> </li> </ol> </li> <li> <p>(Optional) Code coverage can also be added to the workflow file by uploading it as an artifact     after running the coverage. Follow the instructions in this     post     on how to do it.</p> </li> <li> <p>With different checks in place, it is a good time to learn about branch protection rules. A branch     protection rule is essentially some kind of guarding that prevents you from merging code into a branch before     certain conditions are met. In this exercise, we will create a branch protection rule that requires all checks to     pass before merging code into the main branch.</p> <ol> <li> <p>Start by going into your <code>Settings -&gt; Rules -&gt; Rulesets</code> and create a new branch ruleset. See the image below.</p> <p> </p> </li> <li> <p>In the ruleset start by giving it a name and then set the target branches to be <code>Default branch</code>. This means that     the ruleset will be applied to your master/main branch. As shown in the image below, two rules may be     particularly beneficial when you later start working with other people:</p> <ul> <li> <p>The first rule to consider is Require a pull request before merging. As the name suggests this rule     requires that changes that are to be merged into the main branch must be done through a pull request. This     is a good practice as it allows for code review and testing before the code is merged into the main branch.     Additionally, this opens the option to specify that the code must be reviewed (or at least approved) by     a certain number of people.</p> </li> <li> <p>The second rule to consider is Require status checks to pass. This rule makes sure that our workflows     are passing before we can merge code into the main branch. You can select which workflows are required, as     some may be nice to have passing but not strictly needed.</p> </li> </ul> <p> </p> <p>Finally, if you think the rules are a bit too restrictive you can always add that the repository admin e.g. you can bypass the rules by adding <code>Repository admin</code> to the bypass list. Implement the following rules:</p> <ul> <li>At least one person needs to approve any PR</li> <li>All your workflows need to pass</li> <li>All conversations need to be resolved</li> </ul> </li> <li> <p>If you have created the rules correctly you should see something like the image below when you try to merge a     pull request. In this case, all three checks are required to pass before the code can be merged. Additionally,     a single reviewer is required to approve the code. A bypass rule is also setup for the repository admin.</p> <p> </p> </li> </ol> </li> <li> <p>One problem you may have encountered is running your tests that have to do with your data, with the core problem     being that your data is not stored in GitHub (assuming you have done module     M8 - DVC) and therefore cannot be tested. However, we can download     data while running our continuous integration. Let's try to create that:</p> <ol> <li> <p>The first problem is that we need our continuous integration pipeline to be able to authenticate with our storage     solution. We can take advantage of an authentication file that is created the first time we push with DVC. It is     located in <code>$CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json</code> where <code>$CACHE_HOME</code> depends on your     operating system:</p> macOSLinuxWindows <p><code>~/Library/Caches</code></p> <p><code>~/.cache</code>  This is the typical location, but it may vary depending on what distro you are running</p> <p><code>{user}/AppData/Local</code></p> <p>Find the file. The content should look similar to this (only some fields are shown):</p> <pre><code>{\n    \"access_token\": ...,\n    \"client_id\": ...,\n    \"client_secret\": ...,\n    \"refresh_token\": ...,\n    ...\n}\n</code></pre> </li> <li> <p>The content of that file should be treated as a password and not shared with the world and the relevant     question is therefore how to use this info in a public repository. The answer is GitHub secrets, where we     can store information, and access it in our workflow files and it is still not public. Navigate to the secrets     option (as shown below) and create a secret with the name <code>GDRIVE_CREDENTIALS_DATA</code> that contains the content     of the file you found in the previous exercise.</p> <p> </p> </li> <li> <p>Afterward, add the following code to your workflow file:</p> <pre><code>- uses: iterative/setup-dvc@v1\n- name: Get data\n  run: dvc pull\n  env:\n    GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}\n</code></pre> <p>that runs <code>dvc pull</code> using the secret authentication file. For help you can visit this small repository that implements the same workflow.</p> </li> <li> <p>Finally, add the changes, commit, push and confirm that everything works as expected. You should now be able to     run unit tests that depend on your input data.</p> </li> </ol> </li> <li> <p>In module M6 on good coding practices     (optional module) of the course you were introduced to a couple of good coding practices such as being consistent     with your coding style, how your Python packages are sorted and that your code follows certain standards. All this     was done using the <code>ruff</code> framework. In this set of exercises, we will create GitHub workflows that will     automatically test for this.</p> <ol> <li> <p>Create a new workflow file called <code>codecheck.yaml</code>, that implements the following three steps</p> <ul> <li> <p>Setup Python environment</p> </li> <li> <p>Installs <code>ruff</code></p> </li> <li> <p>Runs <code>ruff check</code> and <code>ruff format</code> on the repository</p> </li> </ul> <p>(HINT: You should be able to just change the last steps of the <code>tests.yaml</code> workflow file)</p> Solution codecheck.yaml<pre><code>name: Code formatting\n\non:\n  push:\n    branches:\n    - main\n  pull_request:\n    branches:\n    - main\n\njobs:\n  format:\n      runs-on: ubuntu-latest\n      steps:\n      - name: Checkout code\n        uses: actions/checkout@v4\n      - name: Set up Python\n        uses: actions/setup-python@v5\n        with:\n          python-version: 3.11\n          cache: 'pip'\n          cache-dependency-path: setup.py\n      - name: Install dependencies\n        run: |\n          pip install ruff\n          pip list\n      - name: Ruff check\n        run: ruff check .\n      - name: Ruff format\n        run: ruff format .\n</code></pre> </li> <li> <p>In addition to <code>ruff</code> we also used <code>mypy</code> in those sets of exercises for checking if the typing we added to our     code was good enough. Add another step to the <code>codecheck.yaml</code> file which runs <code>mypy</code> on your repository.</p> </li> <li> <p>Try to make sure that all steps are passed on repository. Especially <code>mypy</code> can be hard to get a passing, so this     exercise formally only requires you to get <code>ruff</code> passing.</p> </li> </ol> </li> <li> <p>(Optional) As you have probably already experienced in module M9 on docker it can     be cumbersome to build docker images, sometimes taking a couple of minutes to build each time we make changes to our     code base. For this reason, we just want to build a new image every time we commit our code because that should mark     that we believe the code to be working at that point. Thus, let's automate the process of building our docker images     using Github actions. Do note that in a future module will look at how to     build containers using cloud providers, and this exercise is therefore very much optional.</p> <ol> <li> <p>Start by making sure you have a dockerfile in your repository. If you do not have one, you can use the following     simple dockerfile:</p> <pre><code>FROM busybox\nCMD echo \"Howdy cowboy\"\n</code></pre> </li> <li> <p>Push the dockerfile to your repository</p> </li> <li> <p>Next, create a Docker Hub account</p> </li> <li> <p>Within Docker Hub create an access token by going to <code>Settings -&gt; Security</code>. Click the <code>New Access Token</code> button     and give it a name that you recognize.</p> </li> <li> <p>Copy the newly created access token and head over to your GitHub repository online. Go to     <code>Settings -&gt; Secrets -&gt; Actions</code> and click the <code>New repository secret</code>. Copy over the access token and give     it the name <code>DOCKER_HUB_TOKEN</code>. Additionally, add two other secrets <code>DOCKER_HUB_USERNAME</code> and     <code>DOCKER_HUB_REPOSITORY</code> that contain your docker username and docker repository name respectively.</p> </li> <li> <p>Next, we are going to construct the actual Github actions workflow file</p> <pre><code>name: Docker Image continuous integration\n\non:\n  push:\n    branches: [ master ]\n\njobs:\n    build:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v4\n    - name: Build the Docker image\n      run: |\n        echo \"${{ secrets.DOCKER_HUB_TOKEN }}\" | docker login \\\n          -u \"${{ secrets.DOCKER_HUB_USERNAME }}\" --password-stdin docker.io\n        docker build . --file Dockerfile \\\n          --tag docker.io/${{ secrets.DOCKER_HUB_USERNAME }}/${{ secrets.DOCKER_HUB_REPOSITORY }}:$GITHUB_SHA\n        docker push \\\n          docker.io/${{ secrets.DOCKER_HUB_USERNAME }}/${{ secrets.DOCKER_HUB_REPOSITORY }}:$GITHUB_SHA\n</code></pre> <p>The first part of the workflow file should look somewhat recognizable. However, the last three lines are where all the magic happens. Carefully go through them and figure out what they do. If you want some help you can look at the help page for <code>docker login</code>, <code>docker build</code> and <code>docker push</code>.</p> </li> <li> <p>Upload the workflow to your GitHub repository and check that it is being executed. If everything works you should     be able to see the build docker image in your container repository in the docker hub.</p> </li> <li> <p>Make sure that you can execute <code>docker pull</code> locally to pull down the image that you just continuously build</p> </li> <li> <p>(Optional) To test that the container works directly in GitHub you can also try to include an additional     step that runs the container.</p> <pre><code>    - name: Run container\n      run: |\n        docker run ...\n</code></pre> </li> </ol> </li> </ol>"},{"location":"s5_continuous_integration/github_actions/#dependabot","title":"Dependabot","text":"<p>A great feature that GitHub provides is the ability to have bots help you with maintaining your repository. One of the most useful bots is called <code>Dependabot</code>. As the name suggests, <code>Dependabot</code> helps you keep your dependencies up to date. This is important because dependencies often either contain fixes for bugs or security vulnerabilities that you want to have in your code.</p>"},{"location":"s5_continuous_integration/github_actions/#exercises_1","title":"\u2754 Exercises","text":"<ol> <li> <p>To get dependabot working in your repository, we need to add a single configuration file to your repository. Create     a file called <code>.github/dependabot.yaml</code>. Look through the     documentation for how to set up     the file such that it updates your Python dependencies on a weekly basis.</p> Solution <p>The following code will check for updates in the <code>pip</code> ecosystem every week e.g. it automatically will look for <code>requirements.txt</code> files and update the packages in there.</p> <pre><code>version: 2\nupdates:\n  - package-ecosystem: \"pip\"\n    directory: \"/\"\n    schedule:\n      interval: \"weekly\"\n</code></pre> <ol> <li>Push the changes to your repository and check that the dependabot is working by going to the <code>Insights</code> tab and then the <code>Dependency graph</code> tab. From here you under the <code>Dependabot</code> tab should be able to see if the bot has correctly identified what files to track and if it has found any updates.</li> </ol> <p> </p> <p>Click the <code>Recent update jobs</code> to see the history of Dependabot checking for updates. If there are no updates you can try to click the <code>Check for updates</code> button to force Dependabot to check for updates.</p> </li> <li> <p>At this point the Dependabot should hopefully have found some updates and created one or more pull requests. If it     has not done so you most likely need to update your requirement file such that your dependencies are correctly     restricted/specified e.g.</p> <pre><code># lets assume pytorch v2.5 is the latest version\n\n# these different specifications will not trigger dependabot because\n# the latest version is included in the specification\ntorch\ntorch == 2.5\ntorch &gt;= 2.5\ntorch ~= 2.5\n\n# these specifications will trigger dependabot because the latest\n# version is not included\ntorch &lt; 2.5\ntorch == 2.4\ntorch &lt;= 2.4\n</code></pre> <p>If you have a pull request from Dependabot, check it out and see if it looks good. If it does, you can merge it.</p> <p> </p> </li> <li> <p>(Optional) Dependabot can also help keeping our GitHub Actions pipelines up-to-date. As you may have realized     during this module, when we write statements like in our workflow files:</p> <pre><code>...\njobs:\n  build:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v4\n...\n</code></pre> <p>The <code>@v4</code> specifies that we are using version 4 of the <code>actions/checkout</code> action. This means that if a new version of the action is released, we will not automatically get the new version. Dependabot can help us with this. Try adding to the <code>dependabot.yaml</code> file that Dependabot should also check for updates in the GitHub Actions ecosystem.</p> Solution <pre><code>version: 2\nupdates:\n  - package-ecosystem: \"pip\"\n    directory: \"/\"\n    schedule:\n      interval: \"weekly\"\n  - package-ecosystem: \"github-actions\"\n    schedule:\n      interval: \"weekly\"\n</code></pre> </li> </ol>"},{"location":"s5_continuous_integration/github_actions/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>When working with GitHub actions you will often encounter the following 4 concepts:</p> <ul> <li>Workflow</li> <li>Runner</li> <li>Job</li> <li>Action</li> </ul> <p>Try to define them in your own words.</p> Solution <ul> <li>Workflow: A <code>yaml</code> file that defines the instructions to be executed on specific events. Needs to be placed in     the <code>.github/workflows</code> folder.</li> <li>Runner: Workflows need to run somewhere. The environment that the workflow is being executed on is called the     runner. Most commonly the runner is hosted by GitHub but can also hosted by yourself.</li> <li>Job: A series of steps that are executed on the same runner. A workflow must include at least one job but     often contains many.</li> <li>Action: An action is the smallest unit in a workflow. Jobs often consist of multiple actions that are     executed sequentially.</li> </ul> </li> <li> <p>The <code>on</code> attribute specifies upon which events the workflow will be triggered. Assume you have set the <code>on</code> attribute     to the following:</p> <pre><code>on:\n    push:\n      branches: [main]\n    pull_request:\n      branches: [main]\n    schedule:\n      - cron: \"0 0 * * *\"\n    workflow_dispatch: {}\n</code></pre> <p>What 4 events would trigger the execution of that action?</p> Solution <ol> <li>Direct push to branch <code>main</code> would trigger it</li> <li>Any pull request opened that will merge into <code>main</code> will trigger it</li> <li>At the end of the day the action would trigger, see cron for more info</li> <li> <p>The trigger can be executed by manually triggering it through the GitHub UI, for example, shown below</p> <p> </p> </li> </ol> </li> </ol> <p>This ends the module on GitHub workflows. If you are more interested in this topic you can check out module M31 on documentation which first includes locally building some documentation for your project and afterward use GitHub actions for deploying it to GitHub Pages. Additionally, GitHub also has a lot of templates already for running different continuous integration tasks. If you try to create a workflow file directly in GitHub you may encounter the following page</p> <p></p> <p>We highly recommend checking this out if you want to write any other kind of continuous integration pipeline in GitHub actions. We can also recommend this repository that has a list of awesome actions and check out the act repository which is a tool for running your GitHub Actions locally!</p>"},{"location":"s5_continuous_integration/pre_commit/","title":"M18 - Pre-commit","text":""},{"location":"s5_continuous_integration/pre_commit/#pre-commit","title":"Pre-commit","text":"<p>One of the cornerstones of working with git is remembering to commit your work often. Often committing makes sure that it is easier to identify and revert unwanted changes that you have introduced, because the code changes becomes smaller per commit.</p> <p>However, as you hopefully already seen in the course there are a lot of mental task to do before you actually write <code>git commit</code> in the terminal. The most basic thing is of course making sure that you have saved all your changes, and you are not committing a not up-to-date file. However, this also includes tasks such as styling, formatting, making sure all tests succeeds etc. All these mental to-do notes does not mix well with the principal of remembering to commit often, because you in principal have to do them every time.</p> <p>The obvious solution to this problem is to automate all or some of our mental task every time that we do a commit. This is where pre-commit hooks comes into play, as they can help us attach additional tasks that should be run every time that we do a <code>git commit</code>.</p>"},{"location":"s5_continuous_integration/pre_commit/#configuration","title":"Configuration","text":"<p>Pre-commit simply works by inserting whatever workflow we want to automate in between whenever we do a <code>git commit</code> and afterwards would do a <code>git push</code>.</p> <p></p>  Image credit  <p>The system works by looking for a file called <code>.pre-commit-config.yaml</code> that we can configure. If we execute</p> <pre><code>pre-commit sample-config | out-file .pre-commit-config.yaml -encoding utf8\n</code></pre> <p>you should get a sample file that looks like</p> <pre><code># See https://pre-commit.com for more information\n# See https://pre-commit.com/hooks.html for more hooks\nrepos:\n-   repo: https://github.com/pre-commit/pre-commit-hooks\n    rev: v3.2.0\n    hooks:\n    -   id: trailing-whitespace\n    -   id: end-of-file-fixer\n    -   id: check-yaml\n    -   id: check-added-large-files\n</code></pre> <p>the file structure is very simple:</p> <ul> <li>It starts by listing the repositories where we want to get our pre-commits from, in this case   https://github.com/pre-commit/pre-commit-hooks. This repository contains a large collection of pre-commit hooks.</li> <li>Next we need to defined what pre-commit hooks that we want to get by specifying the <code>id</code> of the different hooks.   The <code>id</code> corresponds to an <code>id</code> in this file:   https://github.com/pre-commit/pre-commit-hooks/blob/master/.pre-commit-hooks.yaml</li> </ul> <p>When we are done defining our <code>.pre-commit-config.yaml</code> we just need to install it</p> <pre><code>pre-commit install\n</code></pre> <p>this will make sure that the file is automatically executed whenever we run <code>git commit</code></p>"},{"location":"s5_continuous_integration/pre_commit/#exercises","title":"\u2754 Exercises","text":"<ol> <li> <p>Install pre-commit</p> <pre><code>pip install pre-commit\n</code></pre> <p>Consider adding <code>pre-commit</code> to a <code>requirements_dev.txt</code> file, as it is a development tool.</p> </li> <li> <p>Next create the sample file</p> <pre><code>pre-commit sample-config &gt; .pre-commit-config.yaml\n</code></pre> </li> <li> <p>The sample file already contains 4 hooks. Make sure you understand what each do and if you need them at all.</p> </li> <li> <p><code>pre-commit</code> works by hooking into the <code>git commit</code> command, running whenever that command is run. For this to work,     we need to install the hooks into <code>git commit</code>. Run</p> <pre><code>pre-commit install\n</code></pre> <p>to do this.</p> </li> <li> <p>Try to commit your recently created <code>.pre-commit-config.yaml</code> file. You will likely not do anything, because     <code>pre-commit</code> only check files that are being committed. Instead try to run</p> <pre><code>pre-commit run --all-files\n</code></pre> <p>that will check every file in your repository.</p> </li> <li> <p>Try adding at least another check from the base repository to your     <code>.pre-commit-config.yaml</code> file.</p> Solution <p>In this case we have added the <code>check-json</code> hook to our <code>.pre-commit-config.yaml</code> file, which will automatically check that all JSON files are valid.</p> <pre><code>repos:\n-   repo:\n    rev: v3.2.0\n    hooks:\n    -   id: trailing-whitespace\n    -   id: end-of-file-fixer\n    -   id: check-yaml\n    -   id: check-added-large-files\n    -   id: check-json\n</code></pre> </li> <li> <p>If you have completed the optional module     M7 on good coding practice you will have learned     about the linter <code>ruff</code>. <code>ruff</code> comes with its own pre-commit hook.     Try adding that to your <code>.pre-commit-config.yaml</code> file and see what happens when you try to commit files.</p> Solution <p>This is one way to add the <code>ruff</code> pre-commit hook. We run both the <code>ruff</code> and <code>ruff-format</code> hooks, and we also add the <code>--fix</code> argument to the <code>ruff</code> hook to try to fix what is possible.</p> <p>```yaml repos: - repo: https://github.com/astral-sh/ruff-pre-commit   rev: v0.4.7   hooks:     # try to fix what is possible     - id: ruff         args: [\"--fix\"]     # perform formatting updates     - id: ruff-format     # validate if all is fine with preview mode     - id: ruff</p> </li> <li> <p>(Optional) Add more hooks to your <code>.pre-commit-config.yaml</code>.</p> </li> <li> <p>Sometimes you are in a hurry, so make sure that you also can do commits without running <code>pre-commit</code> e.g.</p> <pre><code>git commit -m &lt;message&gt; --no-verify\n</code></pre> </li> <li> <p>Finally, figure out how to disable <code>pre-commit</code> again (if you get tired of it).</p> </li> <li> <p>Assuming you have completed the module on GitHub Actions, lets try to add a     <code>pre-commit</code> workflow that automatically runs your <code>pre-commit</code> checks every time you push to your repository and     then automatically commits those changes to your repository. We recommend that you make use of</p> <ul> <li>this pre-commit action for installing and running <code>pre-commit</code></li> <li>this commit action to automatically commit the   changes that <code>pre-commit</code> makes.</li> </ul> <p>As an alternative you configure the CI tool provided by the creators of <code>pre-commit</code>.</p> Solution <p>The workflow first uses the <code>pre-commit</code> action to install and run the <code>pre-commit</code> checks. Importantly we run it with <code>continue-on-error: true</code> to make sure that the workflow does not fail if the checks fail. Next, we use <code>git diff</code> to list the changes that <code>pre-commit</code> has made and then we use the <code>git-auto-commit-action</code> to commit those changes.</p> .github/workflows/pre_commit.yaml<pre><code>name: Pre-commit CI\n\non:\n  pull_request:\n  push:\n    branches: [main]\n\njobs:\n  pre-commit:\n    name: Check pre-commit\n    runs-on: ubuntu-latest\n\n    permissions:\n      contents: write\n\n    steps:\n    - name: Checkout code\n      uses: actions/checkout@v4\n\n    - name: Set up Python\n      uses: actions/setup-python@v5\n      with:\n        python-version: 3.11\n\n    - name: Install pre-commit\n      uses: pre-commit/action@v3.0.1\n      continue-on-error: true\n\n    - name: List modified files\n      run: |\n        git diff --name-only\n\n    - name: Commit changes\n      uses: stefanzweifel/git-auto-commit-action@v5\n      with:\n        commit_message: Pre-commit fixes\n        commit_options: '--no-verify'\n</code></pre> </li> </ol> <p>That was all about how <code>pre-commit</code> can be used to automate tasks. If you want to deep dive more into the topic you can checkout this page on how to define your own <code>pre-commit</code> hooks.</p>"},{"location":"s5_continuous_integration/unittesting/","title":"M16 - Unit testing","text":""},{"location":"s5_continuous_integration/unittesting/#unit-testing","title":"Unit testing","text":"<p>Core Module</p> <p>What often comes to mind for many developers, when discussing continuous integration (CI) is code testing. Continuous integration should ensure that whenever a codebase is updated it is automatically tested such that if bugs have been introduced in the codebase it will be caught early on. If you look at the MLOps cycle, continuous integration is one of the cornerstones of the operations part. However, it should be noted that applying continuous integration does not magically secure that your code does not break. Continuous integration is only as strong as the tests that are automatically executed. Continuous integration simply structures and automates this.</p> <p>Quote</p> <p>Continuous Integration doesn\u2019t get rid of bugs, but it does make them dramatically easier to find and remove.  Martin Fowler, Chief Scientist, ThoughtWorks</p> <p></p>  Image credit  <p>The kind of tests we are going to look at are called unit testing. Unit testing refers to the practice of writing test that tests individual parts of your code base to test for correctness. By unit, you can therefore think of a function, module or in general any object. By writing tests in this way it should be very easy to isolate which part of the code broke after an update to the code base. Another way to test your code base would be through integration testing which is equally important but we are not going to focus on it in this course.</p> <p>Unit tests (and integration tests) are not a unique concept to MLOps but are a core concept of DevOps. However, it is important to note that testing machine learning-based systems is much more difficult than traditional systems. The reason for this is that machine learning systems depend on data, that influences the state of our system. For this reason, we not only need unit tests and integration tests of our code we also need data testing, infrastructure testing and more monitoring to check that we stay within the data distribution we are training on (more on this in module M25 on data drifting). This added complexity is illustrated in the figure below.</p> <p></p>"},{"location":"s5_continuous_integration/unittesting/#pytest","title":"Pytest","text":"<p>Before we can begin to automate testing of our code base we of course need to write the tests first. It is both a hard and tedious task to do but arguably the most important aspect of continuous integration. Python offers a couple of different libraries for writing tests. We are going to use <code>pytest</code>.</p>"},{"location":"s5_continuous_integration/unittesting/#exercises","title":"\u2754 Exercises","text":"<p>The following exercises should be applied to your MNIST repository</p> <ol> <li> <p>The first part of doing continuous integration is writing the unit tests. We do not expect you to cover every part     of the code you have developed but try to at least write tests that cover two files. Start by     creating a <code>tests</code> folder.</p> </li> <li> <p>Read the getting started guide for pytest     which is the testing framework that we are going to use</p> </li> <li> <p>Install pytest:</p> <pre><code>pip install pytest\n</code></pre> </li> <li> <p>Write some tests. Below are some guidelines on some tests that should be implemented, but     you are of course free to implement more tests. You can at any point check if your tests are     passing by typing in a terminal</p> <pre><code>pytest tests/\n</code></pre> <p>When you implement a test you need to follow two standards, for <code>pytest</code> to be able to find your tests. First, any files created (except <code>__init__.py</code>) should always start with <code>test_*.py</code>. Secondly, any test implemented needs to be wrapped into a function that again needs to start with <code>test_*</code>:</p> <pre><code># this will be found and executed by pytest\ndef test_something():\n    ...\n\n# this will not be found and executed by pytest\ndef something_to_test():\n    ...\n</code></pre> <ol> <li> <p>Start by creating a <code>tests/__init__.py</code> file and fill in the following:</p> <pre><code>import os\n_TEST_ROOT = os.path.dirname(__file__)  # root of test folder\n_PROJECT_ROOT = os.path.dirname(_TEST_ROOT)  # root of project\n_PATH_DATA = os.path.join(_PROJECT_ROOT, \"data\")  # root of data\n</code></pre> <p>these can help you refer to your data files during testing. For example, in another test file, I could write</p> <pre><code>from tests import _PATH_DATA\n</code></pre> <p>which then contains the root path to my data.</p> </li> <li> <p>Data testing: In a file called <code>tests/test_data.py</code> implement at least a test that checks that data gets     correctly loaded. By this, we mean that you should check</p> <pre><code>def test_data():\n    dataset = MNIST(...)\n    assert len(dataset) == N_train for training and N_test for test\n    assert that each datapoint has shape [1,28,28] or [784] depending on how you choose to format\n    assert that all labels are represented\n</code></pre> <p>where <code>N_train</code> should be either 30.000 or 50.000 depending on if you are just the first subset of the corrupted MNIST data or also including the second subset. <code>N_test</code> should be 5000.</p> Solution <pre><code>from my_project.data import corrupt_mnist\n\ndef test_data():\n    train, test = corrupt_mnist()\n    assert len(train) == 30000\n    assert len(test) == 5000\n    for dataset in [train, test]:\n        for x, y in dataset:\n            assert x.shape == (1, 28, 28)\n            assert y in range(10)\n    train_targets = torch.unique(train.tensors[1])\n    assert (train_targets == torch.arange(0,10)).all()\n    test_targets = torch.unique(test.tensors[1])\n    assert (test_targets == torch.arange(0,10)).all()\n</code></pre> </li> <li> <p>Model testing: In a file called <code>tests/test_model.py</code> implement at least a test that checks for a given input     with shape X that the output of the model has shape Y.</p> Solution <pre><code>from my_project.model import MyAwesomeModel\n\ndef test_model():\n    model = MyAwesomeModel()\n    x = torch.randn(1, 1, 28, 28)\n    y = model(x)\n    assert y.shape == (1, 10)\n</code></pre> </li> <li> <p>Training testing: In a file called <code>tests/test_training.py</code> implement at least one test that asserts something     about your training script. You are here given free hands on what should be tested but try to test something     that risks being broken when developing the code.</p> </li> <li> <p>Good code raises errors and gives out warnings in appropriate places. This is often in     the case of some invalid combination of input to your script. For example, your model could check for the size     of the input given to it (see code below) to make sure it corresponds to what you are expecting. Not     implementing such errors would still result in PyTorch failing at a later point due to shape errors, however,     these custom errors will probably make more sense to the end user. Implement at least one raised error or     warning somewhere in your code and use either <code>pytest.raises</code> or <code>pytest.warns</code> to check that     they are correctly raised/warned. As inspiration, the following implements <code>ValueError</code> in code     belonging to the model:</p> <pre><code># src/models/model.py\ndef forward(self, x: Tensor):\n    if x.ndim != 4:\n        raise ValueError('Expected input to a 4D tensor')\n    if x.shape[1] != 1 or x.shape[2] != 28 or x.shape[3] != 28:\n        raise ValueError('Expected each sample to have shape [1, 28, 28]')\n</code></pre> Solution <p>The above example would be captured by a test looking something like this:</p> <pre><code># tests/test_model.py\nimport pytest\nfrom my_project.model import MyAwesomeModel\n\ndef test_error_on_wrong_shape():\n    model = MyAwesomeModel()\n    with pytest.raises(ValueError, match='Expected input to a 4D tensor')\n        model(torch.randn(1,2,3))\n    with pytest.raises(ValueError, match='Expected each sample to have shape [1, 28, 28]')\n        model(torch.randn(1,1,28,29))\n</code></pre> </li> <li> <p>A test is only as good as the error message it gives, and by default, <code>assert</code> will only report that the     check failed. However, we can help ourselves and others by adding strings after <code>assert</code> like</p> <pre><code>assert len(train_dataset) == N_train, \"Dataset did not have the correct number of samples\"\n</code></pre> <p>Add such comments to the assert statements you just did in the previous exercises.</p> </li> <li> <p>The tests that involve checking anything that has to do with our data, will of course fail     if the data is not present. To future-proof our code, we can take advantage of the     <code>pytest.mark.skipif</code> decorator. Use this decorator to skip your data tests if the corresponding     data files do not exist. It should look something like this</p> <pre><code>import os.path\n@pytest.mark.skipif(not os.path.exists(file_path), reason=\"Data files not found\")\ndef test_something_about_data():\n    ...\n</code></pre> <p>You can read more about skipping tests here</p> </li> </ol> </li> <li> <p>After writing the different tests, make sure that they are passing locally.</p> </li> <li> <p>We often want to check a function/module for various input arguments. In this case, you could write the same test     over and over again for different inputs, but <code>pytest</code> also has built-in support for this with the use of the     pytest.mark.parametrize decorator. Implement a parametrized     test and make sure that it runs for different inputs.</p> Solution <pre><code>@pytest.mark.parametrize(\"batch_size\", [32, 64])\ndef test_model(batch_size: int) -&gt; None:\n    model = MyModel()\n    x = torch.randn(batch_size, 1, 28, 28)\n    y = model(x)\n    assert y.shape == (batch_size, 10)\n</code></pre> </li> <li> <p>There is no way of measuring how good the test you have written is. However, what we can measure is the     code coverage. Code coverage refers to the percentage of your codebase that gets run when all your     tests are executed. Having a high coverage at least means that all your code will run when executed.</p> <ol> <li> <p>Install coverage</p> <pre><code>pip install coverage\n</code></pre> </li> <li> <p>Instead of running your tests directly with <code>pytest</code>, now do</p> <pre><code>coverage run -m pytest tests/\n</code></pre> </li> <li> <p>To get a simple coverage report simply type</p> <pre><code>coverage report\n</code></pre> <p>which will give you the percentage of cover in each of your files. You can also write</p> <pre><code>coverage report -m\n</code></pre> <p>to get the exact lines that were missed by your tests.</p> </li> <li> <p>Finally, try to increase the coverage by writing a new test that runs some     of the lines in your codebase that are not covered yet.</p> </li> <li> <p>Often <code>coverage</code> reports the code coverage on files that we do not want to get a code coverage for, for example     your test file. Figure out how to configure <code>coverage</code> to exclude some files.</p> Solution <p>You need to set the <code>omit</code> option. This can either be done when running <code>coverage run</code> or <code>coverage report</code> such as:</p> <pre><code>coverage run --omit=\"tests/*\" -m pytest tests/\n# or\ncoverage report --omit=\"tests/*\"\n</code></pre> <p>As an alternative you can specify this in your <code>pyproject.toml</code> file:</p> <pre><code>[tool.coverage.run]\nomit = [\"tests/*\"]\n</code></pre> </li> </ol> </li> </ol>"},{"location":"s5_continuous_integration/unittesting/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>Assuming you have a code coverage of 100%, would you expect that no bugs are present in your code?</p> Solution <p>No, code coverage is not a guarantee that your code is bug-free. It is just a measure of how many lines of code are run when your tests are executed. Therefore, there may still be some corner case that is not covered by your tests and will result in a bug. However, having a high code coverage is a good indicator that you have tested your code.</p> </li> <li> <p>Consider the following code:</p> <pre><code>@pytest.mark.parametrize(\"network_size\", [10, 100, 1000])\n@pytest.mark.parametrize(\"device\", [\"cpu\", \"cuda\"])\nclass MyTestClass:\n    @pytest.mark.parametrize(\"network_type\", [\"alexnet\", \"squeezenet\", \"vgg\", \"resnet\"])\n    @pytest.mark.parametrize(\"precision\", [torch.half, torch.float, torch.double])\n    def test_network1(self, network_size, device, network_type, precision):\n        if device == \"cuda\" and not torch.cuda.is_available():\n            pytest.skip(\"Test requires cuda\")\n        model = MyModelClass(network_size, network_type).to(device=device, dtype=precision)\n        ...\n\n    @pytest.mark.parametrize(\"add_dropout\", [True, False])\n    def test_network2(self, network_size, device, add_dropout):\n        if device == \"cuda\" and not torch.cuda.is_available():\n            pytest.skip(\"Test requires cuda\")\n        model = MyModelClass2(network_size, add_dropout).to(device)\n        ...\n</code></pre> <p>how many tests are executed when running the above code?</p> Solution <p>The answer depends on whether or not we are running on a GPU-enabled machine. The <code>test_network1</code> has 4 parameters, <code>network_size, device, network_type, precision</code>, that respectively can take on <code>3, 2, 4, 3</code> values meaning that in total that test will be running <code>3x2x4x3=72</code> times with different parameters on a GPU-enabled machine and 36 on a machine without a GPU. A similar calculation can be done for <code>test_network2</code>, which only has three factors <code>network_size, device, add_dropout</code> that result in <code>3x2x2=12</code> test on a GPU-enabled machine and 6 on a machine without a GPU. In total, that means 84 tests would run on a machine with a GPU and 42 on a machine without a GPU.</p> </li> </ol> <p>That covers the basics of writing unit tests for Python code. We want to note that <code>pytest</code> of course is not the only framework for doing this. Python has a built-in framework called unittest for doing this also (but <code>pytest</code> offers a bit more features). Another open-source framework that you could choose to check out is hypothesis which can help catch errors in corner cases of your code. In addition to writing unit tests it is also highly recommended to test code that you include in your docstring belonging to your functions and modulus to make sure that any code there is in your documentation is also correct. For such testing, we can highly recommend using Python built-in framework doctest.</p>"},{"location":"s6_the_cloud/","title":"Cloud computing","text":"<p>Slides</p> <ul> <li> <p></p> <p>Learn how to get started with Google Cloud Platform and how to interact with the SDK.</p> <p> M20: Cloud Setup</p> </li> <li> <p></p> <p>Learn how to use different GCP services to support your machine learning pipeline.</p> <p> M21: Cloud Services</p> </li> </ul> <p>Running computations locally is often sufficient when only playing around with code in the initial phase of development. However, to scale your experiments you will need more computing power than what your standard laptop/desktop can offer. You probably already have experience with running on a local cluster or similar but today's topic is about utilizing cloud computing.</p> <p></p>  Image credit  <p>There exist numerous amount of cloud computing providers with some of the biggest being:</p> <ul> <li>Azure</li> <li>AWS</li> <li>Google Cloud Platform (GCP)</li> <li>Alibaba Cloud</li> </ul> <p>They all have slight advantages and disadvantages over each other. In this course, we are going to focus on Google Cloud platform, because they have been kind enough to sponsor $50 of cloud credit to each student. If you happen to run out of credit, you can also get some free credit for a limited amount of time when you sign up with a new account. What's important to note is that all these different cloud providers all have the same set of services and that learning how to use the services of one cloud provider in many cases translates to also knowing how to use the same services at another cloud provider. The services are called something different and can have a bit of a different interface/interaction pattern but in the end, it does not matter.</p> <p>Today's exercises are about getting to know how to work with the cloud. If you are in doubt about anything or want to deep dive into some topics, I can recommend watching this series of videos or going through the general docs.</p> <p>Learning objectives</p> <p>The learning objectives of this session are:</p> <ul> <li>In general being familiar with the Google SDK working</li> <li>Being able to start different compute instances and work with them</li> <li>Know how to do continuous integration workflows for the building of docker images</li> <li>Knowledge about how to store data and containers/artifacts in cloud buckets</li> <li>Being able to train simple deep-learning models using a combination of cloud services</li> </ul>"},{"location":"s6_the_cloud/cloud_setup/","title":"Cloud setup","text":"<p>Core Module</p> <p>Google Cloud Platform (GCP) is the cloud service provided by Google. The key concept, or selling point, of any cloud provider, is the idea of near-infinite resources. Without the cloud, it simply is not feasible to do many modern deep learning and machine learning tasks because they cannot be scaled locally.</p> <p>The image below shows all the different services that the Google Cloud platform offers. We are going to be working with around 10 of these services throughout the course. Therefore, if you get done with exercises early I highly recommend that you deep dive more into the Google cloud platform.</p> <p></p>  Image credit"},{"location":"s6_the_cloud/cloud_setup/#exercises","title":"\u2754 Exercises","text":"<p>As the first step, we are going to get you some Google Cloud credits.</p> <ol> <li> <p>Go to https://learn.inside.dtu.dk. Go to this course. Find the recent message where there should be a download     link and instructions on how to claim the $50 cloud credit. Please do not share the link anywhere as there are a     limited amount of coupons. If you are not officially taking this course at DTU, Google gives $300 cloud credits     whenever you sign up with a new account. NOTE that you need to provide a credit card for this so make     sure to closely monitor your credit use so you do not end up spending more than the free credit.</p> </li> <li> <p>Log in to the homepage of GCP. It should look like this:</p> <p> </p> </li> <li> <p>Go to billing and make sure that your account is showing $50 of cloud credit</p> <p> </p> <p>make sure to also check out the <code>Reports</code> throughout the course. When you are starting to use some of the cloud services these tabs will update with info about how much time you can use before your cloud credit runs out. Make sure that you monitor this page as you will not be given another coupon.</p> </li> <li> <p>One way to stay organized within GCP is to create projects.</p> <p> </p> <p>Create a new project called <code>dtumlops</code>. When you click <code>create</code> you should get a notification that the project is being created. The notification bell is a good way to make sure how the processes you are running are doing throughout the course.</p> </li> <li> <p>Next, it local setup on your laptop. We are going to install <code>gcloud</code>, which is part of the Google Cloud SDK.     <code>gcloud</code> is the command line interface for working with our Google Cloud account. Nearly everything that we can do     through the web interface we can also do through the <code>gcloud</code> interface. Follow the installation instructions     here for your specific OS.</p> <ol> <li> <p>After installation, try in a terminal to type:</p> <pre><code>gcloud -h\n</code></pre> <p>the command should show the help page. If not, something went wrong in the installation (you may need to restart after installing).</p> </li> <li> <p>Now login by typing</p> <pre><code>gcloud auth login\n</code></pre> <p>you should be sent to a web page where you link your cloud account to the <code>gcloud</code> interface. Afterward, also run this command:</p> <pre><code>gcloud auth application-default login\n</code></pre> <p>If you at some point want to revoke the authentication you can type:</p> <pre><code>gcloud auth revoke\n</code></pre> </li> <li> <p>Next, you will need to set the project that we just created as the default project0. In your web browser under     project info, you should be able to see the <code>Project ID</code> belonging to your <code>dtumlops</code> project. Copy this and     type he following command in a terminal</p> <pre><code>gcloud config set project &lt;project-id&gt;\n</code></pre> <p>You can also get the project info by running</p> <pre><code>gcloud projects list\n</code></pre> </li> <li> <p>Next, install the Google Cloud Python API:</p> <pre><code>pip install --upgrade google-api-python-client\n</code></pre> <p>Make sure that the Python interface is also installed. In a Python terminal type</p> <pre><code>import googleapiclient\n</code></pre> <p>this should work without any errors.</p> </li> <li> <p>(Optional) If you are using VSCode you can also download the relevant     extension     called <code>Cloud Code</code>. After installing it you should see a small <code>Cloud Code</code> button in the action bar.</p> </li> </ol> </li> <li> <p>Finally, we need to activate a couple of     developer APIs that are not activated     by default. In a terminal write</p> <pre><code>gcloud services enable apigateway.googleapis.com\ngcloud services enable servicemanagement.googleapis.com\ngcloud services enable servicecontrol.googleapis.com\n</code></pre> <p>you can always check which services are enabled by typing</p> <pre><code>gcloud services list\n</code></pre> </li> </ol> <p>After following these steps your laptop should hopefully be setup for using GCP locally. You are now ready to use their services, both locally on your laptop and in the cloud console.</p>"},{"location":"s6_the_cloud/cloud_setup/#iam-and-quotas","title":"IAM and Quotas","text":"<p>A big part of using the cloud in a bigger organization has to do with Admin and quotas. Admin here in general refers to the different roles that users of GCP and quotas refer to the amount of resources that a given user has access to. For example, one employee, let's say a data scientist, may only be granted access to certain GCP services that have to do with the development and training of machine learning models, with <code>X</code> amounts of GPUs available to use to make sure that the employee does not spend too much money. Another employee, a DevOps engineer, probably does not need access to the same services and not necessarily the same resources.</p> <p>In this course, we are not going to focus too much on this aspect but it is important to know that it exists. One feature you are going to need for doing the project is how to share a project with other people. This is done through the IAM (Identities and Access Management) page. Simply click the <code>Grant Access</code> button, search for the email of the person you want to share the project with and give them either <code>Viewer</code>, <code>Editor</code> or <code>Owner</code> access, depending on what you want them to be able to do. The figure below shows how to do this.</p> <p></p> <p>What we are going to go through right now is how to increase the quotas for how many GPUs you have available for your project. By default, for any free accounts in GCP (or accounts using teaching credits) the default quota for GPUs that you can use is either 0 or 1 (their policies sometimes change). We will in the exercises below try to increase it.</p>"},{"location":"s6_the_cloud/cloud_setup/#exercises_1","title":"\u2754 Exercises","text":"<ol> <li> <p>Start by enabling the <code>Compute Engine</code> service. Simply search for it in the top search bar. It should bring you     to a page where you can enable the service (may take some time). We are going to look more into this service     in the next module.</p> </li> <li> <p>Next go to the <code>IAM &amp; Admin</code> page, again search for it in the top search bar. The remaining steps are illustrated     in the figure below.</p> <ol> <li> <p>Go to the <code>quotas page</code></p> </li> <li> <p>In the search field search for <code>GPUs (all regions)</code> (needs to match exactly, the search field is case sensitive),     such that you get the same quota as in the image.</p> </li> <li> <p>In the limit, you can see what your current quota for the number of GPUs you can use is. Additionally, to the     right of the limit, you can see the current usage. It is worth checking in on if you are ever in doubt if a job     is running on GPU or not.</p> </li> <li> <p>Click the quota and afterward the <code>Edit</code> quotas button.</p> </li> <li> <p>In the pop-up window, increase your limit to either 1 or 2.</p> </li> <li> <p>After sending your request you can try clicking the <code>Increase requests</code> tab to see the status of your request</p> </li> </ol> <p> </p> </li> </ol> <p>If you are ever running into errors when working in GPU that contains statements about <code>quotas</code> you can always try to go to this page and see what you are allowed to use currently and try to increase it. For example, when you get to training machine learning models using Vertex AI in the next module, you would most likely need to ask for a quota increase for that service as well.</p> <p></p> <p>Finally, we want to note that a quota increase is sometimes not allowed within 24 hours of creating an account. If your request gets rejected, we recommend to wait a day and try again. If this does still not work, you may need to use their services some more to make sure you are not a bot that wants to mine crypto.</p>"},{"location":"s6_the_cloud/cloud_setup/#service-accounts","title":"Service accounts","text":"<p>At some point, you will most likely need to use a service account. A service account is a virtual account that is used to interact with the Google Cloud API. It it intended for non-human users e.g. other machines, services, etc. For example, if you want to launch a training job from Github Actions, you will need to use a service account for authentication between Github and GCP. You can read more about how to create a service account here.</p>"},{"location":"s6_the_cloud/cloud_setup/#exercises_2","title":"\u2754 Exercises","text":"<ol> <li> <p>Go to the <code>IAM &amp; Admin</code> page and click on <code>Service accounts</code>. Alternatively, you can search for it in the top search     bar.</p> </li> <li> <p>Click the <code>Create Service Account</code> button. On the next page, you can give the service account a name, and id (     automatically generated, but you can change it if you want). You can also give it a description. Leave the rest as     default and click <code>Create</code>.</p> </li> <li> <p>Next, let's give the service account some permissions. Click on the service account you just created. In the     <code>Permissions</code> tab click <code>Add permissions</code>. Your job now is to give the service account the lowest possible     permissions such that it can download files from a bucket. Look at this     page and try to find the role that fits the description.</p> Solution <p>The role you are looking for is <code>Storage Object Viewer</code>. This role allows the service account to list objects in a bucket and download objects, but nothing more. Thus even if someone gets access to the service account they cannot delete objects in the bucket.</p> </li> <li> <p>To use the service account later we need to create a key for it. Click on the service account and then the <code>Keys</code>     tab. Click <code>Add key</code> and then <code>Create new key</code>. Choose the <code>JSON</code> key type and click <code>Create</code>. This will download     a JSON file to your computer. This file is the key to the service account and should be kept secret. If you lose     it you can always create a new one.</p> </li> <li> <p>Finally, everything we just did from creating the service account, giving it permissions, and creating a key can     also be done through the <code>gcloud</code> interface. Try to find the commands to do this in the     documentation.</p> Solution <p>The commands you are looking for are:</p> <pre><code>gcloud iam service-accounts create my-sa \\\n    --description=\"My first service account\" --display-name=\"my-sa\"\ngcloud projects add-iam-policy-binding $(GCP_PROJECT_NAME) \\\n    --member=\"serviceAccount:global-service-account@iam.gserviceaccount.com\" \\\n    --role=\"roles/storage.objectViewer\"\ngcloud iam service-accounts keys create service_account_key.json \\\n    --iam-account=global-service-account@$(GCP_PROJECT_NAME).iam.gserviceaccount.com\n</code></pre> <p>where <code>$(GCP_PROJECT_NAME)</code> is the name of your project. If you then want to delete the service account you can run</p> <pre><code>gcloud iam service-accounts delete global-service-account@$(GCP_PROJECT_NAME).iam.gserviceaccount.com\n</code></pre> </li> </ol>"},{"location":"s6_the_cloud/cloud_setup/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>What considerations to take when choosing a GCP region for running a new application?</p> Solution <p>A series of factors may influence your choice of region, including:</p> <ul> <li>Services availability in the region, not all services are available in all regions</li> <li>Resource availability: some regions have more     GPUs available than others</li> <li>Reduced latency: if your application is running in the same region as your users, the latency will be lower</li> <li>Compliance: some countries have strict rules that require user info to be stored inside a particular region     eg. EU has GDPR rules that require all user data to be stored in the EU</li> <li>Pricing: some regions may have different pricing than others</li> </ul> </li> <li> <p>The 3 major cloud providers all have the same services, but they are called something different depending on the     provider. What are the corresponding names of these GCP services in AWS and Azure?</p> <ul> <li>Compute Engine</li> <li>Cloud storage</li> <li>Cloud functions</li> <li>Cloud run</li> <li>Cloud build</li> <li>Vertex AI</li> </ul> <p>It is important to know these correspondences to navigate blogpost etc. about MLOps on the internet.</p> Solution GCP AWS Azure Compute Engine Elastic Compute Cloud (EC2) Virtual Machines Cloud storage Simple Storage Service (S3) Blob Storage Cloud functions Lambda Functions Serverless Compute Cloud run App Runner, Fargate, Lambda Container Apps, Container Instances Cloud build CodeBuild DevOps Vertex AI SageMaker AI Platform </li> <li> <p>Why does is it always important to assign the lowest possible permissions to a service account?</p> Solution <p>The reason is that if someone gets access to the service account they can only do what the service account is allowed to do. If the service account has the permission to delete objects in a bucket, the attacker can delete all the objects in the bucket. For this reason, in most cases multiple service accounts are used, each with different permissions. This setup is called the principle of least privilege.</p> </li> </ol>"},{"location":"s6_the_cloud/using_the_cloud/","title":"M21 - Using the Cloud","text":""},{"location":"s6_the_cloud/using_the_cloud/#using-the-cloud","title":"Using the cloud","text":"<p>Core Module</p> <p>In this set of exercises, we are going to get more familiar with using some of the resources that GCP offers.</p>"},{"location":"s6_the_cloud/using_the_cloud/#compute","title":"Compute","text":"<p>The most basic service of any cloud provider is the ability to create and run virtual machines. In GCP this service is called Compute Engine API. A virtual machine allows you to essentially run an operating system that behaves like a completely separate computer. There are many reasons why one to use virtual machines:</p> <ul> <li> <p>Virtual machines allow you to scale your operations, essentially giving you access to infinitely many individual     computers</p> </li> <li> <p>Virtual machines allow you to use large-scale hardware. For example, if you are developing a deep learning model on     your laptop and want to know the inference time for a specific hardware configuration, you can just create a virtual     machine with those specs and run your model.</p> </li> <li> <p>Virtual machines allow you to run processes in the \"background\". If you want to train a model for a week or more, you     do not want to do this on your laptop as you cannot move it or do anything with it while it is training.     Virtual machines allow you to just launch a job and forget about it (at least until you run out of credit).</p> </li> </ul> <p></p>"},{"location":"s6_the_cloud/using_the_cloud/#exercises","title":"\u2754 Exercises","text":"<p>We are now going to start using the cloud.</p> <ol> <li> <p>Click on the <code>Compute Engine</code> tab in the sidebar on the homepage of GCP.</p> </li> <li> <p>Click the <code>Create Instance</code> button. You will see the following image below.</p> <p> </p> <p>Give the virtual machine a meaningful name, and set the location to some location that is closer to where you are (to reduce latency, we recommend <code>europe-west-1</code>). Finally, try to adjust the configuration a bit. Can you find at least two settings that alter the price of the virtual machine?</p> Solution <p>In general, the price of a virtual machine is determined by the class of hardware attached to it. Higher class CPUs and GPUs mean higher prices. Additionally, the amount of memory and disk space also affects the price. Finally, to location of the virtual machine also affects the price.</p> </li> <li> <p>After figuring this out, create a <code>e2-medium</code> instance (leave the rest configured as default). Before clicking the     <code>Create</code> button make sure to check the <code>Equivalent code</code> button. You should see a very long command that you     could have typed in the terminal that would create a VM similar to configuring it through the UI.</p> </li> <li> <p>After creating the virtual machine, in a local terminal type:</p> <pre><code>gcloud compute instances list\n</code></pre> <p>you should hopefully see the instance you have just created.</p> </li> <li> <p>You can start a terminal directly by typing:</p> <pre><code>gcloud compute ssh --zone &lt;zone&gt; &lt;name&gt; --project &lt;project-id&gt;\n</code></pre> <p>You can always see the exact command that you need to run to <code>ssh</code> to a VM by selecting the <code>View gcloud command</code> option in the Compute Engine overview (see image below).</p> <p> </p> </li> <li> <p>While logged into the instance, check if Python and PyTorch are installed.     You should see that neither is installed. The VM we have only specified what     compute resources it should have, and not what software should be in it. We     can fix this by starting VMs based on specific docker images (it's all coming together).</p> <ol> <li> <p>GCP comes with several ready-to-go images for doing deep learning.     More info can be found here.     Try, running this line:</p> <pre><code>gcloud container images list --repository=\"gcr.io/deeplearning-platform-release\"\n</code></pre> <p>what does the output show?</p> Solution <p>The output should show a list of images that are available for you to use. The images are essentially docker images that contain a specific software stack. The software stack is often a specific version of Python, PyTorch, TensorFlow, etc. The images are maintained by Google and are updated regularly.</p> </li> <li> <p>Next, start (in the terminal) a new instance using a PyTorch image. The command for doing it should look     something like this:</p> <pre><code>gcloud compute instances create &lt;instance_name&gt; \\\n    --zone=&lt;zone&gt; \\\n    --image-family=&lt;image-family&gt; \\\n    --image-project=deeplearning-platform-release \\\n    # add these arguments if you want to run on GPU and have the quota to do so\n    --accelerator=\"type=nvidia-tesla-K80,count=1\" \\\n    --maintenance-policy TERMINATE \\\n    --metadata=\"install-nvidia-driver=True\" \\\n</code></pre> <p>You can find more info here on what <code>&lt;image-family&gt;</code> should have as value and what extra argument you need to add if you want to run on GPU (if you have access).</p> Solution <p>The command should look something like this:</p> CPUGPU <pre><code>gcloud compute instances create my_instance \\\n    --zone=europe-west1-b \\\n    --image-family=pytorch-latest-cpu \\\n    --image-project=deeplearning-platform-release\n</code></pre> <pre><code>gcloud compute instances create my_instance \\\n    --zone=europe-west1-b \\\n    --image-family=pytorch-latest-gpu \\\n    --image-project=deeplearning-platform-release \\\n    --accelerator=\"type=nvidia-tesla-K80,count=1\" \\\n    --maintenance-policy TERMINATE\n</code></pre> </li> <li> <p><code>ssh</code> to the VM as one of the previous exercises. Confirm that the container indeed contains     both a Python installation and PyTorch is also installed. Hint: you also have the possibility     through the web page to start a browser session directly to the VMs you create:</p> <p> </p> </li> </ol> </li> <li> <p>Everything that you have done locally can also be achieved through the web terminal, which of course comes     pre-installed with the <code>gcloud</code> command etc.</p> <p> </p> <p>Try out launching this and run some of the commands from the previous exercises.</p> </li> <li> <p>Finally, we want to make sure that we do not forget to stop our VMs. VMs are charged by the minute, so even if you     are not using them you are still paying for them. Therefore, you must remember to stop your VMs when you are not     using them. You can do this by either clicking the <code>Stop</code> button on the VM overview page or by running the following     command:</p> <pre><code>gcloud compute instances stop &lt;instance-name&gt;\n</code></pre> </li> </ol>"},{"location":"s6_the_cloud/using_the_cloud/#data-storage","title":"Data storage","text":"<p>Another big part of cloud computing is the storage of data. There are many reasons that you want to store your data in the cloud including:</p> <ul> <li>Easily being able to share</li> <li>Easily expand as you need more</li> <li>Data is stored in multiple locations, making sure that it is not lost in case of an emergency</li> </ul> <p>Cloud storage is luckily also very cheap. Google Cloud only takes around $0.026 per GB per month. This means that around 1 TB of data would cost you $26 which is more than what the same amount of data would cost on Google Drive, but the storage in Google Cloud is much more focused on enterprise usage such that you can access the data through code.</p>"},{"location":"s6_the_cloud/using_the_cloud/#exercises_1","title":"\u2754 Exercises","text":"<p>When we did the exercise on data version control, we made <code>dvc</code> work together with our own Google Drive to store data. However, a big limitation of this is that we need to authenticate each time we try to either push or pull the data. The reason is that we need to use an API instead which is offered through GCP.</p> <p>We are going to follow the instructions from this page</p> <ol> <li> <p>Let's start by creating a data storage. On the GCP start page, in the sidebar, click on the <code>Cloud Storage</code>.     On the next page click the <code>Create bucket</code>:</p> <p> </p> <p>Give the bucket a unique name, set it to a region close by and importantly remember to enable Object versioning under the last tab. Finally, click `Create``.</p> </li> <li> <p>After creating the storage, you should be able to see it online and you should be able to see it if you type in your     local terminal:</p> <pre><code>gsutil ls\n</code></pre> <p>gsutil is a command line tool that allows you to create, upload, download, list, move, rename and delete objects in the cloud storage. For example, you can upload a file to the cloud storage by running:</p> <pre><code>gsutil cp &lt;file&gt; gs://&lt;bucket-name&gt;\n</code></pre> </li> <li> <p>Next, we need the Google storage extension for <code>dvc</code></p> <pre><code>pip install dvc-gs\n</code></pre> </li> <li> <p>Now in your corrupt MNIST repository where you have already configured <code>dvc</code>, we are going to change the storage     from our Google Drive to our newly created Google Cloud storage.</p> <pre><code>dvc remote add -d remote_storage &lt;output-from-gsutils&gt;\n</code></pre> <p>In addition, we are also going to modify the remote to support object versioning (called <code>version_aware</code> in <code>dvc</code>):</p> <pre><code>dvc remote modify remote_storage version_aware true\n</code></pre> <p>This will change the default way that <code>dvc</code> handles data. Instead of just storing the latest version of the data as content-addressable storage it will now store the data as it looks in our local repository, which allows us to not only use <code>dvc</code> to download our data.</p> </li> <li> <p>The above command will change the <code>.dvc/config</code> file. <code>git add</code> and <code>git commit</code> the changes to that file.     Finally, push data to the cloud</p> <pre><code>dvc push --no-run-cache  # (1)!\n</code></pre> <ol> <li> The <code>--no-run-cache</code> flag is used to avoid pushing the cache file to the cloud, which is not     supported by the Google Cloud storage.</li> </ol> </li> <li> <p>Finally, make sure that you can pull without having to give your credentials. The easiest way to see this     is to delete the <code>.dvc/cache</code> folder that should be locally on your laptop and afterward do a</p> <pre><code>dvc pull --no-run-cache\n</code></pre> </li> </ol> <p>This setup should work when trying to access the data from your laptop, which we authenticated in the previous module. However, how can you access the data from a virtual machine, inside a docker container or from a different laptop? We in general recommend two ways:</p> <ul> <li> <p>You can make the bucket publicly accessible e.g. no authentication is needed. That means that anyone with the URL     to the data can access it. This is the easiest way to do it, but also the least secure. You can read more about     how to make your buckets public here.</p> </li> <li> <p>You can use the service account that you created in the previous module to authenticate the VM. This is the most     secure way to do it, but also the most complicated. You first need to give the service account the correct     permissions. Then you need to authenticate using the service account. In <code>dvc</code> this is done by setting the     environment variable <code>GOOGLE_APPLICATION_CREDENTIALS</code> to the path of</p> Linux/MacOSWindows <pre><code>export GOOGLE_APPLICATION_CREDENTIALS=\"/path/to/your/credentials.json\"\n</code></pre> <pre><code>set GOOGLE_APPLICATION_CREDENTIALS=\"C:\\path\\to\\your\\credentials.json\"\n</code></pre> </li> </ul>"},{"location":"s6_the_cloud/using_the_cloud/#artifact-registry","title":"Artifact registry","text":"<p>You should hopefully at this point have seen the strength of using containers to create reproducible environments. They allow us to specify exactly the software that we want to run inside our VMs. However, you should already have run into two problems with containers</p> <ul> <li>The building process can take a lot of time</li> <li>Docker images can be large</li> </ul> <p>For this reason, we want to move both the building process and the storage of images to the cloud. In GCP the two services that we are going to use for this are called Cloud Build for building the containers in the cloud and Artifact registry for storing the images afterward.</p>"},{"location":"s6_the_cloud/using_the_cloud/#exercises_2","title":"\u2754 Exercises","text":"<p>In these exercises, I recommend that you start with a dummy version of some code to make sure that the building process does not take too long. Below is a simple Python script that does image classification using Sklearn, together with the corresponding <code>requirements.txt</code> file and <code>Dockerfile</code>.</p> Python script main.py<pre><code>from sklearn import datasets, metrics, svm\nfrom sklearn.model_selection import train_test_split\n\nif __name__ == \"__main__\":\n    digits = datasets.load_digits()\n\n    # flatten the images\n    n_samples = len(digits.images)\n    data = digits.images.reshape((n_samples, -1))\n\n    # Create a classifier: a support vector classifier\n    clf = svm.SVC(gamma=0.001)\n\n    # Split data into 50% train and 50% test subsets\n    X_train, X_test, y_train, y_test = train_test_split(data, digits.target, test_size=0.5, shuffle=False)\n\n    # Learn the digits on the train subset\n    clf.fit(X_train, y_train)\n\n    # Predict the value of the digit on the test subset\n    predicted = clf.predict(X_test)\n\n    print(f\"Classification report for classifier {clf}:\\n{metrics.classification_report(y_test, predicted)}\\n\")\n</code></pre> requirements.txt requirements.txt<pre><code>scikit-learn&gt;=1.0\n</code></pre> Dockerfile Dockerfile<pre><code>FROM python:3.11-slim\n\n# install python\nRUN apt update &amp;&amp; \\\n    apt install --no-install-recommends -y build-essential gcc &amp;&amp; \\\n    apt clean &amp;&amp; rm -rf /var/lib/apt/lists/*\n\nCOPY requirements.txt requirements.txt\nCOPY main.py main.py\nWORKDIR /\nRUN pip install -r requirements.txt --no-cache-dir\n\nENTRYPOINT [\"python\", \"-u\", \"main.py\"]\n</code></pre> <p>The docker images for this application are therefore going to be substantially faster to build and smaller in size than the images we are used to that use PyTorch.</p> <ol> <li> <p>Start by enabling the service: <code>Google Artifact Registry API</code> and <code>Google Cloud Build API</code>. This can be     done through the website (by searching for the services) or can also be enabled from the terminal:</p> <pre><code>gcloud services enable artifactregistry.googleapis.com\ngcloud services enable cloudbuild.googleapis.com\n</code></pre> </li> <li> <p>The first step is creating an artifact repository in the cloud. You can either do this through the UI or using     <code>gcloud</code> in the command line.</p> UICommand line <p>Find the <code>Artifact Registry</code> service (search for it in the search bar) and click on it. From there click on the <code>Create repository</code> button. You should see the following page:</p> <p> </p> <p>Give the repository a name, make sure to set the format to <code>Docker</code> and specify the region. At the bottom of the page you can optionally add a cleanup policy. We recommend that you add one to keep costs down. Give the policy a name choose the <code>Keep most recent versions</code> option and set the keep count to <code>5</code>. Click <code>Create</code> and you should now see the repository in the list of repositories.</p> <pre><code>gcloud artifacts repositories create &lt;registry-name&gt; \\\n    --repository-format=docker \\\n    --location=europe-west1 \\\n    --description=\"My docker registry\"\n</code></pre> <p>where you need to replace <code>&lt;registry-name&gt;</code> with a name of your choice. You can read more about the command here. We recommend that after creating the repository you update it with a cleanup policy to keep costs down. You can do this by running:</p> <pre><code>gcloud artifacts repositories set-cleanup-policies REPOSITORY\n    --project=&lt;project-id&gt;\n    --location=&lt;region&gt;\n    --policy=policy.yaml\n</code></pre> <p>where the <code>policy.yaml</code> file should look something like this:</p> <p><pre><code>[\n    {\n        \"name\": \"keep-minimum-versions\",\n        \"action\": {\"type\": \"Keep\"},\n        \"mostRecentVersions\": {\n            \"keepCount\": 5\n        }\n    }\n]\n</code></pre> and you can read more about the command here.</p> <p>Whenever we in the future want to push or pull to this artifact repository we can refer to it using this URL:</p> <pre><code>&lt;region&gt;-docker.pkg.dev/&lt;project-id&gt;/&lt;registry-name&gt;\n</code></pre> <p>for example, <code>europe-west1-docker.pkg.dev/dtumlops-335110/container-registry</code> would be a valid URL (this is the one I created).</p> </li> <li> <p>We are now ready to build our containers in the cloud. In principle, GCP cloud build works out of the box with docker     files. However, the recommended way is to add specialized <code>cloudbuild.yaml</code> files. You can think of the     <code>cloudbuild.yaml</code> file as the corresponding file in GCP as workflow files are in GitHub actions, which you learned     about in module M16. It is essentially a file that specifies     a list of steps that should be executed to do something, but the syntax is different.</p> <p>Look at the documentation on how to write a <code>cloudbuild.yaml</code> file for building and pushing a docker image to the artifact registry. Try to implement such a file in your repository.</p> Solution <p>For building docker images the syntax is as follows:</p> <pre><code>steps:\n- name: 'gcr.io/cloud-builders/docker'\n  id: 'Build container image'\n  args: [\n    'build',\n    '.',\n    '-t',\n    'europe-west1-docker.pkg.dev/$PROJECT_ID/&lt;registry-name&gt;/&lt;image-name&gt;',\n    '-f',\n    '&lt;path-to-dockerfile&gt;'\n  ]\n- name: 'gcr.io/cloud-builders/docker'\n  id: 'Push container image'\n  args: [\n    'push',\n    'europe-west1-docker.pkg.dev/$PROJECT_ID/&lt;registry-name&gt;/&lt;image-name&gt;'\n  ]\n</code></pre> <p>where you need to replace <code>&lt;registry-name&gt;</code>, <code>&lt;image-name&gt;</code> and <code>&lt;path-to-dockerfile&gt;</code> with your own values. You can hopefully recognize the syntax from the docker exercises. In this example, we are calling the <code>cloud-builders/docker</code> service with both the <code>build</code> and <code>push</code> arguments.</p> </li> <li> <p>You can now try to trigger the <code>cloudbuild.yaml</code> file from your local machine. What <code>gcloud</code> command would you use     to do this?</p> Solution <p>You can trigger a build by running the following command:</p> <pre><code>gcloud builds submit --config=cloudbuild.yaml .\n</code></pre> <p>This command will submit a build to the cloud build service using the configuration file <code>cloudbuild.yaml</code> in the current directory.</p> </li> <li> <p>Instead of relying on manually submitting builds, we can setup the building process as continuous integration such     that it is triggered every time we push code to the repository. This is done by setting up a     trigger in the GCP console. From the GCP homepage, navigate to the     triggers panel:</p> <p> </p> <p>Click on the manage repositories.</p> <ol> <li> <p>From there, click the <code>Connect Repository</code> and go through the steps of authenticating your GitHub profile with     GCP and choose the repository that you want to setup build triggers. For now, skip the     <code>Create a trigger (optional)</code> part by pressing <code>Done</code> in the end.</p> <p> </p> </li> <li> <p>Navigate back to the <code>Triggers</code> homepage and click <code>Create trigger</code>. Set the following:</p> <ul> <li>Give a name</li> <li>Event: choose <code>Push to branch</code></li> <li>Source: choose the repository you just connected</li> <li>Branch: choose <code>^main$</code></li> <li>Configuration: choose either <code>Autodetected</code> or <code>Cloud build configuration file</code></li> </ul> <p>Finally, click the <code>Create</code> button and the trigger should show up on the triggers page.</p> </li> <li> <p>To activate the trigger, push some code to the chosen repository.</p> </li> <li> <p>Go to the <code>Cloud Build</code> page and you should see the image being built and pushed.</p> <p> </p> <p>Try clicking on the build to check out the build process and build summary. As you can see from the image, if a build is failing you will often find valuable info by looking at the build summary. If your build is failing try to configure it to run in one of these regions: <code>us-central1, us-west2, europe-west1, asia-east1, australia-southeast1, southamerica-east1</code> as specified in the documentation.</p> </li> <li> <p>If/when your build is successful, navigate to the <code>Artifact Registry</code> page. You should hopefully find that the     image you just built was pushed here. Congrats!</p> </li> </ol> </li> <li> <p>Make sure that you can pull your image down to your laptop</p> <pre><code>docker pull &lt;region&gt;-docker.pkg.dev/&lt;project-id&gt;/&lt;registry-name&gt;/&lt;image-name&gt;:&lt;image-tag&gt;\n</code></pre> <p>you will need to authenticate <code>docker</code> with GCP first. Instructions can be found here, but the following command should hopefully be enough to make <code>docker</code> and GCP talk to each other:</p> <pre><code>gcloud auth configure-docker &lt;region-docker.pkg.dev&gt;\n</code></pre> <p>where you need to replace <code>&lt;region&gt;</code> with the region you are using. Do note you need to have <code>docker</code> actively running in the background, as any other time you want to use <code>docker</code>.</p> </li> <li> <p>Automatization through the cloud is in general the way to go, but sometimes you may want to manually create images     and push them to the registry. Figure out how to push an image to your <code>Artifact Registry</code>. For simplicity, you can     just push the <code>busybox</code> image you downloaded during the initial docker exercises. This     page should help you with exercise.</p> Solution <p>Pushing to a repository is similar to pulling. Assuming that you have already built an image called <code>busybox</code> you can push it to the repository by running:</p> <pre><code>docker tag busybox &lt;region&gt;-docker.pkg.dev/&lt;project-id&gt;/&lt;registry-name&gt;/busybox:latest\ndocker push &lt;region&gt;-docker.pkg.dev/&lt;project-id&gt;/&lt;registry-name&gt;/busybox:latest\n</code></pre> <p>where you need to replace <code>&lt;region&gt;</code>, <code>&lt;project-id&gt;</code> and <code>&lt;registry-name&gt;</code> with your own values.</p> </li> <li> <p>(Optional) Instead of using the built-in trigger in GCP, another way to activate the build-on code changes is to     integrate with Github Actions. This has the benefit that we can make the build process depend on other steps in the     pipeline. For example, in the image below we have conditioned the build to only run if tests are passing on     all operating systems. Lets try to implement this.</p> <p> </p> <ol> <li> <p>Start by adding a new secret to Github with the name <code>GCLOUD_SERVICE_KEY</code> and the value of the service account     key that you created in the previous module. This is needed to authenticate the Github action with GCP.</p> </li> <li> <p>We assume that you already have a workflow file that runs some unit tests:</p> <pre><code>name: Unit tests &amp; build\n\non:\n  push:\n    branches: [main]\n  pull_request:\n    branches: [main]\n\njobs:\n  test:\n</code></pre> <p>we now want to add a job that triggers the build process in GCP. How can you make the <code>build</code> job depend on the <code>test</code> job? Hint: Relevant documentation.</p> Solution <p>You can make the <code>build</code> job depend on the <code>test</code> job by adding the <code>needs</code> keyword to the <code>build</code> job:</p> <pre><code>name: Unit tests &amp; build\n\non:\n  push:\n    branches: [main]\n  pull_request:\n    branches: [main]\n\njobs:\n  test:\n    ...\n  build:\n    needs: test\n    ...\n</code></pre> </li> <li> <p>Additionally, we probably only want to build the image if the job is running on our main branch e.g. not part     of a pull request. How can you make the <code>build</code> job only run on the main branch?</p> Solution <pre><code>name: Unit tests &amp; build\n\non:\n  push:\n    branches: [main]\n  pull_request:\n    branches: [main]\n\njobs:\n  test:\n    ...\n  build:\n    needs: test\n    if: ${{ github.event_name == 'push' &amp;&amp; github.ref == 'refs/heads/main' }}\n    ...\n</code></pre> </li> <li> <p>Finally, we need to add the steps to submit the build job to GCP. You need four steps:</p> <ul> <li>Checkout the code</li> <li>Authenticate with GCP</li> <li>Setup gcloud</li> <li>Submit the build</li> </ul> <p>How can you do this? Hint: For the first two steps these two Github actions can be useful: auth and setup-gcloud.</p> Solution <pre><code>name: Unit tests &amp; build\n\non:\n  push:\n    branches: [main]\n  pull_request:\n    branches: [main]\n\njobs:\n  test:\n    ...\n  build:\n    needs: test\n    if: ${{ github.event_name == 'push' &amp;&amp; github.ref == 'refs/heads/main' }}\n    runs-on: ubuntu-latest\n    steps:\n    - name: Checkout code\n      uses: actions/checkout@v4\n\n    - name: Auth with GCP\n      uses: google-github-actions/auth@v2\n      with:\n        credentials_json: ${{ secrets.GCLOUD_SERVICE_KEY }}\n\n    - name: Set up Cloud SDK\n      uses: google-github-actions/setup-gcloud@v2\n\n    - name: Submit build\n      run: gcloud builds submit --config cloudbuild_containers.yaml\n</code></pre> </li> </ol> </li> <li> <p>(Optional) The <code>cloudbuild</code> specification format allows you to specify so-called     substitutions. A substitution     is simply a way to replace a variable in the <code>cloudbuild.yaml</code> file with a value that is known only at runtime. This     can be useful for using the same <code>cloudbuild.yaml</code> file for multiple builds. Try to implement a substitution in your     docker cloud build file such that the image name is a variable.</p> <p>Build in substitutions</p> <p>You have probably already encountered substitutions like <code>$PROJECT_ID</code> in the <code>cloudbuild.yaml</code> file. These are substitutions that are automatically replaced by GCP. Other commonly used are <code>$BUILD_ID</code>, <code>$PROJECT_NUMBER</code> and <code>$LOCATION</code>. You can find a full list of built.in substitutions here</p> Solution <p>We just need to add the <code>substitutions</code> field to the <code>cloudbuild.yaml</code> file. For example, if we want to replace the image name with a variable called <code>_IMAGE_NAME</code> we can do the following:</p> <pre><code>steps:\n- name: 'gcr.io/cloud-builders/docker'\n  id: 'Build container image'\n  args: [\n    'build',\n    '.',\n    '-t',\n    'europe-west1-docker.pkg.dev/$PROJECT_ID/&lt;registry-name&gt;/$_IMAGE_NAME',\n    '-f',\n    '&lt;path-to-dockerfile&gt;'\n  ]\n- name: 'gcr.io/cloud-builders/docker'\n  id: 'Push container image'\n  args: [\n    'push',\n    'europe-west1-docker.pkg.dev/$PROJECT_ID/&lt;registry-name&gt;/$_IMAGE_NAME'\n  ]\nsubstitutions:\n  _IMAGE_NAME: 'my_image'\n</code></pre> <p>Do note that user substitutions are prefixed with an underscore <code>_</code> to distinguish them from built-in. You can read more here</p> <ol> <li> <p>How would you provide the value for the <code>_IMAGE_NAME</code> variable to the <code>gcloud builds submit</code> command?</p> Solution <p>You can provide the value for the <code>_IMAGE_NAME</code> variable by adding the <code>--substitutions</code> flag to the <code>gcloud builds submit</code> command:</p> <pre><code>gcloud builds submit --config=cloudbuild.yaml --substitutions=_IMAGE_NAME=my_image\n</code></pre> <p>If you want to provide more than one substitution you can do so by separating them with a comma.</p> </li> </ol> </li> </ol>"},{"location":"s6_the_cloud/using_the_cloud/#training","title":"Training","text":"<p>As the final step in our journey through different GCP services in this module, we are going to look at the training of our models. This is one of the important tasks that GCP can help us with because we can always rent more hardware as long as we have credits, meaning that we can both scale horizontally (run more experiments) and vertically (run longer experiments).</p> <p>We are going to check out two ways of running our experiments. First, we are going to return to the Compute Engine service because it gives the most simple form of scaling of experiments. That is: we create a VM with an appropriate docker image, start it, log into the VM and run our experiments. Most people can run a couple of experiments in parallel this way. However, what if there was an abstract layer that automatically created VM for us, launched our experiments and then closed the VM afterwards?</p> <p>This is where Vertex AI service comes into play. This is a dedicated service for handling ML models in GCP in the cloud. Vertex AI is in principal and end-to-end service that can take care of everything machine learning related in the cloud. In this course, we are primarily focused on just the training of our models, and then use other services for different parts of our pipeline.</p>"},{"location":"s6_the_cloud/using_the_cloud/#exercises_3","title":"\u2754 Exercises","text":"<ol> <li> <p>Let's start by going through how we could train a model using PyTorch using the Compute Engine service:</p> <ol> <li> <p>Start by creating an appropriate VM. If you want to start a VM that has PyTorch pre-installed with only CPU     support you can run the following command</p> <pre><code>gcloud compute instances create &lt;instance-name&gt; \\\n    --zone europe-west1-b \\\n    --image-family=pytorch-latest-cpu \\\n    --image-project=deeplearning-platform-release\n</code></pre> <p>alternatively, if you have access to GPU in your GCP account you could start a VM in the following way</p> <pre><code>gcloud compute instances create &lt;instance-name&gt; \\\n    --zone europe-west4-a \\\n    --image-family=pytorch-latest-gpu \\\n    --image-project=deeplearning-platform-release \\\n    --accelerator=\"type=nvidia-tesla-v100,count=1\" \\\n    --metadata=\"install-nvidia-driver=True\" \\\n    --maintenance-policy TERMINATE\n</code></pre> </li> <li> <p>Next login into your newly created VM. You can either open an <code>ssh</code> terminal in the cloud console or run the     following command</p> <pre><code>gcloud beta compute ssh &lt;instance-name&gt;\n</code></pre> </li> <li> <p>It is recommended to always check that the VM we get is actually what we asked for. In this case, the VM should     have PyTorch pre-installed so let's check for that by running</p> <pre><code>python -c \"import torch; print(torch.__version__)\"\n</code></pre> <p>Additionally, if you have a VM with GPU support also try running the <code>nvidia-smi</code> command.</p> </li> <li> <p>When you have logged in to the VM, it works as your machine. Therefore to run some training code you would     need to do the same setup step you have done on your machine: clone your Github, install dependencies,     download data, and run code. Try doing this to make sure you can train a model.</p> </li> </ol> </li> <li> <p>The above exercises should hopefully have convinced you that it can be hard to scale experiments using the Compute     Engine service. The reason is that you need to manually start, setup and stop a separate VM for each experiment.     Instead, let's try to use the Vertex AI service to train our models.</p> <ol> <li> <p>Start by enabling it by searching for <code>Vertex AI</code> in the cloud console by going to the service or by running the     following command:</p> <pre><code>gcloud services enable aiplatform.googleapis.com\n</code></pre> </li> <li> <p>The way we are going to use Vertex AI is to create custom jobs because we have already developed docker     containers that contain everything to run our code. Thus the only command that we need to use is     <code>gcloud ai custom-jobs create</code> command. An example here would be:</p> <pre><code>gcloud ai custom-jobs create \\\n    --region=europe-west1 \\\n    --display-name=test-run \\\n    --config=config.yaml \\\n    # these are the arguments that are passed to the container, only needed if you want to change defaults\n    --command 'python src/my_project/train.py' \\\n    --args '[\"--epochs\", \"10\"]'\n</code></pre> <p>Essentially, this command combines everything into one command: it first creates a VM with the specs specified by a configuration file, then loads a container specified again in the configuration file and finally it runs everything. An example of a config file could be:</p> CPUGPU <pre><code># config_cpu.yaml\nworkerPoolSpecs:\n    machineSpec:\n        machineType: n1-highmem-2\n    replicaCount: 1\n    containerSpec:\n        imageUri: gcr.io/&lt;project-id&gt;/&lt;docker-img&gt;\n</code></pre> <pre><code># config_gpu.yaml\nworkerPoolSpecs:\n    machineSpec:\n        machineType: n1-standard-8\n        acceleratorType: NVIDIA_TESLA_T4 #(1)!\n        acceleratorCount: 1\n    replicaCount: 1\n    containerSpec:\n        imageUri: gcr.io/&lt;project-id&gt;/&lt;docker-img&gt;\n</code></pre> <ol> <li>In this case we are requesting a Nvidia Tesla T4 GPU. This will only work if you have a quota for     allocating this type of GPU in the Vertex AI service. You can check how to request quota in the last     exercise of the previous module. Remember that it is not enough to just request a     quota for the GPU, the request needs to be approved by Google before you can use it.</li> </ol> <p>you can read more about the configuration formatting here and the different types of machines here. Try to execute a job using the <code>gcloud ai custom-jobs create</code> command. For additional documentation you can look at the documentation on the command and this page and this page</p> </li> <li> <p>Assuming you manage to launch a job, you should see an output like this:</p> <p> </p> <p>Try executing the commands that are outputted to look at both the status and the progress of your job.</p> </li> <li> <p>In addition, you can also visit the <code>Custom Jobs</code> tab in <code>training</code> part of Vertex AI</p> <p> </p> <p>You will need to select the specific region that you submitted your job to see the job.</p> </li> <li> <p>During custom training, we do not necessarily need to use <code>dvc</code> for downloading our data. A more efficient way is     to use cloud storage as a     mounted file system.     This allows us to access data directly from the cloud storage without having to download it first. All our     training jobs are automatically mounted a <code>gcs</code> folder in the root directory. Try to access the data from your     training script:</p> <pre><code># loading from a bucket using mounted file system\ndata = torch.load('/gcs/&lt;my-bucket-name&gt;/data.pt')\n# writing to a bucket using mounted file system\ntorch.save(data, '/gcs/&lt;my-bucket-name&gt;/data.pt')\n</code></pre> <p>is should speed up the training process a bit.</p> </li> <li> <p>Your code may depend on environment variables for authenticating, for example with weights and bias during     training. These can also be specified in the configuration file. How would you do this?</p> Solution <p>You can specify environment variables in the configuration file by adding the <code>env</code> field to the <code>containerSpec</code> field. For example, if you want to specify the <code>WANDB_API_KEY</code> you can do it like this:</p> <pre><code>workerPoolSpecs:\n    machineSpec:\n        machineType: n1-highmem-2\n    replicaCount: 1\n    containerSpec:\n        imageUri: gcr.io/&lt;project-id&gt;/&lt;docker-img&gt;\n        env:\n        - name: WANDB_API_KEY\n          value: &lt;your-wandb-api-key&gt;\n</code></pre> <p>You need to replace <code>&lt;your-wandb-api-key&gt;</code> with your actual key. Also, remember that this file now contains a secret and should be treated as such.</p> </li> <li> <p>Try to execute multiple jobs with different configurations e.g. change the <code>--args</code> field in the <code>gcloud ai     custom-jobs create</code> command at the same time. This should hopefully show you how easy it is to scale experiments     using the Vertex AI service.</p> </li> </ol> </li> </ol>"},{"location":"s6_the_cloud/using_the_cloud/#secrets-management","title":"Secrets management","text":"<p>Similar to GitHub Actions, GCP also has a secrets store that can be used to keep secrets safe. This is called the Secret Manager in GCP. By using the Secret Manager, we get the option to inject secrets into our code without having to store them in the code itself.</p>"},{"location":"s6_the_cloud/using_the_cloud/#exercises_4","title":"\u2754 Exercises","text":"<ol> <li> <p>Let's look at the example from before where we have a config file like this for custom Vertex AI jobs:</p> <pre><code>workerPoolSpecs:\n    machineSpec:\n        machineType: n1-highmem-2\n    replicaCount: 1\n    containerSpec:\n        imageUri: gcr.io/&lt;project-id&gt;/&lt;docker-img&gt;\n        env:\n        - name: WANDB_API_KEY\n            value: $WANDB_API_KEY\n</code></pre> <p>we do not want to store the <code>WANDB_API_KEY</code> in the config file, rather we would like to store it in the Secret Manager and inject it right before the job starts. Let's figure out how to do that.</p> <ol> <li> <p>Start by enabling the secrets manager API by running the following command:</p> <pre><code>gcloud services enable secretmanager.googleapis.com\n</code></pre> </li> <li> <p>Next, go to the secrets manager in the cloud console and create a new secret. You just need to give it a name, a     value and leave the rest as default. Add one or more secrets like the image below.</p> <p> </p> </li> <li> <p>We are going to inject the secrets into our training job by using cloudbuild. Create new cloudbuild file called     <code>vertex_ai_train.yaml</code> and add the following content:</p> vertex_ai_train.yaml<pre><code>steps:\n- name: \"alpine\"\n  id: \"Replace values in the training config\"\n  entrypoint: \"sh\"\n  args:\n    - '-c'\n    - |\n      apk add --no-cache gettext\n      envsubst &lt; config.yaml &gt; config.yaml.tmp\n      mv config.yaml.tmp config.yaml\n  secretEnv: ['WANDB_API_KEY']\n\n- name: 'alpine'\n  id: \"Show config\"\n  waitFor: ['Replace values in the training config']\n  entrypoint: \"sh\"\n  args:\n    - '-c'\n    - |\n    cat config.yaml\n\n- name: 'gcr.io/cloud-builders/gcloud'\n  id: 'Train on vertex AI'\n  waitFor: ['Replace values in the training config']\n  args: [\n    'ai',\n    'custom-jobs',\n    'create',\n    '--region',\n    'europe-west1',\n    '--display-name',\n    'example-mlops-job',\n    '--config',\n    '${_VERTEX_TRAIN_CONFIG}',\n  ]\navailableSecrets:\n  secretManager:\n  - versionName: projects/$PROJECT_ID/secrets/WANDB_API_KEY/versions/latest\n    env: 'WANDB_API_KEY'\n</code></pre> <p>Slowly go through the file and try to understand what each step does.</p> Solution <p>There are two parts to using secrets in cloud build. First, there is the <code>availableSecrets</code> field that specifies what secrets from the Secret Manager should be injected into the build. In this case, we are injecting the <code>WANDB_API_KEY</code> and setting it as an environment variable. The second part is the <code>secretEnv</code> field in the first step. This field specifies which secrets should be available in the first step. The steps are then doing:</p> <ol> <li> <p>The first step call the envsubst command which is a     general Linux command that replaces environment variables in a file. In this case, it replaces the     <code>$WANDB_API_KEY</code> with the actual value of the secret. We then save the file as <code>config.yaml.tmp</code> and     rename it back to <code>config.yaml</code>.</p> </li> <li> <p>The second step is just to show that the replacement was successful. This is mostly for debugging     purposes and can be removed.</p> </li> <li> <p>The third step is the actual training job. It waits for the first step to finish before running.</p> </li> </ol> </li> <li> <p>Finally, try to trigger the build:</p> <pre><code>gcloud builds submit --config=vertex_ai_train.yaml\n</code></pre> <p>and check that the <code>WANDB_API_KEY</code> is correctly injected into the <code>config.yaml</code> file.</p> </li> </ol> </li> </ol>"},{"location":"s6_the_cloud/using_the_cloud/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>In Compute Engine, we have the option to either stop or suspend the VMs, can you describe what the difference is?</p> Solution <p>Suspended instances preserve the guest OS memory, device state, and application state. You will not be charged for a suspended VM but will be charged for the storage of the aforementioned states. Stopped instances do not preserve any of the states and you will be charged for the storage of the disk. However, in both cases if the VM instances have resources attached to them, such as static IPs and persistent disks, which are charged until they are deleted.</p> </li> <li> <p>As seen in the exercises, a <code>cloudbuild.yaml</code> file often contains multiple steps. How would you make steps dependent     on each other e.g. one step can only run if another step has finished? And how would you make steps execute     concurrently?</p> Solution <p>In both cases, the solution is the <code>waitFor</code> field. If you want a step to wait for another step to finish you you need to give the first step an <code>id</code> and then specify that <code>id</code> in the <code>waitFor</code> field of the second step.</p> <pre><code>steps:\n- name: 'alpine'\n  id: 'step1'\n  entrypoint: 'sh'\n  args: ['-c', 'echo \"Hello World\"']\n- name: 'alpine'\n  id: 'step2'\n  entrypoint: 'sh'\n  args: ['-c', 'echo \"Hello World 2\"']\n  waitFor: ['step1']\n</code></pre> <p>If you want steps to run concurrently you can set the <code>waitFor</code> field to <code>['-']</code>:</p> <pre><code>steps:\n- name: 'alpine'\n  id: 'step1'\n  entrypoint: 'sh'\n  args: ['-c', 'echo \"Hello World\"']\n- name: 'alpine'\n  id: 'step2'\n  entrypoint: 'sh'\n  args: ['-c', 'echo \"Hello World 2\"']\n  waitFor: ['-']\n</code></pre> </li> </ol> <p>This ends the session on how to use Google Cloud services for now. In a future session, we are going to investigate a bit more of the services offered in GCP, in particular for deploying the models that we have just trained.</p>"},{"location":"s7_deployment/","title":"Model deployment","text":"<p>Slides</p> <ul> <li> <p></p> <p>Learn how to use requests works and how to create custom APIs</p> <p> M22: Requests and APIs</p> </li> <li> <p></p> <p>Learn how to deploy custom APIs using serverless functions and serverless containers in the cloud</p> <p> M23: Cloud Deployment</p> </li> <li> <p></p> <p>Learn how to test APIs for functionality and load</p> <p> M24: API testing</p> </li> <li> <p></p> <p>Learn about different ways to improve the deployment of machine learning models</p> <p> M25: ML Deployment</p> </li> <li> <p></p> <p>Learn how to create a frontend for your application using Streamlit</p> <p> M26: Frontend</p> </li> </ul> <p>Let's say that you have spent 1000 GPU hours and trained the most awesome model that you want to share with the world. One way to do this is, of course, to just place all your code in a Github repository, upload a file with the trained model weights to your favorite online storage (assuming it is too big for GitHub to handle) and ask people to just download your code and the weights to run the code by themselves. This is a fine approach in a small research setting, but in production, you need to be able to deploy the model to an environment that is fully contained such that people can just execute without looking (too hard) at the code.</p> <p> </p>  Image credit  <p>In this session we try to look at methods specialized towards deployment of models on your local machine and also how to deploy services in the cloud.</p> <p>Learning objectives</p> <p>The learning objectives of this session are:</p> <ul> <li>Understand the basics of requests and APIs</li> <li>Can create custom APIs using the framework <code>fastapi</code> and run it locally</li> <li>Knowledge about serverless deployments and how to deploy custom APIs using both serverless functions and     serverless containers</li> <li>Can create basic continues deployment pipelines for your models</li> <li>Understand the basics of frontend development and how to create a frontend for your application using Streamlit</li> <li>Know how to use more advance frameworks like onnx and bentoml to deploy your machine learning models</li> </ul>"},{"location":"s7_deployment/apis/","title":"M22 - Requests and APIs","text":""},{"location":"s7_deployment/apis/#requests-and-apis","title":"Requests and APIs","text":"<p>Core Module</p> <p>Before we can get deployment of our models we need to understand concepts such as APIs and requests. The core reason for this is that we need a new abstraction layer on top of our applications that are not Python-specific. While Python is the defacto language for machine learning, we cannot expect everybody else to use it and in particular, we cannot expect network protocols (both locally and external) to be able to communicate with our Python programs out of the box. For this reason, we need to understand requests, in particular HTTP requests and how to create APIs that can interact with those requests.</p>"},{"location":"s7_deployment/apis/#requests","title":"Requests","text":"<p>When we are talking about requests, we are essentially talking about the communication method used in client-server types of architectures. As shown in the image below, in this architecture, the client (user) is going to send requests to a server (our machine learning application) and the server will give a response. For example, the user may send a request to get the class of a specific image, which our application will do and then send back the response in terms of a label.</p> <p></p>  Image credit  <p>The common way of sending requests is called HTTP (Hypertext Transfer Protocol). It is essentially a specification of the intermediary transportation method between the client and server. An HTTP request essentially consists of two parts:</p> <ul> <li>A request URL: the location of the server we want to send our request to</li> <li>A request Method: describing what action we want to perform on the server</li> </ul> <p>The common request methods are (case sensitive):</p> <ul> <li>GET: get data from the server</li> <li>POST/PUT: send data to the server</li> <li>DELETE: delete data on the server</li> </ul> <p>You can read more about the different methods here. For most machine learning applications, GET and POST are the core methods to remember. Additionally, if you want to read more about HTTP in general we highly recommend that you go over this comic strip protocol, but the TLDR is that it provides privacy, integrity and identification over the web.</p>"},{"location":"s7_deployment/apis/#exercises","title":"\u2754 Exercises","text":"<p>We are going to do a couple of exercises on sending requests using requests package to get familiar with the syntax.</p> <ol> <li> <p>Start by installing the `requests`` package</p> <pre><code>pip install requests\n</code></pre> </li> <li> <p>Afterwards, create a small script and try to execute the code</p> <pre><code>import requests\nresponse = requests.get('https://api.github.com/this-api-should-not-exist')\nprint(response.status_code)\n</code></pre> <p>As you can see from the syntax, we are sending a request using the GET method. This code should return status code 404. Take a look at this page that contains a list of status codes. Next, let's call a page that exists</p> <pre><code>import requests\nresponse = requests.get('https://api.github.com')\nprint(response.status_code)\n</code></pre> <p>What is the status code now and what does it mean? Status codes are important when you have an application that is interacting with a server and wants to make sure that it does not fail, which can be done with simple <code>if</code> statements on the status codes</p> <pre><code>if response.status_code == 200:\n    print('Success!')\nelif response.status_code == 404:\n    print('Not Found.')\n</code></pre> </li> <li> <p>Next, try to call the following</p> <pre><code>response=requests.get(\"https://api.github.com/repos/SkafteNicki/dtu_mlops\")\n</code></pre> <p>which gives back a payload. Essentially, payload refers to any additional data that is sent from the client to the server or vice-versa. Try looking at the <code>response.content</code> attribute. What is the type of this attribute?</p> </li> <li> <p>You should hopefully observe that the <code>.content</code> attribute is of type <code>bytes</code>. It is important to note that this is     the standard way of sending payloads to encode them into <code>byte</code> objects. To get a more human-readable version of     the response, we can convert it to JSON format</p> <pre><code>response.json()\n</code></pre> <p>Important to remember that a JSON object in Python is just a nested dictionary if you ever want to iterate over the object in some way.</p> </li> <li> <p>When we use the GET method we can additionally provide a <code>params</code> argument, that specifies what we want the server     to send back for a specific request URL:</p> <pre><code>response = requests.get(\n    'https://api.github.com/search/repositories',\n    params={'q': 'requests+language:python'},\n)\n</code></pre> <p>Before looking at <code>response.json()</code> can you explain what the code does? You can try looking at this page for help.</p> </li> <li> <p>Sometimes the content of a page cannot be converted into JSON, because as already stated data is sent as bytes.     Say that we want to download an image, which we can do in the following way</p> <pre><code>import requests\nresponse = requests.get('https://imgs.xkcd.com/comics/making_progress.png')\n</code></pre> <p>Try calling <code>response.json()</code>, what happens? Next, try calling <code>response.content</code>. To get the result in this case we would need to convert from bytes to an image:</p> <pre><code>with open(r'img.png','wb') as f:\n    f.write(response.content)\n</code></pre> </li> <li> <p>The <code>get</code> method is the most useful method because it allows us to get data from the server. However, as stated in     the beginning multiple request methods exist, for example, the POST method for sending data to the server. Try     executing:</p> <pre><code>pload = {'username':'Olivia','password':'123'}\nresponse = requests.post('https://httpbin.org/post', data = pload)\n</code></pre> <p>Investigate the response (this is an artificial example because we do not control the server).</p> </li> <li> <p>Finally, we should also know that requests can be sent directly from the command line using the <code>curl</code> command.     Sometimes it is easier to send a request directly from the terminal and sometimes it is easier to do it from a     script.</p> <ol> <li> <p>Make sure you have <code>curl</code> installed, or else find instruction on installing it. To check call <code>curl -</code>-help` with     the documentation on curl.</p> </li> <li> <p>To execute <code>requests.get('https://api.github.com')</code> using curl we would simply do</p> <pre><code>curl -X GET \"https://api.github.com\"\ncurl -X GET -I \"https://api.github.com\" # if you want the status code\n</code></pre> <p>Try it yourself.</p> </li> <li> <p>Try to redo some of the exercises yourself using <code>curl</code>.</p> </li> </ol> </li> </ol> <p>That ends the intro session on <code>requests</code>. Do not worry if you are still not completely comfortable with sending requests, we are going to return to how we do it in practice when we have created our API. If you want to learn more about the <code>requests</code> package you can check out this tutorial and if you want to see more examples of how to use <code>curl</code> you can check out this page</p>"},{"location":"s7_deployment/apis/#creating-apis","title":"Creating APIs","text":"<p>Requests are all about being on the client side of our client-server architecture. We are now going to move on to the server side where we will be learning about writing the APIs that requests can interact with. An application programming interface (API) is essentially the way for the developer (you) tells a user how to use the application that you have created. The API is an abstraction layer that allows the user to interact with our application in the way we want them to interact with it, without the user even having to look at the code.</p> <p>We can take the API from GitHub as an example https://api.github.com. This API allows any user to retrieve, integrate and send data to Github without ever having to visit their webpage. The API exposes multiple endpoints that have various functions:</p> <ul> <li>https://api.github.com/repos/OWNER/REPO/branches: check out the branches on a given repository</li> <li>https://api.github.com/search/code: search through Github for repositories</li> <li>https://api.github.com/repos/OWNER/REPO/actions/workflows: check the status of workflows for a given repository</li> </ul> <p>and we could go on. However, there may be functionality that Github is not interested in users having access to and they may therefore choose not to have endpoints for specific features (1).</p> <ol> <li> Many companies provide public APIs to interact with their services/data. For a general list of     public APIs you can check out this page. For the Danes out there, you     can check out this list of public and private APIs from Danish companies and     organizations.</li> </ol> <p>The particular kind of API we are going to work with is called REST API (or RESTful API). The REST API specify specific constraints that a particular API needs to fulfill to be considered RESTful. You can read more about what the six guiding principles behind REST API on this page but one of the most important to have in mind is that the client-server architecture needs to be stateless. This means that whenever a request is send to the server it needs to be self-contained (all information included) and the server cannot rely on any previously stored information from previous requests.</p> <p>To implement APIs in practice we are going to use FastAPI. FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints. FastAPI is only one of many frameworks for defining APIs, however, compared to other frameworks such as Flask and django it offers a sweet spot of being flexible enough to do what you want without having many additional (unnecessary) features.</p>"},{"location":"s7_deployment/apis/#exercises_1","title":"\u2754 Exercises","text":"<p>The exercises below are a condensed version of this and this tutorial. If you ever need context for the exercises, we can recommend trying to go through these. Additionally, we also provide this solution file that you can look through for help.</p> <ol> <li> <p>Install FastAPI</p> <pre><code>pip install fastapi\n</code></pre> <p>This contains the functions, modules, and variables we are going to need to define our interface.</p> </li> <li> <p>Additionally, also install <code>uvicorn</code> which is a package for defining low level server applications.</p> <pre><code>pip install uvicorn[standard]\n</code></pre> </li> <li> <p>Start by defining a small application like this in a file called <code>main.py</code>:</p> <pre><code>from fastapi import FastAPI\napp = FastAPI()\n\n@app.get(\"/\")\ndef read_root():\n    return {\"Hello\": \"World\"}\n\n@app.get(\"/items/{item_id}\")\ndef read_item(item_id: int):\n    return {\"item_id\": item_id}\n</code></pre> <p>Importantly here is the use of the <code>@app.get</code> decorator. What could this decorator refer to? Explain what the two functions are probably doing.</p> </li> <li> <p>Next lets launch our app. Since we called our script <code>main.py</code> and we inside the script initialized our API with     <code>app = FastAPI</code>, our application that we want to deploy can be referenced by <code>main:app</code>:</p> <pre><code>uvicorn --reload --port 8000 main:app\n</code></pre> <p>this will launch a server at this page: <code>http://localhost:8000/</code>. As you will hopefully see, this page will return the content of the <code>root</code> function, like the image below. Remember to also check the output in your terminal as that will give info on when and how your application is being invoked.</p> <p> </p> <ol> <li> <p>What webpage should you open to get the server to return <code>1</code>?</p> </li> <li> <p>Also checkout the pages: <code>http://localhost:8000/docs</code> and <code>http://localhost:8000/redoc</code>. What does     these pages show?</p> </li> <li> <p>The power of the <code>docs</code> and <code>redoc</code> pages is that they allow you to easily test your application with their     simple UI. As shown in the image below, simply open the endpoint you want to test, click the <code>Try it out</code>     button, input any values and execute it. It will return both the corresponding <code>curl</code> command for invoking     your endpoint, the corresponding URL and response of you application. Try it out.</p> <p> </p> </li> <li> <p>You can also checkout <code>http://localhost:8000/openapi.json</code> to check out the schema that is generated     which essentially is a <code>json</code> file containing the overall specifications of your program.</p> </li> <li> <p>Try to access <code>http://localhost:8000/items/foo</code>, what happens in this case? When you specify types in your API,     FastAPI will automatically do type validation using pydantic, making sure users     can only access your API with the correct types. Therefore, remember to include types in your applications!</p> </li> </ol> </li> <li> <p>With the fundamentals in place let's configure it a bit more:</p> <ol> <li> <p>Lets start by changing the root function to include a bit more info. In particular we are also interested in     returning the status code so the end user can easily read that. Default status codes are included in the     http built-in Python package:</p> <pre><code>from http import HTTPStatus\n\n@app.get(\"/\")\ndef root():\n    \"\"\" Health check.\"\"\"\n    response = {\n        \"message\": HTTPStatus.OK.phrase,\n        \"status-code\": HTTPStatus.OK,\n    }\n    return response\n</code></pre> <p>try to reload the app and see what is returned now. You should not have to re-launch the app because we initialized the app with the <code>--reload</code> argument.</p> </li> <li> <p>When we decorate our functions with <code>@app.get(\"/items/{item_id}\")</code>, <code>item_id</code> is in the case what we call a     path parameters because it is a parameter that is directly included in the path of our endpoint. We have     already seen how we can restrict a path to a single type, but what if we want to restrict it to specific values?     This is often the case if we are working with parameters of type <code>str</code>. In this case we would need to define a     <code>enum</code>:</p> <pre><code>from enum import Enum\nclass ItemEnum(Enum):\n    alexnet = \"alexnet\"\n    resnet = \"resnet\"\n    lenet = \"lenet\"\n\n@app.get(\"/restric_items/{item_id}\")\ndef read_item(item_id: ItemEnum):\n    return {\"item_id\": item_id}\n</code></pre> <p>Add this API, reload and execute both a valid parameter and a non-valid parameter.</p> </li> <li> <p>In contrast to path parameters we have query parameters. In the requests exercises we saw an example of this     where we were calling https://api.github.com/search/code with the query <code>'q': 'requests+language:python'</code>.     Any parameter in FastAPI that is not a path parameter, will be considered a query parameter:</p> <pre><code>@app.get(\"/query_items\")\ndef read_item(item_id: int):\n    return {\"item_id\": item_id}\n</code></pre> <p>Add this API, reload and figure out how to pass in a query parameter.</p> </li> <li> <p>We have until now worked with the <code>.get</code> method, but lets also see an example of the <code>.post</code> method. As already     described the POST request method is used for uploading data to the server. Here is a simple app that saves     username and password in a database (please never implement this in real life like this):</p> <pre><code>database = {'username': [ ], 'password': [ ]}\n\n@app.post(\"/login/\")\ndef login(username: str, password: str):\n    username_db = database['username']\n    password_db = database['password']\n    if username not in username_db and password not in password_db:\n        with open('database.csv', \"a\") as file:\n            file.write(f\"{username}, {password} \\n\")\n        username_db.append(username)\n        password_db.append(password)\n    return \"login saved\"\n</code></pre> <p>Make sure you understand what the function does and then try to execute it a couple of times to see your database updating. It is important to note that we sometimes in the following exercises use the <code>.get</code> method and sometimes the <code>.post</code> method. For our usage it does not really matter.</p> </li> </ol> </li> <li> <p>We are now moving on to figuring out how to provide different standard inputs like text, images, json to our APIs. It     is important that you try out each example yourself and in particular you look at the <code>curl</code> commands that are     necessary to invoke each application.</p> <ol> <li> <p>Here is a small application, that takes a single text input</p> <pre><code>@app.get(\"/text_model/\")\ndef contains_email(data: str):\n    regex = r'\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b'\n    response = {\n        \"input\": data,\n        \"message\": HTTPStatus.OK.phrase,\n        \"status-code\": HTTPStatus.OK,\n        \"is_email\": re.fullmatch(regex, data) is not None\n    }\n    return response\n</code></pre> <p>What does the application do? Try it out yourself</p> </li> <li> <p>Let's say we wanted to extend the application to check for a specific email domain, either <code>gmail</code> or <code>hotmail</code>.     Assume that we want to feed this into our application as a <code>json</code> object e.g.</p> <pre><code>{\n    \"email\": \"mlops@gmail.com\",\n    \"domain_match\": \"gmail\"\n}\n</code></pre> <p>Figure out how to alter the <code>data</code> parameter such that it takes in the <code>json</code> object and make sure to extend the application to check if the email and domain also match. Hint: take a look at this page</p> </li> <li> <p>Let's move on to an application that requires a file input:</p> <pre><code>from fastapi import UploadFile, File\nfrom typing import Optional\n\n@app.post(\"/cv_model/\")\nasync def cv_model(data: UploadFile = File(...)):\n    with open('image.jpg', 'wb') as image:\n        content = await data.read()\n        image.write(content)\n        image.close()\n\n    response = {\n        \"input\": data,\n        \"message\": HTTPStatus.OK.phrase,\n        \"status-code\": HTTPStatus.OK,\n    }\n    return response\n</code></pre> <p>A couple of new things are going on here: we use the specialized <code>UploadFile</code> and <code>File</code> bodies in our input definition. Additionally, we added the <code>async</code>/<code>await</code> keywords. Figure out what everything does and try to run the application (you can use any image file you like).</p> </li> <li> <p>The above application actually does not do anything. Let's add opencv     as a package and let's resize the image. It can be done with the following three lines:</p> <pre><code>import cv2\nimg = cv2.imread(\"image.jpg\")\nres = cv2.resize(img, (h, w))\n</code></pre> <p>Figure out where to add them in the application and additionally add <code>h</code> and <code>w</code> as optional parameters, with a default value of 28. Try running the application where you specify everything and one more time where you leave out <code>h</code> and <code>w</code>.</p> </li> <li> <p>Finally, let's also figure out how to return a file from our application. You will need to add the following     lines:</p> <pre><code>from fastapi.responses import FileResponse\ncv2.imwrite('image_resize.jpg', res)\nFileResponse('image_resize.jpg')\n</code></pre> <p>Figure out where to add them to the code and try running the application one more time to see that you get a file back with the resized image.</p> </li> </ol> </li> <li> <p>A common pattern in most applications is that we want some code to run on startup and some code to run on shutdown.     FastAPI allows us to do this by controlling the lifespan of our application. This is done by implementing the     <code>lifespan</code> function. Look at the documentation for lifespan events     and implement a small application that prints <code>Hello</code> on startup and <code>Goodbye</code> on shutdown.</p> Solution <p>Here is a simple example that will print <code>Hello</code> on startup and <code>Goodbye</code> on shutdown.</p> <pre><code>from contextlib import asynccontextmanager\nfrom fastapi import FastAPI\n\n@asynccontextmanager\nasync def lifespan(app: FastAPI):\n    print(\"Hello\")\n    yield\n    print(\"Goodbye\")\n\napp = FastAPI(lifespan=lifespan)\n\n@app.get(\"/\")\ndef read_root():\n    return {\"Hello\": \"World\"}\n</code></pre> </li> <li> <p>Let's try to figure out how to use FastAPI in a Machine learning context. Below is a script that downloads     a <code>VisionEncoderDecoder</code> from     huggingface     . The model can be used to create captions for a given image. Thus calling</p> <pre><code>predict_step(['s7_deployment/exercise_files/my_cat.jpg'])\n</code></pre> <p>returns a list of strings like <code>['a cat laying on a couch with a stuffed animal']</code> (try this yourself). Create a FastAPI application that can do inference using this model e.g. it should take in an image, preferably some optional hyperparameters (like <code>max_length</code>) and should return a string (or list of strings) containing the generated caption.</p> <p>simple ML application</p> <pre><code>from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor, AutoTokenizer\nimport torch\nfrom PIL import Image\n\nmodel = VisionEncoderDecoderModel.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\nfeature_extractor = ViTFeatureExtractor.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\ntokenizer = AutoTokenizer.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nmodel.to(device)\n\ngen_kwargs = {\"max_length\": 16, \"num_beams\": 8, \"num_return_sequences\": 1}\ndef predict_step(image_paths):\n    images = []\n    for image_path in image_paths:\n        i_image = Image.open(image_path)\n        if i_image.mode != \"RGB\":\n            i_image = i_image.convert(mode=\"RGB\")\n\n        images.append(i_image)\n    pixel_values = feature_extractor(images=images, return_tensors=\"pt\").pixel_values\n    pixel_values = pixel_values.to(device)\n    output_ids = model.generate(pixel_values, **gen_kwargs)\n    preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)\n    preds = [pred.strip() for pred in preds]\n    return preds\n\nif __name__ == \"__main__\":\n    print(predict_step(['s7_deployment/exercise_files/my_cat.jpg']))\n</code></pre> Solution ml_app.py<pre><code>from contextlib import asynccontextmanager\n\nimport torch\nfrom fastapi import FastAPI, File, UploadFile\nfrom PIL import Image\nfrom transformers import AutoTokenizer, VisionEncoderDecoderModel, ViTFeatureExtractor\n\n\n@asynccontextmanager\nasync def lifespan(app: FastAPI):\n    \"\"\"Load and clean up model on startup and shutdown.\"\"\"\n    global model, feature_extractor, tokenizer, device, gen_kwargs\n    print(\"Loading model\")\n    model = VisionEncoderDecoderModel.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n    feature_extractor = ViTFeatureExtractor.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n    tokenizer = AutoTokenizer.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n    device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n    model.to(device)\n    gen_kwargs = {\"max_length\": 16, \"num_beams\": 8, \"num_return_sequences\": 1}\n\n    yield\n\n    print(\"Cleaning up\")\n    del model, feature_extractor, tokenizer, device, gen_kwargs\n\n\napp = FastAPI(lifespan=lifespan)\n\n\n@app.post(\"/caption/\")\nasync def caption(data: UploadFile = File(...)):\n    \"\"\"Generate a caption for an image.\"\"\"\n    i_image = Image.open(data.file)\n    if i_image.mode != \"RGB\":\n        i_image = i_image.convert(mode=\"RGB\")\n\n    pixel_values = feature_extractor(images=[i_image], return_tensors=\"pt\").pixel_values\n    pixel_values = pixel_values.to(device)\n    output_ids = model.generate(pixel_values, **gen_kwargs)\n    preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)\n    return [pred.strip() for pred in preds]\n</code></pre> </li> <li> <p>As the final step, we want to figure out how to include our FastAPI application in a docker container as it will help     us when we want to deploy in the cloud because docker as always can take care of the dependencies for our     application. For the following set of exercises you can take whatever previous FastAPI application as the base     application for the container</p> <ol> <li> <p>Start by creating a <code>requirement.txt</code> file for your application. You will at least need <code>fastapi</code> and <code>uvicorn</code>     in the file and we always recommend that you are specific about the version you want to use</p> <pre><code>fastapi&gt;=0.68.0,&lt;0.69.0\nuvicorn&gt;=0.15.0,&lt;0.16.0\n# add anything else you application needs to be able to run\n</code></pre> </li> <li> <p>Next, create a <code>Dockerfile</code> with the following content</p> <pre><code>FROM python:3.9\nWORKDIR /code\nCOPY ./requirements.txt /code/requirements.txt\n\nRUN pip install --no-cache-dir --upgrade -r /code/requirements.txt\nCOPY ./app /code/app\n\nCMD [\"uvicorn\", \"app.main:app\", \"--host\", \"0.0.0.0\", \"--port\", \"80\"]\n</code></pre> <p>The above assumes that your file structure looks like this</p> <pre><code>.\n\u251c\u2500\u2500 app\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u2514\u2500\u2500 main.py\n\u251c\u2500\u2500 Dockerfile\n\u2514\u2500\u2500 requirements.txt\n</code></pre> <p>Hopefully, all these steps should look familiar if you already went through module M9, except for maybe the last line. However, this is just the standard way that we have run our FastAPI applications in the last couple of exercises, this time with some extra arguments regarding the ports we allow.</p> </li> <li> <p>Next, build the corresponding docker image</p> <pre><code>docker build -t my_fastapi_app .\n</code></pre> </li> <li> <p>Finally, run the image such that a container is spun up that runs our application. The important part here is     to remember to specify the <code>-p</code> argument (p for port) which should be the same number as the port we have     specified in the last line of our Dockerfile.</p> <pre><code>docker run --name mycontainer -p 80:80 myimage\n</code></pre> </li> <li> <p>Check that everything is working by going to the corresponding localhost page     http://localhost/items/5?q=somequery</p> </li> </ol> </li> </ol> <p>This ends the module on APIs. If you want to go further in this direction we highly recommend that you check out bentoml which is an API standard that focuses solely on creating easy-to-understand APIs and services for ml-applications. Additionally, we can also highly recommend checking out Postman which can help design, document and in particular test the API you are writing to make sure that it works as expected.</p>"},{"location":"s7_deployment/cloud_deployment/","title":"M23 - Cloud Deployment","text":""},{"location":"s7_deployment/cloud_deployment/#cloud-deployment","title":"Cloud deployment","text":"<p>Core Module</p> <p>We are now returning to using the cloud. In this module, you should have gone through the steps of having your code in your GitHub repository to automatically build into a docker container, store that, store data and pull it all together to make a training run. After the training is completed you should hopefully have a file stored in the cloud with your trained model weights.</p> <p>Today's exercises will be about serving those model weights to an end user. We focus on two different ways of deploying our model: Google cloud functions and Google cloud run. Both services are serverless, meaning that you do not have to manage the server that runs your code.</p> <p></p>  GCP in general has 5 core deployment options. We are going to focus on Cloud Functions and Cloud Run, which are two of the serverless options. In contrast to these two, you have the option to deploy to Kubernetes Engine and Compute Engine which are more traditional ways of deploying your code. Here you have to manage the underlying infrastructure."},{"location":"s7_deployment/cloud_deployment/#cloud-functions","title":"Cloud Functions","text":"<p>Google Cloud Functions, is the most simple way that we can deploy our code to the cloud. As stated above, it is a serverless service, meaning that you do not have to worry about the underlying infrastructure. You just write your code and deploy it. The service is great for small applications that can be encapsulated in a single script.</p>"},{"location":"s7_deployment/cloud_deployment/#exercises","title":"\u2754 Exercises","text":"<ol> <li> <p>Go to the start page of <code>Cloud Functions</code>. Can be found in the sidebar on the homepage or you can just search for it.     Activate the service in the cloud console or use the following command:</p> <pre><code>gcloud services enable cloudfunctions.googleapis.com\n</code></pre> </li> <li> <p>Click the <code>Create Function</code> button which should take you to a screen like the image below. Make sure it is a 2nd     Gen function, give it a name, set the server region to somewhere close by and change the authentication policy to     <code>Allow unauthenticated invocations</code> so we can access it directly from a browser. Remember to note down the</p> <p> </p> </li> <li> <p>On the next page, for <code>Runtime</code> pick the <code>Python 3.11</code> option (or newer). This will make the inline editor show both     a <code>main.py</code> and <code>requirements.py</code> file. Look over them and try to understand what they do. Especially, take a     look at the functions-framework which is a     needed requirement of any Cloud function.</p> <p> </p> <p>After you have looked over the files, click the <code>Deploy</code> button.</p> Solution <p>The <code>functions-framework</code> is a lightweight, open-source framework for turning Python functions into HTTP functions. Any function that you deploy to Cloud Functions must be wrapped in the <code>@functions_framework.http</code> decorator.</p> </li> <li> <p>Afterwards, the function should begin to deploy. When it is done, you should see \u2705. Now let's test it by going to     the <code>Testing</code> tab.</p> <p> </p> </li> <li> <p>If you know what the application does, it should come as no surprise that it can run without any input. We     therefore just send an empty request by clicking the <code>Test The Function</code> button. Does the function return     the output you expected? Wait for the logs to show up. What do they show?</p> <ol> <li> <p>What should the <code>Triggering event</code> look like in the testing prompt for the program to respond with</p> <pre><code>Hallo General Kenobi!\n</code></pre> <p>Try it out.</p> Solution <p>The default triggering event is a JSON object with a key <code>name</code> and a value. Therefore the triggering event should look like this:</p> <pre><code>{\n    \"name\": \"General Kenobi\"\n}\n</code></pre> </li> <li> <p>Go to the trigger tab and go to the URL for the application. Execute the API a couple of times. How can you     change the URL to make the application respond with the same output as above?</p> Solution <p>You can change the URL to include a query parameter <code>name</code> with the value <code>General Kenobi</code>. For example</p> <pre><code>https://us-central1-my-personal-mlops-project.cloudfunctions.net/function-3?name=General%20Kanobi\n</code></pre> <p>where you would need to replace everything before the <code>?</code> with your URL.</p> </li> <li> <p>Click on the metrics tab. You should hopefully see it being populated with a few data points. Identify what each     panel is showing.</p> Solution <ul> <li>Invocations/Second: The number of times the function is invoked per second</li> <li>Execution time (ms): The time it takes for the function to execute in milliseconds</li> <li>Memory usage (MB): The memory usage of the function in MB</li> <li>Instance count (instances): The number of instances that are running the function</li> </ul> </li> <li> <p>Check out the logs tab. You should see that your application has already been invoked multiple times. Also, try     to execute this command in a terminal:</p> <pre><code>gcloud functions logs read\n</code></pre> </li> </ol> </li> <li> <p>Next, we are going to create our own application that takes some input so we can try to send it requests. We provide     a very simple script to get started.</p> <p>Simple script</p> sklearn_cloud_functions.py<pre><code># Load data\nimport pickle\n\nimport numpy as np\nfrom sklearn import datasets\nfrom sklearn.neighbors import KNeighborsClassifier\n\niris_x, iris_y = datasets.load_iris(return_X_y=True)\n\n# Split iris data in train and test data\n# A random permutation, to split the data randomly\nnp.random.seed(0)\nindices = np.random.permutation(len(iris_x))\niris_x_train = iris_x[indices[:-10]]\niris_y_train = iris_y[indices[:-10]]\niris_x_test = iris_x[indices[-10:]]\niris_y_test = iris_y[indices[-10:]]\n\n# Create and fit a nearest-neighbor classifier\n\nknn = KNeighborsClassifier()\nknn.fit(iris_x_train, iris_y_train)\nknn.predict(iris_x_test)\n\n# save model\n\nwith open(\"model.pkl\", \"wb\") as file:\n    pickle.dump(knn, file)\n</code></pre> <ol> <li> <p>Figure out what the script does and run the script. This should create a file with a trained model.</p> Solution <p>The file trains a simple KNN model on the iris dataset and saves it to a file called <code>model.pkl</code>.</p> </li> <li> <p>Next, create a storage bucket and upload the model file to the bucket. Try to do this using the <code>gsutil</code> command     and check afterward that the file is in the bucket.</p> Solution <pre><code>gsutil mb gs://&lt;bucket-name&gt;  # mb stands for make bucket\ngsutil cp &lt;file-name&gt; gs://&lt;bucket-name&gt;  # cp stands for copy\n</code></pre> </li> <li> <p>Create a new cloud function with the same initial settings as the first one, e.g. <code>Python 3.11</code> and <code>HTTP</code>. Then     implement in the <code>main.py</code> file code that:</p> <ul> <li>Loads the model from the bucket</li> <li>Takes a request with a list of integers as input</li> <li>Returns the prediction of the model</li> </ul> <p>In addition to writing the <code>main.py</code> file, you also need to fill out the <code>requirements.txt</code> file. You need at least three packages to run the application. Remember to also change the <code>Entry point</code> to the name of your function. If your deployment fails, try to go to the <code>Logs Explorer</code> page in <code>gcp</code> which can help you identify why.</p> Solution <p>The main script should look something like this:</p> main.py<pre><code>import pickle\n\nimport functions_framework\nfrom google.cloud import storage\n\nBUCKET_NAME = \"my_sklearn_model_bucket\"\nMODEL_FILE = \"model.pkl\"\n\nclient = storage.Client()\nbucket = client.get_bucket(BUCKET_NAME)\nblob = bucket.get_blob(MODEL_FILE)\nmy_model = pickle.loads(blob.download_as_string())\n\n\n@functions_framework.http\ndef knn_classifier(request):\n    \"\"\"Simple knn classifier function for iris prediction.\"\"\"\n    request_json = request.get_json()\n    if request_json and \"input_data\" in request_json:\n        input_data = request_json[\"input_data\"]\n        input_data = [float(in_data) for in_data in input_data]\n        input_data = [input_data]\n        prediction = my_model.predict(input_data)\n        return {\"prediction\": prediction.tolist()}\n    return {\"error\": \"No input data provided.\"}\n</code></pre> <p>And, the requirement file should look like this:</p> <pre><code>functions-framework&gt;=3.7.0\ngoogle-cloud-storage&gt;=2.14.0\nscikit-learn&gt;=1.4.0\n</code></pre> <p>importantly make sure that you are using the same version of <code>scikit-learn</code> as you used when you trained the model. Else when trying to load the model you will most likely get an error.</p> </li> <li> <p>When you have successfully deployed the model, try to make predictions with it. What should the request     look like?</p> Solution <p>It depends on how exactly you have chosen to implement the <code>main.py</code>. But for the provided solution, the payload should look like this:</p> <pre><code>{\n    \"data\": [1, 2, 3, 4]\n}\n</code></pre> <p>with the corresponding <code>curl</code> command:</p> <pre><code>curl -X POST \\\n    https://your-cloud-function-url/knn_classifier \\\n    -H \"Content-Type: application/json\" \\\n    -d '{\"input_data\": [5.1, 3.5, 1.4, 0.2]}'\n</code></pre> </li> </ol> </li> <li> <p>Let's try to figure out how to do the above deployment using <code>gcloud</code> instead of the console UI. The relevant command     is gcloud functions deploy. For this function     to work you will need to put the <code>main.py</code> and <code>requirements.txt</code> in a separate folder. Try to execute the command     to successfully deploy the function.</p> Solution <pre><code>gcloud functions deploy &lt;func-name&gt; \\\n    --gen2 --runtime python311 --trigger-http --source &lt;folder&gt; --entry-point knn_classifier\n</code></pre> <p>where you need to replace <code>&lt;func-name&gt;</code> with the name of your function and <code>&lt;folder&gt;</code> with the path to the folder containing the <code>main.py</code> and <code>requirements.txt</code> files.</p> </li> <li> <p>(Optional) You can finally try to redo the exercises by deploying a PyTorch application. You will essentially     need to go through the same steps as the sklearn example, including uploading a trained model to storage and     writing a cloud function that loads it and returns some output. You are free to choose whatever PyTorch model you     want.</p> </li> </ol>"},{"location":"s7_deployment/cloud_deployment/#cloud-run","title":"Cloud Run","text":"<p>Cloud functions are great for simple deployments, that can be encapsulated in a single script with only simple requirements. However, they do not scale with more advanced applications that may depend on multiple programming languages. We are already familiar with how we can deal with this through containers and Cloud Run is the corresponding service in GCP for deploying containers.</p>"},{"location":"s7_deployment/cloud_deployment/#exercises_1","title":"\u2754 Exercises","text":"<ol> <li> <p>We are going to start locally by developing a small app that we can deploy. We provide two small examples to choose     from: first is a small FastAPI app consisting of a single Python script and a docker file. The second is a small     Streamlit app (which you can learn more about in this module) consisting of a single docker file.     You can choose which one you want to work with.</p> Simple Fastapi app simple_fastapi_app.py<pre><code>from fastapi import FastAPI\n\napp = FastAPI()\n\n\n@app.get(\"/\")\ndef read_root():\n    \"\"\"Root endpoint.\"\"\"\n    return {\"Hello\": \"World\"}\n\n\n@app.get(\"/items/{item_id}\")\ndef read_item(item_id: int):\n    \"\"\"Get an item by id.\"\"\"\n    return {\"item_id\": item_id}\n</code></pre> simple_fastapi_app.dockerfile<pre><code>FROM python:3.9-slim\n\nEXPOSE $PORT\n\nWORKDIR /app\n\nRUN apt-get update &amp;&amp; apt-get install -y \\\n    build-essential \\\n    software-properties-common \\\n    git \\\n    &amp;&amp; rm -rf /var/lib/apt/lists/*\n\nRUN pip install fastapi\nRUN pip install pydantic\nRUN pip install uvicorn\n\nCOPY simple_fastapi_app.py simple_fastapi_app.py\n\nCMD exec uvicorn simple_fastapi_app:app --port $PORT --host 0.0.0.0 --workers 1\n</code></pre> Simple Streamlit app streamlit_app.dockerfile<pre><code>FROM python:3.9-slim\n\nEXPOSE $PORT\n\nWORKDIR /app\n\nRUN apt-get update &amp;&amp; apt-get install -y \\\n    build-essential \\\n    software-properties-common \\\n    git \\\n    &amp;&amp; rm -rf /var/lib/apt/lists/*\n\nRUN git clone https://github.com/streamlit/streamlit-example.git .\n\nRUN pip3 install -r requirements.txt\n\nENTRYPOINT [\"streamlit\", \"run\", \"streamlit_app.py\", \"--server.port=$PORT\", \"--server.address=0.0.0.0\"]\n</code></pre> <ol> <li> <p>Start by going over the files belonging to your choice app and understand what it does.</p> </li> <li> <p>Next, build the docker image belonging to the app</p> <pre><code>docker build -f &lt;dockerfile&gt; . -t gcp_test_app:latest\n</code></pre> </li> <li> <p>Next tag and push the image to your artifact registry</p> <pre><code>docker tag gcp_test_app &lt;region&gt;-docker.pkg.dev/&lt;project-id&gt;/&lt;registry-name&gt;/gcp_test_app:latest\ndocker push &lt;region&gt;-docker.pkg.dev/&lt;project-id&gt;/&lt;registry-name&gt;/gcp_test_app:latest\n</code></pre> <p>Afterward, check your artifact registry contains the pushed image.</p> </li> </ol> </li> <li> <p>Next, go to <code>Cloud Run</code> in the cloud console and enable the service or use the following command:</p> <pre><code>gcloud services enable run.googleapis.com\n</code></pre> </li> <li> <p>Click the <code>Create Service</code> button which should bring you to a page similar to the one below</p> <p> </p> <p>Do the following:</p> <ul> <li> <p>Click the select button, which will bring up all build containers and pick the one you want to deploy. In the     future, you probably want to choose the Continuously deploy new revisions from a source repository such that     a new version is always deployed when a new container is built.</p> </li> <li> <p>Hereafter, give the service a name and select the region. We recommend choosing a region close to you.</p> </li> <li> <p>Set the authentication method to Allow unauthenticated invocations such that we can call it without     providing credentials. In the future, you may only set that authenticated invocations are allowed.</p> </li> <li> <p>Expand the Container, Connections, Security tab and edit the port such that it matches the port exposed in your     chosen application. If your docker file exposes the env variable <code>$PORT</code> you can set the port to anything.</p> </li> </ul> <p>Finally, click the create button and wait for the service to be deployed (may take some time).</p> <p>Common problems</p> <p>If you get an error saying The user-provided container failed to start and listen on the port defined by the PORT environment variable. there are two common reasons for this:</p> <ol> <li> <p>You need to add an <code>EXPOSE</code> statement in your docker container:</p> <pre><code>EXPOSE 8080\nCMD exec uvicorn my_application:app --port 8080 --workers 1 main:app\n</code></pre> <p>and make sure that your application is also listening on that port. If you hard code the port in your application (as in the above code) it is best to set it 8080 which is the default port for cloud run. Alternatively, a better approach is to set it to the <code>$PORT</code> environment variable which is set by cloud run and can be accessed in your application:</p> <pre><code>EXPOSE $PORT\nCMD exec uvicorn my_application:app --port $PORT --workers 1 main:app\n</code></pre> <p>If you do this and then want to run locally you can run it as:</p> <pre><code>docker run -p 8080:8080 -e PORT=8080 &lt;image-name&gt;:&lt;image-tag&gt;\n</code></pre> </li> <li> <p>If you are serving a large machine-learning model, it may also be that your deployed container is running     out of memory. You can try to increase the memory of the container by going to the Edit container and     the Resources tab and increasing the memory.</p> </li> </ol> </li> <li> <p>If you manage to deploy the service you should see an image like this:</p> <p> </p> <p>You can now access your application by clicking the URL. This will access the root of your application, so you may need to add <code>/</code> or <code>/&lt;path&gt;</code> to the URL depending on how the app works.</p> </li> <li> <p>Everything we just did in the console UI we can also do with the     gcloud run deploy. How would you do that?</p> Solution <p>The command should look something like this</p> <pre><code>gcloud run deploy &lt;service-name&gt; \\\n    --image &lt;image-name&gt;:&lt;image-tag&gt; --platform managed --region &lt;region&gt; --allow-unauthenticated\n</code></pre> <p>where you need to replace <code>&lt;service-name&gt;</code> with the name of your service, <code>&lt;image-name&gt;</code> with the name of your image and <code>&lt;region&gt;</code> with the region you want to deploy to. The <code>--allow-unauthenticated</code> flag is optional but is needed if you want to access the service without providing credentials.</p> </li> <li> <p>After deploying using the command line, make sure that the service is up and running by using these two commands</p> <pre><code>gcloud run services list\ngcloud run services describe &lt;service-name&gt; --region &lt;region&gt;\n</code></pre> </li> <li> <p>Instead of deploying our docker container using the UI or command line, which is a manual operation, we can do it     continuously by using <code>cloudbuild.yaml</code> file we learned about in the previous section. This is called     continuous deployment, and it is a way to     automate the deployment process.</p> <p>  Image credit  </p> <p>Let's revise the <code>cloudbuild.yaml</code> file from the artifact registry exercises in this module which will build and push a specified docker image.</p> <p>cloudbuild.yaml</p> cloudbuild.yaml<pre><code>steps:\n- name: 'gcr.io/cloud-builders/docker'\n  id: 'Build container image'\n  args: [\n    'build',\n    '.',\n    '-t',\n    'europe-west1-docker.pkg.dev/$PROJECT_ID/&lt;registry-name&gt;/&lt;image-name&gt;',\n    '-f',\n    '&lt;path-to-dockerfile&gt;'\n  ]\n- name: 'gcr.io/cloud-builders/docker'\n  id: 'Push container image'\n  args: [\n    'push',\n    'europe-west1-docker.pkg.dev/$PROJECT_ID/&lt;registry-name&gt;/&lt;image-name&gt;'\n  ]\n</code></pre> <p>Add a third step to the <code>cloudbuild.yaml</code> file that deploys the container image to Cloud Run. The relevant service you need to use is called <code>'gcr.io/cloud-builders/gcloud'</code> and the command is <code>'gcloud run deploy'</code>. Afterwards, reuse the trigger you created in the previous module or create a new one to build and deploy the container image continuously. Confirm that this works by making a change to your application and pushing it to GitHub and see if the application is updated continuously.</p> Solution <p>The full <code>cloudbuild.yaml</code> file should look like this:</p> <pre><code>steps:\n- name: 'gcr.io/cloud-builders/docker'\n  id: 'Build container image'\n  args: [\n    'build',\n    '.',\n    '-t',\n    'europe-west1-docker.pkg.dev/$PROJECT_ID/&lt;registry-name&gt;/&lt;image-name&gt;',\n    '-f',\n    '&lt;path-to-dockerfile&gt;'\n  ]\n- name: 'gcr.io/cloud-builders/docker'\n  id: 'Push container image'\n  args: [\n    'push',\n    'europe-west1-docker.pkg.dev/$PROJECT_ID/&lt;registry-name&gt;/&lt;image-name&gt;'\n  ]\n- name: 'gcr.io/cloud-builders/gcloud'\n  id: 'Deploy to Cloud Run'\n  args: [\n    'run',\n    'deploy',\n    '&lt;service-name&gt;',\n    '--image',\n    'europe-west1-docker.pkg.dev/$PROJECT_ID/&lt;registry-name&gt;/&lt;image-name&gt;',\n    '--region',\n    'europe-west1',\n    '--platform',\n    'managed',\n  ]\n</code></pre> </li> </ol>"},{"location":"s7_deployment/cloud_deployment/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>In the previous module on using the cloud you learned about the Secrets     Manager in GCP. How can you use this service in combination with Cloud Run?</p> Solution <p>In the cloud console, secrets can be set in the Container(s), Volumes, Networking, Security tab under the Variables &amp; Secrets section, see image below.</p> <p> </p> <p>In the <code>gcloud</code> command, you can set the secret by using the <code>--update-secrets</code> flag.</p> <pre><code>gcloud run deploy &lt;service-name&gt; \\\n    --image &lt;image-name&gt;:&lt;image-tag&gt; --platform managed \\\n    --region &lt;region&gt; --allow-unauthenticated \\\n    --update-secrets &lt;secret-name&gt;=&lt;secret-version&gt;\n</code></pre> </li> </ol> <p>That ends the exercises on deployment. The exercises above are just a small taste of what deployment has to offer. In both sections, we have explicitly chosen to work with serverless deployments. But what if you wanted to do the opposite e.g. being the one in charge of the management of the cluster that handles the deployed services? If you are interested in taking deployment to the next level should get started on Kubernetes which is the de-facto open-source container orchestration platform that is being used in production environments. If you want to deep dive we recommend starting here which describes how to make pipelines that are a necessary component before you start to create your own Kubernetes cluster.</p>"},{"location":"s7_deployment/frontend/","title":"M26 - Frontend","text":""},{"location":"s7_deployment/frontend/#frontend","title":"Frontend","text":"<p>If you have gone over the deployment module you should be at the point where you have a machine learning model running in the cloud. The model can be interacted with by sending HTTP requests to the API endpoint. In general we refer to this as the backend of the application. It is the part of our application that are behind-the-scene that the user does not see and it is not really that user-friendly. Instead we want to create a frontend that the user can interact with in a more user-friendly way. This is what we will be doing in this module.</p> <p>Another point of splitting our application into a frontend and a backend has to do with scalability. If we have a lot of users interacting with our application, we might want to scale only the backend and not the frontend, because that is the part that will be running our heavy machine learning model. In general dividing a application into smaller pieces are the pattern that is used in microservice architectures.</p> <p></p>  In monollithic applications everything the user may be requesting of our application is handled by a single process/ container. In microservice architectures the application is split into smaller pieces that can be scaled independently. This also leads to easier maintainability and faster development.  <p>Frontends have for the longest time been created using HTML, CSS and JavaScript. This is still the case, but there are now a lot of frameworks that can help us create a frontend in Python:</p> <ul> <li>Django</li> <li>Reflex</li> <li>Streamlit</li> <li>Bokeh</li> <li>Gradio</li> </ul> <p>In this module we will be looking at <code>streamlit</code>. <code>streamlit</code> is a easy to use framework that allows us to create interactive web applications in Python. It is not at all as powerful as a framework like <code>Django</code>, but it is very easy to get started with and it is very easy to integrate with our machine learning models.</p>"},{"location":"s7_deployment/frontend/#exercises","title":"\u2754 Exercises","text":"<p>In these exercises we go through the process of setting up a backend using <code>fastapi</code> and a frontend using <code>streamlit</code>, containerizing both applications and then deploying them to the cloud. We have already created an example of this which can be found in the <code>samples/frontend_backend</code> folder.</p> <ol> <li> <p>Lets start by creating the backend application in a <code>backend.py</code> file. You can use essentially any backend you want,     but we will be using a simple imagenet classifier that we have created in the <code>samples/frontend_backend/backend</code>     folder.</p> <ol> <li> <p>Create a new file called <code>backend.py</code> and implement a FastAPI interface with a single <code>/predict</code> endpoint that     takes a image as input and returns the predicted class (and probabilities) of the image.</p> Solution backend.py<pre><code>import json\nfrom contextlib import asynccontextmanager\n\nimport anyio\nimport torch\nfrom fastapi import FastAPI, File, HTTPException, UploadFile\nfrom PIL import Image\nfrom torchvision import models, transforms\n\n\n@asynccontextmanager\nasync def lifespan(app: FastAPI):\n    \"\"\"Context manager to start and stop the lifespan events of the FastAPI application.\"\"\"\n    global model, transform, imagenet_classes\n    # Load model\n    model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)\n    model.eval()\n\n    transform = transforms.Compose(\n        [\n            transforms.Resize((224, 224)),\n            transforms.ToTensor(),\n            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),\n        ],\n    )\n\n    async with await anyio.open_file(\"imagenet-simple-labels.json\") as f:\n        imagenet_classes = json.load(f)\n\n    yield\n\n    # Clean up\n    del model\n    del transform\n    del imagenet_classes\n\n\napp = FastAPI(lifespan=lifespan)\n\n\ndef predict_image(image_path: str) -&gt; str:\n    \"\"\"Predict image class (or classes) given image path and return the result.\"\"\"\n    img = Image.open(image_path).convert(\"RGB\")\n    img = transform(img).unsqueeze(0)\n    with torch.no_grad():\n        output = model(img)\n    _, predicted_idx = torch.max(output, 1)\n    return output.softmax(dim=-1), imagenet_classes[predicted_idx.item()]\n\n\n@app.get(\"/\")\nasync def root():\n    \"\"\"Root endpoint.\"\"\"\n    return {\"message\": \"Hello from the backend!\"}\n\n\n# FastAPI endpoint for image classification\n@app.post(\"/classify/\")\nasync def classify_image(file: UploadFile = File(...)):\n    \"\"\"Classify image endpoint.\"\"\"\n    try:\n        contents = await file.read()\n        async with await anyio.open_file(file.filename, \"wb\") as f:\n            f.write(contents)\n        probabilities, prediction = predict_image(file.filename)\n        return {\"filename\": file.filename, \"prediction\": prediction, \"probabilities\": probabilities.tolist()}\n    except Exception as e:\n        raise HTTPException(status_code=500) from e\n</code></pre> </li> <li> <p>Run the backend using <code>uvicorn</code></p> <pre><code>uvicorn backend:app --reload\n</code></pre> </li> <li> <p>Test the backend by sending a request to the <code>/predict</code> endpoint, preferably using <code>curl</code> command</p> Solution <p>In this example we are sending a request to the <code>/predict</code> endpoint with a file called <code>my_cat.jpg</code>. The response should be \"tabby cat\" for the solution we have provided.</p> <pre><code>curl -X 'POST' \\\n    'http://127.0.0.1:8000/classify/' \\\n    -H 'accept: application/json' \\\n    -H 'Content-Type: multipart/form-data' \\\n    -F 'file=@my_cat.jpg;type=image/jpeg'\n</code></pre> </li> <li> <p>Create a <code>requirements_backend.txt</code> file with the dependencies needed for the backend.</p> Solution requirements_backend.txt<pre><code>fastapi&gt;=0.108.0\nuvicorn&gt;=0.25.0\ntorch&gt;=2.1.2\ntorchvision&gt;=0.16.2\n</code></pre> </li> <li> <p>Containerize the backend into a file called <code>backend.dockerfile</code>.</p> Solution backend.dockerfile<pre><code>FROM python:3.11-slim\n\nRUN apt update &amp;&amp; \\\n    apt install --no-install-recommends -y build-essential gcc git &amp;&amp; \\\n    apt clean &amp;&amp; rm -rf /var/lib/apt/lists/*\n\nRUN mkdir /app\n\nWORKDIR /app\n\nCOPY requirements_backend.txt /app/requirements_backend.txt\nCOPY backend.py /app/backend.py\n\nRUN --mount=type=cache,target=/root/.cache/pip pip install -r requirements_backend.txt\n\nEXPOSE $PORT\nCMD exec unicorn --port $PORT --host 0.0.0.0 backend:app\n</code></pre> </li> <li> <p>Build the backend image</p> <pre><code>docker build -t backend:latest -f backend.dockerfile .\n</code></pre> </li> <li> <p>Recheck that the backend works by running the image in a container</p> <pre><code>docker run --rm -p 8000:8000 -e \"PORT=8000\" backend\n</code></pre> <p>and test that it works by sending a request to the <code>/predict</code> endpoint.</p> </li> <li> <p>Deploy the backend to Cloud run using the <code>gcloud</code> command</p> Solution <p>Assuming that we have created an artifact registry called <code>frontend_backend</code> we can deploy the backend to Cloud Run using the following commands:</p> <pre><code>docker tag \\\n    backend:latest \\\n    &lt;region&gt;-docker.pkg.dev/&lt;project&gt;/frontend-backend/backend:latest\ndocker push \\\n    &lt;region&gt;.pkg.dev/&lt;project&gt;/frontend-backend/backend:latest\ngcloud run deploy backend \\\n    --image=europe-west1-docker.pkg.dev/&lt;project&gt;/frontend-backend/backend:latest \\\n    --region=europe-west1 \\\n    --platform=managed \\\n</code></pre> <p>where <code>&lt;region&gt;</code> and <code>&lt;project&gt;</code> should be replaced with the appropriate values.</p> </li> <li> <p>Finally, test that the deployed backend works as expected by sending a request to the <code>/predict</code> endpoint</p> Solution <p>In this solution we are first extracting the url of the deployed backend and then sending a request to the <code>/predict</code> endpoint.</p> <pre><code>export MYENDPOINT=$(gcloud run services describe backend --region=&lt;region&gt; --format=\"value(status.url)\")\ncurl -X 'POST' \\\n    $MYENDPOINT/predict \\\n    -H 'accept: application/json' \\\n    -H 'Content-Type: multipart/form-data' \\\n    -F 'file=@my_cat.jpg;type=image/jpeg'\n</code></pre> </li> </ol> </li> <li> <p>With the backend taken care of lets now write our frontend. Our frontend just needs to be a \"nice\" interface to our     backend. Its main functionality will be to send a request to the backend and display the result.     streamlit documentation</p> <ol> <li> <p>Start by installing <code>streamlit</code></p> <pre><code>pip install streamlit\n</code></pre> </li> <li> <p>Now create a file called <code>frontend.py</code> and implement a streamlit application. You can design it as you want,     but we recommend that the following can be done in the frontend:</p> <ol> <li> <p>Have a file uploader that allows the user to upload an image</p> </li> <li> <p>Display the image that the user uploaded</p> </li> <li> <p>Have a button that sends the image to the backend and displays the result</p> </li> </ol> <p>For now just assume that a environment variable called <code>BACKEND</code> is available that contains the URL of the backend. We will in the next step show how to get this URL automatically.</p> Solution frontend.py<pre><code>import os\n\nimport pandas as pd\nimport requests\nimport streamlit as st\nfrom google.cloud import run_v2\n\n\ndef get_backend_url():\n    \"\"\"Get the URL of the backend service.\"\"\"\n    parent = \"projects/my-personal-mlops-project/locations/europe-west1\"\n    client = run_v2.ServicesClient()\n    services = client.list_services(parent=parent)\n    for service in services:\n        if service.name.split(\"/\")[-1] == \"production-model\":\n            return service.uri\n    return os.environ.get(\"BACKEND\", None)\n\n\ndef classify_image(image, backend):\n    \"\"\"Send the image to the backend for classification.\"\"\"\n    predict_url = f\"{backend}/predict\"\n    response = requests.post(predict_url, files={\"image\": image}, timeout=10)\n    if response.status_code == 200:\n        return response.json()\n    return None\n\n\ndef main() -&gt; None:\n    \"\"\"Main function of the Streamlit frontend.\"\"\"\n    backend = get_backend_url()\n    if backend is None:\n        msg = \"Backend service not found\"\n        raise ValueError(msg)\n\n    st.title(\"Image Classification\")\n\n    uploaded_file = st.file_uploader(\"Upload an image\", type=[\"jpg\", \"jpeg\", \"png\"])\n\n    if uploaded_file is not None:\n        image = uploaded_file.read()\n        result = classify_image(image, backend=backend)\n\n        if result is not None:\n            prediction = result[\"prediction\"]\n            probabilities = result[\"probabilities\"]\n\n            # show the image and prediction\n            st.image(image, caption=\"Uploaded Image\")\n            st.write(\"Prediction:\", prediction)\n\n            # make a nice bar chart\n            data = {\"Class\": [f\"Class {i}\" for i in range(10)], \"Probability\": probabilities}\n            df = pd.DataFrame(data)\n            df.set_index(\"Class\", inplace=True)\n            st.bar_chart(df, y=\"Probability\")\n        else:\n            st.write(\"Failed to get prediction\")\n\n\nif __name__ == \"__main__\":\n    main()\n</code></pre> </li> <li> <p>We need to make sure that the frontend knows where the backend is located, and we want that to happen     automatically so we do not have to hardcode the URL into our frontend. We can do this by using the     Python SDK for Google Cloud Run. The following code snippet shows how to get the URL of the backend service     or fall back to an environment variable if the service is not found.</p> <pre><code>from google.cloud import run_v2\nimport streamlit as st\n\n@st.cache_resource  # (1)!\ndef get_backend_url():\n    \"\"\"Get the URL of the backend service.\"\"\"\n    parent = \"projects/&lt;project&gt;/locations/&lt;region&gt;\"\n    client = run_v2.ServicesClient()\n    services = client.list_services(parent=parent)\n    for service in services:\n        if service.name.split(\"/\")[-1] == \"production-model\":\n            return service.uri\n    name = os.environ.get(\"BACKEND\", None)\n    return name\n</code></pre> <ol> <li> The <code>st.cache_resource</code> is a decorator that tells <code>streamlit</code> to cache the result of the     function. This is useful if the function is expensive to run and we want to avoid running it multiple times.</li> </ol> <p>Add the above code snippet to the top of your <code>frontend.py</code> file and replace <code>&lt;project&gt;</code> and <code>&lt;region&gt;</code> with the appropriate values. You will need to install <code>pip install google-cloud-run</code> to be able to use the code snippet.</p> </li> <li> <p>Run the frontend using <code>streamlit</code></p> <pre><code>streamlit run frontend.py\n</code></pre> </li> <li> <p>Create a <code>requirements_frontend.txt</code> file with the dependencies needed for the frontend.</p> Solution requirements_frontend.txt<pre><code>streamlit&gt;=1.28.2\npandas&gt;=2.1.3\ngoogle-cloud-run&gt;=0.10.5\n</code></pre> </li> <li> <p>Containerize the frontend into a file called <code>frontend.dockerfile</code>.</p> Solution frontend.dockerfile<pre><code>FROM python:3.11-slim\n\nRUN apt update &amp;&amp; \\\n    apt install --no-install-recommends -y build-essential gcc git &amp;&amp; \\\n    apt clean &amp;&amp; rm -rf /var/lib/apt/lists/*\n\nRUN mkdir /app\n\nWORKDIR /app\n\nCOPY requirements_frontend.txt /app/requirements_frontend.txt\nCOPY frontend.py /app/frontend.py\n\nRUN --mount=type=cache,target=/root/.cache/pip pip install -r requirements_frontend.txt\n\nEXPOSE $PORT\n\nCMD [\"streamlit\", \"run\", \"frontend.py\", \"--server.port\", \"$PORT\"]\n</code></pre> </li> <li> <p>Build the frontend image</p> <pre><code>docker build -t frontend:latest -f frontend.dockerfile .\n</code></pre> </li> <li> <p>Run the frontend image</p> <pre><code>docker run --rm -p 8001:8001 -e \"PORT=8001\" backend\n</code></pre> <p>and check in your web browser that the frontend works as expected.</p> </li> <li> <p>Deploy the frontend to Cloud run using the <code>gcloud</code> command</p> Solution <p>Assuming that we have created an artifact registry called <code>frontend_backend</code> we can deploy the backend to Cloud Run using the following commands:</p> <pre><code>docker tag frontend:latest \\\n    &lt;region&gt;-docker.pkg.dev/&lt;project&gt;/frontend-backend/frontend:latest\ndocker push &lt;region&gt;.pkg.dev/&lt;project&gt;/frontend-backend/frontend:latest\ngcloud run deploy frontend \\\n    --image=europe-west1-docker.pkg.dev/&lt;project&gt;/frontend-backend/frontend:latest \\\n    --region=europe-west1 \\\n    --platform=managed \\\n</code></pre> </li> <li> <p>Test that frontend works as expected by opening the URL of the deployed frontend in your web browser.</p> </li> </ol> </li> <li> <p>(Optional) If you have gotten this far you have successfully created a frontend and a backend and deployed them to     the cloud. Finally, it may be worth it to load test your application to see how it performs under load. Write a     locust file which is covered in this module and run it against your frontend.     Make sure that it can handle the load you expect it to handle.</p> </li> <li> <p>(Optional) Feel free to experiment further with streamlit and see what you can create. For example, you can try to     create a option for the user to upload a video and then display the video with the predicted class overlaid on     top of the video.</p> </li> </ol>"},{"location":"s7_deployment/frontend/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>We have created separate requirements files for the frontend and the backend. Why is this a good idea?</p> Solution <p>This is a good idea because the frontend and the backend may have different dependencies. By having separate requirements files we can make sure that we only install the dependencies that are needed for the specific application. This also has the positive side effect that we can keep the docker images smaller. For example, the frontend does not need the <code>torch</code> library which is huge and only needed for the backend.</p> </li> </ol> <p>This ends the exercises for this module.</p>"},{"location":"s7_deployment/ml_deployment/","title":"M25 - ML deployment","text":""},{"location":"s7_deployment/ml_deployment/#deployment-of-machine-learning-models","title":"Deployment of Machine Learning Models","text":"<p>In one of the previous modules you learned about how to use FastAPI to create an API to interact with your machine learning models. FastAPI is a great framework, but it is a general framework meaning that it was not developed with machine learning applications in mind. This means that there are features which you may consider to be missing when considering running large scale machine learning models:</p> <ul> <li> <p>Dynamic-batching: if you have a large number of requests coming in, you may want to process them in batches to     reduce the overhead of loading the model and running the inference. This is especially true if you are running your     model on a GPU, where the overhead of loading the model is significant.</p> </li> <li> <p>Async inference: FastAPi does support async requests but not no way to call the model asynchronously. This means that     if you have a large number of requests coming in, you will have to wait for the model to finish processing (because     the model is not async) before you can start processing the next request.</p> </li> <li> <p>Native GPU support: you can definitely run part of your application in FastAPI if you want to. But again it was not     build with machine learning in mind, so you will have to do some extra work to get it to work.</p> </li> </ul> <p>It should come as no surprise that multiple frameworks have therefore sprung up that better supports deployment of machine learning algorithms (just listing a few here):</p> \ud83c\udf1f Framework \ud83e\udde9 Backend Agnostic \ud83e\udde0 Model Agnostic \ud83d\udcc2 Repository \u2b50 Github Stars Cortex \u2705 \u2705 \ud83d\udd17 Link 8.0k BentoML \u2705 \u2705 \ud83d\udd17 Link 7.2k Ray Serve \u2705 \u2705 \ud83d\udd17 Link 34.1k Triton Inference Server \u2705 \u2705 \ud83d\udd17 Link 8.3k OpenVINO \u2705 \u2705 \ud83d\udd17 Link 7.3k Seldon-core \u2705 \u2705 \ud83d\udd17 Link 4.4k Litserve \u2705 \u2705 \ud83d\udd17 Link 2.5k Torchserve \u274c \u2705 \ud83d\udd17 Link 4.2k TensorFlow serve \u274c \u2705 \ud83d\udd17 Link 6.2k vLLM \u274c \u274c \ud83d\udd17 Link 30.5k <p>The first 7 frameworks are backend agnostic, meaning that they are intended to work with whatever computational backend you model is implemented in (TensorFlow, PyTorch, Jax, Sklearn etc.), whereas the last 3 are backend specific (PyTorch, TensorFlow and a custom framework). The first 9 frameworks are model agnostic, meaning that they are intended to work with whatever model you have implemented, whereas the last one is model specific in this case to LLM's. When choosing a framework to deploy your model, you should consider the following:</p> <ul> <li> <p>Ease of use. Some frameworks are easier to use and get started with than others, but may have fewer features. As     an example from the list above, <code>Litserve</code> is very easy to get started with but is a relatively new framework and     may not have all the features you need.</p> </li> <li> <p>Performance. Some frameworks are optimized for performance, but may be harder to use. As an example from the list     above, <code>vLLM</code> is a very high performance framework for serving large language models but it cannot be used for other     types of models.</p> </li> <li> <p>Community. Some frameworks have a large community, which can be helpful if you run into problems. As an example     from the list above, <code>Triton Inference Server</code> is developed by Nvidia and has a large community of users. As a good     rule of thumb, the more stars a repository has on Github, the larger the community.</p> </li> </ul> <p>In this module we are going to be looking at the <code>BentoML</code> framework because it strikes a good balance between ease of use and having a lot of features that can improve the performance of serving your models. However, before we dive into this serving framework, we are going to look at a general way to package our machine learning models that should work with most of the above frameworks.</p>"},{"location":"s7_deployment/ml_deployment/#model-packaging","title":"Model Packaging","text":"<p>Whenever we want to serve a machine learning model, we in general need 3 things:</p> <ul> <li>The computational graph of the model, e.g. how to pass data through the model to get a prediction.</li> <li>The weights of the model, e.g. the parameters that the model has learned during training.</li> <li>A computational backend that can run the model</li> </ul> <p>In the previous module on Docker we learned how to package all of these things into a container. This is a great way to package a model, but it is not the only way. The core assumption we currently have made is that the computational backend is the same as the one we trained the model on. However, this does not need to be the case. As long as we can export our model and weights to a common format, we can run the model on any backend that supports this format.</p> <p>This is exactly what the Open Neural Network Exchange (ONNX) is designed to do. ONNX is a standardized format for creating and sharing machine learning models. It defines an extensible computation graph model, as well as definitions of built-in operators and standard data types. The idea behind ONNX is that a model trained with a specific framework on a specific device, let's say PyTorch on your local computer, can be exported and run with an entirely different framework and hardware easily. Learning how to export your models to ONNX is therefore a great way to increase the longevity of your models and not being locked into a specific framework for serving your models.</p> <p></p>  The ONNX format is designed to bridge the gap between development and deployment of machine learning models, by making it easy to export models between different frameworks and hardware. For example PyTorch is in general considered an developer friendly framework, however it has historically been slow to run inference with compared to a framework.  Image credit"},{"location":"s7_deployment/ml_deployment/#exercises","title":"\u2754 Exercises","text":"<ol> <li> <p>Start by installing ONNX, ONNX runtime and ONNX script. This can be done by running the following command</p> <pre><code>pip install onnx onnxruntime onnxscript\n</code></pre> <p>the first package contains the core ONNX framework, the second package contains the runtime for running ONNX models and the third package contains a new experimental package that is designed to make it easier to export models to ONNX.</p> </li> <li> <p>Let's start out with converting a model to ONNX. The following code snippets shows how to export a PyTorch model to     ONNX.</p> PyTorch =&gt; 2.0PyTorch &lt; 2.0 or WindowsPyTorch-lightning <pre><code>import torch\nimport torchvision\n\nmodel = torchvision.models.resnet18(weights=None)\nmodel.eval()\n\ndummy_input = torch.randn(1, 3, 224, 224)\nonnx_model = torch.onnx.dynamo_export(\n    model=model,\n    model_args=(dummy_input,),\n    export_options=torch.onnx.ExportOptions(dynamic_shapes=True),\n)\nonnx_model.save(\"resnet18.onnx\")\n</code></pre> <pre><code>import torch\nimport torchvision\n\nmodel = torchvision.models.resnet18(weights=None)\nmodel.eval()\n\ndummy_input = torch.randn(1, 3, 224, 224)\ntorch.onnx.export(\n    model=model,\n    args=(dummy_input,),\n    f=\"resnet18.onnx\",\n    input_names=[\"input\"],\n    output_names=[\"output\"],\n    dynamic_axes={\"input\": {0: \"batch_size\"}, \"output\": {0: \"batch_size\"}}\n)\n</code></pre> <pre><code>import torch\nimport torchvision\nimport pytorch_lightning as pl\nimport onnx\nimport onnxruntime\n\nclass LitModel(pl.LightningModule):\n    def __init__(self):\n        super().__init__()\n        self.model = torchvision.models.resnet18(pretrained=True)\n        self.model.eval()\n\n    def forward(self, x):\n        return self.model(x)\n\nmodel = LitModel()\nmodel.eval()\n\ndummy_input = torch.randn(1, 3, 224, 224)\nmodel.to_onnx(\n    file_path=\"resnet18.onnx\",\n    input_sample=dummy_input,\n    input_names=[\"input\"],\n    output_names=[\"output\"],\n    dynamic_axes={\"input\": {0: \"batch_size\"}, \"output\": {0: \"batch_size\"}}\n)\n</code></pre> <p>Export a model of your own choice to ONNX or just try to export the <code>resnet18</code> model as shown in the examples above, and confirm that the model was exported by checking that the file exists. Can you figure out what is meant by <code>dynamic_axes</code>?</p> Solution <p>The <code>dynamic_axes</code> argument is used to specify which axes of the input tensor that should be considered dynamic. This is useful when the model can accept inputs of different sizes, e.g. when the model is used in a dynamic batching scenario. In the example above we have specified that the first axis of the input tensor should be considered dynamic, meaning that the model can accept inputs of different batch sizes. While it may be tempting to specify all axes as dynamic, however this can lead to slower inference times, because the ONNX runtime will not be able to optimize the computational graph as well.</p> </li> <li> <p>Check that the model was correctly exported by loading it using the <code>onnx</code> package and afterwards check the graph     of model using the following code:</p> <pre><code>import onnx\nmodel = onnx.load(\"resnet18.onnx\")\nonnx.checker.check_model(model)\nprint(onnx.helper.printable_graph(model.graph))\n</code></pre> </li> <li> <p>To get a better understanding of what is actually exported, lets try to visualize the computational graph of the     model. This can be done using the open-source tool netron. You can either     try it out directly in webbrowser or you can install it locally using <code>pip install netron</code>     and then run it using <code>netron resnet18.onnx</code>. Can you figure out what method of the model is exported to ONNX?</p> Solution <p>When a PyTorch model is exported to ONNX, it is only the <code>forward</code> method of the model that is exported. This means that it is the only method we have access to when we load the model later. Therefore, make sure that the <code>forward</code> method of your model is implemented in a way that it can be used for inference.</p> </li> <li> <p>After converting a model to ONNX format we can use the ONNX Runtime to run it.     The benefit of this is that ONNX Runtime is able to optimize the computational graph of the model, which can lead     to faster inference times. Lets try to look into that.</p> <ol> <li> <p>Figure out how to run a model using the ONNX Runtime. Relevant     documentation.</p> Solution <p>To use the ONNX runtime to run a model, we first need to start a inference session, then extract input output names of our model and finally run the model. The following code snippet shows how to do this.</p> <pre><code>import onnxruntime as rt\nort_session = rt.InferenceSession(\"&lt;path-to-model&gt;\")\ninput_names = [i.name for i in ort_session.get_inputs()]\noutput_names = [i.name for i in ort_session.get_outputs()]\nbatch = {input_names[0]: np.random.randn(1, 3, 224, 224).astype(np.float32)}\nout = ort_session.run(output_names, batch)\n</code></pre> </li> <li> <p>Let's experiment with performance of ONNX vs. PyTorch. Implement a benchmark that measures the time it takes to     run a model using PyTorch and ONNX. Bonus points if you test for multiple input sizes. To get you started we     have implemented a timing decorator that you can use to measure the time it takes to run a function.</p> <pre><code>from statistics import mean, stdev\nimport time\ndef timing_decorator(func, function_repeat: int = 10, timing_repeat: int = 5):\n    \"\"\" Decorator that times the execution of a function. \"\"\"\n    def wrapper(*args, **kwargs):\n        timing_results = []\n        for _ in range(timing_repeat):\n            start_time = time.time()\n            for _ in range(function_repeat):\n                result = func(*args, **kwargs)\n            end_time = time.time()\n            elapsed_time = end_time - start_time\n            timing_results.append(elapsed_time)\n        print(f\"Avg +- Stddev: {mean(timing_results):0.3f} +- {stdev(timing_results):0.3f} seconds\")\n        return result\n    return wrapper\n</code></pre> Solution onnx_benchmark.py<pre><code>import sys\nimport time\nfrom statistics import mean, stdev\n\nimport onnxruntime as ort\nimport torch\nimport torchvision\n\n\ndef timing_decorator(func, function_repeat: int = 10, timing_repeat: int = 5):\n    \"\"\"Decorator that times the execution of a function.\"\"\"\n\n    def wrapper(*args, **kwargs):\n        timing_results = []\n        for _ in range(timing_repeat):\n            start_time = time.time()\n            for _ in range(function_repeat):\n                result = func(*args, **kwargs)\n            end_time = time.time()\n            elapsed_time = end_time - start_time\n            timing_results.append(elapsed_time)\n        print(f\"Avg +- Stddev: {mean(timing_results):0.3f} +- {stdev(timing_results):0.3f} seconds\")\n        return result\n\n    return wrapper\n\n\nmodel = torchvision.models.resnet18()\nmodel.eval()\n\ndummy_input = torch.randn(1, 3, 224, 224)\nif sys.platform == \"win32\":\n    # Windows doesn't support the new TorchDynamo-based ONNX Exporter\n    torch.onnx.export(\n        model,\n        dummy_input,\n        \"resnet18.onnx\",\n        input_names=[\"input.1\"],\n        dynamic_axes={\"input.1\": {0: \"batch_size\", 2: \"height\", 3: \"width\"}},\n    )\nelse:\n    torch.onnx.dynamo_export(model, dummy_input).save(\"resnet18.onnx\")\n\nort_session = ort.InferenceSession(\"resnet18.onnx\")\n\n\n@timing_decorator\ndef torch_predict(image) -&gt; None:\n    \"\"\"Predict using PyTorch model.\"\"\"\n    model(image)\n\n\n@timing_decorator\ndef onnx_predict(image) -&gt; None:\n    \"\"\"Predict using ONNX model.\"\"\"\n    ort_session.run(None, {\"input.1\": image.numpy()})\n\n\nif __name__ == \"__main__\":\n    for size in [224, 448, 896]:\n        dummy_input = torch.randn(1, 3, size, size)\n        print(f\"Image size: {size}\")\n        torch_predict(dummy_input)\n        onnx_predict(dummy_input)\n</code></pre> </li> <li> <p>To get a better understanding of why running the model using the ONNX runtime is usually faster lets try to see     what happens to the computational graph. By default the ONNX Runtime will apply these optimization in online     mode, meaning that the optimizations are applied when the model is loaded. However, it is also possible to apply     the optimizations in offline mode, such that the optimized model is saved to disk. Below is an example of how     to do this.</p> <pre><code>import onnxruntime as rt\nsess_options = rt.SessionOptions()\n\n# Set graph optimization level\nsess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_EXTENDED\n\n# To enable model serialization after graph optimization set this\nsess_options.optimized_model_filepath = \"optimized_model.onnx&gt;\"\n\nsession = rt.InferenceSession(\"&lt;model_path&gt;\", sess_options)\n</code></pre> <p>Try to apply the optimizations in offline mode and use <code>netron</code> to visualize both the original and optimized model side by side. Can you see any differences?</p> Solution <p>You should hopefully see that the optimized model consist of fewer nodes and edges than the original model. These nodes are often called fused nodes, because they are the result of multiple nodes being fused together. In the image below we have visualized the first part of the computational graph of a resnet18 model, before and after optimization.</p> <p> </p> </li> </ol> </li> <li> <p>As mentioned in the introduction, ONNX is able to run on many different types of hardware and execution engine.     You can check all providers and all the available providers by running the following code</p> <pre><code>import onnxruntime\nprint(onnxruntime.get_all_providers())\nprint(onnxruntime.get_available_providers())\n</code></pre> <p>Can you figure out how to set which provide the ONNX runtime should use?</p> Solution <p>The provider that the ONNX runtime should use can be set by passing the <code>providers</code> argument to the <code>InferenceSession</code> class. A list should be provided, which prioritizes the providers in the order they are listed.</p> <pre><code>import onnxruntime as rt\nprovider_list = ['CUDAExecutionProvider', 'CPUExecutionProvider']\nort_session = rt.InferenceSession(\"&lt;path-to-model&gt;\", providers=provider_list)\n</code></pre> <p>In this case we will prefer CUDA Execution Provider over CPU Execution Provider if both are available.</p> </li> <li> <p>As you have probably realised in the exercises on docker, it can take a long time     to build the kind of containers we are working with and they can be quite large. There is a reason for this and that     is that PyTorch is a very large framework with a lot of dependencies. ONNX on the other hand is a much smaller     framework. This kind of makes sense, because PyTorch is a framework that primarily was designed for developing e.g.     training models, while ONNX is a framework that is designed for serving models. Let's try to quantify this.</p> <ol> <li> <p>Construct a dockerfile that builds a docker image with PyTorch as a dependency. The dockerfile does actually     not need to run anything. Repeat the same process for the ONNX runtime. Bonus point for developing a docker     image that takes a build arg at build time that specifies     if the image should be built with CUDA support or not.</p> Solution <p>The dockerfile for the PyTorch image could look something like this</p> inference_pytorch.dockerfile<pre><code>FROM python:3.11-slim\n\nRUN apt update &amp;&amp; \\\n    apt install --no-install-recommends -y build-essential gcc &amp;&amp; \\\n    apt clean &amp;&amp; rm -rf /var/lib/apt/lists/*\n\nARG CUDA\nENV CUDA=${CUDA}\nRUN echo \"CUDA is set to: ${CUDA}\"\n\nRUN echo \"CUDA is set to: ${CUDA}\" &amp;&amp; \\\n    if [ -n \"$CUDA\" ]; then \\\n        pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu121; \\\n    else \\\n        pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu; \\\n    fi\n</code></pre> <p>and the dockerfile for the ONNX image could look something like this</p> inference_onnx.dockerfile<pre><code>FROM python:3.11-slim\n\nRUN apt update &amp;&amp; \\\n    apt install --no-install-recommends -y build-essential gcc &amp;&amp; \\\n    apt clean &amp;&amp; rm -rf /var/lib/apt/lists/*\n\nRUN echo \"CUDA is set to: ${CUDA}\" &amp;&amp; \\\n    if [ -n \"$CUDA\" ]; then \\\n        pip install onnxruntime-gpu; \\\n    else \\\n        pip install onnxruntime; \\\n    fi\n</code></pre> </li> <li> <p>Build both containers and measure the time it takes to build them. How much faster is it to build the ONNX     container compared to the PyTorch container?</p> Solution <p>On unix/linux you can use the time command to measure the time it takes to build the containers. Building both images, with and without CUDA support, can be done with the following commands</p> <pre><code>time docker build . -t pytorch_inference_cuda:latest -f inference_pytorch.dockerfile \\\n    --no-cache --build-arg CUDA=true\ntime docker build . -t pytorch_inference:latest -f inference_pytorch.dockerfile \\\n    --no-cache --build-arg CUDA=\ntime docker build . -t onnx_inference_cuda:latest -f inference_onnx.dockerfile \\\n    --no-cache --build-arg CUDA=true\ntime docker build . -t onnx_inference:latest -f inference_onnx.dockerfile \\\n    --no-cache --build-arg CUDA=\n</code></pre> <p>the <code>--no-cache</code> flag is used to ensure that the build process is not cached and ensure a fair comparison. On my laptop this respectively took <code>5m1s</code>, <code>1m4s</code>, <code>0m4s</code>, <code>0m50s</code> meaning that the ONNX container was respectively 7x (with CUDA) and 1.28x (no CUDA) faster to build than the PyTorch container.</p> </li> <li> <p>Find out the size of the two docker images. It can be done in the terminal by running the <code>docker images</code>     command. How much smaller is the ONNX model compared to the PyTorch model?</p> Solution <p>As of writing the docker image containing the PyTorch framework was 5.54GB (with CUDA) and 1.25GB (no CUDA). In comparison the ONNX image was 647MB (with CUDA) and 647MB (no CUDA). This means that the ONNX image is respectively 8.5x (with CUDA) and 1.94x (no CUDA) smaller than the PyTorch image.</p> </li> </ol> </li> <li> <p>(Optional) Assuming you have completed the module on FastAPI try creating a small     FastAPI application that serves a model using the ONNX runtime.</p> Solution <p>Here is a simple example of how to create a FastAPI application that serves a model using the ONNX runtime.</p> onnx_fastapi.py<pre><code>import numpy as np\nimport onnxruntime\nfrom fastapi import FastAPI\n\napp = FastAPI()\n\n\n@app.get(\"/predict\")\ndef predict():\n    \"\"\"Predict using ONNX model.\"\"\"\n    # Load the ONNX model\n    model = onnxruntime.InferenceSession(\"model.onnx\")\n\n    # Prepare the input data\n    input_data = {\"input\": np.random.rand(1, 3).astype(np.float32)}\n\n    # Run the model\n    output = model.run(None, input_data)\n\n    return {\"output\": output[0].tolist()}\n</code></pre> </li> </ol> <p>This completes the exercises on the ONNX format. Do note that one limitation of the ONNX format is that is is based on ProtoBuf, which is a binary format. A protobuf file can have a maximum size of 2GB, which means that the <code>.onnx</code> format is not enough for very large models. However, through the use of external data it is possible to circumvent this limitation.</p>"},{"location":"s7_deployment/ml_deployment/#bentoml","title":"BentoML","text":"<p>BentoML cloud vs BentoML OSS</p> <p>We are only going to be looking at the open-source version of BentoML in this module. However, BentoML also has a cloud version that makes it very easy to deploy models that are coded in BentoML to the cloud. If you are interested in this, you can check out the BentoML cloud documentation. This business strategy of having an open-source product and a cloud product is very common in the machine learning space (HuggingFace, LightningAI, Weights and Biases etc.), because it allows companies to make money from the cloud product while still providing a free product to the community.</p> <p>BentoML is a framework that is designed to make it easy to serve machine learning models. It is designed to be backend agnostic, meaning that it can be used with any computational backend. It is also model agnostic, meaning that it can be used with any machine learning model.</p> <p>Let's consider a simple example of how to serve a model using BentoML. The following code snippet shows how to serve a model that uses the <code>transformers</code> library to summarize text.</p> <pre><code>import bentoml\nfrom transformers import pipeline\n\nEXAMPLE_INPUT = (\n    \"Breaking News: In an astonishing turn of events, the small town of Willow Creek has been taken by storm as \"\n    \"local resident Jerry Thompson's cat, Whiskers, performed what witnesses are calling a 'miraculous and gravity-\"\n    \"defying leap.' Eyewitnesses report that Whiskers, an otherwise unremarkable tabby cat, jumped a record-breaking \"\n    \"20 feet into the air to catch a fly. The event, which took place in Thompson's backyard, is now being investigated \"\n    \"by scientists for potential breaches in the laws of physics. Local authorities are considering a town festival to \"\n    \"celebrate what is being hailed as 'The Leap of the Century.'\"\n)\n\n@bentoml.service(resources={\"cpu\": \"2\"}, traffic={\"timeout\": 10})\nclass Summarization:\n    def __init__(self) -&gt; None:\n        self.pipeline = pipeline('summarization')\n\n    @bentoml.api\n    def summarize(self, text: str = EXAMPLE_INPUT) -&gt; str:\n        result = self.pipeline(text)\n        return result[0]['summary_text']\n</code></pre> <p>In <code>BentoML</code> we organize our services in classes, where each class is a service that we want to serve. The two important parts of the code snippet are the <code>@bentoml.service</code> and <code>@bentoml.api</code> decorators.</p> <ul> <li> <p>The <code>@bentoml.service</code> decorator is used to specify the resources that the service should use and in general how the     service should be run. In this case we are specifying that the service should use 2 CPU cores and that the timeout     for the service should be 10 seconds.</p> </li> <li> <p>The <code>@bentoml.api</code> decorator is used to specify the API that the service should expose. In this case we are specifying     that the service should have an API called <code>summarize</code> that takes a string as input and returns a string as output.</p> </li> </ul> <p>To serve the model using <code>BentoML</code> we can execute the following command, which is very similar to the command we used to serve the model using FastAPI.</p> <pre><code>bentoml serve service:Summarization\n</code></pre>"},{"location":"s7_deployment/ml_deployment/#exercises_1","title":"\u2754 Exercises","text":"<p>In general, we advise looking through the docs for Bento ML if you need help with any of the exercises. We are going to assume that you have done the exercises on ONNX and we are therefore going to be using <code>BentoML</code> to serve ONNX models. If you have not done this part, you can still follow along but you will need to use a PyTorch model instead of an ONNX model.</p> <ol> <li> <p>Install BentoML</p> <pre><code>pip install bentoml\n</code></pre> <p>Remember to add the dependency to your <code>requirements.txt</code> file.</p> </li> <li> <p>You are in principal free to serve any model you like, but we recommend to just use a     torchvision model as in the ONNX exercises. Write your first service     in <code>BentoML</code> that serves a model of your choice. We recommend experimenting with providing     input/output as tensors because bentoml supports this     nativly. Secondly, write a client that can send a request to the service and print the result. Here we recommend     using the build in bentoml.SyncHTTPClient.</p> Solution <p>The following implements a simple BentoML service that serves a ONNX resnet18 model. The service expects the both input and output to be numpy arrays.</p> bentoml_service.py<pre><code>from __future__ import annotations\n\nimport bentoml\nimport numpy as np\nfrom onnxruntime import InferenceSession\n\n\n@bentoml.service\nclass ImageClassifierService:\n    \"\"\"Image classifier service using ONNX model.\"\"\"\n\n    def __init__(self) -&gt; None:\n        self.model = InferenceSession(\"model.onnx\")\n\n    @bentoml.api\n    def predict(self, image: np.ndarray) -&gt; np.ndarray:\n        \"\"\"Predict the class of the input image.\"\"\"\n        output = self.model.run(None, {\"input\": image.astype(np.float32)})\n        return output[0]\n</code></pre> <p>The service can be served using the following command</p> <pre><code>bentoml serve bentoml_service:ImageClassifierService\n</code></pre> <p>To test that the service works the following client can be used</p> bentoml_client.py<pre><code>import bentoml\nimport numpy as np\nfrom PIL import Image\n\nif __name__ == \"__main__\":\n    image = Image.open(\"my_cat.jpg\")\n    image = image.resize((224, 224))  # Resize to match the minimum input size of the model\n    image = np.array(image)\n    image = np.transpose(image, (2, 0, 1))  # Change to CHW format\n    image = np.expand_dims(image, axis=0)  # Add batch dimension\n\n    with bentoml.SyncHTTPClient(\"http://localhost:4040\") as client:\n        resp = client.predict(image=image)\n        print(resp)\n</code></pre> </li> <li> <p>We are now going to look at features very <code>BentoML</code> really sets itself apart from <code>FastAPI</code>. The first is     adaptive batching. As you are hopefully aware, modern machine learning models can process multiple samples at the     same time and in doing so increases the throughput of the model. When we train a model we often set a fixed     batch size, however we cannot do that when serving the model because that would mean that we would have to wait for     the batch to be full before we can process it. Adaptive batching simply refers to the process where we specify a     maximum batch size and also a timeout. When either the batch is full or the timeout is reached, however many     samples we have collected are sent to the model for processing. This can be a very powerful feature because it     allows us to process samples as soon as they arrive, while still taking advantage of the increased throughput of     batching.</p> <p>  The overall architecture of the adaptive batching feature in BentoML. The feature is implemented on the server side and mainly consist of dispatcher that is in charge of collecting requests and sending them to the model server when either the batch is full or a timeout is reached.  Image credit  </p> <ol> <li> <p>Look through the     documentation on adaptive batching and     add adaptive batching to your service from the previous exercise. Make sure your service works as expected by     testing it with the client from the previous exercise.</p> Solution bentoml_service_adaptive_batching.py<pre><code>from __future__ import annotations\n\nimport bentoml\nimport numpy as np\nfrom onnxruntime import InferenceSession\n\n\n@bentoml.service\nclass ImageClassifierService:\n    \"\"\"Image classifier service using ONNX model.\"\"\"\n\n    def __init__(self) -&gt; None:\n        self.model = InferenceSession(\"model.onnx\")\n\n    @bentoml.api(\n        batchable=True,\n        batch_dim=(0, 0),\n        max_batch_size=128,\n        max_latency_ms=1000,\n    )\n    def predict(self, image: np.ndarray) -&gt; np.ndarray:\n        \"\"\"Predict the class of the input image.\"\"\"\n        output = self.model.run(None, {\"input\": image.astype(np.float32)})\n        return output[0]\n</code></pre> </li> <li> <p>Try to measure the throughput of your model with and without adaptive batching. Assuming that you have completed     the module on testing APIs and therefore are familiar with the <code>locust</code> framework, we     recommend that you write a simple locustfile and use the <code>locust</code> command to measure the throughput of your     model.</p> Solution <p>The following locust file can be used to measure the throughput of the model with and without adaptive</p> locustfile.py<pre><code>import numpy as np\nfrom locust import HttpUser, between, task\nfrom PIL import Image\n\n\ndef prepare_image():\n    \"\"\"Load and preprocess the image as required.\"\"\"\n    image = Image.open(\"my_cat.jpg\")\n    image = image.resize((224, 224))\n    image = np.array(image)\n    image = np.transpose(image, (2, 0, 1))  # Convert to CHW format\n    image = np.expand_dims(image, axis=0)  # Add batch dimension\n    # Convert to list format for JSON serialization\n    return image.tolist()\n\n\nimage = prepare_image()\n\n\nclass BentoMLUser(HttpUser):\n    \"\"\"Locust user class for sending prediction requests to the server.\"\"\"\n\n    wait_time = between(1, 2)\n\n    @task\n    def send_prediction_request(self):\n        \"\"\"Send a prediction request to the server.\"\"\"\n        payload = {\"image\": image}  # Package the image as JSON\n        self.client.post(\"/predict\", json=payload, headers={\"Content-Type\": \"application/json\"})\n</code></pre> <p>and then the following command can be used to measure the throughput of the model</p> <pre><code>locust -f locustfile_bentoml.py --host http://localhost:4040 --headless -u 50 -t 60s\n</code></pre> <p>You should hopefully see that the throughput of the model is higher when adaptive batching is enabled, but the speedup is largely dependent on the model you are running, the configuration of the adaptive batching and the hardware you are running on.</p> <p>On my laptop I saw about a 1.5 - 2x speedup when adaptive batching was enabled.</p> </li> </ol> </li> <li> <p>(Optional, requires GPU) Look through the     documentation for inference on GPU and add this to     your service. Check that your service works as expected by testing it with the client from the previous exercise and     make sure you are seeing a speedup when running on the GPU.</p> Solution <p>A simple change to the <code>bento.service</code> decorator is all that is needed to run the model on the GPU.</p> <p>```python @bentoml.service(resources={\"gpu\": 1}) class MyService:     def init(self):         self.model = torch.load('model.pth').to('cuda:0')</p> </li> <li> <p>Another way to speed up the inference is to just use multiple workers. This duplicates the server over multiple     processes taking advantage of modern multi-core CPUs. This is similar to running <code>uvicorn</code> command with the     <code>--workers</code> flag for fastapi applications. Implement multiple workers in your service and test that it works as     expected by testing it with the client from the previous exercise. Also test that you are seeing a speedup when     running with multiple workers.</p> Solution <p>Multiple workers can be added to the <code>bento.service</code> decorator as shown below.</p> <pre><code>@bentoml.service(workers=4)\nclass MyService:\n    # Service implementation\n</code></pre> <p>Alternatively, you can set <code>workers=\"cpu_count\"</code> to use all available CPU cores. The speedup depends on the model you are serving, the hardware you are running on and the number of workers you are using, but it should be higher than using a single worker.</p> </li> <li> <p>In addition to increasing the throughput of your deployments <code>BentoML</code> can also help with ML applications that     requires some kind of composition of multiple models. It is very normal in production setups to have multiple models     that either</p> <ul> <li>Runs in a sequence, e.g. the output of one model is the input of another model. You may have a preprocessing     service that preprocesses the data before it is sent to a model that makes a prediction.</li> <li>Runs concurrently, e.g. you have multiple models that are run at the same time and the output of all the models     are combined to make a prediction. Ensemble models are a good example of this.</li> </ul> <p><code>BentoML</code> makes it easy to compose multiple models together.</p> <ol> <li> <p>Implement two services that runs in a sequence e.g. the output of one service is used as the input of another     service. As an example you can implement either some pre- or post-processing service that is used in conjunction     with the model you have implemented in the previous exercises.</p> Solution <p>The following code snippet shows how to implement two services that runs in a sequence.</p> bentoml_service_composition.py<pre><code>from __future__ import annotations\n\nfrom pathlib import Path\n\nimport bentoml\nimport numpy as np\nfrom onnxruntime import InferenceSession\nfrom PIL import Image\n\n\n@bentoml.service\nclass ImagePreprocessorService:\n    \"\"\"Image preprocessor service.\"\"\"\n\n    @bentoml.api\n    def preprocess(self, image_file: Path) -&gt; np.ndarray:\n        \"\"\"Preprocess the input image.\"\"\"\n        image = Image.open(image_file)\n        image = image.resize((224, 224))\n        image = np.array(image)\n        image = np.transpose(image, (2, 0, 1))\n        return np.expand_dims(image, axis=0)\n\n\n@bentoml.service\nclass ImageClassifierService:\n    \"\"\"Image classifier service using ONNX model.\"\"\"\n\n    preprocessing_service = bentoml.depends(ImagePreprocessorService)\n\n    def __init__(self) -&gt; None:\n        self.model = InferenceSession(\"model.onnx\")\n\n    @bentoml.api\n    async def predict(self, image_file: Path) -&gt; np.ndarray:\n        \"\"\"Predict the class of the input image.\"\"\"\n        image = await self.preprocessing_service.to_async.preprocess(image_file)\n        output = self.model.run(None, {\"input\": image.astype(np.float32)})\n        return output[0]\n</code></pre> </li> <li> <p>Implement three services, where two of them runs concurrently and the output of both services are combined in the     third service to make a prediction. As an example you can expand your previous service to serve two different     models and then implement a third service that combines the output of both models to make a prediction.</p> Solution <p>The following code snippet shows how to implement a service that consist of two concurrent services. The example assumes that two models called <code>model_a.onnx</code> and <code>model_b.onnx</code> are available.</p> bentoml_service_composition.py<pre><code>from __future__ import annotations\n\nimport asyncio\n\nimport bentoml\nimport numpy as np\nfrom onnxruntime import InferenceSession\n\n\n@bentoml.service\nclass ImageClassifierServiceModelA:\n    \"\"\"Image classifier service using ONNX model.\"\"\"\n\n    def __init__(self) -&gt; None:\n        self.model = InferenceSession(\"model_a.onnx\")\n\n    @bentoml.api\n    def predict(self, image: np.ndarray) -&gt; np.ndarray:\n        \"\"\"Predict the class of the input image.\"\"\"\n        output = self.model.run(None, {\"input\": image.astype(np.float32)})\n        return output[0]\n\n\n@bentoml.service\nclass ImageClassifierServiceModelB:\n    \"\"\"Image classifier service using ONNX model.\"\"\"\n\n    def __init__(self) -&gt; None:\n        self.model = InferenceSession(\"model_b.onnx\")\n\n    @bentoml.api\n    def predict(self, image: np.ndarray) -&gt; np.ndarray:\n        \"\"\"Predict the class of the input image.\"\"\"\n        output = self.model.run(None, {\"input\": image.astype(np.float32)})\n        return output[0]\n\n\n@bentoml.service\nclass ImageClassifierService:\n    \"\"\"Image classifier service using ONNX model.\"\"\"\n\n    model_a = bentoml.depends(ImageClassifierServiceModelA)\n    model_b = bentoml.depends(ImageClassifierServiceModelB)\n\n    @bentoml.api\n    async def predict(self, image: np.ndarray) -&gt; np.ndarray:\n        \"\"\"Predict the class of the input image.\"\"\"\n        result_a, result_b = await asyncio.gather(\n            self.model_a.to_async.predict(image), self.model_b.to_async.predict(image)\n        )\n        return (result_a + result_b) / 2\n</code></pre> </li> <li> <p>(Optional) Implement a server that consist of both sequential and concurrent services.</p> </li> </ol> </li> <li> <p>Similar to deploying a FastAPI application to the cloud, deploying a <code>BentoML</code> framework to the cloud     often requires you to first containerize the application. Because <code>BentoML</code> is designed to be easy to use for even     users not that familiar with Docker, it introduces the concept of a <code>bentofile</code>. A <code>bentofile</code> is a file that     specifies how the container should be build. Below is an example of how a <code>bentofile</code> could look like.</p> <pre><code>service: 'service:Summarization'\nlabels:\n  owner: bentoml-team\n  project: gallery\ninclude:\n  - '*.py'\npython:\n  packages:\n    - torch\n    - transformers\n</code></pre> <p>which can then be used to build a <code>bento</code> using the following command</p> <pre><code>bentoml build\n</code></pre> <p>A <code>bento</code> is not a docker image, but it can be used to build a docker image with the following command</p> <pre><code>bentoml containerize summarization:latest\n</code></pre> <ol> <li> <p>Can you figure out how the different parts of the <code>bentofile</code> are used to build the docker image? Additionally,     can you figure out from the source repository how the <code>bentofile</code> is     used to build the docker image?</p> Solution <p>The <code>service</code> part specifies both what the container should be called and also what service it should serve e.g. the last statement in the corresponding dockerfile is <code>CMD [\"bentoml\", \"serve\", \"service:Summarization\"]</code>. The <code>labels</code> part is used to specify labels about the container, see this link for more info. The <code>include</code> part corresponds to <code>COPY</code> statements in the dockerfile and finally the <code>python</code> part is used to specify what python packages should be installed in the container which corresponds to <code>RUN pip install ...</code> in the dockerfile.</p> <p>Regarding how the <code>bentofile</code> is used to build the docker image, the <code>bentoml</code> package contains a number of templates (written using the jinja2 templating language) that are used to generate the dockerfiles. The templates can be found here.</p> </li> <li> <p>Take whatever service from the previous exercises and try to containerize it. You are free to either write a     <code>bentofile</code> or a <code>dockerfile</code> to do this.</p> Solution <p>The following <code>bentofile</code> can be used to containerize the very first service we implemented in this set of exercises.</p> <pre><code>service: 'bentoml_service:ImageClassifierService'\nlabels:\n  owner: bentoml-team\n  project: gallery\ninclude:\n- 'bentoml_service.py'\n- 'model.onnx'\npython:\n  packages:\n    - onnxruntime\n    - numpy\n</code></pre> <p>The corresponding dockerfile would look something like this</p> <pre><code>FROM python:3.11-slim\nWORKDIR /bento\nCOPY bentoml_service.py .\nCOPY model.onnx .\nRUN pip install onnxruntime numpy bentoml\nCMD [\"bentoml\", \"serve\", \"bentoml_service:ImageClassifierService\"]\n</code></pre> </li> <li> <p>Deploy the container to GCP Run and test that it works.</p> Solution <p>The following command can be used to deploy the container to GCP Run. We assume that you have already build the container and called it <code>bentoml_service:latest</code>.</p> <pre><code>docker tag bentoml_service:latest \\\n    &lt;region&gt;-docker.pkg.dev/&lt;project-id&gt;/&lt;repository-name&gt;/bentoml_service:latest\ndocker push &lt;region&gt;-docker.pkg.dev/&lt;project-id&gt;/&lt;repository-name&gt;/bentoml_service:latest\ngcloud run deploy bentoml-service \\\n    --image=&lt;region&gt;-docker.pkg.dev/&lt;project-id&gt;/&lt;repository-name&gt;/bentoml_service:latest \\\n    --platform managed \\\n    --port 3000  # default used by BentoML\n</code></pre> <p>where <code>&lt;project-id&gt;</code> should be replaced with the id of the project you are deploying to. The service should now be available at the URL that is printed in the terminal.</p> </li> </ol> </li> </ol> <p>This completes the exercises on the <code>BentoML</code> framework. If you want to deep dive more into this we can recommend looking into their tasks feature for use cases that have a very long running time and build in model management feature to unify the way models are loaded, managed and served.</p>"},{"location":"s7_deployment/ml_deployment/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>How would you export a <code>scikit-learn</code> model to ONNX? What method is exported when you export <code>scikit-learn</code> model to     ONNX?</p> Solution <p>It is possible to export a <code>scikit-learn</code> model to ONNX using the <code>sklearn-onnx</code> package. The following code snippet shows how to export a <code>scikit-learn</code> model to ONNX.</p> <pre><code>from sklearn.ensemble import RandomForestClassifier\nfrom skl2onnx import to_onnx\nmodel = RandomForestClassifier(n_estimators=2)\ndummy_input = np.random.randn(1, 4)\nonx = to_onnx(model, dummy_input)\nwith open(\"model.onnx\", \"wb\") as f:\n    f.write(onx.SerializeToString())\n</code></pre> <p>The method that is exported when you export a <code>scikit-learn</code> model to ONNX is the <code>predict</code> method.</p> </li> <li> <p>In your own words, describe what the concept of computational graph means?</p> Solution <p>A computational graph is a way to represent the mathematical operations that are performed in a model. It is essentially a graph where the nodes are the operations and the edges are the data that is passed between them. The computational graph normally represents the forward pass of the model and is the reason that we can easily backpropagate through the model to train it, because the graph contains all the necessary information to calculate the gradients of the model.</p> </li> <li> <p>In your own words, explain why fusing operations together in the computational graph often leads to better     performance?</p> Solution <p>Each time we want to do a computation, the data needs to be loaded from memory into the CPU/GPU. This is a slow process and the more operations we have, the more times we need to load the data. By fusing operations together, we can reduce the number of times we need to load the data, because we can do multiple operations on the same data before we need to load new data.</p> </li> </ol> <p>This ends the module on tools specifically designed for serving machine learning models. As stated in the beginning of the module, there are a lot of different tools that can be used to serve machine learning models and the choice of tool often depends on the specific use case. In general, we recommend that whenever you want to serve a machine learning model, you should try out a few different frameworks and see which one fits your use case the best.</p>"},{"location":"s7_deployment/testing_apis/","title":"M24 - API Testing","text":""},{"location":"s7_deployment/testing_apis/#api-testing","title":"API testing","text":"<p>Core Module</p> <p>API testing, similar to unit testing, is a type of software testing that involves testing the application programming interface (API) directly to ensure it meets requirements for functionality, reliability, performance, and security. The core difference from the unit testing we have been implementing until now is that instead of testing the individual functions, we are testing the entire API as a whole. API testing is therefore a form of integration testing. Additionally, another difference is that we need to simulate API calls that should be as similar as possible to the ones that will be made by the users of the API.</p> <p>The is in general two things that we want to test when we are working with APIs:</p> <ul> <li>Does the API work as intended? e.g. for a given input, does it return the expected output?</li> <li>Can the API handle the expected load? e.g. if we send 1000 requests per second, does it crash?</li> </ul> <p>In this module, we go over how to do each of them.</p>"},{"location":"s7_deployment/testing_apis/#testing-for-functionality","title":"Testing for functionality","text":"<p>Similar to when we wrote unit tests for our code back in this module we can also write tests for our API that checks that our code does what it is supposed to do e.g. by using <code>assert</code> statements. As always we recommend implementing the tests in a separate folder called <code>tests</code>, but we recommend that you add further subfolders to separate the different types of tests. For example, for the type of machine learning projects and APIs we have been working with in this course:</p> <pre><code>my_project\n|-- src/\n|   |-- train.py\n|   |-- data.py\n|   |-- app.py\n|-- tests/\n|   |-- unittests/\n|   |   |-- test_train.py\n|   |   |-- test_data.py\n|   |-- integrationtests/\n|   |   |-- test_apis.py\n</code></pre>"},{"location":"s7_deployment/testing_apis/#exercises","title":"\u2754 Exercises","text":"<p>In these exercises, we are going to assume that we want to test an API written in FastAPI (see this module). If the API is written in a different framework then how to write the tests may have to change.</p> <ol> <li> <p>Start by installing httpx which is the client we are going to use during testing:</p> <pre><code>pip install httpx\n</code></pre> <p>Remember to add it to your <code>requirements.txt</code> file.</p> </li> <li> <p>If you have already done the module on unittesting then you should     already have a <code>tests/</code> folder. If not then create one. Inside the <code>tests/</code> folder create a new folder called     <code>integrationtests/</code>. Inside the <code>integrationtests/</code> folder create a file called <code>test_apis.py</code> and write the     following code:</p> <pre><code>from fastapi.testclient import TestClient\nfrom app.main import app\nclient = TestClient(app)\n</code></pre> <p>this code will create a client that can be used to send requests to the API. The <code>app</code> variable is the FastAPI application that we want to test.</p> </li> <li> <p>Now, you can write tests that check that the API works as intended, much like you would write unit tests. For     example, if you have an root endpoint that just returns a simple welcome message you could write a test like this:</p> <pre><code>def test_read_root(model):\n    response = client.get(\"/\")\n    assert response.status_code == 200\n    assert response.json() == {\"message\": \"Welcome to the MNIST model inference API!\"}\n</code></pre> <p>make sure to always <code>assert</code> that the status code is what you expect and that the response is what you expect. Add such tests for all the endpoints in your API.</p> Application with lifespans <p>If you have an application with lifespan events e.g. you have implemented the <code>lifespan</code> function in your FastAPI application, you need to instead use the <code>TestClient</code> in a <code>with</code> statement. This is because the <code>TestClient</code> will close the connection to the application after the test is done. Here is an example:</p> <pre><code>def test_read_root(model):\n    with TestClient(app) as client:\n        response = client.get(\"/\")\n        assert response.status_code == 200\n        assert response.json() == {\"message\": \"Welcome to the MNIST model inference API!\"}\n</code></pre> </li> <li> <p>To run the tests, you can use the following command:</p> <pre><code>pytest tests/integrationtests/test_apis.py\n</code></pre> <p>Make sure that all your tests pass.</p> </li> </ol>"},{"location":"s7_deployment/testing_apis/#load-testing","title":"Load testing","text":"<p>The next type of testing we are going to implement for our application is load testing, which is a kind of performance testing. The goal of load testing is to determine how an application behaves under both normal and peak conditions. The purpose is to identify the maximum operating capacity of an application as well as any bottlenecks and to determine which element is causing degradation.</p> <p>Before we get started on the exercises we recommend that you start by defining an environment variable that contains the endpoint of your API e.g we need the API running to be able to test it. To begin with, you can just run the API locally, thus in a terminal window run the following command:</p> <pre><code>uvicorn app.main:app --reload\n</code></pre> <p>by default the API will be running on <code>http://localhost:8000</code> which we can then define as an environment variable:</p> WindowsMac/Linux <pre><code>set MYENDPOINT=http://localhost:8000\n</code></pre> <pre><code>export MYENDPOINT=http://localhost:8000\n</code></pre> <p>However, the end goal is to test an API you have deployed in the cloud. If you have used Google Cloud Run to deploy your API then you can get the endpoint by going to the UI and looking at the service details:</p> <p></p>  The endpoint can be seen in the top center. It always starts with `https://` followed by a random string and then `.a.run.app`  <p>However, we can also use the <code>gcloud</code> command to get the endpoint:</p> WindowsMac/Linux <pre><code>for /f \"delims=\" %i in ^\n('gcloud run services describe &lt;name&gt; --region=&lt;region&gt; --format=\"value(status.url)\"') do set MYENDPOINT=%i\n</code></pre> <pre><code>export MYENDPOINT=$(gcloud run services describe &lt;name&gt; --region=&lt;region&gt; --format=\"value(status.url)\")\n</code></pre> <p>where you need to define <code>&lt;name&gt;</code> and <code>&lt;region&gt;</code> with the name of your service and the region it is deployed in.</p>"},{"location":"s7_deployment/testing_apis/#exercises_1","title":"\u2754 Exercises","text":"<p>For the exercises, we are going to use the locust framework for load testing (the name is a reference to a locust being a swarm of bugs invading your application). It is a Python framework that allows you to write tests that simulate many users interacting with your application. It is very easy to get started with and it is very easy to integrate with your CI/CD pipeline.</p> <ol> <li> <p>Install <code>locust</code></p> <pre><code>pip install locust\n</code></pre> <p>Remember to add it to your <code>requirements.txt</code> file.</p> </li> <li> <p>Make sure you have written an API that you can test. Else you can for simplicity just use this simple example</p> <p>Simple hallo world Fastapi example</p> model.py<pre><code>from fastapi import FastAPI\n\napp = FastAPI()\n\n\n@app.get(\"/\")\ndef read_root():\n    \"\"\"Root endpoint.\"\"\"\n    return {\"Hello\": \"World\"}\n\n\n@app.get(\"/items/{item_id}\")\ndef read_item(item_id: int):\n    \"\"\"Get an item by id.\"\"\"\n    return {\"item_id\": item_id}\n</code></pre> </li> <li> <p>Add a new folder to your <code>tests/</code> folder called <code>performancetests</code> and inside it create a file called     <code>locustfile.py</code>. To that file, you need to add the appropriate code to simulate the users that you want to test.     You can read more about how to write a <code>locustfile.py</code> here.</p> Solution <p>Here we provide a solution to the above simple example:</p> locustfile.py<pre><code>import random\n\nfrom locust import HttpUser, between, task\n\n\nclass MyUser(HttpUser):\n    \"\"\"A simple Locust user class that defines the tasks to be performed by the users.\"\"\"\n\n    wait_time = between(1, 2)\n\n    @task\n    def get_root(self) -&gt; None:\n        \"\"\"A task that simulates a user visiting the root URL of the FastAPI app.\"\"\"\n        self.client.get(\"/\")\n\n    @task(3)\n    def get_item(self) -&gt; None:\n        \"\"\"A task that simulates a user visiting a random item URL of the FastAPI app.\"\"\"\n        item_id = random.randint(1, 10)\n        self.client.get(f\"/items/{item_id}\")\n</code></pre> </li> <li> <p>Then try to run the <code>locust</code> command:</p> <pre><code>locust -f tests/performancetests/locustfile.py\n</code></pre> <p>and then navigate to http://localhost:8089 in your web browser. You should see a page that looks similar to the top of this figure.</p> <p> </p> <p>you can here define the number of users you want to simulate and how many users you want to spawn per second. Finally, you can define which endpoint you want to test. When you are ready you can press the <code>Start</code>.</p> <p>Afterward, you should see the results of the test in the web browser. Answer the following questions:</p> <ul> <li>What is the average response time of your API?</li> <li>What is the 99th percentile response time of your API?</li> <li>How many requests per second can your API handle?</li> </ul> </li> <li> <p>Maybe of more use to us is running locust in the terminal. To do this you can run the following command:</p> WindowsMac/Linux <pre><code>locust -f tests/performancetests/locustfile.py \\\n    --headless --users 10 --spawn-rate 1 --run-time 1m --host %MYENDPOINT%\n</code></pre> <pre><code>locust -f tests/performancetests/locustfile.py \\\n    --headless --users 10 --spawn-rate 1 --run-time 1m --host $MYENDPOINT\n</code></pre> <p>this will run the test with 10 users that are spawned at a rate of 1 per second for 1 minute.</p> </li> <li> <p>(Optional) A good use case for load testing in our case is to test that our API can handle a load right after it     has been deployed. To do this we need to add appropriate steps to our CI/CD pipeline. Try adding locust to an     existing or new workflow file in your <code>.github/workflows/</code> folder, such that it runs after the deployment step.</p> Solution <p>The solution here expects that a service called <code>production-model</code> has been deployed to Google Cloud Run. Then the following steps can be added to a workflow file, to first authenticate with Google Cloud, extract the relevant URL, and then run the load test:</p> <pre><code>- name: Auth with GCP\n  uses: google-github-actions/auth@v2\n  with:\n    credentials_json: ${{ secrets.GCP_SA_KEY }}\n\n- name: Set up Cloud SDK\n  uses: google-github-actions/setup-gcloud@v2\n\n- name: Extract deployed model URL\n  run: |\n    DEPLOYED_MODEL_URL=$(gcloud run services describe production-model \\\n      --region=europe-west1 \\\n      --format='value(status.url)')\n    echo \"DEPLOYED_MODEL_URL=$DEPLOYED_MODEL_URL\" &gt;&gt; $GITHUB_ENV\n\n- name: Run load test on deployed model\n  env:\n    DEPLOYED_MODEL_URL: ${{ env.DEPLOYED_MODEL_URL }}\n  run: |\n    locust -f tests/performance/locustfile.py \\\n      --headless -u 100 -r 10 --run-time 10m --host=$DEPLOYED_MODEL_URL --csv=/locust/results\n\n- name: Upload locust results\n  uses: actions/upload-artifact@v4\n  with:\n    name: locust-results\n    path: /locust\n</code></pre> <p>the results can afterward be downloaded from the artifacts tab in the GitHub UI.</p> </li> </ol>"},{"location":"s7_deployment/testing_apis/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>In the <code>locust</code> framework, what does the <code>@task</code> decorator do and what does <code>@task(3)</code> mean?</p> Solution <p>The <code>@task</code> decorator is used to define a task that a user can perform. The <code>@task(3)</code> decorator is used to define a task that a user can perform that is three times more likely to be performed than the other tasks.</p> </li> <li> <p>In the <code>locust</code> framework, what does the <code>wait_time</code> attribute do?</p> Solution <p>The <code>wait_time</code> attribute is used to define how long a user should wait between tasks. It can be either be a fixed number or a random number between two values.</p> <pre><code>from locust import HttpUser, task, between, constant\n\nclass MyUser(HttpUser):\n    wait_time = between(5, 9)\n    # or\n    wait_time = constant(5)\n</code></pre> </li> <li> <p>Load testing can give numbers on average response time, 99th percentile response time, and requests per second. What     do these numbers tell us about the user experience of the API?</p> Solution <p>The average response time and 99th percentile response time are both measures how \"snappy\" the API feels to the user. While the average response time is normally considered the most important, the 99th percentile response time is also important as it tells us if there are a small amount of users that are experiencing a very slow response time. The requests per second tells us how many users the API can handle at the same time. If this number is too low it can lead to users experiencing slow response times and may indicate that something is wrong with the API.</p> </li> </ol>"},{"location":"s8_monitoring/","title":"Monitoring","text":"<p>Slides</p> <ul> <li> <p></p> <p>Learn how to detect data drifting using the <code>evidently</code> framework</p> <p> M27: Data Drifting</p> </li> <li> <p></p> <p>Learn how to setup a prometheus monitoring system for your application</p> <p> M28: System Monitoring</p> </li> </ul> <p>We have now reached the end of our machine-learning pipeline. We have successfully developed, trained and deployed a machine learning model. However, the question then becomes if you can trust that your newly deployed model still works as expected after 1 day without you intervening? What about 1 month? What about 1 year?</p> <p>There may be corner cases where an ML model is working as expected, but the vast majority of ML models will perform worse over time because they are not generalizing well enough. For example, assume you have just deployed an application that classifies images from phones when suddenly a new phone comes out with a new kind of sensor that takes images that either have a very weird aspect ratio or something else your model is not robust towards. There is nothing wrong with this, you can essentially just retrain your model on new data that accounts for this corner case, however, you need a mechanism that informs you.</p> <p>This is very monitoring comes into play. Monitoring practices are in charge of collecting any information about your application in some format that can then be analyzed and reacted on. Monitoring is essential to securing the longevity of your applications.</p> <p>As with many other sub-fields within MLOps, we can divide monitoring into classic monitoring and ML-specific monitoring. Classic monitoring (known from classic DevOps) is often about</p> <ul> <li>Errors: Is my application working without problems?</li> <li>Logs: What is going on?</li> <li>Performance: How fast is my application?</li> </ul> <p>All these are basic information you are interested in regardless of what application type you are trying to deploy. However, then there is machine learning related monitoring that especially relates data. Take the example above, with the new phone, this we would in general consider to be data drifting problem e.g. the data you are trying to do inference on have drifted away from the distribution of data your model was trained on. Such monitoring problems are unique to machine learning applications and needs to be handled separately.</p> <p>We are in this session going to see examples of both kinds of monitoring.</p> <p>Learning objectives</p> <p>The learning objectives of this session are:</p> <ul> <li>Understand the concepts of data drifting in machine learning applications</li> <li>Can detect data drifting using the <code>evidently</code> framework</li> <li>Understand the importance of different system level monitoring and can conceptually implement it</li> </ul>"},{"location":"s8_monitoring/data_drifting/","title":"M27 - Data Drifting","text":""},{"location":"s8_monitoring/data_drifting/#data-drifting","title":"Data drifting","text":"<p>Data drifting is one of the core reasons for model accuracy degrades over time in production. For machine learning models, data drift is the change in model input data that leads to model performance degradation. In practical terms, this means that the model is receiving input that is outside of the scope that it was trained on, as seen in the figure below. This shows that the underlying distribution of a particular feature has slowly been increasing in value over two years</p> <p></p>  Image credit  <p>In some cases, it may be that if you normalize some feature in a better way that you are able to generalize your model better, but this is not always the case. The reason for such a drift is commonly some external factor that you essentially have no control over. That really only leaves you with one option: retrain your model on the newly received input features and deploy that model to production. This process is probably going to repeat over the lifetime of your application if you want to keep it up-to-date with the real world.</p> <p></p>  Image credit  <p>We have now come up with a solution to the data drift problem, but there is one important detail that we have not taken care of: When we should actually trigger the retraining? We do not want to wait around for our model performance to degrade, thus we need tools that can detect when we are seeing a drift in our data.</p>"},{"location":"s8_monitoring/data_drifting/#exercises","title":"\u2754 Exercises","text":"<p>For these exercises we are going to use the framework Evidently developed by EvidentlyAI. Evidently currently supports both detection for both regression and classification models. The exercises are in large taken from here and in general we recommend if you are in doubt about an exercise to look at the docs for API and examples (their documentation can be a bit lacking sometimes, so you may also have to dive into the source code).</p> <p>Additionally, we want to stress that data drift detection, concept drift detection etc. is still an active field of research and therefore exist multiple frameworks for doing this kind of detection. In addition to Evidently, we can also mention NannyML, WhyLogs and deepcheck.</p> <ol> <li> <p>Start by install Evidently</p> <pre><code>pip install evidently\n</code></pre> <p>you will also need <code>scikit-learn</code> and <code>pandas</code> installed if you do not already have it.</p> </li> <li> <p>Hopefully you already gone through session S7 on deployment. As part of the deployment     to GCP functions you should have developed a application that can classify the     iris dataset, based on a model trained by this     script     . We are going to convert this into a FastAPI application for the purpose here:</p> <ol> <li> <p>Convert your GCP function into a FastAPI application. The appropriate <code>curl</code> command should look something like     this:</p> <pre><code>curl -X 'POST' \\\n    'http://127.0.0.1:8000/iris_v1/?sepal_length=1.0&amp;sepal_width=1.0&amp;petal_length=1.0&amp;petal_width=1.0' \\\n    -H 'accept: application/json' \\\n    -d ''\n</code></pre> <p>and the response body should look like this:</p> <pre><code>{\n    \"prediction\": \"Iris-Setosa\",\n    \"prediction_int\": 0\n}\n</code></pre> <p>We have implemented a solution in this file (called v1) if you need help.</p> </li> <li> <p>Next we are going to add some functionality to our application. We need to add that the input for the user is     saved to a database whenever our application is called. However, to not slow down the response to our user we     want to implement this as an background task. A background task is a function that should be executed after     the user have got their response. Implement a background task that save the user input to a database implemented     as a simple <code>.csv</code> file. You can read more about background tasks     here. The header of the database should look     something like this:</p> <pre><code>time, sepal_length, sepal_width, petal_length, petal_width, prediction\n2022-12-28 17:24:34.045649, 1.0, 1.0, 1.0, 1.0, 1\n2022-12-28 17:24:44.026432, 2.0, 2.0, 2.0, 2.0, 1\n...\n</code></pre> <p>thus both input, timestamp and predicted value should be saved. We have implemented a solution in this file (called v2) if you need help.</p> </li> <li> <p>Call you API a number of times to generate some dummy data in the database.</p> </li> </ol> </li> <li> <p>Create a new <code>data_drift.py</code> file where we are going to implement the data drifting detection and reporting. Start     by adding both the real iris data and your generated dummy data as pandas dataframes.</p> <pre><code>import pandas as pd\nfrom sklearn import datasets\nreference_data = datasets.load_iris(as_frame=True).frame\ncurrent_data = pd.read_csv('prediction_database.csv')\n</code></pre> <p>if done correctly you will most likely end up with two dataframes that look like</p> <pre><code># reference_data\nsepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  target\n0                  5.1               3.5                1.4               0.2       0\n1                  4.9               3.0                1.4               0.2       0\n...\n148                6.2               3.4                5.4               2.3       2\n149                5.9               3.0                5.1               1.8       2\n[150 rows x 5 columns]\n\n# current_data\ntime                         sepal_length   sepal_width   petal_length   petal_width   prediction\n2022-12-28 17:24:34.045649   1.0            1.0            1.0           1.0           1\n...\n2022-12-28 17:24:34.045649   1.0            1.0            1.0           1.0           1\n[10 rows x 5 columns]\n</code></pre> <p>Standardize the dataframes such that they have the same column names and drop the time column from the <code>current_data</code> dataframe.</p> </li> <li> <p>We are now ready to generate some reports about data drifting:</p> <ol> <li> <p>Try executing the following code:</p> <pre><code>from evidently.report import Report\nfrom evidently.metric_preset import DataDriftPreset\nreport = Report(metrics=[DataDriftPreset()])\nreport.run(reference_data=reference, current_data=current)\nreport.save_html('report.html')\n</code></pre> <p>and open the generated <code>.html</code> page. What does it say about your data? Have it drifted? Make sure to poke around to understand what the different plots are actually showing.</p> </li> <li> <p>Data drifting is not the only kind of reporting evidently can make. We can also get reports on the data quality.     Try first adding a few <code>Nan</code> values to your reference data. Secondly, try changing the report to</p> <pre><code>from evidently.metric_preset import DataDriftPreset, DataQualityPreset\nreport = Report(metrics=[DataDriftPreset(), DataQualityPreset()])\n</code></pre> <p>and re-run the report. Checkout the newly generated report. Again go over the generated plots and make sure that it picked up on the missing values you just added.</p> </li> <li> <p>The final report present we will look at is the <code>TargetDriftPreset</code>. Target drift means that our model is     over/under predicting certain classes e.g. or general terms the distribution of predicted values differs from     the ground true distribution of targets. Try adding the <code>TargetDriftPreset</code> to the <code>Report</code> class and re-run the     analysis and inspect the result. Have your targets drifted?</p> </li> </ol> </li> <li> <p>Evidently reports are meant for debugging, exploration and reporting of results. However, as we stated in the     beginning, what we are actually interested in methods automatically detecting when we are beginning to drift. For     this we will need to look at Test and TestSuites:</p> <ol> <li> <p>Lets start with a simple test that checks if there are any missing values in our dataset:</p> <pre><code>from evidently.test_suite import TestSuite\nfrom evidently.tests import TestNumberOfMissingValues\ndata_test = TestSuite(tests=[TestNumberOfMissingValues()])\ndata_test.run(reference_data=reference, current_data=current)\n</code></pre> <p>again we could run <code>data_test.save_html</code> to get a nice view of the results (feel free to try it out) but additionally we can also call <code>data_test.as_dict()</code> method that will give a dict with the test results. What dictionary key contains the if all tests have passed or not?</p> </li> <li> <p>Take a look at this colab notebook     that contains all tests implemented in Evidently. Pick 5 tests of your choice, where at least 1 fails by default     and implement them as a <code>TestSuite</code>. Then try changing the arguments of the test so they better fit your     usecase and get them all passing.</p> </li> </ol> </li> <li> <p>(Optional) When doing monitoring in practice, we are not always interested in running on all data collected from our     API maybe only the last <code>N</code> entries or maybe just from the last hour of observations. Since we are already logging     the timestamps of when our API is called we can use that for filtering. Implement a simple filter that either takes     an integer <code>n</code> and returns the last <code>n</code> entries in our database or some datetime <code>t</code> that filters away observations     earlier than this.</p> </li> <li> <p>Evidently by default only supports structured data e.g. tabular data (so does nearly every other framework). Thus,     the question then becomes how we can extend unstructured data such as images or text? The solution is to extract     structured features from the data which we then can run the analysis on.</p> <ol> <li> <p>(Optional) For images the simple solution would be to flatten the images and consider each pixel a feature,     however this does not work in practice because changes in the individual pixels does not really tell anything     about the image. Instead we should derive some feature such as:</p> <ul> <li>Average brightness</li> <li>Contrast of image</li> <li>Image sharpness</li> <li>...</li> </ul> <p>These are all numbers that can make up a feature vector for a image. Try out doing this yourself, for example by extracting such features from MNIST and FashionMNIST datasets, and check if you can detect a drift between the two sets.</p> </li> <li> <p>(Optional) For text a common approach is to extra some higher level embedding such as the very classical     GLOVE embedding. Try following     this tutorial     to understand how drift detection is done on text.</p> </li> <li> <p>Lets instead take a deep learning based approach to doing this. Lets consider the     CLIP model, which is normally used to do image captioning. For our purpose     this is perfect because we can use the model to get abstract feature embeddings for both images and text:</p> <pre><code>from PIL import Image\nimport requests\n# requires transformers package: pip install transformers\nfrom transformers import CLIPProcessor, CLIPModel\n\nmodel = CLIPModel.from_pretrained(\"openai/clip-vit-base-patch32\")\nprocessor = CLIPProcessor.from_pretrained(\"openai/clip-vit-base-patch32\")\n\nurl = \"http://images.cocodataset.org/val2017/000000039769.jpg\"\nimage = Image.open(requests.get(url, stream=True).raw)\n\n# set either text=None or images=None when only the other is needed\ninputs = processor(text=[\"a photo of a cat\", \"a photo of a dog\"], images=image, return_tensors=\"pt\", padding=True)\n\nimg_features = model.get_image_features(inputs['pixel_values'])\ntext_features = model.get_text_features(inputs['input_ids'], inputs['attention_mask'])\n</code></pre> <p>Both <code>img_features</code> and <code>text_features</code> are in this case a <code>(512,)</code> abstract feature embedding, that should be able to tell us something about our data distribution. Try using this method to extract features on two different datasets like CIFAR10 and SVHN if you want to work with vision or IMDB movie review and Amazon review for text. After extracting the features try running some of the data distribution testing you just learned about.</p> </li> </ol> </li> <li> <p>(Optional) If we have multiple applications and want to run monitoring for each application we often want also the     monitoring to be a deployed application (that only we can access). Implement a <code>/monitoring/</code> endpoint that does     all the reporting we just went through such that you have two endpoints:</p> <pre><code>http://127.0.0.1:8000/iris_infer/?sepal_length=1.0&amp;sepal_width=1.0&amp;petal_length=1.0&amp;petal_width=1.0 # user endpoint\nhttp://127.0.0.1:8000/iris_monitoring/ # monitoring endpoint\n</code></pre> <p>Our monitoring endpoint should return a HTML page either showing an Evidently report or test suit. Try implementing this endpoint. We have implemented a solution in this file if you need help with how to return an HTML page from a FastAPI application.</p> </li> <li> <p>As an final exercise, we recommend that you try implementing this to run directly in the cloud. You will need to     implement this in a container e.g. GCP Run service because the data gathering from the endpoint should still be     implemented as an background task. For this to work you will need to change the following:</p> </li> <li> <p>Instead of saving the input to a local file you should either store it in GCP bucket or an         BigQuery SQL table (this is a better solution, but also         out-of-scope for this course)</p> </li> <li> <p>You can either run the data analysis locally by just pulling from cloud storage predictions and training data         or alternatively you can deploy this as its own endpoint that can be invoked. For the latter option we recommend         that this should require authentication.</p> </li> </ol> <p>That ends the module on detection of data drifting, data quality etc. If this has not already been made clear, monitoring of machine learning applications is an extremely hard discipline because it is not a clear cut when we should actually respond to feature beginning to drift and when it is probably fine. That comes down to the individual application what kind of rules that should be implemented. Additionally, the tools presented here are also in no way complete and are especially limited in one way: they are only considering the marginal distribution of data. Every analysis that we done have been on the distribution per feature (the marginal distribution), however as the image below show it is possible for data to have drifted to another distribution with the marginal being approximately the same.</p> <p></p> <p>There are methods such as Maximum Mean Discrepancy (MMD) tests that are able to do testing on multivariate distributions, which you are free to dive into. In this course we will just always recommend to consider multiple features when doing decision regarding your deployed applications.</p>"},{"location":"s8_monitoring/monitoring/","title":"M28 - System Monitoring","text":""},{"location":"s8_monitoring/monitoring/#monitoring","title":"Monitoring","text":"<p>In this module we are going to look into more classical monitoring of applications. The key concept we are often working with here is called telemetry. Telemetry in general refer to any automatic measurement and wireless transmission of data from our application. It could be numbers such as:</p> <ul> <li>The number of requests are our application receiving per minute/hour/day. This number is of interest because it is     directly proportional to the running cost of application.</li> <li>The amount of time (on average) our application runs per request. The number is of interest because it most likely is     the core contributor to the latency that our users are experience (which we want to be low).</li> <li>...</li> </ul> <p>In general there are three different kinds of telemetry we are interested in:</p> Name Description Example Purpose Metrics Metrics are quantitative measurements of the system. They are usually numbers that are aggregated over a period of time. E.g. the number of requests per minute. The number of requests per minute. Metrics are used to get an overview of the system. They are often used to create dashboards that can be used to get an overview of the system. Logs Logs are textual or structured records generated by applications. They provide a detailed account of events, errors, warnings, and informational messages that occur during the operation of the system. System logs, error logs Logs are essential for diagnosing issues, debugging, and auditing. They provide a detailed history of what happened in a system, making it easier to trace the root cause of problems and track the behavior of components over time. Traces Traces are detailed records of specific transactions or events as they move through a system. A trace typically includes information about the sequence of operations, timing, and dependencies between different components. Distributed tracing in microservices architecture Traces help in understanding the flow of a request or a transaction across different components. They are valuable for identifying bottlenecks, understanding latency, and troubleshooting issues related to the flow of data or control. <p>We are mainly going to focus in this module on metrics.</p>"},{"location":"s8_monitoring/monitoring/#local-instrumentator","title":"Local instrumentator","text":"<p>Before we look into the cloud lets at least conceptually understand how a given instance of a app can expose values that we may be interested in monitoring.</p> <p>The standard framework for exposing metrics is called prometheus. Prometheus is a time series database that is designed to store metrics. It is also designed to be very easy to instrument applications with and it is designed to scale to large amounts of data. The way prometheus works is that it exposes a <code>/metrics</code> endpoint that can be queried to get the current state of the metrics. The metrics are exposed in a format called prometheus text format.</p>"},{"location":"s8_monitoring/monitoring/#exercises","title":"\u2754 Exercises","text":"<ol> <li> <p>Start by installing <code>prometheus-fastapi-instrumentator</code> in python</p> <pre><code>pip install prometheus-fastapi-instrumentator\n</code></pre> <p>this will allow us to easily instrument our FastAPI application with prometheus.</p> </li> <li> <p>Create a simple FastAPI application in a file called <code>app.py</code>. You can reuse any application from the previous     module on APIs. To that file now add the following code:</p> <pre><code>from prometheus_fastapi_instrumentator import Instrumentator\n\n# your app code here\n\nInstrumentator().instrument(app).expose(app)\n</code></pre> <p>This will instrument your application with prometheus and expose the metrics on the <code>/metrics</code> endpoint.</p> </li> <li> <p>Run the app using <code>uvicorn</code> server. Make sure that the app exposes the endpoints you expect it too exposes, but make     sure you also checkout the <code>/metrics</code> endpoint.</p> </li> <li> <p>The metric endpoint exposes multiple <code>/metrics</code>. Metrics always looks like this:</p> <pre><code># TYPE key &lt;type&gt;\nkey value\n</code></pre> <p>e.g. it is essentially a ditionary of key-value pairs with the added functionality of a <code>&lt;type&gt;</code>. Look at this page over the different types prometheus metrics can have and try to understand the different metrics being exposed.</p> </li> <li> <p>Look at the documentation for the     <code>prometheus-fastapi-instrumentator</code> and try to add at least one more metric to your application. Rerun the     application and confirm that the new metric is being exposed.</p> </li> </ol>"},{"location":"s8_monitoring/monitoring/#cloud-monitoring","title":"Cloud monitoring","text":"<p>Any cloud system with respect for itself will have some kind of monitoring system. GCP has a service called Monitoring that is designed to monitor all the different services. By default it will monitor a lot of metrics out-of-box. However, the question is if we want to monitor more than the default metrics. The complexity that comes with doing monitoring in the cloud is that we need more than one container. We at least need one container actually running the application that is also exposing the <code>/metrics</code> endpoint and then we need a another container that is collecting the metrics from the first container and storing them in a database. To implement such system of containers that need to talk to each others we in general need to use a container orchestration system such as Kubernetes. This is out of scope for this course, but we can use a feature of <code>Cloud Run</code> called <code>sidecar containers</code> to achieve the same effect. A sidecar container is a container that is running alongside the main container and can be used to do things such as collecting metrics.</p> <p></p>"},{"location":"s8_monitoring/monitoring/#exercises_1","title":"\u2754 Exercises","text":"<ol> <li> <p>Overall we recommend that you just become familiar with the monitoring tab for your cloud run service (see image)     above. Try to invoke your service a couple of times and see what happens to the metrics over time.</p> <ol> <li>(Optional) If you really want to load test your application we recommend checking out the tool     locust. Locust is a Python based load testing tool that can be used to simulate many     users accessing your application at the same time.</li> </ol> </li> <li> <p>Try creating a service level objective (SLO). In short a SLO     is a target for how well your application should be performing. Click the <code>Create SLO</code> button and fill it out with     what you consider to be a good SLO for your application.</p> <p> </p> </li> <li> <p>(Optional) To expose our own metrics we need to setup a sidecar container. To do this follow the instructions     here. We have     setup a simple example that uses fastapi and prometheus that you can find     here. After you have correctly setup the sidecar container you     should be able to see the metrics in the monitoring tab.</p> </li> </ol>"},{"location":"s8_monitoring/monitoring/#alert-systems","title":"Alert systems","text":"<p>A core problem within monitoring is alert systems. The alert system is in charge of sending out alerts to relevant people when some telemetry or metric we are tracking is not behaving as it should. Alert systems are a subjective choice of when and how many should be send out and in general should be proportional with how important to the of the metric/telemetry. We commonly run into what is referred to the goldielock problem where we want just the right amount of alerts however it is more often the case that we either have</p> <ul> <li>Too many alerts, such that they become irrelevant and the really important ones are overseen, often referred to as     alert fatigue</li> <li>Or alternatively, we have too little alerts and problems that should have triggered an alert is not dealt with when     they happen which can have unforeseen consequences.</li> </ul> <p>Therefore, setting up proper alert systems can be as challenging as setting up the systems for actually the metrics we want to trigger alerts.</p>"},{"location":"s8_monitoring/monitoring/#exercises_2","title":"\u2754 Exercises","text":"<p>We are in this exercise going to look at how we can setup automatic alerting such that we get an message every time one of our applications are not behaving as expected.</p> <ol> <li> <p>Go to the <code>Monitoring</code> service. Then go to <code>Alerting</code> tab.      </p> </li> <li> <p>Start by setting up an notification channel. A recommend setting up with an email.</p> </li> <li> <p>Next lets create a policy. Clicking the <code>Add Condition</code> should bring up a window as below. You are free to setup the    condition as you want but the image is one way bo setup an alert that will react to the number of times an cloud    function is invoked (actually it measures the amount of log entries from cloud functions).</p> <p> </p> </li> <li> <p>After adding the condition, add the notification channel you created in one of the earlier steps. Remember to also     add some documentation that should be send with the alert to better describe what the alert is actually doing.</p> </li> <li> <p>When the alert is setup you need to trigger it. If you setup the condition as the image above you just need to     invoke the cloud function many times. Here is a small code snippet that you can execute on your laptop to call a     cloud function many time (you need to change the url and payload depending on your function):</p> <pre><code>import time\nimport requests\nurl = 'https://us-central1-dtumlops-335110.cloudfunctions.net/function-2'\npayload = {'message': 'Hello, General Kenobi'}\n\nfor _ in range(1000):\n    r = requests.get(url, params=payload)\n</code></pre> </li> <li> <p>Make sure that you get the alert through the notification channel you setup.</p> </li> </ol>"},{"location":"s9_scalable_applications/","title":"Scaling applications","text":"<p>Slides</p> <ul> <li> <p></p> <p>Learn how to setup distributed data loading in your PyTorch application</p> <p> M29: Distributed Data Loading</p> </li> <li> <p></p> <p>Learn how to do distributed training in PyTorch using <code>pytorch-lightning</code></p> <p> M30: Distributed Training</p> </li> <li> <p></p> <p>Learn how to do scalable inference in PyTorch</p> <p> M31: Scalable Inference</p> </li> </ul> <p>This module is all about scaling the applications that we are building. We are here going to use a very narrow definition of scaling namely that we want our applications to run faster, however, one should note that in general scaling is a much broader term. There are many different ways to scale your applications and we are going to look at three of these related to different tasks in machine learning algorithms:</p> <ul> <li>Scaling data loading</li> <li>Scaling training</li> <li>Scaling inference</li> </ul> <p>We are going to approach the term scaling from two different angles and both should result in your application running faster. The first approach is levering multiple devices, such as using multiple CPU cores or parallelizing training across multiple GPUs. The second approach is more analytical, where we are going to look at how we can design smaller/faster model architectures that run faster.</p> <p>It should be noted that this module is specific to working with PyTorch applications. In particular, we are going to see how we can both improve base PyTorch code and how to utilize the PyTorch Lightning which we introduced in module M14 on boilerplate to improve the scaling of our applications. If your application is written using another framework we can guarantee that the same techniques in these modules transfer to that framework, but may require you to seek out how to specifically to it.</p> <p>If you manage to complete all modules in this session, feel free to check out the extra module on scalable hyperparameter optimization.</p> <p>Learning objectives</p> <p>The learning objectives of this session are:</p> <ul> <li>Understand how data loading during training can be parallelized and have experimented with it</li> <li>Understand the different paradigms for distributed training and can run multi-GPU experiments using the     framework <code>pytorch-lightning</code></li> <li>Knowledge of different ways, including quantization, pruning, architecture tuning etc. to improve inference     speed</li> </ul>"},{"location":"s9_scalable_applications/data_loading/","title":"M29 - Distributed Data Loading","text":""},{"location":"s9_scalable_applications/data_loading/#distributed-data-loading","title":"Distributed Data Loading","text":"<p>Core Module</p> <p>One way that deep learning fundamentally changed the way we think about data in machine learning is that more data is always better. This was very much not the case with more traditional machine learning algorithms (random forest, support vector machines etc.) where a plateau in performance was often reached for a certain amount of data and did not improve if more was added. However, as deep learning models have become deeper and deeper and thereby more and more data-hungry performance seems to be ever increasing or at least not reaching a plateau in the same way as for traditional machine learning.</p> <p></p>  Image credit  <p>As we are trying to feed more and more data into our models, the obvious first question to ask is how to do this efficiently. As a general rule of thumb, we want the performance bottleneck to be the forward/backward e.g. the actual computation in our neural network and not the data loading. By bottleneck, we here refer to the part of our pipeline that is restricting how fast we can process data. If data loading is our bottleneck, then our compute device can sit idle while waiting for data to arrive, which is both inefficient and costly. For example, if you are using a cloud provider for training deep learning models, you are paying by the hour per device, and thus not using them fully can be costly in the long run.</p> <p>In the first set of exercises, we are therefore going to focus on distributed data loading i.e. how to load data in parallel to make sure that we always have data ready for our compute devices. We are in the following going to look at what is going on behind the scenes when we use PyTorch to parallelize data loading.</p>"},{"location":"s9_scalable_applications/data_loading/#a-closer-look-at-data-loading","title":"A closer look at Data loading","text":"<p>Before we talk distributed applications it is important to understand the physical layout of a standard CPU (the brain of your computer).</p> <p></p> <p>Most modern CPUs is a single chip that consists of multiple cores. Each core can further be divided into threads. In most laptops, the core count is 4 and commonly 2 threads per code. This means that the common laptop has 8 threads. The number of threads a compute unit has is important because that directly corresponds to the number of parallel operations that can be executed i.e. one per thread. In a Python terminal you should be able to get the number of cores in your machine by writing (try it):</p> <pre><code>import multiprocessing\ncores = multiprocessing.cpu_count()\nprint(f\"Number of cores: {cores}, Number of threads: {2*cores}\")\n</code></pre> <p>A distributed application is in general any kind of application that parallelizes some or all of its workload. We are in these exercises only focusing on distributed data loading, which happens primarily only on the CPU. In <code>PyTorch</code> it is easy to parallelize data loading if you are using their dataset/data loader interface:</p> <pre><code>from torch.utils.data import Dataset, DataLoader\nclass MyDataset(Dataset):\n    def __init__(self, ...):\n        # whatever logic is needed to init the data set\n        self.data = ...\n\n    def __getitem__(self, idx):\n        # return one item\n        return self.data[idx]\n\ndataset = MyDataset()\ndataloader = Dataloader(\n    dataset,\n    batch_size=8,\n    num_workers=4  # this is the number of threads we want to parallelize workload over\n)\n</code></pre> <p>Let's take a deep dive into what happens when we request a batch from our dataloader e.g. <code>next(dataloader)</code>. First, we must understand that we have a thread that plays the role of the main and the remaining threads (in the above example we request 4) are called workers. When the dataloader is created, we create this structure and make sure that all threads have a copy of our dataset definition so each can call the <code>__getitem__</code> method.</p> <p></p> <p>Then comes the actual part where we request a batch of data. Assume that we have a batch size of 8 and we do not do any shuffling. In this step, the master thread then distributes the list of requested data points (<code>[0,1,2,3,4,5,6,7]</code>) to the four worker threads. With 8 indices and 4 workers, each worker will receive 2 indices.</p> <p></p> <p>Each worker thread then calls the <code>__getitem__</code> method for all the indices it has received. When all processes are done, the loaded images data points gets sent back to the master thread and collected into a single structure/tensor.</p> <p></p> <p>Each arrow is corresponds to a communication between two threads, which is not a free operation. In total to get a single batch (not counting the initial startup cost) in this example we need to do 8 communication operations. This may seem like a small price to pay, but that may not be the case. If the processing time of <code>__getitem__</code> is very low ( data is stored in memory, we just need to index to get it) then it does not make sense to use multiprocessing. The computational savings by doing the look-up operations in parallel are smaller than the communication cost there is between the main thread and the workers. Multiprocessing makes sense when the processing time of <code>__getitem__</code> is high (data is probably stored on the hard drive).</p> <p>It is this trade-off that we are going to investigate in the exercises.</p>"},{"location":"s9_scalable_applications/data_loading/#exercises","title":"\u2754 Exercises","text":"<p>This exercise is intended to be done on the labeled faces in the wild (LFW) dataset. The dataset consists of images of famous people extracted from the internet. The dataset had been used to drive the field of facial verification, which you can read more about here. We are going to imagine that this dataset cannot fit in memory, and your job is therefore to construct a data pipeline that can be parallelized based on loading the raw data files (.jpg) at runtime.</p> <ol> <li> <p>Download the dataset and extract it to a folder. It does not matter if you choose the non-aligned or aligned version     of the dataset.</p> </li> <li> <p>We provide the <code>lfw_dataset.py</code> file where we have started the process of defining a data class. Fill out the     <code>__init__</code>, <code>__len__</code> and <code>__getitem__</code>. Note that <code>__getitem__</code> expects that you return a single <code>img</code> which should     be a <code>torch.Tensor</code>. Loading should be done using PIL Image, as <code>PIL</code>     images are the default input format for torchvision for     transforms (for data augmentation).</p> </li> <li> <p>Make sure that the script runs without any additional arguments</p> <pre><code>python lfw_dataset.py\n</code></pre> </li> <li> <p>Visualize a single batch by filling out the codeblock after the first TODO right after defining the dataloader.     The visualization should show when launching the script as</p> <pre><code>python lfw_dataset.py -visualize_batch\n</code></pre> <p>Hint: this tutorial.</p> </li> <li> <p>Experiment how the number of workers influences the performance. We have already provide code that will pass over 100     batches from the dataset 5 times and calculate how long time it took, which you can play around with by calling</p> <pre><code>python lfw_dataset.py -get_timing -num_workers 1\n</code></pre> <p>Make a errorbar plot with number of workers along the x-axis and the timing along the y-axis. The errorbars should correspond to the standard deviation over the 5 runs. HINT: if it is taking too long to evaluate, measure the time over less batches (set the <code>-batches_to_check</code> flag). Also if you are not seeing an improvement, try increasing the batch size (since data loading is parallelized per batch).</p> <p>For certain machines like the Mac with M1 chipset it is necessary to set the <code>multiprocessing_context</code> flag in the dataloder to <code>\"fork\"</code>. This essentially tells the dataloader how the worker nodes should be created.</p> </li> <li> <p>Retry the experiment where you change the data augmentation to be more complex:</p> <pre><code>lfw_trans = transforms.Compose([\n    transforms.RandomAffine(5, (0.1, 0.1), (0.5, 2.0)),\n    # add more transforms here\n    transforms.ToTensor()\n])\n</code></pre> <p>by making the augmentation more computationally demanding, it should be easier to get a boost in performance when using multiple workers because the data augmentation is also executed in parallel.</p> </li> <li> <p>(Optional, requires access to GPU) If your dataset fits in GPU memory it is beneficial to set the <code>pin_memory</code> flag     to <code>True</code>. By setting this flag we are essentially telling PyTorch that they can lock the data in place in memory     which will make the transfer between the host (CPU) and the device (GPU) faster.</p> </li> </ol> <p>This ends the module on distributed data loading in PyTorch. If you want to go into more details we highly recommend that you read this paper that goes into great detail on analyzing how data loading in PyTorch works and performance benchmarks.</p>"},{"location":"s9_scalable_applications/distributed_training/","title":"M30 - Distributed Training","text":""},{"location":"s9_scalable_applications/distributed_training/#distributed-training","title":"Distributed Training","text":"<p>In this module we are going to look at distributed training. Distributed training is one of the key ingredients to all the awesome results that deep learning models are producing. For example: Alphafold the highly praised model from Deepmind that seems to have solved protein structure prediction, was trained in a distributed fashion for a few weeks. The training was done on 16 TPUv3s (specialized hardware), which is approximately equal to 100-200 modern GPUs. This means that training Alphafold without distributed training on a single GPU (probably not even possible) would take a couple of years to train! Therefore, it is simply impossible currently to train some of the state-of-the-art (SOTA) models within deep learning currently, without taking advantage of distributed training.</p> <p>When we talk about distributed training, there are a number of different paradigms that we may use to parallelize our computations</p> <ul> <li>Data parallel (DP) training</li> <li>Distributed data parallel (DDP) training</li> <li>Sharded training</li> </ul> <p>In this module we are going to look at data parallel training, which is the original way of doing parallel training and distributed data parallel training which is an improved version of data parallel. If you want to know more about sharded training which is the newest of the paradigms you can read more about it in this blog post, which describes how sharded can save over 60% of memory used during your training.</p> <p>Finally, we want to note that for all the exercises in the module you are going to need a multi GPU setup. If you have not already gained access to multi GPU machines on GCP (see the quotas exercises in this module) you will need to find another way of running the exercises. For DTU Students I can recommend checking out this optional module on using the high performance cluster (HPC) where you can get access to multi GPU resources.</p>"},{"location":"s9_scalable_applications/distributed_training/#data-parallel","title":"Data parallel","text":"<p>While data parallel today in general is seen as obsolete compared to distributed data parallel, we are still going to investigate it a bit since it offers the most simple form of distributed computations in deep learning pipeline.</p> <p>In the figure below is shown both the forward and backward step in the data parallel paradigm</p> <p></p> <p>The steps are the following:</p> <ul> <li> <p>Whenever we try to do forward call e.g. <code>out=model(batch)</code> we take the batch and divide it equally between all     devices. If we have a batch size of <code>N</code> and <code>M</code> devices each device will be sent <code>N/M</code> datapoints.</p> </li> <li> <p>Afterwards each device receives a copy of the <code>model</code> e.g. a copy of the weights that currently parametrizes our     neural network.</p> </li> <li> <p>In this step we perform the actual forward pass in parallel. This is the actual steps that can help us scale     our training.</p> </li> <li> <p>Finally we need to send back the output of each replicated model to the primary device.</p> </li> </ul> <p>Similar to the analysis we did of parallel data loading, we cannot always expect that this will actual take less time than doing the forward call on a single GPU. If we are parallelizing over <code>M</code> devices, we essentially need to do <code>3xM</code> communication calls to send batch, model and output between the devices. If the parallel forward call does not outweigh this, then it will take longer.</p> <p>In addition, we also have the backward path to focus on</p> <ul> <li> <p>As the end of the forward collected the output on the primary device, this is also where the loss is accumulated.     Thus, loss gradients are first calculated on the primary device</p> </li> <li> <p>Next we scatter the gradient to all the workers</p> </li> <li> <p>The workers then perform a parallel backward pass through their individual model</p> </li> <li> <p>Finally, we reduce (sum) the gradients from all the workers on the main process such that we can do gradient descend.</p> </li> </ul> <p>One of the big downsides of using data parallel is that all the replicas are destroyed after each backward call. This means that we over and over again need to replicate our model and send it to the devices that are part of the computations.</p> <p>Even though it seems like a lot of logic is implementing data parallel into your code, in PyTorch we can very simply enable data parallel training by wrapping our model in the nn.DataParallel class.</p> <pre><code>from torch import nn\nmodel = MyModelClass()\nmodel = nn.DataParallel(model, device_ids=[0, 1])  # data parallel on gpu 0 and 1\npreds = model(input)  # same as usual\n</code></pre>"},{"location":"s9_scalable_applications/distributed_training/#exercises","title":"\u2754 Exercises","text":"<p>Please note that the exercise only makes sense if you have access to multiple GPUs.</p> <ol> <li> <p>Create a new script (call it <code>data_parallel.py</code>) where you take a copy of model <code>FashionCNN</code>     from the <code>fashion_mnist.py</code> script. Instantiate the model and wrap <code>torch.nn.DataParallel</code>     around it such that it can be executed in data parallel.</p> </li> <li> <p>Try to run inference in parallel on multiple devices (pass a batch multiple times and time it) e.g.</p> <pre><code>import time\nstart = time.time()\nfor _ in range(n_reps):\n    out = model(batch)\nend = time.time()\n</code></pre> <p>Does data parallel decrease the inference time? If no, can you explain why that may be? Try playing around with the batch size, and see if data parallel is more beneficial for larger batch sizes.</p> </li> </ol>"},{"location":"s9_scalable_applications/distributed_training/#distributed-data-parallel","title":"Distributed data parallel","text":"<p>It should be clear that there is huge disadvantage of using the data parallel paradigm to scale your applications: the model needs to replicated on each pass (because it is destroyed in the end), which requires a large transfer of data. This is the main problem that distributed data parallel tries to solve.</p> <p></p> <p>The two key difference between distributed data parallel and data parallel that we move the model update (the gradient step) to happen on each device in parallel instead of only on the main device. This has the consequence that we do not need to move replicate the model on each step, instead we just keep a local version on each device that we keep updating. The full set of steps (as shown in the figure):</p> <ul> <li> <p>Initialize an exact copy of the model on each device</p> </li> <li> <p>From disk (or memory) we start by loading data into a section of page-locked host memory per device. Page-locked     memory is essentially a way to reverse a piece of a computers memory for a specific transfer that is going to     happen over and over again to speed it up. The page-locked regions are loaded with non-overlapping data.</p> </li> <li> <p>Transfer data from page-locked memory to each device in parallel</p> </li> <li> <p>Perform forward  pass in parallel</p> </li> <li> <p>Do a all-reduce operation on the gradients. An all-reduce operation is a so call all-to-all  operation meaning     that all processes send their own gradient to all other processes and also received from all other processes.</p> </li> <li> <p>Reduce the combined gradient signal from all processes and update the individual model in parallel. Since all     processes received the same gradient information, all models will still be in sync.</p> </li> </ul> <p>Thus, in distributed data parallel we here end up only doing a single communication call between all processes, compared to all the communication going on in data parallel. While all-reduce is a more expensive operation that many of the other communication operations that we can do, because we only have to do a single we gain a huge performance boost. Empirically distributed data parallel tends to be 2-3 times faster than data parallel.</p> <p>However, this performance increase does not come for free. Where we could implement data parallel in a single line in PyTorch, distributed data parallel is much more involving.</p>"},{"location":"s9_scalable_applications/distributed_training/#exercises_1","title":"\u2754 Exercises","text":"<ol> <li> <p>We have provided an example of how to do distributed data parallel training in PyTorch in the two     files <code>distributed_example.py</code> and <code>distributed_example.sh</code>. You objective is to get a understanding of the necessary     components in the script to get this kind of distributed training to work. Try to answer the following questions     (HINT: try to Google around):</p> <ol> <li> <p>What is the function of the <code>DDP</code> wrapper?</p> </li> <li> <p>What is the function of the <code>DistributedSampler</code>?</p> </li> <li> <p>Why is it necessary to call <code>dist.barrier()</code> before passing a batch into the model?</p> </li> <li> <p>What does the different environment variables do in the <code>.sh</code> file</p> </li> </ol> </li> <li> <p>Try to benchmark the runs using 1 and 2 GPUs</p> </li> <li> <p>The first exercise have hopefully convinced you that it can be quite the trouble writing distributed training     applications yourself. Luckily for us, <code>PyTorch-lightning</code> can take care of this for us such that we do not have to     care about the specific details. To get your model training on multiple GPUs you need to change two arguments in the     trainer: the <code>accelerator</code> flag and the <code>gpus</code> flag. In addition to this, you can read through this     guide about any additional steps you may     need to do (for many of you, it should just work). Try running your model on multiple GPUs.</p> </li> <li> <p>Try benchmarking your training using 1 and 2 gpus e.g. try running a couple of epochs and measure how long time it     takes. How much of a speedup can you actually get? Why can you not get a speedup of 2?</p> </li> </ol>"},{"location":"s9_scalable_applications/inference/","title":"M31 - Scalable Inference","text":""},{"location":"s9_scalable_applications/inference/#scalable-inference","title":"Scalable Inference","text":"<p>Inference is the task of applying our trained model to some new and unseen data, often called prediction. Thus, scaling inference is different from scaling data loading and training, mainly due to inference normally only using a single data point (or a few). As we can neither parallelize the data loading nor parallelize using multiple GPUs (at least not in any efficient way), this is of no use to us when we are doing inference. Additionally, performing inference is often not something we do on machines that can perform large computations, as most inference today is actually either done on edge devices e.g. mobile phones or in low-cost-low-compute cloud environments. Thus, we need to be smarter about how we scale inference than just throwing more computing power at it.</p> <p>In this module, we are going to look at various ways that you can either reduce the size of your model or make your model faster. Both are important for running inference fast regardless of the setup you are running your model on. We want to note that this is still very much an active area of research and therefore best practices for what to do in a specific situation can change.</p>"},{"location":"s9_scalable_applications/inference/#choosing-the-right-architecture","title":"Choosing the right architecture","text":"<p>Assume you are starting a completely new project and have to come up with a model architecture for doing this. What is your strategy? The common way to do this is to look at prior work on similar problems that you are facing and either directly choose the same architecture or create some slight variation hereof. This is a great way to get started, but the architecture that you end up choosing may be optimal in terms of performance but not inference speed.</p> <p>The fact is that not all base architectures are created equal, and a 10K parameter model with one architecture can have a significantly different inference speed than another 10K parameter model with another architecture. For example, consider the figure below which compares an number of models from the timm package, colored based on their base architecture. The general trend is that the number of images that can be processed by a model per sec (y-axis) is inversely proportional to the number of parameters (x-axis). However, we in general see that convolutional base architectures (conv) are more efficient than transformer (vit) for the same parameter budget.</p> <p></p>  Image credit"},{"location":"s9_scalable_applications/inference/#exercises","title":"\u2754 Exercises","text":"<p>As discussed in this blogpost the largest increase in inference speed you will see (given some specific hardware) is choosing an efficient model architecture. In the exercises below we are going to investigate the inference speed of different architectures.</p> <ol> <li> <p>Start by checking out this     table     which contains a list of pretrained weights in <code>torchvision</code>. Try finding an</p> <ul> <li>Efficient net</li> <li>Resnet</li> <li>Transformer based</li> </ul> <p>model that has in the range of 20-30 mio parameters.</p> </li> <li> <p>Write a small script that first initializes all models, creates a dummy input tensor of shape [100, 3, 256, 256] and     then measures the time it takes to do a forward pass on the input tensor. Make sure to do this multiple times to get     a good average time.</p> Solution <p>In this solution, we have chosen to use the efficientnet b5 (30.4M parameters), resnet50 (25.6M parameters) and the swin v2 transformer tiny (28.4M parameters) models.</p> <pre><code>import time\nimport torch\nfrom torchvision import models\n\nmodel_list = [\"efficientnet_b5\", \"resnet50\", \"swin_v2_t\"]\nimage = torch.randn(100, 3, 256, 256)\n\nn_reps = 10\nfor i, m in enumerate(model_list):\n    model = models.get_model(m)\n    tic = time.time()\n    for _ in range(n_reps):\n        _ = model(image)\n    toc = time.time()\n    print(f\"Model {i} took: {(toc - tic) / n_reps}\")\n</code></pre> </li> <li> <p>Does the results make sense? Based on the above figure we would expect that efficientnet is faster than resnet,     which is faster than the transformer based model. Is this also what you are seeing?</p> </li> <li> <p>To figure out why one net is more efficient than another we can try to count the operations each network need to     do for inference. A operation here we can define as a     FLOP (floating point operation) which is any mathematical operation (such as     +, -, *, /) or assignment that involves floating-point numbers. Luckily for us someone has already created a python     package for calculating this in pytorch: ptflops</p> <ol> <li> <p>Install the package</p> <pre><code>pip install ptflops\n</code></pre> </li> <li> <p>Try calling the <code>get_model_complexity_info</code> function from the <code>ptflops</code> package on the networks from the     previous exercise. What are the results?</p> Solution <pre><code>from ptflops import get_model_complexity_info\nimport time\nimport torch\nfrom torchvision import models\n\nmodel_list = [\"efficientnet_b5\", \"resnet50\", \"swin_v2_t\"]\nfor model in model_list:\n    macs, params = get_model_complexity_info(\n        models.get_model(model_list[0]), (3, 256, 256), backend='pytorch', print_per_layer_stat=False\n    )\n    print(f\"Model {model} have {params} parameters and uses {macs}\")\n</code></pre> </li> </ol> </li> <li> <p>In the table from the initial exercise, you could also see the overall performance of each network on the     Imagenet-1K dataset. Given this performance, the inference speed, the flops count what network would you choose     to use in a production setting? Discuss when choosing one over another should be considered.</p> </li> </ol>"},{"location":"s9_scalable_applications/inference/#quantization","title":"Quantization","text":"<p>Quantization is a technique where all computations are performed with integers instead of floats. We are essentially taking all continuous signals and converting them into discretized signals.</p> <p></p>  Image credit  <p>As discussed in this blogpost series, while <code>float</code> (32-bit) is the primarily used precision in machine learning because is strikes a good balance between memory consumption, precision and computational requirement it does not mean that during inference we can take advantage of quantization to improve the speed of our model. For instance:</p> <ul> <li> <p>Floating-point computations are slower than integer operations</p> </li> <li> <p>Recent hardware have specialized hardware for doing integer operations</p> </li> <li> <p>Many neural networks are actually not bottlenecked by how many computations they need to do but by how fast we can     transfer data e.g. the memory bandwidth and cache of your system is the limiting factor. Therefore working with 8-bit     integers vs 32-bit floats means that we can approximately move data around 4 times as fast.</p> </li> <li> <p>Storing models in integers instead of floats save us approximately 75% of the ram/harddisk space whenever we save     a checkpoint. This is especially useful in relation to deploying models using docker (as you hopefully remember) as     it will lower the size of our docker images.</p> </li> </ul> <p>But how do we convert between floats and integers in quantization? In most cases we often use a linear affine quantization:</p> <p>$$ x_{int} = \\text{round}\\left( \\frac{x_{float}}{s} + z \\right) $$</p> <p>where $s$ is a scale and $z$ is the so called zero point. But how does to doing inference in a neural network. The figure below shows all the conversations that we need to make to our standard inference pipeline to actually do computations in quantized format.</p> <p></p>  Image credit"},{"location":"s9_scalable_applications/inference/#exercises_1","title":"\u2754 Exercises","text":"<ol> <li> <p>Lets look at how quantized tensors look in PyTorch</p> <ol> <li> <p>Start by creating a tensor that contains both random numbers</p> </li> <li> <p>Next call the <code>torch.quantize_per_tensor</code> function on the tensor. What does the quantized tensor     look like? How does the values relate to the <code>scale</code> and <code>zero_point</code> arguments.</p> </li> <li> <p>Finally, try to call the <code>.dequantize()</code> method on the tensor. Do you get a tensor back that is     close to what you initially started out with.</p> </li> </ol> </li> <li> <p>As you hopefully saw in the first exercise we are going to perform a number of rounding errors when     doing quantization and naively we would expect that this would accumulate and lead to a much worse model.     However, in practice we observe that quantization still works, and we actually have a mathematically     sound reason for this. Can you figure out why quantization still works with all the small rounding     errors? HINT: it has to do with the central limit theorem</p> </li> <li> <p>Lets move on to quantization of our model. Follow this     tutorial from PyTorch on how to do quantization. The goal is     to construct a model <code>model_fc32</code> that works on normal floats and a quantized version <code>model_int8</code>. For simplicity     you can just use one of the models from the tutorial.</p> </li> <li> <p>Lets try to benchmark our quantized model and see if all the trouble that we went through actually paid of. Also     try to perform the benchmark on the non-quantized model and see if you get a difference. If you do not get an     improvement, explain why that may be.</p> </li> </ol>"},{"location":"s9_scalable_applications/inference/#pruning","title":"Pruning","text":"<p>Pruning is another way for reducing the model size and maybe improve performance of our network. As the figure below illustrates, in pruning we are simply removing weights in our network that we do not consider important for the task at hand. By removing, we here mean that the weight gets set to 0. There are many ways to determine if a weight is important but the general rule that the importance of a weight is proportional to the magnitude of a given weight. This makes intuitively sense, since weights in all linear operations (fully connected or convolutional) are always multiplied onto the incoming value, thus a small weight means a small outgoing activation.</p> <p></p>  Image credit"},{"location":"s9_scalable_applications/inference/#exercises_2","title":"\u2754 Exercises","text":"<ol> <li> <p>We provide a start script that implements the famous     LeNet in this     file.     Open and run it just to make sure that you know the network.</p> </li> <li> <p>PyTorch have already some pruning methods implemented in its package.     Import the <code>prune</code> module from <code>torch.nn.utils</code> in the script.</p> </li> <li> <p>Try to prune the weights of the first convolutional layer by calling</p> <pre><code>prune.random_unstructured(module_1, name=\"weight\", amount=0.3)  # (1)!\n</code></pre> <ol> <li> You can read about the prune method     here.</li> </ol> <p>Try printing the <code>named_parameters</code>, <code>named_buffers</code> before and after the module is pruned. Can you explain the difference and what is the connection to the <code>module_1.weight</code> attribute.</p> </li> <li> <p>Try pruning the bias of the same module this time using the <code>l1_unstructured</code> function from the pruning module. Again     check the  <code>named_parameters</code>, <code>named_buffers</code> argument to make sure you understand the difference between L1 pruning     and unstructured pruning.</p> </li> <li> <p>Instead of pruning only a single module in the model lets try pruning the whole model. To do this we just need to     iterate over all <code>named_modules</code> in the model like this:</p> <pre><code>for name, module in new_model.named_modules():\n    prune.l1_unstructured(module, name='weight', amount=0.2)\n</code></pre> <p>But what if we wanted to apply different pruning to different layers. Implement a pruning scheme where</p> <ul> <li>The weights of convolutional layers are L1 pruned with <code>amount=0.2</code></li> <li>The weights of linear layers are unstructured pruned with <code>amount=0.4</code></li> </ul> <p>Print <code>print(dict(new_model.named_buffers()).keys())</code> after the pruning to confirm that all weights have been correctly pruned.</p> </li> <li> <p>The pruning we have looked at until know have only been local in nature e.g. we have applied the pruning     independently for each layer, not accounting globally for how much we should actually prune. As you may realize this     can quickly lead to an network that is pruned too much. Instead, the more common approach is too prune globally     where we remove the smallest <code>X</code> amount of connections:</p> <ol> <li> <p>Start by creating a tuple over all the weights with the following format</p> <pre><code>parameters_to_prune = (\n    (model.conv1, 'weight'),\n    # fill in the rest of the modules yourself\n    (model.fc3, 'weight'),\n)\n</code></pre> <p>The tuple needs to have length 5. Challenge: Can you construct the tuple using <code>for</code> loops, such that the code works for arbitrary size networks?</p> </li> <li> <p>Next prune using the <code>global_unstructured</code> function to globally prune the tuple of parameters</p> <pre><code>prune.global_unstructured(\n    parameters_to_prune,\n    pruning_method=prune.L1Unstructured,\n    amount=0.2,\n)\n</code></pre> </li> <li> <p>Check that the amount that have been pruned is actually equal to the 20% specified in the pruning. We provide     the following function that for a given submodule (for example <code>model.conv1</code>) computes the amount of pruned     weights</p> <pre><code>def check_prune_level(module: nn.Module):\n    sparsity_level = 100 * float(torch.sum(module.weight == 0) / module.weight.numel())\n    print(f\"Sparsity level of module {sparsity_level}\")\n</code></pre> </li> </ol> </li> <li> <p>With a pruned network we really want to see if all our effort actually ended up with a network that is faster and/or     smaller in memory. Do the following to the globally pruned network from the previous exercises:</p> <ol> <li> <p>First we need to make the pruning of our network permanent. Right now it is only semi-permanent as we are still     keeping a copy of the original weights in memory. Make the change permanent by calling <code>prune.remove</code> on every     pruned module in the model. Hint: iterate over the <code>parameters_to_prune</code> tuple.</p> </li> <li> <p>Next try to measure the time of a single inference (repeated 100 times) for both the pruned and non-pruned network</p> <pre><code>import time\ntic = time.time()\nfor _ in range(100):\n    _ = network(torch.randn(100, 1, 28, 28))\ntoc = time.time()\nprint(toc - tic)\n</code></pre> <p>Is the pruned network actually faster? If not can you explain why?</p> </li> <li> <p>Next lets measure the size of our network (called <code>pruned_network</code>) and a freshly initialized network (called     <code>network</code>):</p> <pre><code>torch.save(pruned_network.state_dict(), 'pruned_network.pt')\ntorch.save(network.state_dict(), 'network.pt')\n</code></pre> <p>Lookup the size of each file. Are the pruned network actually smaller? If not can you explain why?</p> </li> <li> <p>Repeat the last exercise, but this time start by converting all pruned weights to sparse format first by calling     the <code>.to_sparse()</code> method on each pruned weight. Is the saved model smaller now?</p> </li> </ol> </li> </ol> <p>This ends the exercises on pruning. As you probably realized in the last couple of exercises, then pruning does not guarantee speedups out of the box. This is because linear operations in PyTorch does not handle sparse structures out of the box. To actually get speedups we would need to deep dive into the sparse tensor operations, which again does not even guarantee that a speedup because the performance of these operations depends on the sparsity structure of the pruned weights. Investigating this is out of scope for these exercises, but we highly recommend checking it out if you are interested in sparse networks.</p>"},{"location":"s9_scalable_applications/inference/#knowledge-distillation","title":"Knowledge distillation","text":"<p>Knowledge distillation is somewhat similar to pruning, in the sense that it tries to find a smaller model that can perform equally well as a large model, however it does so in a completely different way. Knowledge distillation is a model compression technique that builds on the work of Bucila et al. in which we try do distill/compress the knowledge of a large complex model (also called the teacher model) into a simpler model (also called the student model).</p> <p>The best known example of this is the DistilBERT model. The DistilBERT model is a smaller version of the large natural-language procession model Bert, which achieves 97% of the performance of Bert while only containing 40% of the weights and being 60% faster. You can see in the figure below how it is much smaller in size compared to other models developed at the same time.</p> <p></p>  Image credit  <p>Knowledge distillation works by assuming we have a big teacher that is already performing well that we want to compress. By running our training set through our large model we get a softmax distribution for each and every training sample. The goal of the students, is to both match the original labels of the training data but also match the softmax distribution of the teacher model. The intuition behind doing this, is that teacher model needs to be more complex to learn the complex inter-class relasionship from just (one-hot) labels. The student on the other hand gets directly feed with softmax distributions from the teacher that explicit encodes this inter-class relasionship and thus does not need the same capasity to learn the same as the teacher.</p> <p></p>  Image credit"},{"location":"s9_scalable_applications/inference/#exercises_3","title":"\u2754 Exercises","text":"<p>Lets try implementing model distillation ourself. We are going to see if we can achieve this on the cifar10 dataset. Do note that exercise below can take quite long time to finish because it involves training multiple networks and therefore involve some waiting.</p> <ol> <li> <p>Start by install the <code>transformers</code> and <code>datasets</code> packages from Huggingface</p> <pre><code>pip install transformers\npip install datasets\n</code></pre> <p>which we are going to download the cifar10 dataset and a teacher model.</p> </li> <li> <p>Next download the cifar10 dataset</p> <pre><code>from datasets import load_dataset\ndataset = load_dataset(\"cifar10\")\n</code></pre> </li> <li> <p>Next lets initialize our teacher model. For this we consider a large transformer based model:</p> <pre><code>from transformers import AutoFeatureExtractor, AutoModelForImageClassification\nextractor = AutoFeatureExtractor.from_pretrained(\"aaraki/vit-base-patch16-224-in21k-finetuned-cifar10\")\nmodel = AutoModelForImageClassification.from_pretrained(\"aaraki/vit-base-patch16-224-in21k-finetuned-cifar10\")\n</code></pre> </li> <li> <p>To get the logits (un-normalized softmax scores) from our teacher model for a single datapoint from the training     dataset you would extract it like this:</p> <pre><code>sample_img = dataset['train'][0]['img']\npreprocessed_img = extractor(dataset['train'][0]['img'], return_tensors='pt')\noutput =  model(**preprocessed_img)\nprint(output.logits)\n# tensor([[ 3.3682, -0.3160, -0.2798, -0.5006, -0.5529, -0.5625, -0.6144, -0.4671, 0.2807, -0.3066]])\n</code></pre> <p>Repeat this process for the whole training dataset and store the result somewhere.</p> </li> <li> <p>Implement a simple convolutional model. You can create a custom one yourself or use a small one from <code>torchvision</code>.</p> </li> <li> <p>Train the model on cifar10 to convergence, so you have a base result on how the model is performing.</p> </li> <li> <p>Redo the training, but this time add knowledge distillation to your training objective. It should look like this:</p> <pre><code>for batch in dataset:\n    # ...\n    img, target, teacher_logits = batch\n    preds = model(img)\n    loss = torch.nn.functional.cross_entropy(preds, target)\n    loss_teacher = torch.nn.functional.cross_entropy(preds, teacher_logits)\n    loss = loss + loss_teacher\n    loss.backward()\n    # ...\n</code></pre> </li> <li> <p>Compare the final performance obtained with and without knowledge distillation. Did the performance improve or not?</p> </li> </ol> <p>This ends the module on scaling inference in machine learning models.</p>"},{"location":"samples/","title":"Collection of sample applications","text":""},{"location":"tools/","title":"Tools","text":"<p>Just a collection of tools and scripts for running the course.</p>"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Home","text":"<p> Machine Learning Operations <p>Repository for course 02476 at DTU.</p> <p>Checkout the homepage!</p> </p> <p> </p>"},{"location":"#i-course-information","title":"\u2139\ufe0f Course information","text":"<ul> <li>Course responsible<ul> <li>Postdoc Nicki Skafte Detlefsen, nsde@dtu.dk</li> <li>Professor S\u00f8ren Hauberg, sohau@dtu.dk</li> </ul> </li> <li>5 ECTS (European Credit Transfer System), corresponding to 140 hours of work</li> <li>3 week period in January</li> <li>Master level course</li> <li>Grade: Pass/not passed</li> <li>Type of assessment: project report</li> <li> <p>Recommended prerequisites: DTU course 02456 (Deep Learning) or     experience with the following topics:</p> <ul> <li>General understanding of machine learning (datasets, probability, classifiers, overfitting etc.)</li> <li>Basic knowledge of deep learning (backpropagation, convolutional neural networks, auto-encoders etc.)</li> <li>Coding in PyTorch. On the first day, we provide some exercises in PyTorch to     get everyone's skills up-to-date as fast as possible.</li> </ul> </li> </ul>"},{"location":"#course-setup","title":"\ud83d\udcbb Course setup","text":"<p>Start by cloning or downloading this repository</p> <pre><code>git clone https://github.com/SkafteNicki/dtu_mlops\n</code></pre> <p>If you do not have git installed (yet) we will touch upon it in the course. The folder will contain all the exercise material for this course and lectures. Additionally, you should join our Slack channel which we use for communication. The link may be expired, write to me.</p>"},{"location":"#course-organization","title":"\ud83d\udcc2 Course organization","text":"<p>We highly recommend that when going through the material you use the homepage which is the corresponding GitHub Pages version of this repository that is more nicely rendered, and also includes some special HTML magic provided by Material for MkDocs.</p> <p>The course is divided into sessions, denoted by capital S, and modules, denoted by capital M. A session corresponds to a full day of work if you are following the course, meaning approximately 9 hours of work. Each session (S) corresponds to a topic within MLOps and consists of multiple modules (M) that each cover a specific topic.</p> <p>Importantly we differ between core modules and optional modules. Core modules will be marked by</p> <p>Core Module</p> <p>at the top of their corresponding page. Core modules are important to go through to be able to pass the course. You are highly recommended to still do the optional modules.</p> <p>Additionally, be aware of the following icons throughout the course material:</p> <ul> <li> <p>This icon can be expanded to show code belonging to a given exercise</p> Example <p>I will contain some code for an exercise.</p> </li> <li> <p>This icon can be expanded to show a solution for a given exercise</p> Solution <p>I will present a solution to the exercise.</p> </li> <li> <p>This icon (1) can be expanded to show a hint or a note for a given exercise</p> <ol> <li> I am a hint or note</li> </ol> </li> </ul>"},{"location":"#mlops-what-is-it","title":"\ud83c\udd92 MLOps: What is it?","text":"<p>Machine Learning Operations (MLOps) is a rather new field that has seen its uprise as machine learning and particularly deep learning has become a widely available technology. The term itself is a compound of \"machine learning\" and \"operations\" and covers everything that has to do with the management of the production ML lifecycle.</p> <p>The lifecycle of production ML can largely be divided into three phases:</p> <ol> <li> <p>Design: The initial phase starts with an investigation of the problem. Based on this analysis, several requirements     can be prioritized for what we want our future model to do. Since machine learning requires     data to be trained, we also investigate in this step what data we have and if we need to source it in some other way.</p> </li> <li> <p>Model development: Based on the design phase we can begin to conjure some machine learning algorithms to solve our     problems. As always, the initial step often involves doing some data analysis to make sure that our model is     learning the signal that we want it to learn. Secondly, is the machine learning engineering phase, where the     particular model architecture is chosen. Finally, we also need to do validation and testing to make sure that     our model is generalizing well.</p> </li> <li> <p>Operations: Based on the model development phase, we now have a model that we want to use. The operations are where     create an automatic pipeline that makes sure that whenever we make changes to our codebase they get automatically     incorporated into our model, such that we do not slow down production. Equally important is the ongoing monitoring     of already deployed models to make sure that they behave exactly as we specified them.</p> </li> </ol> <p>It is important to note that the three steps are a cycle, meaning that when you have successfully deployed a machine learning model that is not the end of it. Your initial requirements may change, forcing you to revisit the design phase. Some new algorithms may show promising results, so you revisit the model development phase to implement this. Finally, you may try to cut the cost of running your model in production, making you revisit the operations phase, and trying to optimize some steps.</p> <p>The focus in this course is particularly on the Operations part of MLOps as this is what many data scientists are missing in their toolbox to take all the knowledge they have about data processing and model development into a production setting.</p>"},{"location":"#learning-objectives","title":"\u2754 Learning objectives","text":"<p>General course objective</p> <p>Introduce the student to a number of coding practices that will help them organization, scale, monitor and deploy machine learning models either in a research or production setting. To provide hands-on experience with a number of frameworks, both local and in the cloud, for doing large scale machine learning models.</p> <p>This includes:</p> <ul> <li>Organize code in an efficient way for easy maintainability and shareability</li> <li>Understand the importance of reproducibility and how to create reproducible containerized applications and experiments</li> <li>Cable of using version control to efficiently collaborate on code development</li> <li>Knowledge of continuous integration (CI) and continuous machine learning (CML) for automating code development</li> <li>Being able to debug, profile, visualize and monitor multiple experiments to assess model performance</li> <li>Cable of using online cloud-based computing services to scale experiments</li> <li>Demonstrate knowledge about different distributed training paradigms within  machine learning and how to apply them</li> <li>Deploy machine learning models, both locally and in the cloud</li> <li>Conduct a research project in collaboration with fellow students using the frameworks taught in the course</li> <li>Have lots of fun and share memes! :)</li> </ul>"},{"location":"#references","title":"\ud83d\udcd3 References","text":"<p>Additional reading resources (in no particular order):</p> <ul> <li> <p>Ref 1     Introduction blog post for those who have never heard about MLOps and want to get an overview.</p> </li> <li> <p>Ref 2     Great document from Google about the different levels of MLOps.</p> </li> <li> <p>Ref 3     Another introduction to the principles of MLOps and the different stages of MLOps.</p> </li> <li> <p>Ref 4     Great paper about the technical dept in machine learning.</p> </li> <li> <p>Ref 5     Interview study that uncovers many of the pain points that ML engineers go through when doing MLOps.</p> </li> </ul> <p>Other courses with content similar to this:</p> <ul> <li> <p>Made with ML. Great online MLOps course that also covers additional topics on the     foundations of working with ML.</p> </li> <li> <p>Full stack deep learning. Another MLOps online course going through the whole     developer pipeline.</p> </li> <li> <p>MLOps Zoomcamp. MLOps online course that includes many of the same     topics.</p> </li> </ul>"},{"location":"#contributing","title":"\ud83d\udc68\u200d\ud83c\udfeb Contributing","text":"<p>If you want to contribute to the course, we are happy to have you! Anything from fixing typos to adding new content is welcome. For building the course material locally, it is a simple two-step process:</p> <pre><code>pip install -r requirements.txt\nmkdocs serve\n</code></pre> <p>Which will start a local server that you can access at <code>http://127.0.0.1:8000</code> and will automatically update when you make changes to the course material. When you have something that you want to contribute, please make a pull request.</p>"},{"location":"#license","title":"\u2755 License","text":"<p>I highly value open source, and the content of this course is therefore free to use under the Apache 2.0 license. If you use parts of this course in your work, please cite using:</p> <pre><code>@misc{skafte_mlops,\n    author       = {Nicki Skafte Detlefsen},\n    title        = {Machine Learning Operations},\n    howpublished = {\\url{https://github.com/SkafteNicki/dtu_mlops}},\n    year         = {2024}\n}\n</code></pre>"},{"location":"pages/faq/","title":"Frequently asked questions","text":"<p>For further questions, please contact Nicki.</p>"},{"location":"pages/faq/#when-is-the-next-time-the-course-is-running","title":"When is the next time the course is running \u2754","text":"<p>The course always runs in January, during the 3-week period at DTU. The exact dates can be found in the academic calendar.</p>"},{"location":"pages/faq/#is-it-possible-to-attend-the-course-fully-online","title":"Is it possible to attend the course fully online \u2754","text":"<p>Mostly yes. All exercises are provided online and lectures will be recorded and streamed. However, do note that</p> <ul> <li>For project days (see which days in the time plan) you will need to agree with your project group that     you are working from home.</li> <li>We have limited TA resources and will be prioritizing students coming to campus for help. If you are attending online,     feel free to ask questions on our Slack channel and we will help to the best of our ability.</li> </ul> <p>Overall we try to support flexible learning as much as possible with some limitations.</p>"},{"location":"pages/faq/#what-are-the-prerequisites-for-taking-this-course","title":"What are the prerequisites for taking this course \u2754","text":"<p>We recommend that you have a basic understanding of machine learning concepts such as what a dataset is, what probabilities are, what a classifier is, what overfitting means etc. This corresponds to the curriculum covered in course 02450. The actual focus of the course is not on machine learning models, but we will be using these basic concepts throughout the exercises.</p> <p>Additionally, we recommend basic knowledge about deep learning and how to code in PyTorch, corresponding to the curriculum covered in 02456. From prior experience, we know that not all students have gained knowledge about deep learning models before this course, and we will be covering the basics of how to code in PyTorch in one of the first modules of the course to get everyone up to speed.</p>"},{"location":"pages/faq/#i-will-be-missing-x-days-of-the-course-will-that-be-a-problem","title":"I will be missing X days of the course, will that be a problem \u2754","text":"<p>Depends. The course is fairly intensive, with most students working from 9-17 every day. If you already know that you will be missing X days of the course, then I highly recommend that you go through some of the first sessions before the course starts to give yourself a bit of breathing room. If you are not able to do so, please be aware that an additional effort may be needed from you to keep up with your fellow students.</p>"},{"location":"pages/faq/#how-many-should-we-be-in-a-group-for-the-projects","title":"How many should we be in a group for the projects \u2754","text":"<p>Between 3 and 5. The projects are designed to be done in groups meaning that we intentionally make them too big for one person to do alone. Luckily, a lot of the work that you need to do can be done in parallel, so it is not as bad as it sounds.</p>"},{"location":"pages/faq/#when-will-the-exam-take-place","title":"When will the exam take place \u2754","text":"<p>From 2025 and onwards, the exam only consist of a project report. The report should be handed in at midnight on the final day of the course. For January 2025, this means the 24th.</p>"},{"location":"pages/faq/#where-can-i-find-information-regarding-the-exam","title":"Where can I find information regarding the exam \u2754","text":"<p>Look at the bottom of this page. Details will be updated as we get closer to the exam date.</p>"},{"location":"pages/faq/#can-i-use-chatgpt-or-similar-tools-for-the-exercises-project-exam-report-coding-writing","title":"Can I use ChatGPT or similar tools for the exercises, project, exam report (coding + writing) \u2754","text":"<p>Yes, yes, and yes, but remember that its a tool and you need to validate the output before using it. We would prefer for the exam report that you formulate the answers in your own words because it is intended for you do describe what you have been doing in your project. The I in LLM stands for intelligence.</p>"},{"location":"pages/faq/#i-am-a-phd-student-not-enrolled-at-dtu-can-i-take-the-course","title":"I am a PhD student not enrolled at DTU, can I take the course \u2754","text":"<p>Yes, PhD students from other universities can attend the course. You can checkout this page for more information or in general you can contact phdcourses@dtu.dk for more information. Do note that the registration deadline is usually in beginning of December.</p>"},{"location":"pages/faq/#i-am-a-foreign-student-and-my-home-university-doesnt-accept-passnot-pass-what-can-i-do","title":"I am a foreign student and my home university doesn't accept pass/not pass, what can I do \u2754","text":"<p>We can give a grade on the Danish 7-point grading scale for foreign students who need it, where their home university does not accept pass/no-pass. You need to contact the course responsible Nicki within the first week of the course to request this. Secondly, we may need to further validate your work, so please be prepared for doing a short oral exam on one of the last days of the course.</p>"},{"location":"pages/faq/#i-am-a-euroteq-student-any-special-rules-for-me","title":"I am a EuroTEQ student, any special rules for me \u2754","text":"<p>Not really, you will attend the course as any other student. However, we will provide a special Slack channel for you, trying to make sure that you can get the same help as students from DTU who can attend the course on campus.</p>"},{"location":"pages/overview/","title":"Summary of course content","text":"<p>There are a lot of moving parts in this course, so it may be hard to understand how it all fits together. This page provides a summary of the frameworks in this course e.g. the stack of tools used. In the figure below we have provided an overview on how the different tools of the course interacts with each other. The table after the figure provides a short description of each of the parts.</p> <p></p>  The MLOps stack in the course. This is just an example of one stack, and depending on your use case you may want to use a different stack of tools that better fits your needs. Regardless of the stack, the principles of MLOps are the same.  Framework Description PyTorch is the backbone of our code, it provides the computational engine and the data structures that we need to define our data structures. PyTorch lightning is a framework that provides a high-level interface to PyTorch. It provides a lot of functionality that we need to train our models, such as logging, checkpointing, early stopping, etc. such that we do not have to implement it ourselves. It also allows us to scale our models to multiple GPUs and multiple nodes. We control the dependencies and Python interpreter using Conda that enables us to construct reproducible virtual environments For configuring our experiments we use Hydra that allows us to define a hierarchical configuration structure config files Using Weights and Bias allows us to track and log any values and hyperparameters for our experiments Whenever we run into performance bottlenecks with our code we can use the Profiler to find the cause of the bottleneck When we run into bugs in our code we can use the Debugger to find the cause of the bug For organizing our code and creating templates we can use Cookiecutter Docker is a tool that allows us to create a container that contains all the dependencies and code that we need to run our code For controlling the versions of our data and synchronization between local and remote data storage, we can use DVC that makes this process easy For version control of our code we use Git (in complement with Github) that allows multiple developers to work together on a shared codebase We can use Pytest to write unit tests for our code, to make sure that new changes to the code does break the code base For linting our code and keeping a consistent coding style we can use tools such as Pylint and Flake8 that checks our code for common mistakes and style issues For running our unit tests and other checks on our code in a continuous manner e.g. after we commit and push our code we can use Github actions that automate this process Using Cloud build we can automate the process of building our docker images and pushing them to our artifact registry Artifact registry is a service that allows us to store our docker images for later use by other services For storing our data and trained models we can use Cloud storage that provides a scalable and secure storage solution For general compute tasks we can use Compute engine that provides a scalable and secure compute solution For training our experiments in a easy and scalable manner we can use Vertex AI For creating a REST API for our model we can use FastAPI that provides a high-level interface for creating APIs For simple deployments of our code we can use Cloud functions that allows us to run our code in response to events through simple Python functions For more complex deployments of our code we can use Cloud run that allows us to run our code in response to events through docker containers Cloud monitoring gives us the tools to keep track of important logs and errors from the other cloud services For monitoring our deployed model is experiencing any drift we can use Evidently AI that provides a framework and dashboard for monitoring drift For monitoring the telemetry of our deployed model we can use OpenTelemetry that provides a standard for collecting and exporting telemetry data"},{"location":"pages/projects/","title":"Project work","text":"<p>Slides</p> <p>Approximately 1/3 of the course time is dedicated to doing project work. The projects will serve as the basis of your exam. In the project, you will essentially re-apply everything that you learn throughout the course to a self chosen project. The overall goals with the project is:</p> <ul> <li>Being able to work in a group on a larger project</li> <li>To formulate a project within the provided guidelines</li> <li>Apply the material though in the course to the problem</li> <li>Present your findings</li> </ul> <p>In the projects you are free to work on whatever problem that you want. That said, we have a specific requirement, that you need to incorporate some third-party framework into your project. If you want inspiration for projects, here are some examples</p> <ol> <li> <p>Classification of tweets</p> </li> <li> <p>Translating from English to German</p> </li> <li> <p>Classification of scientific papers</p> </li> <li> <p>Classification of rice types from images</p> </li> </ol> <p>We hope most students will be able to form groups by themselves. Expected group size is between 3 and 5. If you are not able to form a group, please make sure to post in the <code>#looking-for-group</code> channel on Slack or make sure to be present on the 4th day of the course (the day before the project work starts) where we will help students that have not found a group yet.</p>"},{"location":"pages/projects/#open-source-tools","title":"Open-source tools","text":"<p>We strive to keep the tools thought in this course as open-source as possible. The great thing about the open-source community is that whatever problem you are working on, there is probably some package out there that can get you at least 10% of the way. For the project, we want to enforce this point and you are required to include some third-party package, that is neither PyTorch or one of the tools already covered in the course, into your project.</p> <p>If you have no idea what framework to include, the PyTorch ecosystem is a great place for finding open-source frameworks that can help you accelerate your own projects where PyTorch is the backengine. All tools in the ecosystem should work greatly together with PyTorch. However, it is important to note that the ecosystem is not a complete list of all the awesome packages that exist to extend the functionality of PyTorch. If you are still missing inspiration for frameworks to use, we highly recommend these three that have been used in previous years of the course:</p> <ul> <li> <p>PyTorch Image Models. PyTorch Image Models (also known as TIMM)     is the absolutely most used computer vision package (maybe except for <code>torchvision</code>). It contains models, scripts and     pre trained for a lot of state-of-the-art image models within computer vision.</p> </li> <li> <p>Transformers. The Transformers repository from the Huggingface group     focuses on state-of-the-art Natural Language Processing (NLP). It provides many pre-trained model to perform tasks on     texts such as classification, information extraction, question answering, summarization, translation, text generation,     etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.</p> </li> <li> <p>PyTorch-Geometric. PyTorch Geometric (PyG) is a geometric deep     learning. It consists of various methods for deep learning on graphs and other irregular structures, also known as     geometric deep learning, from a variety of published papers.</p> </li> </ul>"},{"location":"pages/projects/#project-days","title":"Project days","text":"<p>Each project day is fully dedicated to project work, except for maybe external inspirational lectures in the morning. The group decides exactly where they want to work on the project, how they want to work on the project, how do distribute the workload etc. We encourage strongly to parallelize work during the project, because there are a lot of tasks to do, but it it is important that all group members at least have some understanding of the whole project.</p> <p>Remember that the focus of the project work is not to demonstrate that you can work with the biggest and baddest deep learning model, but instead that you show that you can incorporate the tools that are taught throughout the course in a meaningful way.</p> <p>Also note that the project is not expected to be very large in scope. It may simply be that you want to train X model on Y data. You will approximately be given 4 full days to work on the project. It is better that you start out with a smaller project and then add complexity along the way if you have time.</p>"},{"location":"pages/projects/#day-1","title":"Day 1","text":"<p>The first project days is all about getting started on the projects and formulating exactly what you want to work on as a group.</p> <ol> <li> <p>Start by brainstorming projects! Try to figure out exactly what you want to work with and begin to investigate what     third party package that can support the project.</p> </li> <li> <p>When you have come up with an idea, write a project description. The description is the delivery for today and should     be at least 300 words. Try to answer the following questions in the description:</p> <ul> <li>Overall goal of the project</li> <li>What framework are you going to use and you do you intend to include the framework into your project?</li> <li>What data are you going to run on (initially, may change)</li> <li>What models do you expect to use</li> </ul> </li> <li> <p>(Optional) If you want to think more about the product design of your project, feel free to fill out the     ML canvas (or part of it). You can read more about the     different fields on canvas here.</p> </li> <li> <p>After having done the product description, you can start on the actual coding of the project. In the next section,     a to-do list is attached that summaries what we are doing in the course. You are NOT expected to fulfill all bullet     points from week 1 today.</p> </li> </ol> <p>The project description will serve as an guideline for us at the exam that you have somewhat reached the goals that you set out to do. By the end of the day, you should commit your project description to the <code>README.md</code> file belonging to your project repository. If you filled out the ML canvas, feel free to include that as part of the <code>README.md</code> file. Also remember to commit whatever you have done on the project until now. When you have done this, go to DTU Learn and hand-in (as a group) the link to your GitHub repository as an assignment.</p> <p>We will briefly (before next Monday) look over your GitHub repository and project description to check that everything is fine. If we have any questions/concerns we will contact you.</p>"},{"location":"pages/projects/#day-2","title":"Day 2","text":"<p>The goal for today is simply to continue working on your project. Start with bullet points in the checklist from week 1 and continue with bullet points for week 2.</p>"},{"location":"pages/projects/#day-3","title":"Day 3","text":"<p>Continue working on your project, today you should hopefully focus on the bullet points in the checklist from week 2. There is no delivery for this week, but make sure that you have committed all your progress at the end of the day. We will again briefly look over the repositories and will reach out to your group if we are worried about the progression of your project.</p>"},{"location":"pages/projects/#day-4","title":"Day 4","text":"<p>We have now entered the final week of the course and the second last project day. You are most likely continuing with bullet points from week 2, but should hopefully begin to look at the bullet points from week 3 today. These are in general much more complex, so we recommend looking at them until you have completed most from week 2. We also recommend that you being to fill our report template.</p>"},{"location":"pages/projects/#day-5","title":"Day 5","text":"<p>Today you are finishing your project. We recommend that you start by creating a architechtual overview of your project similar to this figure. I recommend using draw.io for creating this kind of diagram, but feel free to use any tool you like. Else you should just continue working on your project, checking of as many bullet points as possible. Finally, you should also prepare yourself for the exam tomorrow.</p>"},{"location":"pages/projects/#project-hints","title":"Project hints","text":"<p>Below are listed some hints to prevent you from getting stuck during the project work with problems that previous groups have encountered.</p> <p>Data</p> <ul> <li> <p>Start out small! We recommend that you start out with less than 1GB of data. If the dataset you want to work with     is larger, then subsample it. You can use dvc to version control your data and only download the full dataset     when you are ready to train the model.</p> </li> <li> <p>Be aware of many smaller files. <code>DVC</code> does not handle many small files well, and can take a long time to download.     If you have many small files, consider zipping them together and then unzip them at runtime.</p> </li> <li> <p>You do not need to use <code>DVC</code> for everything regarding data. You workflow is to just use <code>DVC</code> for version     controlling the data, but when you need to get it you can just download it from the source. For example if you     are storing your data in a GCP bucket, you can use the <code>gsutil</code> command to download the data or directly     accessing the it using the     cloud storage file system</p> </li> </ul> <p>Modelling</p> <ul> <li> <p>Again, start out small! Start with a simple model and then add complexity as you go along. It is better to have a     simple model that works than a complex model that does not work.</p> </li> <li> <p>Try fine-tuning a pre-trained model. This is often much faster than training a model from scratch.</p> </li> </ul> <p>Deployment</p> <ul> <li>When getting around to deployment always start out by running your application locally first, then run it locally     inside a docker container and then finally try to deploy it in the cloud. This way you can catch errors early     and not waste time on debugging cloud deployment issues.</li> </ul>"},{"location":"pages/projects/#project-checklist","title":"Project checklist","text":"<p>Please note that all the lists are exhaustive meaning that I do not expect you to have completed very point on the checklist for the exam.</p>"},{"location":"pages/projects/#week-1","title":"Week 1","text":"<ul> <li> Create a git repository</li> <li> Make sure that all team members have write access to the GitHub repository</li> <li> Create a dedicated environment for you project to keep track of your packages</li> <li> Create the initial file structure using cookiecutter</li> <li> Fill out the <code>make_dataset.py</code> file such that it downloads whatever data you need and</li> <li> Add a model file and a training script and get that running</li> <li> Remember to fill out the <code>requirements.txt</code> file with whatever dependencies that you are using</li> <li> Remember to comply with good coding practices (<code>pep8</code>) while doing the project</li> <li> Do a bit of code typing and remember to document essential parts of your code</li> <li> Setup version control for your data or part of your data</li> <li> Construct one or multiple docker files for your code</li> <li> Build the docker files locally and make sure they work as intended</li> <li> Write one or multiple configurations files for your experiments</li> <li> Used Hydra to load the configurations and manage your hyperparameters</li> <li> When you have something that works somewhat, remember at some point to to some profiling and see if     you can optimize your code</li> <li> Use Weights &amp; Biases to log training progress and other important metrics/artifacts in your code. Additionally,     consider running a hyperparameter optimization sweep.</li> <li> Use PyTorch-lightning (if applicable) to reduce the amount of boilerplate in your code</li> </ul>"},{"location":"pages/projects/#week-2","title":"Week 2","text":"<ul> <li> Write unit tests related to the data part of your code</li> <li> Write unit tests related to model construction and or model training</li> <li> Calculate the coverage.</li> <li> Get some continuous integration running on the GitHub repository</li> <li> Create a data storage in GCP Bucket for you data and preferable link this with your data version control setup</li> <li> Create a trigger workflow for automatically building your docker images</li> <li> Get your model training in GCP using either the Engine or Vertex AI</li> <li> Create a FastAPI application that can do inference using your model</li> <li> If applicable, consider deploying the model locally using torchserve</li> <li> Deploy your model in GCP using either Functions or Run as the backend</li> </ul>"},{"location":"pages/projects/#week-3","title":"Week 3","text":"<ul> <li> Check how robust your model is towards data drifting</li> <li> Setup monitoring for the system telemetry of your deployed model</li> <li> Setup monitoring for the performance of your deployed model</li> <li> If applicable, play around with distributed data loading</li> <li> If applicable, play around with distributed model training</li> <li> Play around with quantization, compilation and pruning for you trained models to increase inference speed</li> </ul>"},{"location":"pages/projects/#additional","title":"Additional","text":"<ul> <li> Revisit your initial project description. Did the project turn out as you wanted?</li> <li> Make sure all group members have a understanding about all parts of the project</li> <li> Uploaded all your code to github</li> </ul>"},{"location":"pages/projects/#exam","title":"Exam","text":"<p>From January 2025 the exam only consist of a project report. The report should be handed in at midnight on the final day of the course. For January 2025, this means the 24th. We provide template folder called reports. As the first task you should copy the folder and all its content to your project repository. Then, you jobs is to fill out the <code>README.md</code> file which contains the report template. The file itself contains instructions on how to fill it out and instructions on using the included <code>report.py</code> file for validating your work. You will hand-in the template by simple including it in your project repository. By midnight on the final day of the course, we will automatically scrape the report and use it as the basis for grading you. Therefore, changes after this point are not registered.</p>"},{"location":"pages/timeplan/","title":"Timeplan","text":"<p>Slides</p> <p>The course is organised into exercise (2/3 of the course) days and project days (1/3 of the course).</p> <p>Exercise days start at 9:00 in the morning with an lecture (usually 30-45 min) that will give some context about at least one of the topics of that day. Additionally, previous days exercises may shortly be touched upon. The remaining of the day will be spend on solving exercises either individually or in small groups. For some people the exercises may be fast to do and for others it will take the whole day. We will provide help throughout the day. We will try to answer questions on slack but help with be priorities to students physically on campus.</p> <p>Project days are intended for project work and you are therefore responsible for making an agreement with your group when and where you are going to work. The first project days there will be a lecture at 9:00 with project information. Other project days we may also start the day with an external lecture, which we highly recommend that you participate in. During each project day we will have office hours where you can ask questions for the project.</p> <p>Below is an overall timeplan for each day, including the presentation topic of the day and the frameworks that you will be using in the exercises.</p> <p>Recodings (link to drive folder with mp4 files):</p> <ul> <li>\ud83c\udfa52023 Lectures</li> <li>\ud83c\udfa52024 Lectures</li> </ul>"},{"location":"pages/timeplan/#week-1","title":"Week 1","text":"<p>In the first week you will be introduced to a number of development practices for organizing and developing code, especially with a focus on making everything reproducible.</p> Date Day Presentation topic Frameworks Format 6/1/25 Monday Deep learning software\ud83d\udcdd Terminal, Conda, IDE, PyTorch Exercises 7/1/25 Tuesday MLOps: what is it?\ud83d\udcdd Git, CookieCutter, Pep8, DVC Exercises 8/1/25 Wednesday Reproducibility\ud83d\udcdd Docker, Hydra Exercises 9/1/25 Thursday Debugging\ud83d\udcdd Debugger, Profiler, Wandb, Lightning Exercises 10/1/25 Friday Project work\ud83d\udcdd - Projects"},{"location":"pages/timeplan/#week-2","title":"Week 2","text":"<p>The second week is about automatization and the cloud. Automatization will help use making sure that our code does not break when we make changes to it. The cloud will help us scale up our applications and we learn how to use different services to help develop a full machine learning pipeline.</p> Date Day Presentation topic Frameworks Format 13/1/25 Monday Continuous Integration\ud83d\udcdd Pytest, Github actions, Pre-commit, CML Exercises 14/1/25 Tuesday The Cloud\ud83d\udcdd GCP Engine, Bucket, Artifact registry, Vertex AI Exercises 15/1/25 Wednesday Deployment\ud83d\udcdd FastAPI, Torchserve, GCP Functions, GCP Run Exercises 16/1/25 Thursday No lecture - Projects 17/1/25 Friday Company presentation (TBA) - Projects"},{"location":"pages/timeplan/#week-3","title":"Week 3","text":"<p>For the final week we look into advance topics such as monitoring and scaling of applications. Monitoring is especially important for the longivity for the applications that we develop, that we can deploy them either locally or in the cloud and that we have the tools to monitor how they behave over time. Scaling of applications is an important topic if we ever want our applications to be used by many people at the same time.</p> Date Day Presentation topic Frameworks Format 20/1/25 Monday Monitoring\ud83d\udcdd Evidently AI, Prometheus, GCP Monitoring Exercises 21/1/25 Tuesday Scalable applications\ud83d\udcdd PyTorch, Lightning Exercises 22/1/25 Wednesday Company presentation (TBA) - Projects 23/1/25 Thursday No lecture - Projects 24/1/25 Friday No lecture - Projects"},{"location":"reports/","title":"Exam template for 02476 Machine Learning Operations","text":"<p>This is the report template for the exam. Please only remove the text formatted as with three dashes in front and behind like:</p> <p><code>--- question 1 fill here ---</code></p> <p>where you instead should add your answers. Any other changes may have unwanted consequences when your report is auto-generated at the end of the course. For questions where you are asked to include images, start by adding the image to the <code>figures</code> subfolder (please only use <code>.png</code>, <code>.jpg</code> or <code>.jpeg</code>) and then add the following code in your answer:</p> <pre><code>![my_image](figures/&lt;image&gt;.&lt;extension&gt;)\n</code></pre> <p>In addition to this markdown file, we also provide the <code>report.py</code> script that provides two utility functions:</p> <p>Running:</p> <pre><code>python report.py html\n</code></pre> <p>will generate a <code>.html</code> page of your report. After the deadline for answering this template, we will auto-scrape everything in this <code>reports</code> folder and then use this utility to generate an <code>.html</code> page that will be your serve as your final hand-in.</p> <p>Running</p> <pre><code>python report.py check\n</code></pre> <p>will check your answers in this template against the constraints listed for each question e.g. is your answer too short, too long, or have you included an image when asked to.</p> <p>For both functions to work you mustn't rename anything. The script has two dependencies that can be installed with</p> <pre><code>pip install click markdown\n</code></pre>"},{"location":"reports/#overall-project-checklist","title":"Overall project checklist","text":"<p>The checklist is exhaustive which means that it includes everything that you could do on the project included in the curriculum in this course. Therefore, we do not expect at all that you have checked all boxes at the end of the project.</p>"},{"location":"reports/#week-1","title":"Week 1","text":"<ul> <li> Create a git repository</li> <li> Make sure that all team members have write access to the GitHub repository</li> <li> Create a dedicated environment for you project to keep track of your packages</li> <li> Create the initial file structure using cookiecutter</li> <li> Fill out the <code>make_dataset.py</code> file such that it downloads whatever data you need and</li> <li> Add a model file and a training script and get that running</li> <li> Remember to fill out the <code>requirements.txt</code> file with whatever dependencies that you are using</li> <li> Remember to comply with good coding practices (<code>pep8</code>) while doing the project</li> <li> Do a bit of code typing and remember to document essential parts of your code</li> <li> Setup version control for your data or part of your data</li> <li> Construct one or multiple docker files for your code</li> <li> Build the docker files locally and make sure they work as intended</li> <li> Write one or multiple configurations files for your experiments</li> <li> Used Hydra to load the configurations and manage your hyperparameters</li> <li> When you have something that works somewhat, remember at some point to to some profiling and see if       you can optimize your code</li> <li> Use Weights &amp; Biases to log training progress and other important metrics/artifacts in your code. Additionally,       consider running a hyperparameter optimization sweep.</li> <li> Use PyTorch-lightning (if applicable) to reduce the amount of boilerplate in your code</li> </ul>"},{"location":"reports/#week-2","title":"Week 2","text":"<ul> <li> Write unit tests related to the data part of your code</li> <li> Write unit tests related to model construction and or model training</li> <li> Calculate the coverage.</li> <li> Get some continuous integration running on the GitHub repository</li> <li> Create a data storage in GCP Bucket for you data and preferable link this with your data version control setup</li> <li> Create a trigger workflow for automatically building your docker images</li> <li> Get your model training in GCP using either the Engine or Vertex AI</li> <li> Create a FastAPI application that can do inference using your model</li> <li> If applicable, consider deploying the model locally using torchserve</li> <li> Deploy your model in GCP using either Functions or Run as the backend</li> </ul>"},{"location":"reports/#week-3","title":"Week 3","text":"<ul> <li> Check how robust your model is towards data drifting</li> <li> Setup monitoring for the system telemetry of your deployed model</li> <li> Setup monitoring for the performance of your deployed model</li> <li> If applicable, play around with distributed data loading</li> <li> If applicable, play around with distributed model training</li> <li> Play around with quantization, compilation and pruning for you trained models to increase inference speed</li> </ul>"},{"location":"reports/#additional","title":"Additional","text":"<ul> <li> Revisit your initial project description. Did the project turn out as you wanted?</li> <li> Make sure all group members have a understanding about all parts of the project</li> <li> Uploaded all your code to github</li> </ul>"},{"location":"reports/#group-information","title":"Group information","text":""},{"location":"reports/#question-1","title":"Question 1","text":"<p>Enter the group number you signed up on  <p>Answer:</p> <p>--- question 1 fill here ---</p>"},{"location":"reports/#question-2","title":"Question 2","text":"<p>Enter the study number for each member in the group</p> <p>Example:</p> <p>sXXXXXX, sXXXXXX, sXXXXXX</p> <p>Answer:</p> <p>--- question 2 fill here ---</p>"},{"location":"reports/#question-3","title":"Question 3","text":"<p>What framework did you choose to work with and did it help you complete the project?</p> <p>Recommended answer length: 100-200 words.</p> <p>Example: We used the third-party framework ... in our project. We used functionality ... and functionality ... from the package to do ... and ... in our project.</p> <p>Answer:</p> <p>--- question 3 fill here ---</p>"},{"location":"reports/#coding-environment","title":"Coding environment","text":"<p>In the following section we are interested in learning more about you local development environment.</p>"},{"location":"reports/#question-4","title":"Question 4","text":"<p>Explain how you managed dependencies in your project? Explain the process a new team member would have to go through to get an exact copy of your environment.</p> <p>Recommended answer length: 100-200 words</p> <p>Example: We used ... for managing our dependencies. The list of dependencies was auto-generated using ... . To get a complete copy of our development environment, one would have to run the following commands</p> <p>Answer:</p> <p>--- question 4 fill here ---</p>"},{"location":"reports/#question-5","title":"Question 5","text":"<p>We expect that you initialized your project using the cookiecutter template. Explain the overall structure of your code. Did you fill out every folder or only a subset?</p> <p>Recommended answer length: 100-200 words</p> <p>Example: From the cookiecutter template we have filled out the ... , ... and ... folder. We have removed the ... folder because we did not use any ... in our project. We have added an ... folder that contains ... for running our experiments. Answer:</p> <p>--- question 5 fill here ---</p>"},{"location":"reports/#question-6","title":"Question 6","text":"<p>Did you implement any rules for code quality and format? Additionally, explain with your own words why these concepts matters in larger projects.</p> <p>Recommended answer length: 50-100 words.</p> <p>Answer:</p> <p>--- question 6 fill here ---</p>"},{"location":"reports/#version-control","title":"Version control","text":"<p>In the following section we are interested in how version control was used in your project during development to corporate and increase the quality of your code.</p>"},{"location":"reports/#question-7","title":"Question 7","text":"<p>How many tests did you implement and what are they testing in your code?</p> <p>Recommended answer length: 50-100 words.</p> <p>Example: In total we have implemented X tests. Primarily we are testing ... and ... as these the most critical parts of our application but also ... .</p> <p>Answer:</p> <p>--- question 7 fill here ---</p>"},{"location":"reports/#question-8","title":"Question 8","text":"<p>What is the total code coverage (in percentage) of your code? If you code had an code coverage of 100% (or close to), would you still trust it to be error free? Explain you reasoning.</p> <p>Recommended answer length: 100-200 words.</p> <p>Example: The total code coverage of code is X%, which includes all our source code. We are far from 100% coverage of our ** code and even if we were then...*</p> <p>Answer:</p> <p>--- question 8 fill here ---</p>"},{"location":"reports/#question-9","title":"Question 9","text":"<p>Did you workflow include using branches and pull requests? If yes, explain how. If not, explain how branches and pull request can help improve version control.</p> <p>Recommended answer length: 100-200 words.</p> <p>Example: We made use of both branches and PRs in our project. In our group, each member had an branch that they worked on in addition to the main branch. To merge code we ...</p> <p>Answer:</p> <p>--- question 9 fill here ---</p>"},{"location":"reports/#question-10","title":"Question 10","text":"<p>Did you use DVC for managing data in your project? If yes, then how did it improve your project to have version control of your data. If no, explain a case where it would be beneficial to have version control of your data.</p> <p>Recommended answer length: 100-200 words.</p> <p>Example: We did make use of DVC in the following way: ... . In the end it helped us in ... for controlling ... part of our pipeline</p> <p>Answer:</p> <p>--- question 10 fill here ---</p>"},{"location":"reports/#question-11","title":"Question 11","text":"<p>Discuss you continuous integration setup. What kind of continuous integration are you running (unittesting, linting, etc.)? Do you test multiple operating systems, Python  version etc. Do you make use of caching? Feel free to insert a link to one of your GitHub actions workflow.</p> <p>Recommended answer length: 200-300 words.</p> <p>Example: We have organized our continuous integration into 3 separate files: one for doing ..., one for running ... testing and one for running ... . In particular for our ..., we used ... .An example of a triggered workflow can be seen here:  <p>Answer:</p> <p>--- question 11 fill here ---</p>"},{"location":"reports/#running-code-and-tracking-experiments","title":"Running code and tracking experiments","text":"<p>In the following section we are interested in learning more about the experimental setup for running your code and especially the reproducibility of your experiments.</p>"},{"location":"reports/#question-12","title":"Question 12","text":"<p>How did you configure experiments? Did you make use of config files? Explain with coding examples of how you would run a experiment.</p> <p>Recommended answer length: 50-100 words.</p> <p>Example: We used a simple argparser, that worked in the following way: Python  my_script.py --lr 1e-3 --batch_size 25</p> <p>Answer:</p> <p>--- question 12 fill here ---</p>"},{"location":"reports/#question-13","title":"Question 13","text":"<p>Reproducibility of experiments are important. Related to the last question, how did you secure that no information is lost when running experiments and that your experiments are reproducible?</p> <p>Recommended answer length: 100-200 words.</p> <p>Example: We made use of config files. Whenever an experiment is run the following happens: ... . To reproduce an experiment one would have to do ...</p> <p>Answer:</p> <p>--- question 13 fill here ---</p>"},{"location":"reports/#question-14","title":"Question 14","text":"<p>Upload 1 to 3 screenshots that show the experiments that you have done in W&amp;B (or another experiment tracking service of your choice). This may include loss graphs, logged images, hyperparameter sweeps etc. You can take inspiration from this figure. Explain what metrics you are tracking and why they are important.</p> <p>Recommended answer length: 200-300 words + 1 to 3 screenshots.</p> <p>Example: As seen in the first image when have tracked ... and ... which both inform us about ... in our experiments. As seen in the second image we are also tracking ... and ...</p> <p>Answer:</p> <p>--- question 14 fill here ---</p>"},{"location":"reports/#question-15","title":"Question 15","text":"<p>Docker is an important tool for creating containerized applications. Explain how you used docker in your experiments? Include how you would run your docker images and include a link to one of your docker files.</p> <p>Recommended answer length: 100-200 words.</p> <p>Example: For our project we developed several images: one for training, inference and deployment. For example to run the training docker image: <code>docker run trainer:latest lr=1e-3 batch_size=64</code>. Link to docker file:  <p>Answer:</p> <p>--- question 15 fill here ---</p>"},{"location":"reports/#question-16","title":"Question 16","text":"<p>When running into bugs while trying to run your experiments, how did you perform debugging? Additionally, did you try to profile your code or do you think it is already perfect?</p> <p>Recommended answer length: 100-200 words.</p> <p>Example: Debugging method was dependent on group member. Some just used ... and others used ... . We did a single profiling run of our main code at some point that showed ...</p> <p>Answer:</p> <p>--- question 16 fill here ---</p>"},{"location":"reports/#working-in-the-cloud","title":"Working in the cloud","text":"<p>In the following section we would like to know more about your experience when developing in the cloud.</p>"},{"location":"reports/#question-17","title":"Question 17","text":"<p>List all the GCP services that you made use of in your project and shortly explain what each service does?</p> <p>Recommended answer length: 50-200 words.</p> <p>Example: We used the following two services: Engine and Bucket. Engine is used for... and Bucket is used for...</p> <p>Answer:</p> <p>--- question 17 fill here ---</p>"},{"location":"reports/#question-18","title":"Question 18","text":"<p>The backbone of GCP is the Compute engine. Explained how you made use of this service and what type of VMs you used?</p> <p>Recommended answer length: 100-200 words.</p> <p>Example: We used the compute engine to run our ... . We used instances with the following hardware: ... and we started the using a custom container: ...</p> <p>Answer:</p> <p>--- question 18 fill here ---</p>"},{"location":"reports/#question-19","title":"Question 19","text":"<p>Insert 1-2 images of your GCP bucket, such that we can see what data you have stored in it. You can take inspiration from this figure.</p> <p>Answer:</p> <p>--- question 19 fill here ---</p>"},{"location":"reports/#question-20","title":"Question 20","text":"<p>Upload one image of your GCP artifact registry, such that we can see the different images that you have stored. You can take inspiration from this figure.</p> <p>Answer:</p> <p>--- question 20 fill here ---</p>"},{"location":"reports/#question-21","title":"Question 21","text":"<p>Upload one image of your GCP cloud build history, so we can see the history of the images that have been build in your project. You can take inspiration from this figure.</p> <p>Answer:</p> <p>--- question 21 fill here ---</p>"},{"location":"reports/#question-22","title":"Question 22","text":"<p>Did you manage to deploy your model, either in locally or cloud? If not, describe why. If yes, describe how and preferably how you invoke your deployed service?</p> <p>Recommended answer length: 100-200 words.</p> <p>Example: For deployment we wrapped our model into application using ... . We first tried locally serving the model, which worked. Afterwards we deployed it in the cloud, using ... . To invoke the service an user would call <code>curl -X POST -F \"file=@file.json\"&lt;weburl&gt;</code></p> <p>Answer:</p> <p>--- question 22 fill here ---</p>"},{"location":"reports/#question-23","title":"Question 23","text":"<p>Did you manage to implement monitoring of your deployed model? If yes, explain how it works. If not, explain how monitoring would help the longevity of your application.</p> <p>Recommended answer length: 100-200 words.</p> <p>Example: We did not manage to implement monitoring. We would like to have monitoring implemented such that over time we could measure ... and ... that would inform us about this ... behaviour of our application.</p> <p>Answer:</p> <p>--- question 23 fill here ---</p>"},{"location":"reports/#question-24","title":"Question 24","text":"<p>How many credits did you end up using during the project and what service was most expensive?</p> <p>Recommended answer length: 25-100 words.</p> <p>Example: Group member 1 used ..., Group member 2 used ..., in total ... credits was spend during development. The service costing the most was ... due to ...</p> <p>Answer:</p> <p>--- question 24 fill here ---</p>"},{"location":"reports/#overall-discussion-of-project","title":"Overall discussion of project","text":"<p>In the following section we would like you to think about the general structure of your project.</p>"},{"location":"reports/#question-25","title":"Question 25","text":"<p>Include a figure that describes the overall architecture of your system and what services that you make use of. You can take inspiration from this figure. Additionally in your own words, explain the overall steps in figure.</p> <p>Recommended answer length: 200-400 words</p> <p>Example:</p> <p>The starting point of the diagram is our local setup, where we integrated ... and ... and ... into our code. Whenever we commit code and push to github, it auto triggers ... and ... . From there the diagram shows ...</p> <p>Answer:</p> <p>--- question 25 fill here ---</p>"},{"location":"reports/#question-26","title":"Question 26","text":"<p>Discuss the overall struggles of the project. Where did you spend most time and what did you do to overcome these challenges?</p> <p>Recommended answer length: 200-400 words.</p> <p>Example: The biggest challenges in the project was using ... tool to do ... . The reason for this was ...</p> <p>Answer:</p> <p>--- question 26 fill here ---</p>"},{"location":"reports/#question-27","title":"Question 27","text":"<p>State the individual contributions of each team member. This is required information from DTU, because we need to make sure all members contributed actively to the project</p> <p>Recommended answer length: 50-200 words.</p> <p>Example: Student sXXXXXX was in charge of developing of setting up the initial cookie cutter project and developing of the docker containers for training our applications. Student sXXXXXX was in charge of training our models in the cloud and deploying them afterwards. All members contributed to code by...</p> <p>Answer:</p> <p>--- question 27 fill here ---</p>"},{"location":"s10_extra/","title":"Extra learning modules","text":"<p>All modules listed here are not part of the core course but expand on some of the other topics. Some of them may still be under construction and may in the future be moved into other sessions.</p> <ul> <li> <p></p> <p>Learn how to setup a simple documentation system for your application</p> <p> M32: Documentation</p> </li> <li> <p></p> <p>Learn how to do hyperparameter optimization using Optuna</p> <p> M33: Hyperparameter Optimization</p> </li> <li> <p></p> <p>Learn how to use HPC systems that use PBS to do job scheduling</p> <p> M34: High Performance Clusters</p> </li> </ul>"},{"location":"s10_extra/calibration/","title":"Calibration of ML models","text":"<p>Danger</p> <p>Module is still under development</p>"},{"location":"s10_extra/calibration/#methods","title":"Methods","text":""},{"location":"s10_extra/calibration/#exercises","title":"\u2754 Exercises","text":"<ol> <li> <p>Implement a script</p> </li> <li> <p>Implement temperature scaling</p> </li> <li> <p>Implement label smoothing</p> <pre><code>alpha = 0.1\nfor i in range(len(y_true)):\n    y_true[i] = (1 - alpha) * y_true[i] + alpha / num_classes\n</code></pre> </li> <li> <p>Implement mixup</p> </li> <li> <p>Implement cutmix</p> </li> <li> <p>Implement the Focal Loss</p> </li> <li> <p>Implement it in a continues integration setup</p> </li> </ol>"},{"location":"s10_extra/calibration/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":""},{"location":"s10_extra/design/","title":"Designing MLOps pipelines","text":"<p>Danger</p> <p>Module is still under development</p> <p>\"Machine learning engineering is 10% machine learning and 90% engineering.\" - Chip Huyen</p> <p>We highly recommend that you read the book Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications by Chip Huyen which gives an fantastic overview of the thought processes that goes into designing moder machine learning systems.</p>"},{"location":"s10_extra/design/#the-stack","title":"The stack","text":"<p>Have you ever encountered the concept of full stack developer. A full stack developer is an developer who can both develop client and server software or in more general terms, it is a developer who can take care of the complete developer pipeline.</p> <p>Below is seen an image of the massive amounts of tools that exist within the MLOps umbrella. </p>"},{"location":"s10_extra/design/#visualizing-the-design","title":"Visualizing the design","text":""},{"location":"s10_extra/documentation/","title":"M32 - Documentation","text":""},{"location":"s10_extra/documentation/#documentation","title":"Documentation","text":"<p>In today's rapidly evolving software development landscape, effective documentation is a crucial component of any project. The ability to create clear, concise, and user-friendly technical documentation can make a significant difference in the success of your codebase. We all probably encountered code that we wanted to use, only for us to abandon using it because it was missing documentation such that we could get started with it.</p> <p>Technical documentation or code documentation can be many things:</p> <ul> <li>Plain text, images and videos explaining core concepts for your software</li> <li>Documentation of API on how to call a function or class, what the different parameters are etc.</li> <li>Code examples of how to use certain functionality</li> </ul> <p>and many more. We are in this module going to focus on setting up a very basic documentation system that will automatically help you document the API of your code. For this reason we recommend that before continuing with this module that you have completed module M7 on good coding practices or have similar experience with writing docstrings for Python functions and classes.</p> <p>There are different systems for writing documentation. In fact there is a lot to choose from:</p> <ul> <li>MkDocs</li> <li>Sphinx</li> <li>GitBook</li> <li>Docusaurus</li> <li>Doxygen</li> <li>Jekyll</li> </ul> <p>Important to note that all these are static site generators. The word static here refers to that when the content is generated and served on a webside, the underlying HTML code will not change. It may contain HTML elements that dynamic (like video), but the site does not change (1).</p> <ol> <li> Good examples of dynamic sites are any social media or news media where new posts, pages etc.     are constantly added over time. Good examples of static sites are documentation, blogposts etc.</li> </ol> <p>We are in this module going to look at Mkdocs, which (in my opinion) is one of the easiest systems to get started with because all documentation is written in markdown and the build system is written in Python. As an alternativ, you can consider doing the exercises in Sphinx which is probably the most used documentation system for Python code. Sphinx offer more customization than Mkdocs, so is generally preferred for larger projects with complex documentation, but for smaller projects Mkdocs should be easier to get started with and is sufficient.</p> <p>Mkdocs by default does not include many features and for that reason we are directly going to dive into using the material for mkdocs theme that provides a lot of nice customization to create professional static sites. In fact, this whole course is written in mkdocs using the material theme.</p>"},{"location":"s10_extra/documentation/#mkdocs","title":"Mkdocs","text":"<p>The core file when using mkdocs is the <code>mkdocs.yaml</code> file, which is the configuration file for the project:</p> <pre><code>site_name: Documentation of my project\nsite_author: Jane Doe\ndocs_dir: source # (1)!\n\ntheme:\n    language: en\n    name: material # (2)!\n    features: # (3)!\n    - content.code.copy\n    - content.code.annotate\n\nplugins: # (4)!\n    - search\n    - mkdocstrings\n\nnav: # (5)!\n  - Home: index.md\n</code></pre> <ol> <li> <p> This indicates the source directory of our documentation. If the layout of your documentation is     a bit different than what described above, you may need to change this.</p> </li> <li> <p> The overall theme of your documentation. We recommend the <code>material</code> theme but there are     many more to choose from and you can also     create your own.</p> </li> <li> <p> The <code>featuers</code> section is where features that are supported by your given theme can be enabled.     In this example we have enabled <code>content.code.copy</code> feature which adds a small copy button to all code block and the     <code>content.code.annotate</code> feature which allows you to add annotations like this box to code blocks.</p> </li> <li> <p> Plugins add new functionality to your documentation.     In this case we have added two plugins that add functionality for searching through our documentation and     automatically adding documentation from docstrings. Remember that some plugins requires you to install additional     Python packages with those plugins, so remember to add them to your <code>requirements.txt</code> file.</p> </li> <li> <p> The <code>nav</code> section is where you define the navigation structure of your documentation. When you     add new <code>.md</code> files to the <code>source</code> folder you then need to add them to the <code>nav</code> section.</p> </li> </ol> <p>And that is more or less what you need to get started. In general, if you need help with configuration of your documentation in mkdocs I recommend looking at this page and this page.</p>"},{"location":"s10_extra/documentation/#exercises","title":"Exercises","text":"<p>In this set of exercises we assume that you have completed module M6 on code structure and therefore have a repository that at least contains the following:</p> <pre><code>\u251c\u2500\u2500 pyproject.toml     &lt;- Project configuration file with package metadata\n\u2502\n\u251c\u2500\u2500 docs               &lt;- Documentation folder\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 index.md       &lt;- Homepage for your documentation\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 mkdocs.yaml     &lt;- Configuration file for mkdocs\n\u2502   \u2502\n\u2502   \u2514\u2500\u2500 source/        &lt;- Source directory for documentation files\n\u2502\n\u2514\u2500\u2500 src                &lt;- Source code for use in this project.\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 __init__.py    &lt;- Makes src a Python module\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 models         &lt;- model implementations, training script\n\u2502   \u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u2502   \u251c\u2500\u2500 model.py\n\u2502   \u2502   \u251c\u2500\u2500 train_model.py\n...\n</code></pre> <p>It is not important exactly what is in the <code>src</code> folder for the exercises, but we are going to refer to the above structure in the exercises, so adjust accordingly if you diviate from this. Additionally, we are going to assume that your project code is installed in your environment such that it can be imported as normal Python code.</p> <ol> <li> <p>We are going to need two Python packages to get started: mkdocs and     material for mkdocs. Install with</p> <pre><code>pip install \"mkdocs-material &gt;= 4.8.0\" # (1)!\n</code></pre> <ol> <li>Since <code>mkdocs</code> is a dependency of <code>mkdocs-material</code> we only need to install the latter.</li> </ol> </li> <li> <p>Run in your terminal (from the <code>docs</code> folder):</p> <pre><code>mkdocs serve # (1)!\n</code></pre> <ol> <li> <code>mkdocs serve</code> will automatically rebuild the whole site whenever you save a file inside the     <code>docs</code> folder. This is not a problem if you have a fairly small site with not that many pages (or elements), but     can take a long time for large sites. Consider running with the <code>--dirty</code> option for only re-building the site     for files that have been changed.</li> </ol> <p>which should render the <code>index.md</code> file as the homepage. You can leave the documentation server running during the remaining exercises.</p> </li> <li> <p>We are no ready to document the API of our code:</p> <ol> <li> <p>Make sure you at least have one function and class inside your <code>src</code> module. If you do not have you can for     simplicity copy the following module to the <code>src/models/model.py</code> file</p> <pre><code>import torch\n\nclass MyNeuralNet(torch.nn.Module):\n    \"\"\"Basic neural network class.\n\n    Args:\n        in_features: number of input features\n        out_features: number of output features\n\n    \"\"\"\n    def __init__(self, in_features: int, out_features: int) -&gt; None:\n        self.l1 = torch.nn.Linear(in_features, 500)\n        self.l2 = torch.nn.Linear(500, out_features)\n        self.r = torch.nn.ReLU()\n\n    def forward(self, x: torch.Tensor) -&gt; torch.Tensor:\n        \"\"\"Forward pass of the model.\n\n        Args:\n            x: input tensor expected to be of shape [N,in_features]\n\n        Returns:\n            Output tensor with shape [N,out_features]\n\n        \"\"\"\n        return self.l2(self.r(self.l1(x)))\n</code></pre> <p>and the following function to add <code>src/predict_model.py</code> file:</p> <pre><code>def predict(\n    model: torch.nn.Module,\n    dataloader: torch.utils.data.DataLoader\n) -&gt; None:\n    \"\"\"Run prediction for a given model and dataloader.\n\n    Args:\n        model: model to use for prediction\n        dataloader: dataloader with batches\n\n    Returns\n        Tensor of shape [N, d] where N is the number of samples and d is the output dimension of the model\n\n    \"\"\"\n    return [model(batch) for batch in dataloader]\n</code></pre> </li> <li> <p>Add a markdown file to the <code>docs/source</code> folder called <code>my_api.md</code> and add that file to the <code>nav:</code> section in     the <code>mkdocs.yaml</code> file.</p> </li> <li> <p>To that file add the following code:</p> <pre><code># My API\n\n::: src.models.model.MyNeuralNet\n\n::: src.predict_model.predict\n</code></pre> <p>The <code>:::</code> indicator tells mkdocs that it should look for the corresponding function/module and then render it on the given page. Thus, if you have a function/module located in another location change the paths accordingly.</p> </li> <li> <p>Make sure that the documentation correctly includes your function and module on the given page.</p> </li> <li> <p>(Optional) Include more functions/modules in your documentation.</p> </li> </ol> </li> <li> <p>(Optional) Look through the documentation for mkdocstrings and try to     improve the layout a bit. Especially, the     headings,     docstrings and     signatures could be of interest to adjust.</p> </li> <li> <p>Finally, try to build a final version of your documentation</p> <pre><code>mkdocs build\n</code></pre> <p>this should result in a <code>site</code> folder that contains the actual HTML code for documentation.</p> </li> </ol>"},{"location":"s10_extra/documentation/#publish-your-documentation","title":"Publish your documentation","text":"<p>To publish your documentation you need a place to host your build documentation e.g. the content of the <code>site</code> folder you build in the last exercise. There are many places to host your documentation, but if you only need a static site and are already hosting your code through Github, then a good option is Github Pages. Github pages is free to use for your public projects.</p> <p>Before getting started with this set of exercises you should have completed module M16 on GitHub actions so you already know about workflow files.</p>"},{"location":"s10_extra/documentation/#exercises_1","title":"Exercises","text":"<ol> <li> <p>Start by adding a new file called <code>deploy_docs.yaml</code> to the <code>.github/workflows</code> folder. Add the following cod to that     file and save it.</p> <pre><code>name: Deploy docs\n\non:\npush:\n    branches:\n        - main\n\npermissions:\n    contents: write # (1)!\n\njobs:\n  deploy:\n    name: Deploy docs\n    runs-on: ubuntu-latest\n    steps:\n    - name: Checkout code\n      uses: actions/checkout@v4\n      with:\n        fetch-depth: 0\n\n    - name: Set up Python\n      uses: actions/setup-python@v5\n      with:\n        python-version: 3.11\n        cache: 'pip'\n        cache-dependency-path: setup.py\n\n    - name: Install dependencies\n      run: pip install -r requirements.txt\n\n    - name: Deploy docs\n      run: mkdocs gh-deploy --force\n</code></pre> <ol> <li> It is important to give <code>write</code> permissions to this actions because it is not only reading     your code but it will also push code.</li> </ol> <p>Before continuing, make sure you understand what the different steps of the workflow does and especially we recommend looking at the documentation of the <code>mkdocs gh-deploy</code> command.</p> </li> <li> <p>Commit and push the file. Check that the action is executed and if it succeeds, that your build project is pushed to     a branch called <code>gh-pages</code>. If the action does not succeeds, then figure out what is wrong and fix it!</p> </li> <li> <p>After confirming that our action is working, you need to configure Github to publish the content being     build by Github Actions. Do the following:</p> <ul> <li>Go to the Settings tab and then the Pages subsection</li> <li>In the <code>Source</code> setting choose the <code>Deploy from a branch</code></li> <li>In the <code>Branch</code> setting choose the <code>gh-pages</code> branch and <code>/(root)</code> folder and save</li> </ul> <p> </p> <p>This should then start deploying your site to <code>https://&lt;your-username&gt;.github.io/&lt;your-reponame&gt;/</code>. If it does not do this you may need to recommit and trigger the GitHub actions build again.</p> </li> <li> <p>Make sure your documentation is published and looks as it should.</p> </li> </ol> <p>This ends the module on technical documentation. We cannot stress enough how important it is to write proper documentation for larger projects that need to be maintained over a longer time. It is often a iterative process, but it is often best to do it while writing the code.</p>"},{"location":"s10_extra/high_performance_clusters/","title":"M34 - High Performance Clusters","text":""},{"location":"s10_extra/high_performance_clusters/#high-performance-clusters","title":"High Performance Clusters","text":"<p>As discussed in the intro session on the cloud, cloud providers offers near infinite compute resources. However, using these resources comes at a hefty price often and it is therefore important to be aware of another resource many have access to: High Performance Clusters or HPC. HPCs exist all over the world, and many time you already have access to one or can easily get access to one. If you are an university student you most likely have a local HPC that you can access through your institution. Else, there exist public HPC resources that everybody (with a project) can apply for. As an example in the EU we have EuroHPC initiative that currently has 8 different supercomputers with a centralized location for applying for resources that are both open for research projects and start-ups.</p> <p>Depending on your application, you may have different needs and it is therefore important to be aware also of the different tiers of HPC. In Europe, HPC are often categorized such that Tier-0 are European Centers with petaflop or hexascale machines,\u00a0Tier 1 are National centers of supercomputers, and Tier 2 are Regional centers. The lower the Tier, the larger applications it is possible to run.</p> <p></p>"},{"location":"s10_extra/high_performance_clusters/#cluster-architectures","title":"Cluster architectures","text":"<p>In very general terms, cluster can come as two different kind of systems: supercomputers and LSF (Load Sharing Facility). A supercomputer (as shown below) is organized into different modules, that are separated by network link. When you login to a supercomputer you will meet the front end which contains all the software needed to run computations. When you submit a job it will get sent to the backend modules which in most cases includes: general compute modules (CPU), acceleration modules (GPU), a memory module (RAM) and finally a storage module (HDD). Depending on your application you may need one module more than another. For example in deep learning the acceleration module is important but in physics simulation the general compute module / storage model is probably more important.</p> <p></p>  Overview of the Meluxina supercomputer that's part of EuroHPC.  Image credit  <p>Alternatively, LSF are a network of computers where each computer has its own CPU, GPU, RAM etc. and the individual computes (or nodes) are then connected by network. The important different between a supercomputer and as LSF systems is how the resources are organized. When comparing supercomputers to LSF system it is generally the case that it is better to run on a LSF system if you are only requesting resources that can be handled by a single node, however it is better to run on a supercomputer if you have a resource intensive application that requires many devices to communicate with each others.</p> <p>Regardless of cluster architectures, on the software side of HPC, the most important part is what's called the HPC scheduler. Without a HPC scheduler an HPC cluster would just be a bunch of servers with different jobs interfering with each other. The problem is when you have a large collection of resources and a large collection of users, you cannot rely on the users just running their applications without interfering with each other. A HPC scheduler is in charge of managing that whenever an user request to run an application, they get put in a queue and whenever the resources their application ask for are available the application gets run.</p> <p>The biggest bach control systems for doing scheduling on HPC are:</p> <ul> <li>SLURM</li> <li>MOAB HPC Suite</li> <li>PBS Works</li> </ul> <p>We are going to take a look at PBS works as that is what is installed on our local university cluster.</p>"},{"location":"s10_extra/high_performance_clusters/#exercises","title":"\u2754 Exercises","text":"<p>Exercise files</p> <p>The following exercises are focused on local students at DTU that want to use our local HPC resources. That said, the steps in the exercise are fairly general to other types of cluster. For the purpose of this exercise we are going to see how we can run this image classifier script , but feel free to work with whatever application you want to.</p> <ol> <li> <p>Start by accessing the cluster. This can either be through <code>ssh</code> in a terminal or if you want a graphical interface     thinlinc can be installed. In general we recommend following the steps     here for DTU students as the setup depends on if you are on campus or not.</p> </li> <li> <p>When you have access to the cluster we are going to start with the setup phase. In the setup phase we are going     to setup the environment necessary for our computations. If you have accessed the cluster through graphical interface     start by opening a terminal.</p> <ol> <li> <p>Lets start by setting up conda for controlling our dependencies. If you have not already worked with <code>conda</code>,     please checkout module     M2 on package managers and virtual environments. In general     you should be able to setup (mini)conda through these two commands:</p> <pre><code>wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nsh Miniconda3-latest-Linux-x86_64.sh\n</code></pre> </li> <li> <p>Close the terminal and open a new for the installation to complete. Type <code>conda</code> in the terminal to check that     everything is fine. Go ahead and create a new environment that we can install dependencies in</p> <pre><code>conda create -n \"hpc_env\" python=3.10 --no-default-packages\n</code></pre> <p>and activate it.</p> </li> <li> <p>Copy over any files you need. For the image classifier script you need the     requirements file     and the actual     application.</p> </li> <li> <p>Next, install all the requirements you need. If you want to run the image classifier script you can run this     command in the terminal</p> <pre><code>pip install -r image_classifier_requirements.txt\n</code></pre> <p>using this requirements file.</p> </li> </ol> </li> <li> <p>That's all the setup needed. You would need to go through the creating of environment and installation of requirements     whenever you start a new project (no need for reinstalling conda). For the next step we need to look at how to submit     jobs on the cluster. We are now ready to submit the our first job to the cluster:</p> <ol> <li> <p>Start by checking the statistics for the different clusters. Try to use both the <code>qstat</code> command which should give     an overview of the different cluster, number of running jobs and number of pending jobs. For many system you can     also try the much more user friendly command <code>classstat</code> command.</p> </li> <li> <p>Figure out which queue you want to use. For the sake of the exercises it needs to be one with GPU support. For     DTU students, any queue that starts with <code>gpu</code> are GPU accelerated.</p> </li> <li> <p>Now we are going to develop a bash script for submitting our job. We have provided an example of such     scripts. Take a     careful look and go each line and make sure you understand it. Afterwards, change it to your needs     (queue and student email).</p> </li> <li> <p>Try to submit the script:</p> <pre><code>bsub &lt; jobscript.sh\n</code></pre> <p>You can check the status of your script by running the <code>bstat</code> command. Hopefully, the job should go through really quickly. Take a look at the output file, it should be called something like <code>gpu_*.out</code>. Also take a look at the <code>gpu_*.err</code> file. Does both files look as they should?</p> </li> </ol> </li> <li> <p>Lets now try to run our application on the cluster. To do that we need to take care of two things:</p> <ol> <li> <p>First we need to load the correct version of CUDA. A cluster system often contains multiple versions of specific     software to suit the needs of all their users, and it is the users that are in charge of loading the correct     software during job submission. The only extra software that needs to be loaded for most PyTorch applications     are a CUDA module. You can check which modules are available on the cluster with</p> <pre><code>module avail\n</code></pre> <p>Afterwards, add the correct CUDA version you need to the <code>jobscript.sh</code> file. If you are trying to run the provided image classifier script then the correct version is <code>CUDA/11.7</code> (can be seen in the requirements file).</p> <pre><code># add to the bottom of the file\nmodule load cuda/11.7\n</code></pre> </li> <li> <p>We are now ready to add in our application. The only thing we need to take care of is telling the system to run     it using the <code>python</code> version that is connected to our <code>hpc_env</code> we created in the beginning. Try typing:</p> <pre><code>which python\n</code></pre> <p>which should give you the full path. Then add to the bottom of the <code>jobscript</code> file:</p> <pre><code>~/miniconda3/envs/hpc_env/bin/python \\\n    image_classifier.py \\\n    --trainer.accelerator 'gpu' --trainer.devices 1  --trainer.max_epochs 5\n</code></pre> <p>which will run the image classifier script (change it if you are running something else).</p> </li> <li> <p>Finally submit the job:</p> <pre><code>bsub &lt; jobscript.sh\n</code></pre> <p>and check when it is done that it has produced what you expected.</p> </li> <li> <p>(Optional) If you application supports multi GPUs also try that out. You would first need to change the     jobscript to request multiple GPUs and additionally you would need to tell your application to run on multiple     GPUs. For the image classifier script it can be done by changing the <code>--trainer.devices</code> flag     to <code>2</code> (or higher).</p> </li> </ol> </li> </ol> <p>This ends the module on using HPC systems.</p>"},{"location":"s10_extra/hyperparameters/","title":"M33 - Hyperparameter optimization","text":""},{"location":"s10_extra/hyperparameters/#hyperparameter-optimization","title":"Hyperparameter optimization","text":"<p>Outdated module</p> <p>This module has not been updated for a long time and therefore some functionality of Optuna, which is used in these exercises, may not be included. If you have completed the module on Weights &amp; Bias then we highly recommend instead using their sweep functionality.</p> <p>Hyperparameter optimization is not a new idea within machine learning but have somewhat seen a renaissance with the uprise of deep learning. This can mainly be contributed to the following:</p> <ul> <li>Trying to beat state-of-the-art often comes down to very small differences in performance, and hyperparameter     optimization can help squeeze out a bit more</li> <li>Deep learning models are in general not that robust towards the choice of hyparameter so choosing the wrong set     may lead to a model that does not work</li> </ul> <p>However the problem with doing hyperparameter optimization of a deep learning models is that it can take over a week to train a single model. In most cases we therefore cannot do a full grid search of all hyperparameter combinations to get the best model. Instead we have to do some tricks that will help us speed up our searching. In these exercises we are going to be integrating optuna into our different models, that will provide the tools for speeding up our search.</p> <p></p> <p>It should be noted that a lot of deep learning models does not optimize every hyperparameter that is included in the model but instead relies on heuristic guidelines (\"rule of thumb\") based on what seems to be working in general e.g. a learning rate of 0.01 seems to work great with the Adam optimizer. That said, these rules probably only apply to 80% of deep learning model, whereas for the last 20% the recommendations may be suboptimal Here is a great site that has collected an extensive list of these recommendations, taken from the excellent deep learning book by Ian Goodfellow, Yoshua Bengio and Aaron Courville.</p> <p>In practice, I recommend trying to identify (through experimentation) which hyperparameters that are important for the performance of your model and then spend your computational budget trying to optimize them while setting the rest to a \"recommended value\".</p>"},{"location":"s10_extra/hyperparameters/#exercises","title":"\u2754 Exercises","text":"<p>Exercise files</p> <ol> <li> <p>Start by installing optuna:     <code>pip install optuna</code></p> </li> <li> <p>Initially we will look at the <code>cross_validate.py</code> file. It implements simple K-fold cross validation of     a random forest sklearn digits dataset (subset of MNIST). Look over the script and try to run it.</p> </li> <li> <p>We will now try to write the same code in optune. Please note that the script have a variable <code>OPTUNA=False</code>     that you can use to change what part of the code should run. The three main concepts of optuna is</p> <ul> <li> <p>A trial: a single experiment</p> </li> <li> <p>A study: a collection of trials</p> </li> <li> <p>The objective: function to determine how \"good\" a trial is</p> </li> </ul> <p>Lets start by writing the objective function, which we have already started in the script. For now you do not need to care about the <code>trial</code> argument, just assume that it contains the hyperparameters needed to define your random forest model. The output of the objective function should be a single number that we want to optimize. (HINT: did you remember to do K-fold crossvalidation inside your objective function?)</p> </li> <li> <p>Next lets focus on the trial. Inside the <code>objective</code> function the trial should be used to suggest what     parameters to use next. Take a look at the documentation for     trial or take a look at     the code examples and figure out how to define the hyperparameter of the model.</p> </li> <li> <p>Finally lets launch a study. It can be as simple as</p> <pre><code>study = optuna.create_study()\nstudy.optimize(objective, n_trials=100)\n</code></pre> <p>but lets play around a bit with it:</p> <ol> <li> <p>By default the <code>.optimize</code> method will minimize the objective (by definition the optimum of an objective     function is at its minimum). Is the score your objective function is returning something that should     be minimized? If not, a simple solution is to put a <code>-</code> in front of the metric. However, look through     the documentation on how to change the direction of the optimization.</p> </li> <li> <p>Optuna will by default do Bayesian optimization when sampling the hyperparameters (using a evolutionary     algorithm for suggesting new trials). However, since this example is quite simple, we can actually     perform a full grid search. How would you do this in Optuna?</p> </li> <li> <p>Compare the performance of a single optuna run using Bayesian optimization with <code>n_trials=10</code> with a     exhaustive grid search that have search through all hyperparameters. What is the performance/time     trade-off for these two solutions?</p> </li> </ol> </li> <li> <p>In addition to doing baysian optimization, the other great part about Optuna is that it have native support     for Pruning unpromising trials. Pruning refers to the user stopping trials for hyperparameter combinations     that does not seem to lead anywhere. You may have learning rate that is so high that training is diverging or     a neural network with too many parameters so it is just overfitting to the training data. This however begs the     question: what constitutes an unpromising trial? This is up to you to define based on prior experimentation.</p> <ol> <li> <p>Start by looking at the <code>fashion_trainer.py</code> script. Its a simple classification network for classifying     images in the FashionMNIST dataset. Run the script     with the default hyperparameters to get a feeling of how the training should be progress.     Note down the performance on the test set.</p> </li> <li> <p>Start by defining a validation set and a validation dataloader that we can use for hyperparameter optimization     (HINT: use 5-10% of you training data).</p> </li> <li> <p>Now, adjust the script to use Optuna. The 5 hyperparameters listed in the table above should at least be included     in the hyperparameter search. For some we have already defined the search space but for the remaining you need to     come up with a good range of values to investigate. We done integrating optuna, run a small study (<code>n_tirals=3</code>)     to check that the code is working.</p> Hyperparameter Search space Learning rate 1e-6 to 1e0 Number of output features in the second last layer ??? The amount of dropout to apply ??? Batch size ??? Use batch normalize or not {True, False} (Optional) Different activations functions {<code>nn.ReLU</code>, <code>nn.Tanh</code>, <code>nn.RReLU</code>, <code>nn.LeakyReLU</code>, <code>nn.ELU</code>} </li> <li> <p>If implemented correctly the number of hyperparameter combinations should be at least 1000, meaning that     we not only need baysian optimization but probably also need pruning to succeed. Checkout the page for     built-in pruners in Optuna. Implement     pruning in the script. I recommend using either the <code>MedianPruner</code> or the <code>ProcentilePruner</code>.</p> </li> <li> <p>Re-run the study using pruning with a large number of trials (<code>n_trials&gt;50</code>)</p> </li> <li> <p>Take a look at this     visualization page     for ideas on how to visualize the study you just did. Make at least two visualization of the study and     make sure that you understand them.</p> </li> <li> <p>Pruning is great for better spending your computational budged, however it comes with a trade-off. What is     it and what hyperparameter should one be especially careful about when using pruning?</p> </li> <li> <p>Finally, what parameter combination achieved the best performance? What is the test set performance for this     set of parameters. Did you improve over the initial set of hyperparameters?</p> </li> </ol> </li> <li> <p>The exercises until now have focused on doing the hyperparameter searching sequentially, meaning that we test one     set of parameters at the time. It is a fine approach because you can easily let it run for a week without any     interaction. However, assuming that you have the computational resources to run in parallel, how do you do that?</p> <ol> <li> <p>To run hyperparameter search in parallel we need a common database that all experiments can read and     write to. We are going to use the recommended <code>mysql</code>. You do not have to understand what SQL is to     complete this exercise, but it is basically a language (like python)     for managing databases. Install mysql.</p> </li> <li> <p>Next we are going to initialize a database that we can read and write to. For this exercises we are going     to focus on a locally stored database but it could of course also be located in the cloud.</p> <pre><code>mysql -u root -e \"CREATE DATABASE IF NOT EXISTS example\"\n</code></pre> <p>you can also do this directly in Python when calling the <code>create_study</code> command by also setting the <code>storage</code> and <code>load_if_exists=True</code> flags.</p> </li> <li> <p>Now we are going to create a Optuna study in our database</p> <pre><code>optuna create-study --study-name \"distributed-example\" --storage \"mysql://root@localhost/example\"\n</code></pre> </li> <li> <p>Change how you initialize the study to read and write to the database. Therefore, instead of doing</p> <pre><code>study = optuna.create_study()\n</code></pre> <p>then do</p> <pre><code>study = optuna.load_study(\n    study_name=\"distributed-example\", storage=\"mysql://root@localhost/example\"\n)\n</code></pre> <p>where the <code>study_name</code> and <code>storage</code> should match how the study was created.</p> </li> <li> <p>For running in parallel, you can either open up a extra terminal and simple launch your script once     per open terminal or you can use the provided <code>parallel_lancher.py</code> that will launch multiple executions     of your script. It should be used as:</p> <pre><code>python parallel_lancher.py myscript.py --num_parallel 2\n</code></pre> </li> <li> <p>Finally, make sure that you can access the results</p> </li> </ol> </li> </ol> <p>That's all on how to do hyperparameter optimization in a scalable way. If you feel like it you can try to apply these techniques on the ongoing corrupted MNIST example, where you are free to choose what hyperparameters that you want to use.</p>"},{"location":"s10_extra/infrastructure_as_code/","title":"Infrastructure as code","text":"<p>Danger</p> <p>Module is still under development</p>"},{"location":"s10_extra/infrastructure_as_code/#infrastructure-as-code-iac","title":"Infrastructure as Code (IaC)","text":"<p>Infrastructure as Code (IaC) is the process of managing and provisioning computer data centers through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools. The IT infrastructure managed by this comprises both physical equipment such as bare-metal servers as well as virtual machines and associated configuration resources. The definitions are written in a high-level programming language and can be versioned, and the code can be tested and validated.</p>"},{"location":"s10_extra/infrastructure_as_code/#terraform","title":"Terraform","text":"<p>Terraform is an open-source infrastructure as code software tool created by HashiCorp. It enables users to define and provision a datacenter infrastructure using a high-level configuration language known as Hashicorp Configuration Language (HCL), or optionally JSON. It allows infrastructure to be expressed as code in a simple, human-readable language called HCL (HashiCorp Configuration Language). It supports a multitude of cloud providers, including AWS, Azure, Google Cloud, and many others.</p>"},{"location":"s10_extra/infrastructure_as_code/#installation","title":"Installation","text":"<p>To install Terraform, download the appropriate package for your operating system from the official Terraform website. Once downloaded, unzip the package and move the binary to a directory included in your system's PATH.</p>"},{"location":"s10_extra/infrastructure_as_code/#getting-started","title":"Getting started","text":"<p>To get started with Terraform, you need to create a configuration file. This file is a human-readable file that describes the infrastructure and set of resources to be created. The file is saved with a <code>.tf</code> extension. Here is an example of a simple Terraform configuration file that creates an AWS EC2 instance:</p> <pre><code>provider \"aws\" {\n  region = \"us-west-2\"\n}\n\nresource \"aws_instance\" \"example\" {\n  ami           = \"ami-0c55b159cbfafe1f0\"\n  instance_type = \"t2.micro\"\n}\n</code></pre> <p>To create the infrastructure described in the configuration file, navigate to the directory containing the file and run the following commands:</p> <pre><code>terraform init\nterraform apply\n</code></pre> <p>The <code>terraform init</code> command is used to initialize a working directory containing Terraform configuration files. This is the first command that should be run after writing a new Terraform configuration or cloning an existing one from version control. The <code>terraform apply</code> command is used to apply the changes required to reach the desired state of the configuration, or the pre-determined set of actions generated by a <code>terraform plan</code> execution plan.</p>"},{"location":"s10_extra/infrastructure_as_code/#resources","title":"Resources","text":"<ul> <li>Terraform documentation</li> </ul>"},{"location":"s10_extra/kubernetes/","title":"Kubernetes","text":"<p>Danger</p> <p>Module is still under development</p>"},{"location":"s10_extra/kubernetes/#kubernetes","title":"Kubernetes","text":"<p>Kubernetes, also known as K8s, is an open-source platform designed to automate the deployment, scaling, and operation of application containers across clusters of hosts. It provides the framework to run distributed systems resiliently, handling scaling and failover for your applications, providing deployment patterns, and more.</p>"},{"location":"s10_extra/kubernetes/#what-is-kubernetes","title":"What is Kubernetes?","text":""},{"location":"s10_extra/kubernetes/#brief-history","title":"Brief History","text":"<p>Kubernetes was originally developed by Google and is now maintained by the Cloud Native Computing Foundation.</p>"},{"location":"s10_extra/kubernetes/#core-functions","title":"Core Functions","text":"<p>Kubernetes makes it easier to deploy and manage containerized applications at scale.</p>"},{"location":"s10_extra/kubernetes/#key-concepts","title":"Key Concepts","text":"<ul> <li>Pods</li> <li>Nodes</li> <li>Clusters</li> <li>...</li> </ul>"},{"location":"s10_extra/kubernetes/#kubernetes-architecture","title":"Kubernetes Architecture","text":"<p>Kubernetes follows a client-server architecture. At a high level, it consists of a Control Plane (master) and Nodes (workers).</p> <p></p>  Overview of Kubernetes Architecture. Image Credit: Kubernetes Official Documentation"},{"location":"s10_extra/kubernetes/#control-plane-components","title":"Control Plane Components","text":"<ul> <li>API Server: The frontend for Kubernetes.</li> <li>etcd: Consistent and highly-available key value store.</li> <li>...</li> </ul>"},{"location":"s10_extra/kubernetes/#node-components","title":"Node Components","text":"<ul> <li>Kubelet: An agent that runs on each node.</li> <li>Container Runtime: The software responsible for running containers.</li> <li>...</li> </ul>"},{"location":"s10_extra/kubernetes/#minikube-local-kubernetes-environment","title":"Minikube: Local Kubernetes Environment","text":"<p>Minikube is a tool that allows you to run Kubernetes locally. It runs a single-node Kubernetes cluster inside a VM on your laptop for users looking to try out Kubernetes or develop with it day-to-day.</p>"},{"location":"s10_extra/kubernetes/#installing-minikube","title":"Installing Minikube","text":"<ol> <li>System Requirements: Ensure your system meets the minimum requirements.</li> <li>Download and Install: Visit Minikube's official installation guide.</li> <li>Start Minikube: Run <code>minikube start</code>.</li> </ol>"},{"location":"s10_extra/kubernetes/#exercises","title":"\u2754 Exercises","text":"<ol> <li>Install Minikube following the steps above.</li> <li>Validate the installation by typing <code>minikube</code> in a terminal.</li> <li>Ensure that kubectl, the command-line tool for Kubernetes, is correctly installed by typing <code>kubectl</code> in a terminal.</li> </ol>"},{"location":"s10_extra/kubernetes/#yatai-model-serving-platform-for-kubernetes","title":"Yatai: Model Serving Platform for Kubernetes","text":"<p>Yatai is a model serving platform, making it easier to deploy machine learning models in Kubernetes environments.</p>"},{"location":"s10_extra/kubernetes/#what-is-yatai","title":"What is Yatai?","text":"<p>Yatai simplifies the deployment, management, and scaling of machine learning models in Kubernetes.</p>"},{"location":"s10_extra/kubernetes/#getting-started-with-yatai","title":"Getting Started with Yatai","text":"<ol> <li>Installation: Steps to install Yatai in your Kubernetes cluster.</li> <li>Basic Usage: How to deploy your first model using Yatai.</li> </ol>"},{"location":"s10_extra/kubernetes/#additional-resources","title":"Additional Resources","text":"<ul> <li>Official Kubernetes Documentation</li> <li>Interactive Tutorials</li> <li>Community Forums</li> <li>...</li> </ul>"},{"location":"s10_extra/orchestration/","title":"Orchestration","text":"<p>Danger</p> <p>Module is still under development</p>"},{"location":"s10_extra/orchestration/#workflow-orchestration","title":"Workflow orchestration","text":""},{"location":"s10_extra/orchestration/#prefect","title":"Prefect","text":"<p>If you give an MLOps engineer a job</p> <ul> <li>Could you just set up this pipeline to train this model?</li> <li>Could you set up logging?</li> <li>Could you do it every day?</li> <li>Could you make it retry if it fails?</li> <li>Could you send me a message when it succeeds?</li> <li>Could you visualize the dependencies?</li> <li>Could you add caching?</li> <li>Could you add add collaborators to run ad hoc - who don't code e.g could you add a UI?</li> </ul> <pre><code>pip install prefect\n</code></pre> <pre><code>from prefect import task, Flow\n</code></pre>"},{"location":"s10_extra/orchestration/#exercises","title":"\u2754 Exercises","text":"<ol> <li> <p>Start by installing <code>prefect</code>:</p> <pre><code>pip install prefect\n</code></pre> </li> <li> <p>Start a local Prefect server instance in your virtual environment.</p> <pre><code>prefect server start\n</code></pre> </li> <li> <p>The great thing about Prefect is that the orchestration tasks and flows are written in pure Python.</p> </li> </ol>"},{"location":"s10_extra/quantization/","title":"Quantization","text":""},{"location":"s10_extra/quantization/#quantization","title":"Quantization","text":"<p>Danger</p> <p>Module is still under development</p>"},{"location":"s10_extra/quantization/#exercises","title":"\u2754 Exercises","text":"<p>We are in these exercises going to be looking at two different kinds of quantization strategies: quantization-aware training and post-training quantization. As the names suggest, the quantization is either applied while training or after training. There are good reasons for doing both:</p> <ul> <li> <p>If the model you are going to deploy in the end needs to be quantized, either due to hard requirements for how the     big the model can be or in the effort to optimize inference time, quantization-aware training is the better     approach. The reason here being that the model is specifically optimized to always be quantized and therefore in     general end up with a better model.</p> </li> <li> <p>If the most important metric for deployment is the overall performance of the model with no regards to model size     and inference speed, post-training quantization is the better option. This allows you to most likely train a better     model to begin with and then try out converting the model afterwards. In the best case this can be done without     any hits to performance.</p> </li> <li> <p>Start by installing intel neural compressor</p> <pre><code>pip install neural_compressor\n</code></pre> <p>and remember to add this to your <code>requirements.txt</code> file.</p> </li> <li> <p>Let's start a new script called <code>model_converter.py</code>. Start by filling it with some simple code for loading a given     <code>float32</code> model checkpoint. You should already have such code from earlier exercises. Preferably, add a small CLI     interface to load a model by passing the filename in the command line:</p> <pre><code>python model_converter.py model_checkpoint.ckpt\n</code></pre> Solution <p>We are here going to assume that you are either loading from a <code>onnx</code> model or alternatively loading a PyTorch Lightning checkpoint:</p> <pre><code>from typer import App\nimport onnx\nfrom onnx.onnx_ml_pb2 import ModelProto\nfrom pytorch_lightning import LightningModule\nfrom my_model import MyModel\napp = App()\n\n@app.command()\n@app.argument(\"model_checkpoint\")\ndef quantize(model_checkpoint: ModelProto | LightningModule) -&gt; None:\n    if isinstance(model_checkpoint, LightningModule):\n        model = MyModel.load_from_checkpoint(model_checkpoint)\n    else:\n        model = onnx.load(model_checkpoint)\n</code></pre> </li> <li> <p>Next you also need to add</p> </li> <li> <p>Finally, calculate the size (in MB) of the original model and the quantized model. How much smaller is the quantized     model?</p> Solution <p>Assuming the models are saved as <code>checkpoint.ckpt</code> and <code>checkpoint_quantized.ckpt</code> we can calculate the size using <code>os.path.getsize</code> in Python:</p> <pre><code>original_size = os.path.getsize(\"models/checkpoint.onnx\") / (1024 * 1024)\nquantized_size = os.path.getsize(\"models/checkpoint_quantized.onnx\") / (1024 * 1024)\n</code></pre> <p>The quantized model should be very close to 4 times smaller as <code>int4</code> only uses 1/4 the bits to store weights compared to <code>float32</code> format.</p> </li> </ul>"},{"location":"s10_extra/quantization/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":""},{"location":"s1_development_environment/","title":"Setting up a development environment","text":"<p>Slides</p> <ul> <li> <p></p> <p>Learn the basics of the command line, and how to use it to navigate your file system and run programs.</p> <p> M1: Command line</p> </li> <li> <p></p> <p>Learn how package managers work in Python and how to create reproducible virtual environments using <code>conda</code> and <code>pip</code>.</p> <p> M2: Package Manager</p> </li> <li> <p></p> <p>Learn how to use a modern editor for code development.</p> <p> M3: Editor</p> </li> <li> <p></p> <p>Refresh your PyTorch skills and implement a simple deep-learning model.</p> <p> M4: Deep Learning Software</p> </li> </ul> <p>Today we start our journey into the world of machine learning operations (MLOps). However, before we can get started, we need to make sure that you have a basic understanding of a couple of topics, as we will be using these throughout the course. In particular, today is all about getting set up with a proper development environment that can support your journey. Most of you probably already have experience with these topics, and it will be mostly repetition.</p> <p>The reason we are starting here is that many students are missing very basic skills that are never taught but are just expected to be picked up by yourself. This session will only cover the most basic skills to get you started on your development journey. If you wish to learn more about basic computer science skills in general, we highly recommend that you check out The Missing Semester of Your CS Education course from MIT.</p> <p>Learning objectives</p> <p>The learning objectives of this session are:</p> <ul> <li>Understand the basics of the command line.</li> <li>Being able to create reproducible virtual environments.</li> <li>Able to use a modern editor for code development</li> <li>Write and run a Python program, implementing a simple deep-learning model</li> </ul>"},{"location":"s1_development_environment/command_line/","title":"M1 - The command line","text":""},{"location":"s1_development_environment/command_line/#the-command-line","title":"The command line","text":"<p>Core Module</p> <p></p>  Image credit  <p>Contrary to popular belief, the command line (also commonly known as the terminal) is not a mythical being that has existed since the dawn of time. Instead, it was created at a time when it was not given that your computer had a graphical interface that you could interact with. Think of it as a text interface to your computer.</p> <p>The terminal is a well-known concept to users of Linux; however, MAC and (especially) Windows users often do not need it and therefore encounter it. Having a basic understanding of how to use a command line can help improve your workflow. The reason that the command line is an important tool to get to know is that doing any kind of MLOps will require us to be able to interact with many different tools, many of which do not have a graphical interface. Additionally, when we get to working in the cloud later in the course, you will be forced to interact with the command line.</p> <p>Note if you already are a terminal wizard then feel free to skip the exercises below. They are very elementary.</p>"},{"location":"s1_development_environment/command_line/#the-anatomy-of-the-command-line","title":"The anatomy of the command line","text":"<p>Regardless of the operating system, all command lines look more or less the same:</p> <p></p> <p>As already stated, it is essentially just a big text interface to interact with your computer. As the image illustrates, when trying to execute a command, there are several parts to it:</p> <ol> <li>The prompt is the part where you type your commands. It usually contains the name of the current directory you     are in, followed by some kind of sign: <code>$</code>, <code>&gt;</code>, <code>:</code> are the usual ones. It can also contain other information,     such as in the case of the above image which also shows the current <code>conda</code> environment.</li> <li>The command is the actual command you want to execute. For example, <code>ls</code> or <code>cd</code>.</li> <li>The options are additional arguments that you can pass to the command. For example, <code>ls -l</code> or <code>cd ..</code>.</li> <li>The arguments are the actual arguments that you pass to the command. For example, <code>ls -l figures</code> or <code>cd ..</code>.</li> </ol> <p>The core difference between options and arguments is that options are optional, while arguments are not.</p> <p></p>  Image credit"},{"location":"s1_development_environment/command_line/#exercises","title":"\u2754 Exercises","text":"<p>We have put a cheat sheet in the exercise files folder belonging to this session, that gives a quick overview of the different commands that can be executed in the command line.</p> Windows users <p>We highly recommend that you install Windows Subsystem for Linux (WSL). This will install a full Linux system on your Windows machine. Please follow this guide. Remember to run commands from an elevated (as administrator) Windows Command Prompt. You can in general complete all exercises in the course from a normal Windows Command Prompt, but some are easier to do if you run from WSL.</p> <p>If you decide to run in WSL, you need to remember that you now have two different systems, and installing a package on one system does not mean that it is installed on the other. For example, if you install <code>pip</code> in WSL, you need to install it again in Windows if you want to use it there.</p> <p>If you decide to not run in WSL, please always work in a Windows Command Prompt and not Powershell.</p> <ol> <li> <p>Start by opening a terminal.</p> </li> <li> <p>To navigate inside a terminal, we rely on the <code>cd</code> command and <code>pwd</code> command. Make sure you know how to go back and     forth in your file system. (1)</p> <ol> <li> Your terminal should support     tab-completion which can help finish commands for you!</li> </ol> </li> <li> <p>The <code>ls</code> command is important when we want to know the content of a folder. Try to use the command, and also try     it with the additional option <code>-l</code>. What does it show?</p> </li> <li> <p>Make sure to familiarize yourself with the <code>which</code>, <code>echo</code>, <code>cat</code>, <code>wget</code>, <code>less</code>, and <code>top</code> commands. Also,     familiarize yourself with the <code>&gt;</code> operator. You are probably going to use some of them throughout the course or in     your future career. For Windows users, these commands may be named something else, e.g., <code>where</code> command on Windows     corresponds to <code>which</code>.</p> </li> <li> <p>It is also significant that you know how to edit a file through the terminal. Most systems should have the     <code>nano</code> editor installed; else, try to figure out which one is installed on your system.</p> <ol> <li> <p>Type <code>nano</code> in the terminal.</p> </li> <li> <p>Write the following text in the script</p> <pre><code>if __name__ == \"__main__\":\n    print(\"Hello world!\")\n</code></pre> </li> <li> <p>Save the script and try to execute it.</p> </li> <li> <p>Afterward, try to edit the file through the terminal (change <code>Hello world</code> to something else).</p> </li> </ol> </li> <li> <p>All terminals come with a programming language. The most common system is called <code>bash</code>, which can come in handy     when being able to write simple programs in bash. For example, one case is that you want to execute multiple Python     programs sequentially, which can be done through a bash script.</p> Windows users <p>Bash is not part of Windows, so you need to run this part through WSL. If you did not install WSL, you can skip this part or as an alternative do the exercises in Powershell which is the native Windows scripting language (not recommended).</p> <ol> <li> <p>Write a bash script (in <code>nano</code>) and try executing it:</p> <pre><code>#!/bin/bash\n# A sample Bash script, by Ryan\necho Hello World!\n</code></pre> </li> <li> <p>Change the bash script to call the Python program you just wrote.</p> </li> <li> <p>Try to Google how to write a simple for-loop that executes the Python script 10 times in a row.</p> </li> </ol> </li> <li> <p>A trick you may need throughout this course is setting environment variables. An environment variable is just a     dynamically named value that may alter the way running processes behave on a computer. The syntax for setting an     environment variable depends on your operating system:</p> WindowsLinux/Mac <pre><code>set MY_VAR=hello\necho %MY_VAR%\n</code></pre> <pre><code>export MY_VAR=hello\necho $MY_VAR\n</code></pre> <ol> <li> <p>Try to set an environment variable and print it out.</p> </li> <li> <p>To use an environment variable in a Python program, you can use the <code>os.environ</code> function from the <code>os</code> module.     Write a Python program that prints out the environment variable you just set.</p> </li> <li> <p>If you have a collection of environment variables, these can be stored in a file called <code>.env</code>. The file is     formatted as follows:</p> <pre><code>MY_VAR=hello\nMY_OTHER_VAR=world\n</code></pre> <p>To load the environment variables from the file, you can use the <code>python-dotenv</code> package. Install it with <code>pip install python-dotenv</code> and then try to load the environment variables from the file and print them out.</p> <pre><code>from dotenv import load_dotenv\nload_dotenv()\nimport os\nprint(os.environ[\"MY_VAR\"])\n</code></pre> </li> </ol> </li> </ol>"},{"location":"s1_development_environment/command_line/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>Here is one command from later in the course when we are going to work in the cloud</p> <pre><code>gcloud compute instances create-with-container instance-1 \\\n    --container-image=gcr.io/&lt;project-id&gt;/gcp_vm_tester\n    --zone=europe-west1-b\n</code></pre> <p>Identify the command, options, and arguments.</p> Solution <ul> <li>The command is <code>gcloud compute instances create-with-container</code>.</li> <li>The options are <code>--container-image=gcr.io/&lt;project-id&gt;/gcp_vm_tester</code> and <code>--zone=europe-west1-b</code>.</li> <li>The arguments are <code>instance-1</code>.</li> </ul> <p>The tricky part of this example is that commands can have subcommands, which are also commands. In this case, <code>compute</code> is a subcommand to <code>gcloud</code>, <code>instances</code> is a subcommand to <code>compute</code>, and <code>create-with-container</code> is a subcommand to <code>instances</code>.</p> </li> <li> <p>Two common arguments that nearly all commands have are the <code>-h</code> and <code>-V</code> options. What does each of them do?</p> Solution <p>The <code>-h</code> (or <code>--help</code>) option prints the help message for the command, including subcommands and arguments. Try it out by executing <code>python -h</code>.   The <code>-V</code> (or <code>--version</code>) option prints the version of the installed program. Try it out by executing <code>python --version</code>.</p> </li> </ol> <p>This ends the module on the command line. If you are still not comfortable working with the command line, fear not as we are going to use it extensively throughout the course. If you want to spend additional time on this topic, we highly recommend that you watch this video on how to use the command line.</p> <p>If you are interested in personalizing your command line, you can check out the starship project, which allows you to customize your command line with a lot of different options.</p>"},{"location":"s1_development_environment/deep_learning_software/","title":"M4 - Deep Learning Software","text":""},{"location":"s1_development_environment/deep_learning_software/#deep-learning-software","title":"Deep Learning Software","text":"<p>Core Module</p> <p>Deep learning has, since its revolution back in 2012, transformed our lives. From Google Translate to driverless cars to personal assistants to protein engineering, deep learning is transforming nearly every sector of our economy and our lives. However, it did not take long before people realized that deep learning is not a simple beast to tame and it comes with its own kinds of problems, especially if you want to use it in a production setting. In particular, the concept of technical debt was invented to indicate the significant maintenance costs at a system level that it takes to run machine learning in production. MLOps should very much be seen as the response to the concept of technical debt, namely that we should develop methods, processes, and tools (with inspiration from classical DevOps) to counter the problems we run into when working with deep learning models.</p> <p>It is important to note that all the concepts and tools that have been developed for MLOps can be used together with more classical machine learning models (think K-nearest neighbor, Random forest, etc.), however, deep learning comes with its own set of problems which mostly have to do with the sheer size of the data and models we are working with. For these reasons, we are focusing on working with deep learning models in this course.</p>"},{"location":"s1_development_environment/deep_learning_software/#software-landscape-for-deep-learning","title":"Software Landscape for Deep Learning","text":"<p>Regarding software for Deep Learning, the landscape is currently dominated by three software frameworks (listed in order of when they were published):</p> <p> </p> <ul> <li> <p>TensorFlow</p> </li> <li> <p>PyTorch</p> </li> <li> <p>JAX</p> </li> </ul> <p>We won't go into a longer discussion on which framework is best, as it is pointless. PyTorch and TensorFlow have been around for the longest and therefore have bigger communities and feature sets at this point in time. They are both very similar in the sense that they both have features directed against research and production. JAX is kind of the new kid on the block, which in many ways improves on PyTorch and TensorFlow, but is still not as mature as the other frameworks. As the frameworks use different kinds of programming principles (object-oriented vs. functional programming), comparing them is essentially meaningless.</p> <p>In this course, we have chosen to work with PyTorch because we find it a bit more intuitive and it is the framework that we use for our day-to-day research life. Additionally, as of right now, it is absolutely the dominating framework for published models, research papers, and competition winners.</p> <p>The intention behind this set of exercises is to bring everyone's PyTorch skills up to date. If you already are a PyTorch-Jedi, feel free to pass the first set of exercises, but I recommend that you still complete it. The exercises are, in large part, taken directly from the deep learning course at Udacity. Note that these exercises are given as notebooks, which is the last time we are going to use them actively in the course. Instead, after this set of exercises, we are going to focus on writing code in Python scripts.</p> <p>The notebooks contain a lot of explanatory text. The exercises that you are supposed to fill out are inlined in the text in small \"exercise\" blocks:</p> <p></p> <p>If you need a refresher on any deep learning topic in general throughout the course, we recommend finding the relevant chapter in the deep learning book by Ian Goodfellow, Yoshua Bengio, and Aaron Courville (which can also be found in the literature folder). It is not necessary to be good at deep learning to pass this course as the focus is on all the software needed to get deep learning models into production. However, it's important to have a basic understanding of the concepts.</p>"},{"location":"s1_development_environment/deep_learning_software/#exercises","title":"\u2754 Exercises","text":"<p>Exercise files</p> <ol> <li> <p>Start a Jupyter Notebook session in your terminal (assuming you are standing at the root of the course material).     Alternatively, you should be able to open the notebooks directly in your code editor. For VS code users you can read     more about how to work with Jupyter Notebooks in VS code     here</p> </li> <li> <p>Complete the     Tensors in PyTorch     notebook. It focuses on the basic manipulation of PyTorch tensors. You can pass this notebook if you are comfortable     doing this.</p> </li> <li> <p>Complete the     Neural Networks in PyTorch     notebook. It focuses on building a very simple neural network using the PyTorch <code>nn.Module</code> interface.</p> </li> <li> <p>Complete the     Training Neural Networks     notebook. It focuses on how to write a simple training loop for training a neural network.</p> </li> <li> <p>Complete the     Fashion MNIST     notebook, which summarizes concepts learned in notebooks 2 and 3 on building a neural network for classifying the     Fashion MNIST dataset.</p> </li> <li> <p>Complete the     Inference and Validation     notebook. This notebook adds important concepts on how to do inference and validation on our neural network.</p> </li> <li> <p>Complete the     Saving_and_Loading_Models     notebook. This notebook addresses how to save and load model weights. This is important if you want to share a     model with someone else.</p> </li> </ol>"},{"location":"s1_development_environment/deep_learning_software/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>If tensor <code>a</code> has shape <code>[N, d]</code> and tensor <code>b</code> has shape <code>[M, d]</code> how can we calculate the pairwise distance     between rows in <code>a</code> and <code>b</code> without using a for loop?</p> Solution <p>We can take advantage of broadcasting to do this</p> <pre><code>a = torch.randn(N, d)\nb = torch.randn(M, d)\ndist = torch.sum((a.unsqueeze(1) - b.unsqueeze(0))**2, dim=2)  # shape [N, M]\n</code></pre> </li> <li> <p>What should be the size of <code>S</code> for an input image of size 1x28x28, and how many parameters does the neural network     then have?</p> <pre><code>from torch import nn\nneural_net = nn.Sequential(\n    nn.Conv2d(1, 32, 3), nn.ReLU(), nn.Conv2d(32, 64, 3), nn.ReLU(), nn.Flatten(), nn.Linear(S, 10)\n)\n</code></pre> Solution <p>Since both convolutions have a kernel size of 3, stride 1 (default value) and no padding that means that we lose 2 pixels in each dimension, because the kernel can not be centered on the edge pixels. Therefore, the output of the first convolution would be 32x26x26. The output of the second convolution would be 64x24x24. The size of <code>S</code> must therefore be <code>64 * 24 * 24 = 36864</code>. The number of parameters in a convolutional layer is <code>kernel_size * kernel_size * in_channels * out_channels + out_channels</code> (last term is the bias) and the number of parameters in a linear layer is <code>in_features * out_features + out_features</code> (last term is the bias). Therefore, the total number of parameters in the network is <code>3*3*1*32 + 32 + 3*3*32*64 + 64 + 36864*10 + 10 = 387,466</code>, which could be calculated by running:</p> <pre><code>sum([prod(p.shape) for p in neural_net.parameters()])\n</code></pre> </li> <li> <p>A working training loop in PyTorch should have these three function calls: <code>optimizer.zero_grad()</code>,     <code>loss.backward()</code>, <code>optimizer.step()</code>. Explain what would happen in the training loop (or implement it) if you     forgot each of the function calls.</p> Solution <p><code>optimizer.zero_grad()</code> is in charge of zeroring the gradient. If this is not done, then gradients would accumulate over the steps leading to exploding gradients. <code>loss.backward()</code> is in charge of calculating the gradients. If this is not done, then the gradients will not be calculated and the optimizer will not be able to update the weights. <code>optimizer.step()</code> is in charge of updating the weights. If this is not done, then the weights will not be updated and the model will not learn anything.</p> </li> </ol>"},{"location":"s1_development_environment/deep_learning_software/#final-exercise","title":"Final exercise","text":"<p>As the final exercise, we will develop a simple baseline model that we will continue to develop during the course. For this exercise, we provide the data in the <code>data/corruptmnist</code> folder. Do NOT use the data in the <code>corruptmnist_v2</code> folder as that is intended for another exercise. As the name suggests, this is a (subsampled) corrupted version of the regular MNIST. Your overall task is the following:</p> <p>Implement an MNIST neural network that achieves at least 85% accuracy on the test set.</p> <p>Before any training can start, you should identify the corruption that we have applied to the MNIST dataset to create the corrupted version. This can help you identify what kind of neural network to use to get good performance, but any network should be able to achieve this.</p> <p>One key point of this course is trying to stay organized. Spending time now organizing your code will save time in the future as you start to add more and more features. As subgoals, please fulfill the following exercises:</p> <ol> <li> <p>Implement your model in a script called <code>model.py</code>.</p> Starting point for <code>model.py</code> model.py<pre><code>from torch import nn\n\n\nclass MyAwesomeModel(nn.Module):\n    \"\"\"My awesome model.\"\"\"\n\n    def __init__(self) -&gt; None:\n        super().__init__()\n        self.fc1 = nn.Linear(784, 128)\n</code></pre> Solution <p>The provided solution implements a convolutional neural network with 3 convolutional layers and a single fully connected layer. Because the MNIST dataset consists of images, we want an architecture that can take advantage of the spatial information in the images.</p> model.py<pre><code>import torch\nfrom torch import nn\n\n\nclass MyAwesomeModel(nn.Module):\n    \"\"\"My awesome model.\"\"\"\n\n    def __init__(self) -&gt; None:\n        super().__init__()\n        self.conv1 = nn.Conv2d(1, 32, 3, 1)\n        self.conv2 = nn.Conv2d(32, 64, 3, 1)\n        self.conv3 = nn.Conv2d(64, 128, 3, 1)\n        self.dropout = nn.Dropout(0.5)\n        self.fc1 = nn.Linear(128, 10)\n\n    def forward(self, x: torch.Tensor) -&gt; torch.Tensor:\n        \"\"\"Forward pass.\"\"\"\n        x = torch.relu(self.conv1(x))\n        x = torch.max_pool2d(x, 2, 2)\n        x = torch.relu(self.conv2(x))\n        x = torch.max_pool2d(x, 2, 2)\n        x = torch.relu(self.conv3(x))\n        x = torch.max_pool2d(x, 2, 2)\n        x = torch.flatten(x, 1)\n        x = self.dropout(x)\n        return self.fc1(x)\n\n\nif __name__ == \"__main__\":\n    model = MyAwesomeModel()\n    print(f\"Model architecture: {model}\")\n    print(f\"Number of parameters: {sum(p.numel() for p in model.parameters())}\")\n\n    dummy_input = torch.randn(1, 1, 28, 28)\n    output = model(dummy_input)\n    print(f\"Output shape: {output.shape}\")\n</code></pre> </li> <li> <p>Implement your data setup in a script called <code>data.py</code>. The data was saved using <code>torch.save</code>, so to load it you     should use <code>torch.load</code>.</p> <p>Saving the model</p> <p>When saving the model, you should use <code>torch.save(model.state_dict(), \"model.pt\")</code>, and when loading the model, you should use <code>model.load_state_dict(torch.load(\"model.pt\"))</code>. If you do <code>torch.save(model, \"model.pt\")</code>, this can lead to problems when loading the model later on, as it will try to not only save the model weights but also the model definition. This can lead to problems if you change the model definition later on (which you most likely are going to do).</p> Starting point for <code>data.py</code> model.py<pre><code>import torch\n\n\ndef corrupt_mnist():\n    \"\"\"Return train and test dataloaders for corrupt MNIST.\"\"\"\n    # exchange with the corrupted mnist dataset\n    train = torch.randn(50000, 784)\n    test = torch.randn(10000, 784)\n    return train, test\n</code></pre> Solution <p>Data is stored in <code>.pt</code> files which can be loaded using <code>torch.load</code> (1). We iterate over the files, load them and concatenate them into a single tensor. In particular, we have highlighted the use of <code>.unsqueeze</code> function. Convolutional neural networks (which we propose as a solution) need the data to be in the shape <code>[N, C, H, W]</code> where <code>N</code> is the number of samples, <code>C</code> is the number of channels, <code>H</code> is the height of the image and <code>W</code> is the width of the image. The dataset is stored in the shape <code>[N, H, W]</code> and therefore we need to add a channel.</p> <ol> <li> The <code>.pt</code> files are nothing else than a <code>.pickle</code> file in disguise. The     <code>torch.save/torch.load</code> function is essentially a wrapper around the <code>pickle</code> module in Python, which     produces serialized files. However, it is convention to use <code>.pt</code> to indicate that the file contains PyTorch     tensors.</li> </ol> <p>We have additionally in the solution added functionality for plotting the images together with the labels for inspection. Remember: all good machine learning starts with a good understanding of the data.</p> model.py<pre><code>from __future__ import annotations\n\nimport matplotlib.pyplot as plt  # only needed for plotting\nimport torch\nfrom mpl_toolkits.axes_grid1 import ImageGrid  # only needed for plotting\n\nDATA_PATH = \"data/corruptmnist\"\n\n\ndef corrupt_mnist() -&gt; tuple[torch.utils.data.Dataset, torch.utils.data.Dataset]:\n    \"\"\"Return train and test dataloaders for corrupt MNIST.\"\"\"\n    train_images, train_target = [], []\n    for i in range(5):\n        train_images.append(torch.load(f\"{DATA_PATH}/train_images_{i}.pt\"))\n        train_target.append(torch.load(f\"{DATA_PATH}/train_target_{i}.pt\"))\n    train_images = torch.cat(train_images)\n    train_target = torch.cat(train_target)\n\n    test_images = torch.load(f\"{DATA_PATH}/test_images.pt\")\n    test_target = torch.load(f\"{DATA_PATH}/test_target.pt\")\n\n    train_images = train_images.unsqueeze(1).float()\n    test_images = test_images.unsqueeze(1).float()\n    train_target = train_target.long()\n    test_target = test_target.long()\n\n    train_set = torch.utils.data.TensorDataset(train_images, train_target)\n    test_set = torch.utils.data.TensorDataset(test_images, test_target)\n\n    return train_set, test_set\n\n\ndef show_image_and_target(images: torch.Tensor, target: torch.Tensor) -&gt; None:\n    \"\"\"Plot images and their labels in a grid.\"\"\"\n    row_col = int(len(images) ** 0.5)\n    fig = plt.figure(figsize=(10.0, 10.0))\n    grid = ImageGrid(fig, 111, nrows_ncols=(row_col, row_col), axes_pad=0.3)\n    for ax, im, label in zip(grid, images, target):\n        ax.imshow(im.squeeze(), cmap=\"gray\")\n        ax.set_title(f\"Label: {label.item()}\")\n        ax.axis(\"off\")\n    plt.show()\n\n\nif __name__ == \"__main__\":\n    train_set, test_set = corrupt_mnist()\n    print(f\"Size of training set: {len(train_set)}\")\n    print(f\"Size of test set: {len(test_set)}\")\n    print(f\"Shape of a training point {(train_set[0][0].shape, train_set[0][1].shape)}\")\n    print(f\"Shape of a test point {(test_set[0][0].shape, test_set[0][1].shape)}\")\n    show_image_and_target(train_set.tensors[0][:25], train_set.tensors[1][:25])\n</code></pre> </li> <li> <p>Implement training and evaluation of your model in <code>main.py</code> script. The <code>main.py</code> script should be able to take     additional subcommands indicating if the model should be trained or evaluated. It will look something like this:</p> <pre><code>python main.py train --lr 1e-4\npython main.py evaluate trained_model.pt\n</code></pre> <p>which can be implemented in various ways. We provide you with a starting script that uses the <code>click</code> library to define a command line interface (CLI), which you can learn more about in this module.</p> VS code and command line arguments <p>If you try to execute the above code in VS code using the debugger (F5) or the build run functionality in the upper right corner:</p> <p> </p> <p>you will get an error message saying that you need to select a command to run e.g. <code>main.py</code> either needs the <code>train</code> or <code>evaluate</code> command. This can be fixed by adding a <code>launch.json</code> to a specialized <code>.vscode</code> folder in the root of the project. The <code>launch.json</code> file should look something like this:</p> <pre><code>{\n    \"version\": \"0.2.0\",\n    \"configurations\": [\n        {\n            \"name\": \"Python: Current File\",\n            \"type\": \"python\",\n            \"request\": \"launch\",\n            \"program\": \"${file}\",\n            \"args\": [\n                \"train\",\n                \"--lr\",\n                \"1e-4\"\n            ],\n            \"console\": \"integratedTerminal\",\n            \"justMyCode\": true\n        }\n    ]\n}\n</code></pre> <p>This will inform VS code that then we execute the current file (in this case <code>main.py</code>) we want to run it with the <code>train</code> command and additionally pass the <code>--lr</code> argument with the value <code>1e-4</code>. You can read more about creating a <code>launch.json</code> file here. If you want to have multiple configurations you can add them to the <code>configurations</code> list as additional dictionaries.</p> Starting point for <code>main.py</code> main.py<pre><code>import click\nimport torch\nfrom data_solution import corrupt_mnist\nfrom model import MyAwesomeModel\n\n\n@click.group()\ndef cli() -&gt; None:\n    \"\"\"Command line interface.\"\"\"\n\n\n@click.command()\n@click.option(\"--lr\", default=1e-3, help=\"learning rate to use for training\")\ndef train(lr) -&gt; None:\n    \"\"\"Train a model on MNIST.\"\"\"\n    print(\"Training day and night\")\n    print(lr)\n\n    # TODO: Implement training loop here\n    model = MyAwesomeModel()\n    train_set, _ = corrupt_mnist()\n\n\n@click.command()\n@click.argument(\"model_checkpoint\")\ndef evaluate(model_checkpoint) -&gt; None:\n    \"\"\"Evaluate a trained model.\"\"\"\n    print(\"Evaluating like my life depends on it\")\n    print(model_checkpoint)\n\n    # TODO: Implement evaluation logic here\n    model = torch.load(model_checkpoint)\n    _, test_set = corrupt_mnist()\n\n\ncli.add_command(train)\ncli.add_command(evaluate)\n\n\nif __name__ == \"__main__\":\n    cli()\n</code></pre> Solution <p>The solution implements a simple training loop and evaluation loop. Furthermore, we have added additional hyperparameters that can be passed to the training loop. Highlighted in the solution are the different lines where we take care that our model and data are moved to GPU (or Apple MPS accelerator if you have a newer Mac) if available.</p> main.py<pre><code>import click\nimport matplotlib.pyplot as plt\nimport torch\nfrom model import MyAwesomeModel\n\nfrom data import corrupt_mnist\n\nDEVICE = torch.device(\"cuda\" if torch.cuda.is_available() else \"mps\" if torch.backends.mps.is_available() else \"cpu\")\n\n\n@click.group()\ndef cli() -&gt; None:\n    \"\"\"Command line interface.\"\"\"\n\n\n@click.command()\n@click.option(\"--lr\", default=1e-3, help=\"learning rate to use for training\")\n@click.option(\"--batch_size\", default=32, help=\"batch size to use for training\")\n@click.option(\"--epochs\", default=10, help=\"number of epochs to train for\")\ndef train(lr, batch_size, epochs) -&gt; None:\n    \"\"\"Train a model on MNIST.\"\"\"\n    print(\"Training day and night\")\n    print(f\"{lr=}, {batch_size=}, {epochs=}\")\n\n    model = MyAwesomeModel().to(DEVICE)\n    train_set, _ = corrupt_mnist()\n\n    train_dataloader = torch.utils.data.DataLoader(train_set, batch_size=batch_size)\n\n    loss_fn = torch.nn.CrossEntropyLoss()\n    optimizer = torch.optim.Adam(model.parameters(), lr=lr)\n\n    statistics = {\"train_loss\": [], \"train_accuracy\": []}\n    for epoch in range(epochs):\n        model.train()\n        for i, (img, target) in enumerate(train_dataloader):\n            img, target = img.to(DEVICE), target.to(DEVICE)\n            optimizer.zero_grad()\n            y_pred = model(img)\n            loss = loss_fn(y_pred, target)\n            loss.backward()\n            optimizer.step()\n            statistics[\"train_loss\"].append(loss.item())\n\n            accuracy = (y_pred.argmax(dim=1) == target).float().mean().item()\n            statistics[\"train_accuracy\"].append(accuracy)\n\n            if i % 100 == 0:\n                print(f\"Epoch {epoch}, iter {i}, loss: {loss.item()}\")\n\n    print(\"Training complete\")\n    torch.save(model.state_dict(), \"model.pth\")\n    fig, axs = plt.subplots(1, 2, figsize=(15, 5))\n    axs[0].plot(statistics[\"train_loss\"])\n    axs[0].set_title(\"Train loss\")\n    axs[1].plot(statistics[\"train_accuracy\"])\n    axs[1].set_title(\"Train accuracy\")\n    fig.savefig(\"training_statistics.png\")\n\n\n@click.command()\n@click.argument(\"model_checkpoint\")\ndef evaluate(model_checkpoint) -&gt; None:\n    \"\"\"Evaluate a trained model.\"\"\"\n    print(\"Evaluating like my life depended on it\")\n    print(model_checkpoint)\n\n    model = MyAwesomeModel().to(DEVICE)\n    model.load_state_dict(torch.load(model_checkpoint))\n\n    _, test_set = corrupt_mnist()\n    test_dataloader = torch.utils.data.DataLoader(test_set, batch_size=32)\n\n    model.eval()\n    correct, total = 0, 0\n    for img, target in test_dataloader:\n        img, target = img.to(DEVICE), target.to(DEVICE)\n        y_pred = model(img)\n        correct += (y_pred.argmax(dim=1) == target).float().sum().item()\n        total += target.size(0)\n    print(f\"Test accuracy: {correct / total}\")\n\n\ncli.add_command(train)\ncli.add_command(evaluate)\n\n\nif __name__ == \"__main__\":\n    cli()\n</code></pre> </li> <li> <p>As documentation that your model is working when running the <code>train</code> command, the script needs to produce a single     plot with the training curve (training step vs training loss). When the <code>evaluate</code> command is run, it should write     the test set accuracy to the terminal.</p> </li> </ol> <p>It is part of the exercise to not implement in notebooks, as code development in real life happens in scripts. As the model is simple to run (for now), you should be able to complete the exercise on your laptop, even if you are only training on CPU. That said, you are allowed to upload your scripts to your own \"Google Drive\" and then you can call your scripts from a Google Colab notebook, which is shown in the image below where all code is placed in the <code>fashion_trainer.py</code> script and the Colab notebook is just used to execute it.</p> <p></p> <p>Be sure to have completed the final exercise before the next session, as we will be building on top of the model you have created.</p>"},{"location":"s1_development_environment/editor/","title":"M3 - Editor","text":""},{"location":"s1_development_environment/editor/#editoride","title":"Editor/IDE","text":"<p>Core Module</p> <p>Notebooks can be great for testing out ideas, developing simple code, and explaining and visualizing certain aspects of a codebase. Remember that Jupyter Notebook was created to \"...allows you to create and share documents that contain live code, equations, visualizations, and narrative text.\" However, any larger machine learning project will require you to work in multiple <code>.py</code> files, and here notebooks will provide a suboptimal workflow. Therefore, to truly get \"work done,\" you will need a good editor/IDE.</p> <p>Many opinions exist on this matter, but for simplicity, we recommend getting started with one of the following 3:</p> Editor Webpage Comment (Biased opinion) Spyder https://www.spyder-ide.org/ A Matlab-like environment that is easy to get started with Visual Studio Code https://code.visualstudio.com/ Support for multiple languages with fairly easy setup PyCharm https://www.jetbrains.com/pycharm/ An IDE for Python professionals. Will take a bit of time getting used to <p>We highly recommend Visual Studio (VS) Code if you do not already have an editor installed (or just want to try something new). We, therefore, put additional effort into explaining VS Code.</p> <p>Below, you see an overview of the VS Code interface</p> <p></p>  Image credit  <p>The main components of VS Code are:</p> <ul> <li> <p>The action bar: VS Code is not an editor meant for a single language and can do many things. One of the core reasons     that VS Code has become so popular is that custom plug-ins called extensions can be installed to add     functionality to VS Code. It is in the action bar that you can navigate between these different applications     when you have installed them.</p> </li> <li> <p>The sidebar: The sidebar has different functionality depending on what extension you have open.     In most cases, the sidebar will just contain the file explorer.</p> </li> <li> <p>The editor: This is where your code is. VS Code supports several layouts in the editor (one column, two columns,     etc.). You can make a custom layout by dragging a file to where you want the layout to split.</p> </li> <li> <p>The panel: The panel contains a terminal for you to interact with. This can quickly be used to try out code by     opening a <code>python</code> interpreter, management of environments, etc.</p> </li> <li> <p>The status bar: The status bar contains information based on the extensions you have installed. In particular,     for Python development, the status bar can be used to change the conda environment.</p> </li> </ul>"},{"location":"s1_development_environment/editor/#exercises","title":"\u2754 Exercises","text":"<p>The overall goal of the exercises is that you should start familiarizing yourself with the editor you have chosen. If you are already an expert in one of them, feel free to skip the rest. You should at least be able to:</p> <ul> <li>Create a new file</li> <li>Run a Python script</li> <li>Change the Python environment</li> </ul> <p>The instructions below are specific to Visual Studio Code, but we recommend that you try to answer the questions if using another editor. In the <code>exercise_files</code> folder belonging to this session, we have put cheat sheets for VS Code (one for Windows and one for Mac/Linux) that can give you an easy overview of the different macros in VS Code. The following exercises are just to get you started, but you can find many more tutorials here.</p> <ol> <li> <p>VS Code is a general editor for many languages, and to get proper Python support, we need to install some     extensions. In the <code>action bar</code>, go to the <code>extension</code> tab and search for <code>python</code> in the marketplace. From here,     we highly recommend installing the following packages:</p> <ul> <li>Python: general Python support for VS Code</li> <li>Pylance: language server for     Python that provides better code completion and type-checking</li> <li>Jupyter: support for Jupyter notebooks     directly in VS Code</li> <li>Python Environment Manager:     allows for easy management of virtual environments</li> </ul> </li> <li> <p>If you install the <code>Python</code> package, you should see something like this in your status bar:</p> <p> </p> <p>which indicates that you are using the stock Python installation instead of the one you have created using <code>conda</code>. Click it and change the Python environment to the one you want to use.</p> </li> <li> <p>One of the most useful tools in VS Code is the ability to navigate the whole project using the built-in     <code>Explorer</code>. To take advantage of VS Code, you need to make sure what you are working on is a project.     Create a folder called <code>hello</code> (somewhere on your laptop) and open it in VS Code (Click <code>File</code> in the menu and then     select <code>Open Folder</code>). You should end up with a completely clean workspace (as shown below). Click the <code>New file</code>     button and create a file called <code>hello.py</code>.</p> <p>  Image credit  </p> </li> <li> <p>Finally, let's run some code. Add something simple to the <code>hello.py</code> file like:</p> <p>  Image credit  </p> <p>and click the <code>run</code> button as shown in the image. It should create a new terminal, activate the environment that you have chosen, and finally run your script. In addition to clicking the <code>run</code> button, you can also:</p> <ul> <li>Select some code and press <code>Shift+Enter</code> to run it in the terminal</li> <li>Select some code and right-click, choosing to run in an interactive window (where you can interact with the results     like in a Jupyter Notebook)</li> </ul> </li> </ol> <p>That's the basics of using VS Code. We highly recommend that you revisit this tutorial during the course when we get to topics such as debugging and version control, which VS Code can help with. We can also recommend this blog post that goes over some good extensions for AI/ML development in VS Code.</p>"},{"location":"s1_development_environment/editor/#a-note-on-jupyter-notebooks-in-production-environments","title":"A note on Jupyter notebooks in production environments","text":"<p>As already stated, Jupyter Notebooks are great for development as they allow developers to easily test out new ideas. However, they often lead to pain points when models need to be deployed. We highly recommend reading section 5.1.1 of this paper by Shankar et al. which in more detail discusses the strong opinions on Jupyter notebooks that exist within the developer community.</p> <p>All this said, there exists one simple tool to make notebooks work better in a production setting. It's called <code>nbconvert</code> and can be installed with</p> <pre><code>pip install nbconvert\n</code></pre> <p>You may need some further dependencies such as Pandoc, TeX and Pyppeteer for it to work (see install instructions here). After this, converting a notebook to a <code>.py</code> script is as simple as:</p> <pre><code>jupyter nbconvert --to=script my_notebook.ipynb\n</code></pre> <p>which will produce a similarly named script called <code>my_notebook.py</code>. We highly recommend that you stick to developing scripts directly during the course to get experience with doing so, but <code>nbconvert</code> can be a fantastic tool to have in your toolbox.</p>"},{"location":"s1_development_environment/editor/#ai-assistance","title":"AI assistance","text":"<p>You are probably all familiar with using AI tools for solving different tasks in your daily life and you have most likely also used AI tools like ChatGPT or similar for programming. However, most of these tools are not directly integrated into your editor, which can lead to a lot of context-switching that in general leads to lower productivity.</p> <p>We are therefore in this section going to be looking at GitHub Copilot, which is an AI tool that directly integrates into your editor, eliminating the need to switch between browser tabs or external tools. In addition, the strength of having AI directly in your editor is that it can provide suggestions based on the code you are currently writing and in general it just has access to a larger context than a standalone tool.</p>"},{"location":"s1_development_environment/editor/#exercises_1","title":"\u2754 Exercises","text":"<ol> <li> <p>As of writing this GitHub Copilot is free for all students, teachers and maintainers of popular open-source projects.     As a student, sign up for the Student Developer Pack</p> </li> <li> <p>Install the GitHub Copilot extension in your     editor</p> </li> <li> <p>GitHub Copilot has many different features, but the most important one is the ability to provide suggestions based     on the code you are currently writing. Try to write some code in a new Python file and see if you can get some     suggestions from GitHub Copilot on how to complete the code. If you have no idea what to try out here is a     simple example of starting out coding a neural network in PyTorch:</p> <pre><code>import torch\nfrom torch import nn\nclass Net(nn.Module):\n</code></pre> <p>Github Copilot will most likely suggest you complete the code using linear layers with an input dimension of <code>28*28</code>. Can you explain why it suggests this and where this bias comes from?</p> </li> <li> <p>The second feature that can be very useful is the ability to directly chat or ask questions regarding     your code. Try highlighting (in your code editor) the code from the previous exercise and press <code>Ctrl+i</code> which     should open a chat window. Ask it to complete it with a convolutional neural network instead of a linear one.</p> <p> </p> </li> <li> <p>Finally, let's try the built-in chat feature. You can get to this by clicking the <code>Chat</code> icon in the Activity bar and     begin to ask questions similar to how you would ask ChatGPT. However, we have also the option to provide context     either from the code editor or the terminal. Try saving the following code in a Python script <code>copilot.py</code>:</p> <pre><code>import torch\nfrom torch import nn\nclass Net(nn.Module):\n    def __init__(self):\n        super(Net, self).__init__()\n        self.fc1 = nn.Linear(28*28, 128)\n        self.fc2 = nn.Linear(128, 64)\n        self.fc3 = nn.Linear(64, 10)\n    def forward(self, x):\n        x = x.view(-1, 28*28)\n        x = torch.relu(self.fc1(x))\n        x = torch.relu(self.fc2(x))\n        x = self.fc3(x)\n        return x\n\nmodel = Net()\nprint(model(torch.randn(1, 1, 14, 14)))\n</code></pre> <p>and run it in the terminal: <code>python copilot.py</code>. It will naturally give you an error, but you can now ask GitHub Copilot for help. The easiest way to do this is by highlighting the output in the terminal and then pressing running the <code>Github Copilot: Explain This (Terminal)</code> command (see the image below, use <code>Ctrl+Shift+P</code> to open the command palette and search for the command). Does the explanation make sense e.g. can you figure out what to change to get the code running?</p> <p> </p> </li> <li> <p>(Optional) Just to investigate the difference between using Github Copilot and ChatGPT, try to redo the previous     exercises using ChatGPT. What are the main differences between the two tools? (1)</p> <ol> <li> Remember that ChatGPT is a general AI model, meaning that it was trained to be good at many     different tasks, whereas GitHub Copilot (which uses OpenAI's Codex model under the hood) was specifically     trained to be good at coding.</li> </ol> </li> </ol> <p>That was a small introduction to GitHub Copilot. We highly recommend that you try to use it during the course to see how it can help you solve both the exercises and the final project. However, when using AI tools it is always important to remember that they are not perfect and that you need to critically evaluate the suggestions they provide. In the end, you are the one responsible for the code you write, not the AI tool.</p>"},{"location":"s1_development_environment/package_manager/","title":"M2 - Package Manager","text":""},{"location":"s1_development_environment/package_manager/#package-managers-and-virtual-environments","title":"Package managers and virtual environments","text":"<p>Core Module</p> <p>Python is a great programming language and this is mostly due to its vast ecosystem of packages. No matter what you want to do, there is probably a package that can get you started. Just try to remember when the last time you wrote a program only using the Python standard library. Probably never. For this reason, we need a way to install third-party packages and this is where package managers come into play.</p> <p>You have probably already used <code>pip</code> for the longest time, which is the default package manager for Python. <code>pip</code> is great for beginners but it is missing one essential feature that you will need as a developer or data scientist: virtual environments. Virtual environments are an essential way to make sure that the dependencies of different projects do not cross-contaminate each other. As a naive example, consider project A that requires <code>torch==1.3.0</code> and project B that requires <code>torch==2.0</code>, then doing</p> <pre><code>cd project_A  # move to project A\npip install torch==1.3.0  # install old torch version\ncd ../project_B  # move to project B\npip install torch==2.0  # install new torch version\ncd ../project_A  # move back to project A\npython main.py  # try executing main script from project A\n</code></pre> <p>will mean that even though we are executing the main script from project A's folder, it will use <code>torch==2.0</code> instead of <code>torch==1.3.0</code> because that is the last version we installed because in both cases <code>pip</code> will install the package into the same environment, in this case, the global environment. Instead, if we did something like:</p> Unix/macOSWindows <pre><code>cd project_A  # move to project A\npython -m venv env  # create a virtual environment in project A\nsource env/bin/activate  # activate that virtual environment\npip install torch==1.3.0  # Install the old torch version into the virtual environment belonging to project A\ncd ../project_B  # move to project B\npython -m venv env  # create a virtual environment in project B\nsource env/bin/activate  # activate that virtual environment\npip install torch==2.0  # Install new torch version into the virtual environment belonging to project B\ncd ../project_A  # Move back to project A\nsource env/bin/activate  # Activate the virtual environment belonging to project A\npython main.py  # Succeed in executing the main script from project A\n</code></pre> <pre><code>cd project_A  # Move to project A\npython -m venv env  # Create a virtual environment in project A\n.\\env\\Scripts\\activate  # Activate that virtual environment\npip install torch==1.3.0  # Install the old torch version into the virtual environment belonging to project A\ncd ../project_B  # Move to project B\npython -m venv env  # Create a virtual environment in project B\n.\\env\\Scripts\\activate  # Activate that virtual environment\npip install torch==2.0  # Install new torch version into the virtual environment belonging to project B\ncd ../project_A  # Move back to project A\n.\\env\\Scripts\\activate  # Activate the virtual environment belonging to project A\npython main.py  # Succeed in executing the main script from project A\n</code></pre> <p>then we would be sure that <code>torch==1.3.0</code> is used when executing <code>main.py</code> in project A because we are using two different virtual environments. In the above case, we used the venv module which is the built-in Python module for creating virtual environments. <code>venv+pip</code> is arguably a good combination but when working on multiple projects it can quickly become a hassle to manage all the different virtual environments yourself, remembering which Python version to use, which packages to install and so on.</p> <p>For this reason, a number of package managers have been created that can help you manage your virtual environments and dependencies, with some of the most popular being:</p> <ul> <li>conda</li> <li>pipenv</li> <li>poetry</li> <li>pipx</li> <li>hatch</li> <li>pdm</li> </ul> <p>with more being created every year (rye is looking like an interesting project). This is considered a problem in the Python community because it means that there is no standard way of managing dependencies like in other languages like <code>npm</code> for <code>node.js</code> or <code>cargo</code> for <code>rust</code>.</p> <p></p>  Image credit  <p>In the course, we do not care about which package manager you use, but we do care that you use one. If you are already familiar with one package manager, then skip this exercise and continue to use that. The best recommendation that I can give regarding package managers, in general, is to find one you like and then stick with it. A lot of time can be wasted on trying to find the perfect package manager, but in the end, they all do the same with some minor differences. Check out this blog post if you want a fairly up-to-date evaluation of the different environment management and packaging tools that exist in the Python ecosystem.</p> <p>If you are not familiar with any package managers, then we recommend that you use <code>conda</code> and <code>pip</code> for this course. You probably already have conda installed on your laptop, which is great. What conda does especially well, is that it allows you to create virtual environments with different Python versions, which can be really useful if you encounter dependencies that have not been updated in a long time. In this course specifically, we are going to recommend the following workflow</p> <ul> <li>Use <code>conda</code> to create virtual environments with specific Python versions</li> <li>Use <code>pip</code> to install packages in that environment</li> </ul> <p>Installing packages with <code>pip</code> inside <code>conda</code> environments has been considered a bad practice for a long time, but since <code>conda&gt;=4.6</code> it is considered safe to do so. The reason for this is that <code>conda</code> now has a built-in compatibility layer that makes sure that <code>pip</code> installed packages are compatible with the other packages installed in the environment.</p>"},{"location":"s1_development_environment/package_manager/#python-dependencies","title":"Python dependencies","text":"<p>Before we get started with the exercises, let's first talk a bit about Python dependencies. One of the most common ways to specify dependencies in the Python community is through a <code>requirements.txt</code> file, which is a simple text file that contains a list of all the packages that you want to install. The format allows you to specify the package name and version number you want, with 7 different operators:</p> <pre><code>package1           # any version\npackage2 == x.y.z  # exact version\npackage3 &gt;= x.y.z  # at least version x.y.z\npackage4 &gt;  x.y.z  # newer than version x.y.z\npackage4 &lt;= x.y.z  # at most version x.y.z\npackage5 &lt;  x.y.z  # older than version x.y.z\npackage6 ~= x.y.z  # install version newer than x.y.z and older than x.y+1\n</code></pre> <p>In general, all packages (should) follow the semantic versioning standard, which means that the version number is split into three parts: <code>x.y.z</code> where <code>x</code> is the major version, <code>y</code> is the minor version and <code>z</code> is the patch version.</p> <p>The reason that we need to specify the version number is that we want to make sure that we can reproduce our code at a later point. If we do not specify the version number, then we are at the mercy of the package maintainer to not change the API of the package. This is especially important when working with machine learning models, as we want to make sure that we can reproduce the exact same model at a later point.</p> <p>Finally, we also need to discuss dependency resolution, which is the process of figuring out which packages are compatible. This is a non-trivial problem, and there exist a lot of different algorithms for doing this. If you have ever thought that <code>pip</code> and <code>conda</code> were taking a long time to install something, then it is probably because they were trying to figure out which packages are compatible with each other. For example, if you try to install</p> <pre><code>pip install \"matplotlib &gt;= 3.8.0\" \"numpy &lt;= 1.19\" --dry-run\n</code></pre> <p>then it would simply fail because there are no versions of <code>matplotlib</code> and <code>numpy</code> under the given constraints that are compatible with each other. In this case, we would need to relax the constraints to something like</p> <pre><code>pip install \"matplotlib &gt;= 3.8.0\" \"numpy &lt;= 1.21\" --dry-run\n</code></pre> <p>to make it work.</p>"},{"location":"s1_development_environment/package_manager/#exercises","title":"\u2754 Exercises","text":"<p>For hints regarding how to use <code>conda</code> you can check out the cheat sheet in the exercise folder.</p> <ol> <li> <p>Download and install <code>conda</code>. You are free to either install full <code>conda</code> or the much simpler version <code>miniconda</code>.     The core difference between the two packages is that <code>conda</code> already comes with a lot of packages that you would     normally have to install with <code>miniconda</code>. The downside is that <code>conda</code> is a much larger package which can be a     huge disadvantage on smaller devices. Make sure that your installation is working by writing <code>conda help</code> in a     terminal and it should show you the help message for conda. If this does not work you probably need to set some     system variable to     point to the conda installation</p> </li> <li> <p>If you have successfully installed conda, then you should be able to execute the <code>conda</code> command in a terminal.</p> <p> </p> <p>Conda will always tell you what environment you are currently in, indicated by the <code>(env_name)</code> in the prompt. By default, it will always start in the <code>(base)</code> environment.</p> </li> <li> <p>Try creating a new virtual environment. Make sure that it is called <code>my_environment</code> and that it installs version    3.11 of Python. What command should you execute to do this?</p> Use Python 3.8 or higher <p>We highly recommend that you use Python 3.8 or higher for this course. In general, we recommend that you use the second latest version of Python that is available (currently Python 3.11 as of writing this). This is because the latest version of Python is often not supported by all dependencies. You can always check the status of different Python version support here.</p> Solution <pre><code>conda create --name my_environment python=3.11\n</code></pre> </li> <li> <p>Which <code>conda</code> command gives you a list of all the environments that you have created?</p> Solution <pre><code>conda env list\n</code></pre> </li> <li> <p>Which <code>conda</code> command gives you a list of the packages installed in the current environment?</p> Solution <pre><code>conda list\n</code></pre> <ol> <li> <p>How do you easily export this list to a text file? Do this, and make sure you export it to     a file called <code>environment.yaml</code>, as conda uses another format by default than <code>pip</code>.</p> Solution <pre><code>conda list --explicit &gt; environment.yaml\n</code></pre> </li> <li> <p>Inspect the file to see what is in it.</p> </li> <li> <p>The <code>environment.yaml</code> file you have created is one way to secure reproducibility between users because     anyone should be able to get an exact copy of your environment if they have your <code>environment.yaml</code> file.     Try creating a new environment directly from your <code>environment.yaml</code> file and check that the packages being     installed exactly match what you originally had.</p> Solution <pre><code>conda env create --file environment.yaml\n</code></pre> </li> </ol> </li> <li> <p>As the introduction states, it is fairly safe to use <code>pip</code> inside <code>conda</code> today. What is the corresponding <code>pip</code>     command that gives you a list of all <code>pip</code> installed packages? And how do you export this to <code>requirements.txt</code>     file?</p> Solution <pre><code>pip list # List all installed packages\npip freeze &gt; requirements.txt # Export all installed packages to a requirements.txt file\n</code></pre> </li> <li> <p>If you look through the requirements that both <code>pip</code> and <code>conda</code> produce then you will see that it     is often filled with a lot more packages than what you are using in your project. What you are interested in are the     packages that you import in your code: <code>from package import module</code>. One way to get around this is to use the     package <code>pipreqs</code>, which will automatically scan your project and create a requirements file specific to that.     Let's try it out:</p> <ol> <li> <p>Install <code>pipreqs</code>:</p> <pre><code>pip install pipreqs\n</code></pre> </li> <li> <p>Either try out <code>pipreqs</code> on one of your own projects or try it out on some other online project.     What does the <code>requirements.txt</code> file <code>pipreqs</code> produces look like compared to the files produced     by either <code>pip</code> or <code>conda</code>.</p> </li> </ol> </li> </ol>"},{"location":"s1_development_environment/package_manager/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>Try executing the command</p> <pre><code>pip install \"pytest &lt; 4.6\" pytest-cov==2.12.1\n</code></pre> <p>based on the error message you get, what would be a compatible way to install these?</p> Solution <p>As <code>pytest-cov==2.12.1</code> requires a version of <code>pytest</code> newer than <code>4.6</code>, we can simply change the command to be:</p> <pre><code>pip install \"pytest &gt;= 4.6\" pytest-cov==2.12.1\n</code></pre> <p>but there of course exist other solutions as well.</p> </li> </ol> <p>This ends the module on setting up virtual environments. While the methods mentioned in the exercises are great ways to construct requirements files automatically, sometimes it is just easier to manually sit down and create the files as you in that way ensure that only the most necessary requirements are installed when creating a new environment.</p>"},{"location":"s2_organisation_and_version_control/","title":"Organization and version control","text":"<p>Slides</p> <ul> <li> <p></p> <p>Learn the basics of version control and how to use <code>git</code> to track changes to your code and collaborate with others.</p> <p> M5: Git</p> </li> <li> <p></p> <p>Learn how to organize Python code into a library, package it and use templates to create new projects.</p> <p> M6: Code Structure</p> </li> <li> <p></p> <p>Learn different coding practices and how to use them to improve the quality of your code.</p> <p> M7: Good Coding Practice</p> </li> <li> <p></p> <p>Learn how to version control data using <code>dvc</code>.</p> <p> M8: Data Version Control</p> </li> <li> <p></p> <p>Learn the different ways to setup command line interfaces for your applications.</p> <p> M9: Command Line Interfaces</p> </li> </ul> <p>Today we take our first steps into the world of MLOps. The set of modules in this session focuses on getting organized and making sure that you are familiar with good development practices. While many of the practices you will learn about these modules do not seem that important when you are a single person working on a project, it is crucial when working in large groups that the difference in how different people organize and write their code is minimized. The topics in this session will focus on:</p> <ul> <li>Version control to help track and manage changes to your code and data</li> <li>Coding practices for staying organized in large projects</li> </ul> <p></p>  Image credit  <p>Some exercises in this course are very loosely stated (including the exercises today). You are expected to seek out information before you ask for help (Google is your friend!) as you will both learn more for trying to solve the problems yourself, and it is more realistic of how the \"real world\" works.</p> <p>Learning objectives</p> <p>The learning objectives of this session are:</p> <ul> <li>Understand the basics of version control and can use <code>git</code> to track changes to your code</li> <li>Knowledge of how to package Python code into a library and how to organize your code for reuse</li> <li>Understand different coding practices and how to use them to improve the quality of your code</li> <li>Can use <code>dvc</code> to version control data</li> </ul>"},{"location":"s2_organisation_and_version_control/cli/","title":"M9 - Command Line Interfaces","text":""},{"location":"s2_organisation_and_version_control/cli/#command-line-interfaces","title":"Command line interfaces","text":"<p>As we already laid out in the very first module, the command line is a powerful tool for interacting with your computer. You should already now be familiar with running basic Python commands in the terminal:</p> <pre><code>python my_script.py\n</code></pre> <p>However, as your projects grow in size and complexity, you will often find yourself in need of more advanced ways of interacting with your code. This is where command line interface (CLI) comes into play. A CLI can be seen as a way for you to define the user interface of your application directly in the terminal. Thus, there is no right or wrong way of creating a CLI, it is all about what makes sense for your application.</p> <p>In this module we are going to look at three different ways of creating a CLI for your machine learning projects. They are all serving a bit different purposes and can therefore be combined in the same project. However, you will most likely also feel that they are overlapping in some areas. That is completely fine, and it is up to you to decide which one to use in which situation.</p>"},{"location":"s2_organisation_and_version_control/cli/#project-scripts","title":"Project scripts","text":"<p>You might already be familiar with the concept of executable scripts. An executable script is a Python script that can be run directly from the terminal without having to call the Python interpreter. This has been possible for a long time in Python, by the inclusion of a so-called shebang line at the top of the script. However, we are going to look at a specific way of defining executable scripts using the standard <code>pyproject.toml</code> file, which you should have learned about in this module.</p>"},{"location":"s2_organisation_and_version_control/cli/#exercises","title":"\u2754 Exercises","text":"<ol> <li> <p>We are going to assume that you have a training script in your project that you would like to be able to run from the     terminal directly without having to call the Python interpreter. Lets assume it is located like this</p> <pre><code>src/\n\u251c\u2500\u2500 my_project/\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u251c\u2500\u2500 train.py\npyproject.toml\n</code></pre> <p>In your <code>pyproject.toml</code> file add the following lines. You will need to alter the paths to match your project.</p> <pre><code>[project.scripts]\ntrain = \"my_project.train:main\"\n</code></pre> <p>what do you think the <code>train = \"my_project.train:main\"</code> line do?</p> Solution <p>The line tells Python that we want to create an executable script called <code>train</code> that should run the <code>main</code> function in the <code>train.py</code> file located in the <code>my_project</code> package.</p> </li> <li> <p>Now, all that is left to do is install the project again in editable mode</p> <pre><code>pip install -e .\n</code></pre> <p>and you should now be able to run the following command in the terminal</p> <pre><code>train\n</code></pre> <p>Try it out and see if it works.</p> </li> <li> <p>Add additional scripts to your <code>pyproject.toml</code> file that allows you to run other scripts in your project from the     terminal.</p> Solution <p>We assume that you also have a script called <code>evaluate.py</code> in the <code>my_project</code> package.</p> <pre><code>[project.scripts]\ntrain = \"my_project.train:main\"\nevaluate = \"my_project.evaluate:main\"\n</code></pre> </li> </ol> <p>That is all there really is to it. You can now run your scripts directly from the terminal without having to call the Python interpreter. Some good examples of Python packages that uses this approach are numpy, pylint and kedro.</p>"},{"location":"s2_organisation_and_version_control/cli/#command-line-arguments","title":"Command line arguments","text":"<p>If you have worked with Python for some time you are probably familiar with the <code>argparse</code> package, which allows you to directly pass in additional arguments to your script in the terminal</p> <pre><code>python my_script.py --arg1 val1 --arg2 val2\n</code></pre> <p><code>argparse</code> is a very simple way of constructing what is called a command line interfaces. However, one limitation of <code>argparse</code> is the possibility of easily defining an CLI with subcommands. If we take <code>git</code> as an example, <code>git</code> is the main command but it has multiple subcommands: <code>push</code>, <code>pull</code>, <code>commit</code> etc. that all can take their own arguments. This kind of second CLI with subcommands is somewhat possible to do using only <code>argparse</code>, however it requires a bit of hacks.</p> <p>You could of course ask the question why we at all would like to have the possibility of defining such CLI. The main argument here is to give users of our code a single entrypoint to interact with our application instead of having multiple scripts. As long as all subcommands are proper documented, then our interface should be simple to interact with (again think <code>git</code> where each subcommand can be given the <code>-h</code> arg to get specific help).</p> <p>Instead of using <code>argparse</code> we are here going to look at the yyper package. <code>typer</code> extends the functionalities of <code>argparse</code> to allow for easy definition of subcommands and many other things, which we are not going to touch upon in this module. For completeness we should also mention that <code>typer</code> is not the only package for doing this, and of other excellent frameworks for creating command line interfaces easily we can mention click.</p>"},{"location":"s2_organisation_and_version_control/cli/#exercises_1","title":"\u2754 Exercises","text":"<ol> <li> <p>Start by installing the <code>typer</code> package</p> <pre><code>pip install typer\n</code></pre> <p>remember to add the package to your <code>requirements.txt</code> file.</p> </li> <li> <p>To get you started with <code>typer</code>, let's just create a simple hello world type of script. Create a new Python file     called <code>greetings.py</code> and use the <code>typer</code> package to create a command line interface such that running the     following lines</p> <pre><code>python greetings.py            # should print \"Hello World!\"\npython greetings.py --count=3  # should print \"Hello World!\" three times\npython greetings.py --help     # should print the help message, informing the user of the possible arguments\n</code></pre> <p>executes and gives the expected output. Relevant documentation.</p> Solution <p>Importantly for <code>typer</code> is that you need to provide type hints for the arguments. This is because <code>typer</code> needs these to be able to work properly.</p> <pre><code>import typer\napp = typer.Typer()\n\n@app.command()\ndef hello(count: int = 1, name: str = \"World\"):\n    for x in range(count):\n        typer.echo(f\"Hello {name}!\")\n\nif __name__ == \"__main__\":\n    app()\n</code></pre> </li> <li> <p>Next, lets try on a bit harder example. Below is a simple script that trains a support vector machine on the iris     dataset.</p> <p>iris_classifier.py</p> iris_classifier.py<pre><code>from sklearn.datasets import load_breast_cancer\nfrom sklearn.metrics import accuracy_score, classification_report\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.svm import SVC\n\n\ndef train():\n    \"\"\"Train and evaluate the model.\"\"\"\n    # Load the dataset\n    data = load_breast_cancer()\n    x = data.data\n    y = data.target\n\n    # Split the dataset into training and testing sets\n    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)\n\n    # Standardize the features\n    scaler = StandardScaler()\n    x_train = scaler.fit_transform(x_train)\n    x_test = scaler.transform(x_test)\n\n    # Train a Support Vector Machine (SVM) model\n    model = SVC(kernel=\"linear\", random_state=42)\n    model.fit(x_train, y_train)\n\n    # Make predictions on the test set\n    y_pred = model.predict(x_test)\n\n    # Evaluate the model\n    accuracy = accuracy_score(y_test, y_pred)\n    report = classification_report(y_test, y_pred)\n\n    print(f\"Accuracy: {accuracy:.2f}\")\n    print(\"Classification Report:\")\n    print(report)\n    return accuracy, report\n\n\nif __name__ == \"__main__\":\n    train()\n</code></pre> <p>Implement a CLI for the script such that the following commands can be run</p> <pre><code>python iris_classifier.py train --output 'model.ckpt'  # should train the model and save it to 'model.ckpt'\npython iris_classifier.py train -o 'model.ckpt'  # should be the same as above\n</code></pre> Solution <p>We are here making use of the short name option in typer for giving an shorter alias to the <code>--output</code> option.</p> iris_classifier.py<pre><code>import pickle\nfrom typing import Annotated\n\nimport typer\nfrom sklearn.datasets import load_breast_cancer\nfrom sklearn.metrics import accuracy_score, classification_report\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.svm import SVC\n\napp = typer.Typer()\n\n\n@app.command()\ndef train(output: Annotated[str, typer.Option(\"--output\", \"-o\")] = \"model.ckpt\"):\n    \"\"\"Train and evaluate the model.\"\"\"\n    # Load the dataset\n    data = load_breast_cancer()\n    x = data.data\n    y = data.target\n\n    # Split the dataset into training and testing sets\n    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)\n\n    # Standardize the features\n    scaler = StandardScaler()\n    x_train = scaler.fit_transform(x_train)\n    x_test = scaler.transform(x_test)\n\n    # Train a Support Vector Machine (SVM) model\n    model = SVC(kernel=\"linear\", random_state=42)\n    model.fit(x_train, y_train)\n\n    # Make predictions on the test set\n    y_pred = model.predict(x_test)\n\n    # Evaluate the model\n    accuracy = accuracy_score(y_test, y_pred)\n    report = classification_report(y_test, y_pred)\n\n    with open(output, \"wb\") as f:\n        pickle.dump(model, f)\n\n    print(f\"Accuracy: {accuracy:.2f}\")\n    print(\"Classification Report:\")\n    print(report)\n    return accuracy, report\n\n\nif __name__ == \"__main__\":\n    app()\n</code></pre> </li> <li> <p>Next lets create a CLI that has more than a single command. Continue working in the basic machine learning     application from the previous exercise, but this time we want to define two separate commands</p> <pre><code>python iris_classifier.py train --output 'model.ckpt'\npython iris_classifier.py evaluate 'model.ckpt'\n</code></pre> Solution <p>The only key difference between the two is that in the <code>train</code> command we define the <code>output</code> argument to to be an optional parameter e.g. we provide a default and for the <code>evaluate</code> command it is a required parameter.</p> iris_classifier.py<pre><code>import pickle\nfrom typing import Annotated\n\nimport typer\nfrom sklearn.datasets import load_breast_cancer\nfrom sklearn.metrics import accuracy_score, classification_report\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.svm import SVC\n\napp = typer.Typer()\n\n# Load the dataset\ndata = load_breast_cancer()\nx = data.data\ny = data.target\n\n# Split the dataset into training and testing sets\nx_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)\n\n# Standardize the features\nscaler = StandardScaler()\nx_train = scaler.fit_transform(x_train)\nx_test = scaler.transform(x_test)\n\n\n@app.command()\ndef train(output_file: Annotated[str, typer.Option(\"--output\", \"-o\")] = \"model.ckpt\") -&gt; None:\n    \"\"\"Train the model.\"\"\"\n    # Train a Support Vector Machine (SVM) model\n    model = SVC(kernel=\"linear\", random_state=42)\n    model.fit(x_train, y_train)\n\n    with open(output_file, \"wb\") as f:\n        pickle.dump(model, f)\n\n\n@app.command()\ndef evaluate(model_file):\n    \"\"\"Evaluate the model.\"\"\"\n    with open(model_file, \"rb\") as f:\n        model = pickle.load(f)\n\n    # Make predictions on the test set\n    y_pred = model.predict(x_test)\n\n    # Evaluate the model\n    accuracy = accuracy_score(y_test, y_pred)\n    report = classification_report(y_test, y_pred)\n\n    print(f\"Accuracy: {accuracy:.2f}\")\n    print(\"Classification Report:\")\n    print(report)\n    return accuracy, report\n\n\nif __name__ == \"__main__\":\n    app()\n</code></pre> </li> <li> <p>Finally, let's try to define subcommands for our subcommands e.g. something similar to how <code>git</code> has the subcommand     <code>remote</code> which in itself has multiple subcommands like <code>add</code>, <code>rename</code> etc. Continue on the simple machine     learning application from the previous exercises, but this time define a cli such that</p> <pre><code>python iris_classifier.py train svm --kernel 'linear'\npython iris_classifier.py train knn -k 5\n</code></pre> <p>e.g the <code>train</code> command now has two subcommands for training different machine learning models (in this case SVM and KNN) which each takes arguments that are unique to that model. Relevant documentation.</p> Success iris_classifier.py<pre><code>import pickle\nfrom typing import Annotated\n\nimport typer\nfrom sklearn.datasets import load_breast_cancer\nfrom sklearn.metrics import accuracy_score, classification_report\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.svm import SVC\n\napp = typer.Typer()\ntrain_app = typer.Typer()\napp.add_typer(train_app, name=\"train\")\n\n# Load the dataset\ndata = load_breast_cancer()\nx = data.data\ny = data.target\n\n# Split the dataset into training and testing sets\nx_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)\n\n# Standardize the features\nscaler = StandardScaler()\nx_train = scaler.fit_transform(x_train)\nx_test = scaler.transform(x_test)\n\n\n@train_app.command()\ndef svm(kernel: str = \"linear\", output_file: Annotated[str, typer.Option(\"--output\", \"-o\")] = \"model.ckpt\") -&gt; None:\n    \"\"\"Train a SVM model.\"\"\"\n    model = SVC(kernel=kernel, random_state=42)\n    model.fit(x_train, y_train)\n\n    with open(output_file, \"wb\") as f:\n        pickle.dump(model, f)\n\n\n@train_app.command()\ndef knn(k: int = 5, output_file: Annotated[str, typer.Option(\"--output\", \"-o\")] = \"model.ckpt\") -&gt; None:\n    \"\"\"Train a KNN model.\"\"\"\n    model = KNeighborsClassifier(n_neighbors=k)\n    model.fit(x_train, y_train)\n\n    with open(output_file, \"wb\") as f:\n        pickle.dump(model, f)\n\n\n@app.command()\ndef evaluate(model_file):\n    \"\"\"Evaluate the model.\"\"\"\n    with open(model_file, \"rb\") as f:\n        model = pickle.load(f)\n\n    # Make predictions on the test set\n    y_pred = model.predict(x_test)\n\n    # Evaluate the model\n    accuracy = accuracy_score(y_test, y_pred)\n    report = classification_report(y_test, y_pred)\n\n    print(f\"Accuracy: {accuracy:.2f}\")\n    print(\"Classification Report:\")\n    print(report)\n    return accuracy, report\n\n\nif __name__ == \"__main__\":\n    app()\n</code></pre> </li> <li> <p>(Optional) Let's try to combine what we have learned until now. Try to make your <code>typer</code> cli into a executable     script using the <code>pyproject.toml</code> file and try it out!</p> Solution <p>Assuming that our <code>iris_classifier.py</code> script from before is placed in <code>src/my_project</code> folder, we should just add</p> <pre><code>[project.scripts]\ngreetings = \"src.my_project.iris_classifier:app\"\n</code></pre> <p>and remember to install the project in editable mode</p> <pre><code>pip install -e .\n</code></pre> <p>and you should now be able to run the following command in the terminal</p> <pre><code>iris_classifier train knn\n</code></pre> </li> </ol> <p>This covers the basic of <code>typer</code> but feel free to deep dive into how the package can help you custimize your CLIs. Checkout this page on adding colors to your CLI or this page on validating the inputs to your CLI.</p>"},{"location":"s2_organisation_and_version_control/cli/#non-python-code","title":"Non-Python code","text":"<p>The two sections above have shown you how to create a simple CLI for your Python scripts. However, when doing machine learning projects, you often have a lot of non-Python code that you would like to run from the terminal. Based on the learning modules you have already completed, you have already encountered a couple of CLI tools that are used in our projects:</p> <ul> <li>conda for managing environments</li> <li>git for version control of code</li> <li>dvc for version control of data</li> </ul> <p>As we begin to move into the next couple of learning modules, we are going to encounter even more CLI tools that we need to interact with. Here is a example of long command that you might need to run in your project in the future</p> <pre><code>docker run -v $(pwd):/app -w /app --gpus all --rm -it my_image:latest python my_script.py --arg1 val1 --arg2 val2\n</code></pre> <p>This can be a lot to remember, and it can be easy to make mistakes. Instead it would be nice if we could just do</p> <pre><code>run my_command --arg1=val1 --arg2=val2\n</code></pre> <p>e.g. easier to remember because we have remove a lot of the hard-to-remember stuff, but we are still able to configure it to our liking. To help with this, we are going to look at the invoke package. <code>invoke</code> is a Python package that allows you to define tasks that can be run from the terminal. It is a bit like a more advanced version of the Makefile that you might have encountered in other programming languages. Some good alternatives to <code>invoke</code> are just and task, but we have chosen to focus on <code>invoke</code> in this module because it can be installed as a Python package making installation across different systems easier.</p>"},{"location":"s2_organisation_and_version_control/cli/#exercises_2","title":"\u2754 Exercises","text":"<ol> <li> <p>Start by installing <code>invoke</code></p> <pre><code>pip install invoke\n</code></pre> <p>remember to add the package to your <code>requirements.txt</code> file.</p> </li> <li> <p>Add a <code>tasks.py</code> file to your repository and try to just run</p> <pre><code>invoke --list\n</code></pre> <p>which should work but inform you that no tasks are added yet.</p> </li> <li> <p>Let's now try to add a task to the <code>tasks.py</code> file. The way to do this with invoke is to import the <code>task</code>     decorator from <code>invoke</code> and then decorate a function with it:</p> <pre><code>from invoke import task\nimport os\n\n@task\ndef python(ctx):\n    \"\"\" \"\"\"\n    ctx.run(\"which python\" if os.name != \"nt\" else \"where python\")\n</code></pre> <p>the first argument of any task-decorated function is the <code>ctx</code> context argument that implements the <code>run</code> method for running any command as we run them in the terminal. In this case we have simply implemented a task that returns the current Python interpreter but it works for all operating systems. Check that it works by running:</p> <pre><code>invoke hello\n</code></pre> </li> <li> <p>Lets try to create a task that simplifies the process of <code>git add</code>, <code>git commit</code>, <code>git push</code>. Create a task such     that the following command can be run</p> <pre><code>invoke git --message \"My commit message\"\n</code></pre> <p>Implement it and use the command to commit the taskfile you just created!</p> Solution <pre><code>@task\ndef git(ctx, message):\n    ctx.run(f\"git add .\")\n    ctx.run(f\"git commit -m '{message}'\")\n    ctx.run(f\"git push\")\n</code></pre> </li> <li> <p>As you have hopefully realized by now, the most important method in <code>invoke</code> is the <code>ctx.run</code> method which actually     run the commands you want to run in the terminal. This command takes multiple additional arguments. Try out the     arguments <code>warn</code>, <code>pty</code>, <code>echo</code> and explain in your own words what they do.</p> Solution <ul> <li><code>warn</code>: If set to <code>True</code> the command will not raise an exception if the command fails. This can be useful if     you want to run multiple commands and you do not want the whole process to stop if one of the commands fail.</li> <li><code>pty</code>: If set to <code>True</code> the command will be run in a pseudo-terminal. If you want to enable this or not,     depends on the command you are running.     Here     is a good explanation of when/why you should use it.</li> <li><code>echo</code>: If set to <code>True</code> the command will be printed to the terminal before it is run.</li> </ul> </li> <li> <p>Create a command that simplifies the process of bootstrapping a <code>conda</code> environment and install the relevant     dependencies of your project.</p> Solution <pre><code>@task\ndef conda(ctx, name: str = \"dtu_mlops\"):\n    ctx.run(f\"conda env create -f environment.yml\", echo=True)\n    ctx.run(f\"conda activate {name}\", echo=True)\n    ctx.run(f\"pip install -e .\", echo=True)\n</code></pre> <p>and try to run the following command</p> <pre><code>invoke conda\n</code></pre> </li> <li> <p>Assuming you have completed the exercises on using dvc for version control of data, lets also try to add     a task that simplifies the process of adding new data. This is the list of commands that need to be run to add new     data to a dvc repository: <code>dvc add</code>, <code>git add</code>, <code>git commit</code>, <code>git push</code>, <code>dvc push</code>. Try to implement a task     that simplifies this process. It needs to take two arguments for defining the folder to add and the commit message.</p> Solution <pre><code>@task\ndef dvc(ctx, folder=\"data\", message=\"Add new data\"):\n    ctx.run(f\"dvc add {folder}\")\n    ctx.run(f\"git add {folder}.dvc .gitignore\")\n    ctx.run(f\"git commit -m '{message}'\")\n    ctx.run(f\"git push\")\n    ctx.run(f\"dvc push\")\n</code></pre> <p>and try to run the following command</p> <pre><code>invoke dvc --folder 'data' --message 'Add new data'\n</code></pre> </li> <li> <p>As the final exercise, lets try to combine every way of defining CLIs we have learned about in this module. Define     a task that does the following</p> <ul> <li>calls <code>dvc pull</code> to download the data</li> <li>calls a entrypoint <code>my_cli</code> with the subcommand <code>train</code> with the arguments <code>--output 'model.ckpt'</code></li> </ul> Solution <pre><code>from invoke import task\n\n@task\ndef pull_data(ctx):\n    ctx.run(\"dvc pull\")\n\n@task(pull_data)\ndef train(ctx)\n    ctx.run(\"my_cli train\")\n</code></pre> </li> </ol> <p>That is all there is to it. You should now be able to define tasks that can be run from the terminal to simplify the process of running your code. We recommend that as you go through the learning modules in this course that you slowly start to add tasks to your <code>tasks.py</code> file that simplifies the process of running the code you are writing.</p>"},{"location":"s2_organisation_and_version_control/cli/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>What is the purpose of a command line interface?</p> Solution <p>A command line interface is a way for you to define the user interface of your application directly in the terminal. It allows you to interact with your code in a more advanced way than just running Python scripts.</p> </li> </ol>"},{"location":"s2_organisation_and_version_control/code_structure/","title":"M6 - Code structure","text":""},{"location":"s2_organisation_and_version_control/code_structure/#code-organization","title":"Code organization","text":"<p>Core Module</p> <p>With a basic understanding of version control, it is now time to really begin filling up our code repository. However, the question then remains how to organize our code? As developers we tend to not think about code organization that much. It is instead something that just dynamically is being created as we may need it. However, maybe we should spend some time initially getting organized with the chance of this making our code easier to develop and maintain in the long run. If we do not spend time organizing our code, we may end up with a mess of code that is hard to understand or maintain</p> <p>Big ball of Mud</p> <p>A Big Ball of Mud is a haphazardly structured, sprawling, sloppy, duct-tape-and-baling-wire, spaghetti-code jungle. These systems show unmistakable signs of unregulated growth, and repeated, expedient repair. Information is shared promiscuously among distant elements of the system, often to the point where nearly all the important information becomes global or duplicated. The overall structure of the system may never have been well defined. If it was, it may have eroded beyond recognition. Programmers with a shred of architectural sensibility shun these quagmires. Only those who are unconcerned about architecture, and, perhaps, are comfortable with the inertia of the day-to-day chore of patching the holes in these failing dikes, are content to work on such systems.  Brian Foote and Joseph Yoder, Big Ball of Mud. Fourth Conference on Patterns Languages of Programs (PLoP '97/EuroPLoP '97) Monticello, Illinois, September 1997</p> <p>We are here going to focus on the organization of data science projects and machine learning projects. The core difference this kind of projects introduces compared to more traditional systems, is data. The key to modern machine learning is without a doubt the vast amounts of data that we have access to today. It is therefore not unreasonable that data should influence our choice of code structure. If we had another kind of application, then the layout of our codebase should probably be different.</p>"},{"location":"s2_organisation_and_version_control/code_structure/#cookiecutter","title":"Cookiecutter","text":"<p>We are in this course going to use the tool cookiecutter, which is tool for creating projects from project templates. A project template is in short just an overall structure of how you want your folders, files etc. to be organised from the beginning. For this course we are going to be using a custom MLOps template. The template is essentially a fork of the cookiecutter data science template template that has been used for a couple of years in the course, but specialized a bit more towards MLOps instead of general data science.</p> <p>We are not going to argue that this template is better than every other template, we are just focusing on that it is a standardized way of creating project structures for machine learning projects. By standardized we mean, that if two persons are both using <code>cookiecutter</code> with the same template, the layout of their code does follow some specific rules, enabling one to faster understand the other person's code. Code organization is therefore not only to make the code easier for you to maintain but also for others to read and understand.</p> <p>Shown below is the default code structure of cookiecutter for data science projects.</p> <p></p> <p>What is important to keep in mind when using a template, is that it exactly is a template. By definition a template is a guide to make something. Therefore, not all parts of a template may be important for your project at hand. Your job is to pick the parts from the template that is useful for organizing your machine learning project and add the parts that are missing.</p>"},{"location":"s2_organisation_and_version_control/code_structure/#python-projects","title":"Python projects","text":"<p>While the same template in principal could be used regardless of what language we were using for our machine learning or data science application, there are certain considerations to take into account based on what language we are using. Python is the dominant language for machine learning and data science currently, which is why we in this section are focusing on some of the special files you will need for your Python projects.</p> <p>The first file you may or may not know is the <code>__init__.py</code> file. In Python the <code>__init__.py</code> file is used to mark a directory as a Python package. Therefore as a bare minimum, any Python package should look something like this:</p> <pre><code>\u251c\u2500\u2500 src/\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u251c\u2500\u2500 file1.py\n\u2502   \u251c\u2500\u2500 file2.py\n\u251c\u2500\u2500 pyproject.toml\n</code></pre> <p>The second file to focus on is the <code>pyproject.toml</code>. This file is important for actually converting your code into a Python project. Essentially, whenever you run <code>pip install</code>, <code>pip</code> is in charge of both downloading the package you want but also in charge of installing it. For <code>pip</code> to be able to install a package it needs instructions on what part of the code it should install and how to install it. This is the job of the <code>pyproject.toml</code> file.</p> <p>Below we have both added a description of the structure of the <code>pyproject.toml</code> file but also <code>setup.py + setup.cfg</code> which is the \"old\" way of providing project instructions regarding Python project. However, you may still encounter a lot of projects using <code>setup.py + setup.cfg</code> so it is good to at least know about them.</p> pyproject.tomlsetup.py + setup.cfg <p><code>pyproject.toml</code> is the new standardized way of describing project metadata in a declaratively way, introduced in PEP 621. It is written in toml format which is easy to read. At the very least your <code>pyproject.toml</code> file should include the <code>[build-system]</code> and <code>[project]</code> sections:</p> <pre><code>[build-system]\nrequires = [\"setuptools\", \"wheel\"]\nbuild-backend = \"setuptools.build_meta\"\n\n[project]\nname = \"my-package-name\"\nversion = \"0.1.0\"\nauthors = [{name = \"EM\", email = \"me@em.com\"}]\ndescription = \"Something cool here.\"\nrequires-python = \"&gt;=3.8\"\ndynamic = [\"dependencies\"]\n\n[tool.setuptools.dynamic]\ndependencies = {file = [\"requirements.txt\"]}\n</code></pre> <p>the <code>[build-system]</code> informs <code>pip</code>/<code>python</code> that to build this Python project it needs the two packages <code>setuptools</code> and <code>wheels</code> and that it should call the setuptools.build_meta function to actually build the project. The <code>[project]</code> section essentially contains metadata regarding the package, what its called etc. if we ever want to publish it to PyPI.</p> <p>For specifying dependencies of your project you have two options. Either you specify them in a <code>requirements.txt</code> file and it as a dynamic field in <code>pyproject.toml</code> as shown above. Alternatively, you can add a <code>dependencies</code> field under the <code>[project]</code> header like this:</p> <pre><code>[project]\ndependencies = [\n    'torch==2.1.0',\n    'matplotlib&gt;=3.8.1'\n]\n</code></pre> <p>The improvement over <code>setup.py + setup.cfg</code> is that <code>pyproject.toml</code> also allows for metadata from other tools to be specified in it, essentially making sure you only need a single file for your project. For example, in the next [module M7 on good coding practices] you will learn about the tool <code>ruff</code> and how it can help format your code. If we want to configure <code>ruff</code> for our project we can do that directly in <code>pyproject.toml</code> by adding additional headers:</p> <pre><code>[ruff]\nruff_option = ...\n</code></pre> <p>To read more about how to specify <code>pyproject.toml</code> this page is a good place to start.</p> <p><code>setup.py</code> is the original way to describing how a Python package should be build. The most basic <code>setup.py</code> file will look like this:</p> <pre><code>from setuptools import setup\nfrom pip.req import parse_requirements\nrequirements = [str(ir.req) for ir in parse_requirements(\"requirements.txt\")]\nsetup(\n    name=\"my-package-name\",\n    version=\"0.1.0\",\n    author=\"EM\",\n    description=\"Something cool here.\"\n    install_requires=requirements,\n)\n</code></pre> <p>Essentially, the it is the exact same meta information as in <code>pyproject.toml</code>, just written directly in Python syntax instead of <code>toml</code>. Because there was a wish to deperate this meta information into a separate file, the <code>setup.cfg</code> file was created which can contain the exact same information as <code>setup.py</code> just in a declarative config.</p> <pre><code>[metadata]\nname = my-package-name\nversion = 0.1.0\nauthor = EM\ndescription = \"Something cool here.\"\n# ...\n</code></pre> <p>This non-standardized way of providing meta information regarding a package was essentially what lead to the creation of <code>pyproject.toml</code>.</p> <p>Regardless of what way a project is configured, after creating the above files the correct way to install them would be the same</p> <pre><code>pip install .\n# or in developer mode\npip install -e . # (1)!\n</code></pre> <ol> <li> The <code>-e</code> is short for <code>--editable</code> mode also called     developer mode. Since we will continuously     iterating on our package this is the preferred way to install our package, because that means that we do not have     to run <code>pip install</code> every time we make a change. Essentially, in developer mode changes in the Python source code     can immediately take place without requiring a new installation.</li> </ol> <p>after running this your code should be available to import as <code>from project_name import ...</code> like any other Python package you use. This is the most essential you need to know about creating Python packages.</p>"},{"location":"s2_organisation_and_version_control/code_structure/#exercises","title":"\u2754 Exercises","text":"<p>After having installed cookiecutter (exercise 1 and 2), the remaining exercises are intended to be used on taking the simple CNN MNIST classifier from yesterdays exercise and force it into this structure. You are not required to fill out every folder and file in the project structure, but try to at least follow the steps in exercises. Whenever you need to run a file I recommend always doing this from the root directory e.g.</p> <pre><code>python &lt;project_name&gt;/data/make_dataset.py data/raw data/processed\npython &lt;project_name&gt;/models/train_model.py &lt;arguments&gt;\netc...\n</code></pre> <p>in this way paths (for saving and loading files) are always relative to the root.</p> <ol> <li> <p>Install cookiecutter framework</p> <pre><code>pip install cookiecutter\n</code></pre> </li> <li> <p>Start a new project using this template, that is specialized for     this course (1).</p> <ol> <li>If you feel like the template can be improve in some way, feel free to either open a issue with the proposed     improvement or directly send a pull request to the repository \ud83d\ude04.</li> </ol> <p>You do this by running the cookiecutter command using the template url:</p> <pre><code>cookiecutter &lt;url-to-template&gt;\n</code></pre> <p>Valid project names</p> <p>When asked for a project name you should follow the PEP8 guidelines for naming packages. This means that the name should be all lowercase and if you want to separate words, you should use underscores. For example <code>my_project</code> is a valid name, while <code>MyProject</code> is not. Additionally, the packaage name cannot start with a number.</p> Flat-layout vs src-layout <p>There are two common choices on how layout your source directory. The first is called src-layout where the source code is always place in a <code>src/&lt;project_name&gt;</code> folder and the second is called flat-layout where the source code is place is just placed in a <code>&lt;project_name&gt;</code> folder. The template we are using in this course is using the flat-layout, but there are pros and cons for both.</p> </li> <li> <p>After having created your new project, the first step is to also create a corresponding virtual environment and     install any needed requirements. If you have a virtual environment from yesterday feel free to use that else create     a new. Then install the project in that environment</p> <pre><code>pip install -e .\n</code></pre> </li> <li> <p>Start by filling out the <code>&lt;project_name&gt;/data/make_dataset.py</code> file. When this file runs, it should take the raw     data e.g. the corrupted MNIST files from yesterday (<code>../data/corruptmnist</code>) which now should be located in a     <code>data/raw</code> folder and process them into a single tensor, normalize the tensor and save this intermediate     representation to the <code>data/processed</code> folder. By normalization here we refer to making sure the images have mean 0     and standard deviation 1.</p> Solution make_dataset.py<pre><code>import click\nimport torch\n\n\ndef normalize(images: torch.Tensor) -&gt; torch.Tensor:\n    \"\"\"Normalize images.\"\"\"\n    return (images - images.mean()) / images.std()\n\n\n@click.command()\n@click.option(\"raw_dir\", default=\"data/raw\", help=\"Path to raw data directory\")\n@click.option(\"processed_dir\", default=\"data/processed\", help=\"Path to processed data directory\")\ndef make_data(raw_dir: str, processed_dir: str) -&gt; None:\n    \"\"\"Process raw data and save it to processed directory.\"\"\"\n    train_images, train_target = [], []\n    for i in range(5):\n        train_images.append(torch.load(f\"{raw_dir}/train_images_{i}.pt\"))\n        train_target.append(torch.load(f\"{raw_dir}/train_target_{i}.pt\"))\n    train_images = torch.cat(train_images)\n    train_target = torch.cat(train_target)\n\n    test_images: torch.Tensor = torch.load(f\"{raw_dir}/test_images.pt\")\n    test_target: torch.Tensor = torch.load(f\"{raw_dir}/test_target.pt\")\n\n    train_images = train_images.unsqueeze(1).float()\n    test_images = test_images.unsqueeze(1).float()\n    train_target = train_target.long()\n    test_target = test_target.long()\n\n    train_images = normalize(train_images)\n    test_images = normalize(test_images)\n\n    torch.save(train_images, f\"{processed_dir}/train_images.pt\")\n    torch.save(train_target, f\"{processed_dir}/train_target.pt\")\n    torch.save(test_images, f\"{processed_dir}/test_images.pt\")\n    torch.save(test_target, f\"{processed_dir}/test_target.pt\")\n\n\nif __name__ == \"__main__\":\n    make_data()\n</code></pre> </li> <li> <p>This template comes with a <code>Makefile</code> that can be used to easily define common operations in a project. You do not     have to understand the complete file but try taking a look at it. In particular the following commands may come in     handy</p> <pre><code>make data  # runs the make_dataset.py file, try it!\nmake clean  # clean __pycache__ files\nmake requirements  # install everything in the requirements.txt file\n</code></pre> Windows users <p><code>make</code> is a GNU build tool that is by default not available on Windows. There are two recommended ways to get it running on Windows. The first is leveraging linux subsystem for Windows which you maybe have already installed. The second option is utilizing the chocolatey package manager, which enables Windows users to install packages similar to Linux system.</p> <p>In general we recommend that you add commands to the <code>Makefile</code> as you move along in the course. If you want to know more about how to write <code>Makefile</code>s then this is an excellent video.</p> </li> <li> <p>Put your model file (<code>model.py</code>) into <code>&lt;project_name&gt;/models</code> folder together and insert the relevant code from the     <code>main.py</code> file into the <code>train_model.py</code> file. Make sure that whenever a model is trained and it is saved, that it     gets saved to the <code>models</code> folder (preferably in sub-folders).</p> </li> <li> <p>When you run <code>train_model.py</code>, make sure that some statistics/visualizations from the trained models gets saved to     the <code>reports/figures/</code> folder. This could be a simple <code>.png</code> of the training curve.</p> </li> <li> <p>(Optional) Can you figure out a way to add a <code>train</code> command to the <code>Makefile</code> such that training can be started     using</p> <pre><code>make train\n</code></pre> Solution <pre><code>train:\n    python &lt;project_name&gt;/models/train_model.py\n</code></pre> </li> <li> <p>Fill out the newly created <code>&lt;project_name&gt;/models/predict_model.py</code> file, such that it takes a pre-trained model file     and creates prediction for some data. Recommended interface is that users can give this file either a folder with     raw images that gets loaded in or a <code>numpy</code> or <code>pickle</code> file with already loaded images e.g. something like this</p> <pre><code>python &lt;project_name&gt;/models/predict_model.py \\\n    models/my_trained_model.pt \\  # file containing a pretrained model\n    data/example_images.npy  # file containing just 10 images for prediction\n</code></pre> </li> <li> <p>Fill out the file <code>&lt;project_name&gt;/visualization/visualize.py</code> with this (as minimum, feel free to add more     visualizations)</p> <ul> <li>Loads a pre-trained network</li> <li>Extracts some intermediate representation of the data (your training set) from your cnn. This could be the     features just before the final classification layer</li> <li>Visualize features in a 2D space using     t-SNE to do the dimensionality     reduction.</li> <li>Save the visualization to a file in the <code>reports/figures/</code> folder.</li> </ul> Solution <p>The solution here depends a bit on the choice of model. However, in most cases your last layer in the model will be a fully connected layer, which we assume is named <code>fc</code>. The easiest way to get the features before this layer is to replace the layer with <code>torch.nn.Identity</code> which essentially does nothing (see highlighted line below). Alternatively, if you implemented everything in a <code>torch.nn.Sequential</code> you can just remove the last layer from the <code>Sequential</code> object: <code>model = model[:-1]</code>.</p> visualize.py<pre><code>import click\nimport matplotlib.pyplot as plt\nimport torch\nfrom my_project_name.model import MyAwesomeModel\nfrom sklearn.decomposition import PCA\nfrom sklearn.manifold import TSNE\n\n\n@click.command()\n@click.option(\"model_checkpoint\", default=\"model.pth\", help=\"Path to model checkpoint\")\n@click.option(\"processed_dir\", default=\"data/processed\", help=\"Path to processed data directory\")\n@click.option(\"figure_dir\", default=\"reports/figures\", help=\"Path to save figures\")\n@click.option(\"figure_name\", default=\"embeddings.png\", help=\"Name of the figure\")\ndef visualize(model_checkpoint: str, processed_dir: str, figure_dir: str, figure_name: str) -&gt; None:\n    \"\"\"Visualize model predictions.\"\"\"\n    model = MyAwesomeModel().load_state_dict(torch.load(model_checkpoint))\n    model.eval()\n    model.fc = torch.nn.Identity()\n\n    test_images = torch.load(f\"{processed_dir}/test_images.pt\")\n    test_target = torch.load(f\"{processed_dir}/test_target.pt\")\n    test_dataset = torch.utils.data.TensorDataset(test_images, test_target)\n\n    embeddings, targets = [], []\n    with torch.inference_mode():\n        for batch in torch.utils.data.DataLoader(test_dataset, batch_size=32):\n            images, target = batch\n            predictions = model(images)\n            embeddings.append(predictions)\n            targets.append(target)\n        embeddings = torch.cat(embeddings).numpy()\n        targets = torch.cat(targets).numpy()\n\n    if embeddings.shape[1] &gt; 500:  # Reduce dimensionality for large embeddings\n        pca = PCA(n_components=100)\n        embeddings = pca.fit_transform(embeddings)\n    tsne = TSNE(n_components=2)\n    embeddings = tsne.fit_transform(embeddings)\n\n    plt.figure(figsize=(10, 10))\n    for i in range(10):\n        mask = targets == i\n        plt.scatter(embeddings[mask, 0], embeddings[mask, 1], label=str(i))\n    plt.legend()\n    plt.savefig(f\"{figure_dir}/{figure_name}\")\n</code></pre> </li> <li> <p>(Optional) Feel free to create more files/visualizations (what about investigating/explore the data distribution?)</p> </li> <li> <p>Make sure to update the <code>README.md</code> file with a short description on how your scripts should be run</p> </li> <li> <p>Finally make sure to update the <code>requirements.txt</code> file with any packages that are necessary for running your     code (see this set of exercises for help)</p> </li> <li> <p>(Optional) Lets say that you are not satisfied with the template I have recommended that you use, which is     completely fine. What should you then do? You should of course create your own template! This is actually not that     hard to do.</p> <ol> <li> <p>Just for a starting point I would recommend that you fork either the     mlops template which you have already been using or     alternatively fork the data science template     template.</p> </li> <li> <p>After forking the template, clone it down locally and lets start modifying it. The first step is changing     the <code>cookiecutter.json</code> file. For the mlops template it looks like this:</p> <pre><code>{\n    \"project_name\": \"project_name\",\n    \"repo_name\": \"{{ cookiecutter.project_name.lower().replace(' ', '_') }}\",\n    \"author_name\": \"Your name (or your organization/company/team)\",\n    \"description\": \"A short description of the project.\",\n    \"python_version_number\": \"3.10\",\n    \"open_source_license\": [\"No license file\", \"MIT\", \"BSD-3-Clause\"]\n}\n</code></pre> <p>simply add a new line to the json file with the name of the variable you want to add and the default value you want it to have.</p> </li> <li> <p>The actual template is located in the <code>{{ cookiecutter.project_name }}</code> folder. <code>cookiecutter</code> works by replacing     everywhere that it sees <code>{{ cookiecutter.&lt;variable_name&gt; }}</code> with the value of the variable. Therefore, if you     want to add a new file to the template, just add it to the <code>{{ cookiecutter.project_name }}</code> folder and make     sure to add the <code>{{ cookiecutter.&lt;variable_name&gt; }}</code> where you want the variable to be replaced.</p> </li> <li> <p>After you have made the changes you want to the template, you should test it locally. Just run</p> <pre><code>cookiecutter . -f --no-input\n</code></pre> <p>and it should create a new folder using the default values of the <code>cookiecutter.json</code> file.</p> </li> <li> <p>Finally, make sure to push any changes you made to the template to GitHub, such that you in the future can use it     by simply running</p> <pre><code>cookiecutter https://github.com/&lt;username&gt;/&lt;my_template_repo&gt;\n</code></pre> </li> </ol> </li> </ol>"},{"location":"s2_organisation_and_version_control/code_structure/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>Starting from complete scratch, what is the steps needed to create a new GitHub repository and push a specific     template to it as the very first commit.</p> Solution <ol> <li> <p>Create a completely barebone repository, either using the GitHub UI or if you have the GitHub cli installed     (not <code>git</code>) you can run</p> <pre><code>gh repo create &lt;repo_name&gt; --public --confirm\n</code></pre> </li> <li> <p>Run <code>cookiecutter</code> with the template you want to use</p> <pre><code>cookiecutter &lt;template&gt;\n</code></pre> <p>The name of the folder created by <code>cookiecutter</code> should be the same as  you just used. <li> <p>Run the following sequence of commands</p> <pre><code>cd &lt;project_name&gt;\ngit init\ngit add .\ngit commit -m \"Initial commit\"\ngit remote add origin https://github.com/&lt;username&gt;/&lt;repo_name&gt;\ngit push origin master\n</code></pre> </li> <p>That's it. The template should now have been pushed to the repository as the first commit.</p> <p>That ends the module on code structure and <code>cookiecutter</code>. We again want to stress the point of using <code>cookiecutter</code> is not about following one specific template, but instead just to use any template for organizing your code. What often happens in a team is that multiple templates are needed in different stages of the development phase or for different product types because they share common structure, while still having some specifics. Keeping templates up-to-date then becomes critical such that no team member is using an outdated template. If you ever end up in this situation, we highly recommend to checkout cruft that works alongside <code>cookiecutter</code> to not only make projects but update existing ones as template evolves. Cruft additionally also has template validation capabilities to ensure projects match the latest version of a template.</p>"},{"location":"s2_organisation_and_version_control/dvc/","title":"M8 - Data version control","text":""},{"location":"s2_organisation_and_version_control/dvc/#data-version-control","title":"Data Version Control","text":"<p>Core Module</p> <p>In this module, we are going to return to version control. However, this time we are going to focus on version control of data. The reason we need to separate between standard version control and data version control comes down to one problem: size.</p> <p>Classic version control was developed to keep track of code files, which all are simple text files. Even a codebase that contains 1000+ files with millions of lines of code can probably be stored in less than a single gigabyte (GB). On the other hand, the size of data can be drastically bigger. As most machine learning algorithms only get better with the more data that you feed them, we are seeing models today that are being trained on petabytes of data (1.000.000 GB).</p> <p>Because this is an important concept there exist a couple of frameworks that have specialized in versioning data such as DVC, DAGsHub, Hub, Modelstore and ModelDB. Regardless of what framework, they all implement somewhat the same concept: instead of storing the actual data files or in general storing any large artifacts files we instead store a pointer to these large flies. We then version control the point instead of the artifact.</p> <p></p>  Image credit  <p>We are in this course going to use <code>DVC</code> provided by iterative.ai as they also provide tools for automatizing machine learning, which we are going to focus on later.</p>"},{"location":"s2_organisation_and_version_control/dvc/#dvc-what-is-it","title":"DVC: What is it?","text":"<p>DVC (Data Version Control) is simply an extension of <code>git</code> to not only take versioning data but also models and experiments in general. But how does it deal with these large data files? Essentially, <code>DVC</code> will just keep track of a small metafile that will then point to some remote location where your original data is stored. Metafiles essentially work as placeholders for your data files. Your large data files are then stored in some remote location such as Google Drive or an <code>S3</code> bucket from Amazon.</p> <p> </p>  Image credit  <p>As the figure shows, we now have two remote locations: one for code and one for data. We use <code>git pull/push</code> for the code and <code>dvc pull/push</code> for the data. The key concept is the connection between the data file <code>model.pkl</code> which is fairly large and its respective metafile <code>model.pkl.dvc</code> which is very small. The large file is stored in the data remote and the metafile is stored in the code remote.</p>"},{"location":"s2_organisation_and_version_control/dvc/#exercises","title":"\u2754 Exercises","text":"<p>If in doubt about some of the exercises, we recommend checking out the documentation for DVC as it contains excellent tutorials.</p> <ol> <li> <p>For these exercises, we are going to use Google drive as a remote storage     solution for our data. If you do not already have a Google account, please create one (we are going to use it again     in later exercises). Please make sure that you at least have 1GB of free space.</p> </li> <li> <p>Next, install DVC and the Google Drive extension</p> <pre><code>pip install dvc\npip install dvc-gdrive\n</code></pre> <p>If you installed DVC via pip and plan to use cloud services as remote storage, you might need to install these optional dependencies: [s3], [azure], [gdrive], [gs], [oss], [ssh]. Alternatively, use [all] to include them all. If you encounter that the installation fails, we recommend that you start by updating pip and then trying to update <code>dvc</code>:</p> <pre><code>pip install -U pip\npip install -U dvc-gdrive\n</code></pre> <p>If this does not work for you, it is most likely due to a problem with <code>pygit2</code> and in that case we recommend that you follow the instructions here.</p> </li> <li> <p>In your MNIST repository run the following command from the terminal</p> <pre><code>dvc init\n</code></pre> <p>this will setup <code>dvc</code> for this repository (similar to how <code>git init</code> will initialize a git repository). These files should be committed using standard <code>git</code> to your repository.</p> </li> <li> <p>Go to your Google Drive and create a new folder called <code>dtu_mlops_data</code>. Then copy the unique identifier     belonging to that folder as shown in the figure below</p> <p> </p> <p>Using this identifier, add it as a remote storage</p> <pre><code>dvc remote add -d storage gdrive://&lt;your_identifier&gt;\n</code></pre> </li> <li> <p>Check the content of the file <code>.dvc/config</code>. Does it contain a pointer to your remote storage? Afterwards, make sure     to add this file to the next commit we are going to make:</p> <pre><code>git add .dvc/config\n</code></pre> </li> <li> <p>Call the <code>dvc add</code> command on your data files exactly like you would add a file with <code>git</code> (you do not need to     add every file by itself as you can directly add the <code>data/</code> folder). Doing this should create a human-readable     file with the extension <code>.dvc</code>. This is the metafile as explained earlier that will serve as a placeholder for     your data. If you are on Windows and this step fails you may need to install <code>pywin32</code>. At the same time, the <code>data</code>     folder should have been added to the <code>.gitignore</code> file that marks which files should not be tracked by git. Confirm     that this is correct.</p> </li> <li> <p>Now we are going to add, commit and tag the metafiles so we can restore to this stage later on. Commit and tag     the files, which should look something like this:</p> <pre><code>git add data.dvc .gitignore\ngit commit -m \"First datasets, containing 25000 images\"\ngit tag -a \"v1.0\" -m \"data v1.0\"\n</code></pre> </li> <li> <p>Finally, push your data to the remote storage using <code>dvc push</code>. You will be asked to authenticate, which involves     copy-pasting the code in the link prompted. Check out your Google Drive folder. You will see that the data is not     in a recognizable format anymore due to the way that <code>dvc</code> packs and tracks the data. The boring detail is that     <code>dvc</code> converts the data into content-addressable storage     which makes data much faster to get. Finally, make sure that your data is not stored in your Github repository.</p> <p>After authenticating the first time, <code>DVC</code> should be setup without having to authenticate again. If you for some reason encounter that dvc fails to authenticate, you can try to reset the authentication. Locate the file <code>$CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json</code> where <code>$CACHE_HOME</code> depends on your operating system:</p> macOSLinuxWindows <p><code>~/Library/Caches</code></p> <p><code>~/.cache</code>  This is the typical location, but it may vary depending on what distro you are running</p> <p><code>{user}/AppData/Local</code></p> <p>Delete the complete <code>{gdrive_client_id}</code> folder and retry authenticating with <code>dvc push</code>.</p> </li> <li> <p>After completing the above steps, it is very easy for others (or yourself) to get setup with both     code and data by simply running</p> <pre><code>git clone &lt;my_repository&gt;\ncd &lt;my_repository&gt;\ndvc pull\n</code></pre> <p>(assuming that you give them access right to the folder in your drive). Try doing this (in some other location than your standard code) to make sure that the two commands indeed download both your code and data.</p> </li> <li> <p>Lets look about the process of updating our data. Remember the important aspect of version control is that we do not     need to store explicit files called <code>data_v1.pt</code>, <code>data_v2.pt</code> etc. but just have a single <code>data.pt</code> that where we     can always checkout earlier versions. Initially start by copying the data <code>data/corruptmnist_v2</code> folder from this     repository to your MNIST code. This contains 3 extra datafiles with 15000 additional observations. Rerun your data     pipeline so these gets incorporated into the files in your <code>processed</code> folder.</p> </li> <li> <p>Redo the above steps, adding the new data using <code>dvc</code>, committing and tagging the metafiles e.g. the following     commands should be executed (with appropriate input):</p> <p><code>dvc add -&gt; git add -&gt; git commit -&gt; git tag -&gt; dvc push -&gt; git push</code>.</p> </li> <li> <p>Lets say that you wanted to go back to the state of your data in v1.0. If the above steps have been done correctly,     you should be able to do this using:</p> <pre><code>git checkout v1.0\ndvc checkout\n</code></pre> <p>confirm that you have reverted to the original data.</p> </li> <li> <p>(Optional) Finally, it is important to note that <code>dvc</code> is not only intended to be used to store data files but also     any other large files such as trained model weights (with billions of parameters these can be quite large). For     example, if we always store our best-performing model in a file called <code>best_model.ckpt</code> then we can use <code>dvc</code> to     version control it, store it online and make it easy for others to download. Feel free to experiment with this using     your model checkpoints.</p> </li> </ol> <p>In general <code>dvc</code> is a great framework for version-controlling data and models. However, it is important to note that it does have some performance issue when dealing with datasets that consist of many files. Therefore, if you are ever working with a dataset that consists of many small files, it can be a good idea to:</p> <ul> <li> <p>zip files into a single archive and then version control the archive. The <code>zip</code> archive should be placed in a     <code>data/raw</code> folder and then unzipped in the <code>data/processed</code> folder.</p> </li> <li> <p>If possible turn your data into 1D arrays, then it can be stored in a single file such as <code>.parquet</code> or <code>.csv</code>.     This is especially useful for tabular data. Then you can version control the single file instead of the many files.</p> </li> </ul>"},{"location":"s2_organisation_and_version_control/dvc/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>How do you know that a repository is using dvc?</p> Solution <p>Similar to a git repository having a <code>.git</code> directory, a repository using dvc needs to have a <code>.dvc</code> folder. Alternatively you can you the <code>dvc status</code> command.</p> </li> <li> <p>Assume you just added a folder called <code>data/</code> that you want to track with <code>dvc</code>. What is the sequence of 5 commands     to successful version control the folder? (assuming you already setup a remote)</p> Solution <pre><code>dvc add data/\ngit add .\ngit commit -m \"added raw data\"\ngit push\ndvc push\n</code></pre> </li> </ol> <p>That's all for today. With the combined power of <code>git</code> and <code>dvc</code> we should be able to version control everything in our development pipeline such that no changes are lost (assuming we commit regularly). It should be noted that <code>dvc</code> offers more than just data version control, so if you want to deep dive into <code>dvc</code> we recommend their pipeline feature and how this can be used to setup version controlled experiments. Note that we are going to revisit <code>dvc</code> later for a more permanent (and large-scale) storage solution.</p>"},{"location":"s2_organisation_and_version_control/git/","title":"M5 - Git","text":""},{"location":"s2_organisation_and_version_control/git/#git","title":"Git","text":"<p>Core Module</p> <p>Proper collaboration with other people will require that you can work on the same codebase in an organized manner. This is the reason that version control exist. Simply stated, it is a way to keep track of:</p> <ul> <li>Who made changes to the code</li> <li>When did the change happen</li> <li>What changes were made</li> </ul> <p>For a full explanation please see this page</p> <p>Secondly, it is important to note that GitHub is not git! GitHub is the dominating player when it comes to hosting repositories but that does not mean that they are the only one providing free repository hosting (see bitbucket or gitlab) for some other examples.</p> <p>That said we will be using git and GitHub throughout this course. It is a requirement for passing this course that you create a public repository with your code and use git to upload any code changes. How much you choose to integrate this into your own projects depends, but you are at least expected to be familiar with git+GitHub.</p> <p></p>  Image credit"},{"location":"s2_organisation_and_version_control/git/#initial-config","title":"Initial config","text":"<p>What does Git stand for?</p> <p>The name \"git\" was given by Linus Torvalds when he wrote the very first version. He described the tool as \"the stupid content tracker\" and the name as (depending on your mood):</p> <ul> <li>Random three-letter combination that is pronounceable, and not actually used by any common UNIX command. The fact     that it is a mispronunciation of \"get\" may or may not be relevant.</li> <li>Stupid. Contemptible and Despicable. simple. Take your pick from the dictionary of slang.</li> <li>\"Global information tracker\": you're in a good mood, and it actually works for you.     Angels sing, and a light suddenly fills the room.</li> <li>\"Goddamn idiotic truckload of sh*t\": when it breaks</li> </ul> <ol> <li> <p>Install git on your computer and make sure     that your installation is working by writing <code>git help</code> in a terminal and it should show you the help message for     git.</p> </li> <li> <p>Create a GitHub account if you do not already have one.</p> </li> <li> <p>To make sure that we do not have to type in our GitHub username every time that we want to do some changes,     we can once and for all set them on our local machine</p> <pre><code># type in a terminal\ngit config credential.helper store\ngit config --global user.email &lt;email&gt;\n</code></pre> </li> </ol>"},{"location":"s2_organisation_and_version_control/git/#git-overview","title":"Git overview","text":"<p>The most simple way to think of version control, is that it is just nodes with lines connecting them</p> <p></p> <p>Each node, which we call a commit is uniquely identified by a hash string. Each node, stores what our code looked like at that point in time (when we made the commit) and using the hash codes we can easily revert to a specific point in time.</p> <p>The commits are made up of local changes that we make to our code. A basic workflow for adding commits are seen below</p> <p></p> <p>Assuming that we have made some changes to our local working directory and that we want to get these updates to be online in the remote repository we have to do the following steps:</p> <ul> <li> <p>First we run the command <code>git add</code>. This will move our changes to the staging area. While changes are in the     staging area we can very easily revert them (using <code>git restore</code>). There have therefore not been assigned a unique     hash to the code yet, and we can therefore still overwrite it.</p> </li> <li> <p>To take our code from the staging area and make it into a commit, we simply run <code>git commit</code> which will locally     add a note to the graph. It is important again, that we have not pushed the commit to the online repository yet.</p> </li> <li> <p>Finally, we want others to be able to use the changes that we made. We do a simple <code>git push</code> and our     commit gets online</p> </li> </ul> <p>Of course, the real power of version control is the ability to make branches, as in the image below</p> <p></p>  Image credit  <p>Each branch can contain code that are not present on other branches. This is useful when you are many developers working together on the same project.</p>"},{"location":"s2_organisation_and_version_control/git/#exercises","title":"\u2754 Exercises","text":"<ol> <li> <p>In your GitHub account create an repository, where the intention is that you upload the code from the final     exercise from yesterday</p> <ol> <li> <p>After creating the repository, clone it to your computer</p> <pre><code>git clone https://github.com/my_user_name/my_repository_name.git\n</code></pre> </li> <li> <p>Move/copy the three files from yesterday into the repository (and any other that you made)</p> </li> <li> <p>Add the files to a commit by using <code>git add</code> command</p> </li> <li> <p>Commit the files using <code>git commit</code> command where you use the <code>-m</code> argument to provide a commit message (1).</p> <ol> <li> Writing good commit message is a skill in itself. A commit message should be short but     informative about the work you are trying to commit. Try to practise writing good commit messages     throughout the course. You can see     this guideline for help.</li> </ol> </li> <li> <p>Finally push the files to your repository using <code>git push</code>. Make sure to check online that the files have been     updated in your repository.</p> </li> <li> <p>You can always use the command <code>git status</code> to check where you are in the process of making a commit.</p> </li> <li> <p>Also checkout the <code>git log</code> command, which will show you the history of commits that you have made.</p> </li> </ol> </li> <li> <p>Make sure that you understand how to make branches, as this will allow you to try out code changes without     messing with your working code. Creating a new branch can be done using:</p> <pre><code># create a new branch\ngit checkout -b &lt;my_branch_name&gt;\n</code></pre> <p>Afterwards, you can use <code>git checkout</code> (1) to change between branches (remember to commit your work!) Try adding something (a file, a new line of code etc.) to the newly created branch, commit it and try changing back to master afterwards. You should hopefully see whatever you added on the branch is not present on the main branch.</p> <ol> <li> The <code>git checkout</code> command is used for a lot of different things in git. It can be used to     change branches, to revert changes and to create new branches. An alternative is using <code>git switch</code> and     <code>git restore</code> which are more modern commands.</li> </ol> </li> <li> <p>If you do not already have a cloned version of this repository belonging to the course, make sure to make one!     I am continuously updating/changing some of the material during the course and I therefore recommend that you     each day before the lecture do a <code>git pull</code> on your local copy</p> </li> <li> <p>Git may seem like a waste of time when solutions like dropbox, google drive etc exist, and it is     not completely untrue when you are only one or two working on a project. However, these file management     systems falls short when hundreds to thousands of people work together. For this exercise you will     go through the steps of sending an open-source contribution:</p> <ol> <li> <p>Go online and find a project you do not own, where you can improve the code. You can either look at this     page of good issues to get started with or for simplicity you can just choose     the repository belonging to the course. Now fork the project by clicking the Fork button.</p> <p></p> <p>This will create a local copy of the repository which you have complete writing access to. Note that code updates to the original repository do not update code in your local repository.</p> </li> <li> <p>Clone your local fork of the project using <code>git clone</code>.</p> </li> <li> <p>As default your local repository will be on the <code>main branch</code> (HINT: you can check this with the     <code>git status</code> command). It is good practice to make a new branch when working on some changes. Use     the <code>git branch</code> command followed by the <code>git checkout</code> command to create a new branch.</p> </li> <li> <p>You are now ready to make changes to the repository. Try to find something to improve (any spelling mistakes?).     When you have made the changes, do the standard git cycle: <code>add -&gt; commit -&gt; push</code></p> </li> <li> <p>Go online to the original repository and go to the <code>Pull requests</code> tab. Find <code>compare</code> button and     choose the button to compare the <code>master branch</code> of the original repo with the branch that you just created     in your own repository. Check the diff on the page to make sure that it contains the changes you have made.</p> </li> <li> <p>Write a bit about the changes you have made and click <code>Create pull request</code> :)</p> </li> </ol> </li> <li> <p>Forking a repository has the consequence that your fork and the repository that you forked can diverge. To     mitigate this we can set what is called an remote upstream. Take a look on this     page     , and set a remote upstream for the repository you just forked.</p> Solution <pre><code>git remote add upstream &lt;url-to-original-repo&gt;\n</code></pre> </li> <li> <p>After setting the upstream branch, we need to pull and merge any update. Take a look on this     page     and figure out how to do this.</p> Solution <pre><code>git fetch upstream\ngit checkout main\ngit merge upstream/main\n</code></pre> </li> <li> <p>As a final exercise we want to simulate a merge conflict, which happens when two users try to commit changes     to exactly same lines of code in the codebase, and git is not able to resolve how the different commits should be     integrated.</p> <ol> <li> <p>In your browser, open your favorite repository (it could be the one you just worked on), go to any file of     your choosing and click the edit button (see image below) and make some change to the file. For example, if     you choose a Python file you can just import some random packages at the top of the file. Commit the change.</p> <p> </p> </li> <li> <p>Make sure not to pull the change you just made to your local computer. Locally make changes to the same     file in the same lines and commit them afterwards.</p> </li> <li> <p>Now try to <code>git pull</code> the online changes. What should (hopefully) happen is that git will tell you that it found     a merge conflict that needs to be resolved. Open the file and you should see something like this</p> <pre><code>&lt;&lt;&lt;&lt;&lt;&lt;&lt; HEAD\nthis is some content to mess with\ncontent to append\n=======\ntotally different content to merge later\n&gt;&gt;&gt;&gt;&gt;&gt;&gt; master\n</code></pre> <p>this should be interpret as: everything that's between <code>&lt;&lt;&lt;&lt;&lt;&lt;&lt;</code> and <code>=======</code> are the changes made by your local commit and everything between <code>=======</code> and <code>&gt;&gt;&gt;&gt;&gt;&gt;&gt;</code> are the changes you are trying to pull. To fix the merge conflict you simply have to make the code in the two \"cells\" work together. When you are done, remove the identifiers <code>&lt;&lt;&lt;&lt;&lt;&lt;&lt;</code>, <code>=======</code> and <code>&gt;&gt;&gt;&gt;&gt;&gt;&gt;</code>.</p> </li> <li> <p>Finally, commit the merge and try to push.</p> </li> </ol> </li> <li> <p>(Optional) The above exercises have focused on how to use git from the terminal, which I highly recommend learning.     However, if you are using a proper editor they also have build in support for version control. We recommend getting     familiar with these features (here is a tutorial for     VS Code)</p> </li> </ol>"},{"location":"s2_organisation_and_version_control/git/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>How do you know if a certain directory is a git repository?</p> Solution <p>You can check if there is a \".git\" directory. Alternative you can use the <code>git status</code> command.</p> </li> <li> <p>Explain what the file <code>gitignore</code> is used for?</p> Solution <p>The file <code>gitignore</code> is used to tell git which files to ignore when doing a <code>git add .</code> command. This is useful for files that are not part of the codebase, but are needed for the code to run (e.g. data files) or files that contain sensitive information (e.g. <code>.env</code> files that contain API keys and passwords).</p> </li> <li> <p>You have two branches - main and devel. What sequence of commands would you need to execute to make sure that     devel is in sync with main?</p> Solution <pre><code>git checkout main\ngit pull\ngit checkout devel\ngit merge main\n</code></pre> </li> <li> <p>What best practices are you familiar with regarding version control?</p> Solution <ul> <li>Use a descriptive commit message</li> <li>Make each commit a logical unit</li> <li>Incorporate others' changes frequently</li> <li>Share your changes frequently</li> <li>Coordinate with your co-workers</li> <li>Don't commit generated files</li> </ul> </li> </ol> <p>That covers the basics of git to get you started. In the exercise folder you can find a git cheat sheet with the most useful commands for future reference. Finally, we want to point out another awesome feature of GitHub: in browser editor. Sometimes you have a small edit that you want to make, but still would like to do this in a IDE/editor. Or you may be in the situation where you are working from another device than your usual developer machine. GitHub has an built-in editor that can simply be enabled by changing any URL from</p> <pre><code>https://github.com/username/repository\n</code></pre> <p>to</p> <pre><code>https://github.dev/username/repository\n</code></pre> <p>Try it out on your newly created repository.</p>"},{"location":"s2_organisation_and_version_control/good_coding_practice/","title":"M7 - Good coding practice","text":""},{"location":"s2_organisation_and_version_control/good_coding_practice/#good-coding-practice","title":"Good coding practice","text":"<p>Quote</p> <p>Code is read more often than it is written.  Guido Van Rossum (author of Python)</p> <p>It is hard to define exactly what good coding practises are. But the above quote by Guido does hint at what it could be, namely that it has to do with how others observe and persive your code. In general, good coding practice is about making sure that you code is easy to read and understand, not only by others but also by your future self. The key concept to keep in mind with we are talking about good coding practice is consistency. In many cases it does not matter exactly how you choose to style your code etc., the important part is that you are consistent about it.</p> <p></p>  Image credit"},{"location":"s2_organisation_and_version_control/good_coding_practice/#documentation","title":"Documentation","text":"<p>Most programmers have a love-hate relationship with documentation: We absolute hate writing it ourself, but love when someone else has actually taken time to add it to their code. There is no doubt about that well documented code is much easier to maintain, as you do not need to remember all details about the code to still maintain it. It is key to remember that good documentation saves more time, than it takes to write.</p> <p>The problem with documentation is that there is no right or wrong way to do it. You can end up doing:</p> <ul> <li> <p>Under documentation: You document information that is clearly visible from the code and not the complex     parts that are actually hard to understand.</p> </li> <li> <p>Over documentation: Writing too much documentation will have the opposite effect on most people than     what you want: there is too much to read, so people will skip it.</p> </li> </ul> <p>Writing good documentation is a skill that takes time to train, so lets try to do it.</p> <p>Quote</p> <p>Code tells you how; Comments tell you why.  Jeff Atwood</p>"},{"location":"s2_organisation_and_version_control/good_coding_practice/#exercises","title":"\u2754 Exercises","text":"<ol> <li> <p>Go over the most complicated file in your project. Be critical and add comments where the logic     behind the code is not easily understandable. (1)</p> <ol> <li> <p> In deep learning we often work with multi-dimensional tensors that constantly changes shape     after each operation. It is good practise to annotate with comments when tensors undergoes some reshaping.     In the following example we compute the pairwise euclidean distance between two tensors using broadcasting     which results in multiple shape operations.</p> <pre><code>x = torch.randn(5, 10)  # N x D\ny = torch.randn(7, 10)  # M x D\nxy = x.unsqueeze(1) - y.unsqueeze(0)  # (N x 1 x D) - (1 x M x D) = (N x M x D)\npairwise_euc_dist = xy.abs().pow(2.0).sum(dim=-1)  # N x M\n</code></pre> </li> </ol> </li> <li> <p>Add docstrings to at least two Python function/methods.     You can see here (example 5) a good example     how to use identifiable keywords such as <code>Parameters</code>, <code>Args</code>, <code>Returns</code> which standardizes the way of     writing docstrings.</p> </li> </ol>"},{"location":"s2_organisation_and_version_control/good_coding_practice/#styling","title":"Styling","text":"<p>While Python already enforces some styling (e.g. code should be indented in a specific way), this is not enough to secure that code from different users actually look like each other. Maybe even more troubling is that you will often see that your own style of coding changes as you become more and more experienced. This kind of difference in coding style is not that important to take care of when you are working on a personal project, but when working multiple people together on the same project it is important to consider.</p> <p>The question then remains what styling you should use. This is where Pep8 comes into play, which is the  official style guide for python. It is essentially contains what is considered \"good practice\" and \"bad practice\" when coding python.</p> <p>The many years the most commonly used tool to check if you code is PEP8 compliant is to use flake8. However, we are in this course going to be using ruff that are quickly gaining popularity due to how fast it is and how quickly the developers are adding new features. (1)</p> <ol> <li> both <code>flake8</code> and <code>ruff</code> is what is called a     linter or lint tool, which is any kind of static code analyze     program that is used to flag programming errors, bugs, and styling errors.</li> </ol>"},{"location":"s2_organisation_and_version_control/good_coding_practice/#exercises_1","title":"\u2754 Exercises","text":"<ol> <li> <p>Install <code>ruff</code></p> <pre><code>pip install ruff\n</code></pre> </li> <li> <p>Run <code>ruff</code> on your project or part of your project</p> <pre><code>ruff check .  # Lint all files in the current directory (and any subdirectories)\nruff check path/to/code/  # Lint all files in `/path/to/code` (and any subdirectories).\n</code></pre> <p>are you PEP8 compliant or are you a normal mortal?</p> </li> </ol> <p>You could go and fix all the small errors that <code>ruff</code> is giving. However, in practice large projects instead relies on some kind of code formatter, that will automatically format your code for you to be PEP8 compliant. Some of the biggest formatters for the longest time in Python have been black and yapf, but we are going to use <code>ruff</code> which also have a build in formatter that should be a drop-in replacement for <code>black</code>.</p> <ol> <li> <p>Try to use <code>ruff format</code> to format your code</p> <pre><code>ruff format .  # Format all files in the current directory.\nruff format /path/to/file.py  # Format a single file.\n</code></pre> </li> </ol> <p>By default <code>ruff</code> will apply a selection of rules when we are either checking it or formatting it. However, many more rules can be activated through configuration.  If you have completed module M6 on code structure you will have encountered the <code>pyproject.toml</code> file, which can store both build instructions about our package but also configuration of developer tools. Lets try to configure <code>ruff</code> using the <code>pyproject.toml</code> file.</p> <ol> <li> <p>One aspect that is not covered by PEP8 is how <code>import</code> statements in Python should be organized. If you are like     most people, you place your <code>import</code> statements at the top of the file and they are ordered simply by when you     needed them. A better practice is to introduce some clear structure in our imports. In older versions of this course     we have used isort to do the job, but we are here going to configure <code>ruff</code> to do     the job. In your <code>pyproject.toml</code> file add the following lines</p> <pre><code>[tool.ruff]\nselect = [\"I\"]\n</code></pre> <p>and try re-running <code>ruff check</code> and <code>ruff format</code>. Hopefully this should reorganize your imports to follow common practice. (1)</p> <ol> <li> the common practise is to first list built-in Python packages (like <code>os</code>) in one block,     followed by third-party dependencies (like <code>torch</code>) in a second block and finally imports from your own package     in a third block. Each block is then put in alphabetical order.</li> </ol> </li> <li> <p>One PEP8 styling rule that is often diverged from is the recommended line length of 79 characters, which by many     (including myself) is considered very restrictive. If you code consist of multiple levels of indentation, you can     quickly run into 79 characters being limiting. For this reason many projects increase it, often to 120 characters     which seems to be the sweet spot of how many characters fits in a coding window on a laptop.     Add the line</p> <pre><code>line-length=120\n</code></pre> <p>under the <code>[tool.ruff]</code> section in the <code>pyproject.toml</code> file and rerun <code>ruff check</code> and <code>ruff format</code> on your code.</p> </li> <li> <p>Experiment yourself with further configuration of <code>ruff</code>. In particular we recommend adding more     rules and looking <code>[tool.ruff.pydocstyle]</code> configuration to indicate how you     have styled your documentation.</p> </li> </ol>"},{"location":"s2_organisation_and_version_control/good_coding_practice/#typing","title":"Typing","text":"<p>In addition to writing documentation and following a specific styling, in Python we have a third way of improving the quality of our code: through typing. Typing goes back to the earlier programming languages like <code>c</code>, <code>c++</code> etc. where data types needed to be explicit stated for variables:</p> <pre><code>int main() {\n    int x = 5 + 6;\n    float y = 0.5;\n    cout &lt;&lt; \"Hello World! \" &lt;&lt; x &lt;&lt; std::endl();\n}\n</code></pre> <p>This is not required by Python but it can really improve the readability of code, that you can directly read from the code what the expected types of input arguments and returns are. In Python the <code>:</code> character have been reserved for type hints. Here is one example of adding typing to a function:</p> <pre><code>def add2(x: int, y: int) -&gt; int:\n    return x+y\n</code></pre> <p>here we mark that both <code>x</code> and <code>y</code> are integers and using the arrow notation <code>-&gt;</code> we mark that the output type is also an integer. Assuming that we are also going to use the function for floats and <code>torch.Tensor</code>s we could improve the typing by specifying a union of types. Depending on the version of Python you are using the syntax for this can be different.</p> python &lt;3.10python &gt;=3.10 <pre><code>from torch import Tensor  # note it is Tensor with upper case T. This is the base class of all tensors\nfrom typing import Union\ndef add2(x: Union[int, float, Tensor], y: Union[int, float, Tensor]) -&gt; Union[int, float, Tensor]:\n    return x+y\n</code></pre> <pre><code>from torch import Tensor  # note it is Tensor with upper case T. This is the base class of all tensors\ndef add2(x: int | float | Tensor, y: int | float | Tensor) -&gt; int | float | Tensor:\n    return x+y\n</code></pre> <p>Finally, since this is a very generic function it also works on <code>numpy</code> arrays etc. we can always default to the <code>Any</code> type if we are not sure about all the specific types that a function can take</p> <pre><code>from typing import Any\ndef add2(x: Any, y: Any) -&gt; Any:\n    return x+y\n</code></pre> <p>However, in this case we basically is in the same case as if our function were not typed, as the type hints does not help us at all. Therefore, use <code>Any</code> only when necessary.</p>"},{"location":"s2_organisation_and_version_control/good_coding_practice/#exercises_2","title":"\u2754 Exercises","text":"<p>Exercise files</p> <ol> <li> <p>We provide a file called <code>typing_exercise.py</code>. Add typing everywhere in the file. Please note that you will     need the following import:</p> <pre><code>from typing import Callable, Optional, Tuple, Union, List  # you will need all of them in your code\n</code></pre> <p>for it to work. This cheat sheet is a good resource on typing. We also provide <code>typing_exercise_solution.py</code>, but try to solve the exercise yourself.</p> <code>typing_exercise.py</code> typing_exercise.py<pre><code>import torch\nfrom torch import nn\n\n\nclass Network(nn.Module):\n    \"\"\"Builds a feedforward network with arbitrary hidden layers.\n\n    Arguments:\n        input_size: integer, size of the input layer\n        output_size: integer, size of the output layer\n        hidden_layers: list of integers, the sizes of the hidden layers\n\n    \"\"\"\n\n    def __init__(self, input_size, output_size, hidden_layers, drop_p=0.5) -&gt; None:\n        super().__init__()\n        # Input to a hidden layer\n        self.hidden_layers = nn.ModuleList([nn.Linear(input_size, hidden_layers[0])])\n\n        # Add a variable number of more hidden layers\n        layer_sizes = zip(hidden_layers[:-1], hidden_layers[1:])\n        self.hidden_layers.extend([nn.Linear(h1, h2) for h1, h2 in layer_sizes])\n\n        self.output = nn.Linear(hidden_layers[-1], output_size)\n\n        self.dropout = nn.Dropout(p=drop_p)\n\n    def forward(self, x):\n        \"\"\"Forward pass through the network, returns the output logits.\"\"\"\n        for each in self.hidden_layers:\n            x = nn.functional.relu(each(x))\n            x = self.dropout(x)\n        x = self.output(x)\n\n        return nn.functional.log_softmax(x, dim=1)\n\n\ndef validation(model, testloader, criterion):\n    \"\"\"Validation pass through the dataset.\"\"\"\n    accuracy = 0\n    test_loss = 0\n    for images, labels in testloader:\n        images = images.resize_(images.size()[0], 784)\n\n        output = model.forward(images)\n        test_loss += criterion(output, labels).item()\n\n        ## Calculating the accuracy\n        # Model's output is log-softmax, take exponential to get the probabilities\n        ps = torch.exp(output)\n        # Class with highest probability is our predicted class, compare with true label\n        equality = labels.data == ps.max(1)[1]\n        # Accuracy is number of correct predictions divided by all predictions, just take the mean\n        accuracy += equality.type_as(torch.FloatTensor()).mean()\n\n    return test_loss, accuracy\n\n\ndef train(model, trainloader, testloader, criterion, optimizer=None, epochs=5, print_every=40) -&gt; None:\n    \"\"\"Train a PyTorch Model.\"\"\"\n    if optimizer is None:\n        optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)\n    steps = 0\n    running_loss = 0\n    for e in range(epochs):\n        # Model in training mode, dropout is on\n        model.train()\n        for images, labels in trainloader:\n            steps += 1\n\n            # Flatten images into a 784 long vector\n            images.resize_(images.size()[0], 784)\n\n            optimizer.zero_grad()\n\n            output = model.forward(images)\n            loss = criterion(output, labels)\n            loss.backward()\n            optimizer.step()\n\n            running_loss += loss.item()\n\n            if steps % print_every == 0:\n                # Model in inference mode, dropout is off\n                model.eval()\n\n                # Turn off gradients for validation, will speed up inference\n                with torch.no_grad():\n                    test_loss, accuracy = validation(model, testloader, criterion)\n\n                print(\n                    f\"Epoch: {e + 1}/{epochs}.. \",\n                    f\"Training Loss: {running_loss / print_every:.3f}.. \",\n                    f\"Test Loss: {test_loss / len(testloader):.3f}.. \",\n                    f\"Test Accuracy: {accuracy / len(testloader):.3f}\",\n                )\n\n                running_loss = 0\n\n                # Make sure dropout and grads are on for training\n                model.train()\n</code></pre> Solution typing_exercise_solution.py<pre><code>from __future__ import annotations\n\nfrom collections.abc import Callable\n\nimport torch\nfrom torch import nn\n\n\nclass Network(nn.Module):\n    \"\"\"Builds a feedforward network with arbitrary hidden layers.\n\n    Arguments:\n        input_size: integer, size of the input layer\n        output_size: integer, size of the output layer\n        hidden_layers: list of integers, the sizes of the hidden layers\n\n    \"\"\"\n\n    def __init__(\n        self,\n        input_size: int,\n        output_size: int,\n        hidden_layers: list[int],\n        drop_p: float = 0.5,\n    ) -&gt; None:\n        super().__init__()\n        # Input to a hidden layer\n        self.hidden_layers = nn.ModuleList([nn.Linear(input_size, hidden_layers[0])])\n\n        # Add a variable number of more hidden layers\n        layer_sizes = zip(hidden_layers[:-1], hidden_layers[1:])\n        self.hidden_layers.extend([nn.Linear(h1, h2) for h1, h2 in layer_sizes])\n\n        self.output = nn.Linear(hidden_layers[-1], output_size)\n\n        self.dropout = nn.Dropout(p=drop_p)\n\n    def forward(self, x: torch.Tensor) -&gt; torch.Tensor:\n        \"\"\"Forward pass through the network, returns the output logits.\"\"\"\n        for each in self.hidden_layers:\n            x = nn.functional.relu(each(x))\n            x = self.dropout(x)\n        x = self.output(x)\n\n        return nn.functional.log_softmax(x, dim=1)\n\n\ndef validation(\n    model: nn.Module,\n    testloader: torch.utils.data.DataLoader,\n    criterion: Callable | nn.Module,\n) -&gt; tuple[float, float]:\n    \"\"\"Validation pass through the dataset.\"\"\"\n    accuracy = 0\n    test_loss = 0\n    for images, labels in testloader:\n        images = images.resize_(images.size()[0], 784)\n\n        output = model.forward(images)\n        test_loss += criterion(output, labels).item()\n\n        ## Calculating the accuracy\n        # Model's output is log-softmax, take exponential to get the probabilities\n        ps = torch.exp(output)\n        # Class with highest probability is our predicted class, compare with true label\n        equality = labels.data == ps.max(1)[1]\n        # Accuracy is number of correct predictions divided by all predictions, just take the mean\n        accuracy += equality.type_as(torch.FloatTensor()).mean().item()\n\n    return test_loss, accuracy\n\n\ndef train(\n    model: nn.Module,\n    trainloader: torch.utils.data.DataLoader,\n    testloader: torch.utils.data.DataLoader,\n    criterion: Callable | nn.Module,\n    optimizer: None | torch.optim.Optimizer = None,\n    epochs: int = 5,\n    print_every: int = 40,\n) -&gt; None:\n    \"\"\"Train a PyTorch Model.\"\"\"\n    if optimizer is None:\n        optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)\n    steps = 0\n    running_loss = 0\n    for e in range(epochs):\n        # Model in training mode, dropout is on\n        model.train()\n        for images, labels in trainloader:\n            steps += 1\n\n            # Flatten images into a 784 long vector\n            images.resize_(images.size()[0], 784)\n\n            optimizer.zero_grad()\n\n            output = model.forward(images)\n            loss = criterion(output, labels)\n            loss.backward()\n            optimizer.step()\n\n            running_loss += loss.item()\n\n            if steps % print_every == 0:\n                # Model in inference mode, dropout is off\n                model.eval()\n\n                # Turn off gradients for validation, will speed up inference\n                with torch.no_grad():\n                    test_loss, accuracy = validation(model, testloader, criterion)\n\n                print(\n                    f\"Epoch: {e + 1}/{epochs}.. \",\n                    f\"Training Loss: {running_loss / print_every:.3f}.. \",\n                    f\"Test Loss: {test_loss / len(testloader):.3f}.. \",\n                    f\"Test Accuracy: {accuracy / len(testloader):.3f}\",\n                )\n\n                running_loss = 0\n\n                # Make sure dropout and grads are on for training\n                model.train()\n</code></pre> </li> <li> <p>mypy is what is called a static type checker. If you are using     typing in your code, then a static type checker can help you find common mistakes. <code>mypy</code> does not run your code,     but it scans it and checks that the types you have given are compatible. Install <code>mypy</code></p> <pre><code>pip install mypy\n</code></pre> </li> <li> <p>Try to run <code>mypy</code> on the <code>typing.py</code> file</p> <pre><code>mypy typing_exercise.py\n</code></pre> <p>If you have solved exercise 11 correctly then you should get no errors. If not <code>mypy</code> should tell you where your types are incompatible.</p> </li> </ol>"},{"location":"s2_organisation_and_version_control/good_coding_practice/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>According to PEP8 what is wrong with the following code?</p> <pre><code>class myclass(nn.Module):\n    def TrainNetwork(self, X, y):\n        ...\n</code></pre> Solution <p>According to PEP8 classes should follow the CapWords convention, meaning that the first letter in each word of the class name should be capitalized. Thus <code>myclass</code> should therefore be <code>MyClass</code>. On the other hand, functions and methods should be full lowercase with words separated by underscore. Thus <code>TrainNetwork</code> should be <code>train_network</code>.</p> </li> <li> <p>What would be the of argument <code>x</code> for a function <code>def f(x):</code> if it should support the following input</p> <pre><code>x1 = [1, 2, 3, 4]\nx2 = (1, 2, 3, 4)\nx3 = None\nx4 = {1: \"1\", 2: \"2\", 3: \"3\", 4: \"4\"}\n</code></pre> Solution <p>The easy solution would be to do <code>def f(x : Any)</code>. But instead we could also go with:</p> <pre><code>def f(x: None | Tuple[int, ...] | List[int] | Dict[int, str]):\n</code></pre> <p>alternatively, we could also do</p> <pre><code>def f(x: None | Iterable[int]):\n</code></pre> <p>because both <code>list</code>, <code>tuple</code> and <code>dict</code> are iterables and therefore can be covered by one type (in this specific case).</p> </li> </ol> <p>This ends the module on coding style. We again want to emphazize that a good coding style is more about having a consistent style than strictly following PEP8. A good example of this is Google, that have their own style guide for Python. This guide does not match PEP8 exactly, but it makes sure that different teams within google that are working on different projects are still to a large degree following the same style and therefore if a project is handed from one team to another then at least that will not be a problem.</p>"},{"location":"s3_reproducibility/","title":"Reproducibility","text":"<p>Slides</p> <ul> <li> <p></p> <p>Learn how to create reproducible computing environments using <code>docker</code> and how to use them to run your code.</p> <p> M10: Docker</p> </li> <li> <p></p> <p>Learn how to use <code>hydra</code> to manage configuration files and how to integrate it with your code.</p> <p> M11: Config Files</p> </li> </ul> <p>Today is all about reproducibility - one of those concepts that everyone agrees is very important and something should be done about, but the reality is that it is very hard to secure full reproducibility. The last sessions have already touched a bit on how tools like <code>conda</code> and code organization can help make code more reproducible. Today we are going all the way to ensure that our scripts and our computing environment are fully reproducible.</p>"},{"location":"s3_reproducibility/#why-does-reproducibility-matter","title":"Why does reproducibility matter","text":"<p>Reproducibility is closely related to the scientific method:</p> <p>Observe -&gt; Question -&gt; Hypotheses -&gt; Experiment -&gt; Conclude -&gt; Result -&gt; Observe -&gt; ...</p> <p>Not having reproducible experiments essentially breaks the cycle between doing experiments and making conclusions. If experiments are not reproducible, then we do not expect that others will arrive at the same conclusion as ourselves. As machine learning experiments are fundamentally the same as doing chemical experiments in a laboratory, we should be equally careful in making sure our environments are reproducible (think of your laptop as your laboratory).</p> <p>Secondly, if we focus on why reproducibility matters especially in machine learning, it is part of the bigger challenge of making sure that machine learning is trustworthy.</p> <p></p>  Many different aspects are needed if trustworthy machine learning is ever going to be a reality. We need robustness of our pipelines so we can trust that they do not fail under heavy load. We need integrity to make sure that pipelines are deployed if they are of high quality. We need explainability to make sure that we understand what our machine learning models are doing, so it is not just a black box. We need reproducibility to make sure that the results of our models can be reproduced over and over again. Finally, we need fairness to make sure that our models are not biased toward specific populations. Figure inspired by this paper.  <p>Trustworthy ML is the idea that machine learning agents can be trusted. Take the example of a machine learning agent being responsible for medical diagnoses. It is very clear that we need to be able to trust that the agent gives us the correct diagnosis for the system to work in practice. Reproducibility plays a big role here, because without we cannot be sure that the same agent deployed at two different hospitals will give the same diagnosis (given the same input).</p> <p>Learning objectives</p> <p>The learning objectives of this session are:</p> <ul> <li>To understand the importance of reproducibility in computer science</li> <li>To be able to use <code>docker</code> to create a reproducible container, including how to build them from scratch</li> <li>Understand different ways of configuring your code and how to use <code>hydra</code> to integrate with config files</li> </ul>"},{"location":"s3_reproducibility/config_files/","title":"M11 - Config Files","text":""},{"location":"s3_reproducibility/config_files/#config-files","title":"Config files","text":"<p>With docker we can make sure that our compute environment is reproducible, but that does not mean that all our experiments magically become reproducible. There are other factors that are important for creating reproducible experiments.</p> <p>In this paper (highly recommended read) the authors tried to reproduce the results of 255 papers and tried to figure out which factors where significant to succeed. One of those factors were \"Hyperparameters Specified\" e.g. whether or not the authors of the paper had precisely specified the hyperparameter that was used to run the experiments. It should come as no surprise that this can be a determining factor for reproducibility, however it is not given that hyperparameters are always well specified.</p>"},{"location":"s3_reproducibility/config_files/#configure-experiments","title":"Configure experiments","text":"<p>There is really no way around it: deep learning contains a lot of hyperparameters. In general, a hyperparameter is any parameter that affects the learning process (e.g. the weights of a neural network are not hyperparameters because they are a consequence of the learning process). The problem with having many hyperparameters to control in your code, is that if you are not careful and structure them it may be hard after running a experiment to figure out which hyperparameters were actually used. Lack of proper configuration management can cause serious problems with reliability, uptime, and the ability to scale a system.</p> <p>One of the most basic ways of structuring hyperparameters, is just to put them directly into you <code>train.py</code> script in some object:</p> <pre><code>class my_hp:\n    batch_size: 64\n    lr: 128\n    other_hp: 12345\n\n# easy access to them\ndl = DataLoader(Dataset, batch_size=my_hp.batch_size)\n</code></pre> <p>the problem here is configuration is not easy. Each time you want to run a new experiment, you basically have to change the script. If you run the code multiple times, without committing the changes in between then the exact hyperparameter configuration for some experiments may be lost. Alright, with this in mind you change strategy to use an argument parser e.g. run experiments like this</p> <pre><code>python train.py --batch_size 256 --learning_rate 1e-4 --other_hp 12345\n</code></pre> <p>This at least solves the problem with configurability. However, we again can end up with losing experiments if we are not careful.</p> <p>What we really want is some way to easily configure our experiments where the hyperparameters are systematically saved with the experiment. For this we turn our attention to Hydra, a configuration tool that is based around writing config files to keep track of hyperparameters. Hydra operates on top of OmegaConf which is a <code>yaml</code> based hierarchical configuration system.</p> <p>A simple <code>yaml</code> configuration file could look like</p> <pre><code>#config.yaml\nhyperparameters:\n  batch_size: 64\n  learning_rate: 1e-4\n</code></pre> <p>with the corresponding Python code for loading the file</p> <pre><code>from omegaconf import OmegaConf\n# loading\nconfig = OmegaConf.load('config.yaml')\n\n# accessing in two different ways\ndl = DataLoader(dataset, batch_size=config.hyperparameters.batch_size)\noptimizer = torch.optim.Adam(model.parameters(), lr=config['hyperparameters']['learning_rate'])\n</code></pre> <p>or using <code>hydra</code> for loading the configuration</p> <pre><code>import hydra\n\n@hydra.main(config_name=\"basic.yaml\")\ndef main(cfg):\n    print(cfg.hyperparameters.batch_size, cfg.hyperparameters.learning_rate)\n\nif __name__ == \"__main__\":\n    main()\n</code></pre> <p>The idea behind refactoring our hyperparameters into <code>.yaml</code> files is that we disentangle the model configuration from the model. In this way it is easier to do version control of the configuration because we have it in a separate file.</p>"},{"location":"s3_reproducibility/config_files/#exercises","title":"\u2754 Exercises","text":"<p>Exercise files</p> <p>The main idea behind the exercises is to take a single script (that we provide) and use Hydra to make sure that everything gets correctly logged such that you would be able to exactly report to others how each experiment was configured. In the provided script, the hyperparameters are hardcoded into the code and your job will be to separate them out into a configuration file.</p> <p>Note that we provide a solution (in the <code>vae_solution</code> folder) that can help you get through the exercise, but try to look online for your answers before looking at the solution. Remember: its not about the result, its about the journey.</p> <ol> <li> <p>Start by installing hydra:</p> <pre><code>pip install hydra-core\n</code></pre> <p>Remember to add it to your <code>requirements.txt</code> file.</p> </li> <li> <p>Next take a look at the <code>vae_mnist.py</code> and <code>model.py</code> file and understand what is going on. It is a model we will     revisit during the course.</p> </li> <li> <p>Identify the key hyperparameters of the script. Some of them should be easy to find, but at least 3 have made it     into the core part of the code. One essential hyperparameter is also not included in the script but is needed to be     completely reproducible (HINT: the weights of any neural network are initialized at random).</p> Solution <p>From the top of the file <code>batch_size</code>, <code>x_dim</code>, <code>hidden_dim</code> can be found as hyperparameters. Looking through the code it can be seen that the <code>latent_dim</code> of the encoder and decoder, <code>lr</code> or the optimzer, <code>epochs</code> in the training loop also are hyperparameters. Finally, the <code>seed</code> is not included in the script but is needed to make the script fully reproducible e.g. <code>torch.manual_seed(seed)</code>.</p> </li> <li> <p>Write a configuration file <code>config.yaml</code> where you write down the hyperparameters that you have found</p> </li> <li> <p>Get the script running by loading the configuration file inside your script (using hydra) that incorporates the     hyperparameters into the script. Note: you should only edit the <code>vae_mnist.py</code> file and not the <code>model.py</code> file.</p> </li> <li> <p>Run the script</p> </li> <li> <p>By default hydra will write the results to a <code>outputs</code> folder, with a sub-folder for the day the experiment was     run and further the time it was started. Inspect your run by going over each file the hydra has generated and check     the information has been logged. Can you find the hyperparameters?</p> </li> <li> <p>Hydra also allows for dynamically changing and adding parameters on the fly from the command-line:</p> <ol> <li> <p>Try changing one parameter from the command-line</p> <pre><code>python vae_mnist.py hyperparameters.seed=1234\n</code></pre> </li> <li> <p>Try adding one parameter from the command-line</p> <pre><code>python vae_mnist.py +experiment.stuff_that_i_want_to_add=42\n</code></pre> </li> </ol> </li> <li> <p>By default the file <code>vae_mnist.log</code> should be empty, meaning that whatever you printed to the terminal did not get     picked up by Hydra. This is due to Hydra under the hood making use of the native python     logging package. This means that to also save all printed output     from the script we need to convert all calls to <code>print</code> with <code>log.info</code></p> <ol> <li> <p>Create a logger in the script:</p> <pre><code>import logging\nlog = logging.getLogger(__name__)\n</code></pre> </li> <li> <p>Exchange all calls to <code>print</code> with calls to <code>log.info</code></p> </li> <li> <p>Try re-running the script and make sure that the output printed to the terminal also gets saved to the     <code>vae_mnist.log</code> file</p> </li> </ol> </li> <li> <p>Make sure that your script is fully reproducible. To check this you will need two runs of the script to compare.     Then run the <code>reproducibility_tester.py</code> script as</p> <pre><code>python reproducibility_tester.py path/to/run/1 path/to/run/2\n</code></pre> <p>the script will go over trained weights to see if the match and that the hyperparameters was the same. Note: for the script to work, the weights should be saved to a file called <code>trained_model.pt</code> (this is the default of the <code>vae_mnist.py</code> script, so only relevant if you have changed the saving of the weights)</p> </li> <li> <p>Make a new experiment using a new configuration file where you have changed a hyperparameter of your own     choice. You are not allowed to change the configuration file in the script but should instead be able to provide it     as an argument when launching the script e.g. something like</p> <pre><code>python vae_mnist.py experiment=exp2\n</code></pre> <p>We recommend that you use a file structure like this</p> <pre><code>|--conf\n|  |--config.yaml\n|  |--experiments\n|     |--exp1.yaml\n|     |--exp2.yaml\n|--my_app.py\n</code></pre> </li> <li> <p>Finally, a awesome feature of hydra is the     instantiate feature. This allows you to define a     configuration file that can be used to directly instantiating objects in python. Try to create a configuration file     that can be used to instantiating the <code>Adam</code> optimizer in the <code>vae_mnist.py</code> script.</p> Solution <p>The configuration file could look like this</p> <pre><code>optimizer:\n  _target_: torch.optim.Adam\n  lr: 1e-3\n  betas: [0.9, 0.999]\n  eps: 1e-8\n  weight_decay: 0\n</code></pre> <p>and the python code to load the configuration file and instantiate the optimizer could look like this</p> <pre><code>import hydra\nimport torch.optim as optim\n\n@hydra.main(config_name=\"adam.yaml\")\ndef main(cfg):\n    optimizer = hydra.utils.instantiate(cfg.optimizer)\n    print(optimizer)\n\nif __name__ == \"__main__\":\n    main()\n</code></pre> <p>This will print the optimizer object that is created from the configuration file.</p> </li> </ol>"},{"location":"s3_reproducibility/config_files/#final-exercise","title":"Final exercise","text":"<p>Make your MNIST code reproducible! Apply what you have just done to the simple script to your MNIST code. Only requirement is that you this time use multiple configuration files, meaning that you should have at least one <code>model_conf.yaml</code> file and a <code>training_conf.yaml</code> file that separates out the hyperparameters that have to do with the model definition and those that have to do with the training. You can also choose to work with even more complex config setups: in the image below the configuration has two layers such that we individually can specify hyperparameters belonging to a specific model architecture and hyperparameters for each individual optimizer that we may try.</p> <p> </p>  Image credit"},{"location":"s3_reproducibility/docker/","title":"M10 - Docker","text":""},{"location":"s3_reproducibility/docker/#docker","title":"Docker","text":"<p>Core Module</p> <p></p>  Image credit  <p>While the above picture may seem silly at first, it is actually pretty close to how Docker came into existence. A big part of creating an MLOps pipeline is being able to reproduce it. Reproducibility goes beyond versioning our code with <code>git</code> and using <code>conda</code> environments to keep track of our Python installations. To truly achieve reproducibility, we need to capture system-level components such as:</p> <ul> <li>Operating system</li> <li>Software dependencies (other than Python packages)</li> </ul> <p>Docker provides this kind of system-level reproducibility by creating isolated program dependencies. In addition to providing reproducibility, one of the key features of Docker is scalability, which is important when we later discuss deployment. Because Docker ensures system-level reproducibility, it does not (conceptually) matter whether we try to start our program on a single machine or on 1000 machines at once.</p>"},{"location":"s3_reproducibility/docker/#docker-overview","title":"Docker Overview","text":"<p>Docker has three main concepts: Dockerfile, Docker image, and Docker container:</p> <p></p> <ul> <li> <p>A Dockerfile is a basic text document that contains all the commands a user could call on the command line to     run an application. This includes installing dependencies, pulling data from online storage, setting up code, and     specifying commands to run (e.g., <code>python train.py</code>).</p> </li> <li> <p>Running, or more correctly, building a Dockerfile will create a Docker image. An image is a lightweight,     standalone/containerized, executable package of software that includes everything (application code, libraries,     tools, dependencies, etc.) necessary to make an application run.</p> </li> <li> <p>Actually running an image will create a Docker container. This means that the same image can be launched     multiple times, creating multiple containers.</p> </li> </ul> <p>The exercises today will focus on how to construct the actual Dockerfile, as this is the first step to constructing your own container.</p>"},{"location":"s3_reproducibility/docker/#docker-sharing","title":"Docker Sharing","text":"<p>The whole point of using Docker is that sharing applications becomes much easier. In general, we have two options:</p> <ul> <li> <p>After creating the <code>Dockerfile</code>, we can simply commit it to GitHub (it's just a text file) and then ask other users     to simply build the image themselves.</p> </li> <li> <p>After building the image ourselves, we can choose to upload it to an image registry such as     Docker Hub, where others can get our image by simply running <code>docker pull</code>, making them     able to instantaneously run it as a container, as shown in the figure below:</p> </li> </ul> <p></p>  Image credit"},{"location":"s3_reproducibility/docker/#exercises","title":"\u2754 Exercises","text":"<p>In the following exercises, we guide you on how to build a docker file for your MNIST repository that will make the training and prediction a self-contained application. Please make sure that you somewhat understand each step and do not just copy the exercise. Also, note that you probably need to execute the exercise from an elevated terminal e.g. with administrative privilege.</p> <p>The exercises today are only an introduction to docker and some of the steps are going to be unoptimized from a production setting view. For example, we often want to keep the size of the docker image as small as possible, which we are not focusing on for these exercises.</p> <p>If you are using <code>VScode</code> then we recommend installing the VScode docker extension for easy getting an overview of which images have been building and which are running. Additionally, the extension named Dev Containers may also be beneficial for you to download.</p> <ol> <li> <p>Start by installing docker. How much trouble you need to go through     depends on your operating system. For Windows and Mac, we recommend they install Docker Desktop, which comes with     a graphical user interface (GUI) for quickly viewing docker images and docker containers currently built/in use.     Windows users that have not installed WSL yet are going to have to do it now (as docker needs it as a backend for     starting virtual machines) but you do not need to install docker in WSL. After installing docker we recommend that     you restart your laptop.</p> </li> <li> <p>Try running the following to confirm that your installation is working:</p> <pre><code>docker run hello-world\n</code></pre> <p>which should give the message</p> <pre><code>Hello from Docker!\nThis message shows that your installation appears to be working correctly.\n</code></pre> </li> <li> <p>Next, let's try to download an image from Docker Hub. Download the <code>busybox</code> image:</p> <pre><code>docker pull busybox\n</code></pre> <p>which is a very small (1-5Mb) containerized application that contains the most essential GNU file utilities, shell utilities, etc.</p> </li> <li> <p>After pulling the image, write</p> <pre><code>docker images\n</code></pre> <p>which should show you all available images. You should see the <code>busybox</code> image that we just downloaded.</p> </li> <li> <p>Let's try to run this image</p> <pre><code>docker run busybox\n</code></pre> <p>You will see that nothing happens! The reason for that is we did not provide any commands to <code>docker run</code>. We essentially just ask it to start the <code>busybox</code> virtual machine, do nothing, and then close it again. Now, try again, this time with</p> <pre><code>docker run busybox echo \"hello from busybox\"\n</code></pre> <p>Note how fast this process is. In just a few seconds, Docker is able to start a virtual machine, execute a command, and kill it afterward.</p> </li> <li> <p>Try running</p> <pre><code>docker ps\n</code></pre> <p>What does this command do? What if you add <code>-a</code> to the end?</p> </li> <li> <p>If we want to run multiple commands within the virtual machine, we can start it in interactive mode</p> <pre><code>docker run -it busybox\n</code></pre> <p>This can be a great way to investigate what the filesystem of our virtual machine looks like.</p> </li> <li> <p>As you may have already noticed by now, each time we execute <code>docker run</code>, we can still see small remnants of the     containers using <code>docker ps -a</code>. These stray containers can end up taking up a lot of disk space. To remove them,     use <code>docker rm</code> where you provide the container ID that you want to delete</p> <pre><code>docker rm &lt;container_id&gt;\n</code></pre> </li> <li> <p>Let's now move on to trying to construct a Dockerfile ourselves for our MNIST project. Create a file called     <code>trainer.dockerfile</code>. The intention is that we want to develop one Dockerfile for running our training script and     one for doing predictions.</p> </li> <li> <p>Instead of starting from scratch, we nearly always want to start from some base image. For this exercise, we are     going to start from a simple <code>python</code> image. Add the following to your <code>Dockerfile</code></p> <pre><code># Base image\nFROM python:3.9-slim\n</code></pre> </li> <li> <p>Next, we are going to install some essentials in our image. The essentials more or less consist of a Python     installation. These instructions may seem familiar if you are using Linux:</p> <pre><code># Install Python\nRUN apt update &amp;&amp; \\\n    apt install --no-install-recommends -y build-essential gcc &amp;&amp; \\\n    apt clean &amp;&amp; rm -rf /var/lib/apt/lists/*\n</code></pre> </li> <li> <p>The previous two steps are common for any Docker application where you want to run Python. All the remaining steps     are application-specific (to some degree):</p> <ol> <li> <p>Let's copy over our application (the essential parts) from our computer to the container:</p> <pre><code>COPY requirements.txt requirements.txt\nCOPY pyproject.toml pyproject.toml\nCOPY &lt;project-name&gt;/ &lt;project-name&gt;/\nCOPY data/ data/\n</code></pre> <p>Remember that we only want the essential parts to keep our Docker image as small as possible. Why do we need each of these files/folders to run training in our Docker container?</p> </li> <li> <p>Let's set the working directory in our container and add commands that install the dependencies (1):</p> <ol> <li> <p> We split the installation into two steps so that Docker can cache our project dependencies     separately from our application code. This means that if we change our application code, we do not need to     reinstall all the dependencies. This is a common strategy for Docker images.</p> <p> As an alternative, you can use <code>RUN make requirements</code> if you have a <code>Makefile</code> that installs the dependencies. Just remember to also copy over the <code>Makefile</code> into the Docker image.</p> </li> </ol> <pre><code>WORKDIR /\nRUN pip install -r requirements.txt --no-cache-dir\nRUN pip install . --no-deps --no-cache-dir\n</code></pre> <p>The <code>--no-cache-dir</code> is quite important. Can you explain what it does and why it is important in relation to Docker?</p> </li> <li> <p>Finally, we are going to name our training script as the entrypoint for our Docker image. The entrypoint is     the application that we want to run when the image is being executed:</p> <pre><code>ENTRYPOINT [\"python\", \"-u\", \"&lt;project_name&gt;/train_model.py\"]\n</code></pre> <p>The <code>\"u\"</code> here makes sure that any output from our script, e.g., any <code>print(...)</code> statements, gets redirected to our terminal. If not included, you would need to use <code>docker logs</code> to inspect your run.</p> </li> </ol> </li> <li> <p>We are now ready to build our Dockerfile into a Docker image.</p> <pre><code>docker build -f trainer.dockerfile . -t trainer:latest\n</code></pre> MAC M1/M2 users <p>In general, Docker images are built for a specific platform. For example, if you are using a Mac with an M1/M2 chip, then you are running on an ARM architecture. If you are using a Windows or Linux machine, then you are running on an AMD64 architecture. This is important to know when building Docker images. Thus, Docker images you build may not work on other platforms than the ones you build on. You can specify which platform you want to build for by adding the <code>--platform</code> argument to the <code>docker build</code> command:</p> <pre><code>docker build --platform linux/amd64 -f trainer.dockerfile . -t trainer:latest\n</code></pre> <p>and also when running the image:</p> <pre><code>docker run --platform linux/amd64 trainer:latest\n</code></pre> <p>Note that this will significantly increase the build and run time of your Docker image when running locally, because Docker will need to emulate the other platform. In general, for the exercises today, you should not need to specify the platform, but be aware of this if you are building Docker images on your own.</p> <p>Please note that here we are providing two extra arguments to <code>docker build</code>. The <code>-f trainer.dockerfile .</code> (the dot is important to remember) indicates which Dockerfile we want to run (except if you named it just <code>Dockerfile</code>) and the <code>-t trainer:latest</code> is the respective name and tag that we see afterward when running <code>docker images</code> (see image below). Please note that building a Docker image can take a couple of minutes.</p> <p> </p> Docker images and space <p>Docker images can take up a lot of space on your computer, especially the Docker images we are trying to build because PyTorch is a huge dependency. If you are running low on space, you can try to</p> <pre><code>docker system prune\n</code></pre> <p>Alternatively, you can manually delete images using <code>docker rmi {image_name}:{image_tag}</code>.</p> </li> <li> <p>Try running <code>docker images</code> and confirm that you get output similar to the one above. If you succeed with this,     then try running the docker image</p> <pre><code>docker run --name experiment1 trainer:latest\n</code></pre> <p>you should hopefully see your training starting. Please note that we can start as many containers as we want at the same time by giving them all different names using the <code>--name</code> tag.</p> <ol> <li> <p>You are most likely going to rebuild your Docker image multiple times, either due to an implementation error     or the addition of new functionality. Therefore, instead of watching pip suffer through downloading <code>torch</code> for     the 20th time, you can reuse the cache from the last time the Docker image was built. To do this, replace the line     in your Dockerfile that installs your requirements with:</p> <pre><code>RUN --mount=type=cache,target=~/pip/.cache pip install -r requirements.txt --no-cache-dir\n</code></pre> <p>which mounts your local pip cache to the Docker image. For building the image, you need to have enabled the BuildKit feature. If you have Docker version v23.0 or later (you can check this by running <code>docker version</code>), then this is enabled by default. Otherwise, you need to enable it by setting the environment variable <code>DOCKER_BUILDKIT=1</code> before building the image.</p> <p>Try changing your Dockerfile and rebuilding the image. You should see that the build process is much faster.</p> </li> </ol> </li> <li> <p>Remember, if you ever are in doubt about how files are organized inside a Docker image, you always have the option     to start the image in interactive mode:</p> <pre><code>docker run -it --entrypoint sh {image_name}:{image_name}\n</code></pre> </li> <li> <p>When your training has completed you will notice that any files that are created when running your training script     are not present on your laptop (for example if your script is saving the trained model to a file). This is because     the files were created inside your container (which is a separate little machine). To get the files you have two     options:</p> <ol> <li> <p>If you already have a completed run then you can use it</p> <pre><code>docker cp\n</code></pre> <p>to copy the files between your container and laptop. For example to copy a file called <code>trained_model.pt</code> from a folder you would do:</p> <pre><code>docker cp {container_name}:{dir_path}/{file_name} {local_dir_path}/{local_file_name}\n</code></pre> <p>Try this out.</p> </li> <li> <p>A much more efficient strategy is to mount a volume that is shared between the host (your laptop) and the     container. This can be done with the <code>-v</code> option for the <code>docker run</code> command. For example, if we want to     automatically get the <code>trained_model.pt</code> file after running our training script we could simply execute the     container as</p> <pre><code>docker run --name {container_name} -v %cd%/models:/models/ trainer:latest\n</code></pre> <p>this command mounts our local <code>models</code> folder as a corresponding <code>models</code> folder in the container. Any file save by the container to this folder will be synchronized back to our host machine. Try this out! Note if you have multiple files/folders that you want to mount (if in doubt about file organization in the container try to do the next exercise first). Also note that the <code>%cd%</code> needs to change depending on your OS, see this page for help.</p> </li> </ol> </li> <li> <p>With training done we also need to write an application for prediction. Create a new docker image called     <code>predict.dockerfile</code>. This file should call your <code>&lt;project_name&gt;/models/predict_model.py</code> script instead. This image     will need some trained model weights to work. Feel free to either include these during the build process or mount     them afterwards. When you create the file try to <code>build</code> and <code>run</code> it to confirm that it works. Hint: if     you are passing in the model checkpoint and prediction data as arguments to your script, your <code>docker run</code> probably     needs to look something like</p> <pre><code>docker run --name predict --rm \\\n    -v %cd%/trained_model.pt:/models/trained_model.pt \\  # mount trained model file\n    -v %cd%/data/example_images.npy:/example_images.npy \\  # mount data we want to predict on\n    predict:latest \\\n    ../../models/trained_model.pt \\  # argument to script, path relative to script location in container\n    ../../example_images.npy\n</code></pre> </li> <li> <p>(Optional, requires GPU support) By default, a virtual machine created by docker only has access to your <code>cpu</code> and     not your <code>gpu</code>. While you do not necessarily have a laptop with a GPU that supports the training of neural networks     (e.g. one from Nvidia) it is beneficial that you understand how to construct a docker image that can take advantage     of a GPU if you were to run this on a machine in the future that has a GPU (e.g. in the cloud). It does take a bit     more work, but many of the steps will be similar to building a normal docker image.</p> <ol> <li> <p>There are three prerequisites for working with Nvidia GPU-accelerated docker containers. First, you need to have     the Docker Engine installed (already taken care of), have Nvidia GPU with updated GPU drivers and finally have     the Nvidia container toolkit     installed. The last part you not likely have not installed and needs to do. Some distros of Linux have known     problems with the installation process, so you may have to search through known issues in     nvidia-docker repository to find a solution</p> </li> <li> <p>To test that everything is working start by pulling a relevant Nvidia docker image. In my case this is     the correct image:</p> <pre><code>docker pull nvidia/cuda:11.0.3-base-ubuntu20.04\n</code></pre> <p>but it may differ based on what Cuda vision you have. You can find all the different official Nvidia images here. After pulling the image, try running the <code>nvidia-smi</code> command inside a container based on the image you just pulled. It should look something like this:</p> <pre><code>docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi\n</code></pre> <p>and should show an image like below:</p> <p> </p> <p>If it does not work, try redoing the steps.</p> </li> <li> <p>We should hopefully have a working setup now for running Nvidia accelerated docker containers. The next step is     to get PyTorch inside our container, such that our PyTorch implementation also correctly identifies the GPU.     Luckily for us, Nvidia provides a set of docker images for GPU-optimized software for AI, HPC and visualizations     through their NGC Catalog.     The containers that have to do with PyTorch can be seen     here. Try pulling the latest:</p> <pre><code>docker pull nvcr.io/nvidia/pytorch:22.07-py3\n</code></pre> <p>It may take some time because the NGC images include a lot of other software for optimizing PyTorch applications. It may be possible for you to find other images for running GPU-accelerated applications that have a smaller memory footprint, but NGC is the recommended and supported way.</p> </li> <li> <p>Let's test that this container works:</p> <pre><code>docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:22.07-py3\n</code></pre> <p>this should run the container in interactive mode attached to your current terminal. Try opening <code>python</code> in the container and try writing:</p> <pre><code>import torch\nprint(torch.cuda.is_available())\n</code></pre> <p>which hopefully should return <code>True</code>.</p> </li> <li> <p>Finally, we need to incorporate all this into our already developed docker files for our application. This is     also fairly easy as we just need to change our <code>FROM</code> statement at the beginning of our docker file:</p> <pre><code>FROM python:3.7-slim\n</code></pre> <p>change to</p> <pre><code>FROM  nvcr.io/nvidia/pytorch:22.07-py3\n</code></pre> <p>try doing this to one of your docker files, build the image and run the container. Remember to check that your application is using GPU by printing <code>torch.cuda.is_available()</code>.</p> </li> </ol> </li> <li> <p>(Optional) Another way you can use Dockerfiles in your day-to-day work is for Dev-containers. Developer containers     allow you to develop code directly inside a container, making sure that your code is running in the same     environment as it will when deployed. This is especially useful if you are working on a project that has a lot of     dependencies that are hard to install on your local machine. Setup instructions for VS Code and PyCharm can be found     here (should be simple since we have already installed Docker):</p> <ul> <li>VS Code</li> <li>PyCharm</li> </ul> <p>We will focus on the VS Code setup here.</p> <ol> <li> <p>First, install the     Remote - Containers     extension.</p> </li> <li> <p>Create a <code>.devcontainer</code> folder in your project root and create a <code>Dockerfile</code> inside it. We will keep this file very     barebones for now, so let's just define a base installation of Python:</p> <pre><code>FROM python:3.11-slim-buster\n\nRUN apt update &amp;&amp; \\\n    apt install --no-install-recommends -y build-essential gcc &amp;&amp; \\\n    apt clean &amp;&amp; rm -rf /var/lib/apt/lists/*\n</code></pre> </li> <li> <p>Create a <code>devcontainer.json</code> file in the <code>.devcontainer</code> folder. This file should look something like this:</p> <pre><code>{\n    \"name\": \"my_working_env\",\n    \"dockerFile\": \"Dockerfile\",\n    \"postCreateCommand\": \"pip install -r requirements.txt\"\n}\n</code></pre> <p>This file tells VS Code that we want to use the <code>Dockerfile</code> that we just created and that we want to install our Python dependencies after the container has been created.</p> </li> <li> <p>After creating these files, you should be able to open the command palette in VS Code (F1) and search for the     option <code>Remote-Containers: Reopen in Container</code> or <code>Remote-Containers: Rebuild and Reopen in Container</code>. Choose     either of these options.</p> <p> </p> <p>This will start a new VS Code instance inside a Docker container. You should be able to see this in the bottom left corner of your VS Code window. You should also be able to see that the Python interpreter has changed to the one inside the container.</p> <p>You are now ready to start developing inside the container. Try opening a terminal and run <code>python</code> and <code>import torch</code> to confirm that everything is working.</p> </li> </ol> </li> <li> <p>(Optional) In M8 on Data version control you learned about the     framework <code>dvc</code> for version controlling data. A neutral question at this point would then be how to incorporate     <code>dvc</code> into our docker image. We need to do two things:</p> <ul> <li>Make sure that <code>dvc</code> has all the correct files to pull data from our remote storage</li> <li>Make sure that <code>dvc</code> has the correct credentials to pull data from our remote storage</li> </ul> <p>We are going to assume that <code>dvc</code> (and any <code>dvc</code> extension needed) is part of your <code>requirements.txt</code> file and that it is already being installed in a <code>RUN pip install -r requirements.txt</code> command in your Dockerfile. If not, then you need to add it.</p> <ol> <li> <p>Add the following lines to your Dockerfile</p> <pre><code>RUN dvc init --no-scm\nCOPY .dvc/config .dvc/config\nCOPY *.dvc .dvc/\nRUN dvc config core.no_scm true\nRUN dvc pull\n</code></pre> <p>The first line initializes <code>dvc</code> in the Docker image. The <code>--no-scm</code> option is needed because normally <code>dvc</code> can only be initialized inside a git repository, but this option allows initializing <code>dvc</code> without being in one. The second and third lines copy over the <code>dvc</code> config file and the <code>dvc</code> metadata files that are needed to pull data from your remote storage. The last line pulls the data.</p> </li> <li> <p>If your data is not public, we need to provide credentials in some way to pull the data. We are for now going to     do it in a not-so-secure way. When <code>dvc</code> first connected to your drive, a credential file was created. This file     is located in <code>$CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json</code> where <code>$CACHE_HOME</code>.</p> macOSLinuxWindows <p><code>~/Library/Caches</code></p> <p><code>~/.cache</code>  This is the typical location, but it may vary depending on what distro you are running.</p> <p><code>{user}/AppData/Local</code></p> <p>Find the file. The content should look similar to this (only some fields are shown):</p> <pre><code>{\n    \"access_token\": ...,\n    \"client_id\": ...,\n    \"client_secret\": ...,\n    \"refresh_token\": ...,\n    ...\n}\n</code></pre> <p>We are going to copy the file into our Docker image. This, of course, is not a secure way of doing it, but it is the easiest way to get started. As long as you are not sharing your Docker image with anyone else, then it is fine. Add the following lines to your Dockerfile before the <code>RUN dvc pull</code> command:</p> <pre><code>COPY &lt;path_to_default.json&gt; default.json\ndvc remote modify myremote --local gdrive_service_account_json_file_path default.json\n</code></pre> <p>where <code>&lt;path_to_default.json&gt;</code> is the path to the <code>default.json</code> file that you just found. The last line tells <code>dvc</code> to use the <code>default.json</code> file as the credentials for pulling data from your remote storage. You can confirm that this works by running <code>dvc pull</code> in your Docker image.</p> </li> </ol> </li> </ol>"},{"location":"s3_reproducibility/docker/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>What is the difference between a docker image and a docker container?</p> Solution <p>A Docker image is a template for a Docker container. A Docker container is a running instance of a Docker image. A Docker image is a static file, while a Docker container is a running process.</p> </li> <li> <p>What are the 3 steps involved in containerizing an application?</p> Solution <ol> <li>Write a Dockerfile that includes your app (including the commands to run it) and its dependencies.</li> <li>Build the image using the Dockerfile you wrote.</li> <li>Run the container using the image you've built.</li> </ol> </li> <li> <p>What advantage is there to running your application inside a Docker container instead of running the application     directly on your machine?</p> Solution <p>Running inside a Docker container gives you a consistent and independent environment for your application. This means that you can be sure that your application will run the same way on your machine as it will on another machine. Thus, Docker gives the ability to abstract away the differences between different machines.</p> </li> <li> <p>A Docker container is built from a series of layers that are stacked on top of each other. This should be clear if     you look at the output when building a Docker image. What is the advantage of this?</p> Solution <p>The advantage is efficiency and reusability. When a change is made to a Docker image, only the layer(s) that are changed need to be updated. For example, if you update the application code in your Docker image, which usually is the last layer, then only that layer needs to be rebuilt, making the process much faster. Additionally, if you have multiple Docker images that share the same base image, then the base image only needs to be downloaded once.</p> </li> </ol> <p>This covers the absolute minimum you should know about Docker to get a working image and container. If you want to really deep dive into this topic, you can find a copy of the Docker Cookbook by S\u00e9bastien Goasguen in the literature folder.</p> <p>If you are actively going to be using Docker in the future, one thing to consider is the image size. Even these simple images that we have built still take up GB in size. Several optimization steps can be taken to reduce the image size for you or your end user. If you have time, you can read this article on different approaches to reducing image size. Additionally, you can take a look at the dive-in extension for Docker Desktop that lets you explore in depth your Docker images.</p>"},{"location":"s4_debugging_and_logging/","title":"Debugging, Profiling, Logging and Boilerplate","text":"<p>Slides</p> <ul> <li> <p></p> <p>Learn how to use the debugger in your editor to find bugs in your code.</p> <p> M12: Debugging</p> </li> <li> <p></p> <p>Learn how to use a profiler to identify bottlenecks in your code and from those profiles optimize the runtime of your programs.</p> <p> M13: Profiling</p> </li> <li> <p></p> <p>Learn how to systematically log experiments and hyperparameters to make your code reproducible.</p> <p> M14: Logging</p> </li> <li> <p></p> <p>Learn how to use <code>pytorch-lightning</code> framework to minimize boilerplate code and structure deep learning models.</p> <p> M15: Boilerplate</p> </li> </ul> <p>Today we are initially going to go over three different topics that are all fundamentally necessary for any data scientist or DevOps engineer:</p> <ul> <li>Debugging</li> <li>Profiling</li> <li>Logging</li> </ul> <p>All three topics can be characterized by something you probably already are familiar with. Since you started programming, you have done debugging as nobody can write perfect code on the first try. Similarly, while you have not directly profiled your code, I bet that you at some point have had some very slow code and optimized it to run faster. Identifying and improving are the fundamentals of profiling code. Finally, logging is a very broad term and refers to any kind of output from your applications that helps you at a later point identify the \"performance\" of you application.</p> <p>However, while we expect you to already be familiar with these topics, we do not expect all of you to be experts as it is very rare that these topics are focused on. Today we are going to introduce some best practices and tools to help you overcome every one of these three important topics. As the final topic for today, we are going to learn about how we can minimize boilerplate and focus on coding what matters for our project instead of all the boilerplate to get it working.</p> <p>Learning objectives</p> <p>The learning objectives of this session are:</p> <ul> <li>Understand the basics of debugging and how to use a debugger to find bugs in your code</li> <li>Can use a profiler to identify bottlenecks in your code and from those profiles optimize the runtime of your     programs</li> <li>Familiar with an experiment logging framework for tracking experiments and hyperparameters of your code to make it     reproducible</li> <li>Be able to use <code>pytorch-lightning</code> framework to minimize boilerplate code and structure deep learning models</li> </ul>"},{"location":"s4_debugging_and_logging/boilerplate/","title":"M15 - Boilerplate","text":""},{"location":"s4_debugging_and_logging/boilerplate/#minimizing-boilerplate","title":"Minimizing boilerplate","text":"<p>Boilerplate is a general term that describes any standardized text, copy, documents, methods, or procedures that may be used over again without making major changes to the original. But how does this relate to doing machine learning projects? If you have already tried doing a couple of projects within machine learning you will probably have seen a pattern: every project usually consist of these three aspects of code:</p> <ul> <li>a model implementation</li> <li>some training code</li> <li>a collection of utilities for saving models, logging images etc.</li> </ul> <p>While the latter two certainly seems important, in most cases the actual development or research often revolves around defining the model. In this sense, both the training code and the utilities becomes boilerplate that should just carry over from one project to another. But the problem usually is that we have not generalized our training code to take care of the small adjusted that may be required in future projects and we therefore end up implementing it over and over again every time that we start a new project. This is of course a waste of our time that we should try to find a solution to.</p> <p>This is where high-level frameworks comes into play. High-level frameworks are build on top of another framework (PyTorch in this case) and tries to abstract/standardize how to do particular tasks such as training. At first it may seem irritating that you need to comply to someone else code structure, however there is a good reason for that. The idea is that you can focus on what really matters (your task, model architecture etc.) and do not have to worry about the actual boilerplate that comes with it.</p> <p>The most popular high-level (training) frameworks within the <code>PyTorch</code> ecosystem are:</p> <ul> <li>fast.ai</li> <li>Ignite</li> <li>skorch</li> <li>Catalyst</li> <li>Composer</li> <li>PyTorch Lightning</li> </ul> <p>They all offer many of the same features, so choosing one over the other for most projects should not matter that much. We are here going to use <code>PyTorch Lightning</code>, as it offers all the functionality that we are going to need later in the course.</p>"},{"location":"s4_debugging_and_logging/boilerplate/#pytorch-lightning","title":"PyTorch Lightning","text":"<p>In general we refer to the documentation from PyTorch lightning if in doubt about how to format your code for doing specific tasks. We are here going to explain the key concepts of the API that you need to understand to use the framework, starting with the <code>LightningModule</code> and the <code>Trainer</code>.</p>"},{"location":"s4_debugging_and_logging/boilerplate/#lightningmodule","title":"LightningModule","text":"<p>The <code>LightningModule</code> is a subclass of a standard <code>nn.Module</code> that basically adds additional structure. In addition to the standard <code>__init__</code> and <code>forward</code> methods that need to be implemented in a <code>nn.Module</code>, a <code>LightningModule</code> further requires two more methods implemented:</p> <ul> <li> <p><code>training_step</code>: should contain your actual training code e.g. given a batch of data this should return the loss     that you want to optimize</p> </li> <li> <p><code>configure_optimizers</code>: should return the optimizer that you want to use</p> </li> </ul> <p>Below is shown these two methods added to standard MNIST classifier</p> <p></p> <p>Compared to a standard <code>nn.Module</code>, the additional methods in the <code>LightningModule</code> basically specifies exactly how you want to optimize your model.</p>"},{"location":"s4_debugging_and_logging/boilerplate/#trainer","title":"Trainer","text":"<p>The second component to lightning is the <code>Trainer</code> object. As the name suggest, the <code>Trainer</code> object takes care of the actual training, automizing everything that you do not want to worry about.</p> <pre><code>from pytorch_lightning import Trainer\nmodel = MyAwesomeModel()  # this is our LightningModule\ntrainer = Trainer()\ntraier.fit(model)\n</code></pre> <p>That's is essentially all that you need to specify in lightning to have a working model. The trainer object does not have methods that you need to implement yourself, but it have a bunch of arguments that can be used to control how many epochs that you want to train, if you want to run on gpu etc. To get the training of our model to work we just need to specify how our data should be feed into the lighning framework.</p>"},{"location":"s4_debugging_and_logging/boilerplate/#data","title":"Data","text":"<p>For organizing our code that has to do with data in <code>Lightning</code> we essentially have three different options. However, all three assume that we are using <code>torch.utils.data.DataLoader</code> for the dataloading.</p> <ol> <li> <p>If we already have a <code>train_dataloader</code> and possible also a <code>val_dataloader</code> and <code>test_dataloader</code> defined we can     simply add them to our <code>LightningModule</code> using the similar named methods:</p> <pre><code>def train_dataloader(self):\n    return DataLoader(...)\n\ndef val_dataloader(self):\n    return DataLoader(...)\n\ndef test_dataloader(self):\n    return DataLoader(...)\n</code></pre> </li> <li> <p>Maybe even simpler, we can directly feed such dataloaders in the <code>fit</code> method of the <code>Trainer</code> object:</p> <pre><code>trainer.fit(model, train_dataloader, val_dataloader)\ntrainer.test(model, test_dataloader)\n</code></pre> </li> <li> <p>Finally, <code>Lightning</code> also have the <code>LightningDataModule</code> that organizes data loading into a single structure, see     this page for more info. Putting     data loading into a <code>DataModule</code> makes sense as it is then can be reused between projects.</p> </li> </ol>"},{"location":"s4_debugging_and_logging/boilerplate/#callbacks","title":"Callbacks","text":"<p>Callbacks is one way to add additional functionality to your model, that strictly speaking is not already part of your model. Callbacks should therefore be seen as self-contained feature that can be reused between projects. You have the option to implement callbacks yourself (by inheriting from the <code>pytorch_lightning.callbacks.Callback</code> base class) or use one of the build in callbacks. Of particular interest are <code>ModelCheckpoint</code> and <code>EarlyStopping</code> callbacks:</p> <ul> <li> <p>The <code>ModelCheckpoint</code> makes sure to save checkpoints of you model. This is in principal not hard to do yourself, but     the <code>ModelCheckpoint</code> callback offers additional functionality by saving checkpoints only when some metric improves,     or only save the best <code>K</code> performing models etc.</p> <pre><code>model = MyModel()\ncheckpoint_callback = ModelCheckpoint(\n    dirpath=\"./models\", monitor=\"val_loss\", mode=\"min\"\n)\ntrainer = Trainer(callbacks=[checkpoint_callbacks])\ntrainer.fit(model)\n</code></pre> </li> <li> <p>The <code>EarlyStopping</code> callback can help you prevent overfitting by automatically stopping the training if a certain     value is not improving anymore:</p> <pre><code>model = MyModel()\nearly_stopping_callback = EarlyStopping(\n    monitor=\"val_loss\", patience=3, verbose=True, mode=\"min\"\n)\ntrainer = Trainer(callbacks=[early_stopping_callback])\ntrainer.fit(model)\n</code></pre> </li> </ul> <p>Multiple callbacks can be used by passing them all in a list e.g.</p> <pre><code>trainer = Trainer(callbacks=[checkpoint_callbacks, early_stopping_callback])\n</code></pre>"},{"location":"s4_debugging_and_logging/boilerplate/#exercises","title":"\u2754 Exercises","text":"<p>Please note that the in following exercise we will basically ask you to reformat all your MNIST code to follow the lightning standard, such that we can take advantage of all the tricks the framework has to offer. The reason we did not implement our model in <code>lightning</code> to begin with, is that to truly understand why it is beneficially to use a high-level framework to do some of the heavy lifting you need to have gone through some of implementation troubles yourself.</p> <ol> <li> <p>Install pytorch lightning:</p> <pre><code>pip install pytorch-lightning # (1)!\n</code></pre> <ol> <li> You may also install it as <code>pip install lightning</code> which includes more than just the     <code>PyTorch Lightning</code> package. This also includes <code>Lightning Fabric</code> and <code>Lightning Apps</code> which you can read more     about here and here.</li> </ol> </li> <li> <p>Convert your corrupted MNIST model into a <code>LightningModule</code>. You can either choose to completely override your old     model or implement it in a new file. The bare minimum that you need to add while converting to get it working with     the rest of lightning:</p> <ul> <li> <p>The <code>training_step</code> method. This function should contain essentially what goes into a single     training step and should return the loss at the end</p> </li> <li> <p>The <code>configure_optimizers</code> method</p> </li> </ul> <p>Please read the documentation for more info.</p> Solution lightning.py<pre><code>import pytorch_lightning as pl\nimport torch\nfrom torch import nn\n\n\nclass MyAwesomeModel(pl.LightningModule):\n    \"\"\"My awesome model.\"\"\"\n\n    def __init__(self) -&gt; None:\n        super().__init__()\n        self.conv1 = nn.Conv2d(1, 32, 3, 1)\n        self.conv2 = nn.Conv2d(32, 64, 3, 1)\n        self.conv3 = nn.Conv2d(64, 128, 3, 1)\n        self.dropout = nn.Dropout(0.5)\n        self.fc1 = nn.Linear(128, 10)\n\n        self.loss_fn = nn.CrossEntropyLoss()\n\n    def forward(self, x: torch.Tensor) -&gt; torch.Tensor:\n        \"\"\"Forward pass.\"\"\"\n        x = torch.relu(self.conv1(x))\n        x = torch.max_pool2d(x, 2, 2)\n        x = torch.relu(self.conv2(x))\n        x = torch.max_pool2d(x, 2, 2)\n        x = torch.relu(self.conv3(x))\n        x = torch.max_pool2d(x, 2, 2)\n        x = torch.flatten(x, 1)\n        x = self.dropout(x)\n        return self.fc1(x)\n\n    def training_step(self, batch):\n        \"\"\"Training step.\"\"\"\n        img, target = batch\n        y_pred = self(img)\n        return self.loss_fn(y_pred, target)\n\n    def configure_optimizers(self):\n        \"\"\"Configure optimizer.\"\"\"\n        return torch.optim.Adam(self.parameters(), lr=1e-3)\n\n\nif __name__ == \"__main__\":\n    model = MyAwesomeModel()\n    print(f\"Model architecture: {model}\")\n    print(f\"Number of parameters: {sum(p.numel() for p in model.parameters())}\")\n\n    dummy_input = torch.randn(1, 1, 28, 28)\n    output = model(dummy_input)\n    print(f\"Output shape: {output.shape}\")\n</code></pre> </li> <li> <p>Make sure your data is formatted such that it can be loaded using the <code>torch.utils.data.DataLoader</code> object.</p> </li> <li> <p>Instantiate a <code>Trainer</code> object. It is recommended to take a look at the     trainer arguments (there     are many of them) and maybe adjust some of them:</p> <ol> <li> <p>Investigate what the <code>default_root_dir</code> flag does</p> </li> <li> <p>As default lightning will run for 1000 epochs. This may be too much (for now). Change this by     changing the appropriate flag. Additionally, there also exist a flag to set the maximum number of steps that we     should train for.</p> Solution <p>Setting the <code>max_epochs</code> will accomplish this.</p> <pre><code>trainer = Trainer(max_epochs=10)\n</code></pre> <p>Additionally, you may consider instead setting the <code>max_steps</code> flag to limit based on the number of steps or <code>max_time</code> to limit based on time. Similarly, the flags <code>min_epochs</code>, <code>min_steps</code> and <code>min_time</code> can be used to set the minimum number of epochs, steps or time.</p> </li> <li> <p>To start with we also want to limit the amount of training data to 20% of its original size. which     trainer flag do you need to set for this to work?</p> Solution <p>Setting the <code>limit_train_batches</code> flag will accomplish this.</p> <pre><code>trainer = Trainer(limit_train_batches=0.2)\n</code></pre> <p>Similarly, you can also set the <code>limit_val_batches</code> and <code>limit_test_batches</code> flags to limit the validation and test data.</p> </li> </ol> </li> <li> <p>Try fitting your model: <code>trainer.fit(model)</code></p> </li> <li> <p>Now try adding some <code>callbacks</code> to your trainer.</p> Solution <pre><code>early_stopping_callback = EarlyStopping(\n    monitor=\"val_loss\", patience=3, verbose=True, mode=\"min\"\n)\ncheckpoint_callback = ModelCheckpoint(\n    dirpath=\"./models\", monitor=\"val_loss\", mode=\"min\"\n)\ntrainer = Trainer(callbacks=[early_stopping_callback, checkpoint_callback])\n</code></pre> </li> <li> <p>The privous module was all about logging in <code>wandb</code>, so the question is naturally how does <code>lightning</code> support this.     Lightning does not only support <code>wandb</code>, but also many     others. Common for all of them, is that     logging just need to happen through the <code>self.log</code> method in your <code>LightningModule</code>:</p> <ol> <li> <p>Add <code>self.log</code> to your `LightningModule. Should look something like this:</p> <pre><code>def training_step(self, batch, batch_idx):\n    data, target = batch\n    preds = self(data)\n    loss = self.criterion(preds, target)\n    acc = (target == preds.argmax(dim=-1)).float().mean()\n    self.log('train_loss', loss)\n    self.log('train_acc', acc)\n    return loss\n</code></pre> </li> <li> <p>Add the <code>wandb</code> logger to your trainer</p> <pre><code>trainer = Trainer(logger=pl.loggers.WandbLogger(project=\"dtu_mlops\"))\n</code></pre> <p>and try to train the model. Confirm that you are seeing the scalars appearing in your <code>wandb</code> portal.</p> </li> <li> <p><code>self.log</code> does sadly only support logging scalar tensors. Luckily, for logging other quantities we     can still access the standard <code>wandb.log</code> through our model</p> <pre><code>def training_step(self, batch, batch_idx):\n    ...\n    # self.logger.experiment is the same as wandb.log\n    self.logger.experiment.log({'logits': wandb.Histrogram(preds)})\n</code></pre> <p>try doing this, by logging something else than scalar tensors.</p> </li> </ol> </li> <li> <p>Finally, we maybe also want to do some validation or testing. In lightning we just need to add the <code>validation_step</code>     and <code>test_step</code> to our lightning module and supply the respective data in form of a separate dataloader. Try to at     least implement one of them.</p> Solution <p>Both validation and test steps can be implemented in the same way as the training step:</p> <pre><code>def validation_step(self, batch) -&gt; None:\n    data, target = batch\n    preds = self(data)\n    loss = self.criterion(preds, target)\n    acc = (target == preds.argmax(dim=-1)).float().mean()\n    self.log('val_loss', loss, on_epoch=True)\n    self.log('val_acc', acc, on_epoch=True)\n</code></pre> <p>two things to take note of here is that we are setting the <code>on_epoch</code> flag to <code>True</code> in the <code>self.log</code> method. This is because we want to log the validation loss and accuracy only once per epoch. Additionally, we are not returning anything from the <code>validation_step</code> method, because we do not optimize over the loss.</p> </li> <li> <p>(Optional, requires GPU) One of the big advantages of using <code>lightning</code> is that you no more need to deal with device     placement e.g. called <code>.to('cuda')</code> everywhere. If you have a GPU, try to set the <code>gpus</code> flag in the trainer. If you     do not have one, do not worry, we are going to return to this when we are going to run training in the cloud.</p> Solution <p>The two arguments <code>accelerator</code> and <code>devices</code> can be used to specify which devices to run on and how many to run on. For example, to run on a single GPU you can do</p> <pre><code>trainer = Trainer(accelerator=\"gpu\", devices=1)\n</code></pre> <p>as an alternative the accelerator can just be set to <code>accelerator=\"auto\"</code> to automatically detect the best available device.</p> </li> <li> <p>(Optional) As default PyTorch uses <code>float32</code> for representing floating point numbers. However, research have shown     that neural network training is very robust towards a decrease in precision. The great benefit going from <code>float32</code>     to <code>float16</code> is that we get approximately half the     memory consumption. Try out half-precision     training in PyTorch lightning. You can enable this by setting the     precision flag in the <code>Trainer</code>.</p> Solution <p>Lightning supports four different types of mixed precision training (16-bit and 16-bit bfloat) and two types of:</p> <pre><code># 16-bit mixed precision (model weights remain in torch.float32)\ntrainer = Trainer(precision=\"16-mixed\", devices=1)\n\n# 16-bit bfloat mixed precision (model weights remain in torch.float32)\ntrainer = Trainer(precision=\"bf16-mixed\", devices=1)\n\n# 16-bit precision (model weights get cast to torch.float16)\ntrainer = Trainer(precision=\"16-true\", devices=1)\n\n# 16-bit bfloat precision (model weights get cast to torch.bfloat16)\ntrainer = Trainer(precision=\"bf16-true\", devices=1)\n</code></pre> </li> <li> <p>(Optional) Lightning also have built-in support for profiling. Checkout how to do this using the     profiler argument in     the <code>Trainer</code> object.</p> </li> <li> <p>(Optional) Another great feature of Lightning is that the allow for easily defining command line interfaces through     the Lightning CLI feature. The     Lightning CLI is essentially a drop in replacement for defining command line interfaces (covered in     this module) and can also replace the need for config files     (covered in this module) for securing reproducibility when working inside     the Lightning framework. We highly recommend checking out the feature and that you try to refactor your code such     that you do not need to call <code>trainer.fit</code> anymore but it is instead directly controlled from the Lightning CLI.</p> </li> <li> <p>Free exercise: Experiment with what the lightning framework is capable of. Either try out more of the trainer flags,     some of the other callbacks, or maybe look into some of the other methods that can be implemented in your lightning     module. Only your imagination is the limit!</p> </li> </ol> <p>That covers everything for today. It has been a mix of topics that all should help you write \"better\" code (by some objective measure). If you want to deep dive more into the PyTorch lightning framework, we highly recommend looking at the different tutorials in the documentation that covers more advanced models and training cases. Additionally, we also want to highlight other frameworks in the lightning ecosystem:</p> <ul> <li>Torchmetrics: collection of machine learning metrics written     in PyTorch</li> <li>lightning flash: High-level framework for fast prototyping,     baselining, finetuning with a even simpler interface than lightning</li> <li>lightning-bolts: Collection of SOTA pretrained models, model     components, callbacks, losses and datasets for testing out ideas as fast a possible</li> </ul>"},{"location":"s4_debugging_and_logging/debugging/","title":"M12 - Debugging","text":""},{"location":"s4_debugging_and_logging/debugging/#debugging","title":"Debugging","text":"<p>Debugging is very hard to teach and is one of the skills that just comes with experience. That said, there are good and bad ways to debug a program. We are all probably familiar with just inserting <code>print(...)</code> statements everywhere in our code. It is easy and can many times help narrow down where the problem happens. That said, this is not a great way of debugging when dealing with a very large codebase. You should therefore familiarize yourself with the built-in python debugger as it may come in handy during the course.</p> <p></p> <p>To invoke the build in Python debugger you can either:</p> <ul> <li> <p>Set a trace directly with the Python debugger by calling</p> <pre><code>import pdb\npdb.set_trace()\n</code></pre> <p>anywhere you want to stop the code. Then you can use different commands (see the <code>python_debugger_cheatsheet.pdf</code>) to step through the code.</p> </li> <li> <p>If you are using an editor, then you can insert inline breakpoints (in VS code this can be done by pressing <code>F9</code>)     and then execute the script in debug mode (inline breakpoints can often be seen as small red dots to the left of     your code). The editor should then offer some interface to allow you step through your code. Here is a guide to     using the build in debugger in VScode.</p> </li> <li> <p>Additionally, if your program is stopping on an error and you automatically want to start the debugger where it     happens, then you can simply launch the program like this from the terminal</p> <pre><code>python -m pdb -c continue my_script.py\n</code></pre> </li> </ul>"},{"location":"s4_debugging_and_logging/debugging/#exercises","title":"\u2754 Exercises","text":"<p>Exercise files</p> <p>We here provide a script <code>vae_mnist_bugs.py</code> which contains a number of bugs to get it running. Start by going over the script and try to understand what is going on. Hereafter, try to get it running by solving the bugs. The following bugs exist in the script:</p> <ul> <li>One device bug (will only show if running on gpu, but try to find it anyways)</li> <li>One shape bug</li> <li>One math bug</li> <li>One training bug</li> </ul> <p>Some of the bugs prevents the script from even running, while some of them influences the training dynamics. Try to find them all. We also provide a working version called <code>vae_mnist_working.py</code> (but please try to find the bugs before looking at the script). Successfully debugging and running the script should produce three files:</p> <ul> <li><code>orig_data.png</code> containing images from the standard MNIST training set</li> <li><code>reconstructions.png</code> reconstructions from the model</li> <li><code>generated_samples.png</code> samples from the model</li> </ul> <p>Again, we cannot stress enough that the exercise is actually not about finding the bugs but using a proper debugger to find them.</p>"},{"location":"s4_debugging_and_logging/logging/","title":"M14 - Logging","text":""},{"location":"s4_debugging_and_logging/logging/#logging","title":"Logging","text":"<p>Core Module</p> <p>Logging in general refers to the practise of recording events activities over time. Having proper logging in your applications can be extremely beneficial for a few reasons:</p> <ul> <li> <p>Debugging becomes easier because we in a more structure way can output information about the state of our program,     variables, values etc. to help identify and fix bugs or unexpected behavior.</p> </li> <li> <p>When we move into a more production environment, proper logging is essential for monitoring the health and     performance of our application.</p> </li> <li> <p>It can help in auditing as logging info about specific activities etc. can help keeping a record of who did what     and when.</p> </li> <li> <p>Having proper logging means that info is saved for later, that can be analysed to gain insight into the behavior of     our application, such as trends.</p> </li> </ul> <p>We are in this course going to divide the kind of logging we can do into categories: application logging and experiment logging. In general application logging is important regardless of the kind of application you are developing, whereas experiment logging is important machine learning based projects where we are doing experiments.</p>"},{"location":"s4_debugging_and_logging/logging/#application-logging","title":"Application logging","text":"<p>The most basic form of logging in Python applications is the good old <code>print</code> statement:</p> <pre><code>for batch_idx, batch in enumerate(dataloader):\n    print(f\"Processing batch {batch_idx} out of {len(dataloader)}\")\n    ...\n</code></pre> <p>This will keep a \"record\" of the events happening in our script, in this case how far we have progressed. We could even change the print to include something like <code>batch.shape</code> to also have information about the current data being processed.</p> <p>Using <code>print</code> statements is fine for small applications, but to have proper logging we need a bit more functionality than what <code>print</code> can offer. Python actually comes with a great logging module, that defines functions for flexible logging. It is exactly this we are going to look at in this module.</p> <p>The four main components to the Python logging module are:</p> <ol> <li> <p>Logger: The main entry point for using the logging system. You create instances of the Logger class to emit log     messages.</p> </li> <li> <p>Handler: Defines where the log messages go. Handlers send the log messages to specific destinations, such as the     console or a file.</p> </li> <li> <p>Formatter: Specifies the layout of the log messages. Formatters determine the structure of the log records,     including details like timestamps and log message content.</p> </li> <li> <p>Level: Specifies the severity of a log message.</p> </li> </ol> <p>Especially, the last point is important to understand. Levels essentially allows of to get rid of statements like this:</p> <pre><code>if debug:\n    print(x.shape)\n</code></pre> <p>where the logging is conditional on the variable <code>debug</code> which we can set a runtime. Thus, it is something we can disable for users of our application (<code>debug=False</code>) but have enabled when we develop the application (<code>debug=True</code>). And it makes sense that not all things logged, should be available to all stakeholders of a codebase. We as developers probably always wants the highest level of logging, whereas users of the our code need less info and we may want to differentiate this based on users.</p> <p></p> <p>It is also important to understand the different between logging and error handling. Error handling Python is done using <code>raise</code> statements and <code>try/catch</code> like:</p> <pre><code>def f(x: int):\n    if not isinstance(x, int):\n        raise ValueError(\"Expected an integer\")\n    return 2 * x\n\ntry:\n    f(5):\nexcept ValueError:\n    print(\"I failed to do a thing, but continuing.\")\n</code></pre> <p>Why would we evere need log <code>warning</code>, <code>error</code>, <code>critical</code> levels of information, if we are just going to handle it? The reason is that raising exceptions are meant to change the program flow at runtime e.g. things we do not want the user to do, but we can deal with in some way. Logging is always for after a program have run, to inspect what went wrong. Sometimes you need one, sometimes the other, sometimes both.</p>"},{"location":"s4_debugging_and_logging/logging/#exercises","title":"\u2754 Exercises","text":"<p>Exercises are inspired by this made with ml module on the same topic. If you need help for the exercises you can find a simple solution script here.</p> <ol> <li> <p>As logging is a built-in module in Python, nothing needs to be installed. Instead start a new file called     <code>my_logger.py</code> and start out with the following code:</p> <pre><code>import logging\nimport sys\n\n# Create super basic logger\nlogging.basicConfig(stream=sys.stdout, level=logging.DEBUG)\nlogger = logging.getLogger(__name__) # (1)\n\n# Logging levels (from lowest to highest priority)\nlogger.debug(\"Used for debugging your code.\")\nlogger.info(\"Informative messages from your code.\")\nlogger.warning(\"Everything works but there is something to be aware of.\")\nlogger.error(\"There's been a mistake with the process.\")\nlogger.critical(\"There is something terribly wrong and process may terminate.\")\n</code></pre> <ol> <li> The built-in variable <code>__name__</code> always contains the record of the script or module that is     currently being run. Therefore if we initialize our logger base using this variable, it will always be unique     to our application and not conflict with logger setup by any third-party package.</li> </ol> <p>Try running the code. Than try changing the argument <code>level</code> when creating the logger. What happens when you do that?</p> </li> <li> <p>Instead of sending logs to the terminal, we may also want to send them to a file. This can be beneficial, such that     only <code>warning</code> level logs and higher are available to the user, but <code>debug</code> and <code>info</code> is still saved when the     application is running.</p> <ol> <li> <p>Try adding the following dict to your <code>logger.py</code> file:</p> <pre><code>logging_config = {\n    \"version\": 1,\n    \"formatters\": { # (1)\n        \"minimal\": {\"format\": \"%(message)s\"},\n        \"detailed\": {\n            \"format\": \"%(levelname)s %(asctime)s [%(name)s:%(filename)s:%(funcName)s:%(lineno)d]\\n%(message)s\\n\"\n        },\n    },\n    \"handlers\": { # (2)\n        \"console\": {\n            \"class\": \"logging.StreamHandler\",\n            \"stream\": sys.stdout,\n            \"formatter\": \"minimal\",\n            \"level\": logging.DEBUG,\n        },\n        \"info\": {\n            \"class\": \"logging.handlers.RotatingFileHandler\",\n            \"filename\": Path(LOGS_DIR, \"info.log\"),\n            \"maxBytes\": 10485760,  # 1 MB\n            \"backupCount\": 10,\n            \"formatter\": \"detailed\",\n            \"level\": logging.INFO,\n        },\n        \"error\": {\n            \"class\": \"logging.handlers.RotatingFileHandler\",\n            \"filename\": Path(LOGS_DIR, \"error.log\"),\n            \"maxBytes\": 10485760,  # 1 MB\n            \"backupCount\": 10,\n            \"formatter\": \"detailed\",\n            \"level\": logging.ERROR,\n        },\n    },\n    \"root\": {\n        \"handlers\": [\"console\", \"info\", \"error\"],\n        \"level\": logging.INFO,\n        \"propagate\": True,\n    },\n}\n</code></pre> <ol> <li> <p> The formatter section     determines how logs should be formatted. Here we define two separate formatters, called <code>minimal</code> and     <code>detailed</code> which we can use in the next part of the code.</p> </li> <li> <p> The handlers is in     charge of what should happen to different level of logging. <code>console</code> uses the <code>minimal</code> format we defined     and sens logs to the <code>stdout</code> stream for messages of level <code>DEBUG</code> and higher. The <code>info</code> handler uses     the <code>detailed</code> format and sends messages of level <code>INFO</code> and higher to a separate <code>info.log</code> file. The     <code>error</code> handler does the same for messages of level <code>ERROR</code> and higher to a file called <code>error.log</code>.</p> </li> </ol> <p>you will need to set the <code>LOGS_DIR</code> variable and also figure out how to add this <code>logging_config</code> using the logging config submodule to your logger.</p> </li> <li> <p>When the code successfully runs, check the <code>LOGS_DIR</code> folder and make sure that a <code>info.log</code> and <code>error.log</code> file     was created with the appropriate content.</p> </li> </ol> </li> <li> <p>Finally, lets try to add a little bit of style and color to our logging. For this we can use the great package     rich which is a great package for rich text and beautiful formatting in     terminals. Install <code>rich</code> and add the following line to your <code>my_logger.py</code>     script:</p> <pre><code>logger.root.handlers[0] = RichHandler(markup=True)  # set rich handler\n</code></pre> <p>and try re-running the script. Hopefully you should see something beautiful in your terminal like this:</p> <p> </p> </li> <li> <p>(Optional) We already briefly touched on logging during the     module on config files using hydra. If you want to configure hydra to use     custom logging scheme as the one we setup in the last two exercises, you can take a look at this     page. In hydra you will need to provide the configuration of the     logger as config file. You can find examples of such config file     here.</p> </li> </ol>"},{"location":"s4_debugging_and_logging/logging/#experiment-logging","title":"Experiment logging","text":"<p>When most people think machine learning, we think about the training phase. Being able to track and log experiments is an important part of understanding what is going on with your model while you are training. It can help you debug your model and help tweak your models to perfection. Without proper logging of experiments, it can be really hard to iterate on the model because you do not know what changes lead to increase or decrease in performance.</p> <p>The most basic logging we can do when running experiments is writing the metrics that our model is producing e.g. the loss or the accuracy to the terminal or a file for later inspection. We can then also use tools such as matplotlib for plotting the progression of our metrics over time. This kind of workflow may be enough when doing smaller experiments or working alone on a project, but there is no way around using a proper experiment tracker and visualizer when doing large scale experiments in collaboration with others. It especially becomes important when you want to compare performance between different runs.</p> <p>There exist many tools for logging your experiments, with some of them being:</p> <ul> <li>Tensorboard</li> <li>Comet</li> <li>MLFlow</li> <li>Neptune</li> <li>Weights and Bias</li> </ul> <p>All of the frameworks offers many of the same functionalities, you can see a (bias) review here. We are going to use Weights and Bias (wandb), as it support everything we need in this course. Additionally, it is an excellent tool for collaboration and sharing of results.</p> <p></p>  Using the Weights and Bias (wandb) dashboard we can quickly get an overview and compare many runs over different metrics. This allows for better iteration of models and training procedure."},{"location":"s4_debugging_and_logging/logging/#exercises_1","title":"\u2754 Exercises","text":"<ol> <li> <p>Start by creating an account at wandb. I recommend using your GitHub account but feel     free to choose what you want. When you are logged in you should get an API key of length 40. Copy this for later     use (HINT: if you forgot to copy the API key, you can find it under settings), but make sure that you do not share     it with anyone or leak it in any way.</p> .env file <p>A good place to store not only your wandb API key but also other sensitive information is in a <code>.env</code> file. This file should be added to your <code>.gitignore</code> file to make sure that it is not uploaded to your repository. You can then load the variables in the <code>.env</code> file using the <code>python-dotenv</code> package. For more information see this page.</p> <p>.env<pre><code>WANDB_API_KEY=your-api-key\nWANDB_PROJECT=my_project\nWANDB_ENTITY=my_entity\n...\n</code></pre> load_from_env_file.py<pre><code>from dotenv import load_dotenv\nload_dotenv()\nimport os\napi_key = os.getenv(\"WANDB_API_KEY\")\n</code></pre></p> </li> <li> <p>Next install wandb on your laptop</p> <pre><code>pip install wandb\n</code></pre> </li> <li> <p>Now connect to your wandb account</p> <pre><code>wandb login\n</code></pre> <p>you will be asked to provide the 40 length API key. The connection should be remain open to the wandb server even when you close the terminal, such that you do not have to login each time. If using <code>wandb</code> in a notebook you need to manually close the connection using <code>wandb.finish()</code>.</p> </li> <li> <p>We are now ready for incorporating <code>wandb</code> into our code. We are going to continue development on our corrupt MNIST     codebase from the previous sessions. For help, we recommend looking at this     quickstart and this guide     for PyTorch applications. You first job is to alter your training script to include <code>wandb</code> logging, at least for     the training loss.</p> Solution train.py<pre><code>import click\nimport torch\nimport wandb\nfrom my_project.data import corrupt_mnist\nfrom my_project.model import MyAwesomeModel\n\nDEVICE = torch.device(\"cuda\" if torch.cuda.is_available() else \"mps\" if torch.backends.mps.is_available() else \"cpu\")\n\n\n@click.command()\n@click.option(\"--lr\", type=float, default=0.001, help=\"Learning rate\")\n@click.option(\"--batch_size\", type=int, default=32, help=\"Batch size\")\n@click.option(\"--epochs\", type=int, default=5, help=\"Number of epochs\")\ndef train(lr, batch_size, epochs) -&gt; None:\n    \"\"\"Train a model on MNIST.\"\"\"\n    print(\"Training day and night\")\n    print(f\"{lr=}, {batch_size=}, {epochs=}\")\n    wandb.init(\n        project=\"corrupt_mnist\",\n        config={\"lr\": lr, \"batch_size\": batch_size, \"epochs\": epochs},\n    )\n\n    model = MyAwesomeModel().to(DEVICE)\n    train_set, _ = corrupt_mnist()\n\n    train_dataloader = torch.utils.data.DataLoader(train_set, batch_size=batch_size)\n\n    loss_fn = torch.nn.CrossEntropyLoss()\n    optimizer = torch.optim.Adam(model.parameters(), lr=lr)\n\n    for epoch in range(epochs):\n        model.train()\n        for i, (img, target) in enumerate(train_dataloader):\n            img, target = img.to(DEVICE), target.to(DEVICE)\n            optimizer.zero_grad()\n            y_pred = model(img)\n            loss = loss_fn(y_pred, target)\n            loss.backward()\n            optimizer.step()\n            accuracy = (y_pred.argmax(dim=1) == target).float().mean().item()\n            wandb.log({\"train_loss\": loss.item(), \"train_accuracy\": accuracy})\n\n            if i % 100 == 0:\n                print(f\"Epoch {epoch}, iter {i}, loss: {loss.item()}\")\n\n\nif __name__ == \"__main__\":\n    train()\n</code></pre> <ol> <li> <p>After running your model, checkout the webpage. Hopefully you should be able to see at least one run with     something logged.</p> </li> <li> <p>Now log something else than scalar values. This could be a image, a histogram or a matplotlib figure. In all     cases the logging is still going to use <code>wandb.log</code> but you need extra calls to <code>wandb.Image</code> etc. depending     on what you choose to log.</p> Solution <p>In this solution we log the input images to the model every 100 step. Additionally, we also log a histogram of the gradients to inspect if the model is converging. Finally, we create a ROC curve which is a matplotlib figure and log that as well.</p> train.py<pre><code>import click\nimport matplotlib.pyplot as plt\nimport torch\nimport wandb\nfrom my_project.data import corrupt_mnist\nfrom my_project.model import MyAwesomeModel\nfrom sklearn.metrics import RocCurveDisplay\n\nDEVICE = torch.device(\"cuda\" if torch.cuda.is_available() else \"mps\" if torch.backends.mps.is_available() else \"cpu\")\n\n\n@click.command()\n@click.option(\"--lr\", type=float, default=0.001, help=\"Learning rate\")\n@click.option(\"--batch_size\", type=int, default=32, help=\"Batch size\")\n@click.option(\"--epochs\", type=int, default=5, help=\"Number of epochs\")\ndef train(lr, batch_size, epochs) -&gt; None:\n    \"\"\"Train a model on MNIST.\"\"\"\n    print(\"Training day and night\")\n    print(f\"{lr=}, {batch_size=}, {epochs=}\")\n    wandb.init(\n        project=\"corrupt_mnist\",\n        config={\"lr\": lr, \"batch_size\": batch_size, \"epochs\": epochs},\n    )\n\n    model = MyAwesomeModel().to(DEVICE)\n    train_set, _ = corrupt_mnist()\n\n    train_dataloader = torch.utils.data.DataLoader(train_set, batch_size=batch_size)\n\n    loss_fn = torch.nn.CrossEntropyLoss()\n    optimizer = torch.optim.Adam(model.parameters(), lr=lr)\n\n    for epoch in range(epochs):\n        model.train()\n\n        preds, targets = [], []\n        for i, (img, target) in enumerate(train_dataloader):\n            img, target = img.to(DEVICE), target.to(DEVICE)\n            optimizer.zero_grad()\n            y_pred = model(img)\n            loss = loss_fn(y_pred, target)\n            loss.backward()\n            optimizer.step()\n            accuracy = (y_pred.argmax(dim=1) == target).float().mean().item()\n            wandb.log({\"train_loss\": loss.item(), \"train_accuracy\": accuracy})\n\n            preds.append(y_pred.detach().cpu())\n            targets.append(target.detach().cpu())\n\n            if i % 100 == 0:\n                print(f\"Epoch {epoch}, iter {i}, loss: {loss.item()}\")\n\n                # add a plot of the input images\n                images = wandb.Image(img[:5].detach().cpu(), caption=\"Input images\")\n                wandb.log({\"images\": images})\n\n                # add a plot of histogram of the gradients\n                grads = torch.cat([p.grad.flatten() for p in model.parameters() if p.grad is not None], 0)\n                wandb.log({\"gradients\": wandb.Histogram(grads)})\n\n        # add a custom matplotlib plot of the ROC curves\n        preds = torch.cat(preds, 0)\n        targets = torch.cat(targets, 0)\n\n        for class_id in range(10):\n            one_hot = torch.zeros_like(targets)\n            one_hot[targets == class_id] = 1\n            _ = RocCurveDisplay.from_predictions(\n                one_hot,\n                preds[:, class_id],\n                name=f\"ROC curve for {class_id}\",\n                plot_chance_level=(class_id == 2),\n            )\n\n        wandb.plot({\"roc\": plt})\n        # alternatively the wandb.plot.roc_curve function can be used\n\n\nif __name__ == \"__main__\":\n    train()\n</code></pre> </li> <li> <p>Finally, we want to log the model itself. This is done by saving the model as an artifact and then logging the     artifact. You can read much more about what artifacts are here but     they are essentially one or more files logged together with runs that can be versioned and equipped with     metadata. Log the model after training and see if you can find it in the wandb dashboard.</p> Solution <p>In this solution we have added the calculating of final training metrics and when we then log the model we add these as metadata to the artifact.</p> train.py<pre><code>import click\nimport matplotlib.pyplot as plt\nimport torch\nimport wandb\nfrom my_project.data import corrupt_mnist\nfrom my_project.model import MyAwesomeModel\nfrom sklearn.metrics import RocCurveDisplay, accuracy_score, f1_score, precision_score, recall_score\n\nDEVICE = torch.device(\"cuda\" if torch.cuda.is_available() else \"mps\" if torch.backends.mps.is_available() else \"cpu\")\n\n\n@click.command()\n@click.option(\"--lr\", type=float, default=0.001, help=\"Learning rate\")\n@click.option(\"--batch_size\", type=int, default=32, help=\"Batch size\")\n@click.option(\"--epochs\", type=int, default=5, help=\"Number of epochs\")\ndef train(lr, batch_size, epochs) -&gt; None:\n    \"\"\"Train a model on MNIST.\"\"\"\n    print(\"Training day and night\")\n    print(f\"{lr=}, {batch_size=}, {epochs=}\")\n    run = wandb.init(\n        project=\"corrupt_mnist\",\n        config={\"lr\": lr, \"batch_size\": batch_size, \"epochs\": epochs},\n    )\n\n    model = MyAwesomeModel().to(DEVICE)\n    train_set, _ = corrupt_mnist()\n\n    train_dataloader = torch.utils.data.DataLoader(train_set, batch_size=batch_size)\n\n    loss_fn = torch.nn.CrossEntropyLoss()\n    optimizer = torch.optim.Adam(model.parameters(), lr=lr)\n\n    for epoch in range(epochs):\n        model.train()\n\n        preds, targets = [], []\n        for i, (img, target) in enumerate(train_dataloader):\n            img, target = img.to(DEVICE), target.to(DEVICE)\n            optimizer.zero_grad()\n            y_pred = model(img)\n            loss = loss_fn(y_pred, target)\n            loss.backward()\n            optimizer.step()\n            accuracy = (y_pred.argmax(dim=1) == target).float().mean().item()\n            wandb.log({\"train_loss\": loss.item(), \"train_accuracy\": accuracy})\n\n            preds.append(y_pred.detach().cpu())\n            targets.append(target.detach().cpu())\n\n            if i % 100 == 0:\n                print(f\"Epoch {epoch}, iter {i}, loss: {loss.item()}\")\n\n                # add a plot of the input images\n                images = wandb.Image(img[:5].detach().cpu(), caption=\"Input images\")\n                wandb.log({\"images\": images})\n\n                # add a plot of histogram of the gradients\n                grads = torch.cat([p.grad.flatten() for p in model.parameters() if p.grad is not None], 0)\n                wandb.log({\"gradients\": wandb.Histogram(grads)})\n\n        # add a custom matplotlib plot of the ROC curves\n        preds = torch.cat(preds, 0)\n        targets = torch.cat(targets, 0)\n\n        for class_id in range(10):\n            one_hot = torch.zeros_like(targets)\n            one_hot[targets == class_id] = 1\n            _ = RocCurveDisplay.from_predictions(\n                one_hot,\n                preds[:, class_id],\n                name=f\"ROC curve for {class_id}\",\n                plot_chance_level=(class_id == 2),\n            )\n\n        wandb.plot({\"roc\": plt})\n        # alternatively the wandb.plot.roc_curve function can be used\n\n    final_accuracy = accuracy_score(targets, preds.argmax(dim=1))\n    final_precision = precision_score(targets, preds.argmax(dim=1), average=\"weighted\")\n    final_recall = recall_score(targets, preds.argmax(dim=1), average=\"weighted\")\n    final_f1 = f1_score(targets, preds.argmax(dim=1), average=\"weighted\")\n\n    # first we save the model to a file then log it as an artifact\n    torch.save(model.state_dict(), \"model.pth\")\n    artifact = wandb.Artifact(\n        name=\"corrupt_mnist_model\",\n        type=\"model\",\n        description=\"A model trained to classify corrupt MNIST images\",\n        metadata={\"accuracy\": final_accuracy, \"precision\": final_precision, \"recall\": final_recall, \"f1\": final_f1},\n    )\n    artifact.add_file(\"model.pth\")\n    run.log_artifact(artifact)\n\n\nif __name__ == \"__main__\":\n    train()\n</code></pre> <p>After running the script you should be able to see the logged artifact in the wandb dashboard.</p> <p> </p> </li> </ol> </li> <li> <p>Weights and bias was created with collaboration in mind and lets therefore share our results with others.</p> <ol> <li> <p>Lets create a report that you can share. Click the Create report button (upper right corner when you are in     a project workspace) and include some of the graphs/plots/images that you have generated in the report.</p> </li> <li> <p>Make the report shareable by clicking the Share button and create view-only-link. Send a link to your report     to a group member, fellow student or a friend. In the worst case that you have no one else to share with you can     send a link to my email <code>nsde@dtu.dk</code>, so I can checkout your awesome work \ud83d\ude03</p> </li> </ol> </li> <li> <p>When calling <code>wandb.init</code> you can provide many additional argument. Some of the most important are</p> <ul> <li><code>project</code></li> <li><code>entity</code></li> <li><code>job_type</code></li> </ul> <p>Make sure you understand what these arguments do and try them out. It will come in handy for your group work as they essentially allows multiple users to upload their own runs to the same project in <code>wandb</code>.</p> Solution <p>Relevant documentation can be found here. The <code>project</code> indicates what project all experiments and artifacts are logged to. We want to keep this the same for all group members. The <code>entity</code> is the username of the person or team who owns the project, which should also be the same for all group members. The job type is important if you have different jobs that log to the same project. A common example is one script that trains a model and another that evaluates it. By setting the job type you can easily filter the runs in the wandb dashboard.</p> <p> </p> </li> <li> <p>Wandb also comes with build in feature for doing hyperparameter sweeping     which can be beneficial to get a better working model. Look through the documentation on how to do a hyperparameter     sweep in Wandb. You at least need to create a new file called <code>sweep.yaml</code> and make sure that you call <code>wandb.log</code>     in your code on an appropriate value.</p> <ol> <li> <p>Start by creating a <code>sweep.yaml</code> file. Relevant documentation can be found     here. We recommend placing the file in a     <code>configs</code> folder in your project.</p> Solution <p>The <code>sweep.yaml</code> file will depend on kind of hyperparameters your model accepts as arguments and how they are passed to the model. For this solution we assume that the model accepts the hyperparameters <code>lr</code>, <code>batch_size</code> and <code>epochs</code> and that they are passed as <code>--args</code> (with hyphens) (1) e.g. this would be how we run the script</p> <ol> <li> If the script you want to run hyperparameter sweeping is configured using     hydra then you will need to change the default <code>command</code> config     in your <code>sweep.yaml</code> file. This is because <code>wandb</code> uses <code>--args</code> to pass hyperparameters to the script,     whereas <code>hydra</code> uses <code>args</code> (without the hyphen). See this     page for more information.</li> </ol> <pre><code>python train.py --lr=0.01 --batch_size=32 --epochs=10\n</code></pre> <p>The <code>sweep.yaml</code> could then look like this:</p> <pre><code>program: train.py\nname: sweepdemo\nproject: my_project  # change this\nentity: my_entity  # change this\nmetric:\n    goal: minimize\n    name: validation_loss\nparameters:\n    learning_rate:\n        min: 0.0001\n        max: 0.1\n        distribution: log_uniform\n    batch_size:\n        values: [16, 32, 64]\n    epochs:\n        values: [5, 10, 15]\nrun_cap: 10\n</code></pre> </li> <li> <p>Afterwards, you need to create a sweep using the <code>wandb sweep</code> command:</p> <pre><code>wandb sweep configs/sweep.yaml\n</code></pre> <p>this will output a sweep id that you need to use in the next step.</p> </li> <li> <p>Finally, you need to run the sweep using the <code>wandb agent</code> command:</p> <pre><code>wandb agent &lt;sweep_id&gt;\n</code></pre> <p>where <code>&lt;sweep_id&gt;</code> is the id of the sweep you just created. You can find the id in the output of the <code>wandb sweep</code> command. The reason that we first lunch the sweep and then the agent is that we can have multiple agents running at the same time, parallelizing the search for the best hyperparameters. Try this out by opening a new terminal and running the <code>wandb agent</code> command again (with the same <code>&lt;sweep_id&gt;</code>).</p> </li> <li> <p>Inspect the sweep results in the wandb dashboard. You should see multiple new runs under the project you are     logging the sweep to, corresponding to the different hyperparameters you tried. Make sure you understand the     results and can answer what hyperparameters gave the best results and what hyperparameters had the largest     impact on the results.</p> Solution <p>In the sweep dashboard you should see something like this:</p> <p> </p> <p>Importantly you can:</p> <ol> <li>Sort the runs based on what metric you are interested in, thereby quickly finding the best runs.</li> <li>Look at the parallel coordinates plot to see if there are any tendencies in the hyperparameters that     gives the best results.</li> <li>Look at the importance/correlation plot to see what hyperparameters have the largest impact on the     results.</li> </ol> </li> </ol> </li> <li> <p>Next we need to understand the model registry, which will be very important later on when we get to the deployment     of our models. The model registry is a centralized place for storing and versioning models. Importantly, any model     in the registry is immutable, meaning that once a model is uploaded it cannot be changed. This is important for     reproducibility and traceability of models.</p> <p>  The model registry is in general a repository of a teams trained models where ML practitioners publish candidates for production and share them with others. Figure from wandb.  <ol> <li> <p>The model registry builds on the artifact registry in wandb. Any model that is uploaded to the model registry is     stored as an artifact. This means that we first need to log our trained models as artifacts before we can     register them in the model registry. Make sure you have logged at least one model as an artifact before     continuing.</p> </li> <li> <p>Next lets create a registry. Go to the model registry tab (left pane, visible from your homepage) and then click     the <code>New Registered Model</code> button. Fill out the form and create the registry.</p> <p> </p> </li> <li> <p>When then need to link our artifact to the model registry we just created. We can do this in two ways: either     through the web interface or through the <code>wandb</code> API. In the web interface, go to the artifact you want to link     to the model registry and click the <code>Link to registry</code> button (upper right corner). If you want to use the     API you need to call the link method on a artifact object.</p> Solution <p>To use the API, create a new script called <code>link_to_registry.py</code> and add the following code:</p> link_to_registry.py<pre><code>import wandb\napi = wandb.Api()\nartifact_path = \"&lt;entity&gt;/&lt;project&gt;/&lt;artifact_name&gt;:&lt;version&gt;\"\nartifact = api.artifact(artifact_path)\nartifact.link(target_path=\"&lt;entity&gt;/model-registry/&lt;my_registry_name&gt;\")\nartifact.save()\n</code></pre> <p>In the code <code>&lt;entity&gt;</code>, <code>&lt;project&gt;</code>, <code>&lt;artifact_name&gt;</code>, <code>&lt;version&gt;</code> and <code>&lt;my_registry_name&gt;</code> should be replaced with the appropriate values.</p> </li> <li> <p>We are now ready to consume our model, which can be done by downloading the artifact from the model registry. In     this case we use the wandb API to download the artifact.</p> <pre><code>import wandb\nrun = wandb.init()\nartifact = run.use_artifact('&lt;entity&gt;/model-registry/&lt;my_registry_name&gt;:&lt;version&gt;', type='model')\nartifact_dir = artifact.download(\"&lt;artifact_dir&gt;\")\nmodel = MyModel()\nmodel.load_state_dict(torch.load(\"&lt;artifact_dir&gt;/model.ckpt\"))\n</code></pre> <p>Try running this code with the appropriate values for <code>&lt;entity&gt;</code>, <code>&lt;my_registry_name&gt;</code>, <code>&lt;version&gt;</code> and <code>&lt;artifact_dir&gt;</code>. Make sure that you can load the model and that it is the same as the one you trained.</p> </li> <li> <p>Each model in the registry have at least one alias, which is the version of the model. The most recently added     model also receives the alias <code>latest</code>. Aliases are great for indicating where in workflow a model is, e.g. if     it is a candidate for production or if it is a model that is still being developed. Try adding an alias to one     of your models in the registry.</p> </li> <li> <p>(Optional) A model always corresponds to an artifact, and artifacts can contain metadata that we can use to     automate the process of registering models. We could for example imaging that we at the end of each week run     a script that registers the best model from the week. Try creating a small script using the <code>wandb</code> API that     goes over a collection of artifacts and registers the best one.</p> Solution auto_register_best_model.py<pre><code>import logging\nimport operator\nimport os\n\nimport click\nimport wandb\nfrom dotenv import load_dotenv\n\nlogger = logging.getLogger(__name__)\nload_dotenv()\n\n\n@click.command()\n@click.argument(\"model-name\")\n@click.option(\"--metric_name\", default=\"accuracy\", help=\"Name of the metric to choose the best model from.\")\n@click.option(\"--higher-is-better\", default=True, help=\"Whether higher metric values are better.\")\ndef stage_best_model_to_registry(model_name, metric_name, higher_is_better) -&gt; None:\n    \"\"\"\n    Stage the best model to the model registry.\n\n    Args:\n        model_name: Name of the model to be registered.\n        metric_name: Name of the metric to choose the best model from.\n        higher_is_better: Whether higher metric values are better.\n\n    \"\"\"\n    api = wandb.Api(\n        api_key=os.getenv(\"WANDB_API_KEY\"),\n        overrides={\"entity\": os.getenv(\"WANDB_ENTITY\"), \"project\": os.getenv(\"WANDB_PROJECT\")},\n    )\n    artifact_collection = api.artifact_collection(type_name=\"model\", name=model_name)\n\n    best_metric = float(\"-inf\") if higher_is_better else float(\"inf\")\n    compare_op = operator.gt if higher_is_better else operator.lt\n    best_artifact = None\n    for artifact in list(artifact_collection.artifacts()):\n        if metric_name in artifact.metadata and compare_op(artifact.metadata[metric_name], best_metric):\n            best_metric = artifact.metadata[metric_name]\n            best_artifact = artifact\n\n    if best_artifact is None:\n        logging.error(\"No model found in registry.\")\n        return\n\n    logger.info(f\"Best model found in registry: {best_artifact.name} with {metric_name}={best_metric}\")\n    best_artifact.link(\n        target_path=f\"{os.getenv('WANDB_ENTITY')}/model-registry/{model_name}\",\n        aliases=[\"best\", \"staging\"],\n    )\n    best_artifact.save()\n    logger.info(\"Model staged to registry.\")\n\n\nif __name__ == \"__main__\":\n    stage_best_model_to_registry()\n</code></pre> </li> </ol> <li> <p>In the future it will be important for us to be able to run Wandb inside a docker container (together with whatever     training or inference we specify). The problem here is that we cannot authenticate Wandb in the same way as the     previous exercise, it needs to happen automatically. Lets therefore look into how we can do that.</p> <ol> <li> <p>First we need to generate an authentication key, or more precise an API key. This is in general the way any     service (like a docker container) can authenticate. Start by going https://wandb.ai/home, click your profile     icon in the upper right corner and then go to settings. Scroll down to the danger zone and generate a new API     key and finally copy it.</p> </li> <li> <p>Next create a new docker file called <code>wandb.docker</code> and add the following code</p> <pre><code>FROM python:3.10-slim\nRUN apt update &amp;&amp; \\\n    apt install --no-install-recommends -y build-essential gcc &amp;&amp; \\\n    apt clean &amp;&amp; rm -rf /var/lib/apt/lists/*\nRUN pip install wandb\nCOPY s4_debugging_and_logging/exercise_files/wandb_tester.py wandb_tester.py\nENTRYPOINT [\"python\", \"-u\", \"wandb_tester.py\"]\n</code></pre> <p>please take a look at the script being copied into the image and afterwards build the docker image.</p> </li> <li> <p>When we want to run the image, what we need to do is including a environment variables that contains the API key     we generated. This will then authenticate the docker container with the wandb server:</p> <pre><code>docker run -e WANDB_API_KEY=&lt;your-api-key&gt; wandb:latest\n</code></pre> <p>Try running it an confirm that the results are uploaded to the wandb server (1).</p> <ol> <li> If you have stored the API key in a <code>.env</code> file you can use the <code>--env-file</code> flag instead     of <code>-e</code> to load the environment variables from the file e.g. <code>docker run --env-file .env wandb:latest</code>.</li> </ol> </li> </ol> </li> <li> <p>Feel free to experiment more with <code>wandb</code> as it is a great tool for logging, organizing and sharing experiments.</p> </li> <p>That is the module on logging. Please note that at this point in the course you will begin to see some overlap between the different frameworks. While we mainly used <code>hydra</code> for configuring our Python scripts it can also be used to save metrics and hyperparameters similar to how <code>wandb</code> can. Similar arguments holds for <code>dvc</code> which can also be used to log metrics. In our opinion <code>wandb</code> just offers a better experience when interacting with the results after logging. We want to stress that the combination of tools presented in this course may not be the best for all your future projects, and we recommend finding a setup that fits you. That said, each framework provide specific features that the others does not.</p> <p>Finally, we want to note that we during the course really try to showcase a lot of open source frameworks, Wandb is not one. It is free to use for personal usage (with a few restrictions) but for enterprise it does require a license. If you are eager to only work with open-source tools we highly recommend trying out MLFlow which offers the same overall functionalities as Wandb.</p>"},{"location":"s4_debugging_and_logging/profiling/","title":"M13 - Profiling","text":""},{"location":"s4_debugging_and_logging/profiling/#profilers","title":"Profilers","text":"<p>Core Module</p>"},{"location":"s4_debugging_and_logging/profiling/#profilers_1","title":"Profilers","text":"<p>In general profiling code is about improving the performance of your code. In this session we are going to take a somewhat narrow approach to what \"performance\" is: runtime, meaning the time it takes to execute your program.</p> <p>At the bare minimum, the two questions a proper profiling of your program should be able to answer is:</p> <ul> <li>\u201c How many times is each method in my code called?\u201d</li> <li>\u201c How long do each of these methods take?\u201d</li> </ul> <p>The first question is important to priorities optimization. If two methods <code>A</code> and <code>B</code> have approximately the same runtime, but <code>A</code> is called 1000 more times than <code>B</code> we should probably spend time optimizing <code>A</code> over <code>B</code> if we want to speedup our code. The second question is gives itself, directly telling us which methods are the expensive to call.</p> <p>Using profilers can help you find bottlenecks in your code. In this exercise we will look at two different profilers, with the first one being the cProfile. <code>cProfile</code> is pythons build in profiler that can help give you an overview runtime of all the functions and methods involved in your programs.</p>"},{"location":"s4_debugging_and_logging/profiling/#exercises","title":"\u2754 Exercises","text":"<ol> <li> <p>Run the <code>cProfile</code> on the <code>vae_mnist_working.py</code> script. Hint: you can directly call the profiler on a     script using the <code>-m</code> arg</p> <pre><code>python -m cProfile -o &lt;output_file&gt; -s &lt;sort_order&gt; myscript.py\n</code></pre> </li> <li> <p>Try looking at the output of the profiling. Can you figure out which function took the longest to run?</p> </li> <li> <p>Can you explain the difference between <code>tottime</code> and <code>cumtime</code>? Under what circumstances does these differ and     when are they equal.</p> </li> <li> <p>To get a better feeling of the profiled result we can try to visualize it. Python does not     provide a native solution, but open-source solutions such as snakeviz     exist. Try installing <code>snakeviz</code> and load a profiled run into it (HINT: snakeviz expect the run to have the file     format <code>.prof</code>).</p> </li> <li> <p>Try optimizing the run! (Hint: The data is not stored as torch tensor). After optimizing the code make sure     (using <code>cProfile</code> and <code>snakeviz</code>) that the code actually runs faster.</p> </li> </ol>"},{"location":"s4_debugging_and_logging/profiling/#pytorch-profiling","title":"PyTorch profiling","text":"<p>Profiling machine learning code can become much more complex because we are suddenly beginning to mix different devices (CPU+GPU), that can (and should) overlap some of their computations. When profiling this kind of machine learning code we are often looking for bottlenecks. A bottleneck is simple the place in your code that is preventing other processes from performing their best. This is the reason that all major deep learning frameworks also include their own profilers that can help profiling more complex applications.</p> <p>The image below show a typical report using the build in profiler in pytorch. As the image shows the profiler looks both a the <code>kernel</code> time (this is the time spend doing actual computations) and also transfer times such as <code>memcpy</code> (where we are copying data between devices). It can even analyze your code and give recommendations.</p> <p></p> <p>Using the profiler can be as simple as wrapping the code that you want to profile with the <code>torch.profiler.profile</code> decorator</p> <pre><code>with torch.profiler.profile(...) as prof:\n    # code that I want to profile\n    output = model(data)\n</code></pre>"},{"location":"s4_debugging_and_logging/profiling/#exercises_1","title":"\u2754 Exercises","text":"<p>Exercise files</p> <p>In these investigate the profiler that is build into PyTorch already. Note that these exercises requires that you have PyTorch v1.8.1 installed (or higher). You can always check which version you currently have installed by writing (in a python interpreter):</p> <pre><code>import torch\nprint(torch.__version__)\n</code></pre> <p>But we always recommend to update to the latest PyTorch version for the best experience. Additionally, to display the result nicely (like <code>snakeviz</code> for <code>cProfile</code>) we are also going to use the tensorboard profiler extension</p> <pre><code>pip install torch_tb_profiler\n</code></pre> <ol> <li> <p>A good starting point is too look at the API for the profiler. Here     the important class to look at is the <code>torch.profiler.profile</code> class.</p> </li> <li> <p>Lets try out an simple example (taken from     here):</p> <ol> <li> <p>Try to run the following code</p> <pre><code>import torch\nimport torchvision.models as models\nfrom torch.profiler import profile, ProfilerActivity\n\nmodel = models.resnet18()\ninputs = torch.randn(5, 3, 224, 224)\n\nwith profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:\n    model(inputs)\n</code></pre> <p>this will profile the <code>forward</code> pass of Resnet 18 model.</p> </li> <li> <p>Running this code will produce an <code>prof</code> object that contains all the relevant information about the profiling.     Try writing the following code:</p> <pre><code>print(prof.key_averages().table(sort_by=\"cpu_time_total\", row_limit=10))\n</code></pre> <p>what operation is taking most of the cpu?</p> </li> <li> <p>Try running</p> <pre><code>print(prof.key_averages(group_by_input_shape=True).table(sort_by=\"cpu_time_total\", row_limit=30))\n</code></pre> <p>can you see any correlation between the shape of the input and the cost of the operation?</p> </li> <li> <p>(Optional) If you have a GPU you can also profile the operations on that device:</p> <pre><code>with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:\n    model(inputs)\n</code></pre> </li> <li> <p>(Optional) As an alternative to using <code>profile</code> as an     context-manager we can also use its <code>.start</code> and     <code>.stop</code> methods:</p> <pre><code>prof = profile(...)\nprof.start()\n...  # code I want to profile\nprof.stop()\n</code></pre> <p>Try doing this on the above example.</p> </li> </ol> </li> <li> <p>The <code>torch.profiler.profile</code> function takes some additional arguments. What argument would you need to     set to also profile the memory usage? (Hint: this page)     Try doing it to the simple example above and make sure to sort the sample by <code>self_cpu_memory_usage</code>.</p> </li> <li> <p>As mentioned we can also get a graphical output for better inspection. After having done a profiling     try to export the results with:</p> <pre><code>prof.export_chrome_trace(\"trace.json\")\n</code></pre> <p>you should be able to visualize the file by going to <code>chrome://tracing</code> in any chromium based web browser. Can you still identify the information printed in the previous exercises from the visualizations?</p> </li> <li> <p>Running profiling on a single forward step can produce misleading results as it only provides a single sample that     may depend on what background processes that are running on your computer. Therefore it is recommended to profile     multiple iterations of your model. If this is the case then we need to include <code>prof.step()</code> to tell the profiler     when we are doing a new iteration</p> <pre><code>with profile(...) as prof:\n    for i in range(10):\n        model(inputs)\n        prof.step()\n</code></pre> <p>Try doing this. Is the conclusion this the same on what operations that are taken up most of the time? Have the percentage changed significantly?</p> </li> <li> <p>Additionally, we can also visualize the profiling results using the profiling viewer in tensorboard.</p> <ol> <li> <p>Start by initializing the <code>profile</code> class with an additional argument:</p> <pre><code>from torch.profiler import profile, tensorboard_trace_handler\nwith profile(..., on_trace_ready=tensorboard_trace_handler(\"./log/resnet18\")) as prof:\n    ...\n</code></pre> <p>Try run a profiling (using a couple of iterations) and make sure that a file with the <code>.pt.trace.json</code> is produced in the <code>log/resnet18</code> folder.</p> </li> <li> <p>Now try launching tensorboard</p> <pre><code>tensorboard --logdir=./log\n</code></pre> <p>and open the page http://localhost:6006/#pytorch_profiler, where you should hopefully see an image similar to the one below:</p> <p>  Image credit  </p> <p>Try poking around in the interface.</p> </li> <li> <p>Tensorboard have a nice feature for comparing runs under the <code>diff</code> tab. Try redoing a profiling run but use     <code>model = models.resnet34()</code> instead. Load up both runs and try to look at the <code>diff</code> between them.</p> </li> </ol> </li> <li> <p>As an final exercise, try to use the profiler on the <code>vae_mnist_working.py</code> file from the previous module on     debugging, where you profile a whole training run (not only the forward pass). What is the bottleneck during the     training? Is it still the forward pass or is it something else? Can you improve the code somehow based on the     information from the profiler.</p> </li> </ol> <p>This end the module on profiling. If you want to go into more details on this topic we can recommend looking into line_profiler and kernprof. A downside of using python's <code>cProfile</code> is that it can only profiling at an functional/modular level, that is great for identifying hotspots in your code. However, sometimes the cause of an computationally hotspot is a single line of code in a function, which will not be caught by <code>cProfile</code>. An example would be an simple index operations such as <code>a[idx] = b</code>, which for large arrays and non-sequential indexes is really expensive. For these cases line_profiler and kernprof are excellent tools to have in your toolbox. Additionally, if you do not like cProfile we can also recommend py-spy which is another open-source profiling tool for Python programs.</p>"},{"location":"s5_continuous_integration/","title":"Continuous Integration","text":"<p>Slides</p> <ul> <li> <p>     Learn how to write unit tests that cover both data and models in your ML pipeline.</p> <p> M16: Unit testing</p> </li> <li> <p>     Learn how to implement continuous integration using Github actions such that tests are automatically executed on     code changes.</p> <p> M17: Github Actions</p> </li> <li> <p>     Learn how to use pre-commit to ensure that code that is not up to standard does not get committed.</p> <p> M18: Pre-commit</p> </li> <li> <p></p> <p>Learn how to implement continuous machine learning pipelines in Github actions.</p> <p> M19: Continuous Machine Learning</p> </li> </ul> <p>Continues integration is a sub-discipline of the general field of Continues X. Continuous X is one of the core elements of modern DevOps, and by extension MLOps. Continuous X assumes that we have a (long) developer pipeline (see image below) where we want to make some changes to our code e.g:</p> <ul> <li>Update our training data or data processing</li> <li>Update our model architecture</li> <li>Something else...</li> </ul> <p>Basically, any code change we will expect will have a influence on the final result. The problem with doing changes to the start of our pipeline is that we want the change to propagate all the way through to the end of the pipeline.</p> <p></p>  Image credit  <p>This is where continuous X comes into play. The word continuous here refers to the fact that the pipeline should continuously be updated as we make code changes. You can also choose to think of this as the automatization of processes. The X then covers that the process we need to go through to automate steps in the pipeline depends on where we are in the pipeline e.g. the tools needed to do continuous integration are different from the tools needed to do continuous delivery.</p> <p>In this session, we are going to focus on continuous integration (CI). As indicated in the image above, continuous integration usually takes care of the first part of the developer pipeline which has to do with the code base, code building and code testing. This is paramount to step in automatization as we would rather catch bugs at the beginning of our pipeline than in the end.</p> <p>Learning objectives</p> <p>The learning objectives of this session are:</p> <ul> <li>Being able to write unit tests that cover both data and models in your ML pipeline</li> <li>Know how to implement continuous integration using Github actions such that tests are automatically executed on     code changes</li> <li>Can use pre-commit to secure that code that is not up to standard does not get committed</li> <li>Know how to implement continuous integration for continuous building of containers</li> <li>Basic knowledge of how machine learning processes can be implemented in a continuous way</li> </ul>"},{"location":"s5_continuous_integration/cml/","title":"M19 - Continuous Machine Learning","text":""},{"location":"s5_continuous_integration/cml/#continuous-machine-learning","title":"Continuous Machine Learning","text":"<p>The continuous integration we have looked at until now is what we can consider \"classical\" continuous integration, which has its roots in DevOps and not MLOps. While the test that we have written and the containers we have developed in the previous session have been about machine learning, everything we have done translates completely to how it would be done if we had developed any other application that did not include machine learning.</p> <p>In this session, we are now gonna change gears and look at continuous machine learning (CML). As the name may suggest we are now focusing on automatizing actual machine learning processes. The reason for doing this is the same as with continuous integration, namely that we often have a bunch of checks that we want our newly trained model to pass before we trust it to be ready for deployment. Writing unit tests secures that the code that we use for training our model is not broken, but there exist other failure modes of a machine learning pipeline:</p> <ul> <li>Did I train on the correct data?</li> <li>Did my model converge at all?</li> <li>Did a metric that I care about improve?</li> <li>Did I overfit?</li> <li>Did I underfit?</li> <li>...</li> </ul> <p>All these questions are questions that we can answer by writing tests that are specific to machine learning. In this session, we are going to look at how we can begin to use Github Actions to automate these tests.</p>"},{"location":"s5_continuous_integration/cml/#mlops-maturity-model","title":"MLOps maturity model","text":"<p>Before getting started with the exercises, let's first take a side step and look at what is called the MLOps maturity model. The reason here is to get a better understanding of when continuous machine learning is relevant. The main idea behind the MLOps maturity model is to help organizations understand where they are in their machine learning operations journey and what the next logical steps are. The model is divided into five stages:</p> <p></p>  Image credit  <code>Level 0</code> <p>At this level, organizations are doing machine learning in an ad-hoc manner. There is no standardization, no version control, no testing, and no monitoring.</p> <code>Level 1</code> <p>At this level, organizations have started to implement DevOps practices in their machine learning workflows. They have started to use version control and maybe come with basic continuous integration practices.</p> <code>Level 2</code> <p>At this level, organizations have started to standardize the training process and tackle the problem of creating reproducible experiments. Centralization of model artifacts and metadata is common at this level. They have started to implement model versioning and model registry practices.</p> <code>Level 3</code> <p>At this level, organizations have started to implement continuous integration and continuous deployment practices. They have started to automate the testing of their models and have started to monitor their models in production.</p> <code>Level 4</code> <p>At this level, organizations have started to implement continuous machine learning practices. They have started to automate the training, evaluation, and deployment of their models. They have started to implement automated retraining and model updates.</p> <p>The MLOps maturity model tells us that continuous machine learning is the highest form of maturity in MLOps. It is the stage where we have automated the entire machine learning pipeline and the cases we will be going through in the exercises are therefore some of the last steps in the MLOps maturity model.</p>"},{"location":"s5_continuous_integration/cml/#exercises","title":"\u2754 Exercises","text":"<p>In the following exercises, we are going to look at two different cases where we can use continuous machine learning. The first one is a simple case where we are automatically going to trigger some workflow (like training of a model) whenever we make changes to our data. This is a very common use case in machine learning where we have a data pipeline that is continuously updating our data. The second case is connected to staging and deploying models. In this case, we are going to look at how we can automatically do further processing of our model whenever we push a new model to our repository.</p> <ol> <li> <p>For the first set of exercises, we are going to rely on the <code>cml</code> framework by iterative.ai,     which is a framework that is built on top of GitHub actions. The figure below describes the overall process using     the <code>cml</code> framework. It should be clear that it is the very same process that we go through in the other     continuous integration sessions: <code>push code</code> -&gt; <code>trigger GitHub actions</code> -&gt; <code>do stuff</code>. The new part in this session     that we are only going to trigger whenever data changes.</p> <p>  Image credit  </p> <ol> <li> <p>If you have not already created a dataset class for the corrupted Mnist data, start by doing that. Essentially,     it is a class that should inherit from <code>torch.utils.data.Dataset</code> and should have a <code>__getitem__</code> and <code>__len__</code></p> Solution dataset.py<pre><code>from __future__ import annotations\n\nimport os\nfrom typing import TYPE_CHECKING\n\nimport torch\nfrom torch import Tensor\nfrom torch.utils.data import Dataset\n\nif TYPE_CHECKING:\n    import torchvision.transforms.v2 as transforms\n\n\nclass MnistDataset(Dataset):\n    \"\"\"MNIST dataset for PyTorch.\n\n    Args:\n        data_folder: Path to the data folder.\n        train: Whether to load training or test data.\n        img_transform: Image transformation to apply.\n        target_transform: Target transformation to apply.\n    \"\"\"\n\n    name: str = \"MNIST\"\n\n    def __init__(\n        self,\n        data_folder: str = \"data\",\n        train: bool = True,\n        img_transform: transforms.Transform | None = None,\n        target_transform: transforms.Transform | None = None,\n    ) -&gt; None:\n        super().__init__()\n        self.data_folder = data_folder\n        self.train = train\n        self.img_transform = img_transform\n        self.target_transform = target_transform\n        self.load_data()\n\n    def load_data(self) -&gt; None:\n        \"\"\"Load images and targets from disk.\"\"\"\n        images, target = [], []\n        if self.train:\n            nb_files = len([f for f in os.listdir(self.data_folder) if f.startswith(\"train_images\")])\n            for i in range(nb_files):\n                images.append(torch.load(f\"{self.data_folder}/train_images_{i}.pt\"))\n                target.append(torch.load(f\"{self.data_folder}/train_target_{i}.pt\"))\n        else:\n            images.append(torch.load(f\"{self.data_folder}/test_images.pt\"))\n            target.append(torch.load(f\"{self.data_folder}/test_target.pt\"))\n        self.images = torch.cat(images, 0)\n        self.target = torch.cat(target, 0)\n\n    def __getitem__(self, idx: int) -&gt; tuple[Tensor, Tensor]:\n        \"\"\"Return image and target tensor.\"\"\"\n        img, target = self.images[idx], self.target[idx]\n        if self.img_transform:\n            img = self.img_transform(img)\n        if self.target_transform:\n            target = self.target_transform(target)\n        return img, target\n\n    def __len__(self) -&gt; int:\n        \"\"\"Return the number of images in the dataset.\"\"\"\n        return self.images.shape[0]\n</code></pre> </li> <li> <p>Then let's create a function that can report basic statistics such as the number of training samples, number     of test samples and generate figures of sample images in the dataset and distribution of the classes in the     dataset. This function should be called <code>dataset_statistics</code> and should take a path to the dataset as input.</p> Solution dataset.py<pre><code>import click\nimport matplotlib.pyplot as plt\nimport torch\nfrom mnist_dataset import MnistDataset\nfrom utils import show_image_and_target\n\n\n@click.command()\n@click.option(\"--datadir\", default=\"data\", help=\"Path to the data directory\")\ndef dataset_statistics(datadir: str) -&gt; None:\n    \"\"\"Compute dataset statistics.\"\"\"\n    train_dataset = MnistDataset(data_folder=datadir, train=True)\n    test_dataset = MnistDataset(data_folder=datadir, train=False)\n    print(f\"Train dataset: {train_dataset.name}\")\n    print(f\"Number of images: {len(train_dataset)}\")\n    print(f\"Image shape: {train_dataset[0][0].shape}\")\n    print(\"\\n\")\n    print(f\"Test dataset: {test_dataset.name}\")\n    print(f\"Number of images: {len(test_dataset)}\")\n    print(f\"Image shape: {test_dataset[0][0].shape}\")\n\n    show_image_and_target(train_dataset.images[:25], train_dataset.target[:25], show=False)\n    plt.savefig(\"mnist_images.png\")\n    plt.close()\n\n    train_label_distribution = torch.bincount(train_dataset.target)\n    test_label_distribution = torch.bincount(test_dataset.target)\n\n    plt.bar(torch.arange(10), train_label_distribution)\n    plt.title(\"Train label distribution\")\n    plt.xlabel(\"Label\")\n    plt.ylabel(\"Count\")\n    plt.savefig(\"train_label_distribution.png\")\n    plt.close()\n\n    plt.bar(torch.arange(10), test_label_distribution)\n    plt.title(\"Test label distribution\")\n    plt.xlabel(\"Label\")\n    plt.ylabel(\"Count\")\n    plt.savefig(\"test_label_distribution.png\")\n    plt.close()\n\n\nif __name__ == \"__main__\":\n    dataset_statistics()\n</code></pre> </li> <li> <p>Next, we are going to implement a GitHub actions workflow that only activates when we make changes to our data.     Create a new workflow file (call it <code>cml_data.yaml</code>) and make sure it only activates on push/pull-request events     when <code>data/</code> changes. Relevant     documentation</p> Solution <p>The secret is to use the <code>paths</code> keyword in the workflow file. We here specify that the workflow should only trigger when the <code>.dvc</code> folder or any file with the <code>.dvc</code> extension changes, which is the case when we update our data and call <code>dvc add data/</code>.</p> <pre><code>name: DVC Workflow\n\non:\n  pull_request:\n    branches:\n    - main\n    paths:\n    - '**/*.dvc'\n    - '.dvc/**'\n</code></pre> </li> <li> <p>The next step is to implement steps in our workflow that do something when data changes. This is the reason     why we created the <code>dataset_statistics</code> function. Implement a workflow that:</p> <ol> <li>Check-out the code</li> <li>Setups Python</li> <li>Installs dependencies</li> <li>Downloads the data</li> <li>Runs the <code>dataset_statistics</code> function on the data</li> </ol> Solution <p>This solution assumes that data is stored in a GCP bucket and that the credentials are stored in a secret called <code>GCP_SA_KEY</code>. If this is not the case for you, you need to adjust the workflow accordingly with the correct way to pull the data.</p> <pre><code>jobs:\n  run_data_checker:\n    runs-on: ubuntu-latest\n    steps:\n    - name: Checkout code\n      uses: actions/checkout@v4\n\n    - name: Set up Python\n      uses: actions/setup-python@v5\n      with:\n        python-version: 3.11\n        cache: 'pip'\n        cache-dependency-path: setup.py\n\n    - name: Install dependencies\n      run: |\n        make dev_requirements\n        pip list\n\n    - name: Auth with GCP\n      uses: google-github-actions/auth@v2\n      with:\n        credentials_json: ${{ secrets.GCP_SA_KEY }}\n\n    - name: Pull data\n      run: |\n        dvc pull --no-run-cache\n\n    - name: Check data statistics\n      run: |\n        python dataset_statistics.py\n</code></pre> </li> <li> <p>Let's make sure that the workflow works as expected for now. Create a new branch and either add or remove a file     in the <code>data/</code> folder. Then run</p> <pre><code>dvc add data/\ngit add data.dvc\ngit commit -m \"Update data\"\ngit push\n</code></pre> <p>to commit the changes to data. Open a pull request with the branch and make sure that the workflow activates and runs as expected.</p> </li> <li> <p>Let's now add the <code>cml</code> framework such that we can comment the results of the <code>dataset_statistics</code> function in     the pull request automatically. Look at the     getting started guide for help on how to do this. You will     need write all the content of the <code>dataset_statistics</code> function to a file called <code>report.md</code> and then use the     <code>cml comment create</code> command to create a comment in the pull request with the content of the file.</p> Solution <pre><code>jobs:\n  dataset_statistics:\n    runs-on: ubuntu-latest\n    steps:\n    # ...all the previous steps\n    - name: Check data statistics &amp; generate report\n    run: |\n      python src/example_mlops/data.py &gt; data_statistics.md\n      echo '![](./mnist_images.png \"MNIST images\")' &gt;&gt; data_statistics.md\n      echo '![](./train_label_distribution.png \"Train label distribution\")' &gt;&gt; data_statistics.md\n      echo '![](./test_label_distribution.png \"Test label distribution\")' &gt;&gt; data_statistics.md\n\n    - name: Setup cml\n      uses: iterative/setup-cml@v2\n\n    - name: Comment on PR\n      env:\n        REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}\n      run: |\n        cml comment create data_statistics.md --watermark-title=\"Data Checker\" # (1)!\n</code></pre> <ol> <li> The <code>--watermark-title</code> flag is used to watermark the comment created by <code>cml</code>. It is     to make sure that no new comments are created every time the workflow runs.</li> </ol> </li> <li> <p>Make sure that the workflow works as expected. You should see a comment created by <code>github-actions (bot)</code> like     this if you have done everything correctly:</p> <p> </p> </li> <li> <p>(Optional) Feel free to add more checks to the workflow. For example, you could add a check that runs a small     baseline model on the updated data and checks that the model converges. This is a very common sanity check that     is done in machine learning pipelines.</p> </li> </ol> </li> <li> <p>For the second set of exercises, we are going to look at how to automatically run further testing of our models     whenever we add them to our model registry. For that reason, do not continue with this set of exercises before you     have completed the exercises on the model registry in this module.</p> <p>  The model registry is in general a repository of a team's trained models where ML practitioners publish candidates for production and share them with others. Figure from wandb.  <ol> <li> <p>The first step is in our weights and bias account to create a team. Some of these more advanced features are only     available for teams, however every user is allowed to create one team for free. Go to your weights and bias     account and create a team (the option should be on the left side of the UI). Give a team name and select W&amp;B     cloud storage.</p> </li> <li> <p>Now we need to generate a personal access token that can link our weights and bias account to our GitHub account.     Go to this page and generate a new token. You can also     find the page by clicking your profile icon in the upper right corner of Github and selecting     <code>Settings</code>, then <code>Developer settings</code>, then <code>Personal access tokens</code> and finally choose either     <code>Tokens (classic)</code> or <code>Fine-grained tokens</code> (which is the safer option, which is also what the link points to).</p> <p> </p> <p>give it a name, set what repositories it should have access to and select the permissions you want it to have. In our case if you choose to create <code>Fine-grained token</code> then it needs access to the <code>contents:write</code> permission. If you choose <code>Tokens (classic)</code> then it needs access to the <code>repo</code> permission. After you have created the token, copy it and save it somewhere safe.</p> </li> <li> <p>Go to the settings of your newly created team: https://wandb.ai/teamname/settings and scroll down to the     <code>Team secrets</code> section. Here add the token you just created as a secret with the name <code>GITHUB_ACTIONS_TOKEN</code>.     WANDB will now be able to use this token to trigger actions in your repository.</p> </li> <li> <p>On the same settings page, scroll down to the <code>Webhooks</code> settings. Click the <code>New webhook</code> button in fill in the     following information:</p> <ul> <li>Name: <code>github_actions_dispatch</code></li> <li>URL: <code>https://api.github.com/repos/&lt;owner&gt;/&lt;repo&gt;/dispatches</code></li> <li>Access token: <code>GITHUB_ACTIONS_TOKEN</code></li> <li>Secret: leave empty</li> </ul> <p>You here need to replace <code>&lt;owner&gt;</code> and <code>&lt;repo&gt;</code> with your own information. The <code>/dispatches</code> endpoint is a special endpoint that all Github actions workflows can listen to. Thus, if you ever want to setup a webhook in some other framework that should trigger a Github action, you can use this endpoint.</p> </li> <li> <p>Next, navigate to your model registry. It should hopefully contain at least one registry with at least one model     registered. If not, go back to the previous module and do that.</p> </li> <li> <p>When you have a model in your registry, click on the <code>View details</code> button. Then click the <code>New automation</code>     button. On the first page, select that you want to trigger the automation when an alias is added to a model     version, set that alias to <code>staging</code> and select the action type to be <code>Webhook</code>. On the next page, select the     <code>github_actions_dispatch</code> webhook that you just created and add this as the payload:</p> <pre><code>{\n    \"event_type\": \"staged_model\",\n    \"client_payload\":\n    {\n        \"event_author\": \"${event_author}\",\n        \"artifact_version\": \"${artifact_version}\",\n        \"artifact_version_string\": \"${artifact_version_string}\",\n        \"artifact_collection_name\": \"${artifact_collection_name}\",\n        \"project_name\": \"${project_name}\",\n        \"entity_name\": \"${entity_name}\"\n    }\n}\n</code></pre> <p>Finally, on the next page give the automation a name and click <code>Create automation</code>.</p> <p> </p> <p>Make sure you understand overall what is happening here.</p> Solution <p>The automation is set up to trigger a webhook whenever the alias <code>staging</code> is added to a model version. The webhook is set up to trigger a Github action workflow that listens to the <code>/dispatches</code> endpoint and has the event type <code>staged_model</code>. The payload that is sent to the webhook contains information about the model that was staged.</p> </li> <li> <p>We are now ready to create the <code>Github actions workflow</code> that listens to the <code>/dispatches</code> endpoint and triggers     whenever a model is staged. Create a new workflow file (called <code>stage_model.yaml</code>) and make sure it only     activates on the <code>staged_model</code> event. Hint: relevant     documentation</p> Solution <pre><code>name: Check staged model\n\non:\n  repository_dispatch:\n    types: staged_model\n</code></pre> </li> <li> <p>Next, we need to implement the steps in our workflow that do something when a model is staged. The payload that     is sent to the webhook contains information about the model that was staged. Implement a workflow that:</p> <ol> <li>Identifies the model that was staged</li> <li>Sets an environment variable with the corresponding artifact path</li> <li>Outputs the model name</li> </ol> Solution <pre><code>jobs:\n  identify_event:\n    runs-on: ubuntu-latest\n    outputs:\n      model_name: ${{ steps.set_output.outputs.model_name }}\n    steps:\n      - name: Check event type\n        run: |\n          echo \"Event type: repository_dispatch\"\n          echo \"Payload Data: ${{ toJson(github.event.client_payload) }}\"\n\n      - name: Setting model environment variable and output\n        id: set_output\n        run: |\n          echo \"model_name=${{ github.event.client_payload.artifact_version_string }}\" &gt;&gt; $GITHUB_OUTPUT\n</code></pre> </li> <li> <p>We now need to write a script that can be executed on our staged model. In this case, we are going to run some     performance tests on it to check that it is fast enough for deployment. Therefore, do the following:</p> <ol> <li> <p>In a <code>tests/performancetests</code> folder, create a new file called <code>test_model.py</code></p> </li> <li> <p>Implement a test that loads the model from an wandb artifact path e.g.     //: and runs it on a random input. Importantly, the     artifact path should be read from an environment variable called <code>MODEL_NAME</code>. <li> <p>The test should assert that the model can do 100 predictions in less than X amount of time</p> </li> Solution <p>In this solution we assume that 4 environment variables are set: <code>WANDB_API</code>, <code>WANDB_ENTITY</code>, <code>WANDB_PROJECT</code> and <code>MODEL_NAME</code>.</p> test_model.py<pre><code>import wandb\nimport os\nimport time\nfrom my_project.models import MyModel\n\ndef load_model(artifact):\n    api = wandb.Api(\n        api_key=os.getenv(\"WANDB_API_KEY\"),\n        overrides={\"entity\": os.getenv(\"WANDB_ENTITY\"), \"project\": os.getenv(\"WANDB_PROJECT\")},\n    )\n    artifact = api.artifact(model_checkpoint)\n    artifact.download(root=logdir)\n    file_name = artifact.files()[0].name\n    return MyModel.load_from_checkpoint(f\"{logdir}/{file_name}\")\n\ndef test_model_speed():\n    model = load_model(os.getenv(\"MODEL_NAME\"))\n    start = time.time()\n    for _ in range(100):\n        model(torch.rand(1, 1, 28, 28))\n    end = time.time()\n    assert end - start &lt; 1\n</code></pre> <li> <p>Let's now add another job that calls the script we just wrote. It needs to:</p> <ul> <li>Setup the correct environment variables</li> <li>Checkout the code</li> <li>Setup Python</li> <li>Install dependencies</li> <li>Run the test</li> </ul> <p>which is very similar to the kind of jobs we have written before.</p> Solution <pre><code>jobs:\n  identify_event:\n    ...\n  test_model:\n    runs-on: ubuntu-latest\n    needs: identify_event\n    env:\n      WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}\n      WANDB_ENTITY: ${{ secrets.WANDB_ENTITY }}\n      WANDB_PROJECT: ${{ secrets.WANDB_PROJECT }}\n      MODEL_NAME: ${{ needs.identify_event.outputs.model_name }}\n    steps:\n    - name: Echo model name\n      run: |\n        echo \"Model name: $MODEL_NAME\"\n    - name: Checkout code\n      uses: actions/checkout@v4\n\n    - name: Set up Python\n      uses: actions/setup-python@v5\n      with:\n        python-version: 3.11\n        cache: 'pip'\n        cache-dependency-path: setup.py\n\n    - name: Install dependencies\n      run: |\n        pip install -r requirements.txt\n        pip list\n\n    - name: Test model\n      run: |\n        pytest tests/performancetests/test_model.py\n</code></pre> </li> <li> <p>Finally, we are going to assume in this setup that if the model gets this far then it is ready for deployment.     We are therefore going to add a final job that will add a new alias to the model called <code>production</code>. Here is     some relevant Python code that can be used to add the alias:</p> <pre><code>import click\nimport os\nimport wandb\n\n@click.command()\n@click.argument(\"artifact-path\")\n@click.option(\n    \"--aliases\", \"-a\", multiple=True, default=[\"staging\"], help=\"List of aliases to link the artifact with.\"\n)\ndef link_model(artifact_path: str, aliases: list[str]) -&gt; None:\n    \"\"\"\n    Stage a specific model to the model registry.\n\n    Args:\n        artifact_path: Path to the artifact to stage.\n            Should be of the format \"entity/project/artifact_name:version\".\n        aliases: List of aliases to link the artifact with.\n\n    Example:\n        model_management link-model entity/project/artifact_name:version -a staging -a best\n\n    \"\"\"\n    if artifact_path == \"\":\n        click.echo(\"No artifact path provided. Exiting.\")\n        return\n\n    api = wandb.Api(\n        api_key=os.getenv(\"WANDB_API_KEY\"),\n        overrides={\"entity\": os.getenv(\"WANDB_ENTITY\"), \"project\": os.getenv(\"WANDB_PROJECT\")},\n    )\n    _, _, artifact_name_version = artifact_path.split(\"/\")\n    artifact_name, _ = artifact_name_version.split(\":\")\n\n    artifact = api.artifact(artifact_path)\n    artifact.link(target_path=f\"{os.getenv('WANDB_ENTITY')}/model-registry/{artifact_name}\", aliases=aliases)\n    artifact.save()\n    click.echo(f\"Artifact {artifact_path} linked to {aliases}\")\n</code></pre> <p>for example, you can run this script with the following command:</p> <pre><code>python link_model.py entity/project/artifact_name:version -a staging -a production\n</code></pre> <p>Implement a final job that calls this script and adds the <code>production</code> alias to the model.</p> Solution <pre><code>jobs:\n  identify_event:\n    ...\n  test_model:\n    ...\n  add_production_alias:\n    runs-on: ubuntu-latest\n    needs: identify_event\n    env:\n      WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}\n      WANDB_ENTITY: ${{ secrets.WANDB_ENTITY }}\n      WANDB_PROJECT: ${{ secrets.WANDB_PROJECT }}\n      MODEL_NAME: ${{ needs.identify_event.outputs.model_name }}\n    steps:\n    - name: Echo model name\n      run: |\n        echo \"Model name: $MODEL_NAME\"\n\n    - name: Checkout code\n      uses: actions/checkout@v4\n\n    - name: Set up Python\n      uses: actions/setup-python@v5\n      with:\n        python-version: 3.11\n        cache: 'pip'\n        cache-dependency-path: setup.py\n\n    - name: Install dependencies\n      run: |\n        pip install -r requirements.txt\n        pip list\n\n    - name: Add production alias\n      run: |\n        python link_model.py $MODEL_NAME -a production\n</code></pre> </li> <li> <p>Finally, make sure the workflow works as expected. To try it out again and again for testing purposes, you can     just manually add and then delete the <code>staging</code> alias to any model version in the model registry.</p> </li> <li> <p>(Optional) Consider adding more checks to the workflow. For example, you could add a step that checks if the     model is too large for deployment, runs some further evaluation scripts, or checks if the model is robust to     adversarial attacks. Only the imagination sets the limits here.</p> </li> <li> <p>(Optional) If you have got this far, consider combining principles from the two exercises. Here is an idea: we use     the workflow from the second exercise to trigger a workflow that checks a staged model for performance. We then     use the <code>cml</code> framework to automatically create a pull request e.g. use <code>cml pr create</code> instead of     <code>cml comment create</code> to create a pull request with the results of the performance test. Then if we are happy with     the performance, we can then approve that pull request and the production alias is added to the model. This is a     better workflow because it allows for human intervention before the model is deployed.</p> </li>"},{"location":"s5_continuous_integration/cml/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>What is the difference between continuous integration and continuous machine learning?</p> Solution <p>There are three key differences between continuous integration and continuous machine learning:</p> <ul> <li>Scope: CI focuses on integrating and testing software code, while CML encompasses the entire lifecycle of     machine learning models, including data handling, model training, evaluation, deployment, and monitoring.</li> <li>Automation Focus: CI automates code testing and integration, whereas CML automates the training, evaluation,     deployment, and monitoring of machine learning models.</li> <li>Feedback Mechanisms: CI primarily uses automated tests to provide feedback on code quality. CML uses     performance metrics from deployed models to provide feedback and trigger retraining or model updates.</li> </ul> </li> <li> <p>Imaging you get hired in the pharmasuitical industri being asked to develop a machine learning pipeline that can     automatically sort out which drugs are safe and which are not. What level of the MLOps maturity model would you     strive to reach?</p> Solution <p>There is really no right or wrong answer here, but in most cases we would actually not aim for level 4. The reason is that the consequences of a bad model in this case can be severe. Therefore, we would probably not want automated retraining and model updates, which is what level 4 is about. Instead, we would probably aim for level 3 where we have automated testing and monitoring of our models but there is still human oversight in the process.</p> </li> </ol> <p>This ends the module on continuous machine learning. As we have hopefully convinced you, it is only the imagination that sets the limits for what you can use Github actions for in your machine learning pipeline. However, we do want to stress that it is important that human oversight is always present in the process. Automation is great, but it should never replace human judgement. This is especially true in machine learning where the consequences of a bad model can be severe if it is used in critical decision making.</p> <p>Finally, if you have completed the exercises on using the cloud consider checking out the cml runner lunch command that allows you to run your workflows on cloud resources instead of the GitHub actions runners.</p>"},{"location":"s5_continuous_integration/github_actions/","title":"M17 - Github Actions","text":""},{"location":"s5_continuous_integration/github_actions/#github-actions","title":"GitHub actions","text":"<p>Core Module</p> <p>With the tests established in the previous module, we are now ready to move on to implementing some continuous integration in our pipeline. As you probably have already realized testing your code locally may be cumbersome to do, because</p> <ul> <li>You need to run it often to make sure to catch bugs early on</li> <li>If you want to have high code coverage of your code base, you will need many tests that take a long time to run</li> </ul> <p>For these reasons, we want to automate the testing, such that it is done every time we push to our repository. If we combine this with only pushing to branches and then only merging these branches whenever all automated testing has passed, our code should be fairly safe against unwanted bugs (assuming your tests are well covering your code).</p>"},{"location":"s5_continuous_integration/github_actions/#github-actions_1","title":"GitHub actions","text":"<p>GitHub actions are the continuous integration solution that GitHub provides. Each of your repositories gets 2,000 minutes of free testing per month which should be more than enough for the scope of this course (and probably all personal projects you do). Getting GitHub actions set up in a repository may seem complicated at first, but workflow files that you create for one repository can more or less be reused for any other repository that you have.</p> <p>Let's take a look at how a GitHub workflow file is organized:</p> <ul> <li>Initially, we start by giving the workflow a <code>name</code></li> <li>Next, we specify what events the workflow should be triggered. This includes both the action     (pull request, push etc) and on what branches it should activate</li> <li>Next, we list the jobs that we want to do. Jobs are by default executed in parallel but can     also be dependent on each other</li> <li>In the <code>runs-on</code>, we can specify which operation system we want the workflow to run on.</li> <li>Finally, we have the <code>steps</code>. This is where we specify the actual commands that should be     run when the workflow is executed.</li> </ul> <p></p>  Image credit"},{"location":"s5_continuous_integration/github_actions/#exercises","title":"\u2754 Exercises","text":"<ol> <li> <p>Start by creating a <code>.github</code> folder in the root of your repository. Add a sub-folder to that called <code>workflows</code>.</p> </li> <li> <p>Go over this page that explains how to do     automated testing of Python code in GitHub actions. You do not have     to understand everything, but at least get a feeling of what a workflow file should look like.</p> </li> <li> <p>We have provided a workflow file called <code>tests.yaml</code> that should run your tests for you. Place     this file in the <code>.github/workflows/</code> folder. The workflow file consists of three steps</p> <ul> <li> <p>First, a Python environment is initiated (in this case Python 3.8)</p> </li> <li> <p>Next all dependencies required to run the test are installed</p> </li> <li> <p>Finally, <code>pytest</code> is called and our tests will be run</p> </li> </ul> <p>Go over the file and try to understand the overall structure and syntax of the file.</p> <code>tests.yaml</code> tests.yaml<pre><code>name: \"Run tests\"\n\non:\n  push:\n    branches: [ master, main ]\n  pull_request:\n    branches: [ master, main ]\n\njobs:\n  build:\n\n    runs-on: ubuntu-latest\n\n    steps:\n    - name: Checkout\n      uses: actions/checkout@v4\n    - name: Set up Python\n      uses: actions/setup-python@v5\n      with:\n        python-version: 3.11\n    - name: Install dependencies\n      run: |\n        python -m pip install --upgrade pip\n        pip install -r requirements.txt\n        pip install -r requirements_tests.txt\n    - name: Test with pytest\n      run: |\n        pytest -v\n</code></pre> </li> <li> <p>For the script to work you need to define the <code>requirements.txt</code> and <code>requirements_tests.txt</code>. The first file should     contain all packages required to run your code. The second file contains all additional packages required to run     the tests. In your simple case, it may very well be that the second file is empty, however, sometimes additional     packages are used for testing that are not strictly required for the scripts to run.</p> </li> <li> <p>Finally, try pushing the changes to your repository. Hopefully, your tests should just start, and you will after some     time see a green check mark next to the hash of the commit. Also, try to inspect the Actions tap where you can see     the history of actions run.</p> <p> </p> </li> <li> <p>Normally we develop code on only one operating system and just hope that it will work on other operating systems.     However, continuous integration enables us to automatically test on other systems than the one we are using.</p> <ol> <li> <p>The provided <code>tests.yaml</code> only runs on one operating system. Which one?</p> </li> <li> <p>Alter the file such that it executes the test on the two other main operating systems that exist. You can find     information on available operating systems also called runners here</p> Solution <p>We can \"parametrize\" of script to run on different operating systems by using the <code>strategy</code> attribute. This attribute allows us to define a matrix of values that the workflow will run on. The following code will run the tests on <code>ubuntu-latest</code>, <code>windows-latest</code>, and <code>macos-latest</code>:</p> tests.yaml<pre><code>jobs:\n  build:\n    runs-on: ${{ matrix.os }}\n    strategy:\n      matrix:\n        os: [\"ubuntu-latest\", \"windows-latest\", \"macos-latest\"]\n</code></pre> </li> <li> <p>Can you also figure out how to run the tests using different Python versions?</p> Solution <p>Just add another line to the <code>strategy</code> attribute that specifies the Python version and use the value in the setup Python action. The following code will run the tests on Python versions</p> tests.yaml<pre><code>jobs:\n  build:\n    runs-on: ${{ matrix.os }}\n    strategy:\n      matrix:\n        os: [\"ubuntu-latest\", \"windows-latest\", \"macos-latest\"]\n        python-version: [\"3.10\", \"3.11\", \"3.12\"]\n\n    steps:\n    - uses: actions/checkout@v4\n    - name: Set up Python\n      uses: actions/setup-python@v5\n      with:\n        python-version: ${{ matrix.python-version }}\n</code></pre> </li> <li> <p>If you push the changes above you will maybe see that whenever one of the tests in the matrix fails, it will     automatically cancel the other tests. This is for saving time and resources. However, sometimes you want all the     tests to run even if one fails. Can you figure out how to do that?</p> Solution <p>You can set the <code>fail-fast</code> attribute to <code>false</code> under the <code>strategy</code> attribute:</p> tests.yaml<pre><code>jobs:\n  build:\n    runs-on: ${{ matrix.os }}\n    strategy:\n      fail-fast: false\n      matrix:\n        os: [\"ubuntu-latest\", \"windows-latest\", \"macos-latest\"]\n        python-version: [\"3.10\", \"3.11\", \"3.12\"]\n</code></pre> </li> </ol> </li> <li> <p>As the workflow is currently implemented, GitHub actions will destroy every downloaded package     when the workflow has been executed. To improve this we can take advantage of <code>caching</code>:</p> <ol> <li> <p>Figure out how to implement <code>caching</code> in your workflow file. You can find a guide     here and     here.</p> Solution tests.yaml<pre><code>steps:\n- uses: actions/checkout@v4\n- uses: actions/setup-python@v5\n  with:\n    python-version: 3.11\n    cache: 'pip' # caching pip dependencies\n- run: pip install -r requirements.txt\n</code></pre> </li> <li> <p>When you have implemented a caching system go to <code>Actions-&gt;Caches</code> in your repository and make sure that they     are correctly added. It should look something like the image below</p> <p> </p> </li> <li> <p>Measure how long your workflow takes before and after adding <code>caching</code> to your workflow. Did it improve the     runtime of your workflow?</p> </li> </ol> </li> <li> <p>(Optional) Code coverage can also be added to the workflow file by uploading it as an artifact     after running the coverage. Follow the instructions in this     post     on how to do it.</p> </li> <li> <p>With different checks in place, it is a good time to learn about branch protection rules. A branch     protection rule is essentially some kind of guarding that prevents you from merging code into a branch before     certain conditions are met. In this exercise, we will create a branch protection rule that requires all checks to     pass before merging code into the main branch.</p> <ol> <li> <p>Start by going into your <code>Settings -&gt; Rules -&gt; Rulesets</code> and create a new branch ruleset. See the image below.</p> <p> </p> </li> <li> <p>In the ruleset start by giving it a name and then set the target branches to be <code>Default branch</code>. This means that     the ruleset will be applied to your master/main branch. As shown in the image below, two rules may be     particularly beneficial when you later start working with other people:</p> <ul> <li> <p>The first rule to consider is Require a pull request before merging. As the name suggests this rule     requires that changes that are to be merged into the main branch must be done through a pull request. This     is a good practice as it allows for code review and testing before the code is merged into the main branch.     Additionally, this opens the option to specify that the code must be reviewed (or at least approved) by     a certain number of people.</p> </li> <li> <p>The second rule to consider is Require status checks to pass. This rule makes sure that our workflows     are passing before we can merge code into the main branch. You can select which workflows are required, as     some may be nice to have passing but not strictly needed.</p> </li> </ul> <p> </p> <p>Finally, if you think the rules are a bit too restrictive you can always add that the repository admin e.g. you can bypass the rules by adding <code>Repository admin</code> to the bypass list. Implement the following rules:</p> <ul> <li>At least one person needs to approve any PR</li> <li>All your workflows need to pass</li> <li>All conversations need to be resolved</li> </ul> </li> <li> <p>If you have created the rules correctly you should see something like the image below when you try to merge a     pull request. In this case, all three checks are required to pass before the code can be merged. Additionally,     a single reviewer is required to approve the code. A bypass rule is also setup for the repository admin.</p> <p> </p> </li> </ol> </li> <li> <p>One problem you may have encountered is running your tests that have to do with your data, with the core problem     being that your data is not stored in GitHub (assuming you have done module     M8 - DVC) and therefore cannot be tested. However, we can download     data while running our continuous integration. Let's try to create that:</p> <ol> <li> <p>The first problem is that we need our continuous integration pipeline to be able to authenticate with our storage     solution. We can take advantage of an authentication file that is created the first time we push with DVC. It is     located in <code>$CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json</code> where <code>$CACHE_HOME</code> depends on your     operating system:</p> macOSLinuxWindows <p><code>~/Library/Caches</code></p> <p><code>~/.cache</code>  This is the typical location, but it may vary depending on what distro you are running</p> <p><code>{user}/AppData/Local</code></p> <p>Find the file. The content should look similar to this (only some fields are shown):</p> <pre><code>{\n    \"access_token\": ...,\n    \"client_id\": ...,\n    \"client_secret\": ...,\n    \"refresh_token\": ...,\n    ...\n}\n</code></pre> </li> <li> <p>The content of that file should be treated as a password and not shared with the world and the relevant     question is therefore how to use this info in a public repository. The answer is GitHub secrets, where we     can store information, and access it in our workflow files and it is still not public. Navigate to the secrets     option (as shown below) and create a secret with the name <code>GDRIVE_CREDENTIALS_DATA</code> that contains the content     of the file you found in the previous exercise.</p> <p> </p> </li> <li> <p>Afterward, add the following code to your workflow file:</p> <pre><code>- uses: iterative/setup-dvc@v1\n- name: Get data\n  run: dvc pull\n  env:\n    GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}\n</code></pre> <p>that runs <code>dvc pull</code> using the secret authentication file. For help you can visit this small repository that implements the same workflow.</p> </li> <li> <p>Finally, add the changes, commit, push and confirm that everything works as expected. You should now be able to     run unit tests that depend on your input data.</p> </li> </ol> </li> <li> <p>In module M6 on good coding practices     (optional module) of the course you were introduced to a couple of good coding practices such as being consistent     with your coding style, how your Python packages are sorted and that your code follows certain standards. All this     was done using the <code>ruff</code> framework. In this set of exercises, we will create GitHub workflows that will     automatically test for this.</p> <ol> <li> <p>Create a new workflow file called <code>codecheck.yaml</code>, that implements the following three steps</p> <ul> <li> <p>Setup Python environment</p> </li> <li> <p>Installs <code>ruff</code></p> </li> <li> <p>Runs <code>ruff check</code> and <code>ruff format</code> on the repository</p> </li> </ul> <p>(HINT: You should be able to just change the last steps of the <code>tests.yaml</code> workflow file)</p> Solution codecheck.yaml<pre><code>name: Code formatting\n\non:\n  push:\n    branches:\n    - main\n  pull_request:\n    branches:\n    - main\n\njobs:\n  format:\n      runs-on: ubuntu-latest\n      steps:\n      - name: Checkout code\n        uses: actions/checkout@v4\n      - name: Set up Python\n        uses: actions/setup-python@v5\n        with:\n          python-version: 3.11\n          cache: 'pip'\n          cache-dependency-path: setup.py\n      - name: Install dependencies\n        run: |\n          pip install ruff\n          pip list\n      - name: Ruff check\n        run: ruff check .\n      - name: Ruff format\n        run: ruff format .\n</code></pre> </li> <li> <p>In addition to <code>ruff</code> we also used <code>mypy</code> in those sets of exercises for checking if the typing we added to our     code was good enough. Add another step to the <code>codecheck.yaml</code> file which runs <code>mypy</code> on your repository.</p> </li> <li> <p>Try to make sure that all steps are passed on repository. Especially <code>mypy</code> can be hard to get a passing, so this     exercise formally only requires you to get <code>ruff</code> passing.</p> </li> </ol> </li> <li> <p>(Optional) As you have probably already experienced in module M9 on docker it can     be cumbersome to build docker images, sometimes taking a couple of minutes to build each time we make changes to our     code base. For this reason, we just want to build a new image every time we commit our code because that should mark     that we believe the code to be working at that point. Thus, let's automate the process of building our docker images     using Github actions. Do note that in a future module will look at how to     build containers using cloud providers, and this exercise is therefore very much optional.</p> <ol> <li> <p>Start by making sure you have a dockerfile in your repository. If you do not have one, you can use the following     simple dockerfile:</p> <pre><code>FROM busybox\nCMD echo \"Howdy cowboy\"\n</code></pre> </li> <li> <p>Push the dockerfile to your repository</p> </li> <li> <p>Next, create a Docker Hub account</p> </li> <li> <p>Within Docker Hub create an access token by going to <code>Settings -&gt; Security</code>. Click the <code>New Access Token</code> button     and give it a name that you recognize.</p> </li> <li> <p>Copy the newly created access token and head over to your GitHub repository online. Go to     <code>Settings -&gt; Secrets -&gt; Actions</code> and click the <code>New repository secret</code>. Copy over the access token and give     it the name <code>DOCKER_HUB_TOKEN</code>. Additionally, add two other secrets <code>DOCKER_HUB_USERNAME</code> and     <code>DOCKER_HUB_REPOSITORY</code> that contain your docker username and docker repository name respectively.</p> </li> <li> <p>Next, we are going to construct the actual Github actions workflow file</p> <pre><code>name: Docker Image continuous integration\n\non:\n  push:\n    branches: [ master ]\n\njobs:\n    build:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v4\n    - name: Build the Docker image\n      run: |\n        echo \"${{ secrets.DOCKER_HUB_TOKEN }}\" | docker login \\\n          -u \"${{ secrets.DOCKER_HUB_USERNAME }}\" --password-stdin docker.io\n        docker build . --file Dockerfile \\\n          --tag docker.io/${{ secrets.DOCKER_HUB_USERNAME }}/${{ secrets.DOCKER_HUB_REPOSITORY }}:$GITHUB_SHA\n        docker push \\\n          docker.io/${{ secrets.DOCKER_HUB_USERNAME }}/${{ secrets.DOCKER_HUB_REPOSITORY }}:$GITHUB_SHA\n</code></pre> <p>The first part of the workflow file should look somewhat recognizable. However, the last three lines are where all the magic happens. Carefully go through them and figure out what they do. If you want some help you can look at the help page for <code>docker login</code>, <code>docker build</code> and <code>docker push</code>.</p> </li> <li> <p>Upload the workflow to your GitHub repository and check that it is being executed. If everything works you should     be able to see the build docker image in your container repository in the docker hub.</p> </li> <li> <p>Make sure that you can execute <code>docker pull</code> locally to pull down the image that you just continuously build</p> </li> <li> <p>(Optional) To test that the container works directly in GitHub you can also try to include an additional     step that runs the container.</p> <pre><code>    - name: Run container\n      run: |\n        docker run ...\n</code></pre> </li> </ol> </li> </ol>"},{"location":"s5_continuous_integration/github_actions/#dependabot","title":"Dependabot","text":"<p>A great feature that GitHub provides is the ability to have bots help you with maintaining your repository. One of the most useful bots is called <code>Dependabot</code>. As the name suggests, <code>Dependabot</code> helps you keep your dependencies up to date. This is important because dependencies often either contain fixes for bugs or security vulnerabilities that you want to have in your code.</p>"},{"location":"s5_continuous_integration/github_actions/#exercises_1","title":"\u2754 Exercises","text":"<ol> <li> <p>To get dependabot working in your repository, we need to add a single configuration file to your repository. Create     a file called <code>.github/dependabot.yaml</code>. Look through the     documentation for how to set up     the file such that it updates your Python dependencies on a weekly basis.</p> Solution <p>The following code will check for updates in the <code>pip</code> ecosystem every week e.g. it automatically will look for <code>requirements.txt</code> files and update the packages in there.</p> <pre><code>version: 2\nupdates:\n  - package-ecosystem: \"pip\"\n    directory: \"/\"\n    schedule:\n      interval: \"weekly\"\n</code></pre> <ol> <li>Push the changes to your repository and check that the dependabot is working by going to the <code>Insights</code> tab and then the <code>Dependency graph</code> tab. From here you under the <code>Dependabot</code> tab should be able to see if the bot has correctly identified what files to track and if it has found any updates.</li> </ol> <p> </p> <p>Click the <code>Recent update jobs</code> to see the history of Dependabot checking for updates. If there are no updates you can try to click the <code>Check for updates</code> button to force Dependabot to check for updates.</p> </li> <li> <p>At this point the Dependabot should hopefully have found some updates and created one or more pull requests. If it     has not done so you most likely need to update your requirement file such that your dependencies are correctly     restricted/specified e.g.</p> <pre><code># lets assume pytorch v2.5 is the latest version\n\n# these different specifications will not trigger dependabot because\n# the latest version is included in the specification\ntorch\ntorch == 2.5\ntorch &gt;= 2.5\ntorch ~= 2.5\n\n# these specifications will trigger dependabot because the latest\n# version is not included\ntorch &lt; 2.5\ntorch == 2.4\ntorch &lt;= 2.4\n</code></pre> <p>If you have a pull request from Dependabot, check it out and see if it looks good. If it does, you can merge it.</p> <p> </p> </li> <li> <p>(Optional) Dependabot can also help keeping our GitHub Actions pipelines up-to-date. As you may have realized     during this module, when we write statements like in our workflow files:</p> <pre><code>...\njobs:\n  build:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v4\n...\n</code></pre> <p>The <code>@v4</code> specifies that we are using version 4 of the <code>actions/checkout</code> action. This means that if a new version of the action is released, we will not automatically get the new version. Dependabot can help us with this. Try adding to the <code>dependabot.yaml</code> file that Dependabot should also check for updates in the GitHub Actions ecosystem.</p> Solution <pre><code>version: 2\nupdates:\n  - package-ecosystem: \"pip\"\n    directory: \"/\"\n    schedule:\n      interval: \"weekly\"\n  - package-ecosystem: \"github-actions\"\n    schedule:\n      interval: \"weekly\"\n</code></pre> </li> </ol>"},{"location":"s5_continuous_integration/github_actions/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>When working with GitHub actions you will often encounter the following 4 concepts:</p> <ul> <li>Workflow</li> <li>Runner</li> <li>Job</li> <li>Action</li> </ul> <p>Try to define them in your own words.</p> Solution <ul> <li>Workflow: A <code>yaml</code> file that defines the instructions to be executed on specific events. Needs to be placed in     the <code>.github/workflows</code> folder.</li> <li>Runner: Workflows need to run somewhere. The environment that the workflow is being executed on is called the     runner. Most commonly the runner is hosted by GitHub but can also hosted by yourself.</li> <li>Job: A series of steps that are executed on the same runner. A workflow must include at least one job but     often contains many.</li> <li>Action: An action is the smallest unit in a workflow. Jobs often consist of multiple actions that are     executed sequentially.</li> </ul> </li> <li> <p>The <code>on</code> attribute specifies upon which events the workflow will be triggered. Assume you have set the <code>on</code> attribute     to the following:</p> <pre><code>on:\n    push:\n      branches: [main]\n    pull_request:\n      branches: [main]\n    schedule:\n      - cron: \"0 0 * * *\"\n    workflow_dispatch: {}\n</code></pre> <p>What 4 events would trigger the execution of that action?</p> Solution <ol> <li>Direct push to branch <code>main</code> would trigger it</li> <li>Any pull request opened that will merge into <code>main</code> will trigger it</li> <li>At the end of the day the action would trigger, see cron for more info</li> <li> <p>The trigger can be executed by manually triggering it through the GitHub UI, for example, shown below</p> <p> </p> </li> </ol> </li> </ol> <p>This ends the module on GitHub workflows. If you are more interested in this topic you can check out module M31 on documentation which first includes locally building some documentation for your project and afterward use GitHub actions for deploying it to GitHub Pages. Additionally, GitHub also has a lot of templates already for running different continuous integration tasks. If you try to create a workflow file directly in GitHub you may encounter the following page</p> <p></p> <p>We highly recommend checking this out if you want to write any other kind of continuous integration pipeline in GitHub actions. We can also recommend this repository that has a list of awesome actions and check out the act repository which is a tool for running your GitHub Actions locally!</p>"},{"location":"s5_continuous_integration/pre_commit/","title":"M18 - Pre-commit","text":""},{"location":"s5_continuous_integration/pre_commit/#pre-commit","title":"Pre-commit","text":"<p>One of the cornerstones of working with git is remembering to commit your work often. Often committing makes sure that it is easier to identify and revert unwanted changes that you have introduced, because the code changes becomes smaller per commit.</p> <p>However, as you hopefully already seen in the course there are a lot of mental task to do before you actually write <code>git commit</code> in the terminal. The most basic thing is of course making sure that you have saved all your changes, and you are not committing a not up-to-date file. However, this also includes tasks such as styling, formatting, making sure all tests succeeds etc. All these mental to-do notes does not mix well with the principal of remembering to commit often, because you in principal have to do them every time.</p> <p>The obvious solution to this problem is to automate all or some of our mental task every time that we do a commit. This is where pre-commit hooks comes into play, as they can help us attach additional tasks that should be run every time that we do a <code>git commit</code>.</p>"},{"location":"s5_continuous_integration/pre_commit/#configuration","title":"Configuration","text":"<p>Pre-commit simply works by inserting whatever workflow we want to automate in between whenever we do a <code>git commit</code> and afterwards would do a <code>git push</code>.</p> <p></p>  Image credit  <p>The system works by looking for a file called <code>.pre-commit-config.yaml</code> that we can configure. If we execute</p> <pre><code>pre-commit sample-config | out-file .pre-commit-config.yaml -encoding utf8\n</code></pre> <p>you should get a sample file that looks like</p> <pre><code># See https://pre-commit.com for more information\n# See https://pre-commit.com/hooks.html for more hooks\nrepos:\n-   repo: https://github.com/pre-commit/pre-commit-hooks\n    rev: v3.2.0\n    hooks:\n    -   id: trailing-whitespace\n    -   id: end-of-file-fixer\n    -   id: check-yaml\n    -   id: check-added-large-files\n</code></pre> <p>the file structure is very simple:</p> <ul> <li>It starts by listing the repositories where we want to get our pre-commits from, in this case   https://github.com/pre-commit/pre-commit-hooks. This repository contains a large collection of pre-commit hooks.</li> <li>Next we need to defined what pre-commit hooks that we want to get by specifying the <code>id</code> of the different hooks.   The <code>id</code> corresponds to an <code>id</code> in this file:   https://github.com/pre-commit/pre-commit-hooks/blob/master/.pre-commit-hooks.yaml</li> </ul> <p>When we are done defining our <code>.pre-commit-config.yaml</code> we just need to install it</p> <pre><code>pre-commit install\n</code></pre> <p>this will make sure that the file is automatically executed whenever we run <code>git commit</code></p>"},{"location":"s5_continuous_integration/pre_commit/#exercises","title":"\u2754 Exercises","text":"<ol> <li> <p>Install pre-commit</p> <pre><code>pip install pre-commit\n</code></pre> <p>Consider adding <code>pre-commit</code> to a <code>requirements_dev.txt</code> file, as it is a development tool.</p> </li> <li> <p>Next create the sample file</p> <pre><code>pre-commit sample-config &gt; .pre-commit-config.yaml\n</code></pre> </li> <li> <p>The sample file already contains 4 hooks. Make sure you understand what each do and if you need them at all.</p> </li> <li> <p><code>pre-commit</code> works by hooking into the <code>git commit</code> command, running whenever that command is run. For this to work,     we need to install the hooks into <code>git commit</code>. Run</p> <pre><code>pre-commit install\n</code></pre> <p>to do this.</p> </li> <li> <p>Try to commit your recently created <code>.pre-commit-config.yaml</code> file. You will likely not do anything, because     <code>pre-commit</code> only check files that are being committed. Instead try to run</p> <pre><code>pre-commit run --all-files\n</code></pre> <p>that will check every file in your repository.</p> </li> <li> <p>Try adding at least another check from the base repository to your     <code>.pre-commit-config.yaml</code> file.</p> Solution <p>In this case we have added the <code>check-json</code> hook to our <code>.pre-commit-config.yaml</code> file, which will automatically check that all JSON files are valid.</p> <pre><code>repos:\n-   repo:\n    rev: v3.2.0\n    hooks:\n    -   id: trailing-whitespace\n    -   id: end-of-file-fixer\n    -   id: check-yaml\n    -   id: check-added-large-files\n    -   id: check-json\n</code></pre> </li> <li> <p>If you have completed the optional module     M7 on good coding practice you will have learned     about the linter <code>ruff</code>. <code>ruff</code> comes with its own pre-commit hook.     Try adding that to your <code>.pre-commit-config.yaml</code> file and see what happens when you try to commit files.</p> Solution <p>This is one way to add the <code>ruff</code> pre-commit hook. We run both the <code>ruff</code> and <code>ruff-format</code> hooks, and we also add the <code>--fix</code> argument to the <code>ruff</code> hook to try to fix what is possible.</p> <p>```yaml repos: - repo: https://github.com/astral-sh/ruff-pre-commit   rev: v0.4.7   hooks:     # try to fix what is possible     - id: ruff         args: [\"--fix\"]     # perform formatting updates     - id: ruff-format     # validate if all is fine with preview mode     - id: ruff</p> </li> <li> <p>(Optional) Add more hooks to your <code>.pre-commit-config.yaml</code>.</p> </li> <li> <p>Sometimes you are in a hurry, so make sure that you also can do commits without running <code>pre-commit</code> e.g.</p> <pre><code>git commit -m &lt;message&gt; --no-verify\n</code></pre> </li> <li> <p>Finally, figure out how to disable <code>pre-commit</code> again (if you get tired of it).</p> </li> <li> <p>Assuming you have completed the module on GitHub Actions, lets try to add a     <code>pre-commit</code> workflow that automatically runs your <code>pre-commit</code> checks every time you push to your repository and     then automatically commits those changes to your repository. We recommend that you make use of</p> <ul> <li>this pre-commit action for installing and running <code>pre-commit</code></li> <li>this commit action to automatically commit the   changes that <code>pre-commit</code> makes.</li> </ul> <p>As an alternative you configure the CI tool provided by the creators of <code>pre-commit</code>.</p> Solution <p>The workflow first uses the <code>pre-commit</code> action to install and run the <code>pre-commit</code> checks. Importantly we run it with <code>continue-on-error: true</code> to make sure that the workflow does not fail if the checks fail. Next, we use <code>git diff</code> to list the changes that <code>pre-commit</code> has made and then we use the <code>git-auto-commit-action</code> to commit those changes.</p> .github/workflows/pre_commit.yaml<pre><code>name: Pre-commit CI\n\non:\n  pull_request:\n  push:\n    branches: [main]\n\njobs:\n  pre-commit:\n    name: Check pre-commit\n    runs-on: ubuntu-latest\n\n    permissions:\n      contents: write\n\n    steps:\n    - name: Checkout code\n      uses: actions/checkout@v4\n\n    - name: Set up Python\n      uses: actions/setup-python@v5\n      with:\n        python-version: 3.11\n\n    - name: Install pre-commit\n      uses: pre-commit/action@v3.0.1\n      continue-on-error: true\n\n    - name: List modified files\n      run: |\n        git diff --name-only\n\n    - name: Commit changes\n      uses: stefanzweifel/git-auto-commit-action@v5\n      with:\n        commit_message: Pre-commit fixes\n        commit_options: '--no-verify'\n</code></pre> </li> </ol> <p>That was all about how <code>pre-commit</code> can be used to automate tasks. If you want to deep dive more into the topic you can checkout this page on how to define your own <code>pre-commit</code> hooks.</p>"},{"location":"s5_continuous_integration/unittesting/","title":"M16 - Unit testing","text":""},{"location":"s5_continuous_integration/unittesting/#unit-testing","title":"Unit testing","text":"<p>Core Module</p> <p>What often comes to mind for many developers, when discussing continuous integration (CI) is code testing. Continuous integration should ensure that whenever a codebase is updated it is automatically tested such that if bugs have been introduced in the codebase it will be caught early on. If you look at the MLOps cycle, continuous integration is one of the cornerstones of the operations part. However, it should be noted that applying continuous integration does not magically secure that your code does not break. Continuous integration is only as strong as the tests that are automatically executed. Continuous integration simply structures and automates this.</p> <p>Quote</p> <p>Continuous Integration doesn\u2019t get rid of bugs, but it does make them dramatically easier to find and remove.  Martin Fowler, Chief Scientist, ThoughtWorks</p> <p></p>  Image credit  <p>The kind of tests we are going to look at are called unit testing. Unit testing refers to the practice of writing test that tests individual parts of your code base to test for correctness. By unit, you can therefore think of a function, module or in general any object. By writing tests in this way it should be very easy to isolate which part of the code broke after an update to the code base. Another way to test your code base would be through integration testing which is equally important but we are not going to focus on it in this course.</p> <p>Unit tests (and integration tests) are not a unique concept to MLOps but are a core concept of DevOps. However, it is important to note that testing machine learning-based systems is much more difficult than traditional systems. The reason for this is that machine learning systems depend on data, that influences the state of our system. For this reason, we not only need unit tests and integration tests of our code we also need data testing, infrastructure testing and more monitoring to check that we stay within the data distribution we are training on (more on this in module M25 on data drifting). This added complexity is illustrated in the figure below.</p> <p></p>"},{"location":"s5_continuous_integration/unittesting/#pytest","title":"Pytest","text":"<p>Before we can begin to automate testing of our code base we of course need to write the tests first. It is both a hard and tedious task to do but arguably the most important aspect of continuous integration. Python offers a couple of different libraries for writing tests. We are going to use <code>pytest</code>.</p>"},{"location":"s5_continuous_integration/unittesting/#exercises","title":"\u2754 Exercises","text":"<p>The following exercises should be applied to your MNIST repository</p> <ol> <li> <p>The first part of doing continuous integration is writing the unit tests. We do not expect you to cover every part     of the code you have developed but try to at least write tests that cover two files. Start by     creating a <code>tests</code> folder.</p> </li> <li> <p>Read the getting started guide for pytest     which is the testing framework that we are going to use</p> </li> <li> <p>Install pytest:</p> <pre><code>pip install pytest\n</code></pre> </li> <li> <p>Write some tests. Below are some guidelines on some tests that should be implemented, but     you are of course free to implement more tests. You can at any point check if your tests are     passing by typing in a terminal</p> <pre><code>pytest tests/\n</code></pre> <p>When you implement a test you need to follow two standards, for <code>pytest</code> to be able to find your tests. First, any files created (except <code>__init__.py</code>) should always start with <code>test_*.py</code>. Secondly, any test implemented needs to be wrapped into a function that again needs to start with <code>test_*</code>:</p> <pre><code># this will be found and executed by pytest\ndef test_something():\n    ...\n\n# this will not be found and executed by pytest\ndef something_to_test():\n    ...\n</code></pre> <ol> <li> <p>Start by creating a <code>tests/__init__.py</code> file and fill in the following:</p> <pre><code>import os\n_TEST_ROOT = os.path.dirname(__file__)  # root of test folder\n_PROJECT_ROOT = os.path.dirname(_TEST_ROOT)  # root of project\n_PATH_DATA = os.path.join(_PROJECT_ROOT, \"data\")  # root of data\n</code></pre> <p>these can help you refer to your data files during testing. For example, in another test file, I could write</p> <pre><code>from tests import _PATH_DATA\n</code></pre> <p>which then contains the root path to my data.</p> </li> <li> <p>Data testing: In a file called <code>tests/test_data.py</code> implement at least a test that checks that data gets     correctly loaded. By this, we mean that you should check</p> <pre><code>def test_data():\n    dataset = MNIST(...)\n    assert len(dataset) == N_train for training and N_test for test\n    assert that each datapoint has shape [1,28,28] or [784] depending on how you choose to format\n    assert that all labels are represented\n</code></pre> <p>where <code>N_train</code> should be either 30.000 or 50.000 depending on if you are just the first subset of the corrupted MNIST data or also including the second subset. <code>N_test</code> should be 5000.</p> Solution <pre><code>from my_project.data import corrupt_mnist\n\ndef test_data():\n    train, test = corrupt_mnist()\n    assert len(train) == 30000\n    assert len(test) == 5000\n    for dataset in [train, test]:\n        for x, y in dataset:\n            assert x.shape == (1, 28, 28)\n            assert y in range(10)\n    train_targets = torch.unique(train.tensors[1])\n    assert (train_targets == torch.arange(0,10)).all()\n    test_targets = torch.unique(test.tensors[1])\n    assert (test_targets == torch.arange(0,10)).all()\n</code></pre> </li> <li> <p>Model testing: In a file called <code>tests/test_model.py</code> implement at least a test that checks for a given input     with shape X that the output of the model has shape Y.</p> Solution <pre><code>from my_project.model import MyAwesomeModel\n\ndef test_model():\n    model = MyAwesomeModel()\n    x = torch.randn(1, 1, 28, 28)\n    y = model(x)\n    assert y.shape == (1, 10)\n</code></pre> </li> <li> <p>Training testing: In a file called <code>tests/test_training.py</code> implement at least one test that asserts something     about your training script. You are here given free hands on what should be tested but try to test something     that risks being broken when developing the code.</p> </li> <li> <p>Good code raises errors and gives out warnings in appropriate places. This is often in     the case of some invalid combination of input to your script. For example, your model could check for the size     of the input given to it (see code below) to make sure it corresponds to what you are expecting. Not     implementing such errors would still result in PyTorch failing at a later point due to shape errors, however,     these custom errors will probably make more sense to the end user. Implement at least one raised error or     warning somewhere in your code and use either <code>pytest.raises</code> or <code>pytest.warns</code> to check that     they are correctly raised/warned. As inspiration, the following implements <code>ValueError</code> in code     belonging to the model:</p> <pre><code># src/models/model.py\ndef forward(self, x: Tensor):\n    if x.ndim != 4:\n        raise ValueError('Expected input to a 4D tensor')\n    if x.shape[1] != 1 or x.shape[2] != 28 or x.shape[3] != 28:\n        raise ValueError('Expected each sample to have shape [1, 28, 28]')\n</code></pre> Solution <p>The above example would be captured by a test looking something like this:</p> <pre><code># tests/test_model.py\nimport pytest\nfrom my_project.model import MyAwesomeModel\n\ndef test_error_on_wrong_shape():\n    model = MyAwesomeModel()\n    with pytest.raises(ValueError, match='Expected input to a 4D tensor')\n        model(torch.randn(1,2,3))\n    with pytest.raises(ValueError, match='Expected each sample to have shape [1, 28, 28]')\n        model(torch.randn(1,1,28,29))\n</code></pre> </li> <li> <p>A test is only as good as the error message it gives, and by default, <code>assert</code> will only report that the     check failed. However, we can help ourselves and others by adding strings after <code>assert</code> like</p> <pre><code>assert len(train_dataset) == N_train, \"Dataset did not have the correct number of samples\"\n</code></pre> <p>Add such comments to the assert statements you just did in the previous exercises.</p> </li> <li> <p>The tests that involve checking anything that has to do with our data, will of course fail     if the data is not present. To future-proof our code, we can take advantage of the     <code>pytest.mark.skipif</code> decorator. Use this decorator to skip your data tests if the corresponding     data files do not exist. It should look something like this</p> <pre><code>import os.path\n@pytest.mark.skipif(not os.path.exists(file_path), reason=\"Data files not found\")\ndef test_something_about_data():\n    ...\n</code></pre> <p>You can read more about skipping tests here</p> </li> </ol> </li> <li> <p>After writing the different tests, make sure that they are passing locally.</p> </li> <li> <p>We often want to check a function/module for various input arguments. In this case, you could write the same test     over and over again for different inputs, but <code>pytest</code> also has built-in support for this with the use of the     pytest.mark.parametrize decorator. Implement a parametrized     test and make sure that it runs for different inputs.</p> Solution <pre><code>@pytest.mark.parametrize(\"batch_size\", [32, 64])\ndef test_model(batch_size: int) -&gt; None:\n    model = MyModel()\n    x = torch.randn(batch_size, 1, 28, 28)\n    y = model(x)\n    assert y.shape == (batch_size, 10)\n</code></pre> </li> <li> <p>There is no way of measuring how good the test you have written is. However, what we can measure is the     code coverage. Code coverage refers to the percentage of your codebase that gets run when all your     tests are executed. Having a high coverage at least means that all your code will run when executed.</p> <ol> <li> <p>Install coverage</p> <pre><code>pip install coverage\n</code></pre> </li> <li> <p>Instead of running your tests directly with <code>pytest</code>, now do</p> <pre><code>coverage run -m pytest tests/\n</code></pre> </li> <li> <p>To get a simple coverage report simply type</p> <pre><code>coverage report\n</code></pre> <p>which will give you the percentage of cover in each of your files. You can also write</p> <pre><code>coverage report -m\n</code></pre> <p>to get the exact lines that were missed by your tests.</p> </li> <li> <p>Finally, try to increase the coverage by writing a new test that runs some     of the lines in your codebase that are not covered yet.</p> </li> <li> <p>Often <code>coverage</code> reports the code coverage on files that we do not want to get a code coverage for, for example     your test file. Figure out how to configure <code>coverage</code> to exclude some files.</p> Solution <p>You need to set the <code>omit</code> option. This can either be done when running <code>coverage run</code> or <code>coverage report</code> such as:</p> <pre><code>coverage run --omit=\"tests/*\" -m pytest tests/\n# or\ncoverage report --omit=\"tests/*\"\n</code></pre> <p>As an alternative you can specify this in your <code>pyproject.toml</code> file:</p> <pre><code>[tool.coverage.run]\nomit = [\"tests/*\"]\n</code></pre> </li> </ol> </li> </ol>"},{"location":"s5_continuous_integration/unittesting/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>Assuming you have a code coverage of 100%, would you expect that no bugs are present in your code?</p> Solution <p>No, code coverage is not a guarantee that your code is bug-free. It is just a measure of how many lines of code are run when your tests are executed. Therefore, there may still be some corner case that is not covered by your tests and will result in a bug. However, having a high code coverage is a good indicator that you have tested your code.</p> </li> <li> <p>Consider the following code:</p> <pre><code>@pytest.mark.parametrize(\"network_size\", [10, 100, 1000])\n@pytest.mark.parametrize(\"device\", [\"cpu\", \"cuda\"])\nclass MyTestClass:\n    @pytest.mark.parametrize(\"network_type\", [\"alexnet\", \"squeezenet\", \"vgg\", \"resnet\"])\n    @pytest.mark.parametrize(\"precision\", [torch.half, torch.float, torch.double])\n    def test_network1(self, network_size, device, network_type, precision):\n        if device == \"cuda\" and not torch.cuda.is_available():\n            pytest.skip(\"Test requires cuda\")\n        model = MyModelClass(network_size, network_type).to(device=device, dtype=precision)\n        ...\n\n    @pytest.mark.parametrize(\"add_dropout\", [True, False])\n    def test_network2(self, network_size, device, add_dropout):\n        if device == \"cuda\" and not torch.cuda.is_available():\n            pytest.skip(\"Test requires cuda\")\n        model = MyModelClass2(network_size, add_dropout).to(device)\n        ...\n</code></pre> <p>how many tests are executed when running the above code?</p> Solution <p>The answer depends on whether or not we are running on a GPU-enabled machine. The <code>test_network1</code> has 4 parameters, <code>network_size, device, network_type, precision</code>, that respectively can take on <code>3, 2, 4, 3</code> values meaning that in total that test will be running <code>3x2x4x3=72</code> times with different parameters on a GPU-enabled machine and 36 on a machine without a GPU. A similar calculation can be done for <code>test_network2</code>, which only has three factors <code>network_size, device, add_dropout</code> that result in <code>3x2x2=12</code> test on a GPU-enabled machine and 6 on a machine without a GPU. In total, that means 84 tests would run on a machine with a GPU and 42 on a machine without a GPU.</p> </li> </ol> <p>That covers the basics of writing unit tests for Python code. We want to note that <code>pytest</code> of course is not the only framework for doing this. Python has a built-in framework called unittest for doing this also (but <code>pytest</code> offers a bit more features). Another open-source framework that you could choose to check out is hypothesis which can help catch errors in corner cases of your code. In addition to writing unit tests it is also highly recommended to test code that you include in your docstring belonging to your functions and modulus to make sure that any code there is in your documentation is also correct. For such testing, we can highly recommend using Python built-in framework doctest.</p>"},{"location":"s6_the_cloud/","title":"Cloud computing","text":"<p>Slides</p> <ul> <li> <p></p> <p>Learn how to get started with Google Cloud Platform and how to interact with the SDK.</p> <p> M20: Cloud Setup</p> </li> <li> <p></p> <p>Learn how to use different GCP services to support your machine learning pipeline.</p> <p> M21: Cloud Services</p> </li> </ul> <p>Running computations locally is often sufficient when only playing around with code in the initial phase of development. However, to scale your experiments you will need more computing power than what your standard laptop/desktop can offer. You probably already have experience with running on a local cluster or similar but today's topic is about utilizing cloud computing.</p> <p></p>  Image credit  <p>There exist numerous amount of cloud computing providers with some of the biggest being:</p> <ul> <li>Azure</li> <li>AWS</li> <li>Google Cloud Platform (GCP)</li> <li>Alibaba Cloud</li> </ul> <p>They all have slight advantages and disadvantages over each other. In this course, we are going to focus on Google Cloud platform, because they have been kind enough to sponsor $50 of cloud credit to each student. If you happen to run out of credit, you can also get some free credit for a limited amount of time when you sign up with a new account. What's important to note is that all these different cloud providers all have the same set of services and that learning how to use the services of one cloud provider in many cases translates to also knowing how to use the same services at another cloud provider. The services are called something different and can have a bit of a different interface/interaction pattern but in the end, it does not matter.</p> <p>Today's exercises are about getting to know how to work with the cloud. If you are in doubt about anything or want to deep dive into some topics, I can recommend watching this series of videos or going through the general docs.</p> <p>Learning objectives</p> <p>The learning objectives of this session are:</p> <ul> <li>In general being familiar with the Google SDK working</li> <li>Being able to start different compute instances and work with them</li> <li>Know how to do continuous integration workflows for the building of docker images</li> <li>Knowledge about how to store data and containers/artifacts in cloud buckets</li> <li>Being able to train simple deep-learning models using a combination of cloud services</li> </ul>"},{"location":"s6_the_cloud/cloud_setup/","title":"Cloud setup","text":"<p>Core Module</p> <p>Google Cloud Platform (GCP) is the cloud service provided by Google. The key concept, or selling point, of any cloud provider, is the idea of near-infinite resources. Without the cloud, it simply is not feasible to do many modern deep learning and machine learning tasks because they cannot be scaled locally.</p> <p>The image below shows all the different services that the Google Cloud platform offers. We are going to be working with around 10 of these services throughout the course. Therefore, if you get done with exercises early I highly recommend that you deep dive more into the Google cloud platform.</p> <p></p>  Image credit"},{"location":"s6_the_cloud/cloud_setup/#exercises","title":"\u2754 Exercises","text":"<p>As the first step, we are going to get you some Google Cloud credits.</p> <ol> <li> <p>Go to https://learn.inside.dtu.dk. Go to this course. Find the recent message where there should be a download     link and instructions on how to claim the $50 cloud credit. Please do not share the link anywhere as there are a     limited amount of coupons. If you are not officially taking this course at DTU, Google gives $300 cloud credits     whenever you sign up with a new account. NOTE that you need to provide a credit card for this so make     sure to closely monitor your credit use so you do not end up spending more than the free credit.</p> </li> <li> <p>Log in to the homepage of GCP. It should look like this:</p> <p> </p> </li> <li> <p>Go to billing and make sure that your account is showing $50 of cloud credit</p> <p> </p> <p>make sure to also check out the <code>Reports</code> throughout the course. When you are starting to use some of the cloud services these tabs will update with info about how much time you can use before your cloud credit runs out. Make sure that you monitor this page as you will not be given another coupon.</p> </li> <li> <p>One way to stay organized within GCP is to create projects.</p> <p> </p> <p>Create a new project called <code>dtumlops</code>. When you click <code>create</code> you should get a notification that the project is being created. The notification bell is a good way to make sure how the processes you are running are doing throughout the course.</p> </li> <li> <p>Next, it local setup on your laptop. We are going to install <code>gcloud</code>, which is part of the Google Cloud SDK.     <code>gcloud</code> is the command line interface for working with our Google Cloud account. Nearly everything that we can do     through the web interface we can also do through the <code>gcloud</code> interface. Follow the installation instructions     here for your specific OS.</p> <ol> <li> <p>After installation, try in a terminal to type:</p> <pre><code>gcloud -h\n</code></pre> <p>the command should show the help page. If not, something went wrong in the installation (you may need to restart after installing).</p> </li> <li> <p>Now login by typing</p> <pre><code>gcloud auth login\n</code></pre> <p>you should be sent to a web page where you link your cloud account to the <code>gcloud</code> interface. Afterward, also run this command:</p> <pre><code>gcloud auth application-default login\n</code></pre> <p>If you at some point want to revoke the authentication you can type:</p> <pre><code>gcloud auth revoke\n</code></pre> </li> <li> <p>Next, you will need to set the project that we just created as the default project0. In your web browser under     project info, you should be able to see the <code>Project ID</code> belonging to your <code>dtumlops</code> project. Copy this and     type he following command in a terminal</p> <pre><code>gcloud config set project &lt;project-id&gt;\n</code></pre> <p>You can also get the project info by running</p> <pre><code>gcloud projects list\n</code></pre> </li> <li> <p>Next, install the Google Cloud Python API:</p> <pre><code>pip install --upgrade google-api-python-client\n</code></pre> <p>Make sure that the Python interface is also installed. In a Python terminal type</p> <pre><code>import googleapiclient\n</code></pre> <p>this should work without any errors.</p> </li> <li> <p>(Optional) If you are using VSCode you can also download the relevant     extension     called <code>Cloud Code</code>. After installing it you should see a small <code>Cloud Code</code> button in the action bar.</p> </li> </ol> </li> <li> <p>Finally, we need to activate a couple of     developer APIs that are not activated     by default. In a terminal write</p> <pre><code>gcloud services enable apigateway.googleapis.com\ngcloud services enable servicemanagement.googleapis.com\ngcloud services enable servicecontrol.googleapis.com\n</code></pre> <p>you can always check which services are enabled by typing</p> <pre><code>gcloud services list\n</code></pre> </li> </ol> <p>After following these steps your laptop should hopefully be setup for using GCP locally. You are now ready to use their services, both locally on your laptop and in the cloud console.</p>"},{"location":"s6_the_cloud/cloud_setup/#iam-and-quotas","title":"IAM and Quotas","text":"<p>A big part of using the cloud in a bigger organization has to do with Admin and quotas. Admin here in general refers to the different roles that users of GCP and quotas refer to the amount of resources that a given user has access to. For example, one employee, let's say a data scientist, may only be granted access to certain GCP services that have to do with the development and training of machine learning models, with <code>X</code> amounts of GPUs available to use to make sure that the employee does not spend too much money. Another employee, a DevOps engineer, probably does not need access to the same services and not necessarily the same resources.</p> <p>In this course, we are not going to focus too much on this aspect but it is important to know that it exists. One feature you are going to need for doing the project is how to share a project with other people. This is done through the IAM (Identities and Access Management) page. Simply click the <code>Grant Access</code> button, search for the email of the person you want to share the project with and give them either <code>Viewer</code>, <code>Editor</code> or <code>Owner</code> access, depending on what you want them to be able to do. The figure below shows how to do this.</p> <p></p> <p>What we are going to go through right now is how to increase the quotas for how many GPUs you have available for your project. By default, for any free accounts in GCP (or accounts using teaching credits) the default quota for GPUs that you can use is either 0 or 1 (their policies sometimes change). We will in the exercises below try to increase it.</p>"},{"location":"s6_the_cloud/cloud_setup/#exercises_1","title":"\u2754 Exercises","text":"<ol> <li> <p>Start by enabling the <code>Compute Engine</code> service. Simply search for it in the top search bar. It should bring you     to a page where you can enable the service (may take some time). We are going to look more into this service     in the next module.</p> </li> <li> <p>Next go to the <code>IAM &amp; Admin</code> page, again search for it in the top search bar. The remaining steps are illustrated     in the figure below.</p> <ol> <li> <p>Go to the <code>quotas page</code></p> </li> <li> <p>In the search field search for <code>GPUs (all regions)</code> (needs to match exactly, the search field is case sensitive),     such that you get the same quota as in the image.</p> </li> <li> <p>In the limit, you can see what your current quota for the number of GPUs you can use is. Additionally, to the     right of the limit, you can see the current usage. It is worth checking in on if you are ever in doubt if a job     is running on GPU or not.</p> </li> <li> <p>Click the quota and afterward the <code>Edit</code> quotas button.</p> </li> <li> <p>In the pop-up window, increase your limit to either 1 or 2.</p> </li> <li> <p>After sending your request you can try clicking the <code>Increase requests</code> tab to see the status of your request</p> </li> </ol> <p> </p> </li> </ol> <p>If you are ever running into errors when working in GPU that contains statements about <code>quotas</code> you can always try to go to this page and see what you are allowed to use currently and try to increase it. For example, when you get to training machine learning models using Vertex AI in the next module, you would most likely need to ask for a quota increase for that service as well.</p> <p></p> <p>Finally, we want to note that a quota increase is sometimes not allowed within 24 hours of creating an account. If your request gets rejected, we recommend to wait a day and try again. If this does still not work, you may need to use their services some more to make sure you are not a bot that wants to mine crypto.</p>"},{"location":"s6_the_cloud/cloud_setup/#service-accounts","title":"Service accounts","text":"<p>At some point, you will most likely need to use a service account. A service account is a virtual account that is used to interact with the Google Cloud API. It it intended for non-human users e.g. other machines, services, etc. For example, if you want to launch a training job from Github Actions, you will need to use a service account for authentication between Github and GCP. You can read more about how to create a service account here.</p>"},{"location":"s6_the_cloud/cloud_setup/#exercises_2","title":"\u2754 Exercises","text":"<ol> <li> <p>Go to the <code>IAM &amp; Admin</code> page and click on <code>Service accounts</code>. Alternatively, you can search for it in the top search     bar.</p> </li> <li> <p>Click the <code>Create Service Account</code> button. On the next page, you can give the service account a name, and id (     automatically generated, but you can change it if you want). You can also give it a description. Leave the rest as     default and click <code>Create</code>.</p> </li> <li> <p>Next, let's give the service account some permissions. Click on the service account you just created. In the     <code>Permissions</code> tab click <code>Add permissions</code>. Your job now is to give the service account the lowest possible     permissions such that it can download files from a bucket. Look at this     page and try to find the role that fits the description.</p> Solution <p>The role you are looking for is <code>Storage Object Viewer</code>. This role allows the service account to list objects in a bucket and download objects, but nothing more. Thus even if someone gets access to the service account they cannot delete objects in the bucket.</p> </li> <li> <p>To use the service account later we need to create a key for it. Click on the service account and then the <code>Keys</code>     tab. Click <code>Add key</code> and then <code>Create new key</code>. Choose the <code>JSON</code> key type and click <code>Create</code>. This will download     a JSON file to your computer. This file is the key to the service account and should be kept secret. If you lose     it you can always create a new one.</p> </li> <li> <p>Finally, everything we just did from creating the service account, giving it permissions, and creating a key can     also be done through the <code>gcloud</code> interface. Try to find the commands to do this in the     documentation.</p> Solution <p>The commands you are looking for are:</p> <pre><code>gcloud iam service-accounts create my-sa \\\n    --description=\"My first service account\" --display-name=\"my-sa\"\ngcloud projects add-iam-policy-binding $(GCP_PROJECT_NAME) \\\n    --member=\"serviceAccount:global-service-account@iam.gserviceaccount.com\" \\\n    --role=\"roles/storage.objectViewer\"\ngcloud iam service-accounts keys create service_account_key.json \\\n    --iam-account=global-service-account@$(GCP_PROJECT_NAME).iam.gserviceaccount.com\n</code></pre> <p>where <code>$(GCP_PROJECT_NAME)</code> is the name of your project. If you then want to delete the service account you can run</p> <pre><code>gcloud iam service-accounts delete global-service-account@$(GCP_PROJECT_NAME).iam.gserviceaccount.com\n</code></pre> </li> </ol>"},{"location":"s6_the_cloud/cloud_setup/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>What considerations to take when choosing a GCP region for running a new application?</p> Solution <p>A series of factors may influence your choice of region, including:</p> <ul> <li>Services availability in the region, not all services are available in all regions</li> <li>Resource availability: some regions have more     GPUs available than others</li> <li>Reduced latency: if your application is running in the same region as your users, the latency will be lower</li> <li>Compliance: some countries have strict rules that require user info to be stored inside a particular region     eg. EU has GDPR rules that require all user data to be stored in the EU</li> <li>Pricing: some regions may have different pricing than others</li> </ul> </li> <li> <p>The 3 major cloud providers all have the same services, but they are called something different depending on the     provider. What are the corresponding names of these GCP services in AWS and Azure?</p> <ul> <li>Compute Engine</li> <li>Cloud storage</li> <li>Cloud functions</li> <li>Cloud run</li> <li>Cloud build</li> <li>Vertex AI</li> </ul> <p>It is important to know these correspondences to navigate blogpost etc. about MLOps on the internet.</p> Solution GCP AWS Azure Compute Engine Elastic Compute Cloud (EC2) Virtual Machines Cloud storage Simple Storage Service (S3) Blob Storage Cloud functions Lambda Functions Serverless Compute Cloud run App Runner, Fargate, Lambda Container Apps, Container Instances Cloud build CodeBuild DevOps Vertex AI SageMaker AI Platform </li> <li> <p>Why does is it always important to assign the lowest possible permissions to a service account?</p> Solution <p>The reason is that if someone gets access to the service account they can only do what the service account is allowed to do. If the service account has the permission to delete objects in a bucket, the attacker can delete all the objects in the bucket. For this reason, in most cases multiple service accounts are used, each with different permissions. This setup is called the principle of least privilege.</p> </li> </ol>"},{"location":"s6_the_cloud/using_the_cloud/","title":"M21 - Using the Cloud","text":""},{"location":"s6_the_cloud/using_the_cloud/#using-the-cloud","title":"Using the cloud","text":"<p>Core Module</p> <p>In this set of exercises, we are going to get more familiar with using some of the resources that GCP offers.</p>"},{"location":"s6_the_cloud/using_the_cloud/#compute","title":"Compute","text":"<p>The most basic service of any cloud provider is the ability to create and run virtual machines. In GCP this service is called Compute Engine API. A virtual machine allows you to essentially run an operating system that behaves like a completely separate computer. There are many reasons why one to use virtual machines:</p> <ul> <li> <p>Virtual machines allow you to scale your operations, essentially giving you access to infinitely many individual     computers</p> </li> <li> <p>Virtual machines allow you to use large-scale hardware. For example, if you are developing a deep learning model on     your laptop and want to know the inference time for a specific hardware configuration, you can just create a virtual     machine with those specs and run your model.</p> </li> <li> <p>Virtual machines allow you to run processes in the \"background\". If you want to train a model for a week or more, you     do not want to do this on your laptop as you cannot move it or do anything with it while it is training.     Virtual machines allow you to just launch a job and forget about it (at least until you run out of credit).</p> </li> </ul> <p></p>"},{"location":"s6_the_cloud/using_the_cloud/#exercises","title":"\u2754 Exercises","text":"<p>We are now going to start using the cloud.</p> <ol> <li> <p>Click on the <code>Compute Engine</code> tab in the sidebar on the homepage of GCP.</p> </li> <li> <p>Click the <code>Create Instance</code> button. You will see the following image below.</p> <p> </p> <p>Give the virtual machine a meaningful name, and set the location to some location that is closer to where you are (to reduce latency, we recommend <code>europe-west-1</code>). Finally, try to adjust the configuration a bit. Can you find at least two settings that alter the price of the virtual machine?</p> Solution <p>In general, the price of a virtual machine is determined by the class of hardware attached to it. Higher class CPUs and GPUs mean higher prices. Additionally, the amount of memory and disk space also affects the price. Finally, to location of the virtual machine also affects the price.</p> </li> <li> <p>After figuring this out, create a <code>e2-medium</code> instance (leave the rest configured as default). Before clicking the     <code>Create</code> button make sure to check the <code>Equivalent code</code> button. You should see a very long command that you     could have typed in the terminal that would create a VM similar to configuring it through the UI.</p> </li> <li> <p>After creating the virtual machine, in a local terminal type:</p> <pre><code>gcloud compute instances list\n</code></pre> <p>you should hopefully see the instance you have just created.</p> </li> <li> <p>You can start a terminal directly by typing:</p> <pre><code>gcloud compute ssh --zone &lt;zone&gt; &lt;name&gt; --project &lt;project-id&gt;\n</code></pre> <p>You can always see the exact command that you need to run to <code>ssh</code> to a VM by selecting the <code>View gcloud command</code> option in the Compute Engine overview (see image below).</p> <p> </p> </li> <li> <p>While logged into the instance, check if Python and PyTorch are installed.     You should see that neither is installed. The VM we have only specified what     compute resources it should have, and not what software should be in it. We     can fix this by starting VMs based on specific docker images (it's all coming together).</p> <ol> <li> <p>GCP comes with several ready-to-go images for doing deep learning.     More info can be found here.     Try, running this line:</p> <pre><code>gcloud container images list --repository=\"gcr.io/deeplearning-platform-release\"\n</code></pre> <p>what does the output show?</p> Solution <p>The output should show a list of images that are available for you to use. The images are essentially docker images that contain a specific software stack. The software stack is often a specific version of Python, PyTorch, TensorFlow, etc. The images are maintained by Google and are updated regularly.</p> </li> <li> <p>Next, start (in the terminal) a new instance using a PyTorch image. The command for doing it should look     something like this:</p> <pre><code>gcloud compute instances create &lt;instance_name&gt; \\\n    --zone=&lt;zone&gt; \\\n    --image-family=&lt;image-family&gt; \\\n    --image-project=deeplearning-platform-release \\\n    # add these arguments if you want to run on GPU and have the quota to do so\n    --accelerator=\"type=nvidia-tesla-K80,count=1\" \\\n    --maintenance-policy TERMINATE \\\n    --metadata=\"install-nvidia-driver=True\" \\\n</code></pre> <p>You can find more info here on what <code>&lt;image-family&gt;</code> should have as value and what extra argument you need to add if you want to run on GPU (if you have access).</p> Solution <p>The command should look something like this:</p> CPUGPU <pre><code>gcloud compute instances create my_instance \\\n    --zone=europe-west1-b \\\n    --image-family=pytorch-latest-cpu \\\n    --image-project=deeplearning-platform-release\n</code></pre> <pre><code>gcloud compute instances create my_instance \\\n    --zone=europe-west1-b \\\n    --image-family=pytorch-latest-gpu \\\n    --image-project=deeplearning-platform-release \\\n    --accelerator=\"type=nvidia-tesla-K80,count=1\" \\\n    --maintenance-policy TERMINATE\n</code></pre> </li> <li> <p><code>ssh</code> to the VM as one of the previous exercises. Confirm that the container indeed contains     both a Python installation and PyTorch is also installed. Hint: you also have the possibility     through the web page to start a browser session directly to the VMs you create:</p> <p> </p> </li> </ol> </li> <li> <p>Everything that you have done locally can also be achieved through the web terminal, which of course comes     pre-installed with the <code>gcloud</code> command etc.</p> <p> </p> <p>Try out launching this and run some of the commands from the previous exercises.</p> </li> <li> <p>Finally, we want to make sure that we do not forget to stop our VMs. VMs are charged by the minute, so even if you     are not using them you are still paying for them. Therefore, you must remember to stop your VMs when you are not     using them. You can do this by either clicking the <code>Stop</code> button on the VM overview page or by running the following     command:</p> <pre><code>gcloud compute instances stop &lt;instance-name&gt;\n</code></pre> </li> </ol>"},{"location":"s6_the_cloud/using_the_cloud/#data-storage","title":"Data storage","text":"<p>Another big part of cloud computing is the storage of data. There are many reasons that you want to store your data in the cloud including:</p> <ul> <li>Easily being able to share</li> <li>Easily expand as you need more</li> <li>Data is stored in multiple locations, making sure that it is not lost in case of an emergency</li> </ul> <p>Cloud storage is luckily also very cheap. Google Cloud only takes around $0.026 per GB per month. This means that around 1 TB of data would cost you $26 which is more than what the same amount of data would cost on Google Drive, but the storage in Google Cloud is much more focused on enterprise usage such that you can access the data through code.</p>"},{"location":"s6_the_cloud/using_the_cloud/#exercises_1","title":"\u2754 Exercises","text":"<p>When we did the exercise on data version control, we made <code>dvc</code> work together with our own Google Drive to store data. However, a big limitation of this is that we need to authenticate each time we try to either push or pull the data. The reason is that we need to use an API instead which is offered through GCP.</p> <p>We are going to follow the instructions from this page</p> <ol> <li> <p>Let's start by creating a data storage. On the GCP start page, in the sidebar, click on the <code>Cloud Storage</code>.     On the next page click the <code>Create bucket</code>:</p> <p> </p> <p>Give the bucket a unique name, set it to a region close by and importantly remember to enable Object versioning under the last tab. Finally, click `Create``.</p> </li> <li> <p>After creating the storage, you should be able to see it online and you should be able to see it if you type in your     local terminal:</p> <pre><code>gsutil ls\n</code></pre> <p>gsutil is a command line tool that allows you to create, upload, download, list, move, rename and delete objects in the cloud storage. For example, you can upload a file to the cloud storage by running:</p> <pre><code>gsutil cp &lt;file&gt; gs://&lt;bucket-name&gt;\n</code></pre> </li> <li> <p>Next, we need the Google storage extension for <code>dvc</code></p> <pre><code>pip install dvc-gs\n</code></pre> </li> <li> <p>Now in your corrupt MNIST repository where you have already configured <code>dvc</code>, we are going to change the storage     from our Google Drive to our newly created Google Cloud storage.</p> <pre><code>dvc remote add -d remote_storage &lt;output-from-gsutils&gt;\n</code></pre> <p>In addition, we are also going to modify the remote to support object versioning (called <code>version_aware</code> in <code>dvc</code>):</p> <pre><code>dvc remote modify remote_storage version_aware true\n</code></pre> <p>This will change the default way that <code>dvc</code> handles data. Instead of just storing the latest version of the data as content-addressable storage it will now store the data as it looks in our local repository, which allows us to not only use <code>dvc</code> to download our data.</p> </li> <li> <p>The above command will change the <code>.dvc/config</code> file. <code>git add</code> and <code>git commit</code> the changes to that file.     Finally, push data to the cloud</p> <pre><code>dvc push --no-run-cache  # (1)!\n</code></pre> <ol> <li> The <code>--no-run-cache</code> flag is used to avoid pushing the cache file to the cloud, which is not     supported by the Google Cloud storage.</li> </ol> </li> <li> <p>Finally, make sure that you can pull without having to give your credentials. The easiest way to see this     is to delete the <code>.dvc/cache</code> folder that should be locally on your laptop and afterward do a</p> <pre><code>dvc pull --no-run-cache\n</code></pre> </li> </ol> <p>This setup should work when trying to access the data from your laptop, which we authenticated in the previous module. However, how can you access the data from a virtual machine, inside a docker container or from a different laptop? We in general recommend two ways:</p> <ul> <li> <p>You can make the bucket publicly accessible e.g. no authentication is needed. That means that anyone with the URL     to the data can access it. This is the easiest way to do it, but also the least secure. You can read more about     how to make your buckets public here.</p> </li> <li> <p>You can use the service account that you created in the previous module to authenticate the VM. This is the most     secure way to do it, but also the most complicated. You first need to give the service account the correct     permissions. Then you need to authenticate using the service account. In <code>dvc</code> this is done by setting the     environment variable <code>GOOGLE_APPLICATION_CREDENTIALS</code> to the path of</p> Linux/MacOSWindows <pre><code>export GOOGLE_APPLICATION_CREDENTIALS=\"/path/to/your/credentials.json\"\n</code></pre> <pre><code>set GOOGLE_APPLICATION_CREDENTIALS=\"C:\\path\\to\\your\\credentials.json\"\n</code></pre> </li> </ul>"},{"location":"s6_the_cloud/using_the_cloud/#artifact-registry","title":"Artifact registry","text":"<p>You should hopefully at this point have seen the strength of using containers to create reproducible environments. They allow us to specify exactly the software that we want to run inside our VMs. However, you should already have run into two problems with containers</p> <ul> <li>The building process can take a lot of time</li> <li>Docker images can be large</li> </ul> <p>For this reason, we want to move both the building process and the storage of images to the cloud. In GCP the two services that we are going to use for this are called Cloud Build for building the containers in the cloud and Artifact registry for storing the images afterward.</p>"},{"location":"s6_the_cloud/using_the_cloud/#exercises_2","title":"\u2754 Exercises","text":"<p>In these exercises, I recommend that you start with a dummy version of some code to make sure that the building process does not take too long. Below is a simple Python script that does image classification using Sklearn, together with the corresponding <code>requirements.txt</code> file and <code>Dockerfile</code>.</p> Python script main.py<pre><code>from sklearn import datasets, metrics, svm\nfrom sklearn.model_selection import train_test_split\n\nif __name__ == \"__main__\":\n    digits = datasets.load_digits()\n\n    # flatten the images\n    n_samples = len(digits.images)\n    data = digits.images.reshape((n_samples, -1))\n\n    # Create a classifier: a support vector classifier\n    clf = svm.SVC(gamma=0.001)\n\n    # Split data into 50% train and 50% test subsets\n    X_train, X_test, y_train, y_test = train_test_split(data, digits.target, test_size=0.5, shuffle=False)\n\n    # Learn the digits on the train subset\n    clf.fit(X_train, y_train)\n\n    # Predict the value of the digit on the test subset\n    predicted = clf.predict(X_test)\n\n    print(f\"Classification report for classifier {clf}:\\n{metrics.classification_report(y_test, predicted)}\\n\")\n</code></pre> requirements.txt requirements.txt<pre><code>scikit-learn&gt;=1.0\n</code></pre> Dockerfile Dockerfile<pre><code>FROM python:3.11-slim\n\n# install python\nRUN apt update &amp;&amp; \\\n    apt install --no-install-recommends -y build-essential gcc &amp;&amp; \\\n    apt clean &amp;&amp; rm -rf /var/lib/apt/lists/*\n\nCOPY requirements.txt requirements.txt\nCOPY main.py main.py\nWORKDIR /\nRUN pip install -r requirements.txt --no-cache-dir\n\nENTRYPOINT [\"python\", \"-u\", \"main.py\"]\n</code></pre> <p>The docker images for this application are therefore going to be substantially faster to build and smaller in size than the images we are used to that use PyTorch.</p> <ol> <li> <p>Start by enabling the service: <code>Google Artifact Registry API</code> and <code>Google Cloud Build API</code>. This can be     done through the website (by searching for the services) or can also be enabled from the terminal:</p> <pre><code>gcloud services enable artifactregistry.googleapis.com\ngcloud services enable cloudbuild.googleapis.com\n</code></pre> </li> <li> <p>The first step is creating an artifact repository in the cloud. You can either do this through the UI or using     <code>gcloud</code> in the command line.</p> UICommand line <p>Find the <code>Artifact Registry</code> service (search for it in the search bar) and click on it. From there click on the <code>Create repository</code> button. You should see the following page:</p> <p> </p> <p>Give the repository a name, make sure to set the format to <code>Docker</code> and specify the region. At the bottom of the page you can optionally add a cleanup policy. We recommend that you add one to keep costs down. Give the policy a name choose the <code>Keep most recent versions</code> option and set the keep count to <code>5</code>. Click <code>Create</code> and you should now see the repository in the list of repositories.</p> <pre><code>gcloud artifacts repositories create &lt;registry-name&gt; \\\n    --repository-format=docker \\\n    --location=europe-west1 \\\n    --description=\"My docker registry\"\n</code></pre> <p>where you need to replace <code>&lt;registry-name&gt;</code> with a name of your choice. You can read more about the command here. We recommend that after creating the repository you update it with a cleanup policy to keep costs down. You can do this by running:</p> <pre><code>gcloud artifacts repositories set-cleanup-policies REPOSITORY\n    --project=&lt;project-id&gt;\n    --location=&lt;region&gt;\n    --policy=policy.yaml\n</code></pre> <p>where the <code>policy.yaml</code> file should look something like this:</p> <p><pre><code>[\n    {\n        \"name\": \"keep-minimum-versions\",\n        \"action\": {\"type\": \"Keep\"},\n        \"mostRecentVersions\": {\n            \"keepCount\": 5\n        }\n    }\n]\n</code></pre> and you can read more about the command here.</p> <p>Whenever we in the future want to push or pull to this artifact repository we can refer to it using this URL:</p> <pre><code>&lt;region&gt;-docker.pkg.dev/&lt;project-id&gt;/&lt;registry-name&gt;\n</code></pre> <p>for example, <code>europe-west1-docker.pkg.dev/dtumlops-335110/container-registry</code> would be a valid URL (this is the one I created).</p> </li> <li> <p>We are now ready to build our containers in the cloud. In principle, GCP cloud build works out of the box with docker     files. However, the recommended way is to add specialized <code>cloudbuild.yaml</code> files. You can think of the     <code>cloudbuild.yaml</code> file as the corresponding file in GCP as workflow files are in GitHub actions, which you learned     about in module M16. It is essentially a file that specifies     a list of steps that should be executed to do something, but the syntax is different.</p> <p>Look at the documentation on how to write a <code>cloudbuild.yaml</code> file for building and pushing a docker image to the artifact registry. Try to implement such a file in your repository.</p> Solution <p>For building docker images the syntax is as follows:</p> <pre><code>steps:\n- name: 'gcr.io/cloud-builders/docker'\n  id: 'Build container image'\n  args: [\n    'build',\n    '.',\n    '-t',\n    'europe-west1-docker.pkg.dev/$PROJECT_ID/&lt;registry-name&gt;/&lt;image-name&gt;',\n    '-f',\n    '&lt;path-to-dockerfile&gt;'\n  ]\n- name: 'gcr.io/cloud-builders/docker'\n  id: 'Push container image'\n  args: [\n    'push',\n    'europe-west1-docker.pkg.dev/$PROJECT_ID/&lt;registry-name&gt;/&lt;image-name&gt;'\n  ]\n</code></pre> <p>where you need to replace <code>&lt;registry-name&gt;</code>, <code>&lt;image-name&gt;</code> and <code>&lt;path-to-dockerfile&gt;</code> with your own values. You can hopefully recognize the syntax from the docker exercises. In this example, we are calling the <code>cloud-builders/docker</code> service with both the <code>build</code> and <code>push</code> arguments.</p> </li> <li> <p>You can now try to trigger the <code>cloudbuild.yaml</code> file from your local machine. What <code>gcloud</code> command would you use     to do this?</p> Solution <p>You can trigger a build by running the following command:</p> <pre><code>gcloud builds submit --config=cloudbuild.yaml .\n</code></pre> <p>This command will submit a build to the cloud build service using the configuration file <code>cloudbuild.yaml</code> in the current directory.</p> </li> <li> <p>Instead of relying on manually submitting builds, we can setup the building process as continuous integration such     that it is triggered every time we push code to the repository. This is done by setting up a     trigger in the GCP console. From the GCP homepage, navigate to the     triggers panel:</p> <p> </p> <p>Click on the manage repositories.</p> <ol> <li> <p>From there, click the <code>Connect Repository</code> and go through the steps of authenticating your GitHub profile with     GCP and choose the repository that you want to setup build triggers. For now, skip the     <code>Create a trigger (optional)</code> part by pressing <code>Done</code> in the end.</p> <p> </p> </li> <li> <p>Navigate back to the <code>Triggers</code> homepage and click <code>Create trigger</code>. Set the following:</p> <ul> <li>Give a name</li> <li>Event: choose <code>Push to branch</code></li> <li>Source: choose the repository you just connected</li> <li>Branch: choose <code>^main$</code></li> <li>Configuration: choose either <code>Autodetected</code> or <code>Cloud build configuration file</code></li> </ul> <p>Finally, click the <code>Create</code> button and the trigger should show up on the triggers page.</p> </li> <li> <p>To activate the trigger, push some code to the chosen repository.</p> </li> <li> <p>Go to the <code>Cloud Build</code> page and you should see the image being built and pushed.</p> <p> </p> <p>Try clicking on the build to check out the build process and build summary. As you can see from the image, if a build is failing you will often find valuable info by looking at the build summary. If your build is failing try to configure it to run in one of these regions: <code>us-central1, us-west2, europe-west1, asia-east1, australia-southeast1, southamerica-east1</code> as specified in the documentation.</p> </li> <li> <p>If/when your build is successful, navigate to the <code>Artifact Registry</code> page. You should hopefully find that the     image you just built was pushed here. Congrats!</p> </li> </ol> </li> <li> <p>Make sure that you can pull your image down to your laptop</p> <pre><code>docker pull &lt;region&gt;-docker.pkg.dev/&lt;project-id&gt;/&lt;registry-name&gt;/&lt;image-name&gt;:&lt;image-tag&gt;\n</code></pre> <p>you will need to authenticate <code>docker</code> with GCP first. Instructions can be found here, but the following command should hopefully be enough to make <code>docker</code> and GCP talk to each other:</p> <pre><code>gcloud auth configure-docker &lt;region-docker.pkg.dev&gt;\n</code></pre> <p>where you need to replace <code>&lt;region&gt;</code> with the region you are using. Do note you need to have <code>docker</code> actively running in the background, as any other time you want to use <code>docker</code>.</p> </li> <li> <p>Automatization through the cloud is in general the way to go, but sometimes you may want to manually create images     and push them to the registry. Figure out how to push an image to your <code>Artifact Registry</code>. For simplicity, you can     just push the <code>busybox</code> image you downloaded during the initial docker exercises. This     page should help you with exercise.</p> Solution <p>Pushing to a repository is similar to pulling. Assuming that you have already built an image called <code>busybox</code> you can push it to the repository by running:</p> <pre><code>docker tag busybox &lt;region&gt;-docker.pkg.dev/&lt;project-id&gt;/&lt;registry-name&gt;/busybox:latest\ndocker push &lt;region&gt;-docker.pkg.dev/&lt;project-id&gt;/&lt;registry-name&gt;/busybox:latest\n</code></pre> <p>where you need to replace <code>&lt;region&gt;</code>, <code>&lt;project-id&gt;</code> and <code>&lt;registry-name&gt;</code> with your own values.</p> </li> <li> <p>(Optional) Instead of using the built-in trigger in GCP, another way to activate the build-on code changes is to     integrate with Github Actions. This has the benefit that we can make the build process depend on other steps in the     pipeline. For example, in the image below we have conditioned the build to only run if tests are passing on     all operating systems. Lets try to implement this.</p> <p> </p> <ol> <li> <p>Start by adding a new secret to Github with the name <code>GCLOUD_SERVICE_KEY</code> and the value of the service account     key that you created in the previous module. This is needed to authenticate the Github action with GCP.</p> </li> <li> <p>We assume that you already have a workflow file that runs some unit tests:</p> <pre><code>name: Unit tests &amp; build\n\non:\n  push:\n    branches: [main]\n  pull_request:\n    branches: [main]\n\njobs:\n  test:\n</code></pre> <p>we now want to add a job that triggers the build process in GCP. How can you make the <code>build</code> job depend on the <code>test</code> job? Hint: Relevant documentation.</p> Solution <p>You can make the <code>build</code> job depend on the <code>test</code> job by adding the <code>needs</code> keyword to the <code>build</code> job:</p> <pre><code>name: Unit tests &amp; build\n\non:\n  push:\n    branches: [main]\n  pull_request:\n    branches: [main]\n\njobs:\n  test:\n    ...\n  build:\n    needs: test\n    ...\n</code></pre> </li> <li> <p>Additionally, we probably only want to build the image if the job is running on our main branch e.g. not part     of a pull request. How can you make the <code>build</code> job only run on the main branch?</p> Solution <pre><code>name: Unit tests &amp; build\n\non:\n  push:\n    branches: [main]\n  pull_request:\n    branches: [main]\n\njobs:\n  test:\n    ...\n  build:\n    needs: test\n    if: ${{ github.event_name == 'push' &amp;&amp; github.ref == 'refs/heads/main' }}\n    ...\n</code></pre> </li> <li> <p>Finally, we need to add the steps to submit the build job to GCP. You need four steps:</p> <ul> <li>Checkout the code</li> <li>Authenticate with GCP</li> <li>Setup gcloud</li> <li>Submit the build</li> </ul> <p>How can you do this? Hint: For the first two steps these two Github actions can be useful: auth and setup-gcloud.</p> Solution <pre><code>name: Unit tests &amp; build\n\non:\n  push:\n    branches: [main]\n  pull_request:\n    branches: [main]\n\njobs:\n  test:\n    ...\n  build:\n    needs: test\n    if: ${{ github.event_name == 'push' &amp;&amp; github.ref == 'refs/heads/main' }}\n    runs-on: ubuntu-latest\n    steps:\n    - name: Checkout code\n      uses: actions/checkout@v4\n\n    - name: Auth with GCP\n      uses: google-github-actions/auth@v2\n      with:\n        credentials_json: ${{ secrets.GCLOUD_SERVICE_KEY }}\n\n    - name: Set up Cloud SDK\n      uses: google-github-actions/setup-gcloud@v2\n\n    - name: Submit build\n      run: gcloud builds submit --config cloudbuild_containers.yaml\n</code></pre> </li> </ol> </li> <li> <p>(Optional) The <code>cloudbuild</code> specification format allows you to specify so-called     substitutions. A substitution     is simply a way to replace a variable in the <code>cloudbuild.yaml</code> file with a value that is known only at runtime. This     can be useful for using the same <code>cloudbuild.yaml</code> file for multiple builds. Try to implement a substitution in your     docker cloud build file such that the image name is a variable.</p> <p>Build in substitutions</p> <p>You have probably already encountered substitutions like <code>$PROJECT_ID</code> in the <code>cloudbuild.yaml</code> file. These are substitutions that are automatically replaced by GCP. Other commonly used are <code>$BUILD_ID</code>, <code>$PROJECT_NUMBER</code> and <code>$LOCATION</code>. You can find a full list of built.in substitutions here</p> Solution <p>We just need to add the <code>substitutions</code> field to the <code>cloudbuild.yaml</code> file. For example, if we want to replace the image name with a variable called <code>_IMAGE_NAME</code> we can do the following:</p> <pre><code>steps:\n- name: 'gcr.io/cloud-builders/docker'\n  id: 'Build container image'\n  args: [\n    'build',\n    '.',\n    '-t',\n    'europe-west1-docker.pkg.dev/$PROJECT_ID/&lt;registry-name&gt;/$_IMAGE_NAME',\n    '-f',\n    '&lt;path-to-dockerfile&gt;'\n  ]\n- name: 'gcr.io/cloud-builders/docker'\n  id: 'Push container image'\n  args: [\n    'push',\n    'europe-west1-docker.pkg.dev/$PROJECT_ID/&lt;registry-name&gt;/$_IMAGE_NAME'\n  ]\nsubstitutions:\n  _IMAGE_NAME: 'my_image'\n</code></pre> <p>Do note that user substitutions are prefixed with an underscore <code>_</code> to distinguish them from built-in. You can read more here</p> <ol> <li> <p>How would you provide the value for the <code>_IMAGE_NAME</code> variable to the <code>gcloud builds submit</code> command?</p> Solution <p>You can provide the value for the <code>_IMAGE_NAME</code> variable by adding the <code>--substitutions</code> flag to the <code>gcloud builds submit</code> command:</p> <pre><code>gcloud builds submit --config=cloudbuild.yaml --substitutions=_IMAGE_NAME=my_image\n</code></pre> <p>If you want to provide more than one substitution you can do so by separating them with a comma.</p> </li> </ol> </li> </ol>"},{"location":"s6_the_cloud/using_the_cloud/#training","title":"Training","text":"<p>As the final step in our journey through different GCP services in this module, we are going to look at the training of our models. This is one of the important tasks that GCP can help us with because we can always rent more hardware as long as we have credits, meaning that we can both scale horizontally (run more experiments) and vertically (run longer experiments).</p> <p>We are going to check out two ways of running our experiments. First, we are going to return to the Compute Engine service because it gives the most simple form of scaling of experiments. That is: we create a VM with an appropriate docker image, start it, log into the VM and run our experiments. Most people can run a couple of experiments in parallel this way. However, what if there was an abstract layer that automatically created VM for us, launched our experiments and then closed the VM afterwards?</p> <p>This is where Vertex AI service comes into play. This is a dedicated service for handling ML models in GCP in the cloud. Vertex AI is in principal and end-to-end service that can take care of everything machine learning related in the cloud. In this course, we are primarily focused on just the training of our models, and then use other services for different parts of our pipeline.</p>"},{"location":"s6_the_cloud/using_the_cloud/#exercises_3","title":"\u2754 Exercises","text":"<ol> <li> <p>Let's start by going through how we could train a model using PyTorch using the Compute Engine service:</p> <ol> <li> <p>Start by creating an appropriate VM. If you want to start a VM that has PyTorch pre-installed with only CPU     support you can run the following command</p> <pre><code>gcloud compute instances create &lt;instance-name&gt; \\\n    --zone europe-west1-b \\\n    --image-family=pytorch-latest-cpu \\\n    --image-project=deeplearning-platform-release\n</code></pre> <p>alternatively, if you have access to GPU in your GCP account you could start a VM in the following way</p> <pre><code>gcloud compute instances create &lt;instance-name&gt; \\\n    --zone europe-west4-a \\\n    --image-family=pytorch-latest-gpu \\\n    --image-project=deeplearning-platform-release \\\n    --accelerator=\"type=nvidia-tesla-v100,count=1\" \\\n    --metadata=\"install-nvidia-driver=True\" \\\n    --maintenance-policy TERMINATE\n</code></pre> </li> <li> <p>Next login into your newly created VM. You can either open an <code>ssh</code> terminal in the cloud console or run the     following command</p> <pre><code>gcloud beta compute ssh &lt;instance-name&gt;\n</code></pre> </li> <li> <p>It is recommended to always check that the VM we get is actually what we asked for. In this case, the VM should     have PyTorch pre-installed so let's check for that by running</p> <pre><code>python -c \"import torch; print(torch.__version__)\"\n</code></pre> <p>Additionally, if you have a VM with GPU support also try running the <code>nvidia-smi</code> command.</p> </li> <li> <p>When you have logged in to the VM, it works as your machine. Therefore to run some training code you would     need to do the same setup step you have done on your machine: clone your Github, install dependencies,     download data, and run code. Try doing this to make sure you can train a model.</p> </li> </ol> </li> <li> <p>The above exercises should hopefully have convinced you that it can be hard to scale experiments using the Compute     Engine service. The reason is that you need to manually start, setup and stop a separate VM for each experiment.     Instead, let's try to use the Vertex AI service to train our models.</p> <ol> <li> <p>Start by enabling it by searching for <code>Vertex AI</code> in the cloud console by going to the service or by running the     following command:</p> <pre><code>gcloud services enable aiplatform.googleapis.com\n</code></pre> </li> <li> <p>The way we are going to use Vertex AI is to create custom jobs because we have already developed docker     containers that contain everything to run our code. Thus the only command that we need to use is     <code>gcloud ai custom-jobs create</code> command. An example here would be:</p> <pre><code>gcloud ai custom-jobs create \\\n    --region=europe-west1 \\\n    --display-name=test-run \\\n    --config=config.yaml \\\n    # these are the arguments that are passed to the container, only needed if you want to change defaults\n    --command 'python src/my_project/train.py' \\\n    --args '[\"--epochs\", \"10\"]'\n</code></pre> <p>Essentially, this command combines everything into one command: it first creates a VM with the specs specified by a configuration file, then loads a container specified again in the configuration file and finally it runs everything. An example of a config file could be:</p> CPUGPU <pre><code># config_cpu.yaml\nworkerPoolSpecs:\n    machineSpec:\n        machineType: n1-highmem-2\n    replicaCount: 1\n    containerSpec:\n        imageUri: gcr.io/&lt;project-id&gt;/&lt;docker-img&gt;\n</code></pre> <pre><code># config_gpu.yaml\nworkerPoolSpecs:\n    machineSpec:\n        machineType: n1-standard-8\n        acceleratorType: NVIDIA_TESLA_T4 #(1)!\n        acceleratorCount: 1\n    replicaCount: 1\n    containerSpec:\n        imageUri: gcr.io/&lt;project-id&gt;/&lt;docker-img&gt;\n</code></pre> <ol> <li>In this case we are requesting a Nvidia Tesla T4 GPU. This will only work if you have a quota for     allocating this type of GPU in the Vertex AI service. You can check how to request quota in the last     exercise of the previous module. Remember that it is not enough to just request a     quota for the GPU, the request needs to be approved by Google before you can use it.</li> </ol> <p>you can read more about the configuration formatting here and the different types of machines here. Try to execute a job using the <code>gcloud ai custom-jobs create</code> command. For additional documentation you can look at the documentation on the command and this page and this page</p> </li> <li> <p>Assuming you manage to launch a job, you should see an output like this:</p> <p> </p> <p>Try executing the commands that are outputted to look at both the status and the progress of your job.</p> </li> <li> <p>In addition, you can also visit the <code>Custom Jobs</code> tab in <code>training</code> part of Vertex AI</p> <p> </p> <p>You will need to select the specific region that you submitted your job to see the job.</p> </li> <li> <p>During custom training, we do not necessarily need to use <code>dvc</code> for downloading our data. A more efficient way is     to use cloud storage as a     mounted file system.     This allows us to access data directly from the cloud storage without having to download it first. All our     training jobs are automatically mounted a <code>gcs</code> folder in the root directory. Try to access the data from your     training script:</p> <pre><code># loading from a bucket using mounted file system\ndata = torch.load('/gcs/&lt;my-bucket-name&gt;/data.pt')\n# writing to a bucket using mounted file system\ntorch.save(data, '/gcs/&lt;my-bucket-name&gt;/data.pt')\n</code></pre> <p>is should speed up the training process a bit.</p> </li> <li> <p>Your code may depend on environment variables for authenticating, for example with weights and bias during     training. These can also be specified in the configuration file. How would you do this?</p> Solution <p>You can specify environment variables in the configuration file by adding the <code>env</code> field to the <code>containerSpec</code> field. For example, if you want to specify the <code>WANDB_API_KEY</code> you can do it like this:</p> <pre><code>workerPoolSpecs:\n    machineSpec:\n        machineType: n1-highmem-2\n    replicaCount: 1\n    containerSpec:\n        imageUri: gcr.io/&lt;project-id&gt;/&lt;docker-img&gt;\n        env:\n        - name: WANDB_API_KEY\n          value: &lt;your-wandb-api-key&gt;\n</code></pre> <p>You need to replace <code>&lt;your-wandb-api-key&gt;</code> with your actual key. Also, remember that this file now contains a secret and should be treated as such.</p> </li> <li> <p>Try to execute multiple jobs with different configurations e.g. change the <code>--args</code> field in the <code>gcloud ai     custom-jobs create</code> command at the same time. This should hopefully show you how easy it is to scale experiments     using the Vertex AI service.</p> </li> </ol> </li> </ol>"},{"location":"s6_the_cloud/using_the_cloud/#secrets-management","title":"Secrets management","text":"<p>Similar to GitHub Actions, GCP also has a secrets store that can be used to keep secrets safe. This is called the Secret Manager in GCP. By using the Secret Manager, we get the option to inject secrets into our code without having to store them in the code itself.</p>"},{"location":"s6_the_cloud/using_the_cloud/#exercises_4","title":"\u2754 Exercises","text":"<ol> <li> <p>Let's look at the example from before where we have a config file like this for custom Vertex AI jobs:</p> <pre><code>workerPoolSpecs:\n    machineSpec:\n        machineType: n1-highmem-2\n    replicaCount: 1\n    containerSpec:\n        imageUri: gcr.io/&lt;project-id&gt;/&lt;docker-img&gt;\n        env:\n        - name: WANDB_API_KEY\n            value: $WANDB_API_KEY\n</code></pre> <p>we do not want to store the <code>WANDB_API_KEY</code> in the config file, rather we would like to store it in the Secret Manager and inject it right before the job starts. Let's figure out how to do that.</p> <ol> <li> <p>Start by enabling the secrets manager API by running the following command:</p> <pre><code>gcloud services enable secretmanager.googleapis.com\n</code></pre> </li> <li> <p>Next, go to the secrets manager in the cloud console and create a new secret. You just need to give it a name, a     value and leave the rest as default. Add one or more secrets like the image below.</p> <p> </p> </li> <li> <p>We are going to inject the secrets into our training job by using cloudbuild. Create new cloudbuild file called     <code>vertex_ai_train.yaml</code> and add the following content:</p> vertex_ai_train.yaml<pre><code>steps:\n- name: \"alpine\"\n  id: \"Replace values in the training config\"\n  entrypoint: \"sh\"\n  args:\n    - '-c'\n    - |\n      apk add --no-cache gettext\n      envsubst &lt; config.yaml &gt; config.yaml.tmp\n      mv config.yaml.tmp config.yaml\n  secretEnv: ['WANDB_API_KEY']\n\n- name: 'alpine'\n  id: \"Show config\"\n  waitFor: ['Replace values in the training config']\n  entrypoint: \"sh\"\n  args:\n    - '-c'\n    - |\n    cat config.yaml\n\n- name: 'gcr.io/cloud-builders/gcloud'\n  id: 'Train on vertex AI'\n  waitFor: ['Replace values in the training config']\n  args: [\n    'ai',\n    'custom-jobs',\n    'create',\n    '--region',\n    'europe-west1',\n    '--display-name',\n    'example-mlops-job',\n    '--config',\n    '${_VERTEX_TRAIN_CONFIG}',\n  ]\navailableSecrets:\n  secretManager:\n  - versionName: projects/$PROJECT_ID/secrets/WANDB_API_KEY/versions/latest\n    env: 'WANDB_API_KEY'\n</code></pre> <p>Slowly go through the file and try to understand what each step does.</p> Solution <p>There are two parts to using secrets in cloud build. First, there is the <code>availableSecrets</code> field that specifies what secrets from the Secret Manager should be injected into the build. In this case, we are injecting the <code>WANDB_API_KEY</code> and setting it as an environment variable. The second part is the <code>secretEnv</code> field in the first step. This field specifies which secrets should be available in the first step. The steps are then doing:</p> <ol> <li> <p>The first step call the envsubst command which is a     general Linux command that replaces environment variables in a file. In this case, it replaces the     <code>$WANDB_API_KEY</code> with the actual value of the secret. We then save the file as <code>config.yaml.tmp</code> and     rename it back to <code>config.yaml</code>.</p> </li> <li> <p>The second step is just to show that the replacement was successful. This is mostly for debugging     purposes and can be removed.</p> </li> <li> <p>The third step is the actual training job. It waits for the first step to finish before running.</p> </li> </ol> </li> <li> <p>Finally, try to trigger the build:</p> <pre><code>gcloud builds submit --config=vertex_ai_train.yaml\n</code></pre> <p>and check that the <code>WANDB_API_KEY</code> is correctly injected into the <code>config.yaml</code> file.</p> </li> </ol> </li> </ol>"},{"location":"s6_the_cloud/using_the_cloud/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>In Compute Engine, we have the option to either stop or suspend the VMs, can you describe what the difference is?</p> Solution <p>Suspended instances preserve the guest OS memory, device state, and application state. You will not be charged for a suspended VM but will be charged for the storage of the aforementioned states. Stopped instances do not preserve any of the states and you will be charged for the storage of the disk. However, in both cases if the VM instances have resources attached to them, such as static IPs and persistent disks, which are charged until they are deleted.</p> </li> <li> <p>As seen in the exercises, a <code>cloudbuild.yaml</code> file often contains multiple steps. How would you make steps dependent     on each other e.g. one step can only run if another step has finished? And how would you make steps execute     concurrently?</p> Solution <p>In both cases, the solution is the <code>waitFor</code> field. If you want a step to wait for another step to finish you you need to give the first step an <code>id</code> and then specify that <code>id</code> in the <code>waitFor</code> field of the second step.</p> <pre><code>steps:\n- name: 'alpine'\n  id: 'step1'\n  entrypoint: 'sh'\n  args: ['-c', 'echo \"Hello World\"']\n- name: 'alpine'\n  id: 'step2'\n  entrypoint: 'sh'\n  args: ['-c', 'echo \"Hello World 2\"']\n  waitFor: ['step1']\n</code></pre> <p>If you want steps to run concurrently you can set the <code>waitFor</code> field to <code>['-']</code>:</p> <pre><code>steps:\n- name: 'alpine'\n  id: 'step1'\n  entrypoint: 'sh'\n  args: ['-c', 'echo \"Hello World\"']\n- name: 'alpine'\n  id: 'step2'\n  entrypoint: 'sh'\n  args: ['-c', 'echo \"Hello World 2\"']\n  waitFor: ['-']\n</code></pre> </li> </ol> <p>This ends the session on how to use Google Cloud services for now. In a future session, we are going to investigate a bit more of the services offered in GCP, in particular for deploying the models that we have just trained.</p>"},{"location":"s7_deployment/","title":"Model deployment","text":"<p>Slides</p> <ul> <li> <p></p> <p>Learn how to use requests works and how to create custom APIs</p> <p> M22: Requests and APIs</p> </li> <li> <p></p> <p>Learn how to deploy custom APIs using serverless functions and serverless containers in the cloud</p> <p> M23: Cloud Deployment</p> </li> <li> <p></p> <p>Learn how to test APIs for functionality and load</p> <p> M24: API testing</p> </li> <li> <p></p> <p>Learn about different ways to improve the deployment of machine learning models</p> <p> M25: ML Deployment</p> </li> <li> <p></p> <p>Learn how to create a frontend for your application using Streamlit</p> <p> M26: Frontend</p> </li> </ul> <p>Let's say that you have spent 1000 GPU hours and trained the most awesome model that you want to share with the world. One way to do this is, of course, to just place all your code in a Github repository, upload a file with the trained model weights to your favorite online storage (assuming it is too big for GitHub to handle) and ask people to just download your code and the weights to run the code by themselves. This is a fine approach in a small research setting, but in production, you need to be able to deploy the model to an environment that is fully contained such that people can just execute without looking (too hard) at the code.</p> <p> </p>  Image credit  <p>In this session we try to look at methods specialized towards deployment of models on your local machine and also how to deploy services in the cloud.</p> <p>Learning objectives</p> <p>The learning objectives of this session are:</p> <ul> <li>Understand the basics of requests and APIs</li> <li>Can create custom APIs using the framework <code>fastapi</code> and run it locally</li> <li>Knowledge about serverless deployments and how to deploy custom APIs using both serverless functions and     serverless containers</li> <li>Can create basic continues deployment pipelines for your models</li> <li>Understand the basics of frontend development and how to create a frontend for your application using Streamlit</li> <li>Know how to use more advance frameworks like onnx and bentoml to deploy your machine learning models</li> </ul>"},{"location":"s7_deployment/apis/","title":"M22 - Requests and APIs","text":""},{"location":"s7_deployment/apis/#requests-and-apis","title":"Requests and APIs","text":"<p>Core Module</p> <p>Before we can get deployment of our models we need to understand concepts such as APIs and requests. The core reason for this is that we need a new abstraction layer on top of our applications that are not Python-specific. While Python is the defacto language for machine learning, we cannot expect everybody else to use it and in particular, we cannot expect network protocols (both locally and external) to be able to communicate with our Python programs out of the box. For this reason, we need to understand requests, in particular HTTP requests and how to create APIs that can interact with those requests.</p>"},{"location":"s7_deployment/apis/#requests","title":"Requests","text":"<p>When we are talking about requests, we are essentially talking about the communication method used in client-server types of architectures. As shown in the image below, in this architecture, the client (user) is going to send requests to a server (our machine learning application) and the server will give a response. For example, the user may send a request to get the class of a specific image, which our application will do and then send back the response in terms of a label.</p> <p></p>  Image credit  <p>The common way of sending requests is called HTTP (Hypertext Transfer Protocol). It is essentially a specification of the intermediary transportation method between the client and server. An HTTP request essentially consists of two parts:</p> <ul> <li>A request URL: the location of the server we want to send our request to</li> <li>A request Method: describing what action we want to perform on the server</li> </ul> <p>The common request methods are (case sensitive):</p> <ul> <li>GET: get data from the server</li> <li>POST/PUT: send data to the server</li> <li>DELETE: delete data on the server</li> </ul> <p>You can read more about the different methods here. For most machine learning applications, GET and POST are the core methods to remember. Additionally, if you want to read more about HTTP in general we highly recommend that you go over this comic strip protocol, but the TLDR is that it provides privacy, integrity and identification over the web.</p>"},{"location":"s7_deployment/apis/#exercises","title":"\u2754 Exercises","text":"<p>We are going to do a couple of exercises on sending requests using requests package to get familiar with the syntax.</p> <ol> <li> <p>Start by installing the `requests`` package</p> <pre><code>pip install requests\n</code></pre> </li> <li> <p>Afterwards, create a small script and try to execute the code</p> <pre><code>import requests\nresponse = requests.get('https://api.github.com/this-api-should-not-exist')\nprint(response.status_code)\n</code></pre> <p>As you can see from the syntax, we are sending a request using the GET method. This code should return status code 404. Take a look at this page that contains a list of status codes. Next, let's call a page that exists</p> <pre><code>import requests\nresponse = requests.get('https://api.github.com')\nprint(response.status_code)\n</code></pre> <p>What is the status code now and what does it mean? Status codes are important when you have an application that is interacting with a server and wants to make sure that it does not fail, which can be done with simple <code>if</code> statements on the status codes</p> <pre><code>if response.status_code == 200:\n    print('Success!')\nelif response.status_code == 404:\n    print('Not Found.')\n</code></pre> </li> <li> <p>Next, try to call the following</p> <pre><code>response=requests.get(\"https://api.github.com/repos/SkafteNicki/dtu_mlops\")\n</code></pre> <p>which gives back a payload. Essentially, payload refers to any additional data that is sent from the client to the server or vice-versa. Try looking at the <code>response.content</code> attribute. What is the type of this attribute?</p> </li> <li> <p>You should hopefully observe that the <code>.content</code> attribute is of type <code>bytes</code>. It is important to note that this is     the standard way of sending payloads to encode them into <code>byte</code> objects. To get a more human-readable version of     the response, we can convert it to JSON format</p> <pre><code>response.json()\n</code></pre> <p>Important to remember that a JSON object in Python is just a nested dictionary if you ever want to iterate over the object in some way.</p> </li> <li> <p>When we use the GET method we can additionally provide a <code>params</code> argument, that specifies what we want the server     to send back for a specific request URL:</p> <pre><code>response = requests.get(\n    'https://api.github.com/search/repositories',\n    params={'q': 'requests+language:python'},\n)\n</code></pre> <p>Before looking at <code>response.json()</code> can you explain what the code does? You can try looking at this page for help.</p> </li> <li> <p>Sometimes the content of a page cannot be converted into JSON, because as already stated data is sent as bytes.     Say that we want to download an image, which we can do in the following way</p> <pre><code>import requests\nresponse = requests.get('https://imgs.xkcd.com/comics/making_progress.png')\n</code></pre> <p>Try calling <code>response.json()</code>, what happens? Next, try calling <code>response.content</code>. To get the result in this case we would need to convert from bytes to an image:</p> <pre><code>with open(r'img.png','wb') as f:\n    f.write(response.content)\n</code></pre> </li> <li> <p>The <code>get</code> method is the most useful method because it allows us to get data from the server. However, as stated in     the beginning multiple request methods exist, for example, the POST method for sending data to the server. Try     executing:</p> <pre><code>pload = {'username':'Olivia','password':'123'}\nresponse = requests.post('https://httpbin.org/post', data = pload)\n</code></pre> <p>Investigate the response (this is an artificial example because we do not control the server).</p> </li> <li> <p>Finally, we should also know that requests can be sent directly from the command line using the <code>curl</code> command.     Sometimes it is easier to send a request directly from the terminal and sometimes it is easier to do it from a     script.</p> <ol> <li> <p>Make sure you have <code>curl</code> installed, or else find instruction on installing it. To check call <code>curl -</code>-help` with     the documentation on curl.</p> </li> <li> <p>To execute <code>requests.get('https://api.github.com')</code> using curl we would simply do</p> <pre><code>curl -X GET \"https://api.github.com\"\ncurl -X GET -I \"https://api.github.com\" # if you want the status code\n</code></pre> <p>Try it yourself.</p> </li> <li> <p>Try to redo some of the exercises yourself using <code>curl</code>.</p> </li> </ol> </li> </ol> <p>That ends the intro session on <code>requests</code>. Do not worry if you are still not completely comfortable with sending requests, we are going to return to how we do it in practice when we have created our API. If you want to learn more about the <code>requests</code> package you can check out this tutorial and if you want to see more examples of how to use <code>curl</code> you can check out this page</p>"},{"location":"s7_deployment/apis/#creating-apis","title":"Creating APIs","text":"<p>Requests are all about being on the client side of our client-server architecture. We are now going to move on to the server side where we will be learning about writing the APIs that requests can interact with. An application programming interface (API) is essentially the way for the developer (you) tells a user how to use the application that you have created. The API is an abstraction layer that allows the user to interact with our application in the way we want them to interact with it, without the user even having to look at the code.</p> <p>We can take the API from GitHub as an example https://api.github.com. This API allows any user to retrieve, integrate and send data to Github without ever having to visit their webpage. The API exposes multiple endpoints that have various functions:</p> <ul> <li>https://api.github.com/repos/OWNER/REPO/branches: check out the branches on a given repository</li> <li>https://api.github.com/search/code: search through Github for repositories</li> <li>https://api.github.com/repos/OWNER/REPO/actions/workflows: check the status of workflows for a given repository</li> </ul> <p>and we could go on. However, there may be functionality that Github is not interested in users having access to and they may therefore choose not to have endpoints for specific features (1).</p> <ol> <li> Many companies provide public APIs to interact with their services/data. For a general list of     public APIs you can check out this page. For the Danes out there, you     can check out this list of public and private APIs from Danish companies and     organizations.</li> </ol> <p>The particular kind of API we are going to work with is called REST API (or RESTful API). The REST API specify specific constraints that a particular API needs to fulfill to be considered RESTful. You can read more about what the six guiding principles behind REST API on this page but one of the most important to have in mind is that the client-server architecture needs to be stateless. This means that whenever a request is send to the server it needs to be self-contained (all information included) and the server cannot rely on any previously stored information from previous requests.</p> <p>To implement APIs in practice we are going to use FastAPI. FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints. FastAPI is only one of many frameworks for defining APIs, however, compared to other frameworks such as Flask and django it offers a sweet spot of being flexible enough to do what you want without having many additional (unnecessary) features.</p>"},{"location":"s7_deployment/apis/#exercises_1","title":"\u2754 Exercises","text":"<p>The exercises below are a condensed version of this and this tutorial. If you ever need context for the exercises, we can recommend trying to go through these. Additionally, we also provide this solution file that you can look through for help.</p> <ol> <li> <p>Install FastAPI</p> <pre><code>pip install fastapi\n</code></pre> <p>This contains the functions, modules, and variables we are going to need to define our interface.</p> </li> <li> <p>Additionally, also install <code>uvicorn</code> which is a package for defining low level server applications.</p> <pre><code>pip install uvicorn[standard]\n</code></pre> </li> <li> <p>Start by defining a small application like this in a file called <code>main.py</code>:</p> <pre><code>from fastapi import FastAPI\napp = FastAPI()\n\n@app.get(\"/\")\ndef read_root():\n    return {\"Hello\": \"World\"}\n\n@app.get(\"/items/{item_id}\")\ndef read_item(item_id: int):\n    return {\"item_id\": item_id}\n</code></pre> <p>Importantly here is the use of the <code>@app.get</code> decorator. What could this decorator refer to? Explain what the two functions are probably doing.</p> </li> <li> <p>Next lets launch our app. Since we called our script <code>main.py</code> and we inside the script initialized our API with     <code>app = FastAPI</code>, our application that we want to deploy can be referenced by <code>main:app</code>:</p> <pre><code>uvicorn --reload --port 8000 main:app\n</code></pre> <p>this will launch a server at this page: <code>http://localhost:8000/</code>. As you will hopefully see, this page will return the content of the <code>root</code> function, like the image below. Remember to also check the output in your terminal as that will give info on when and how your application is being invoked.</p> <p> </p> <ol> <li> <p>What webpage should you open to get the server to return <code>1</code>?</p> </li> <li> <p>Also checkout the pages: <code>http://localhost:8000/docs</code> and <code>http://localhost:8000/redoc</code>. What does     these pages show?</p> </li> <li> <p>The power of the <code>docs</code> and <code>redoc</code> pages is that they allow you to easily test your application with their     simple UI. As shown in the image below, simply open the endpoint you want to test, click the <code>Try it out</code>     button, input any values and execute it. It will return both the corresponding <code>curl</code> command for invoking     your endpoint, the corresponding URL and response of you application. Try it out.</p> <p> </p> </li> <li> <p>You can also checkout <code>http://localhost:8000/openapi.json</code> to check out the schema that is generated     which essentially is a <code>json</code> file containing the overall specifications of your program.</p> </li> <li> <p>Try to access <code>http://localhost:8000/items/foo</code>, what happens in this case? When you specify types in your API,     FastAPI will automatically do type validation using pydantic, making sure users     can only access your API with the correct types. Therefore, remember to include types in your applications!</p> </li> </ol> </li> <li> <p>With the fundamentals in place let's configure it a bit more:</p> <ol> <li> <p>Lets start by changing the root function to include a bit more info. In particular we are also interested in     returning the status code so the end user can easily read that. Default status codes are included in the     http built-in Python package:</p> <pre><code>from http import HTTPStatus\n\n@app.get(\"/\")\ndef root():\n    \"\"\" Health check.\"\"\"\n    response = {\n        \"message\": HTTPStatus.OK.phrase,\n        \"status-code\": HTTPStatus.OK,\n    }\n    return response\n</code></pre> <p>try to reload the app and see what is returned now. You should not have to re-launch the app because we initialized the app with the <code>--reload</code> argument.</p> </li> <li> <p>When we decorate our functions with <code>@app.get(\"/items/{item_id}\")</code>, <code>item_id</code> is in the case what we call a     path parameters because it is a parameter that is directly included in the path of our endpoint. We have     already seen how we can restrict a path to a single type, but what if we want to restrict it to specific values?     This is often the case if we are working with parameters of type <code>str</code>. In this case we would need to define a     <code>enum</code>:</p> <pre><code>from enum import Enum\nclass ItemEnum(Enum):\n    alexnet = \"alexnet\"\n    resnet = \"resnet\"\n    lenet = \"lenet\"\n\n@app.get(\"/restric_items/{item_id}\")\ndef read_item(item_id: ItemEnum):\n    return {\"item_id\": item_id}\n</code></pre> <p>Add this API, reload and execute both a valid parameter and a non-valid parameter.</p> </li> <li> <p>In contrast to path parameters we have query parameters. In the requests exercises we saw an example of this     where we were calling https://api.github.com/search/code with the query <code>'q': 'requests+language:python'</code>.     Any parameter in FastAPI that is not a path parameter, will be considered a query parameter:</p> <pre><code>@app.get(\"/query_items\")\ndef read_item(item_id: int):\n    return {\"item_id\": item_id}\n</code></pre> <p>Add this API, reload and figure out how to pass in a query parameter.</p> </li> <li> <p>We have until now worked with the <code>.get</code> method, but lets also see an example of the <code>.post</code> method. As already     described the POST request method is used for uploading data to the server. Here is a simple app that saves     username and password in a database (please never implement this in real life like this):</p> <pre><code>database = {'username': [ ], 'password': [ ]}\n\n@app.post(\"/login/\")\ndef login(username: str, password: str):\n    username_db = database['username']\n    password_db = database['password']\n    if username not in username_db and password not in password_db:\n        with open('database.csv', \"a\") as file:\n            file.write(f\"{username}, {password} \\n\")\n        username_db.append(username)\n        password_db.append(password)\n    return \"login saved\"\n</code></pre> <p>Make sure you understand what the function does and then try to execute it a couple of times to see your database updating. It is important to note that we sometimes in the following exercises use the <code>.get</code> method and sometimes the <code>.post</code> method. For our usage it does not really matter.</p> </li> </ol> </li> <li> <p>We are now moving on to figuring out how to provide different standard inputs like text, images, json to our APIs. It     is important that you try out each example yourself and in particular you look at the <code>curl</code> commands that are     necessary to invoke each application.</p> <ol> <li> <p>Here is a small application, that takes a single text input</p> <pre><code>@app.get(\"/text_model/\")\ndef contains_email(data: str):\n    regex = r'\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b'\n    response = {\n        \"input\": data,\n        \"message\": HTTPStatus.OK.phrase,\n        \"status-code\": HTTPStatus.OK,\n        \"is_email\": re.fullmatch(regex, data) is not None\n    }\n    return response\n</code></pre> <p>What does the application do? Try it out yourself</p> </li> <li> <p>Let's say we wanted to extend the application to check for a specific email domain, either <code>gmail</code> or <code>hotmail</code>.     Assume that we want to feed this into our application as a <code>json</code> object e.g.</p> <pre><code>{\n    \"email\": \"mlops@gmail.com\",\n    \"domain_match\": \"gmail\"\n}\n</code></pre> <p>Figure out how to alter the <code>data</code> parameter such that it takes in the <code>json</code> object and make sure to extend the application to check if the email and domain also match. Hint: take a look at this page</p> </li> <li> <p>Let's move on to an application that requires a file input:</p> <pre><code>from fastapi import UploadFile, File\nfrom typing import Optional\n\n@app.post(\"/cv_model/\")\nasync def cv_model(data: UploadFile = File(...)):\n    with open('image.jpg', 'wb') as image:\n        content = await data.read()\n        image.write(content)\n        image.close()\n\n    response = {\n        \"input\": data,\n        \"message\": HTTPStatus.OK.phrase,\n        \"status-code\": HTTPStatus.OK,\n    }\n    return response\n</code></pre> <p>A couple of new things are going on here: we use the specialized <code>UploadFile</code> and <code>File</code> bodies in our input definition. Additionally, we added the <code>async</code>/<code>await</code> keywords. Figure out what everything does and try to run the application (you can use any image file you like).</p> </li> <li> <p>The above application actually does not do anything. Let's add opencv     as a package and let's resize the image. It can be done with the following three lines:</p> <pre><code>import cv2\nimg = cv2.imread(\"image.jpg\")\nres = cv2.resize(img, (h, w))\n</code></pre> <p>Figure out where to add them in the application and additionally add <code>h</code> and <code>w</code> as optional parameters, with a default value of 28. Try running the application where you specify everything and one more time where you leave out <code>h</code> and <code>w</code>.</p> </li> <li> <p>Finally, let's also figure out how to return a file from our application. You will need to add the following     lines:</p> <pre><code>from fastapi.responses import FileResponse\ncv2.imwrite('image_resize.jpg', res)\nFileResponse('image_resize.jpg')\n</code></pre> <p>Figure out where to add them to the code and try running the application one more time to see that you get a file back with the resized image.</p> </li> </ol> </li> <li> <p>A common pattern in most applications is that we want some code to run on startup and some code to run on shutdown.     FastAPI allows us to do this by controlling the lifespan of our application. This is done by implementing the     <code>lifespan</code> function. Look at the documentation for lifespan events     and implement a small application that prints <code>Hello</code> on startup and <code>Goodbye</code> on shutdown.</p> Solution <p>Here is a simple example that will print <code>Hello</code> on startup and <code>Goodbye</code> on shutdown.</p> <pre><code>from contextlib import asynccontextmanager\nfrom fastapi import FastAPI\n\n@asynccontextmanager\nasync def lifespan(app: FastAPI):\n    print(\"Hello\")\n    yield\n    print(\"Goodbye\")\n\napp = FastAPI(lifespan=lifespan)\n\n@app.get(\"/\")\ndef read_root():\n    return {\"Hello\": \"World\"}\n</code></pre> </li> <li> <p>Let's try to figure out how to use FastAPI in a Machine learning context. Below is a script that downloads     a <code>VisionEncoderDecoder</code> from     huggingface     . The model can be used to create captions for a given image. Thus calling</p> <pre><code>predict_step(['s7_deployment/exercise_files/my_cat.jpg'])\n</code></pre> <p>returns a list of strings like <code>['a cat laying on a couch with a stuffed animal']</code> (try this yourself). Create a FastAPI application that can do inference using this model e.g. it should take in an image, preferably some optional hyperparameters (like <code>max_length</code>) and should return a string (or list of strings) containing the generated caption.</p> <p>simple ML application</p> <pre><code>from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor, AutoTokenizer\nimport torch\nfrom PIL import Image\n\nmodel = VisionEncoderDecoderModel.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\nfeature_extractor = ViTFeatureExtractor.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\ntokenizer = AutoTokenizer.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nmodel.to(device)\n\ngen_kwargs = {\"max_length\": 16, \"num_beams\": 8, \"num_return_sequences\": 1}\ndef predict_step(image_paths):\n    images = []\n    for image_path in image_paths:\n        i_image = Image.open(image_path)\n        if i_image.mode != \"RGB\":\n            i_image = i_image.convert(mode=\"RGB\")\n\n        images.append(i_image)\n    pixel_values = feature_extractor(images=images, return_tensors=\"pt\").pixel_values\n    pixel_values = pixel_values.to(device)\n    output_ids = model.generate(pixel_values, **gen_kwargs)\n    preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)\n    preds = [pred.strip() for pred in preds]\n    return preds\n\nif __name__ == \"__main__\":\n    print(predict_step(['s7_deployment/exercise_files/my_cat.jpg']))\n</code></pre> Solution ml_app.py<pre><code>from contextlib import asynccontextmanager\n\nimport torch\nfrom fastapi import FastAPI, File, UploadFile\nfrom PIL import Image\nfrom transformers import AutoTokenizer, VisionEncoderDecoderModel, ViTFeatureExtractor\n\n\n@asynccontextmanager\nasync def lifespan(app: FastAPI):\n    \"\"\"Load and clean up model on startup and shutdown.\"\"\"\n    global model, feature_extractor, tokenizer, device, gen_kwargs\n    print(\"Loading model\")\n    model = VisionEncoderDecoderModel.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n    feature_extractor = ViTFeatureExtractor.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n    tokenizer = AutoTokenizer.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n    device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n    model.to(device)\n    gen_kwargs = {\"max_length\": 16, \"num_beams\": 8, \"num_return_sequences\": 1}\n\n    yield\n\n    print(\"Cleaning up\")\n    del model, feature_extractor, tokenizer, device, gen_kwargs\n\n\napp = FastAPI(lifespan=lifespan)\n\n\n@app.post(\"/caption/\")\nasync def caption(data: UploadFile = File(...)):\n    \"\"\"Generate a caption for an image.\"\"\"\n    i_image = Image.open(data.file)\n    if i_image.mode != \"RGB\":\n        i_image = i_image.convert(mode=\"RGB\")\n\n    pixel_values = feature_extractor(images=[i_image], return_tensors=\"pt\").pixel_values\n    pixel_values = pixel_values.to(device)\n    output_ids = model.generate(pixel_values, **gen_kwargs)\n    preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)\n    return [pred.strip() for pred in preds]\n</code></pre> </li> <li> <p>As the final step, we want to figure out how to include our FastAPI application in a docker container as it will help     us when we want to deploy in the cloud because docker as always can take care of the dependencies for our     application. For the following set of exercises you can take whatever previous FastAPI application as the base     application for the container</p> <ol> <li> <p>Start by creating a <code>requirement.txt</code> file for your application. You will at least need <code>fastapi</code> and <code>uvicorn</code>     in the file and we always recommend that you are specific about the version you want to use</p> <pre><code>fastapi&gt;=0.68.0,&lt;0.69.0\nuvicorn&gt;=0.15.0,&lt;0.16.0\n# add anything else you application needs to be able to run\n</code></pre> </li> <li> <p>Next, create a <code>Dockerfile</code> with the following content</p> <pre><code>FROM python:3.9\nWORKDIR /code\nCOPY ./requirements.txt /code/requirements.txt\n\nRUN pip install --no-cache-dir --upgrade -r /code/requirements.txt\nCOPY ./app /code/app\n\nCMD [\"uvicorn\", \"app.main:app\", \"--host\", \"0.0.0.0\", \"--port\", \"80\"]\n</code></pre> <p>The above assumes that your file structure looks like this</p> <pre><code>.\n\u251c\u2500\u2500 app\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u2514\u2500\u2500 main.py\n\u251c\u2500\u2500 Dockerfile\n\u2514\u2500\u2500 requirements.txt\n</code></pre> <p>Hopefully, all these steps should look familiar if you already went through module M9, except for maybe the last line. However, this is just the standard way that we have run our FastAPI applications in the last couple of exercises, this time with some extra arguments regarding the ports we allow.</p> </li> <li> <p>Next, build the corresponding docker image</p> <pre><code>docker build -t my_fastapi_app .\n</code></pre> </li> <li> <p>Finally, run the image such that a container is spun up that runs our application. The important part here is     to remember to specify the <code>-p</code> argument (p for port) which should be the same number as the port we have     specified in the last line of our Dockerfile.</p> <pre><code>docker run --name mycontainer -p 80:80 myimage\n</code></pre> </li> <li> <p>Check that everything is working by going to the corresponding localhost page     http://localhost/items/5?q=somequery</p> </li> </ol> </li> </ol> <p>This ends the module on APIs. If you want to go further in this direction we highly recommend that you check out bentoml which is an API standard that focuses solely on creating easy-to-understand APIs and services for ml-applications. Additionally, we can also highly recommend checking out Postman which can help design, document and in particular test the API you are writing to make sure that it works as expected.</p>"},{"location":"s7_deployment/cloud_deployment/","title":"M23 - Cloud Deployment","text":""},{"location":"s7_deployment/cloud_deployment/#cloud-deployment","title":"Cloud deployment","text":"<p>Core Module</p> <p>We are now returning to using the cloud. In this module, you should have gone through the steps of having your code in your GitHub repository to automatically build into a docker container, store that, store data and pull it all together to make a training run. After the training is completed you should hopefully have a file stored in the cloud with your trained model weights.</p> <p>Today's exercises will be about serving those model weights to an end user. We focus on two different ways of deploying our model: Google cloud functions and Google cloud run. Both services are serverless, meaning that you do not have to manage the server that runs your code.</p> <p></p>  GCP in general has 5 core deployment options. We are going to focus on Cloud Functions and Cloud Run, which are two of the serverless options. In contrast to these two, you have the option to deploy to Kubernetes Engine and Compute Engine which are more traditional ways of deploying your code. Here you have to manage the underlying infrastructure."},{"location":"s7_deployment/cloud_deployment/#cloud-functions","title":"Cloud Functions","text":"<p>Google Cloud Functions, is the most simple way that we can deploy our code to the cloud. As stated above, it is a serverless service, meaning that you do not have to worry about the underlying infrastructure. You just write your code and deploy it. The service is great for small applications that can be encapsulated in a single script.</p>"},{"location":"s7_deployment/cloud_deployment/#exercises","title":"\u2754 Exercises","text":"<ol> <li> <p>Go to the start page of <code>Cloud Functions</code>. Can be found in the sidebar on the homepage or you can just search for it.     Activate the service in the cloud console or use the following command:</p> <pre><code>gcloud services enable cloudfunctions.googleapis.com\n</code></pre> </li> <li> <p>Click the <code>Create Function</code> button which should take you to a screen like the image below. Make sure it is a 2nd     Gen function, give it a name, set the server region to somewhere close by and change the authentication policy to     <code>Allow unauthenticated invocations</code> so we can access it directly from a browser. Remember to note down the</p> <p> </p> </li> <li> <p>On the next page, for <code>Runtime</code> pick the <code>Python 3.11</code> option (or newer). This will make the inline editor show both     a <code>main.py</code> and <code>requirements.py</code> file. Look over them and try to understand what they do. Especially, take a     look at the functions-framework which is a     needed requirement of any Cloud function.</p> <p> </p> <p>After you have looked over the files, click the <code>Deploy</code> button.</p> Solution <p>The <code>functions-framework</code> is a lightweight, open-source framework for turning Python functions into HTTP functions. Any function that you deploy to Cloud Functions must be wrapped in the <code>@functions_framework.http</code> decorator.</p> </li> <li> <p>Afterwards, the function should begin to deploy. When it is done, you should see \u2705. Now let's test it by going to     the <code>Testing</code> tab.</p> <p> </p> </li> <li> <p>If you know what the application does, it should come as no surprise that it can run without any input. We     therefore just send an empty request by clicking the <code>Test The Function</code> button. Does the function return     the output you expected? Wait for the logs to show up. What do they show?</p> <ol> <li> <p>What should the <code>Triggering event</code> look like in the testing prompt for the program to respond with</p> <pre><code>Hallo General Kenobi!\n</code></pre> <p>Try it out.</p> Solution <p>The default triggering event is a JSON object with a key <code>name</code> and a value. Therefore the triggering event should look like this:</p> <pre><code>{\n    \"name\": \"General Kenobi\"\n}\n</code></pre> </li> <li> <p>Go to the trigger tab and go to the URL for the application. Execute the API a couple of times. How can you     change the URL to make the application respond with the same output as above?</p> Solution <p>You can change the URL to include a query parameter <code>name</code> with the value <code>General Kenobi</code>. For example</p> <pre><code>https://us-central1-my-personal-mlops-project.cloudfunctions.net/function-3?name=General%20Kanobi\n</code></pre> <p>where you would need to replace everything before the <code>?</code> with your URL.</p> </li> <li> <p>Click on the metrics tab. You should hopefully see it being populated with a few data points. Identify what each     panel is showing.</p> Solution <ul> <li>Invocations/Second: The number of times the function is invoked per second</li> <li>Execution time (ms): The time it takes for the function to execute in milliseconds</li> <li>Memory usage (MB): The memory usage of the function in MB</li> <li>Instance count (instances): The number of instances that are running the function</li> </ul> </li> <li> <p>Check out the logs tab. You should see that your application has already been invoked multiple times. Also, try     to execute this command in a terminal:</p> <pre><code>gcloud functions logs read\n</code></pre> </li> </ol> </li> <li> <p>Next, we are going to create our own application that takes some input so we can try to send it requests. We provide     a very simple script to get started.</p> <p>Simple script</p> sklearn_cloud_functions.py<pre><code># Load data\nimport pickle\n\nimport numpy as np\nfrom sklearn import datasets\nfrom sklearn.neighbors import KNeighborsClassifier\n\niris_x, iris_y = datasets.load_iris(return_X_y=True)\n\n# Split iris data in train and test data\n# A random permutation, to split the data randomly\nnp.random.seed(0)\nindices = np.random.permutation(len(iris_x))\niris_x_train = iris_x[indices[:-10]]\niris_y_train = iris_y[indices[:-10]]\niris_x_test = iris_x[indices[-10:]]\niris_y_test = iris_y[indices[-10:]]\n\n# Create and fit a nearest-neighbor classifier\n\nknn = KNeighborsClassifier()\nknn.fit(iris_x_train, iris_y_train)\nknn.predict(iris_x_test)\n\n# save model\n\nwith open(\"model.pkl\", \"wb\") as file:\n    pickle.dump(knn, file)\n</code></pre> <ol> <li> <p>Figure out what the script does and run the script. This should create a file with a trained model.</p> Solution <p>The file trains a simple KNN model on the iris dataset and saves it to a file called <code>model.pkl</code>.</p> </li> <li> <p>Next, create a storage bucket and upload the model file to the bucket. Try to do this using the <code>gsutil</code> command     and check afterward that the file is in the bucket.</p> Solution <pre><code>gsutil mb gs://&lt;bucket-name&gt;  # mb stands for make bucket\ngsutil cp &lt;file-name&gt; gs://&lt;bucket-name&gt;  # cp stands for copy\n</code></pre> </li> <li> <p>Create a new cloud function with the same initial settings as the first one, e.g. <code>Python 3.11</code> and <code>HTTP</code>. Then     implement in the <code>main.py</code> file code that:</p> <ul> <li>Loads the model from the bucket</li> <li>Takes a request with a list of integers as input</li> <li>Returns the prediction of the model</li> </ul> <p>In addition to writing the <code>main.py</code> file, you also need to fill out the <code>requirements.txt</code> file. You need at least three packages to run the application. Remember to also change the <code>Entry point</code> to the name of your function. If your deployment fails, try to go to the <code>Logs Explorer</code> page in <code>gcp</code> which can help you identify why.</p> Solution <p>The main script should look something like this:</p> main.py<pre><code>import pickle\n\nimport functions_framework\nfrom google.cloud import storage\n\nBUCKET_NAME = \"my_sklearn_model_bucket\"\nMODEL_FILE = \"model.pkl\"\n\nclient = storage.Client()\nbucket = client.get_bucket(BUCKET_NAME)\nblob = bucket.get_blob(MODEL_FILE)\nmy_model = pickle.loads(blob.download_as_string())\n\n\n@functions_framework.http\ndef knn_classifier(request):\n    \"\"\"Simple knn classifier function for iris prediction.\"\"\"\n    request_json = request.get_json()\n    if request_json and \"input_data\" in request_json:\n        input_data = request_json[\"input_data\"]\n        input_data = [float(in_data) for in_data in input_data]\n        input_data = [input_data]\n        prediction = my_model.predict(input_data)\n        return {\"prediction\": prediction.tolist()}\n    return {\"error\": \"No input data provided.\"}\n</code></pre> <p>And, the requirement file should look like this:</p> <pre><code>functions-framework&gt;=3.7.0\ngoogle-cloud-storage&gt;=2.14.0\nscikit-learn&gt;=1.4.0\n</code></pre> <p>importantly make sure that you are using the same version of <code>scikit-learn</code> as you used when you trained the model. Else when trying to load the model you will most likely get an error.</p> </li> <li> <p>When you have successfully deployed the model, try to make predictions with it. What should the request     look like?</p> Solution <p>It depends on how exactly you have chosen to implement the <code>main.py</code>. But for the provided solution, the payload should look like this:</p> <pre><code>{\n    \"data\": [1, 2, 3, 4]\n}\n</code></pre> <p>with the corresponding <code>curl</code> command:</p> <pre><code>curl -X POST \\\n    https://your-cloud-function-url/knn_classifier \\\n    -H \"Content-Type: application/json\" \\\n    -d '{\"input_data\": [5.1, 3.5, 1.4, 0.2]}'\n</code></pre> </li> </ol> </li> <li> <p>Let's try to figure out how to do the above deployment using <code>gcloud</code> instead of the console UI. The relevant command     is gcloud functions deploy. For this function     to work you will need to put the <code>main.py</code> and <code>requirements.txt</code> in a separate folder. Try to execute the command     to successfully deploy the function.</p> Solution <pre><code>gcloud functions deploy &lt;func-name&gt; \\\n    --gen2 --runtime python311 --trigger-http --source &lt;folder&gt; --entry-point knn_classifier\n</code></pre> <p>where you need to replace <code>&lt;func-name&gt;</code> with the name of your function and <code>&lt;folder&gt;</code> with the path to the folder containing the <code>main.py</code> and <code>requirements.txt</code> files.</p> </li> <li> <p>(Optional) You can finally try to redo the exercises by deploying a PyTorch application. You will essentially     need to go through the same steps as the sklearn example, including uploading a trained model to storage and     writing a cloud function that loads it and returns some output. You are free to choose whatever PyTorch model you     want.</p> </li> </ol>"},{"location":"s7_deployment/cloud_deployment/#cloud-run","title":"Cloud Run","text":"<p>Cloud functions are great for simple deployments, that can be encapsulated in a single script with only simple requirements. However, they do not scale with more advanced applications that may depend on multiple programming languages. We are already familiar with how we can deal with this through containers and Cloud Run is the corresponding service in GCP for deploying containers.</p>"},{"location":"s7_deployment/cloud_deployment/#exercises_1","title":"\u2754 Exercises","text":"<ol> <li> <p>We are going to start locally by developing a small app that we can deploy. We provide two small examples to choose     from: first is a small FastAPI app consisting of a single Python script and a docker file. The second is a small     Streamlit app (which you can learn more about in this module) consisting of a single docker file.     You can choose which one you want to work with.</p> Simple Fastapi app simple_fastapi_app.py<pre><code>from fastapi import FastAPI\n\napp = FastAPI()\n\n\n@app.get(\"/\")\ndef read_root():\n    \"\"\"Root endpoint.\"\"\"\n    return {\"Hello\": \"World\"}\n\n\n@app.get(\"/items/{item_id}\")\ndef read_item(item_id: int):\n    \"\"\"Get an item by id.\"\"\"\n    return {\"item_id\": item_id}\n</code></pre> simple_fastapi_app.dockerfile<pre><code>FROM python:3.9-slim\n\nEXPOSE $PORT\n\nWORKDIR /app\n\nRUN apt-get update &amp;&amp; apt-get install -y \\\n    build-essential \\\n    software-properties-common \\\n    git \\\n    &amp;&amp; rm -rf /var/lib/apt/lists/*\n\nRUN pip install fastapi\nRUN pip install pydantic\nRUN pip install uvicorn\n\nCOPY simple_fastapi_app.py simple_fastapi_app.py\n\nCMD exec uvicorn simple_fastapi_app:app --port $PORT --host 0.0.0.0 --workers 1\n</code></pre> Simple Streamlit app streamlit_app.dockerfile<pre><code>FROM python:3.9-slim\n\nEXPOSE $PORT\n\nWORKDIR /app\n\nRUN apt-get update &amp;&amp; apt-get install -y \\\n    build-essential \\\n    software-properties-common \\\n    git \\\n    &amp;&amp; rm -rf /var/lib/apt/lists/*\n\nRUN git clone https://github.com/streamlit/streamlit-example.git .\n\nRUN pip3 install -r requirements.txt\n\nENTRYPOINT [\"streamlit\", \"run\", \"streamlit_app.py\", \"--server.port=$PORT\", \"--server.address=0.0.0.0\"]\n</code></pre> <ol> <li> <p>Start by going over the files belonging to your choice app and understand what it does.</p> </li> <li> <p>Next, build the docker image belonging to the app</p> <pre><code>docker build -f &lt;dockerfile&gt; . -t gcp_test_app:latest\n</code></pre> </li> <li> <p>Next tag and push the image to your artifact registry</p> <pre><code>docker tag gcp_test_app &lt;region&gt;-docker.pkg.dev/&lt;project-id&gt;/&lt;registry-name&gt;/gcp_test_app:latest\ndocker push &lt;region&gt;-docker.pkg.dev/&lt;project-id&gt;/&lt;registry-name&gt;/gcp_test_app:latest\n</code></pre> <p>Afterward, check your artifact registry contains the pushed image.</p> </li> </ol> </li> <li> <p>Next, go to <code>Cloud Run</code> in the cloud console and enable the service or use the following command:</p> <pre><code>gcloud services enable run.googleapis.com\n</code></pre> </li> <li> <p>Click the <code>Create Service</code> button which should bring you to a page similar to the one below</p> <p> </p> <p>Do the following:</p> <ul> <li> <p>Click the select button, which will bring up all build containers and pick the one you want to deploy. In the     future, you probably want to choose the Continuously deploy new revisions from a source repository such that     a new version is always deployed when a new container is built.</p> </li> <li> <p>Hereafter, give the service a name and select the region. We recommend choosing a region close to you.</p> </li> <li> <p>Set the authentication method to Allow unauthenticated invocations such that we can call it without     providing credentials. In the future, you may only set that authenticated invocations are allowed.</p> </li> <li> <p>Expand the Container, Connections, Security tab and edit the port such that it matches the port exposed in your     chosen application. If your docker file exposes the env variable <code>$PORT</code> you can set the port to anything.</p> </li> </ul> <p>Finally, click the create button and wait for the service to be deployed (may take some time).</p> <p>Common problems</p> <p>If you get an error saying The user-provided container failed to start and listen on the port defined by the PORT environment variable. there are two common reasons for this:</p> <ol> <li> <p>You need to add an <code>EXPOSE</code> statement in your docker container:</p> <pre><code>EXPOSE 8080\nCMD exec uvicorn my_application:app --port 8080 --workers 1 main:app\n</code></pre> <p>and make sure that your application is also listening on that port. If you hard code the port in your application (as in the above code) it is best to set it 8080 which is the default port for cloud run. Alternatively, a better approach is to set it to the <code>$PORT</code> environment variable which is set by cloud run and can be accessed in your application:</p> <pre><code>EXPOSE $PORT\nCMD exec uvicorn my_application:app --port $PORT --workers 1 main:app\n</code></pre> <p>If you do this and then want to run locally you can run it as:</p> <pre><code>docker run -p 8080:8080 -e PORT=8080 &lt;image-name&gt;:&lt;image-tag&gt;\n</code></pre> </li> <li> <p>If you are serving a large machine-learning model, it may also be that your deployed container is running     out of memory. You can try to increase the memory of the container by going to the Edit container and     the Resources tab and increasing the memory.</p> </li> </ol> </li> <li> <p>If you manage to deploy the service you should see an image like this:</p> <p> </p> <p>You can now access your application by clicking the URL. This will access the root of your application, so you may need to add <code>/</code> or <code>/&lt;path&gt;</code> to the URL depending on how the app works.</p> </li> <li> <p>Everything we just did in the console UI we can also do with the     gcloud run deploy. How would you do that?</p> Solution <p>The command should look something like this</p> <pre><code>gcloud run deploy &lt;service-name&gt; \\\n    --image &lt;image-name&gt;:&lt;image-tag&gt; --platform managed --region &lt;region&gt; --allow-unauthenticated\n</code></pre> <p>where you need to replace <code>&lt;service-name&gt;</code> with the name of your service, <code>&lt;image-name&gt;</code> with the name of your image and <code>&lt;region&gt;</code> with the region you want to deploy to. The <code>--allow-unauthenticated</code> flag is optional but is needed if you want to access the service without providing credentials.</p> </li> <li> <p>After deploying using the command line, make sure that the service is up and running by using these two commands</p> <pre><code>gcloud run services list\ngcloud run services describe &lt;service-name&gt; --region &lt;region&gt;\n</code></pre> </li> <li> <p>Instead of deploying our docker container using the UI or command line, which is a manual operation, we can do it     continuously by using <code>cloudbuild.yaml</code> file we learned about in the previous section. This is called     continuous deployment, and it is a way to     automate the deployment process.</p> <p>  Image credit  </p> <p>Let's revise the <code>cloudbuild.yaml</code> file from the artifact registry exercises in this module which will build and push a specified docker image.</p> <p>cloudbuild.yaml</p> cloudbuild.yaml<pre><code>steps:\n- name: 'gcr.io/cloud-builders/docker'\n  id: 'Build container image'\n  args: [\n    'build',\n    '.',\n    '-t',\n    'europe-west1-docker.pkg.dev/$PROJECT_ID/&lt;registry-name&gt;/&lt;image-name&gt;',\n    '-f',\n    '&lt;path-to-dockerfile&gt;'\n  ]\n- name: 'gcr.io/cloud-builders/docker'\n  id: 'Push container image'\n  args: [\n    'push',\n    'europe-west1-docker.pkg.dev/$PROJECT_ID/&lt;registry-name&gt;/&lt;image-name&gt;'\n  ]\n</code></pre> <p>Add a third step to the <code>cloudbuild.yaml</code> file that deploys the container image to Cloud Run. The relevant service you need to use is called <code>'gcr.io/cloud-builders/gcloud'</code> and the command is <code>'gcloud run deploy'</code>. Afterwards, reuse the trigger you created in the previous module or create a new one to build and deploy the container image continuously. Confirm that this works by making a change to your application and pushing it to GitHub and see if the application is updated continuously.</p> Solution <p>The full <code>cloudbuild.yaml</code> file should look like this:</p> <pre><code>steps:\n- name: 'gcr.io/cloud-builders/docker'\n  id: 'Build container image'\n  args: [\n    'build',\n    '.',\n    '-t',\n    'europe-west1-docker.pkg.dev/$PROJECT_ID/&lt;registry-name&gt;/&lt;image-name&gt;',\n    '-f',\n    '&lt;path-to-dockerfile&gt;'\n  ]\n- name: 'gcr.io/cloud-builders/docker'\n  id: 'Push container image'\n  args: [\n    'push',\n    'europe-west1-docker.pkg.dev/$PROJECT_ID/&lt;registry-name&gt;/&lt;image-name&gt;'\n  ]\n- name: 'gcr.io/cloud-builders/gcloud'\n  id: 'Deploy to Cloud Run'\n  args: [\n    'run',\n    'deploy',\n    '&lt;service-name&gt;',\n    '--image',\n    'europe-west1-docker.pkg.dev/$PROJECT_ID/&lt;registry-name&gt;/&lt;image-name&gt;',\n    '--region',\n    'europe-west1',\n    '--platform',\n    'managed',\n  ]\n</code></pre> </li> </ol>"},{"location":"s7_deployment/cloud_deployment/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>In the previous module on using the cloud you learned about the Secrets     Manager in GCP. How can you use this service in combination with Cloud Run?</p> Solution <p>In the cloud console, secrets can be set in the Container(s), Volumes, Networking, Security tab under the Variables &amp; Secrets section, see image below.</p> <p> </p> <p>In the <code>gcloud</code> command, you can set the secret by using the <code>--update-secrets</code> flag.</p> <pre><code>gcloud run deploy &lt;service-name&gt; \\\n    --image &lt;image-name&gt;:&lt;image-tag&gt; --platform managed \\\n    --region &lt;region&gt; --allow-unauthenticated \\\n    --update-secrets &lt;secret-name&gt;=&lt;secret-version&gt;\n</code></pre> </li> </ol> <p>That ends the exercises on deployment. The exercises above are just a small taste of what deployment has to offer. In both sections, we have explicitly chosen to work with serverless deployments. But what if you wanted to do the opposite e.g. being the one in charge of the management of the cluster that handles the deployed services? If you are interested in taking deployment to the next level should get started on Kubernetes which is the de-facto open-source container orchestration platform that is being used in production environments. If you want to deep dive we recommend starting here which describes how to make pipelines that are a necessary component before you start to create your own Kubernetes cluster.</p>"},{"location":"s7_deployment/frontend/","title":"M26 - Frontend","text":""},{"location":"s7_deployment/frontend/#frontend","title":"Frontend","text":"<p>If you have gone over the deployment module you should be at the point where you have a machine learning model running in the cloud. The model can be interacted with by sending HTTP requests to the API endpoint. In general we refer to this as the backend of the application. It is the part of our application that are behind-the-scene that the user does not see and it is not really that user-friendly. Instead we want to create a frontend that the user can interact with in a more user-friendly way. This is what we will be doing in this module.</p> <p>Another point of splitting our application into a frontend and a backend has to do with scalability. If we have a lot of users interacting with our application, we might want to scale only the backend and not the frontend, because that is the part that will be running our heavy machine learning model. In general dividing a application into smaller pieces are the pattern that is used in microservice architectures.</p> <p></p>  In monollithic applications everything the user may be requesting of our application is handled by a single process/ container. In microservice architectures the application is split into smaller pieces that can be scaled independently. This also leads to easier maintainability and faster development.  <p>Frontends have for the longest time been created using HTML, CSS and JavaScript. This is still the case, but there are now a lot of frameworks that can help us create a frontend in Python:</p> <ul> <li>Django</li> <li>Reflex</li> <li>Streamlit</li> <li>Bokeh</li> <li>Gradio</li> </ul> <p>In this module we will be looking at <code>streamlit</code>. <code>streamlit</code> is a easy to use framework that allows us to create interactive web applications in Python. It is not at all as powerful as a framework like <code>Django</code>, but it is very easy to get started with and it is very easy to integrate with our machine learning models.</p>"},{"location":"s7_deployment/frontend/#exercises","title":"\u2754 Exercises","text":"<p>In these exercises we go through the process of setting up a backend using <code>fastapi</code> and a frontend using <code>streamlit</code>, containerizing both applications and then deploying them to the cloud. We have already created an example of this which can be found in the <code>samples/frontend_backend</code> folder.</p> <ol> <li> <p>Lets start by creating the backend application in a <code>backend.py</code> file. You can use essentially any backend you want,     but we will be using a simple imagenet classifier that we have created in the <code>samples/frontend_backend/backend</code>     folder.</p> <ol> <li> <p>Create a new file called <code>backend.py</code> and implement a FastAPI interface with a single <code>/predict</code> endpoint that     takes a image as input and returns the predicted class (and probabilities) of the image.</p> Solution backend.py<pre><code>import json\nfrom contextlib import asynccontextmanager\n\nimport anyio\nimport torch\nfrom fastapi import FastAPI, File, HTTPException, UploadFile\nfrom PIL import Image\nfrom torchvision import models, transforms\n\n\n@asynccontextmanager\nasync def lifespan(app: FastAPI):\n    \"\"\"Context manager to start and stop the lifespan events of the FastAPI application.\"\"\"\n    global model, transform, imagenet_classes\n    # Load model\n    model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)\n    model.eval()\n\n    transform = transforms.Compose(\n        [\n            transforms.Resize((224, 224)),\n            transforms.ToTensor(),\n            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),\n        ],\n    )\n\n    async with await anyio.open_file(\"imagenet-simple-labels.json\") as f:\n        imagenet_classes = json.load(f)\n\n    yield\n\n    # Clean up\n    del model\n    del transform\n    del imagenet_classes\n\n\napp = FastAPI(lifespan=lifespan)\n\n\ndef predict_image(image_path: str) -&gt; str:\n    \"\"\"Predict image class (or classes) given image path and return the result.\"\"\"\n    img = Image.open(image_path).convert(\"RGB\")\n    img = transform(img).unsqueeze(0)\n    with torch.no_grad():\n        output = model(img)\n    _, predicted_idx = torch.max(output, 1)\n    return output.softmax(dim=-1), imagenet_classes[predicted_idx.item()]\n\n\n@app.get(\"/\")\nasync def root():\n    \"\"\"Root endpoint.\"\"\"\n    return {\"message\": \"Hello from the backend!\"}\n\n\n# FastAPI endpoint for image classification\n@app.post(\"/classify/\")\nasync def classify_image(file: UploadFile = File(...)):\n    \"\"\"Classify image endpoint.\"\"\"\n    try:\n        contents = await file.read()\n        async with await anyio.open_file(file.filename, \"wb\") as f:\n            f.write(contents)\n        probabilities, prediction = predict_image(file.filename)\n        return {\"filename\": file.filename, \"prediction\": prediction, \"probabilities\": probabilities.tolist()}\n    except Exception as e:\n        raise HTTPException(status_code=500) from e\n</code></pre> </li> <li> <p>Run the backend using <code>uvicorn</code></p> <pre><code>uvicorn backend:app --reload\n</code></pre> </li> <li> <p>Test the backend by sending a request to the <code>/predict</code> endpoint, preferably using <code>curl</code> command</p> Solution <p>In this example we are sending a request to the <code>/predict</code> endpoint with a file called <code>my_cat.jpg</code>. The response should be \"tabby cat\" for the solution we have provided.</p> <pre><code>curl -X 'POST' \\\n    'http://127.0.0.1:8000/classify/' \\\n    -H 'accept: application/json' \\\n    -H 'Content-Type: multipart/form-data' \\\n    -F 'file=@my_cat.jpg;type=image/jpeg'\n</code></pre> </li> <li> <p>Create a <code>requirements_backend.txt</code> file with the dependencies needed for the backend.</p> Solution requirements_backend.txt<pre><code>fastapi&gt;=0.108.0\nuvicorn&gt;=0.25.0\ntorch&gt;=2.1.2\ntorchvision&gt;=0.16.2\n</code></pre> </li> <li> <p>Containerize the backend into a file called <code>backend.dockerfile</code>.</p> Solution backend.dockerfile<pre><code>FROM python:3.11-slim\n\nRUN apt update &amp;&amp; \\\n    apt install --no-install-recommends -y build-essential gcc git &amp;&amp; \\\n    apt clean &amp;&amp; rm -rf /var/lib/apt/lists/*\n\nRUN mkdir /app\n\nWORKDIR /app\n\nCOPY requirements_backend.txt /app/requirements_backend.txt\nCOPY backend.py /app/backend.py\n\nRUN --mount=type=cache,target=/root/.cache/pip pip install -r requirements_backend.txt\n\nEXPOSE $PORT\nCMD exec unicorn --port $PORT --host 0.0.0.0 backend:app\n</code></pre> </li> <li> <p>Build the backend image</p> <pre><code>docker build -t backend:latest -f backend.dockerfile .\n</code></pre> </li> <li> <p>Recheck that the backend works by running the image in a container</p> <pre><code>docker run --rm -p 8000:8000 -e \"PORT=8000\" backend\n</code></pre> <p>and test that it works by sending a request to the <code>/predict</code> endpoint.</p> </li> <li> <p>Deploy the backend to Cloud run using the <code>gcloud</code> command</p> Solution <p>Assuming that we have created an artifact registry called <code>frontend_backend</code> we can deploy the backend to Cloud Run using the following commands:</p> <pre><code>docker tag \\\n    backend:latest \\\n    &lt;region&gt;-docker.pkg.dev/&lt;project&gt;/frontend-backend/backend:latest\ndocker push \\\n    &lt;region&gt;.pkg.dev/&lt;project&gt;/frontend-backend/backend:latest\ngcloud run deploy backend \\\n    --image=europe-west1-docker.pkg.dev/&lt;project&gt;/frontend-backend/backend:latest \\\n    --region=europe-west1 \\\n    --platform=managed \\\n</code></pre> <p>where <code>&lt;region&gt;</code> and <code>&lt;project&gt;</code> should be replaced with the appropriate values.</p> </li> <li> <p>Finally, test that the deployed backend works as expected by sending a request to the <code>/predict</code> endpoint</p> Solution <p>In this solution we are first extracting the url of the deployed backend and then sending a request to the <code>/predict</code> endpoint.</p> <pre><code>export MYENDPOINT=$(gcloud run services describe backend --region=&lt;region&gt; --format=\"value(status.url)\")\ncurl -X 'POST' \\\n    $MYENDPOINT/predict \\\n    -H 'accept: application/json' \\\n    -H 'Content-Type: multipart/form-data' \\\n    -F 'file=@my_cat.jpg;type=image/jpeg'\n</code></pre> </li> </ol> </li> <li> <p>With the backend taken care of lets now write our frontend. Our frontend just needs to be a \"nice\" interface to our     backend. Its main functionality will be to send a request to the backend and display the result.     streamlit documentation</p> <ol> <li> <p>Start by installing <code>streamlit</code></p> <pre><code>pip install streamlit\n</code></pre> </li> <li> <p>Now create a file called <code>frontend.py</code> and implement a streamlit application. You can design it as you want,     but we recommend that the following can be done in the frontend:</p> <ol> <li> <p>Have a file uploader that allows the user to upload an image</p> </li> <li> <p>Display the image that the user uploaded</p> </li> <li> <p>Have a button that sends the image to the backend and displays the result</p> </li> </ol> <p>For now just assume that a environment variable called <code>BACKEND</code> is available that contains the URL of the backend. We will in the next step show how to get this URL automatically.</p> Solution frontend.py<pre><code>import os\n\nimport pandas as pd\nimport requests\nimport streamlit as st\nfrom google.cloud import run_v2\n\n\ndef get_backend_url():\n    \"\"\"Get the URL of the backend service.\"\"\"\n    parent = \"projects/my-personal-mlops-project/locations/europe-west1\"\n    client = run_v2.ServicesClient()\n    services = client.list_services(parent=parent)\n    for service in services:\n        if service.name.split(\"/\")[-1] == \"production-model\":\n            return service.uri\n    return os.environ.get(\"BACKEND\", None)\n\n\ndef classify_image(image, backend):\n    \"\"\"Send the image to the backend for classification.\"\"\"\n    predict_url = f\"{backend}/predict\"\n    response = requests.post(predict_url, files={\"image\": image}, timeout=10)\n    if response.status_code == 200:\n        return response.json()\n    return None\n\n\ndef main() -&gt; None:\n    \"\"\"Main function of the Streamlit frontend.\"\"\"\n    backend = get_backend_url()\n    if backend is None:\n        msg = \"Backend service not found\"\n        raise ValueError(msg)\n\n    st.title(\"Image Classification\")\n\n    uploaded_file = st.file_uploader(\"Upload an image\", type=[\"jpg\", \"jpeg\", \"png\"])\n\n    if uploaded_file is not None:\n        image = uploaded_file.read()\n        result = classify_image(image, backend=backend)\n\n        if result is not None:\n            prediction = result[\"prediction\"]\n            probabilities = result[\"probabilities\"]\n\n            # show the image and prediction\n            st.image(image, caption=\"Uploaded Image\")\n            st.write(\"Prediction:\", prediction)\n\n            # make a nice bar chart\n            data = {\"Class\": [f\"Class {i}\" for i in range(10)], \"Probability\": probabilities}\n            df = pd.DataFrame(data)\n            df.set_index(\"Class\", inplace=True)\n            st.bar_chart(df, y=\"Probability\")\n        else:\n            st.write(\"Failed to get prediction\")\n\n\nif __name__ == \"__main__\":\n    main()\n</code></pre> </li> <li> <p>We need to make sure that the frontend knows where the backend is located, and we want that to happen     automatically so we do not have to hardcode the URL into our frontend. We can do this by using the     Python SDK for Google Cloud Run. The following code snippet shows how to get the URL of the backend service     or fall back to an environment variable if the service is not found.</p> <pre><code>from google.cloud import run_v2\nimport streamlit as st\n\n@st.cache_resource  # (1)!\ndef get_backend_url():\n    \"\"\"Get the URL of the backend service.\"\"\"\n    parent = \"projects/&lt;project&gt;/locations/&lt;region&gt;\"\n    client = run_v2.ServicesClient()\n    services = client.list_services(parent=parent)\n    for service in services:\n        if service.name.split(\"/\")[-1] == \"production-model\":\n            return service.uri\n    name = os.environ.get(\"BACKEND\", None)\n    return name\n</code></pre> <ol> <li> The <code>st.cache_resource</code> is a decorator that tells <code>streamlit</code> to cache the result of the     function. This is useful if the function is expensive to run and we want to avoid running it multiple times.</li> </ol> <p>Add the above code snippet to the top of your <code>frontend.py</code> file and replace <code>&lt;project&gt;</code> and <code>&lt;region&gt;</code> with the appropriate values. You will need to install <code>pip install google-cloud-run</code> to be able to use the code snippet.</p> </li> <li> <p>Run the frontend using <code>streamlit</code></p> <pre><code>streamlit run frontend.py\n</code></pre> </li> <li> <p>Create a <code>requirements_frontend.txt</code> file with the dependencies needed for the frontend.</p> Solution requirements_frontend.txt<pre><code>streamlit&gt;=1.28.2\npandas&gt;=2.1.3\ngoogle-cloud-run&gt;=0.10.5\n</code></pre> </li> <li> <p>Containerize the frontend into a file called <code>frontend.dockerfile</code>.</p> Solution frontend.dockerfile<pre><code>FROM python:3.11-slim\n\nRUN apt update &amp;&amp; \\\n    apt install --no-install-recommends -y build-essential gcc git &amp;&amp; \\\n    apt clean &amp;&amp; rm -rf /var/lib/apt/lists/*\n\nRUN mkdir /app\n\nWORKDIR /app\n\nCOPY requirements_frontend.txt /app/requirements_frontend.txt\nCOPY frontend.py /app/frontend.py\n\nRUN --mount=type=cache,target=/root/.cache/pip pip install -r requirements_frontend.txt\n\nEXPOSE $PORT\n\nCMD [\"streamlit\", \"run\", \"frontend.py\", \"--server.port\", \"$PORT\"]\n</code></pre> </li> <li> <p>Build the frontend image</p> <pre><code>docker build -t frontend:latest -f frontend.dockerfile .\n</code></pre> </li> <li> <p>Run the frontend image</p> <pre><code>docker run --rm -p 8001:8001 -e \"PORT=8001\" backend\n</code></pre> <p>and check in your web browser that the frontend works as expected.</p> </li> <li> <p>Deploy the frontend to Cloud run using the <code>gcloud</code> command</p> Solution <p>Assuming that we have created an artifact registry called <code>frontend_backend</code> we can deploy the backend to Cloud Run using the following commands:</p> <pre><code>docker tag frontend:latest \\\n    &lt;region&gt;-docker.pkg.dev/&lt;project&gt;/frontend-backend/frontend:latest\ndocker push &lt;region&gt;.pkg.dev/&lt;project&gt;/frontend-backend/frontend:latest\ngcloud run deploy frontend \\\n    --image=europe-west1-docker.pkg.dev/&lt;project&gt;/frontend-backend/frontend:latest \\\n    --region=europe-west1 \\\n    --platform=managed \\\n</code></pre> </li> <li> <p>Test that frontend works as expected by opening the URL of the deployed frontend in your web browser.</p> </li> </ol> </li> <li> <p>(Optional) If you have gotten this far you have successfully created a frontend and a backend and deployed them to     the cloud. Finally, it may be worth it to load test your application to see how it performs under load. Write a     locust file which is covered in this module and run it against your frontend.     Make sure that it can handle the load you expect it to handle.</p> </li> <li> <p>(Optional) Feel free to experiment further with streamlit and see what you can create. For example, you can try to     create a option for the user to upload a video and then display the video with the predicted class overlaid on     top of the video.</p> </li> </ol>"},{"location":"s7_deployment/frontend/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>We have created separate requirements files for the frontend and the backend. Why is this a good idea?</p> Solution <p>This is a good idea because the frontend and the backend may have different dependencies. By having separate requirements files we can make sure that we only install the dependencies that are needed for the specific application. This also has the positive side effect that we can keep the docker images smaller. For example, the frontend does not need the <code>torch</code> library which is huge and only needed for the backend.</p> </li> </ol> <p>This ends the exercises for this module.</p>"},{"location":"s7_deployment/ml_deployment/","title":"M25 - ML deployment","text":""},{"location":"s7_deployment/ml_deployment/#deployment-of-machine-learning-models","title":"Deployment of Machine Learning Models","text":"<p>In one of the previous modules you learned about how to use FastAPI to create an API to interact with your machine learning models. FastAPI is a great framework, but it is a general framework meaning that it was not developed with machine learning applications in mind. This means that there are features which you may consider to be missing when considering running large scale machine learning models:</p> <ul> <li> <p>Dynamic-batching: if you have a large number of requests coming in, you may want to process them in batches to     reduce the overhead of loading the model and running the inference. This is especially true if you are running your     model on a GPU, where the overhead of loading the model is significant.</p> </li> <li> <p>Async inference: FastAPi does support async requests but not no way to call the model asynchronously. This means that     if you have a large number of requests coming in, you will have to wait for the model to finish processing (because     the model is not async) before you can start processing the next request.</p> </li> <li> <p>Native GPU support: you can definitely run part of your application in FastAPI if you want to. But again it was not     build with machine learning in mind, so you will have to do some extra work to get it to work.</p> </li> </ul> <p>It should come as no surprise that multiple frameworks have therefore sprung up that better supports deployment of machine learning algorithms (just listing a few here):</p> \ud83c\udf1f Framework \ud83e\udde9 Backend Agnostic \ud83e\udde0 Model Agnostic \ud83d\udcc2 Repository \u2b50 Github Stars Cortex \u2705 \u2705 \ud83d\udd17 Link 8.0k BentoML \u2705 \u2705 \ud83d\udd17 Link 7.2k Ray Serve \u2705 \u2705 \ud83d\udd17 Link 34.1k Triton Inference Server \u2705 \u2705 \ud83d\udd17 Link 8.4k OpenVINO \u2705 \u2705 \ud83d\udd17 Link 7.3k Seldon-core \u2705 \u2705 \ud83d\udd17 Link 4.4k Litserve \u2705 \u2705 \ud83d\udd17 Link 2.5k Torchserve \u274c \u2705 \ud83d\udd17 Link 4.2k TensorFlow serve \u274c \u2705 \ud83d\udd17 Link 6.2k vLLM \u274c \u274c \ud83d\udd17 Link 30.6k <p>The first 7 frameworks are backend agnostic, meaning that they are intended to work with whatever computational backend you model is implemented in (TensorFlow, PyTorch, Jax, Sklearn etc.), whereas the last 3 are backend specific (PyTorch, TensorFlow and a custom framework). The first 9 frameworks are model agnostic, meaning that they are intended to work with whatever model you have implemented, whereas the last one is model specific in this case to LLM's. When choosing a framework to deploy your model, you should consider the following:</p> <ul> <li> <p>Ease of use. Some frameworks are easier to use and get started with than others, but may have fewer features. As     an example from the list above, <code>Litserve</code> is very easy to get started with but is a relatively new framework and     may not have all the features you need.</p> </li> <li> <p>Performance. Some frameworks are optimized for performance, but may be harder to use. As an example from the list     above, <code>vLLM</code> is a very high performance framework for serving large language models but it cannot be used for other     types of models.</p> </li> <li> <p>Community. Some frameworks have a large community, which can be helpful if you run into problems. As an example     from the list above, <code>Triton Inference Server</code> is developed by Nvidia and has a large community of users. As a good     rule of thumb, the more stars a repository has on Github, the larger the community.</p> </li> </ul> <p>In this module we are going to be looking at the <code>BentoML</code> framework because it strikes a good balance between ease of use and having a lot of features that can improve the performance of serving your models. However, before we dive into this serving framework, we are going to look at a general way to package our machine learning models that should work with most of the above frameworks.</p>"},{"location":"s7_deployment/ml_deployment/#model-packaging","title":"Model Packaging","text":"<p>Whenever we want to serve a machine learning model, we in general need 3 things:</p> <ul> <li>The computational graph of the model, e.g. how to pass data through the model to get a prediction.</li> <li>The weights of the model, e.g. the parameters that the model has learned during training.</li> <li>A computational backend that can run the model</li> </ul> <p>In the previous module on Docker we learned how to package all of these things into a container. This is a great way to package a model, but it is not the only way. The core assumption we currently have made is that the computational backend is the same as the one we trained the model on. However, this does not need to be the case. As long as we can export our model and weights to a common format, we can run the model on any backend that supports this format.</p> <p>This is exactly what the Open Neural Network Exchange (ONNX) is designed to do. ONNX is a standardized format for creating and sharing machine learning models. It defines an extensible computation graph model, as well as definitions of built-in operators and standard data types. The idea behind ONNX is that a model trained with a specific framework on a specific device, let's say PyTorch on your local computer, can be exported and run with an entirely different framework and hardware easily. Learning how to export your models to ONNX is therefore a great way to increase the longevity of your models and not being locked into a specific framework for serving your models.</p> <p></p>  The ONNX format is designed to bridge the gap between development and deployment of machine learning models, by making it easy to export models between different frameworks and hardware. For example PyTorch is in general considered an developer friendly framework, however it has historically been slow to run inference with compared to a framework.  Image credit"},{"location":"s7_deployment/ml_deployment/#exercises","title":"\u2754 Exercises","text":"<ol> <li> <p>Start by installing ONNX, ONNX runtime and ONNX script. This can be done by running the following command</p> <pre><code>pip install onnx onnxruntime onnxscript\n</code></pre> <p>the first package contains the core ONNX framework, the second package contains the runtime for running ONNX models and the third package contains a new experimental package that is designed to make it easier to export models to ONNX.</p> </li> <li> <p>Let's start out with converting a model to ONNX. The following code snippets shows how to export a PyTorch model to     ONNX.</p> PyTorch =&gt; 2.0PyTorch &lt; 2.0 or WindowsPyTorch-lightning <pre><code>import torch\nimport torchvision\n\nmodel = torchvision.models.resnet18(weights=None)\nmodel.eval()\n\ndummy_input = torch.randn(1, 3, 224, 224)\nonnx_model = torch.onnx.dynamo_export(\n    model=model,\n    model_args=(dummy_input,),\n    export_options=torch.onnx.ExportOptions(dynamic_shapes=True),\n)\nonnx_model.save(\"resnet18.onnx\")\n</code></pre> <pre><code>import torch\nimport torchvision\n\nmodel = torchvision.models.resnet18(weights=None)\nmodel.eval()\n\ndummy_input = torch.randn(1, 3, 224, 224)\ntorch.onnx.export(\n    model=model,\n    args=(dummy_input,),\n    f=\"resnet18.onnx\",\n    input_names=[\"input\"],\n    output_names=[\"output\"],\n    dynamic_axes={\"input\": {0: \"batch_size\"}, \"output\": {0: \"batch_size\"}}\n)\n</code></pre> <pre><code>import torch\nimport torchvision\nimport pytorch_lightning as pl\nimport onnx\nimport onnxruntime\n\nclass LitModel(pl.LightningModule):\n    def __init__(self):\n        super().__init__()\n        self.model = torchvision.models.resnet18(pretrained=True)\n        self.model.eval()\n\n    def forward(self, x):\n        return self.model(x)\n\nmodel = LitModel()\nmodel.eval()\n\ndummy_input = torch.randn(1, 3, 224, 224)\nmodel.to_onnx(\n    file_path=\"resnet18.onnx\",\n    input_sample=dummy_input,\n    input_names=[\"input\"],\n    output_names=[\"output\"],\n    dynamic_axes={\"input\": {0: \"batch_size\"}, \"output\": {0: \"batch_size\"}}\n)\n</code></pre> <p>Export a model of your own choice to ONNX or just try to export the <code>resnet18</code> model as shown in the examples above, and confirm that the model was exported by checking that the file exists. Can you figure out what is meant by <code>dynamic_axes</code>?</p> Solution <p>The <code>dynamic_axes</code> argument is used to specify which axes of the input tensor that should be considered dynamic. This is useful when the model can accept inputs of different sizes, e.g. when the model is used in a dynamic batching scenario. In the example above we have specified that the first axis of the input tensor should be considered dynamic, meaning that the model can accept inputs of different batch sizes. While it may be tempting to specify all axes as dynamic, however this can lead to slower inference times, because the ONNX runtime will not be able to optimize the computational graph as well.</p> </li> <li> <p>Check that the model was correctly exported by loading it using the <code>onnx</code> package and afterwards check the graph     of model using the following code:</p> <pre><code>import onnx\nmodel = onnx.load(\"resnet18.onnx\")\nonnx.checker.check_model(model)\nprint(onnx.helper.printable_graph(model.graph))\n</code></pre> </li> <li> <p>To get a better understanding of what is actually exported, lets try to visualize the computational graph of the     model. This can be done using the open-source tool netron. You can either     try it out directly in webbrowser or you can install it locally using <code>pip install netron</code>     and then run it using <code>netron resnet18.onnx</code>. Can you figure out what method of the model is exported to ONNX?</p> Solution <p>When a PyTorch model is exported to ONNX, it is only the <code>forward</code> method of the model that is exported. This means that it is the only method we have access to when we load the model later. Therefore, make sure that the <code>forward</code> method of your model is implemented in a way that it can be used for inference.</p> </li> <li> <p>After converting a model to ONNX format we can use the ONNX Runtime to run it.     The benefit of this is that ONNX Runtime is able to optimize the computational graph of the model, which can lead     to faster inference times. Lets try to look into that.</p> <ol> <li> <p>Figure out how to run a model using the ONNX Runtime. Relevant     documentation.</p> Solution <p>To use the ONNX runtime to run a model, we first need to start a inference session, then extract input output names of our model and finally run the model. The following code snippet shows how to do this.</p> <pre><code>import onnxruntime as rt\nort_session = rt.InferenceSession(\"&lt;path-to-model&gt;\")\ninput_names = [i.name for i in ort_session.get_inputs()]\noutput_names = [i.name for i in ort_session.get_outputs()]\nbatch = {input_names[0]: np.random.randn(1, 3, 224, 224).astype(np.float32)}\nout = ort_session.run(output_names, batch)\n</code></pre> </li> <li> <p>Let's experiment with performance of ONNX vs. PyTorch. Implement a benchmark that measures the time it takes to     run a model using PyTorch and ONNX. Bonus points if you test for multiple input sizes. To get you started we     have implemented a timing decorator that you can use to measure the time it takes to run a function.</p> <pre><code>from statistics import mean, stdev\nimport time\ndef timing_decorator(func, function_repeat: int = 10, timing_repeat: int = 5):\n    \"\"\" Decorator that times the execution of a function. \"\"\"\n    def wrapper(*args, **kwargs):\n        timing_results = []\n        for _ in range(timing_repeat):\n            start_time = time.time()\n            for _ in range(function_repeat):\n                result = func(*args, **kwargs)\n            end_time = time.time()\n            elapsed_time = end_time - start_time\n            timing_results.append(elapsed_time)\n        print(f\"Avg +- Stddev: {mean(timing_results):0.3f} +- {stdev(timing_results):0.3f} seconds\")\n        return result\n    return wrapper\n</code></pre> Solution onnx_benchmark.py<pre><code>import sys\nimport time\nfrom statistics import mean, stdev\n\nimport onnxruntime as ort\nimport torch\nimport torchvision\n\n\ndef timing_decorator(func, function_repeat: int = 10, timing_repeat: int = 5):\n    \"\"\"Decorator that times the execution of a function.\"\"\"\n\n    def wrapper(*args, **kwargs):\n        timing_results = []\n        for _ in range(timing_repeat):\n            start_time = time.time()\n            for _ in range(function_repeat):\n                result = func(*args, **kwargs)\n            end_time = time.time()\n            elapsed_time = end_time - start_time\n            timing_results.append(elapsed_time)\n        print(f\"Avg +- Stddev: {mean(timing_results):0.3f} +- {stdev(timing_results):0.3f} seconds\")\n        return result\n\n    return wrapper\n\n\nmodel = torchvision.models.resnet18()\nmodel.eval()\n\ndummy_input = torch.randn(1, 3, 224, 224)\nif sys.platform == \"win32\":\n    # Windows doesn't support the new TorchDynamo-based ONNX Exporter\n    torch.onnx.export(\n        model,\n        dummy_input,\n        \"resnet18.onnx\",\n        input_names=[\"input.1\"],\n        dynamic_axes={\"input.1\": {0: \"batch_size\", 2: \"height\", 3: \"width\"}},\n    )\nelse:\n    torch.onnx.dynamo_export(model, dummy_input).save(\"resnet18.onnx\")\n\nort_session = ort.InferenceSession(\"resnet18.onnx\")\n\n\n@timing_decorator\ndef torch_predict(image) -&gt; None:\n    \"\"\"Predict using PyTorch model.\"\"\"\n    model(image)\n\n\n@timing_decorator\ndef onnx_predict(image) -&gt; None:\n    \"\"\"Predict using ONNX model.\"\"\"\n    ort_session.run(None, {\"input.1\": image.numpy()})\n\n\nif __name__ == \"__main__\":\n    for size in [224, 448, 896]:\n        dummy_input = torch.randn(1, 3, size, size)\n        print(f\"Image size: {size}\")\n        torch_predict(dummy_input)\n        onnx_predict(dummy_input)\n</code></pre> </li> <li> <p>To get a better understanding of why running the model using the ONNX runtime is usually faster lets try to see     what happens to the computational graph. By default the ONNX Runtime will apply these optimization in online     mode, meaning that the optimizations are applied when the model is loaded. However, it is also possible to apply     the optimizations in offline mode, such that the optimized model is saved to disk. Below is an example of how     to do this.</p> <pre><code>import onnxruntime as rt\nsess_options = rt.SessionOptions()\n\n# Set graph optimization level\nsess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_EXTENDED\n\n# To enable model serialization after graph optimization set this\nsess_options.optimized_model_filepath = \"optimized_model.onnx&gt;\"\n\nsession = rt.InferenceSession(\"&lt;model_path&gt;\", sess_options)\n</code></pre> <p>Try to apply the optimizations in offline mode and use <code>netron</code> to visualize both the original and optimized model side by side. Can you see any differences?</p> Solution <p>You should hopefully see that the optimized model consist of fewer nodes and edges than the original model. These nodes are often called fused nodes, because they are the result of multiple nodes being fused together. In the image below we have visualized the first part of the computational graph of a resnet18 model, before and after optimization.</p> <p> </p> </li> </ol> </li> <li> <p>As mentioned in the introduction, ONNX is able to run on many different types of hardware and execution engine.     You can check all providers and all the available providers by running the following code</p> <pre><code>import onnxruntime\nprint(onnxruntime.get_all_providers())\nprint(onnxruntime.get_available_providers())\n</code></pre> <p>Can you figure out how to set which provide the ONNX runtime should use?</p> Solution <p>The provider that the ONNX runtime should use can be set by passing the <code>providers</code> argument to the <code>InferenceSession</code> class. A list should be provided, which prioritizes the providers in the order they are listed.</p> <pre><code>import onnxruntime as rt\nprovider_list = ['CUDAExecutionProvider', 'CPUExecutionProvider']\nort_session = rt.InferenceSession(\"&lt;path-to-model&gt;\", providers=provider_list)\n</code></pre> <p>In this case we will prefer CUDA Execution Provider over CPU Execution Provider if both are available.</p> </li> <li> <p>As you have probably realised in the exercises on docker, it can take a long time     to build the kind of containers we are working with and they can be quite large. There is a reason for this and that     is that PyTorch is a very large framework with a lot of dependencies. ONNX on the other hand is a much smaller     framework. This kind of makes sense, because PyTorch is a framework that primarily was designed for developing e.g.     training models, while ONNX is a framework that is designed for serving models. Let's try to quantify this.</p> <ol> <li> <p>Construct a dockerfile that builds a docker image with PyTorch as a dependency. The dockerfile does actually     not need to run anything. Repeat the same process for the ONNX runtime. Bonus point for developing a docker     image that takes a build arg at build time that specifies     if the image should be built with CUDA support or not.</p> Solution <p>The dockerfile for the PyTorch image could look something like this</p> inference_pytorch.dockerfile<pre><code>FROM python:3.11-slim\n\nRUN apt update &amp;&amp; \\\n    apt install --no-install-recommends -y build-essential gcc &amp;&amp; \\\n    apt clean &amp;&amp; rm -rf /var/lib/apt/lists/*\n\nARG CUDA\nENV CUDA=${CUDA}\nRUN echo \"CUDA is set to: ${CUDA}\"\n\nRUN echo \"CUDA is set to: ${CUDA}\" &amp;&amp; \\\n    if [ -n \"$CUDA\" ]; then \\\n        pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu121; \\\n    else \\\n        pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu; \\\n    fi\n</code></pre> <p>and the dockerfile for the ONNX image could look something like this</p> inference_onnx.dockerfile<pre><code>FROM python:3.11-slim\n\nRUN apt update &amp;&amp; \\\n    apt install --no-install-recommends -y build-essential gcc &amp;&amp; \\\n    apt clean &amp;&amp; rm -rf /var/lib/apt/lists/*\n\nRUN echo \"CUDA is set to: ${CUDA}\" &amp;&amp; \\\n    if [ -n \"$CUDA\" ]; then \\\n        pip install onnxruntime-gpu; \\\n    else \\\n        pip install onnxruntime; \\\n    fi\n</code></pre> </li> <li> <p>Build both containers and measure the time it takes to build them. How much faster is it to build the ONNX     container compared to the PyTorch container?</p> Solution <p>On unix/linux you can use the time command to measure the time it takes to build the containers. Building both images, with and without CUDA support, can be done with the following commands</p> <pre><code>time docker build . -t pytorch_inference_cuda:latest -f inference_pytorch.dockerfile \\\n    --no-cache --build-arg CUDA=true\ntime docker build . -t pytorch_inference:latest -f inference_pytorch.dockerfile \\\n    --no-cache --build-arg CUDA=\ntime docker build . -t onnx_inference_cuda:latest -f inference_onnx.dockerfile \\\n    --no-cache --build-arg CUDA=true\ntime docker build . -t onnx_inference:latest -f inference_onnx.dockerfile \\\n    --no-cache --build-arg CUDA=\n</code></pre> <p>the <code>--no-cache</code> flag is used to ensure that the build process is not cached and ensure a fair comparison. On my laptop this respectively took <code>5m1s</code>, <code>1m4s</code>, <code>0m4s</code>, <code>0m50s</code> meaning that the ONNX container was respectively 7x (with CUDA) and 1.28x (no CUDA) faster to build than the PyTorch container.</p> </li> <li> <p>Find out the size of the two docker images. It can be done in the terminal by running the <code>docker images</code>     command. How much smaller is the ONNX model compared to the PyTorch model?</p> Solution <p>As of writing the docker image containing the PyTorch framework was 5.54GB (with CUDA) and 1.25GB (no CUDA). In comparison the ONNX image was 647MB (with CUDA) and 647MB (no CUDA). This means that the ONNX image is respectively 8.5x (with CUDA) and 1.94x (no CUDA) smaller than the PyTorch image.</p> </li> </ol> </li> <li> <p>(Optional) Assuming you have completed the module on FastAPI try creating a small     FastAPI application that serves a model using the ONNX runtime.</p> Solution <p>Here is a simple example of how to create a FastAPI application that serves a model using the ONNX runtime.</p> onnx_fastapi.py<pre><code>import numpy as np\nimport onnxruntime\nfrom fastapi import FastAPI\n\napp = FastAPI()\n\n\n@app.get(\"/predict\")\ndef predict():\n    \"\"\"Predict using ONNX model.\"\"\"\n    # Load the ONNX model\n    model = onnxruntime.InferenceSession(\"model.onnx\")\n\n    # Prepare the input data\n    input_data = {\"input\": np.random.rand(1, 3).astype(np.float32)}\n\n    # Run the model\n    output = model.run(None, input_data)\n\n    return {\"output\": output[0].tolist()}\n</code></pre> </li> </ol> <p>This completes the exercises on the ONNX format. Do note that one limitation of the ONNX format is that is is based on ProtoBuf, which is a binary format. A protobuf file can have a maximum size of 2GB, which means that the <code>.onnx</code> format is not enough for very large models. However, through the use of external data it is possible to circumvent this limitation.</p>"},{"location":"s7_deployment/ml_deployment/#bentoml","title":"BentoML","text":"<p>BentoML cloud vs BentoML OSS</p> <p>We are only going to be looking at the open-source version of BentoML in this module. However, BentoML also has a cloud version that makes it very easy to deploy models that are coded in BentoML to the cloud. If you are interested in this, you can check out the BentoML cloud documentation. This business strategy of having an open-source product and a cloud product is very common in the machine learning space (HuggingFace, LightningAI, Weights and Biases etc.), because it allows companies to make money from the cloud product while still providing a free product to the community.</p> <p>BentoML is a framework that is designed to make it easy to serve machine learning models. It is designed to be backend agnostic, meaning that it can be used with any computational backend. It is also model agnostic, meaning that it can be used with any machine learning model.</p> <p>Let's consider a simple example of how to serve a model using BentoML. The following code snippet shows how to serve a model that uses the <code>transformers</code> library to summarize text.</p> <pre><code>import bentoml\nfrom transformers import pipeline\n\nEXAMPLE_INPUT = (\n    \"Breaking News: In an astonishing turn of events, the small town of Willow Creek has been taken by storm as \"\n    \"local resident Jerry Thompson's cat, Whiskers, performed what witnesses are calling a 'miraculous and gravity-\"\n    \"defying leap.' Eyewitnesses report that Whiskers, an otherwise unremarkable tabby cat, jumped a record-breaking \"\n    \"20 feet into the air to catch a fly. The event, which took place in Thompson's backyard, is now being investigated \"\n    \"by scientists for potential breaches in the laws of physics. Local authorities are considering a town festival to \"\n    \"celebrate what is being hailed as 'The Leap of the Century.'\"\n)\n\n@bentoml.service(resources={\"cpu\": \"2\"}, traffic={\"timeout\": 10})\nclass Summarization:\n    def __init__(self) -&gt; None:\n        self.pipeline = pipeline('summarization')\n\n    @bentoml.api\n    def summarize(self, text: str = EXAMPLE_INPUT) -&gt; str:\n        result = self.pipeline(text)\n        return result[0]['summary_text']\n</code></pre> <p>In <code>BentoML</code> we organize our services in classes, where each class is a service that we want to serve. The two important parts of the code snippet are the <code>@bentoml.service</code> and <code>@bentoml.api</code> decorators.</p> <ul> <li> <p>The <code>@bentoml.service</code> decorator is used to specify the resources that the service should use and in general how the     service should be run. In this case we are specifying that the service should use 2 CPU cores and that the timeout     for the service should be 10 seconds.</p> </li> <li> <p>The <code>@bentoml.api</code> decorator is used to specify the API that the service should expose. In this case we are specifying     that the service should have an API called <code>summarize</code> that takes a string as input and returns a string as output.</p> </li> </ul> <p>To serve the model using <code>BentoML</code> we can execute the following command, which is very similar to the command we used to serve the model using FastAPI.</p> <pre><code>bentoml serve service:Summarization\n</code></pre>"},{"location":"s7_deployment/ml_deployment/#exercises_1","title":"\u2754 Exercises","text":"<p>In general, we advise looking through the docs for Bento ML if you need help with any of the exercises. We are going to assume that you have done the exercises on ONNX and we are therefore going to be using <code>BentoML</code> to serve ONNX models. If you have not done this part, you can still follow along but you will need to use a PyTorch model instead of an ONNX model.</p> <ol> <li> <p>Install BentoML</p> <pre><code>pip install bentoml\n</code></pre> <p>Remember to add the dependency to your <code>requirements.txt</code> file.</p> </li> <li> <p>You are in principal free to serve any model you like, but we recommend to just use a     torchvision model as in the ONNX exercises. Write your first service     in <code>BentoML</code> that serves a model of your choice. We recommend experimenting with providing     input/output as tensors because bentoml supports this     nativly. Secondly, write a client that can send a request to the service and print the result. Here we recommend     using the build in bentoml.SyncHTTPClient.</p> Solution <p>The following implements a simple BentoML service that serves a ONNX resnet18 model. The service expects the both input and output to be numpy arrays.</p> bentoml_service.py<pre><code>from __future__ import annotations\n\nimport bentoml\nimport numpy as np\nfrom onnxruntime import InferenceSession\n\n\n@bentoml.service\nclass ImageClassifierService:\n    \"\"\"Image classifier service using ONNX model.\"\"\"\n\n    def __init__(self) -&gt; None:\n        self.model = InferenceSession(\"model.onnx\")\n\n    @bentoml.api\n    def predict(self, image: np.ndarray) -&gt; np.ndarray:\n        \"\"\"Predict the class of the input image.\"\"\"\n        output = self.model.run(None, {\"input\": image.astype(np.float32)})\n        return output[0]\n</code></pre> <p>The service can be served using the following command</p> <pre><code>bentoml serve bentoml_service:ImageClassifierService\n</code></pre> <p>To test that the service works the following client can be used</p> bentoml_client.py<pre><code>import bentoml\nimport numpy as np\nfrom PIL import Image\n\nif __name__ == \"__main__\":\n    image = Image.open(\"my_cat.jpg\")\n    image = image.resize((224, 224))  # Resize to match the minimum input size of the model\n    image = np.array(image)\n    image = np.transpose(image, (2, 0, 1))  # Change to CHW format\n    image = np.expand_dims(image, axis=0)  # Add batch dimension\n\n    with bentoml.SyncHTTPClient(\"http://localhost:4040\") as client:\n        resp = client.predict(image=image)\n        print(resp)\n</code></pre> </li> <li> <p>We are now going to look at features very <code>BentoML</code> really sets itself apart from <code>FastAPI</code>. The first is     adaptive batching. As you are hopefully aware, modern machine learning models can process multiple samples at the     same time and in doing so increases the throughput of the model. When we train a model we often set a fixed     batch size, however we cannot do that when serving the model because that would mean that we would have to wait for     the batch to be full before we can process it. Adaptive batching simply refers to the process where we specify a     maximum batch size and also a timeout. When either the batch is full or the timeout is reached, however many     samples we have collected are sent to the model for processing. This can be a very powerful feature because it     allows us to process samples as soon as they arrive, while still taking advantage of the increased throughput of     batching.</p> <p>  The overall architecture of the adaptive batching feature in BentoML. The feature is implemented on the server side and mainly consist of dispatcher that is in charge of collecting requests and sending them to the model server when either the batch is full or a timeout is reached.  Image credit  </p> <ol> <li> <p>Look through the     documentation on adaptive batching and     add adaptive batching to your service from the previous exercise. Make sure your service works as expected by     testing it with the client from the previous exercise.</p> Solution bentoml_service_adaptive_batching.py<pre><code>from __future__ import annotations\n\nimport bentoml\nimport numpy as np\nfrom onnxruntime import InferenceSession\n\n\n@bentoml.service\nclass ImageClassifierService:\n    \"\"\"Image classifier service using ONNX model.\"\"\"\n\n    def __init__(self) -&gt; None:\n        self.model = InferenceSession(\"model.onnx\")\n\n    @bentoml.api(\n        batchable=True,\n        batch_dim=(0, 0),\n        max_batch_size=128,\n        max_latency_ms=1000,\n    )\n    def predict(self, image: np.ndarray) -&gt; np.ndarray:\n        \"\"\"Predict the class of the input image.\"\"\"\n        output = self.model.run(None, {\"input\": image.astype(np.float32)})\n        return output[0]\n</code></pre> </li> <li> <p>Try to measure the throughput of your model with and without adaptive batching. Assuming that you have completed     the module on testing APIs and therefore are familiar with the <code>locust</code> framework, we     recommend that you write a simple locustfile and use the <code>locust</code> command to measure the throughput of your     model.</p> Solution <p>The following locust file can be used to measure the throughput of the model with and without adaptive</p> locustfile.py<pre><code>import numpy as np\nfrom locust import HttpUser, between, task\nfrom PIL import Image\n\n\ndef prepare_image():\n    \"\"\"Load and preprocess the image as required.\"\"\"\n    image = Image.open(\"my_cat.jpg\")\n    image = image.resize((224, 224))\n    image = np.array(image)\n    image = np.transpose(image, (2, 0, 1))  # Convert to CHW format\n    image = np.expand_dims(image, axis=0)  # Add batch dimension\n    # Convert to list format for JSON serialization\n    return image.tolist()\n\n\nimage = prepare_image()\n\n\nclass BentoMLUser(HttpUser):\n    \"\"\"Locust user class for sending prediction requests to the server.\"\"\"\n\n    wait_time = between(1, 2)\n\n    @task\n    def send_prediction_request(self):\n        \"\"\"Send a prediction request to the server.\"\"\"\n        payload = {\"image\": image}  # Package the image as JSON\n        self.client.post(\"/predict\", json=payload, headers={\"Content-Type\": \"application/json\"})\n</code></pre> <p>and then the following command can be used to measure the throughput of the model</p> <pre><code>locust -f locustfile_bentoml.py --host http://localhost:4040 --headless -u 50 -t 60s\n</code></pre> <p>You should hopefully see that the throughput of the model is higher when adaptive batching is enabled, but the speedup is largely dependent on the model you are running, the configuration of the adaptive batching and the hardware you are running on.</p> <p>On my laptop I saw about a 1.5 - 2x speedup when adaptive batching was enabled.</p> </li> </ol> </li> <li> <p>(Optional, requires GPU) Look through the     documentation for inference on GPU and add this to     your service. Check that your service works as expected by testing it with the client from the previous exercise and     make sure you are seeing a speedup when running on the GPU.</p> Solution <p>A simple change to the <code>bento.service</code> decorator is all that is needed to run the model on the GPU.</p> <p>```python @bentoml.service(resources={\"gpu\": 1}) class MyService:     def init(self):         self.model = torch.load('model.pth').to('cuda:0')</p> </li> <li> <p>Another way to speed up the inference is to just use multiple workers. This duplicates the server over multiple     processes taking advantage of modern multi-core CPUs. This is similar to running <code>uvicorn</code> command with the     <code>--workers</code> flag for fastapi applications. Implement multiple workers in your service and test that it works as     expected by testing it with the client from the previous exercise. Also test that you are seeing a speedup when     running with multiple workers.</p> Solution <p>Multiple workers can be added to the <code>bento.service</code> decorator as shown below.</p> <pre><code>@bentoml.service(workers=4)\nclass MyService:\n    # Service implementation\n</code></pre> <p>Alternatively, you can set <code>workers=\"cpu_count\"</code> to use all available CPU cores. The speedup depends on the model you are serving, the hardware you are running on and the number of workers you are using, but it should be higher than using a single worker.</p> </li> <li> <p>In addition to increasing the throughput of your deployments <code>BentoML</code> can also help with ML applications that     requires some kind of composition of multiple models. It is very normal in production setups to have multiple models     that either</p> <ul> <li>Runs in a sequence, e.g. the output of one model is the input of another model. You may have a preprocessing     service that preprocesses the data before it is sent to a model that makes a prediction.</li> <li>Runs concurrently, e.g. you have multiple models that are run at the same time and the output of all the models     are combined to make a prediction. Ensemble models are a good example of this.</li> </ul> <p><code>BentoML</code> makes it easy to compose multiple models together.</p> <ol> <li> <p>Implement two services that runs in a sequence e.g. the output of one service is used as the input of another     service. As an example you can implement either some pre- or post-processing service that is used in conjunction     with the model you have implemented in the previous exercises.</p> Solution <p>The following code snippet shows how to implement two services that runs in a sequence.</p> bentoml_service_composition.py<pre><code>from __future__ import annotations\n\nfrom pathlib import Path\n\nimport bentoml\nimport numpy as np\nfrom onnxruntime import InferenceSession\nfrom PIL import Image\n\n\n@bentoml.service\nclass ImagePreprocessorService:\n    \"\"\"Image preprocessor service.\"\"\"\n\n    @bentoml.api\n    def preprocess(self, image_file: Path) -&gt; np.ndarray:\n        \"\"\"Preprocess the input image.\"\"\"\n        image = Image.open(image_file)\n        image = image.resize((224, 224))\n        image = np.array(image)\n        image = np.transpose(image, (2, 0, 1))\n        return np.expand_dims(image, axis=0)\n\n\n@bentoml.service\nclass ImageClassifierService:\n    \"\"\"Image classifier service using ONNX model.\"\"\"\n\n    preprocessing_service = bentoml.depends(ImagePreprocessorService)\n\n    def __init__(self) -&gt; None:\n        self.model = InferenceSession(\"model.onnx\")\n\n    @bentoml.api\n    async def predict(self, image_file: Path) -&gt; np.ndarray:\n        \"\"\"Predict the class of the input image.\"\"\"\n        image = await self.preprocessing_service.to_async.preprocess(image_file)\n        output = self.model.run(None, {\"input\": image.astype(np.float32)})\n        return output[0]\n</code></pre> </li> <li> <p>Implement three services, where two of them runs concurrently and the output of both services are combined in the     third service to make a prediction. As an example you can expand your previous service to serve two different     models and then implement a third service that combines the output of both models to make a prediction.</p> Solution <p>The following code snippet shows how to implement a service that consist of two concurrent services. The example assumes that two models called <code>model_a.onnx</code> and <code>model_b.onnx</code> are available.</p> bentoml_service_composition.py<pre><code>from __future__ import annotations\n\nimport asyncio\n\nimport bentoml\nimport numpy as np\nfrom onnxruntime import InferenceSession\n\n\n@bentoml.service\nclass ImageClassifierServiceModelA:\n    \"\"\"Image classifier service using ONNX model.\"\"\"\n\n    def __init__(self) -&gt; None:\n        self.model = InferenceSession(\"model_a.onnx\")\n\n    @bentoml.api\n    def predict(self, image: np.ndarray) -&gt; np.ndarray:\n        \"\"\"Predict the class of the input image.\"\"\"\n        output = self.model.run(None, {\"input\": image.astype(np.float32)})\n        return output[0]\n\n\n@bentoml.service\nclass ImageClassifierServiceModelB:\n    \"\"\"Image classifier service using ONNX model.\"\"\"\n\n    def __init__(self) -&gt; None:\n        self.model = InferenceSession(\"model_b.onnx\")\n\n    @bentoml.api\n    def predict(self, image: np.ndarray) -&gt; np.ndarray:\n        \"\"\"Predict the class of the input image.\"\"\"\n        output = self.model.run(None, {\"input\": image.astype(np.float32)})\n        return output[0]\n\n\n@bentoml.service\nclass ImageClassifierService:\n    \"\"\"Image classifier service using ONNX model.\"\"\"\n\n    model_a = bentoml.depends(ImageClassifierServiceModelA)\n    model_b = bentoml.depends(ImageClassifierServiceModelB)\n\n    @bentoml.api\n    async def predict(self, image: np.ndarray) -&gt; np.ndarray:\n        \"\"\"Predict the class of the input image.\"\"\"\n        result_a, result_b = await asyncio.gather(\n            self.model_a.to_async.predict(image), self.model_b.to_async.predict(image)\n        )\n        return (result_a + result_b) / 2\n</code></pre> </li> <li> <p>(Optional) Implement a server that consist of both sequential and concurrent services.</p> </li> </ol> </li> <li> <p>Similar to deploying a FastAPI application to the cloud, deploying a <code>BentoML</code> framework to the cloud     often requires you to first containerize the application. Because <code>BentoML</code> is designed to be easy to use for even     users not that familiar with Docker, it introduces the concept of a <code>bentofile</code>. A <code>bentofile</code> is a file that     specifies how the container should be build. Below is an example of how a <code>bentofile</code> could look like.</p> <pre><code>service: 'service:Summarization'\nlabels:\n  owner: bentoml-team\n  project: gallery\ninclude:\n  - '*.py'\npython:\n  packages:\n    - torch\n    - transformers\n</code></pre> <p>which can then be used to build a <code>bento</code> using the following command</p> <pre><code>bentoml build\n</code></pre> <p>A <code>bento</code> is not a docker image, but it can be used to build a docker image with the following command</p> <pre><code>bentoml containerize summarization:latest\n</code></pre> <ol> <li> <p>Can you figure out how the different parts of the <code>bentofile</code> are used to build the docker image? Additionally,     can you figure out from the source repository how the <code>bentofile</code> is     used to build the docker image?</p> Solution <p>The <code>service</code> part specifies both what the container should be called and also what service it should serve e.g. the last statement in the corresponding dockerfile is <code>CMD [\"bentoml\", \"serve\", \"service:Summarization\"]</code>. The <code>labels</code> part is used to specify labels about the container, see this link for more info. The <code>include</code> part corresponds to <code>COPY</code> statements in the dockerfile and finally the <code>python</code> part is used to specify what python packages should be installed in the container which corresponds to <code>RUN pip install ...</code> in the dockerfile.</p> <p>Regarding how the <code>bentofile</code> is used to build the docker image, the <code>bentoml</code> package contains a number of templates (written using the jinja2 templating language) that are used to generate the dockerfiles. The templates can be found here.</p> </li> <li> <p>Take whatever service from the previous exercises and try to containerize it. You are free to either write a     <code>bentofile</code> or a <code>dockerfile</code> to do this.</p> Solution <p>The following <code>bentofile</code> can be used to containerize the very first service we implemented in this set of exercises.</p> <pre><code>service: 'bentoml_service:ImageClassifierService'\nlabels:\n  owner: bentoml-team\n  project: gallery\ninclude:\n- 'bentoml_service.py'\n- 'model.onnx'\npython:\n  packages:\n    - onnxruntime\n    - numpy\n</code></pre> <p>The corresponding dockerfile would look something like this</p> <pre><code>FROM python:3.11-slim\nWORKDIR /bento\nCOPY bentoml_service.py .\nCOPY model.onnx .\nRUN pip install onnxruntime numpy bentoml\nCMD [\"bentoml\", \"serve\", \"bentoml_service:ImageClassifierService\"]\n</code></pre> </li> <li> <p>Deploy the container to GCP Run and test that it works.</p> Solution <p>The following command can be used to deploy the container to GCP Run. We assume that you have already build the container and called it <code>bentoml_service:latest</code>.</p> <pre><code>docker tag bentoml_service:latest \\\n    &lt;region&gt;-docker.pkg.dev/&lt;project-id&gt;/&lt;repository-name&gt;/bentoml_service:latest\ndocker push &lt;region&gt;-docker.pkg.dev/&lt;project-id&gt;/&lt;repository-name&gt;/bentoml_service:latest\ngcloud run deploy bentoml-service \\\n    --image=&lt;region&gt;-docker.pkg.dev/&lt;project-id&gt;/&lt;repository-name&gt;/bentoml_service:latest \\\n    --platform managed \\\n    --port 3000  # default used by BentoML\n</code></pre> <p>where <code>&lt;project-id&gt;</code> should be replaced with the id of the project you are deploying to. The service should now be available at the URL that is printed in the terminal.</p> </li> </ol> </li> </ol> <p>This completes the exercises on the <code>BentoML</code> framework. If you want to deep dive more into this we can recommend looking into their tasks feature for use cases that have a very long running time and build in model management feature to unify the way models are loaded, managed and served.</p>"},{"location":"s7_deployment/ml_deployment/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>How would you export a <code>scikit-learn</code> model to ONNX? What method is exported when you export <code>scikit-learn</code> model to     ONNX?</p> Solution <p>It is possible to export a <code>scikit-learn</code> model to ONNX using the <code>sklearn-onnx</code> package. The following code snippet shows how to export a <code>scikit-learn</code> model to ONNX.</p> <pre><code>from sklearn.ensemble import RandomForestClassifier\nfrom skl2onnx import to_onnx\nmodel = RandomForestClassifier(n_estimators=2)\ndummy_input = np.random.randn(1, 4)\nonx = to_onnx(model, dummy_input)\nwith open(\"model.onnx\", \"wb\") as f:\n    f.write(onx.SerializeToString())\n</code></pre> <p>The method that is exported when you export a <code>scikit-learn</code> model to ONNX is the <code>predict</code> method.</p> </li> <li> <p>In your own words, describe what the concept of computational graph means?</p> Solution <p>A computational graph is a way to represent the mathematical operations that are performed in a model. It is essentially a graph where the nodes are the operations and the edges are the data that is passed between them. The computational graph normally represents the forward pass of the model and is the reason that we can easily backpropagate through the model to train it, because the graph contains all the necessary information to calculate the gradients of the model.</p> </li> <li> <p>In your own words, explain why fusing operations together in the computational graph often leads to better     performance?</p> Solution <p>Each time we want to do a computation, the data needs to be loaded from memory into the CPU/GPU. This is a slow process and the more operations we have, the more times we need to load the data. By fusing operations together, we can reduce the number of times we need to load the data, because we can do multiple operations on the same data before we need to load new data.</p> </li> </ol> <p>This ends the module on tools specifically designed for serving machine learning models. As stated in the beginning of the module, there are a lot of different tools that can be used to serve machine learning models and the choice of tool often depends on the specific use case. In general, we recommend that whenever you want to serve a machine learning model, you should try out a few different frameworks and see which one fits your use case the best.</p>"},{"location":"s7_deployment/testing_apis/","title":"M24 - API Testing","text":""},{"location":"s7_deployment/testing_apis/#api-testing","title":"API testing","text":"<p>Core Module</p> <p>API testing, similar to unit testing, is a type of software testing that involves testing the application programming interface (API) directly to ensure it meets requirements for functionality, reliability, performance, and security. The core difference from the unit testing we have been implementing until now is that instead of testing the individual functions, we are testing the entire API as a whole. API testing is therefore a form of integration testing. Additionally, another difference is that we need to simulate API calls that should be as similar as possible to the ones that will be made by the users of the API.</p> <p>The is in general two things that we want to test when we are working with APIs:</p> <ul> <li>Does the API work as intended? e.g. for a given input, does it return the expected output?</li> <li>Can the API handle the expected load? e.g. if we send 1000 requests per second, does it crash?</li> </ul> <p>In this module, we go over how to do each of them.</p>"},{"location":"s7_deployment/testing_apis/#testing-for-functionality","title":"Testing for functionality","text":"<p>Similar to when we wrote unit tests for our code back in this module we can also write tests for our API that checks that our code does what it is supposed to do e.g. by using <code>assert</code> statements. As always we recommend implementing the tests in a separate folder called <code>tests</code>, but we recommend that you add further subfolders to separate the different types of tests. For example, for the type of machine learning projects and APIs we have been working with in this course:</p> <pre><code>my_project\n|-- src/\n|   |-- train.py\n|   |-- data.py\n|   |-- app.py\n|-- tests/\n|   |-- unittests/\n|   |   |-- test_train.py\n|   |   |-- test_data.py\n|   |-- integrationtests/\n|   |   |-- test_apis.py\n</code></pre>"},{"location":"s7_deployment/testing_apis/#exercises","title":"\u2754 Exercises","text":"<p>In these exercises, we are going to assume that we want to test an API written in FastAPI (see this module). If the API is written in a different framework then how to write the tests may have to change.</p> <ol> <li> <p>Start by installing httpx which is the client we are going to use during testing:</p> <pre><code>pip install httpx\n</code></pre> <p>Remember to add it to your <code>requirements.txt</code> file.</p> </li> <li> <p>If you have already done the module on unittesting then you should     already have a <code>tests/</code> folder. If not then create one. Inside the <code>tests/</code> folder create a new folder called     <code>integrationtests/</code>. Inside the <code>integrationtests/</code> folder create a file called <code>test_apis.py</code> and write the     following code:</p> <pre><code>from fastapi.testclient import TestClient\nfrom app.main import app\nclient = TestClient(app)\n</code></pre> <p>this code will create a client that can be used to send requests to the API. The <code>app</code> variable is the FastAPI application that we want to test.</p> </li> <li> <p>Now, you can write tests that check that the API works as intended, much like you would write unit tests. For     example, if you have an root endpoint that just returns a simple welcome message you could write a test like this:</p> <pre><code>def test_read_root(model):\n    response = client.get(\"/\")\n    assert response.status_code == 200\n    assert response.json() == {\"message\": \"Welcome to the MNIST model inference API!\"}\n</code></pre> <p>make sure to always <code>assert</code> that the status code is what you expect and that the response is what you expect. Add such tests for all the endpoints in your API.</p> Application with lifespans <p>If you have an application with lifespan events e.g. you have implemented the <code>lifespan</code> function in your FastAPI application, you need to instead use the <code>TestClient</code> in a <code>with</code> statement. This is because the <code>TestClient</code> will close the connection to the application after the test is done. Here is an example:</p> <pre><code>def test_read_root(model):\n    with TestClient(app) as client:\n        response = client.get(\"/\")\n        assert response.status_code == 200\n        assert response.json() == {\"message\": \"Welcome to the MNIST model inference API!\"}\n</code></pre> </li> <li> <p>To run the tests, you can use the following command:</p> <pre><code>pytest tests/integrationtests/test_apis.py\n</code></pre> <p>Make sure that all your tests pass.</p> </li> </ol>"},{"location":"s7_deployment/testing_apis/#load-testing","title":"Load testing","text":"<p>The next type of testing we are going to implement for our application is load testing, which is a kind of performance testing. The goal of load testing is to determine how an application behaves under both normal and peak conditions. The purpose is to identify the maximum operating capacity of an application as well as any bottlenecks and to determine which element is causing degradation.</p> <p>Before we get started on the exercises we recommend that you start by defining an environment variable that contains the endpoint of your API e.g we need the API running to be able to test it. To begin with, you can just run the API locally, thus in a terminal window run the following command:</p> <pre><code>uvicorn app.main:app --reload\n</code></pre> <p>by default the API will be running on <code>http://localhost:8000</code> which we can then define as an environment variable:</p> WindowsMac/Linux <pre><code>set MYENDPOINT=http://localhost:8000\n</code></pre> <pre><code>export MYENDPOINT=http://localhost:8000\n</code></pre> <p>However, the end goal is to test an API you have deployed in the cloud. If you have used Google Cloud Run to deploy your API then you can get the endpoint by going to the UI and looking at the service details:</p> <p></p>  The endpoint can be seen in the top center. It always starts with `https://` followed by a random string and then `.a.run.app`  <p>However, we can also use the <code>gcloud</code> command to get the endpoint:</p> WindowsMac/Linux <pre><code>for /f \"delims=\" %i in ^\n('gcloud run services describe &lt;name&gt; --region=&lt;region&gt; --format=\"value(status.url)\"') do set MYENDPOINT=%i\n</code></pre> <pre><code>export MYENDPOINT=$(gcloud run services describe &lt;name&gt; --region=&lt;region&gt; --format=\"value(status.url)\")\n</code></pre> <p>where you need to define <code>&lt;name&gt;</code> and <code>&lt;region&gt;</code> with the name of your service and the region it is deployed in.</p>"},{"location":"s7_deployment/testing_apis/#exercises_1","title":"\u2754 Exercises","text":"<p>For the exercises, we are going to use the locust framework for load testing (the name is a reference to a locust being a swarm of bugs invading your application). It is a Python framework that allows you to write tests that simulate many users interacting with your application. It is very easy to get started with and it is very easy to integrate with your CI/CD pipeline.</p> <ol> <li> <p>Install <code>locust</code></p> <pre><code>pip install locust\n</code></pre> <p>Remember to add it to your <code>requirements.txt</code> file.</p> </li> <li> <p>Make sure you have written an API that you can test. Else you can for simplicity just use this simple example</p> <p>Simple hallo world Fastapi example</p> model.py<pre><code>from fastapi import FastAPI\n\napp = FastAPI()\n\n\n@app.get(\"/\")\ndef read_root():\n    \"\"\"Root endpoint.\"\"\"\n    return {\"Hello\": \"World\"}\n\n\n@app.get(\"/items/{item_id}\")\ndef read_item(item_id: int):\n    \"\"\"Get an item by id.\"\"\"\n    return {\"item_id\": item_id}\n</code></pre> </li> <li> <p>Add a new folder to your <code>tests/</code> folder called <code>performancetests</code> and inside it create a file called     <code>locustfile.py</code>. To that file, you need to add the appropriate code to simulate the users that you want to test.     You can read more about how to write a <code>locustfile.py</code> here.</p> Solution <p>Here we provide a solution to the above simple example:</p> locustfile.py<pre><code>import random\n\nfrom locust import HttpUser, between, task\n\n\nclass MyUser(HttpUser):\n    \"\"\"A simple Locust user class that defines the tasks to be performed by the users.\"\"\"\n\n    wait_time = between(1, 2)\n\n    @task\n    def get_root(self) -&gt; None:\n        \"\"\"A task that simulates a user visiting the root URL of the FastAPI app.\"\"\"\n        self.client.get(\"/\")\n\n    @task(3)\n    def get_item(self) -&gt; None:\n        \"\"\"A task that simulates a user visiting a random item URL of the FastAPI app.\"\"\"\n        item_id = random.randint(1, 10)\n        self.client.get(f\"/items/{item_id}\")\n</code></pre> </li> <li> <p>Then try to run the <code>locust</code> command:</p> <pre><code>locust -f tests/performancetests/locustfile.py\n</code></pre> <p>and then navigate to http://localhost:8089 in your web browser. You should see a page that looks similar to the top of this figure.</p> <p> </p> <p>you can here define the number of users you want to simulate and how many users you want to spawn per second. Finally, you can define which endpoint you want to test. When you are ready you can press the <code>Start</code>.</p> <p>Afterward, you should see the results of the test in the web browser. Answer the following questions:</p> <ul> <li>What is the average response time of your API?</li> <li>What is the 99th percentile response time of your API?</li> <li>How many requests per second can your API handle?</li> </ul> </li> <li> <p>Maybe of more use to us is running locust in the terminal. To do this you can run the following command:</p> WindowsMac/Linux <pre><code>locust -f tests/performancetests/locustfile.py \\\n    --headless --users 10 --spawn-rate 1 --run-time 1m --host %MYENDPOINT%\n</code></pre> <pre><code>locust -f tests/performancetests/locustfile.py \\\n    --headless --users 10 --spawn-rate 1 --run-time 1m --host $MYENDPOINT\n</code></pre> <p>this will run the test with 10 users that are spawned at a rate of 1 per second for 1 minute.</p> </li> <li> <p>(Optional) A good use case for load testing in our case is to test that our API can handle a load right after it     has been deployed. To do this we need to add appropriate steps to our CI/CD pipeline. Try adding locust to an     existing or new workflow file in your <code>.github/workflows/</code> folder, such that it runs after the deployment step.</p> Solution <p>The solution here expects that a service called <code>production-model</code> has been deployed to Google Cloud Run. Then the following steps can be added to a workflow file, to first authenticate with Google Cloud, extract the relevant URL, and then run the load test:</p> <pre><code>- name: Auth with GCP\n  uses: google-github-actions/auth@v2\n  with:\n    credentials_json: ${{ secrets.GCP_SA_KEY }}\n\n- name: Set up Cloud SDK\n  uses: google-github-actions/setup-gcloud@v2\n\n- name: Extract deployed model URL\n  run: |\n    DEPLOYED_MODEL_URL=$(gcloud run services describe production-model \\\n      --region=europe-west1 \\\n      --format='value(status.url)')\n    echo \"DEPLOYED_MODEL_URL=$DEPLOYED_MODEL_URL\" &gt;&gt; $GITHUB_ENV\n\n- name: Run load test on deployed model\n  env:\n    DEPLOYED_MODEL_URL: ${{ env.DEPLOYED_MODEL_URL }}\n  run: |\n    locust -f tests/performance/locustfile.py \\\n      --headless -u 100 -r 10 --run-time 10m --host=$DEPLOYED_MODEL_URL --csv=/locust/results\n\n- name: Upload locust results\n  uses: actions/upload-artifact@v4\n  with:\n    name: locust-results\n    path: /locust\n</code></pre> <p>the results can afterward be downloaded from the artifacts tab in the GitHub UI.</p> </li> </ol>"},{"location":"s7_deployment/testing_apis/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"<ol> <li> <p>In the <code>locust</code> framework, what does the <code>@task</code> decorator do and what does <code>@task(3)</code> mean?</p> Solution <p>The <code>@task</code> decorator is used to define a task that a user can perform. The <code>@task(3)</code> decorator is used to define a task that a user can perform that is three times more likely to be performed than the other tasks.</p> </li> <li> <p>In the <code>locust</code> framework, what does the <code>wait_time</code> attribute do?</p> Solution <p>The <code>wait_time</code> attribute is used to define how long a user should wait between tasks. It can be either be a fixed number or a random number between two values.</p> <pre><code>from locust import HttpUser, task, between, constant\n\nclass MyUser(HttpUser):\n    wait_time = between(5, 9)\n    # or\n    wait_time = constant(5)\n</code></pre> </li> <li> <p>Load testing can give numbers on average response time, 99th percentile response time, and requests per second. What     do these numbers tell us about the user experience of the API?</p> Solution <p>The average response time and 99th percentile response time are both measures how \"snappy\" the API feels to the user. While the average response time is normally considered the most important, the 99th percentile response time is also important as it tells us if there are a small amount of users that are experiencing a very slow response time. The requests per second tells us how many users the API can handle at the same time. If this number is too low it can lead to users experiencing slow response times and may indicate that something is wrong with the API.</p> </li> </ol>"},{"location":"s8_monitoring/","title":"Monitoring","text":"<p>Slides</p> <ul> <li> <p></p> <p>Learn how to detect data drifting using the <code>evidently</code> framework</p> <p> M27: Data Drifting</p> </li> <li> <p></p> <p>Learn how to setup a prometheus monitoring system for your application</p> <p> M28: System Monitoring</p> </li> </ul> <p>We have now reached the end of our machine-learning pipeline. We have successfully developed, trained and deployed a machine learning model. However, the question then becomes if you can trust that your newly deployed model still works as expected after 1 day without you intervening? What about 1 month? What about 1 year?</p> <p>There may be corner cases where an ML model is working as expected, but the vast majority of ML models will perform worse over time because they are not generalizing well enough. For example, assume you have just deployed an application that classifies images from phones when suddenly a new phone comes out with a new kind of sensor that takes images that either have a very weird aspect ratio or something else your model is not robust towards. There is nothing wrong with this, you can essentially just retrain your model on new data that accounts for this corner case, however, you need a mechanism that informs you.</p> <p>This is very monitoring comes into play. Monitoring practices are in charge of collecting any information about your application in some format that can then be analyzed and reacted on. Monitoring is essential to securing the longevity of your applications.</p> <p>As with many other sub-fields within MLOps, we can divide monitoring into classic monitoring and ML-specific monitoring. Classic monitoring (known from classic DevOps) is often about</p> <ul> <li>Errors: Is my application working without problems?</li> <li>Logs: What is going on?</li> <li>Performance: How fast is my application?</li> </ul> <p>All these are basic information you are interested in regardless of what application type you are trying to deploy. However, then there is machine learning related monitoring that especially relates data. Take the example above, with the new phone, this we would in general consider to be data drifting problem e.g. the data you are trying to do inference on have drifted away from the distribution of data your model was trained on. Such monitoring problems are unique to machine learning applications and needs to be handled separately.</p> <p>We are in this session going to see examples of both kinds of monitoring.</p> <p>Learning objectives</p> <p>The learning objectives of this session are:</p> <ul> <li>Understand the concepts of data drifting in machine learning applications</li> <li>Can detect data drifting using the <code>evidently</code> framework</li> <li>Understand the importance of different system level monitoring and can conceptually implement it</li> </ul>"},{"location":"s8_monitoring/data_drifting/","title":"M27 - Data Drifting","text":""},{"location":"s8_monitoring/data_drifting/#data-drifting","title":"Data drifting","text":"<p>Data drifting is one of the core reasons for model accuracy degrades over time in production. For machine learning models, data drift is the change in model input data that leads to model performance degradation. In practical terms, this means that the model is receiving input that is outside of the scope that it was trained on, as seen in the figure below. This shows that the underlying distribution of a particular feature has slowly been increasing in value over two years</p> <p></p>  Image credit  <p>In some cases, it may be that if you normalize some feature in a better way that you are able to generalize your model better, but this is not always the case. The reason for such a drift is commonly some external factor that you essentially have no control over. That really only leaves you with one option: retrain your model on the newly received input features and deploy that model to production. This process is probably going to repeat over the lifetime of your application if you want to keep it up-to-date with the real world.</p> <p></p>  Image credit  <p>We have now come up with a solution to the data drift problem, but there is one important detail that we have not taken care of: When we should actually trigger the retraining? We do not want to wait around for our model performance to degrade, thus we need tools that can detect when we are seeing a drift in our data.</p>"},{"location":"s8_monitoring/data_drifting/#exercises","title":"\u2754 Exercises","text":"<p>For these exercises we are going to use the framework Evidently developed by EvidentlyAI. Evidently currently supports both detection for both regression and classification models. The exercises are in large taken from here and in general we recommend if you are in doubt about an exercise to look at the docs for API and examples (their documentation can be a bit lacking sometimes, so you may also have to dive into the source code).</p> <p>Additionally, we want to stress that data drift detection, concept drift detection etc. is still an active field of research and therefore exist multiple frameworks for doing this kind of detection. In addition to Evidently, we can also mention NannyML, WhyLogs and deepcheck.</p> <ol> <li> <p>Start by install Evidently</p> <pre><code>pip install evidently\n</code></pre> <p>you will also need <code>scikit-learn</code> and <code>pandas</code> installed if you do not already have it.</p> </li> <li> <p>Hopefully you already gone through session S7 on deployment. As part of the deployment     to GCP functions you should have developed a application that can classify the     iris dataset, based on a model trained by this     script     . We are going to convert this into a FastAPI application for the purpose here:</p> <ol> <li> <p>Convert your GCP function into a FastAPI application. The appropriate <code>curl</code> command should look something like     this:</p> <pre><code>curl -X 'POST' \\\n    'http://127.0.0.1:8000/iris_v1/?sepal_length=1.0&amp;sepal_width=1.0&amp;petal_length=1.0&amp;petal_width=1.0' \\\n    -H 'accept: application/json' \\\n    -d ''\n</code></pre> <p>and the response body should look like this:</p> <pre><code>{\n    \"prediction\": \"Iris-Setosa\",\n    \"prediction_int\": 0\n}\n</code></pre> <p>We have implemented a solution in this file (called v1) if you need help.</p> </li> <li> <p>Next we are going to add some functionality to our application. We need to add that the input for the user is     saved to a database whenever our application is called. However, to not slow down the response to our user we     want to implement this as an background task. A background task is a function that should be executed after     the user have got their response. Implement a background task that save the user input to a database implemented     as a simple <code>.csv</code> file. You can read more about background tasks     here. The header of the database should look     something like this:</p> <pre><code>time, sepal_length, sepal_width, petal_length, petal_width, prediction\n2022-12-28 17:24:34.045649, 1.0, 1.0, 1.0, 1.0, 1\n2022-12-28 17:24:44.026432, 2.0, 2.0, 2.0, 2.0, 1\n...\n</code></pre> <p>thus both input, timestamp and predicted value should be saved. We have implemented a solution in this file (called v2) if you need help.</p> </li> <li> <p>Call you API a number of times to generate some dummy data in the database.</p> </li> </ol> </li> <li> <p>Create a new <code>data_drift.py</code> file where we are going to implement the data drifting detection and reporting. Start     by adding both the real iris data and your generated dummy data as pandas dataframes.</p> <pre><code>import pandas as pd\nfrom sklearn import datasets\nreference_data = datasets.load_iris(as_frame=True).frame\ncurrent_data = pd.read_csv('prediction_database.csv')\n</code></pre> <p>if done correctly you will most likely end up with two dataframes that look like</p> <pre><code># reference_data\nsepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  target\n0                  5.1               3.5                1.4               0.2       0\n1                  4.9               3.0                1.4               0.2       0\n...\n148                6.2               3.4                5.4               2.3       2\n149                5.9               3.0                5.1               1.8       2\n[150 rows x 5 columns]\n\n# current_data\ntime                         sepal_length   sepal_width   petal_length   petal_width   prediction\n2022-12-28 17:24:34.045649   1.0            1.0            1.0           1.0           1\n...\n2022-12-28 17:24:34.045649   1.0            1.0            1.0           1.0           1\n[10 rows x 5 columns]\n</code></pre> <p>Standardize the dataframes such that they have the same column names and drop the time column from the <code>current_data</code> dataframe.</p> </li> <li> <p>We are now ready to generate some reports about data drifting:</p> <ol> <li> <p>Try executing the following code:</p> <pre><code>from evidently.report import Report\nfrom evidently.metric_preset import DataDriftPreset\nreport = Report(metrics=[DataDriftPreset()])\nreport.run(reference_data=reference, current_data=current)\nreport.save_html('report.html')\n</code></pre> <p>and open the generated <code>.html</code> page. What does it say about your data? Have it drifted? Make sure to poke around to understand what the different plots are actually showing.</p> </li> <li> <p>Data drifting is not the only kind of reporting evidently can make. We can also get reports on the data quality.     Try first adding a few <code>Nan</code> values to your reference data. Secondly, try changing the report to</p> <pre><code>from evidently.metric_preset import DataDriftPreset, DataQualityPreset\nreport = Report(metrics=[DataDriftPreset(), DataQualityPreset()])\n</code></pre> <p>and re-run the report. Checkout the newly generated report. Again go over the generated plots and make sure that it picked up on the missing values you just added.</p> </li> <li> <p>The final report present we will look at is the <code>TargetDriftPreset</code>. Target drift means that our model is     over/under predicting certain classes e.g. or general terms the distribution of predicted values differs from     the ground true distribution of targets. Try adding the <code>TargetDriftPreset</code> to the <code>Report</code> class and re-run the     analysis and inspect the result. Have your targets drifted?</p> </li> </ol> </li> <li> <p>Evidently reports are meant for debugging, exploration and reporting of results. However, as we stated in the     beginning, what we are actually interested in methods automatically detecting when we are beginning to drift. For     this we will need to look at Test and TestSuites:</p> <ol> <li> <p>Lets start with a simple test that checks if there are any missing values in our dataset:</p> <pre><code>from evidently.test_suite import TestSuite\nfrom evidently.tests import TestNumberOfMissingValues\ndata_test = TestSuite(tests=[TestNumberOfMissingValues()])\ndata_test.run(reference_data=reference, current_data=current)\n</code></pre> <p>again we could run <code>data_test.save_html</code> to get a nice view of the results (feel free to try it out) but additionally we can also call <code>data_test.as_dict()</code> method that will give a dict with the test results. What dictionary key contains the if all tests have passed or not?</p> </li> <li> <p>Take a look at this colab notebook     that contains all tests implemented in Evidently. Pick 5 tests of your choice, where at least 1 fails by default     and implement them as a <code>TestSuite</code>. Then try changing the arguments of the test so they better fit your     usecase and get them all passing.</p> </li> </ol> </li> <li> <p>(Optional) When doing monitoring in practice, we are not always interested in running on all data collected from our     API maybe only the last <code>N</code> entries or maybe just from the last hour of observations. Since we are already logging     the timestamps of when our API is called we can use that for filtering. Implement a simple filter that either takes     an integer <code>n</code> and returns the last <code>n</code> entries in our database or some datetime <code>t</code> that filters away observations     earlier than this.</p> </li> <li> <p>Evidently by default only supports structured data e.g. tabular data (so does nearly every other framework). Thus,     the question then becomes how we can extend unstructured data such as images or text? The solution is to extract     structured features from the data which we then can run the analysis on.</p> <ol> <li> <p>(Optional) For images the simple solution would be to flatten the images and consider each pixel a feature,     however this does not work in practice because changes in the individual pixels does not really tell anything     about the image. Instead we should derive some feature such as:</p> <ul> <li>Average brightness</li> <li>Contrast of image</li> <li>Image sharpness</li> <li>...</li> </ul> <p>These are all numbers that can make up a feature vector for a image. Try out doing this yourself, for example by extracting such features from MNIST and FashionMNIST datasets, and check if you can detect a drift between the two sets.</p> </li> <li> <p>(Optional) For text a common approach is to extra some higher level embedding such as the very classical     GLOVE embedding. Try following     this tutorial     to understand how drift detection is done on text.</p> </li> <li> <p>Lets instead take a deep learning based approach to doing this. Lets consider the     CLIP model, which is normally used to do image captioning. For our purpose     this is perfect because we can use the model to get abstract feature embeddings for both images and text:</p> <pre><code>from PIL import Image\nimport requests\n# requires transformers package: pip install transformers\nfrom transformers import CLIPProcessor, CLIPModel\n\nmodel = CLIPModel.from_pretrained(\"openai/clip-vit-base-patch32\")\nprocessor = CLIPProcessor.from_pretrained(\"openai/clip-vit-base-patch32\")\n\nurl = \"http://images.cocodataset.org/val2017/000000039769.jpg\"\nimage = Image.open(requests.get(url, stream=True).raw)\n\n# set either text=None or images=None when only the other is needed\ninputs = processor(text=[\"a photo of a cat\", \"a photo of a dog\"], images=image, return_tensors=\"pt\", padding=True)\n\nimg_features = model.get_image_features(inputs['pixel_values'])\ntext_features = model.get_text_features(inputs['input_ids'], inputs['attention_mask'])\n</code></pre> <p>Both <code>img_features</code> and <code>text_features</code> are in this case a <code>(512,)</code> abstract feature embedding, that should be able to tell us something about our data distribution. Try using this method to extract features on two different datasets like CIFAR10 and SVHN if you want to work with vision or IMDB movie review and Amazon review for text. After extracting the features try running some of the data distribution testing you just learned about.</p> </li> </ol> </li> <li> <p>(Optional) If we have multiple applications and want to run monitoring for each application we often want also the     monitoring to be a deployed application (that only we can access). Implement a <code>/monitoring/</code> endpoint that does     all the reporting we just went through such that you have two endpoints:</p> <pre><code>http://127.0.0.1:8000/iris_infer/?sepal_length=1.0&amp;sepal_width=1.0&amp;petal_length=1.0&amp;petal_width=1.0 # user endpoint\nhttp://127.0.0.1:8000/iris_monitoring/ # monitoring endpoint\n</code></pre> <p>Our monitoring endpoint should return a HTML page either showing an Evidently report or test suit. Try implementing this endpoint. We have implemented a solution in this file if you need help with how to return an HTML page from a FastAPI application.</p> </li> <li> <p>As an final exercise, we recommend that you try implementing this to run directly in the cloud. You will need to     implement this in a container e.g. GCP Run service because the data gathering from the endpoint should still be     implemented as an background task. For this to work you will need to change the following:</p> </li> <li> <p>Instead of saving the input to a local file you should either store it in GCP bucket or an         BigQuery SQL table (this is a better solution, but also         out-of-scope for this course)</p> </li> <li> <p>You can either run the data analysis locally by just pulling from cloud storage predictions and training data         or alternatively you can deploy this as its own endpoint that can be invoked. For the latter option we recommend         that this should require authentication.</p> </li> </ol> <p>That ends the module on detection of data drifting, data quality etc. If this has not already been made clear, monitoring of machine learning applications is an extremely hard discipline because it is not a clear cut when we should actually respond to feature beginning to drift and when it is probably fine. That comes down to the individual application what kind of rules that should be implemented. Additionally, the tools presented here are also in no way complete and are especially limited in one way: they are only considering the marginal distribution of data. Every analysis that we done have been on the distribution per feature (the marginal distribution), however as the image below show it is possible for data to have drifted to another distribution with the marginal being approximately the same.</p> <p></p> <p>There are methods such as Maximum Mean Discrepancy (MMD) tests that are able to do testing on multivariate distributions, which you are free to dive into. In this course we will just always recommend to consider multiple features when doing decision regarding your deployed applications.</p>"},{"location":"s8_monitoring/monitoring/","title":"M28 - System Monitoring","text":""},{"location":"s8_monitoring/monitoring/#monitoring","title":"Monitoring","text":"<p>In this module we are going to look into more classical monitoring of applications. The key concept we are often working with here is called telemetry. Telemetry in general refer to any automatic measurement and wireless transmission of data from our application. It could be numbers such as:</p> <ul> <li>The number of requests are our application receiving per minute/hour/day. This number is of interest because it is     directly proportional to the running cost of application.</li> <li>The amount of time (on average) our application runs per request. The number is of interest because it most likely is     the core contributor to the latency that our users are experience (which we want to be low).</li> <li>...</li> </ul> <p>In general there are three different kinds of telemetry we are interested in:</p> Name Description Example Purpose Metrics Metrics are quantitative measurements of the system. They are usually numbers that are aggregated over a period of time. E.g. the number of requests per minute. The number of requests per minute. Metrics are used to get an overview of the system. They are often used to create dashboards that can be used to get an overview of the system. Logs Logs are textual or structured records generated by applications. They provide a detailed account of events, errors, warnings, and informational messages that occur during the operation of the system. System logs, error logs Logs are essential for diagnosing issues, debugging, and auditing. They provide a detailed history of what happened in a system, making it easier to trace the root cause of problems and track the behavior of components over time. Traces Traces are detailed records of specific transactions or events as they move through a system. A trace typically includes information about the sequence of operations, timing, and dependencies between different components. Distributed tracing in microservices architecture Traces help in understanding the flow of a request or a transaction across different components. They are valuable for identifying bottlenecks, understanding latency, and troubleshooting issues related to the flow of data or control. <p>We are mainly going to focus in this module on metrics.</p>"},{"location":"s8_monitoring/monitoring/#local-instrumentator","title":"Local instrumentator","text":"<p>Before we look into the cloud lets at least conceptually understand how a given instance of a app can expose values that we may be interested in monitoring.</p> <p>The standard framework for exposing metrics is called prometheus. Prometheus is a time series database that is designed to store metrics. It is also designed to be very easy to instrument applications with and it is designed to scale to large amounts of data. The way prometheus works is that it exposes a <code>/metrics</code> endpoint that can be queried to get the current state of the metrics. The metrics are exposed in a format called prometheus text format.</p>"},{"location":"s8_monitoring/monitoring/#exercises","title":"\u2754 Exercises","text":"<ol> <li> <p>Start by installing <code>prometheus-fastapi-instrumentator</code> in python</p> <pre><code>pip install prometheus-fastapi-instrumentator\n</code></pre> <p>this will allow us to easily instrument our FastAPI application with prometheus.</p> </li> <li> <p>Create a simple FastAPI application in a file called <code>app.py</code>. You can reuse any application from the previous     module on APIs. To that file now add the following code:</p> <pre><code>from prometheus_fastapi_instrumentator import Instrumentator\n\n# your app code here\n\nInstrumentator().instrument(app).expose(app)\n</code></pre> <p>This will instrument your application with prometheus and expose the metrics on the <code>/metrics</code> endpoint.</p> </li> <li> <p>Run the app using <code>uvicorn</code> server. Make sure that the app exposes the endpoints you expect it too exposes, but make     sure you also checkout the <code>/metrics</code> endpoint.</p> </li> <li> <p>The metric endpoint exposes multiple <code>/metrics</code>. Metrics always looks like this:</p> <pre><code># TYPE key &lt;type&gt;\nkey value\n</code></pre> <p>e.g. it is essentially a ditionary of key-value pairs with the added functionality of a <code>&lt;type&gt;</code>. Look at this page over the different types prometheus metrics can have and try to understand the different metrics being exposed.</p> </li> <li> <p>Look at the documentation for the     <code>prometheus-fastapi-instrumentator</code> and try to add at least one more metric to your application. Rerun the     application and confirm that the new metric is being exposed.</p> </li> </ol>"},{"location":"s8_monitoring/monitoring/#cloud-monitoring","title":"Cloud monitoring","text":"<p>Any cloud system with respect for itself will have some kind of monitoring system. GCP has a service called Monitoring that is designed to monitor all the different services. By default it will monitor a lot of metrics out-of-box. However, the question is if we want to monitor more than the default metrics. The complexity that comes with doing monitoring in the cloud is that we need more than one container. We at least need one container actually running the application that is also exposing the <code>/metrics</code> endpoint and then we need a another container that is collecting the metrics from the first container and storing them in a database. To implement such system of containers that need to talk to each others we in general need to use a container orchestration system such as Kubernetes. This is out of scope for this course, but we can use a feature of <code>Cloud Run</code> called <code>sidecar containers</code> to achieve the same effect. A sidecar container is a container that is running alongside the main container and can be used to do things such as collecting metrics.</p> <p></p>"},{"location":"s8_monitoring/monitoring/#exercises_1","title":"\u2754 Exercises","text":"<ol> <li> <p>Overall we recommend that you just become familiar with the monitoring tab for your cloud run service (see image)     above. Try to invoke your service a couple of times and see what happens to the metrics over time.</p> <ol> <li>(Optional) If you really want to load test your application we recommend checking out the tool     locust. Locust is a Python based load testing tool that can be used to simulate many     users accessing your application at the same time.</li> </ol> </li> <li> <p>Try creating a service level objective (SLO). In short a SLO     is a target for how well your application should be performing. Click the <code>Create SLO</code> button and fill it out with     what you consider to be a good SLO for your application.</p> <p> </p> </li> <li> <p>(Optional) To expose our own metrics we need to setup a sidecar container. To do this follow the instructions     here. We have     setup a simple example that uses fastapi and prometheus that you can find     here. After you have correctly setup the sidecar container you     should be able to see the metrics in the monitoring tab.</p> </li> </ol>"},{"location":"s8_monitoring/monitoring/#alert-systems","title":"Alert systems","text":"<p>A core problem within monitoring is alert systems. The alert system is in charge of sending out alerts to relevant people when some telemetry or metric we are tracking is not behaving as it should. Alert systems are a subjective choice of when and how many should be send out and in general should be proportional with how important to the of the metric/telemetry. We commonly run into what is referred to the goldielock problem where we want just the right amount of alerts however it is more often the case that we either have</p> <ul> <li>Too many alerts, such that they become irrelevant and the really important ones are overseen, often referred to as     alert fatigue</li> <li>Or alternatively, we have too little alerts and problems that should have triggered an alert is not dealt with when     they happen which can have unforeseen consequences.</li> </ul> <p>Therefore, setting up proper alert systems can be as challenging as setting up the systems for actually the metrics we want to trigger alerts.</p>"},{"location":"s8_monitoring/monitoring/#exercises_2","title":"\u2754 Exercises","text":"<p>We are in this exercise going to look at how we can setup automatic alerting such that we get an message every time one of our applications are not behaving as expected.</p> <ol> <li> <p>Go to the <code>Monitoring</code> service. Then go to <code>Alerting</code> tab.      </p> </li> <li> <p>Start by setting up an notification channel. A recommend setting up with an email.</p> </li> <li> <p>Next lets create a policy. Clicking the <code>Add Condition</code> should bring up a window as below. You are free to setup the    condition as you want but the image is one way bo setup an alert that will react to the number of times an cloud    function is invoked (actually it measures the amount of log entries from cloud functions).</p> <p> </p> </li> <li> <p>After adding the condition, add the notification channel you created in one of the earlier steps. Remember to also     add some documentation that should be send with the alert to better describe what the alert is actually doing.</p> </li> <li> <p>When the alert is setup you need to trigger it. If you setup the condition as the image above you just need to     invoke the cloud function many times. Here is a small code snippet that you can execute on your laptop to call a     cloud function many time (you need to change the url and payload depending on your function):</p> <pre><code>import time\nimport requests\nurl = 'https://us-central1-dtumlops-335110.cloudfunctions.net/function-2'\npayload = {'message': 'Hello, General Kenobi'}\n\nfor _ in range(1000):\n    r = requests.get(url, params=payload)\n</code></pre> </li> <li> <p>Make sure that you get the alert through the notification channel you setup.</p> </li> </ol>"},{"location":"s9_scalable_applications/","title":"Scaling applications","text":"<p>Slides</p> <ul> <li> <p></p> <p>Learn how to setup distributed data loading in your PyTorch application</p> <p> M29: Distributed Data Loading</p> </li> <li> <p></p> <p>Learn how to do distributed training in PyTorch using <code>pytorch-lightning</code></p> <p> M30: Distributed Training</p> </li> <li> <p></p> <p>Learn how to do scalable inference in PyTorch</p> <p> M31: Scalable Inference</p> </li> </ul> <p>This module is all about scaling the applications that we are building. We are here going to use a very narrow definition of scaling namely that we want our applications to run faster, however, one should note that in general scaling is a much broader term. There are many different ways to scale your applications and we are going to look at three of these related to different tasks in machine learning algorithms:</p> <ul> <li>Scaling data loading</li> <li>Scaling training</li> <li>Scaling inference</li> </ul> <p>We are going to approach the term scaling from two different angles and both should result in your application running faster. The first approach is levering multiple devices, such as using multiple CPU cores or parallelizing training across multiple GPUs. The second approach is more analytical, where we are going to look at how we can design smaller/faster model architectures that run faster.</p> <p>It should be noted that this module is specific to working with PyTorch applications. In particular, we are going to see how we can both improve base PyTorch code and how to utilize the PyTorch Lightning which we introduced in module M14 on boilerplate to improve the scaling of our applications. If your application is written using another framework we can guarantee that the same techniques in these modules transfer to that framework, but may require you to seek out how to specifically to it.</p> <p>If you manage to complete all modules in this session, feel free to check out the extra module on scalable hyperparameter optimization.</p> <p>Learning objectives</p> <p>The learning objectives of this session are:</p> <ul> <li>Understand how data loading during training can be parallelized and have experimented with it</li> <li>Understand the different paradigms for distributed training and can run multi-GPU experiments using the     framework <code>pytorch-lightning</code></li> <li>Knowledge of different ways, including quantization, pruning, architecture tuning etc. to improve inference     speed</li> </ul>"},{"location":"s9_scalable_applications/data_loading/","title":"M29 - Distributed Data Loading","text":""},{"location":"s9_scalable_applications/data_loading/#distributed-data-loading","title":"Distributed Data Loading","text":"<p>Core Module</p> <p>One way that deep learning fundamentally changed the way we think about data in machine learning is that more data is always better. This was very much not the case with more traditional machine learning algorithms (random forest, support vector machines etc.) where a plateau in performance was often reached for a certain amount of data and did not improve if more was added. However, as deep learning models have become deeper and deeper and thereby more and more data-hungry performance seems to be ever increasing or at least not reaching a plateau in the same way as for traditional machine learning.</p> <p></p>  Image credit  <p>As we are trying to feed more and more data into our models, the obvious first question to ask is how to do this efficiently. As a general rule of thumb, we want the performance bottleneck to be the forward/backward e.g. the actual computation in our neural network and not the data loading. By bottleneck, we here refer to the part of our pipeline that is restricting how fast we can process data. If data loading is our bottleneck, then our compute device can sit idle while waiting for data to arrive, which is both inefficient and costly. For example, if you are using a cloud provider for training deep learning models, you are paying by the hour per device, and thus not using them fully can be costly in the long run.</p> <p>In the first set of exercises, we are therefore going to focus on distributed data loading i.e. how to load data in parallel to make sure that we always have data ready for our compute devices. We are in the following going to look at what is going on behind the scenes when we use PyTorch to parallelize data loading.</p>"},{"location":"s9_scalable_applications/data_loading/#a-closer-look-at-data-loading","title":"A closer look at Data loading","text":"<p>Before we talk distributed applications it is important to understand the physical layout of a standard CPU (the brain of your computer).</p> <p></p> <p>Most modern CPUs is a single chip that consists of multiple cores. Each core can further be divided into threads. In most laptops, the core count is 4 and commonly 2 threads per code. This means that the common laptop has 8 threads. The number of threads a compute unit has is important because that directly corresponds to the number of parallel operations that can be executed i.e. one per thread. In a Python terminal you should be able to get the number of cores in your machine by writing (try it):</p> <pre><code>import multiprocessing\ncores = multiprocessing.cpu_count()\nprint(f\"Number of cores: {cores}, Number of threads: {2*cores}\")\n</code></pre> <p>A distributed application is in general any kind of application that parallelizes some or all of its workload. We are in these exercises only focusing on distributed data loading, which happens primarily only on the CPU. In <code>PyTorch</code> it is easy to parallelize data loading if you are using their dataset/data loader interface:</p> <pre><code>from torch.utils.data import Dataset, DataLoader\nclass MyDataset(Dataset):\n    def __init__(self, ...):\n        # whatever logic is needed to init the data set\n        self.data = ...\n\n    def __getitem__(self, idx):\n        # return one item\n        return self.data[idx]\n\ndataset = MyDataset()\ndataloader = Dataloader(\n    dataset,\n    batch_size=8,\n    num_workers=4  # this is the number of threads we want to parallelize workload over\n)\n</code></pre> <p>Let's take a deep dive into what happens when we request a batch from our dataloader e.g. <code>next(dataloader)</code>. First, we must understand that we have a thread that plays the role of the main and the remaining threads (in the above example we request 4) are called workers. When the dataloader is created, we create this structure and make sure that all threads have a copy of our dataset definition so each can call the <code>__getitem__</code> method.</p> <p></p> <p>Then comes the actual part where we request a batch of data. Assume that we have a batch size of 8 and we do not do any shuffling. In this step, the master thread then distributes the list of requested data points (<code>[0,1,2,3,4,5,6,7]</code>) to the four worker threads. With 8 indices and 4 workers, each worker will receive 2 indices.</p> <p></p> <p>Each worker thread then calls the <code>__getitem__</code> method for all the indices it has received. When all processes are done, the loaded images data points gets sent back to the master thread and collected into a single structure/tensor.</p> <p></p> <p>Each arrow is corresponds to a communication between two threads, which is not a free operation. In total to get a single batch (not counting the initial startup cost) in this example we need to do 8 communication operations. This may seem like a small price to pay, but that may not be the case. If the processing time of <code>__getitem__</code> is very low ( data is stored in memory, we just need to index to get it) then it does not make sense to use multiprocessing. The computational savings by doing the look-up operations in parallel are smaller than the communication cost there is between the main thread and the workers. Multiprocessing makes sense when the processing time of <code>__getitem__</code> is high (data is probably stored on the hard drive).</p> <p>It is this trade-off that we are going to investigate in the exercises.</p>"},{"location":"s9_scalable_applications/data_loading/#exercises","title":"\u2754 Exercises","text":"<p>This exercise is intended to be done on the labeled faces in the wild (LFW) dataset. The dataset consists of images of famous people extracted from the internet. The dataset had been used to drive the field of facial verification, which you can read more about here. We are going to imagine that this dataset cannot fit in memory, and your job is therefore to construct a data pipeline that can be parallelized based on loading the raw data files (.jpg) at runtime.</p> <ol> <li> <p>Download the dataset and extract it to a folder. It does not matter if you choose the non-aligned or aligned version     of the dataset.</p> </li> <li> <p>We provide the <code>lfw_dataset.py</code> file where we have started the process of defining a data class. Fill out the     <code>__init__</code>, <code>__len__</code> and <code>__getitem__</code>. Note that <code>__getitem__</code> expects that you return a single <code>img</code> which should     be a <code>torch.Tensor</code>. Loading should be done using PIL Image, as <code>PIL</code>     images are the default input format for torchvision for     transforms (for data augmentation).</p> </li> <li> <p>Make sure that the script runs without any additional arguments</p> <pre><code>python lfw_dataset.py\n</code></pre> </li> <li> <p>Visualize a single batch by filling out the codeblock after the first TODO right after defining the dataloader.     The visualization should show when launching the script as</p> <pre><code>python lfw_dataset.py -visualize_batch\n</code></pre> <p>Hint: this tutorial.</p> </li> <li> <p>Experiment how the number of workers influences the performance. We have already provide code that will pass over 100     batches from the dataset 5 times and calculate how long time it took, which you can play around with by calling</p> <pre><code>python lfw_dataset.py -get_timing -num_workers 1\n</code></pre> <p>Make a errorbar plot with number of workers along the x-axis and the timing along the y-axis. The errorbars should correspond to the standard deviation over the 5 runs. HINT: if it is taking too long to evaluate, measure the time over less batches (set the <code>-batches_to_check</code> flag). Also if you are not seeing an improvement, try increasing the batch size (since data loading is parallelized per batch).</p> <p>For certain machines like the Mac with M1 chipset it is necessary to set the <code>multiprocessing_context</code> flag in the dataloder to <code>\"fork\"</code>. This essentially tells the dataloader how the worker nodes should be created.</p> </li> <li> <p>Retry the experiment where you change the data augmentation to be more complex:</p> <pre><code>lfw_trans = transforms.Compose([\n    transforms.RandomAffine(5, (0.1, 0.1), (0.5, 2.0)),\n    # add more transforms here\n    transforms.ToTensor()\n])\n</code></pre> <p>by making the augmentation more computationally demanding, it should be easier to get a boost in performance when using multiple workers because the data augmentation is also executed in parallel.</p> </li> <li> <p>(Optional, requires access to GPU) If your dataset fits in GPU memory it is beneficial to set the <code>pin_memory</code> flag     to <code>True</code>. By setting this flag we are essentially telling PyTorch that they can lock the data in place in memory     which will make the transfer between the host (CPU) and the device (GPU) faster.</p> </li> </ol> <p>This ends the module on distributed data loading in PyTorch. If you want to go into more details we highly recommend that you read this paper that goes into great detail on analyzing how data loading in PyTorch works and performance benchmarks.</p>"},{"location":"s9_scalable_applications/distributed_training/","title":"M30 - Distributed Training","text":""},{"location":"s9_scalable_applications/distributed_training/#distributed-training","title":"Distributed Training","text":"<p>In this module we are going to look at distributed training. Distributed training is one of the key ingredients to all the awesome results that deep learning models are producing. For example: Alphafold the highly praised model from Deepmind that seems to have solved protein structure prediction, was trained in a distributed fashion for a few weeks. The training was done on 16 TPUv3s (specialized hardware), which is approximately equal to 100-200 modern GPUs. This means that training Alphafold without distributed training on a single GPU (probably not even possible) would take a couple of years to train! Therefore, it is simply impossible currently to train some of the state-of-the-art (SOTA) models within deep learning currently, without taking advantage of distributed training.</p> <p>When we talk about distributed training, there are a number of different paradigms that we may use to parallelize our computations</p> <ul> <li>Data parallel (DP) training</li> <li>Distributed data parallel (DDP) training</li> <li>Sharded training</li> </ul> <p>In this module we are going to look at data parallel training, which is the original way of doing parallel training and distributed data parallel training which is an improved version of data parallel. If you want to know more about sharded training which is the newest of the paradigms you can read more about it in this blog post, which describes how sharded can save over 60% of memory used during your training.</p> <p>Finally, we want to note that for all the exercises in the module you are going to need a multi GPU setup. If you have not already gained access to multi GPU machines on GCP (see the quotas exercises in this module) you will need to find another way of running the exercises. For DTU Students I can recommend checking out this optional module on using the high performance cluster (HPC) where you can get access to multi GPU resources.</p>"},{"location":"s9_scalable_applications/distributed_training/#data-parallel","title":"Data parallel","text":"<p>While data parallel today in general is seen as obsolete compared to distributed data parallel, we are still going to investigate it a bit since it offers the most simple form of distributed computations in deep learning pipeline.</p> <p>In the figure below is shown both the forward and backward step in the data parallel paradigm</p> <p></p> <p>The steps are the following:</p> <ul> <li> <p>Whenever we try to do forward call e.g. <code>out=model(batch)</code> we take the batch and divide it equally between all     devices. If we have a batch size of <code>N</code> and <code>M</code> devices each device will be sent <code>N/M</code> datapoints.</p> </li> <li> <p>Afterwards each device receives a copy of the <code>model</code> e.g. a copy of the weights that currently parametrizes our     neural network.</p> </li> <li> <p>In this step we perform the actual forward pass in parallel. This is the actual steps that can help us scale     our training.</p> </li> <li> <p>Finally we need to send back the output of each replicated model to the primary device.</p> </li> </ul> <p>Similar to the analysis we did of parallel data loading, we cannot always expect that this will actual take less time than doing the forward call on a single GPU. If we are parallelizing over <code>M</code> devices, we essentially need to do <code>3xM</code> communication calls to send batch, model and output between the devices. If the parallel forward call does not outweigh this, then it will take longer.</p> <p>In addition, we also have the backward path to focus on</p> <ul> <li> <p>As the end of the forward collected the output on the primary device, this is also where the loss is accumulated.     Thus, loss gradients are first calculated on the primary device</p> </li> <li> <p>Next we scatter the gradient to all the workers</p> </li> <li> <p>The workers then perform a parallel backward pass through their individual model</p> </li> <li> <p>Finally, we reduce (sum) the gradients from all the workers on the main process such that we can do gradient descend.</p> </li> </ul> <p>One of the big downsides of using data parallel is that all the replicas are destroyed after each backward call. This means that we over and over again need to replicate our model and send it to the devices that are part of the computations.</p> <p>Even though it seems like a lot of logic is implementing data parallel into your code, in PyTorch we can very simply enable data parallel training by wrapping our model in the nn.DataParallel class.</p> <pre><code>from torch import nn\nmodel = MyModelClass()\nmodel = nn.DataParallel(model, device_ids=[0, 1])  # data parallel on gpu 0 and 1\npreds = model(input)  # same as usual\n</code></pre>"},{"location":"s9_scalable_applications/distributed_training/#exercises","title":"\u2754 Exercises","text":"<p>Please note that the exercise only makes sense if you have access to multiple GPUs.</p> <ol> <li> <p>Create a new script (call it <code>data_parallel.py</code>) where you take a copy of model <code>FashionCNN</code>     from the <code>fashion_mnist.py</code> script. Instantiate the model and wrap <code>torch.nn.DataParallel</code>     around it such that it can be executed in data parallel.</p> </li> <li> <p>Try to run inference in parallel on multiple devices (pass a batch multiple times and time it) e.g.</p> <pre><code>import time\nstart = time.time()\nfor _ in range(n_reps):\n    out = model(batch)\nend = time.time()\n</code></pre> <p>Does data parallel decrease the inference time? If no, can you explain why that may be? Try playing around with the batch size, and see if data parallel is more beneficial for larger batch sizes.</p> </li> </ol>"},{"location":"s9_scalable_applications/distributed_training/#distributed-data-parallel","title":"Distributed data parallel","text":"<p>It should be clear that there is huge disadvantage of using the data parallel paradigm to scale your applications: the model needs to replicated on each pass (because it is destroyed in the end), which requires a large transfer of data. This is the main problem that distributed data parallel tries to solve.</p> <p></p> <p>The two key difference between distributed data parallel and data parallel that we move the model update (the gradient step) to happen on each device in parallel instead of only on the main device. This has the consequence that we do not need to move replicate the model on each step, instead we just keep a local version on each device that we keep updating. The full set of steps (as shown in the figure):</p> <ul> <li> <p>Initialize an exact copy of the model on each device</p> </li> <li> <p>From disk (or memory) we start by loading data into a section of page-locked host memory per device. Page-locked     memory is essentially a way to reverse a piece of a computers memory for a specific transfer that is going to     happen over and over again to speed it up. The page-locked regions are loaded with non-overlapping data.</p> </li> <li> <p>Transfer data from page-locked memory to each device in parallel</p> </li> <li> <p>Perform forward  pass in parallel</p> </li> <li> <p>Do a all-reduce operation on the gradients. An all-reduce operation is a so call all-to-all  operation meaning     that all processes send their own gradient to all other processes and also received from all other processes.</p> </li> <li> <p>Reduce the combined gradient signal from all processes and update the individual model in parallel. Since all     processes received the same gradient information, all models will still be in sync.</p> </li> </ul> <p>Thus, in distributed data parallel we here end up only doing a single communication call between all processes, compared to all the communication going on in data parallel. While all-reduce is a more expensive operation that many of the other communication operations that we can do, because we only have to do a single we gain a huge performance boost. Empirically distributed data parallel tends to be 2-3 times faster than data parallel.</p> <p>However, this performance increase does not come for free. Where we could implement data parallel in a single line in PyTorch, distributed data parallel is much more involving.</p>"},{"location":"s9_scalable_applications/distributed_training/#exercises_1","title":"\u2754 Exercises","text":"<ol> <li> <p>We have provided an example of how to do distributed data parallel training in PyTorch in the two     files <code>distributed_example.py</code> and <code>distributed_example.sh</code>. You objective is to get a understanding of the necessary     components in the script to get this kind of distributed training to work. Try to answer the following questions     (HINT: try to Google around):</p> <ol> <li> <p>What is the function of the <code>DDP</code> wrapper?</p> </li> <li> <p>What is the function of the <code>DistributedSampler</code>?</p> </li> <li> <p>Why is it necessary to call <code>dist.barrier()</code> before passing a batch into the model?</p> </li> <li> <p>What does the different environment variables do in the <code>.sh</code> file</p> </li> </ol> </li> <li> <p>Try to benchmark the runs using 1 and 2 GPUs</p> </li> <li> <p>The first exercise have hopefully convinced you that it can be quite the trouble writing distributed training     applications yourself. Luckily for us, <code>PyTorch-lightning</code> can take care of this for us such that we do not have to     care about the specific details. To get your model training on multiple GPUs you need to change two arguments in the     trainer: the <code>accelerator</code> flag and the <code>gpus</code> flag. In addition to this, you can read through this     guide about any additional steps you may     need to do (for many of you, it should just work). Try running your model on multiple GPUs.</p> </li> <li> <p>Try benchmarking your training using 1 and 2 gpus e.g. try running a couple of epochs and measure how long time it     takes. How much of a speedup can you actually get? Why can you not get a speedup of 2?</p> </li> </ol>"},{"location":"s9_scalable_applications/inference/","title":"M31 - Scalable Inference","text":""},{"location":"s9_scalable_applications/inference/#scalable-inference","title":"Scalable Inference","text":"<p>Inference is the task of applying our trained model to some new and unseen data, often called prediction. Thus, scaling inference is different from scaling data loading and training, mainly due to inference normally only using a single data point (or a few). As we can neither parallelize the data loading nor parallelize using multiple GPUs (at least not in any efficient way), this is of no use to us when we are doing inference. Additionally, performing inference is often not something we do on machines that can perform large computations, as most inference today is actually either done on edge devices e.g. mobile phones or in low-cost-low-compute cloud environments. Thus, we need to be smarter about how we scale inference than just throwing more computing power at it.</p> <p>In this module, we are going to look at various ways that you can either reduce the size of your model or make your model faster. Both are important for running inference fast regardless of the setup you are running your model on. We want to note that this is still very much an active area of research and therefore best practices for what to do in a specific situation can change.</p>"},{"location":"s9_scalable_applications/inference/#choosing-the-right-architecture","title":"Choosing the right architecture","text":"<p>Assume you are starting a completely new project and have to come up with a model architecture for doing this. What is your strategy? The common way to do this is to look at prior work on similar problems that you are facing and either directly choose the same architecture or create some slight variation hereof. This is a great way to get started, but the architecture that you end up choosing may be optimal in terms of performance but not inference speed.</p> <p>The fact is that not all base architectures are created equal, and a 10K parameter model with one architecture can have a significantly different inference speed than another 10K parameter model with another architecture. For example, consider the figure below which compares an number of models from the timm package, colored based on their base architecture. The general trend is that the number of images that can be processed by a model per sec (y-axis) is inversely proportional to the number of parameters (x-axis). However, we in general see that convolutional base architectures (conv) are more efficient than transformer (vit) for the same parameter budget.</p> <p></p>  Image credit"},{"location":"s9_scalable_applications/inference/#exercises","title":"\u2754 Exercises","text":"<p>As discussed in this blogpost the largest increase in inference speed you will see (given some specific hardware) is choosing an efficient model architecture. In the exercises below we are going to investigate the inference speed of different architectures.</p> <ol> <li> <p>Start by checking out this     table     which contains a list of pretrained weights in <code>torchvision</code>. Try finding an</p> <ul> <li>Efficient net</li> <li>Resnet</li> <li>Transformer based</li> </ul> <p>model that has in the range of 20-30 mio parameters.</p> </li> <li> <p>Write a small script that first initializes all models, creates a dummy input tensor of shape [100, 3, 256, 256] and     then measures the time it takes to do a forward pass on the input tensor. Make sure to do this multiple times to get     a good average time.</p> Solution <p>In this solution, we have chosen to use the efficientnet b5 (30.4M parameters), resnet50 (25.6M parameters) and the swin v2 transformer tiny (28.4M parameters) models.</p> <pre><code>import time\nimport torch\nfrom torchvision import models\n\nmodel_list = [\"efficientnet_b5\", \"resnet50\", \"swin_v2_t\"]\nimage = torch.randn(100, 3, 256, 256)\n\nn_reps = 10\nfor i, m in enumerate(model_list):\n    model = models.get_model(m)\n    tic = time.time()\n    for _ in range(n_reps):\n        _ = model(image)\n    toc = time.time()\n    print(f\"Model {i} took: {(toc - tic) / n_reps}\")\n</code></pre> </li> <li> <p>Does the results make sense? Based on the above figure we would expect that efficientnet is faster than resnet,     which is faster than the transformer based model. Is this also what you are seeing?</p> </li> <li> <p>To figure out why one net is more efficient than another we can try to count the operations each network need to     do for inference. A operation here we can define as a     FLOP (floating point operation) which is any mathematical operation (such as     +, -, *, /) or assignment that involves floating-point numbers. Luckily for us someone has already created a python     package for calculating this in pytorch: ptflops</p> <ol> <li> <p>Install the package</p> <pre><code>pip install ptflops\n</code></pre> </li> <li> <p>Try calling the <code>get_model_complexity_info</code> function from the <code>ptflops</code> package on the networks from the     previous exercise. What are the results?</p> Solution <pre><code>from ptflops import get_model_complexity_info\nimport time\nimport torch\nfrom torchvision import models\n\nmodel_list = [\"efficientnet_b5\", \"resnet50\", \"swin_v2_t\"]\nfor model in model_list:\n    macs, params = get_model_complexity_info(\n        models.get_model(model_list[0]), (3, 256, 256), backend='pytorch', print_per_layer_stat=False\n    )\n    print(f\"Model {model} have {params} parameters and uses {macs}\")\n</code></pre> </li> </ol> </li> <li> <p>In the table from the initial exercise, you could also see the overall performance of each network on the     Imagenet-1K dataset. Given this performance, the inference speed, the flops count what network would you choose     to use in a production setting? Discuss when choosing one over another should be considered.</p> </li> </ol>"},{"location":"s9_scalable_applications/inference/#quantization","title":"Quantization","text":"<p>Quantization is a technique where all computations are performed with integers instead of floats. We are essentially taking all continuous signals and converting them into discretized signals.</p> <p></p>  Image credit  <p>As discussed in this blogpost series, while <code>float</code> (32-bit) is the primarily used precision in machine learning because is strikes a good balance between memory consumption, precision and computational requirement it does not mean that during inference we can take advantage of quantization to improve the speed of our model. For instance:</p> <ul> <li> <p>Floating-point computations are slower than integer operations</p> </li> <li> <p>Recent hardware have specialized hardware for doing integer operations</p> </li> <li> <p>Many neural networks are actually not bottlenecked by how many computations they need to do but by how fast we can     transfer data e.g. the memory bandwidth and cache of your system is the limiting factor. Therefore working with 8-bit     integers vs 32-bit floats means that we can approximately move data around 4 times as fast.</p> </li> <li> <p>Storing models in integers instead of floats save us approximately 75% of the ram/harddisk space whenever we save     a checkpoint. This is especially useful in relation to deploying models using docker (as you hopefully remember) as     it will lower the size of our docker images.</p> </li> </ul> <p>But how do we convert between floats and integers in quantization? In most cases we often use a linear affine quantization:</p> <p>$$ x_{int} = \\text{round}\\left( \\frac{x_{float}}{s} + z \\right) $$</p> <p>where $s$ is a scale and $z$ is the so called zero point. But how does to doing inference in a neural network. The figure below shows all the conversations that we need to make to our standard inference pipeline to actually do computations in quantized format.</p> <p></p>  Image credit"},{"location":"s9_scalable_applications/inference/#exercises_1","title":"\u2754 Exercises","text":"<ol> <li> <p>Lets look at how quantized tensors look in PyTorch</p> <ol> <li> <p>Start by creating a tensor that contains both random numbers</p> </li> <li> <p>Next call the <code>torch.quantize_per_tensor</code> function on the tensor. What does the quantized tensor     look like? How does the values relate to the <code>scale</code> and <code>zero_point</code> arguments.</p> </li> <li> <p>Finally, try to call the <code>.dequantize()</code> method on the tensor. Do you get a tensor back that is     close to what you initially started out with.</p> </li> </ol> </li> <li> <p>As you hopefully saw in the first exercise we are going to perform a number of rounding errors when     doing quantization and naively we would expect that this would accumulate and lead to a much worse model.     However, in practice we observe that quantization still works, and we actually have a mathematically     sound reason for this. Can you figure out why quantization still works with all the small rounding     errors? HINT: it has to do with the central limit theorem</p> </li> <li> <p>Lets move on to quantization of our model. Follow this     tutorial from PyTorch on how to do quantization. The goal is     to construct a model <code>model_fc32</code> that works on normal floats and a quantized version <code>model_int8</code>. For simplicity     you can just use one of the models from the tutorial.</p> </li> <li> <p>Lets try to benchmark our quantized model and see if all the trouble that we went through actually paid of. Also     try to perform the benchmark on the non-quantized model and see if you get a difference. If you do not get an     improvement, explain why that may be.</p> </li> </ol>"},{"location":"s9_scalable_applications/inference/#pruning","title":"Pruning","text":"<p>Pruning is another way for reducing the model size and maybe improve performance of our network. As the figure below illustrates, in pruning we are simply removing weights in our network that we do not consider important for the task at hand. By removing, we here mean that the weight gets set to 0. There are many ways to determine if a weight is important but the general rule that the importance of a weight is proportional to the magnitude of a given weight. This makes intuitively sense, since weights in all linear operations (fully connected or convolutional) are always multiplied onto the incoming value, thus a small weight means a small outgoing activation.</p> <p></p>  Image credit"},{"location":"s9_scalable_applications/inference/#exercises_2","title":"\u2754 Exercises","text":"<ol> <li> <p>We provide a start script that implements the famous     LeNet in this     file.     Open and run it just to make sure that you know the network.</p> </li> <li> <p>PyTorch have already some pruning methods implemented in its package.     Import the <code>prune</code> module from <code>torch.nn.utils</code> in the script.</p> </li> <li> <p>Try to prune the weights of the first convolutional layer by calling</p> <pre><code>prune.random_unstructured(module_1, name=\"weight\", amount=0.3)  # (1)!\n</code></pre> <ol> <li> You can read about the prune method     here.</li> </ol> <p>Try printing the <code>named_parameters</code>, <code>named_buffers</code> before and after the module is pruned. Can you explain the difference and what is the connection to the <code>module_1.weight</code> attribute.</p> </li> <li> <p>Try pruning the bias of the same module this time using the <code>l1_unstructured</code> function from the pruning module. Again     check the  <code>named_parameters</code>, <code>named_buffers</code> argument to make sure you understand the difference between L1 pruning     and unstructured pruning.</p> </li> <li> <p>Instead of pruning only a single module in the model lets try pruning the whole model. To do this we just need to     iterate over all <code>named_modules</code> in the model like this:</p> <pre><code>for name, module in new_model.named_modules():\n    prune.l1_unstructured(module, name='weight', amount=0.2)\n</code></pre> <p>But what if we wanted to apply different pruning to different layers. Implement a pruning scheme where</p> <ul> <li>The weights of convolutional layers are L1 pruned with <code>amount=0.2</code></li> <li>The weights of linear layers are unstructured pruned with <code>amount=0.4</code></li> </ul> <p>Print <code>print(dict(new_model.named_buffers()).keys())</code> after the pruning to confirm that all weights have been correctly pruned.</p> </li> <li> <p>The pruning we have looked at until know have only been local in nature e.g. we have applied the pruning     independently for each layer, not accounting globally for how much we should actually prune. As you may realize this     can quickly lead to an network that is pruned too much. Instead, the more common approach is too prune globally     where we remove the smallest <code>X</code> amount of connections:</p> <ol> <li> <p>Start by creating a tuple over all the weights with the following format</p> <pre><code>parameters_to_prune = (\n    (model.conv1, 'weight'),\n    # fill in the rest of the modules yourself\n    (model.fc3, 'weight'),\n)\n</code></pre> <p>The tuple needs to have length 5. Challenge: Can you construct the tuple using <code>for</code> loops, such that the code works for arbitrary size networks?</p> </li> <li> <p>Next prune using the <code>global_unstructured</code> function to globally prune the tuple of parameters</p> <pre><code>prune.global_unstructured(\n    parameters_to_prune,\n    pruning_method=prune.L1Unstructured,\n    amount=0.2,\n)\n</code></pre> </li> <li> <p>Check that the amount that have been pruned is actually equal to the 20% specified in the pruning. We provide     the following function that for a given submodule (for example <code>model.conv1</code>) computes the amount of pruned     weights</p> <pre><code>def check_prune_level(module: nn.Module):\n    sparsity_level = 100 * float(torch.sum(module.weight == 0) / module.weight.numel())\n    print(f\"Sparsity level of module {sparsity_level}\")\n</code></pre> </li> </ol> </li> <li> <p>With a pruned network we really want to see if all our effort actually ended up with a network that is faster and/or     smaller in memory. Do the following to the globally pruned network from the previous exercises:</p> <ol> <li> <p>First we need to make the pruning of our network permanent. Right now it is only semi-permanent as we are still     keeping a copy of the original weights in memory. Make the change permanent by calling <code>prune.remove</code> on every     pruned module in the model. Hint: iterate over the <code>parameters_to_prune</code> tuple.</p> </li> <li> <p>Next try to measure the time of a single inference (repeated 100 times) for both the pruned and non-pruned network</p> <pre><code>import time\ntic = time.time()\nfor _ in range(100):\n    _ = network(torch.randn(100, 1, 28, 28))\ntoc = time.time()\nprint(toc - tic)\n</code></pre> <p>Is the pruned network actually faster? If not can you explain why?</p> </li> <li> <p>Next lets measure the size of our network (called <code>pruned_network</code>) and a freshly initialized network (called     <code>network</code>):</p> <pre><code>torch.save(pruned_network.state_dict(), 'pruned_network.pt')\ntorch.save(network.state_dict(), 'network.pt')\n</code></pre> <p>Lookup the size of each file. Are the pruned network actually smaller? If not can you explain why?</p> </li> <li> <p>Repeat the last exercise, but this time start by converting all pruned weights to sparse format first by calling     the <code>.to_sparse()</code> method on each pruned weight. Is the saved model smaller now?</p> </li> </ol> </li> </ol> <p>This ends the exercises on pruning. As you probably realized in the last couple of exercises, then pruning does not guarantee speedups out of the box. This is because linear operations in PyTorch does not handle sparse structures out of the box. To actually get speedups we would need to deep dive into the sparse tensor operations, which again does not even guarantee that a speedup because the performance of these operations depends on the sparsity structure of the pruned weights. Investigating this is out of scope for these exercises, but we highly recommend checking it out if you are interested in sparse networks.</p>"},{"location":"s9_scalable_applications/inference/#knowledge-distillation","title":"Knowledge distillation","text":"<p>Knowledge distillation is somewhat similar to pruning, in the sense that it tries to find a smaller model that can perform equally well as a large model, however it does so in a completely different way. Knowledge distillation is a model compression technique that builds on the work of Bucila et al. in which we try do distill/compress the knowledge of a large complex model (also called the teacher model) into a simpler model (also called the student model).</p> <p>The best known example of this is the DistilBERT model. The DistilBERT model is a smaller version of the large natural-language procession model Bert, which achieves 97% of the performance of Bert while only containing 40% of the weights and being 60% faster. You can see in the figure below how it is much smaller in size compared to other models developed at the same time.</p> <p></p>  Image credit  <p>Knowledge distillation works by assuming we have a big teacher that is already performing well that we want to compress. By running our training set through our large model we get a softmax distribution for each and every training sample. The goal of the students, is to both match the original labels of the training data but also match the softmax distribution of the teacher model. The intuition behind doing this, is that teacher model needs to be more complex to learn the complex inter-class relasionship from just (one-hot) labels. The student on the other hand gets directly feed with softmax distributions from the teacher that explicit encodes this inter-class relasionship and thus does not need the same capasity to learn the same as the teacher.</p> <p></p>  Image credit"},{"location":"s9_scalable_applications/inference/#exercises_3","title":"\u2754 Exercises","text":"<p>Lets try implementing model distillation ourself. We are going to see if we can achieve this on the cifar10 dataset. Do note that exercise below can take quite long time to finish because it involves training multiple networks and therefore involve some waiting.</p> <ol> <li> <p>Start by install the <code>transformers</code> and <code>datasets</code> packages from Huggingface</p> <pre><code>pip install transformers\npip install datasets\n</code></pre> <p>which we are going to download the cifar10 dataset and a teacher model.</p> </li> <li> <p>Next download the cifar10 dataset</p> <pre><code>from datasets import load_dataset\ndataset = load_dataset(\"cifar10\")\n</code></pre> </li> <li> <p>Next lets initialize our teacher model. For this we consider a large transformer based model:</p> <pre><code>from transformers import AutoFeatureExtractor, AutoModelForImageClassification\nextractor = AutoFeatureExtractor.from_pretrained(\"aaraki/vit-base-patch16-224-in21k-finetuned-cifar10\")\nmodel = AutoModelForImageClassification.from_pretrained(\"aaraki/vit-base-patch16-224-in21k-finetuned-cifar10\")\n</code></pre> </li> <li> <p>To get the logits (un-normalized softmax scores) from our teacher model for a single datapoint from the training     dataset you would extract it like this:</p> <pre><code>sample_img = dataset['train'][0]['img']\npreprocessed_img = extractor(dataset['train'][0]['img'], return_tensors='pt')\noutput =  model(**preprocessed_img)\nprint(output.logits)\n# tensor([[ 3.3682, -0.3160, -0.2798, -0.5006, -0.5529, -0.5625, -0.6144, -0.4671, 0.2807, -0.3066]])\n</code></pre> <p>Repeat this process for the whole training dataset and store the result somewhere.</p> </li> <li> <p>Implement a simple convolutional model. You can create a custom one yourself or use a small one from <code>torchvision</code>.</p> </li> <li> <p>Train the model on cifar10 to convergence, so you have a base result on how the model is performing.</p> </li> <li> <p>Redo the training, but this time add knowledge distillation to your training objective. It should look like this:</p> <pre><code>for batch in dataset:\n    # ...\n    img, target, teacher_logits = batch\n    preds = model(img)\n    loss = torch.nn.functional.cross_entropy(preds, target)\n    loss_teacher = torch.nn.functional.cross_entropy(preds, teacher_logits)\n    loss = loss + loss_teacher\n    loss.backward()\n    # ...\n</code></pre> </li> <li> <p>Compare the final performance obtained with and without knowledge distillation. Did the performance improve or not?</p> </li> </ol> <p>This ends the module on scaling inference in machine learning models.</p>"},{"location":"samples/","title":"Collection of sample applications","text":""},{"location":"tools/","title":"Tools","text":"<p>Just a collection of tools and scripts for running the course.</p>"}]}
\ No newline at end of file
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
index 9e000f5347fd8caa856fb8e292b3e03bbb414032..7607e6e63288faf0a885e256820b6f15379cc7ae 100644
GIT binary patch
delta 13
Ucmb=gXP58h;9#h*o5)@P02tc?h5!Hn

delta 13
Ucmb=gXP58h;Al{@oycAR02)05vj6}9