diff --git a/s2_organisation_and_version_control/git/index.html b/s2_organisation_and_version_control/git/index.html index 9e4640be3..2ec36acf4 100644 --- a/s2_organisation_and_version_control/git/index.html +++ b/s2_organisation_and_version_control/git/index.html @@ -2033,7 +2033,7 @@

❔ Exercises

the repository belonging to the course. Now fork the project by clicking the Fork button.

forking

This will create a local copy of the repository which you have complete writing access to. Note that -code updates to the original repository does not update code in your local repository.

+code updates to the original repository do not update code in your local repository.

  • Clone your local fork of the project using git clone.

    @@ -2191,7 +2191,7 @@

    🧠 Knowledge check

    - January 6, 2024 + January 15, 2024 diff --git a/search/search_index.json b/search/search_index.json index c3c8cbb46..5cf443d66 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Home","text":"

    Machine Learning Operations

    Repository for course 02476 at DTU.

    Checkout the homepage!

    "},{"location":"#i-course-information","title":"\u2139\ufe0f Course information","text":""},{"location":"#course-setup","title":"\ud83d\udcbb Course setup","text":"

    Start by cloning or downloading this repository

    git clone https://github.com/SkafteNicki/dtu_mlops\n

    If you do not have git installed (yet) we will touch upon it in the course. The folder will contain all the exercise material for this course and lectures. Additionally, you should join our Slack channel which we use for communication. The link may be expired, write to me.

    "},{"location":"#course-organization","title":"\ud83d\udcc2 Course organization","text":"

    We highly recommend that when going through the material you use the homepage which is the corresponding Github pages version of this repository that is more nicely rendered, that also includes some special HTML magic provided by Material for MkDocs.

    The course is divided into sessions, denoted by capital S, and modules, denoted by capital M. A session corresponds to a full day of work if you are following the course, meaning approximately 9 hours of work. Each session (S) corresponds to a topic within MLOps and consists of multiple modules (M) that each cover a tool within the session.

    Importantly we differ between core modules and optional modules. Core modules will be marked by

    Core Module

    at the top of their corresponding page. Core modules are important to go through to be able to pass the course. You are highly recommended to still do the optional modules.

    "},{"location":"#mlops-what-is-it","title":"\ud83c\udd92 MLOps: What is it?","text":"

    Machine Learning Operations (MLOps) is a rather new field that has seen its uprise as machine learning and particularly deep learning has become a widely available technology. The term itself is a compound of \"machine learning\" and \"operations\" and covers everything that has to do with the management of the production ML lifecycle.

    The lifecycle of production ML can largely be divided into three phases:

    1. Design: The initial phase starts with an investigation of the problem. Based on this analysis, several requirements can be prioritized for what we want our future model to do. Since machine learning requires data to be trained, we also investigate in this step what data we have and if we need to source it in some other way.

    2. Model development: Based on the design phase we can begin to conjure some machine learning algorithms to solve our problems. As always, the initial step often involves doing some data analysis to make sure that our model is learning the signal that we want it to learn. Secondly, is the machine learning engineering phase, where the particular model architecture is chosen. Finally, we also need to do validation and testing to make sure that our model is generalizing well.

    3. Operations: Based on the model development phase, we now have a model that we want to use. The operations are where create an automatic pipeline that makes sure that whenever we make changes to our codebase they get automatically incorporated into our model, such that we do not slow down production. Equally important is the ongoing monitoring of already deployed models to make sure that they behave exactly as we specified them.

    It is important to note that the three steps are a cycle, meaning that when you have successfully deployed a machine learning model that is not the end of it. Your initial requirements may change, forcing you to revisit the design phase. Some new algorithms may show promising results, so you revisit the model development phase to implement this. Finally, you may try to cut the cost of running your model in production, making you revisit the operations phase, and trying to optimize some steps.

    The focus in this course is particularly on the Operations part of MLOps as this is what many data scientists are missing in their toolbox to take all the knowledge they have about data processing and model development into a production setting.

    "},{"location":"#learning-objectives","title":"\u2754 Learning objectives","text":"

    General course objective

    Introduce the student to a number of coding practices that will help them organization, scale, monitor and deploy machine learning models either in a research or production setting. To provide hands-on experience with a number of frameworks, both local and in the cloud, for doing large scale machine learning models.

    This includes:

    "},{"location":"#references","title":"\ud83d\udcd3 References","text":"

    Additional reading resources (in no particular order):

    Other courses with content similar to this:

    "},{"location":"#contributing","title":"\ud83d\udc68\u200d\ud83c\udfeb Contributing","text":"

    If you want to contribute to the course, we are happy to have you! Anything from fixing typos to adding new content is welcome. For building the course material locally, it is a simple two-step process:

    pip install -r requirements.txt\nmkdocs serve\n

    Which will start a local server that you can access at localhost:8000 and will automatically update when you make changes to the course material. When you have something that you want to contribute, please make a pull request.

    "},{"location":"#license","title":"\u2755 License","text":"

    I highly value open source, and the content of this course is therefore free to use under the Apache 2.0 license. If you use parts of this course in your work, please cite using:

    @misc{skafte_mlops,\n    author       = {Nicki Skafte Detlefsen},\n    title        = {Machine Learning Operations},\n    howpublished = {\\url{https://github.com/SkafteNicki/dtu_mlops}},\n    year         = {2024}\n}\n
    "},{"location":"challenges/","title":"Challenges","text":"

    If you have managed to go through all other material, congratulations, you are already a good way to becoming an MLOps engineer with a great overview of tools, concepts and techniques within the field. Below are listed some technical hard problems regarding MLOps. These are meant as inspiration to get you to deep dive more into using all the cloud services that gcp offers. You are also free to continue work on your project.

    "},{"location":"faq/","title":"Frequently asked questions","text":"

    For further questions, please contact Nicki.

    "},{"location":"faq/#is-it-possible-to-attend-the-course-fully-online","title":"Is it possible to attend the course fully online \u2754","text":"

    Mostly yes. All exercises are provided online and lectures will be recorded and streamed. However, do note that

    Overall we try to support flexible learning as much as possible with some limitations.

    "},{"location":"faq/#what-are-the-prerequisites-for-taking-this-course","title":"What are the prerequisites for taking this course \u2754","text":"

    We recommend that you have a basic understanding of machine learning concepts such as what a dataset is, what probabilities are, what a classifier is, what overfitting means etc. This corresponds to the curriculum covered in course 02450. The actual focus of the course is not on machine learning models, but we will be using these basic concepts throughout the exercises.

    Additionally, we recommend basic knowledge about deep learning and how to code in Pytorch, corresponding to the curriculum covered in 02456. From prior experience, we know that not all students have gained knowledge about deep learning models before this course, and we will be covering the basics of how to code in PyTorch in one of the first modules of the course to get everyone up to speed.

    "},{"location":"faq/#i-will-be-missing-x-days-of-the-course-will-that-be-a-problem","title":"I will be missing X days of the course, will that be a problem \u2754","text":"

    Depends. The course is fairly intensive, with most students working from 9-17 every day. If you already know that you will be missing X days of the course, then I highly recommend that you go through some of the first sessions before the course starts to give yourself a bit of breathing room. If you are not able to do so, please be aware that an additional effort may be needed from you to keep up with your fellow students.

    "},{"location":"faq/#how-many-should-we-be-in-a-group-for-the-projects","title":"How many should we be in a group for the projects \u2754","text":"

    Between 3 and 5. The projects are designed to be done in groups meaning that we intentionally make them too big for one person to do alone. Luckily, a lot of the work that you need to do can be done in parallel, so it is not as bad as it sounds.

    "},{"location":"faq/#when-will-the-exam-take-place","title":"When will the exam take place \u2754","text":"

    The oral part of the exam, which is a small project demo, always falls on the last day of the course. For January 2024, this means the 19th. The written part which is a small project report, should be handed in at midnight on the final course day.

    "},{"location":"faq/#where-can-i-find-information-regarding-the-exam","title":"Where can I find information regarding the exam \u2754","text":"

    Look at the bottom of this page. Details will be updated as we get closer to the exam date.

    "},{"location":"faq/#can-i-use-chatgpt-or-similar-tools-for-the-exercises-project-exam-report-coding-writing","title":"Can I use ChatGPT or similar tools for the exercises, project, exam report (coding + writing) \u2754","text":"

    Yes, yes, and yes, but remember that its a tool and you need to validate the output before using it. We would prefer for the exam report that you formulate the answers in your own words because it is intended for you do describe what you have been doing in your project. The I in LLM stands for intelligence.

    "},{"location":"faq/#i-am-a-foreign-student-and-my-home-university-doesnt-accept-passnot-pass-what-can-i-do","title":"I am a foreign student and my home university doesn't accept pass/not pass, what can I do \u2754","text":"

    We can give a grade on the Danish 7-point grading scale for foreign students who need it, where their home university does not accept pass/no-pass. You need to contact the course responsible Nicki within the first week of the course to request this. Secondly, make sure to also inform us about it during the oral part of the exam because we need to ask you additional questions to be able to give an exact grade.

    "},{"location":"faq/#i-am-a-euroteq-student-any-special-rules-for-me","title":"I am a EuroTEQ student, any special rules for me \u2754","text":"

    You will be allowed to attend the oral part of the exam online and we will provide a special Slack channel for you, trying to make sure that you get the same help as students from DTU who can attend the course on campus.

    "},{"location":"overview/","title":"Summary of course content","text":"

    There are a lot of moving parts in this course, so it may be hard to understand how it all fits together. This page provides a summary of the frameworks in this course e.g. the stack of tools used. In the figure below we have provided an overview on how the different tools of the course interacts with each other. The table after the figure provides a short description of each of the parts.

    The MLOps stack in the course. This is just an example of one stack, and depending on your use case you may want to use a different stack of tools that better fits your needs. Regardless of the stack, the principles of MLOps are the same. Framework Description Pytorch is the backbone of our code, it provides the computational engine and the data structures that we need to define our data structures. Pytorch lightning is a framework that provides a high-level interface to Pytorch. It provides a lot of functionality that we need to train our models, such as logging, checkpointing, early stopping, etc. such that we do not have to implement it ourselves. It also allows us to scale our models to multiple GPUs and multiple nodes. We control the dependencies and python interpreter using Conda that enables us to construct reproducible virtual environments For configuring our experiments we use Hydra that allows us to define a hierarchical configuration structure config files Using Weights and Bias allows us to track and log any values and hyperparameters for our experiments Whenever we run into performance bottlenecks with our code we can use the Profiler to find the cause of the bottleneck When we run into bugs in our code we can use the Debugger to find the cause of the bug For organizing our code and creating templates we can use Cookiecutter Docker is a tool that allows us to create a container that contains all the dependencies and code that we need to run our code For controlling the versions of our data and synchronization between local and remote data storage, we can use DVC that makes this process easy For version control of our code we use Git (in complement with Github) that allows multiple developers to work together on a shared codebase We can use Pytest to write unit tests for our code, to make sure that new changes to the code does break the code base For linting our code and keeping a consistent coding style we can use tools such as Pylint and Flake8 that checks our code for common mistakes and style issues For running our unit tests and other checks on our code in a continues manner e.g. after we commit and push our code we can use Github actions that automate this process Using Cloud build we can automate the process of building our docker images and pushing them to our container registry Container registry is a service that allows us to store our docker images for later use by other services For storing our data and trained models we can use Cloud storage that provides a scalable and secure storage solution For general compute tasks we can use Compute engine that provides a scalable and secure compute solution For training our experiments in a easy and scalable manner we can use Vertex AI For creating a REST API for our model we can use FastAPI that provides a high-level interface for creating APIs For simple deployments of our code we can use Cloud functions that allows us to run our code in response to events through simple python functions For more complex deployments of our code we can use Cloud run that allows us to run our code in response to events through docker containers Cloud monitoring gives us the tools to keep track of important logs and errors from the other cloud services For monitoring our deployed model is experiencing any drift we can use Evidently AI that provides a framework and dashboard for monitoring drift For monitoring the telemetry of our deployed model we can use OpenTelemetry that provides a standard for collecting and exporting telemetry data"},{"location":"projects/","title":"Project work","text":"

    Slides

    Approximately 1/3 of the course time is dedicated to doing project work. The projects will serve as the basis of your exam. In the project, you will essentially re-apply everything that you learn throughout the course to a self chosen project. The overall goals with the project is:

    In the projects you are free to work on whatever problem that you want. That said, we have a specific requirement, that you need to incorporate some third-party framework into your project. If you want inspiration for projects, here are some examples

    1. Classification of tweets

    2. Translating from English to German

    3. Classification of scientific papers

    4. Classification of rice types from images

    We hope most students will be able to form groups by themselves. Expected group size is between 3 and 5. If you are not able to form a group, please make sure to post in the #looking-for-group channel on Slack or make sure to be present on the 4th day of the course (the day before the project work starts) where we will help students that have not found a group yet.

    "},{"location":"projects/#open-source-tools","title":"Open-source tools","text":"

    We strive to keep the tools thought in this course as open-source as possible. The great thing about the open-source community is that whatever problem you are working on, there is probably some package out there that can get you at least 10% of the way. For the project, we want to enforce this point and you are required to include some third-party package, that is neither Pytorch or one of the tools already covered in the course, into your project.

    If you have no idea what framework to include, the Pytorch ecosystem is a great place for finding open-source frameworks that can help you accelerate your own projects where Pytorch is the backengine. All tools in the ecosystem should work greatly together with Pytorch. However, it is important to note that the ecosystem is not a complete list of all the awesome packages that exist to extend the functionality of Pytorch. If you are still missing inspiration for frameworks to use, we highly recommend these three that have been used in previous years of the course:

    "},{"location":"projects/#project-days","title":"Project days","text":"

    Each project day is fully dedicated to project work, except for maybe external inspirational lectures in the morning. The group decides exactly where they want to work on the project, how they want to work on the project, how do distribute the workload etc. We actually encourage strongly to parallelize work during the project, because there are a lot of tasks to do, but it it is important that all group members at least have some understanding of the whole project.

    Remember that the focus of the project work is not to demonstrate that you can work with the biggest and baddest deep learning model, but instead that you show that you can incorporate the tools that are taught throughout the course in a meaningful way.

    Also note that the project is not expected to be very large in scope. It may simply be that you want to train X model on Y data. You will approximately be given 4 full days to work on the project. It is better that you start out with a smaller project and then add complexity along the way if you have time.

    "},{"location":"projects/#day-1","title":"Day 1","text":"

    The first project days is all about getting started on the projects and formulating exactly what you want to work on as a group.

    1. Start by brainstorming projects! Try to figure out exactly what you want to work with and begin to investigate what third party package that can support the project.

    2. When you have come up with an idea, write a project description. The description is the delivery for today and should be at least 300 words. Try to answer the following questions in the description:

      • Overall goal of the project
      • What framework are you going to use and you do you intend to include the framework into your project?
      • What data are you going to run on (initially, may change)
      • What models do you expect to use
    3. (Optional) If you want to think more about the product design of your project, feel free to fill out the ML canvas (or part of it). You can read more about the different fields on canvas here.

    4. After having done the product description, you can start on the actual coding of the project. In the next section, a to-do list is attached that summaries what we are doing in the course. You are NOT expected to fulfill all bullet points from week 1 today.

    The project description will serve as an guideline for us at the exam that you have somewhat reached the goals that you set out to do. By the end of the day, you should commit your project description to the README.md file belonging to your project repository. If you filled out the ML canvas, feel free to include that as part of the README.md file. Also remember to commit whatever you have done on the project until now. When you have done this, go to DTU Learn and hand-in (as a group) the link to your github repository as an assignment.

    We will briefly (before next Monday) look over your github repository and project description to check that everything is fine. If we have any questions/concerns we will contact you.

    "},{"location":"projects/#day-2","title":"Day 2","text":"

    The goal for today is simply to continue working on your project. Start with bullet points in the checklist from week 1 and continue with bullet points for week 2.

    "},{"location":"projects/#day-3","title":"Day 3","text":"

    Continue working on your project, today you should hopefully focus on the bullet points in the checklist from week 2. There is no delivery for this week, but make sure that you have committed all your progress at the end of the day. We will again briefly look over the repositories and will reach out to your group if we are worried about the progression of your project.

    "},{"location":"projects/#day-4","title":"Day 4","text":"

    We have now entered the final week of the course and the second last project day. You are most likely continuing with bullet points from week 2, but should hopefully begin to look at the bullet points from week 3 today. These are in general much more complex, so we recommend looking at them until you have completed most from week 2. We also recommend that you being to fill our report template.

    "},{"location":"projects/#day-5","title":"Day 5","text":"

    Today you are finishing your project. We recommend that you start by creating a architechtual overview of your project similar to this figure. I recommend using draw.io for creating this kind of diagram, but feel free to use any tool you like. Else you should just continue working on your project, checking of as many bullet points as possible. Finally, you should also prepare yourself for the exam tomorrow.

    "},{"location":"projects/#project-checklist","title":"Project checklist","text":"

    Please note that all the lists are exhaustive meaning that I do not expect you to have completed very point on the checklist for the exam.

    "},{"location":"projects/#week-1","title":"Week 1","text":""},{"location":"projects/#week-2","title":"Week 2","text":""},{"location":"projects/#week-3","title":"Week 3","text":""},{"location":"projects/#additional","title":"Additional","text":""},{"location":"projects/#exam","title":"Exam","text":"

    The exam consist of a written and oral element, and both contributes to the overall evaluation if you should pass or not pass the course.

    For the written part of the exam we provide an template folder called reports. As the first task you should copy the folder and all its content to your project repository. Then, you jobs is to fill out the README.md file which contains the report template. The file itself contains instructions on how to fill it out and instructions on using the included report.py file. You will hand-in the template by simple including it in your project repository. By midnight on the 20/1 we will scrape it automatically, and changes after this point are therefore not registered.

    For the oral part of the exam you will be given a time slot where you have to show up for 5-7 min and give a very short demo of your project. What we are interested in seeing is essentially a live demo of your deployed application/project. We will possibly also ask questions regarding the overall curriculum of the course. Importantly, you should have your deployed application, the github repository with your project code, W&B account and your GCP account ready before you enter the exam so we can quickly jump around. We will send out an the time slots during the last week.

    "},{"location":"timeplan/","title":"Timeplan","text":"

    Slides

    The course is organised into exercise (2/3 of the course) days and project days (1/3 of the course).

    Exercise days start at 9:00 in the morning with an lecture (15-30 min) that will give some context about at least one of the topics of that day. Additionally, previous days exercises may shortly be touched upon. The remaining of the day will be spend on solving exercises either individually or in small groups. For some people the exercises may be fast to do and for others it will take the hole day. We will provide help throughout the day. We will try to answer questions on slack but help with be priorities to students physically on campus.

    Project days are intended for project work and you are therefore responsible for making an agreement with your group when and where you are going to work. The first project days there will be a lecture at 9:00 with project information. Other project days we may also start the day with an external lecture, which we highly recommend that you participate in. During each project day we will have office hours where you can ask questions for the project.

    Below is an overall timeplan for each day, including the presentation topic of the day and the frameworks that you will be using in the exercises.

    Legend: \ud83d\udcdd Slides, \ud83c\udfa5 Recording.

    Note

    Current dates listed below are for January 2024 version of the course. The lectures and recordings are currently from January 2023 version of the course. Please note that for January 2024, the first week starts on a Tuesday and ends on a Saturday.

    "},{"location":"timeplan/#week-1","title":"Week 1","text":"

    In the first week you will be introduced to a number of development practices for organizing and developing code, especially with a focus on making everything reproducible.

    Date Day Presentation topic Frameworks Format 2/1 Tuesday Deep learning software \ud83d\udcdd \ud83c\udfa5(2023) \ud83c\udfa5(2024) Terminal, Conda, IDE, Pytorch Exercises 3/1 Wednesday MLOps: what is it? \ud83d\udcdd.pdf) \ud83c\udfa5(2023) \ud83c\udfa5(2023) Git, CookieCutter, Pep8, DVC Exercises 4/1 Thursday Reproducibility \ud83d\udcdd \ud83c\udfa5(2023) \ud83c\udfa5(2024) Docker, Hydra Exercises 5/1 Friday Debugging \ud83d\udcdd \ud83c\udfa5(2023) \ud83c\udfa5(2024) Debugger, Profiler, Wandb, Lightning Exercises 6/1 Saturday Pytorch ecosystem \ud83d\udcdd \ud83c\udfa5(2023) \ud83c\udfa5(2024) - Projects"},{"location":"timeplan/#week-2","title":"Week 2","text":"

    The second week is about automatization and the cloud. Automatization will help use making sure that our code does not break when we make changes to it. The cloud will help us scale up our applications and we learn how to use different services to help develop a full machine learning pipeline.

    Date Day Presentation topic Frameworks Format 8/1 Monday Continuous Integration \ud83d\udcdd \ud83c\udfa5 Pytest, Github actions, Pre-commit, CML Exercises 9/1 Tuesday The Cloud \ud83d\udcdd \ud83c\udfa5 GCP Engine, Bucket, Container registry, Vertex AI Exercises 10/1 Wednesday Deployment \ud83d\udcdd \ud83c\udfa5 FastAPI, Torchservce, GCP Functions, Run Exercises 11/1 Thursday No lecture \ud83c\udfa5 - Projects 12/1 Friday No lecture \ud83c\udfa5 - Projects"},{"location":"timeplan/#week-3","title":"Week 3","text":"

    For the final week we look into advance topics such as monitoring and scaling of applications. Monitoring is especially important for the longivity for the applications that we develop, that we actually can deploy them either locally or in the cloud and that we have the tools to monitor how they behave over time. Scaling of applications is an important topic if we ever want our applications to be used by many people at the same time.

    Date Day Presentation topic Frameworks Format 15/1 Monday Monitoring \ud83d\udcdd \ud83c\udfa5 Evidently AI, OpenTelemetry, Signoz Exercises 16/1 Tuesday Scalable applications \ud83d\udcdd \ud83c\udfa5 Pytorch, Lightning Exercises 17/1 Wednesday - - Projects 18/1 Thursday - - Projects 19/1 Friday - - Exam"},{"location":"reports/","title":"Exam template for 02476 Machine Learning Operations","text":"

    This is the report template for the exam. Please only remove the text formatted as with three dashes in front and behind like:

    --- question 1 fill here ---

    where you instead should add your answers. Any other changes may have unwanted consequences when your report is auto generated in the end of the course. For questions where you are asked to include images, start by adding the image to the figures subfolder (please only use .png, .jpg or .jpeg) and then add the following code in your answer:

    ![my_image](figures/<image>.<extension>)\n

    In addition to this markdown file, we also provide the report.py script that provides two utility functions:

    Running:

    python report.py html\n

    will generate an .html page of your report. After deadline for answering this template, we will autoscrape everything in this reports folder and then use this utility to generate an .html page that will be your serve as your final handin.

    Running

    python report.py check\n

    will check your answers in this template against the constrains listed for each question e.g. is your answer too short, too long, have you included an image when asked to.

    For both functions to work it is important that you do not rename anything. The script have two dependencies that can be installed with pip install click markdown.

    "},{"location":"reports/#overall-project-checklist","title":"Overall project checklist","text":"

    The checklist is exhaustic which means that it includes everything that you could possible do on the project in relation the curricilum in this course. Therefore, we do not expect at all that you have checked of all boxes at the end of the project.

    "},{"location":"reports/#week-1","title":"Week 1","text":""},{"location":"reports/#week-2","title":"Week 2","text":""},{"location":"reports/#week-3","title":"Week 3","text":""},{"location":"reports/#additional","title":"Additional","text":""},{"location":"reports/#group-information","title":"Group information","text":""},{"location":"reports/#question-1","title":"Question 1","text":"

    Enter the group number you signed up on

    Answer:

    --- question 1 fill here ---

    "},{"location":"reports/#question-2","title":"Question 2","text":"

    Enter the study number for each member in the group

    Example:

    sXXXXXX, sXXXXXX, sXXXXXX

    Answer:

    --- question 2 fill here ---

    "},{"location":"reports/#question-3","title":"Question 3","text":"

    What framework did you choose to work with and did it help you complete the project?

    Answer length: 100-200 words.

    Example: We used the third-party framework ... in our project. We used functionality ... and functionality ... from the package to do ... and ... in our project.

    Answer:

    --- question 3 fill here ---

    "},{"location":"reports/#coding-environment","title":"Coding environment","text":"

    In the following section we are interested in learning more about you local development environment.

    "},{"location":"reports/#question-4","title":"Question 4","text":"

    Explain how you managed dependencies in your project? Explain the process a new team member would have to go through to get an exact copy of your environment.

    Answer length: 100-200 words

    Example: We used ... for managing our dependencies. The list of dependencies was auto-generated using ... . To get a complete copy of our development environment, one would have to run the following commands

    Answer:

    --- question 4 fill here ---

    "},{"location":"reports/#question-5","title":"Question 5","text":"

    We expect that you initialized your project using the cookiecutter template. Explain the overall structure of your code. Did you fill out every folder or only a subset?

    Answer length: 100-200 words

    Example: From the cookiecutter template we have filled out the ... , ... and ... folder. We have removed the ... folder because we did not use any ... in our project. We have added an ... folder that contains ... for running our experiments. Answer:

    --- question 5 fill here ---

    "},{"location":"reports/#question-6","title":"Question 6","text":"

    Did you implement any rules for code quality and format? Additionally, explain with your own words why these concepts matters in larger projects.

    Answer length: 50-100 words.

    Answer:

    --- question 6 fill here ---

    "},{"location":"reports/#version-control","title":"Version control","text":"

    In the following section we are interested in how version control was used in your project during development to corporate and increase the quality of your code.

    "},{"location":"reports/#question-7","title":"Question 7","text":"

    How many tests did you implement and what are they testing in your code?

    Answer length: 50-100 words.

    Example: In total we have implemented X tests. Primarily we are testing ... and ... as these the most critical parts of our application but also ... .

    Answer:

    --- question 7 fill here ---

    "},{"location":"reports/#question-8","title":"Question 8","text":"

    What is the total code coverage (in percentage) of your code? If you code had an code coverage of 100% (or close to), would you still trust it to be error free? Explain you reasoning.

    Answer length: 100-200 words.

    Example: The total code coverage of code is X%, which includes all our source code. We are far from 100% coverage of our ** code and even if we were then...*

    Answer:

    --- question 8 fill here ---

    "},{"location":"reports/#question-9","title":"Question 9","text":"

    Did you workflow include using branches and pull requests? If yes, explain how. If not, explain how branches and pull request can help improve version control.

    Answer length: 100-200 words.

    Example: We made use of both branches and PRs in our project. In our group, each member had an branch that they worked on in addition to the main branch. To merge code we ...

    Answer:

    --- question 9 fill here ---

    "},{"location":"reports/#question-10","title":"Question 10","text":"

    Did you use DVC for managing data in your project? If yes, then how did it improve your project to have version control of your data. If no, explain a case where it would be beneficial to have version control of your data.

    Answer length: 100-200 words.

    Example: We did make use of DVC in the following way: ... . In the end it helped us in ... for controlling ... part of our pipeline

    Answer:

    --- question 10 fill here ---

    "},{"location":"reports/#question-11","title":"Question 11","text":"

    Discuss you continues integration setup. What kind of CI are you running (unittesting, linting, etc.)? Do you test multiple operating systems, python version etc. Do you make use of caching? Feel free to insert a link to one of your github actions workflow.

    Answer length: 200-300 words.

    Example: We have organized our CI into 3 separate files: one for doing ..., one for running ... testing and one for running ... . In particular for our ..., we used ... .An example of a triggered workflow can be seen here:

    Answer:

    --- question 11 fill here ---

    "},{"location":"reports/#running-code-and-tracking-experiments","title":"Running code and tracking experiments","text":"

    In the following section we are interested in learning more about the experimental setup for running your code and especially the reproducibility of your experiments.

    "},{"location":"reports/#question-12","title":"Question 12","text":"

    How did you configure experiments? Did you make use of config files? Explain with coding examples of how you would run a experiment.

    Answer length: 50-100 words.

    Example: We used a simple argparser, that worked in the following way: python my_script.py --lr 1e-3 --batch_size 25

    Answer:

    --- question 12 fill here ---

    "},{"location":"reports/#question-13","title":"Question 13","text":"

    Reproducibility of experiments are important. Related to the last question, how did you secure that no information is lost when running experiments and that your experiments are reproducible?

    Answer length: 100-200 words.

    Example: We made use of config files. Whenever an experiment is run the following happens: ... . To reproduce an experiment one would have to do ...

    Answer:

    --- question 13 fill here ---

    "},{"location":"reports/#question-14","title":"Question 14","text":"

    Upload 1 to 3 screenshots that show the experiments that you have done in W&B (or another experiment tracking service of your choice). This may include loss graphs, logged images, hyperparameter sweeps etc. You can take inspiration from this figure. Explain what metrics you are tracking and why they are important.

    Answer length: 200-300 words + 1 to 3 screenshots.

    Example: As seen in the first image when have tracked ... and ... which both inform us about ... in our experiments. As seen in the second image we are also tracking ... and ...

    Answer:

    --- question 14 fill here ---

    "},{"location":"reports/#question-15","title":"Question 15","text":"

    Docker is an important tool for creating containerized applications. Explain how you used docker in your experiments? Include how you would run your docker images and include a link to one of your docker files.

    Answer length: 100-200 words.

    Example: For our project we developed several images: one for training, inference and deployment. For example to run the training docker image: docker run trainer:latest lr=1e-3 batch_size=64. Link to docker file:

    Answer:

    --- question 15 fill here ---

    "},{"location":"reports/#question-16","title":"Question 16","text":"

    When running into bugs while trying to run your experiments, how did you perform debugging? Additionally, did you try to profile your code or do you think it is already perfect?

    Answer length: 100-200 words.

    Example: Debugging method was dependent on group member. Some just used ... and others used ... . We did a single profiling run of our main code at some point that showed ...

    Answer:

    --- question 16 fill here ---

    "},{"location":"reports/#working-in-the-cloud","title":"Working in the cloud","text":"

    In the following section we would like to know more about your experience when developing in the cloud.

    "},{"location":"reports/#question-17","title":"Question 17","text":"

    List all the GCP services that you made use of in your project and shortly explain what each service does?

    Answer length: 50-200 words.

    Example: We used the following two services: Engine and Bucket. Engine is used for... and Bucket is used for...

    Answer:

    --- question 17 fill here ---

    "},{"location":"reports/#question-18","title":"Question 18","text":"

    The backbone of GCP is the Compute engine. Explained how you made use of this service and what type of VMs you used?

    Answer length: 100-200 words.

    Example: We used the compute engine to run our ... . We used instances with the following hardware: ... and we started the using a custom container: ...

    Answer:

    --- question 18 fill here ---

    "},{"location":"reports/#question-19","title":"Question 19","text":"

    Insert 1-2 images of your GCP bucket, such that we can see what data you have stored in it. You can take inspiration from this figure.

    Answer:

    --- question 19 fill here ---

    "},{"location":"reports/#question-20","title":"Question 20","text":"

    Upload one image of your GCP container registry, such that we can see the different images that you have stored. You can take inspiration from this figure.

    Answer:

    --- question 20 fill here ---

    "},{"location":"reports/#question-21","title":"Question 21","text":"

    Upload one image of your GCP cloud build history, so we can see the history of the images that have been build in your project. You can take inspiration from this figure.

    Answer:

    --- question 21 fill here ---

    "},{"location":"reports/#question-22","title":"Question 22","text":"

    Did you manage to deploy your model, either in locally or cloud? If not, describe why. If yes, describe how and preferably how you invoke your deployed service?

    Answer length: 100-200 words.

    Example: For deployment we wrapped our model into application using ... . We first tried locally serving the model, which worked. Afterwards we deployed it in the cloud, using ... . To invoke the service an user would call curl -X POST -F \"file=@file.json\"<weburl>

    Answer:

    --- question 22 fill here ---

    "},{"location":"reports/#question-23","title":"Question 23","text":"

    Did you manage to implement monitoring of your deployed model? If yes, explain how it works. If not, explain how monitoring would help the longevity of your application.

    Answer length: 100-200 words.

    Example: We did not manage to implement monitoring. We would like to have monitoring implemented such that over time we could measure ... and ... that would inform us about this ... behaviour of our application.

    Answer:

    --- question 23 fill here ---

    "},{"location":"reports/#question-24","title":"Question 24","text":"

    How many credits did you end up using during the project and what service was most expensive?

    Answer length: 25-100 words.

    Example: Group member 1 used ..., Group member 2 used ..., in total ... credits was spend during development. The service costing the most was ... due to ...

    Answer:

    --- question 24 fill here ---

    "},{"location":"reports/#overall-discussion-of-project","title":"Overall discussion of project","text":"

    In the following section we would like you to think about the general structure of your project.

    "},{"location":"reports/#question-25","title":"Question 25","text":"

    Include a figure that describes the overall architecture of your system and what services that you make use of. You can take inspiration from this figure. Additionally in your own words, explain the overall steps in figure.

    Answer length: 200-400 words

    Example:

    The starting point of the diagram is our local setup, where we integrated ... and ... and ... into our code. Whenever we commit code and puch to github, it auto triggers ... and ... . From there the diagram shows ...

    Answer:

    --- question 25 fill here ---

    "},{"location":"reports/#question-26","title":"Question 26","text":"

    Discuss the overall struggles of the project. Where did you spend most time and what did you do to overcome these challenges?

    Answer length: 200-400 words.

    Example: The biggest challenges in the project was using ... tool to do ... . The reason for this was ...

    Answer:

    --- question 26 fill here ---

    "},{"location":"reports/#question-27","title":"Question 27","text":"

    State the individual contributions of each team member. This is required information from DTU, because we need to make sure all members contributed actively to the project

    Answer length: 50-200 words.

    Example: Student sXXXXXX was in charge of developing of setting up the initial cookie cutter project and developing of the docker containers for training our applications. Student sXXXXXX was in charge of training our models in the cloud and deploying them afterwards. All members contributed to code by...

    Answer:

    --- question 27 fill here ---

    "},{"location":"s10_extra/","title":"Extra learning modules","text":"

    All modules listed here are not part of the core course, but expands on some of the other topics. Some of them may still be under construction and may in the future be moved into other sessions.

    "},{"location":"s10_extra/cli/","title":"M30 - Command Line Interfaces","text":""},{"location":"s10_extra/cli/#command-line-interfaces","title":"Command line interfaces","text":"

    If you have worked with python for some time you are probably familiar with the argparse package, which allows you to directly pass in additional arguments to your script in the terminal

    python my_script.py --arg1 val1 --arg2 val2\n

    argparse is a very simple way of constructing what is called a command line interfaces (CLI). CLI allows you to interact with your application directly in the terminal instead of having change things in your code. It is essentially a text-based user interface (UI) (in contrast to an graphical user interface (GUI) that we know from all our desktop applications).

    However, one limitation of argparse is the possibility of easily defining an CLI with subcommands. If we take git as an example, git is the main command but it has multiple subcommands: push, pull, commit etc. that all can take their own arguments. This kind of second CLI with subcommands is somewhat possible to do using only argparse, however it requires a bit of hacks.

    You could of course ask the question why we at all would like to have the possibility of defining such CLI. The main argument here is to give users of our code a single entrypoint to interact with our application instead of having multiple scripts. As long as all subcommands are proper documented, then our interface should be simple to interact with (again think git where each subcommand can be given the -h arg to get specific help).

    Instead of using argparse we are here going to look at the click package. click extends the functionalities of argparse to allow for easy definition of subcommands and many other things, which we are not going to touch upon in this module. For completeness we should also mention that click is not the only package for doing this, and of other excellent frameworks for creating command line interfaces easily we can mention Typer.

    "},{"location":"s10_extra/cli/#exercises","title":"\u2754 Exercises","text":"

    Exercise files

    1. Install click

      pip install click\n
    2. Create a new python file greetings.py and add the following code:

      import click\n\n@click.command()\n@click.option('--count', default=1, help='Number of greetings.')\n@click.option('--name', prompt='Your name', help='The person to greet.')\ndef hello(count, name):\n    \"\"\"Simple program that greets NAME for a total of COUNT times.\"\"\"\n    for x in range(count):\n        click.echo(f\"Hello {name}!\")\n\nif __name__ == '__main__':\n    hello()\n

      try running the program in the following ways

      python greetings.py\npython greetings.py --count=3\npython greetings.py --help\n
    3. Make sure you understand what the click.command() decorator and click.option decorator does. You can find the full API docs here.

    4. As stated above, the power of using a tool like click is due to its ability to define subcommands. In click this is done through the click.group() decorator. To the code example from above, add another command:

      @click.command()\n@click.option('--count', default=1, help='Number of greetings.')\n@click.option('--name', prompt='Your name', help='The person to greet.')\ndef howdy(count, name):\n    for x in range(count):\n        click.echo(f\"Howdy {name}!\")\n

      and by using the click.group() decorator make these commands into subcommands such that you would be able to call the script in the following way

      python greetings.py hello\npython greetings.py howdy\n
    5. As an final exercise we provide you with a script that is ready to run as it is, but your job will be do turn it into a script with multiple subcommands, with multiple arguments for each subcommand.

      1. Start by taking a look at the provided code. It is a simple script that runs the K-nearest neighbour classification algorithm on the iris dataset and produces a plot of the decision boundary.

      2. Create a script that has the following subcommands with input arguments

        • Subcommand train: Load data, train model and save. Should take a single argument -o that specifics the filename the trained model should be saved to.
        • Subcommand infer: Load trained model and runs prediction on input data. Should take two arguments: -i that specifies which trained model to load and -d to specify a user defined datapoint to run inference on.
        • Subcommand plot: Load trained model and constructs the decision boundary plot from the code. Should take two arguments: -i that specifies a trained model to load and -o the file to write the generated plot to
        • Subcommand optim: Load data, runs hyperparameter optimization and prints optimal parameters. Should at least take a single argument that in some way adjust the hyperparameter optimization (free to choose how)

        In the end we like the script to be callable in the following ways

        python main.py train -o 'model.ckpt'\npython main.py infer -i 'model.ckpt' -d [[0,1]]\npython main.py plot -i 'model.ckpt' -o 'generated_plot.png'\npython main.py optim\n
    "},{"location":"s10_extra/design/","title":"Designing MLOps pipelines","text":"

    Danger

    Module is still under development

    \"Machine learning engineering is 10% machine learning and 90% engineering.\" - Chip Huyen

    We highly recommend that you read the book Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications by Chip Huyen which gives an fantastic overview of the thought processes that goes into designing moder machine learning systems.

    "},{"location":"s10_extra/design/#the-stack","title":"The stack","text":"

    Have you ever encountered the concept of full stack developer. A full stack developer is an developer who can both develop client and server software or in more general terms, it is a developer who can take care of the complete developer pipeline.

    Below is seen an image of the massive amounts of tools that exist within the MLOps umbrella.

    "},{"location":"s10_extra/design/#visualizing-the-design","title":"Visualizing the design","text":""},{"location":"s10_extra/documentation/","title":"M31 - Documentation","text":""},{"location":"s10_extra/documentation/#documentation","title":"Documentation","text":"

    In today's rapidly evolving software development landscape, effective documentation is a crucial component of any project. The ability to create clear, concise, and user-friendly technical documentation can make a significant difference in the success of your codebase. We all probably encountered code that we wanted to use, only for us to abandon using it because it was missing documentation such that we could get started with it.

    Technical documentation or code documentation can be many things:

    and many more. We are in this module going to focus on setting up a very basic documentation system that will automatically help you document the API of your code. For this reason we recommend that before continuning with this module that you have completed module M7 on good coding practices or have similar experience with writing docstrings for python functions and classes.

    There are different systems for writing documentation. In fact there is a lot to choose from:

    Important to note that all these are static site generators. The word static here refers to that when the content is generated and served on a webside, the underlying HTML code will not change. It may contain HTML elements that dynamic (like video), but the site does not change (1).

    1. Good examples of dynamic sites are any social media or news media where new posts, pages etc. are constantly added over time. Good examples of static sites are documentation, blogposts etc.

    We are in this module going to look at Mkdocs, which (in my opinion) is one of the easiest systems to get started with because all documentation is written in markdown and the build system is written in Python. As an alternativ, you can consider doing the exercises in Sphinx which is probably the most used documentation system for Python code. Sphinx offer more customization than Mkdocs, so is generally preferred for larger projects with complex documentation, but for smaller projects Mkdocs should be easier to get started with and is sufficient.

    Mkdocs by default does not include many features and for that reason we are directly going to dive into using the material for mkdocs theme that provides a lot of nice customization to create professional static sites. In fact, this hole course is written in mkdocs using the material theme.

    "},{"location":"s10_extra/documentation/#mkdocs","title":"Mkdocs","text":"

    The core file when using mkdocs is the mkdocs.yml file, which is the configuration file for the project:

    site_name: Documentation of my project\nsite_author: Jane Doe\ndocs_dir: source # (1)!\n\ntheme:\n    language: en\n    name: material # (2)!\n    features: # (3)!\n    - content.code.copy\n    - content.code.annotate\n\nplugins: # (4)!\n    - search\n    - mkdocstrings\n\nnav: # (5)!\n  - Home: index.md\n
    1. This indicates the source directory of our documentation. If the layout of your documentation is a bit different than what described above, you may need to change this.

    2. The overall theme of your documentation. We recommend the material theme but there are many more to choose from and you can also create your own.

    3. The featuers section is where features that are supported by your given theme can be enabled. In this example we have enabled content.code.copy feature which adds a small copy button to all code block and the content.code.annotate feature which allows you to add annotations like this box to code blocks.

    4. Plugins add new functionality to your documentation. In this case we have added two plugins that add functionality for searching through our documentation and automatically adding documentation from docstrings. Remember that some plugins requires you to install additional Python packages with those plugins, so remember to add them to your requirements.txt file.

    5. The nav section is where you define the navigation structure of your documentation. When you add new .md files to the source folder you then need to add them to the nav section.

    And that is more or less what you need to get started. In general, if you need help with configuration of your documentation in mkdocs I recommend looking at this page and this page.

    "},{"location":"s10_extra/documentation/#exercises","title":"Exercises","text":"

    In this set of exercises we assume that you have completed module M6 on code structure and therefore have a repository that at least contains the following:

    \u251c\u2500\u2500 pyproject.toml     <- Project configuration file with package metadata\n\u2502\n\u251c\u2500\u2500 docs               <- Documentation folder\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 index.md       <- Homepage for your documentation\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 mkdocs.yml     <- Configuration file for mkdocs\n\u2502   \u2502\n\u2502   \u2514\u2500\u2500 source/        <- Source directory for documentation files\n\u2502\n\u2514\u2500\u2500 src                <- Source code for use in this project.\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 __init__.py    <- Makes src a Python module\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 models         <- model implementations, training script\n\u2502   \u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u2502   \u251c\u2500\u2500 model.py\n\u2502   \u2502   \u251c\u2500\u2500 train_model.py\n...\n

    It is not important exactly what is in the src folder for the exercises, but we are going to refer to the above structure in the exercises, so adjust accordingly if you diviate from this. Additionally, we are going to assume that your project code is installed in your environment such that it can be imported as normal python code.

    1. We are going to need two python packages to get started: mkdocs and material for mkdocs. Install with

      pip install \"mkdocs-material >= 4.8.0\" # (1)!\n
      1. Since mkdocs is a dependency of mkdocs-material we only need to install the latter.
    2. Run in your terminal (from the docs folder):

      mkdocs serve # (1)!\n
      1. mkdocs serve will automatically rebuild the hole site whenever you save a file inside the docs folder. This is not a problem if you have a fairly small site with not that many pages (or elements), but can take a long time for large sites. Consider running with the --dirty option for only re-building the site for files that have been changed.

      which should render the index.md file as the homepage. You can leave the documentation server running during the remaining exercises.

    3. We are no ready to document the API of our code:

      1. Make sure you at least have one function and class inside your src module. If you do not have you can for simplicity copy the following module to the src/models/model.py file

        import torch\n\nclass MyNeuralNet(torch.nn.Module):\n    \"\"\"Basic neural network class.\n\n    Args:\n        in_features: number of input features\n        out_features: number of output features\n\n    \"\"\"\n    def __init__(self, in_features: int, out_features: int) -> None:\n        self.l1 = torch.nn.Linear(in_features, 500)\n        self.l2 = torch.nn.Linear(500, out_features)\n        self.r = torch.nn.ReLU()\n\n    def forward(self, x: torch.Tensor) -> torch.Tensor:\n        \"\"\"Forward pass of the model.\n\n        Args:\n            x: input tensor expected to be of shape [N,in_features]\n\n        Returns:\n            Output tensor with shape [N,out_features]\n\n        \"\"\"\n        return self.l2(self.r(self.l1(x)))\n

        and the following function to add src/predict_model.py file:

        def predict(\n    model: torch.nn.Module,\n    dataloader: torch.utils.data.DataLoader\n) -> None:\n    \"\"\"Run prediction for a given model and dataloader.\n\n    Args:\n        model: model to use for prediction\n        dataloader: dataloader with batches\n\n    Returns\n        Tensor of shape [N, d] where N is the number of samples and d is the output dimension of the model\n\n    \"\"\"\n    return [model(batch) for batch in dataloader]\n
      2. Add a markdown file to the docs/source folder called my_api.md and add that file to the nav: section in the mkdocs.yaml file.

      3. To that file add the following code:

        # My API\n\n::: src.models.model.MyNeuralNet\n\n::: src.predict_model.predict\n

        The ::: indicator tells mkdocs that it should look for the corresponding function/module and then render it on the given page. Thus, if you have a function/module located in another location change the paths accordingly.

      4. Make sure that the documentation correctly includes your function and module on the given page.

      5. (Optional) Include more functions/modules in your documentation.

    4. (Optional) Look through the documentation for mkdocstrings and try to improve the layout a bit. Especially, the headings, docstrings and signatures could be of interest to adjust.

    5. Finally, try to build a final version of your documentation

      mkdocs build\n

      this should result in a site folder that contains the actual HTML code for documentation.

    "},{"location":"s10_extra/documentation/#publish-your-documentation","title":"Publish your documentation","text":"

    To publish your documentation you need a place to host your build documentation e.g. the content of the site folder you build in the last exercise. There are many places to host your documentation, but if you only need a static site and are already hosting your code through Github, then a good option is Github Pages. Github pages is free to use for your public projects.

    Before getting started with this set of exercises you should have completed module M16 on github actions so you already know about workflow files.

    "},{"location":"s10_extra/documentation/#exercises_1","title":"Exercises","text":"
    1. Start by adding a new file called deploy_docs.yaml to the .github/workflows folder. Add the following cod to that file and save it.

      name: Deploy docs\n\non:\npush:\n    branches:\n        - main\n\npermissions:\n    contents: write # (1)\n\njobs:\ndeploy:\n    runs-on: ubuntu-latest\n    steps:\n        - uses: actions/checkout@v3\n          with:\n            fetch-depth: 0\n        - uses: actions/setup-python@v4\n          with:\n            python-version: 3.10\n        - uses: actions/cache@v2\n          with:\n            key: ${{ github.ref }}\n            path: .cache\n        - run: pip install -r requirements.txt\n        - run: mkdocs gh-deploy --force\n
      1. It is important to give write premissions to this actions because it is not only reading your code but it will actually also push code.

      Before continuing, make sure you understand what the different steps of the workflow does and especially we recommend looking at the documentation of the mkdocs gh-deploy command.

    2. Commit and push the file. Check that the action is executed and if it succeeds, that your build project is pushed to a branch called gh-pages. If the action does not succeeds, then figure out what is wrong and fix it!

    3. After confirming that our action is working, you need to configure Github to actually publish the content being build by Github Actions. Do the following:

      • Go to the Settings tab and then the Pages subsection
      • In the Source setting choose the Deploy from a branch
      • In the Branch setting choose the gh-pages branch and /(root) folder and save

      This should then start deploying your site to https://<your-username>.github.io/<your-reponame>/. If it does not do this you may need to recommit and trigger the github actions build again.

    4. Make sure your documentation is published and looks as it should.

    This ends the module on technical documentation. We cannot stress enough how important it is to write proper documentation for larger projects that need to be maintained over a longer time. It is often a iterative process, but it is often best to do it while writing the code.

    "},{"location":"s10_extra/frontend/","title":"Frontend","text":"

    Danger

    Module is still under development

    "},{"location":"s10_extra/frontend/#streamlit","title":"Streamlit","text":"

    steamlit

    "},{"location":"s10_extra/frontend/#exercises","title":"\u2754 Exercises","text":"
    1. Start by installing streamlit
    pip install streamlit\n

    and run streamlit hallo afterwards to check that everything works as expected.

    "},{"location":"s10_extra/high_performance_clusters/","title":"M33 - High Performance Clusters","text":""},{"location":"s10_extra/high_performance_clusters/#high-performance-clusters","title":"High Performance Clusters","text":"

    As discussed in the intro session on the cloud, cloud providers offers near infinite compute resources. However, using these resources comes at a hefty price often and it is therefore important to be aware of another resource many have access to: High Performance Clusters or HPC. HPCs exist all over the world, and many time you already have access to one or can easily get access to one. If you are an university student you most likely have a local HPC that you can access through your institution. Else, there exist public HPC resources that everybody (with a project) can apply for. As an example in the EU we have EuroHPC initiative that currently has 8 different supercomputers with a centralized location for applying for resources that are both open for research projects and start-ups.

    Depending on your application, you may have different needs and it is therefore important to be aware also of the different tiers of HPC. In Europe, HPC are often categorized such that Tier-0 are European Centers with petaflop or hexascale machines,\u00a0Tier 1 are National centers of supercomputers, and Tier 2 are Regional centers. The lower the Tier, the larger applications it is possible to run.

    Image credit"},{"location":"s10_extra/high_performance_clusters/#cluster-architectures","title":"Cluster architectures","text":"

    In very general terms, cluster can come as two different kind of systems: supercomputers and LSF (Load Sharing Facility). A supercomputer (as shown below) is organized into different modules, that are separated by network link. When you login to a supercomputer you will meet the front end which contains all the software needed to run computations. When you submit a job it will get sent to the backend modules which in most cases includes: general compute modules (CPU), acceleration modules (GPU), a memory module (RAM) and finally a storage module (HDD). Depending on your application you may need one module more than another. For example in deep learning the acceleration module is important but in physics simulation the general compute module / storage model is probably more important.

    Overview of the Meluxina supercomputer that's part of EuroHPC. Image credit

    Alternatively, LSF are a network of computers where each computer has its own CPU, GPU, RAM etc. and the individual computes (or nodes) are then connected by network. The important different between a supercomputer and as LSF systems is how the resources are organized. When comparing supercomputers to LSF system it is generally the case that it is better to run on a LSF system if you are only requesting resources that can be handled by a single node, however it is better to run on a supercomputer if you have a resource intensive application that requires many devices to communicate with each others.

    Regardless of cluster architectures, on the software side of HPC, the most important part is what's called the HPC scheduler. Without a HPC scheduler an HPC cluster would just be a bunch of servers with different jobs interfering with each other. The problem is when you have a large collection of resources and a large collection of users, you cannot rely on the users just running their applications without interfering with each other. A HPC scheduler is in charge of managing that whenever an user request to run an application, they get put in a queue and whenever the resources their application ask for are available the application gets run.

    The biggest bach control systems for doing scheduling on HPC are:

    We are going to take a look at PBS works as that is what is installed on our local university cluster.

    "},{"location":"s10_extra/high_performance_clusters/#exercises","title":"\u2754 Exercises","text":"

    Exercise files

    The following exercises are focused on local students at DTU that want to use our local HPC resources. That said, the steps in the exercise are fairly general to other types of cluster. For the purpose of this exercise we are going to see how we can run this image classifier script , but feel free to work with whatever application you want to.

    1. Start by accessing the cluster. This can either be through ssh in a terminal or if you want a graphical interface thinlinc can be installed. In general we recommend following the steps here for DTU students as the setup depends on if you are on campus or not.

    2. When you have access to the cluster we are going to start with the setup phase. In the setup phase we are going to setup the environment necessary for our computations. If you have accessed the cluster through graphical interface start by opening a terminal.

      1. Lets start by setting up conda for controlling our dependencies. If you have not already worked with conda, please checkout module M2 on package managers and virtual environments. In general you should be able to setup (mini)conda through these two commands:

        wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nsh Miniconda3-latest-Linux-x86_64.sh\n
      2. Close the terminal and open a new for the installation to complete. Type conda in the terminal to check that everything is fine. Go ahead and create a new environment that we can install dependencies in

        conda create -n \"hpc_env\" python=3.10 --no-default-packages\n

        and activate it.

      3. Copy over any files you need. For the image classifier script you need the requirements file and the actual application.

      4. Next, install all the requirements you need. If you want to run the image classifier script you can run this command in the terminal

        pip install -r image_classifier_requirements.txt\n

        using this requirements file.

    3. That's all the setup needed. You would need to go through the creating of environment and installation of requirements whenever you start a new project (no need for reinstalling conda). For the next step we need to look at how to submit jobs on the cluster. We are now ready to submit the our first job to the cluster:

      1. Start by checking the statistics for the different clusters. Try to use both the qstat command which should give an overview of the different cluster, number of running jobs and number of pending jobs. For many system you can also try the much more user friendly command classstat command.

      2. Figure out which queue you want to use. For the sake of the exercises it needs to be one with GPU support. For DTU students, any queue that starts with gpu are GPU accelerated.

      3. Now we are going to develop a bash script for submitting our job. We have provided an example of such scripts. Take a careful look and go each line and make sure you understand it. Afterwards, change it to your needs (queue and student email).

      4. Try to submit the script:

        bsub < jobscript.sh\n

        You can check the status of your script by running the bstat command. Hopefully, the job should go through really quickly. Take a look at the output file, it should be called something like gpu_*.out. Also take a look at the gpu_*.err file. Does both files look as they should?

    4. Lets now try to run our application on the cluster. To do that we need to take care of two things:

      1. First we need to load the correct version of CUDA. A cluster system often contains multiple versions of specific software to suit the needs of all their users, and it is the users that are in charge of loading the correct software during job submission. The only extra software that needs to be loaded for most Pytorch applications are a CUDA module. You can check which modules are available on the cluster with

        module avail\n

        Afterwards, add the correct CUDA version you need to the jobscript.sh file. If you are trying to run the provided image classifier script then the correct version is CUDA/11.7 (can be seen in the requirements file).

        # add to the bottom of the file\nmodule load cuda/11.7\n
      2. We are now ready to add in our application. The only thing we need to take care of is telling the system to run it using the python version that is connected to our hpc_env we created in the beginning. Try typing:

        which python\n

        which should give you the full path. Then add to the bottom of the jobscript file:

        ~/miniconda3/envs/hpc_env/bin/python \\\n    image_classifier.py \\\n    --trainer.accelerator 'gpu' --trainer.devices 1  --trainer.max_epochs 5\n

        which will run the image classifier script (change it if you are running something else).

      3. Finally submit the job:

        bsub < jobscript.sh\n

        and check when it is done that it has produced what you expected.

      4. (Optional) If you application supports multi GPUs also try that out. You would first need to change the jobscript to request multiple GPUs and additionally you would need to tell your application to run on multiple GPUs. For the image classifier script it can be done by changing the --trainer.devices flag to 2 (or higher).

    This ends the module on using HPC systems.

    "},{"location":"s10_extra/hyperparameters/","title":"M32 - Hyperparameter optimization","text":""},{"location":"s10_extra/hyperparameters/#hyperparameter-optimization","title":"Hyperparameter optimization","text":"

    Hyperparameter optimization is not a new idea within machine learning but have somewhat seen a renaissance with the uprise of deep learning. This can mainly be contributed to the following:

    However the problem with doing hyperparameter optimization of a deep learning models is that it can take over a week to train a single model. In most cases we therefore cannot do a full grid search of all hyperparameter combinations to get the best model. Instead we have to do some tricks that will help us speed up our searching. In these exercises we are going to be integrating optuna into our different models, that will provide the tools for speeding up our search.

    It should be noted that a lot of deep learning models does not optimize every hyperparameter that is included in the model but instead relies on heuristic guidelines (\"rule of thumb\") based on what seems to be working in general e.g. a learning rate of 0.01 seems to work great with the Adam optimizer. That said, these rules probably only apply to 80% of deep learning model, whereas for the last 20% the recommendations may be suboptimal Here is a great site that has collected an extensive list of these recommendations, taken from the excellent deep learning book by Ian Goodfellow, Yoshua Bengio and Aaron Courville.

    In practice, I recommend trying to identify (through experimentation) which hyperparameters that are important for the performance of your model and then spend your computational budget trying to optimize them while setting the rest to a \"recommended value\".

    "},{"location":"s10_extra/hyperparameters/#exercises","title":"\u2754 Exercises","text":"

    Exercise files

    1. Start by installing optuna: pip install optuna

    2. Initially we will look at the cross_validate.py file. It implements simple K-fold cross validation of a random forest sklearn digits dataset (subset of MNIST). Look over the script and try to run it.

    3. We will now try to write the same code in optune. Please note that the script have a variable OPTUNA=False that you can use to change what part of the code should run. The three main concepts of optuna is

      • A trial: a single experiment

      • A study: a collection of trials

      • The objective: function to determine how \"good\" a trial is

      Lets start by writing the objective function, which we have already started in the script. For now you do not need to care about the trial argument, just assume that it contains the hyperparameters needed to define your random forest model. The output of the objective function should be a single number that we want to optimize. (HINT: did you remember to do K-fold crossvalidation inside your objective function?)

    4. Next lets focus on the trial. Inside the objective function the trial should be used to suggest what parameters to use next. Take a look at the documentation for trial or take a look at the code examples and figure out how to define the hyperparameter of the model.

    5. Finally lets launch a study. It can be as simple as

      study = optuna.create_study()\nstudy.optimize(objective, n_trials=100)\n

      but lets play around a bit with it:

      1. By default the .optimize method will minimize the objective (by definition the optimum of an objective function is at its minimum). Is the score your objective function is returning something that should be minimized? If not, a simple solution is to put a - in front of the metric. However, look through the documentation on how to change the direction of the optimization.

      2. Optuna will by default do Bayesian optimization when sampling the hyperparameters (using a evolutionary algorithm for suggesting new trials). However, since this example is quite simple, we can actually perform a full grid search. How would you do this in Optuna?

      3. Compare the performance of a single optuna run using Bayesian optimization with n_trials=10 with a exhaustive grid search that have search through all hyperparameters. What is the performance/time trade-off for these two solutions?

    6. In addition to doing baysian optimization, the other great part about Optuna is that it have native support for Pruning unpromising trials. Pruning refers to the user stopping trials for hyperparameter combinations that does not seem to lead anywhere. You may have learning rate that is so high that training is diverging or a neural network with too many parameters so it is just overfitting to the training data. This however begs the question: what constitutes an unpromising trial? This is up to you to define based on prior experimentation.

      1. Start by looking at the fashion_trainer.py script. Its a simple classification network for classifying images in the FashionMNIST dataset. Run the script with the default hyperparameters to get a feeling of how the training should be progress. Note down the performance on the test set.

      2. Start by defining a validation set and a validation dataloader that we can use for hyperparameter optimization (HINT: use 5-10% of you training data).

      3. Now, adjust the script to use Optuna. The 5 hyperparameters listed in the table above should at least be included in the hyperparameter search. For some we have already defined the search space but for the remaining you need to come up with a good range of values to investigate. We done integrating optuna, run a small study (n_tirals=3) to check that the code is working.

        Hyperparameter Search space Learning rate 1e-6 to 1e0 Number of output features in the second last layer ??? The amount of dropout to apply ??? Batch size ??? Use batch normalize or not {True, False} (Optional) Different activations functions {nn.ReLU, nn.Tanh, nn.RReLU, nn.LeakyReLU, nn.ELU}
      4. If implemented correctly the number of hyperparameter combinations should be at least 1000, meaning that we not only need baysian optimization but probably also need pruning to succeed. Checkout the page for built-in pruners in Optuna. Implement pruning in the script. I recommend using either the MedianPruner or the ProcentilePruner.

      5. Re-run the study using pruning with a large number of trials (n_trials>50)

      6. Take a look at this visualization page for ideas on how to visualize the study you just did. Make at least two visualization of the study and make sure that you understand them.

      7. Pruning is great for better spending your computational budged, however it comes with a trade-off. What is it and what hyperparameter should one be especially careful about when using pruning?

      8. Finally, what parameter combination achieved the best performance? What is the test set performance for this set of parameters. Did you improve over the initial set of hyperparameters?

    7. The exercises until now have focused on doing the hyperparameter searching sequentially, meaning that we test one set of parameters at the time. It is a fine approach because you can easily let it run for a week without any interaction. However, assuming that you have the computational resources to run in parallel, how do you do that?

      1. To run hyperparameter search in parallel we need a common database that all experiments can read and write to. We are going to use the recommended mysql. You do not have to understand what SQL is to complete this exercise, but it is basically a language (like python) for managing databases. Install mysql.

      2. Next we are going to initialize a database that we can read and write to. For this exercises we are going to focus on a locally stored database but it could of course also be located in the cloud.

        mysql -u root -e \"CREATE DATABASE IF NOT EXISTS example\"\n

        you can also do this directly in python when calling the create_study command by also setting the storage and load_if_exists=True flags.

      3. Now we are going to create a Optuna study in our database

        optuna create-study --study-name \"distributed-example\" --storage \"mysql://root@localhost/example\"\n
      4. Change how you initialize the study to read and write to the database. Therefore, instead of doing

        study = optuna.create_study()\n

        then do

        study = optuna.load_study(\n    study_name=\"distributed-example\", storage=\"mysql://root@localhost/example\"\n)\n

        where the study_name and storage should match how the study was created.

      5. For running in parallel, you can either open up a extra terminal and simple launch your script once per open terminal or you can use the provided parallel_lancher.py that will launch multiple executions of your script. It should be used as:

        python parallel_lancher.py myscript.py --num_parallel 2\n
      6. Finally, make sure that you can access the results

    That's all on how to do hyperparameter optimization in a scalable way. If you feel like it you can try to apply these techniques on the ongoing corrupted MNIST example, where you are free to choose what hyperparameters that you want to use.

    "},{"location":"s10_extra/kubernetes/","title":"Kubernetes","text":"

    Danger

    Module is still under development

    "},{"location":"s10_extra/kubernetes/#kubernetes","title":"Kubernetes","text":"

    Kubernetes, also known as K8s, is an open-source platform designed to automate the deployment, scaling, and operation of application containers across clusters of hosts. It provides the framework to run distributed systems resiliently, handling scaling and failover for your applications, providing deployment patterns, and more.

    "},{"location":"s10_extra/kubernetes/#what-is-kubernetes","title":"What is Kubernetes?","text":""},{"location":"s10_extra/kubernetes/#brief-history","title":"Brief History","text":"

    Kubernetes was originally developed by Google and is now maintained by the Cloud Native Computing Foundation.

    "},{"location":"s10_extra/kubernetes/#core-functions","title":"Core Functions","text":"

    Kubernetes makes it easier to deploy and manage containerized applications at scale.

    "},{"location":"s10_extra/kubernetes/#key-concepts","title":"Key Concepts","text":""},{"location":"s10_extra/kubernetes/#kubernetes-architecture","title":"Kubernetes Architecture","text":"

    Kubernetes follows a client-server architecture. At a high level, it consists of a Control Plane (master) and Nodes (workers).

    Overview of Kubernetes Architecture. Image Credit: Kubernetes Official Documentation"},{"location":"s10_extra/kubernetes/#control-plane-components","title":"Control Plane Components","text":""},{"location":"s10_extra/kubernetes/#node-components","title":"Node Components","text":""},{"location":"s10_extra/kubernetes/#minikube-local-kubernetes-environment","title":"Minikube: Local Kubernetes Environment","text":"

    Minikube is a tool that allows you to run Kubernetes locally. It runs a single-node Kubernetes cluster inside a VM on your laptop for users looking to try out Kubernetes or develop with it day-to-day.

    "},{"location":"s10_extra/kubernetes/#installing-minikube","title":"Installing Minikube","text":"
    1. System Requirements: Ensure your system meets the minimum requirements.
    2. Download and Install: Visit Minikube's official installation guide.
    3. Start Minikube: Run minikube start.
    "},{"location":"s10_extra/kubernetes/#exercises","title":"\u2754 Exercises","text":"
    1. Install Minikube following the steps above.
    2. Validate the installation by typing minikube in a terminal.
    3. Ensure that kubectl, the command-line tool for Kubernetes, is correctly installed by typing kubectl in a terminal.
    "},{"location":"s10_extra/kubernetes/#yatai-model-serving-platform-for-kubernetes","title":"Yatai: Model Serving Platform for Kubernetes","text":"

    Yatai is a model serving platform, making it easier to deploy machine learning models in Kubernetes environments.

    "},{"location":"s10_extra/kubernetes/#what-is-yatai","title":"What is Yatai?","text":"

    Yatai simplifies the deployment, management, and scaling of machine learning models in Kubernetes.

    "},{"location":"s10_extra/kubernetes/#getting-started-with-yatai","title":"Getting Started with Yatai","text":"
    1. Installation: Steps to install Yatai in your Kubernetes cluster.
    2. Basic Usage: How to deploy your first model using Yatai.
    "},{"location":"s10_extra/kubernetes/#additional-resources","title":"Additional Resources","text":""},{"location":"s10_extra/onnx/","title":"Onnx","text":""},{"location":"s10_extra/onnx/#onnx","title":"Onnx","text":"

    Danger

    Module is still under development

    "},{"location":"s10_extra/onnx/#model-packaging","title":"Model packaging","text":"

    Whenever we want to serve an machine learning model, what we are actually interested in is doing predictions e.g. given a new datapoint we pass it through our model (forward pass) and the returned value is the predicted value of that datapoint. At a high-level, model predictions depends on three things:

    We have already in module M9 on Docker touch on how to take care of all these things. Containers makes it easy to link a codebase, model weights and code dependencies into a single object. We in general can refer to this as model packaging, because as the name suggest, we are packaging our model into a format that is independent of the actual environment that we are trying to run the model in.

    However, containers is not the only way to do model packaging. If we put some light restrictions on the device we want run our model predictions on, we can achieve the same result using ONNX. The Open Neural Network Exchange (ONNX) is a standardized format for creating and sharing machine learning models. ONNX provides an open source format for machine learning models, both deep learning and traditional ML. It defines an extensible computation graph model, as well as definitions of built-in operators and standard data types.

    Image credit

    As the above image indicates, the idea behind ONNX is that a model trained with a specific framework on a specific device, lets say Pytorch on your local computer, can be exported and run with an entirely different framework and hardware easily. For example, not all frameworks are created equally. For example Pytorch is in general considered an developer friendly framework, however it has historically been slow to run inference with compared to a framework such as Caffe2. ONNX allow you to mix-and-match frameworks based on different usecases, and essentially increases the longivity of your model.

    "},{"location":"s10_extra/onnx/#exercises","title":"\u2754 Exercises","text":"
    1. Start by installing ONNX:

      pip install onnx\npip install onnxruntime\n

      the first package includes the basic building blocks for implementing generalized ONNX models and the second package is for running ONNX optimal on different hardware.

    2. As an test that your installation is working, try executing the following python code

      import onnxruntime\nonnxruntime.get_all_providers()\n

      these providers are translation layers that are implemented ONNX, such that the same ONNX model can run on completely different hardware. Can you identify at least two of the providers that are necessary for running standard Pytorch code on CPU and GPU? Can you identify others

    3. One big advantage of having a standardized format, is that we can easily visualize the computational graph of our model because it consist only of core ONNX operations. We are here going to use the open-source tool netron for visualization. You can either choose to download the program or just run it in your webbrowser.

    "},{"location":"s10_extra/pipeline/","title":"Pipelines and workflows","text":"

    Danger

    Module is still under development

    Image credit"},{"location":"s10_extra/pipeline/#dags","title":"DAGs","text":"

    Directed Acyclic Graph (DAG)

    "},{"location":"s10_extra/pipeline/#exercises","title":"\u2754 Exercises","text":"
    1. Start by installing prefect:

      pip install prefect\n
    2. Start a local Prefect server instance in your virtual environment.

      prefect server start\n
    3. The great thing about Prefect is that the orchestration tasks and flows are written in pure Python.

    "},{"location":"s1_development_environment/","title":"Getting started - Setting up a development environment","text":"

    Slides

    Today we start our journey into the world of machine learning operations (MLOps). However, before we can really get started, we need to make sure that you have a basic understanding of a couple of topics, as we will be using these throughout the course. In particular, today is all about getting set up with a proper development environment that can support your journey. Most of you probably already have experience with these topics, and it will be mostly repetition.

    The reason we are starting here is that many students are missing very basic skills that are never taught but are just expected to be picked up by yourself. This session will only cover the most basic skills to get you started on your development journey. If you wish to learn more about basic computer science skills in general, we highly recommend that you check out The Missing Semester of Your CS Education course from MIT.

    Learning objectives

    The learning objectives of this session are:

    "},{"location":"s1_development_environment/command_line/","title":"M1 - The command line","text":""},{"location":"s1_development_environment/command_line/#the-command-line","title":"The command line","text":"

    Core Module

    Image credit

    Contrary to popular belief, the command line (also commonly known as the terminal) is not a mythical being that has existed since the dawn of time. Instead, it was created at a time when it was not given that your computer had a graphical interface that you could interact with. Think of it as a text interface to your computer.

    The terminal is a well-known concept to users of Linux, however, MAC and (especially) Windows users often do not need and therefore encounter it. Having a basic understanding of how to use a command line can help improve your workflow. The reason that the command line is an important tool to get to know, is that doing any kind of MLOps will require us to be able to interact with many different tools, many of which do not have a graphical interface. Additionally, when we get to working in the cloud later in the course, you will be forced to interact with the command line.

    Note if you already are a terminal wizard then feel free to skip the exercises below. They are very elementary.

    "},{"location":"s1_development_environment/command_line/#the-anatomy-of-the-command-line","title":"The anatomy of the command line","text":"

    Regardless of the operating system, all command lines look more or less the same:

    As already stated, it is essentially just a big text interface to interact with your computer. As the image illustrates, when trying to execute a command, there are several parts to it:

    1. The prompt is the part where you type your commands. It usually contains the name of the current directory you are in, followed by some kind of sign: $, >, : are the usual ones. It can also contain other information, such as in the case of the above image which also shows the current conda environment.
    2. The command is the actual command you want to execute. For example, ls or cd
    3. The options are additional arguments that you can pass to the command. For example, ls -l or cd ...
    4. The arguments are the actual arguments that you pass to the command. For example, ls -l figures or cd ...

    The core difference between options and arguments is that options are optional, while arguments are not.

    Image credit"},{"location":"s1_development_environment/command_line/#exercises","title":"\u2754 Exercises","text":"

    We have put a cheat sheet in the exercise files folder belonging to this session, that gives a quick overview of the different commands that can be executed in the command line.

    Windows users

    We highly recommend that you install Windows Subsystem for Linux (WSL). This will install a full Linux system on your Windows machine. Please follow this guide. Remember to run commands from an elevated (as administrator) Windows Command Prompt. You can in general complete all exercises in the course from a normal Windows Command Prompt, but some are easier to do if you run from WSL.

    If you decide to run in WSL you need to remember that you now have two different systems, and installing a package on one system does not mean that it is installed on the other. For example, if you install pip in WSL, you need to install it again in Windows if you want to use it there.

    If you decide to not run in WSL, please always work in a Windows Command Prompt and not Powershell.

    1. Start by opening a terminal.

    2. To navigate inside a terminal, we rely on the cd command and pwd command. Make sure you know how to go back and forth in your file system. (1)

      1. Your terminal should support tab-completion which can help finish commands for you!
    3. The ls command is important when we want to know the content of a folder. Try to use the command, and also try it with the additional option -l. What does it show?

    4. Make sure to familiarize yourself with the which, echo, cat, wget, less and top commands. Also, familiarize yourself with the > operator. You are probably going to use some of them throughout the course or in your future career. For Windows users, these commands may be named something else, e.g. where command on Windows corresponds to which.

    5. It is also significant that you know how to edit a file through the terminal. Most systems should have the nano editor installed, else try to figure out which one is installed in your system.

      1. Type nano in the terminal

      2. Write the following text in the script

        if __name__ == \"__main__\":\n    print(\"Hello world!\")\n
      3. Save the script and try to execute it

      4. Afterward, try to edit the file through the terminal (change Hello world to something else)

    6. All terminals come with their own programming language. The most common system is called bash. It can come in handy being able to write simple programs in bash. For example, one case is that you want to execute multiple Python programs sequentially, which can be done through a bash script.

      Windows users

      Bash is not part of Windows, so you need to run this part through WSL. If you did not install WSL, you can skip this part or as an alternative do the exercises in Powershell which is the native Windows scripting language (not recommended).

      1. Write a bash script (in nano) and try executing it:

        #!/bin/bash\n# A sample Bash script, by Ryan\necho Hello World!\n
      2. Change the bash script to call the Python program you just wrote.

      3. Try to Google how to write a simple for-loop that executes the Python script 10 times in a row.

    "},{"location":"s1_development_environment/command_line/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"
    1. Here is one command from later in the course when we are going to work in the cloud

      gcloud compute instances create-with-container instance-1 \\\n    --container-image=gcr.io/<project-id>/gcp_vm_tester\n    --zone=europe-west1-b\n

      Identify the command, options and arguments.

      Solution
      • The command is gcloud compute instances create-with-container.
      • The options are --container-image=gcr.io/<project-id>/gcp_vm_tester and --zone=europe-west1-b.
      • The arguments are instance-1.

      The tricky part of this example is that commands can have subcommands, which are also commands. In this case compute is a subcommand to gcloud, instances is a subcommand to compute and create-with-container is a subcommand to instances

    2. Two common arguments that nearly all commands have are the -h and -V options. What does each of them do?

      Solution

      The -h (or --help) option prints the help message for the command, including subcommands and arguments. Try it out by executing python -h. The -V (or --version) option prints the version of the installed program. Try it out by executing python --version.

    This ends the module on the command line. If you are still not comfortable working with the command line, fear not as we are going to use it extensively throughout the course. If you want to spend additional time on this topic, we highly recommend that you watch this video on how to use the command line.

    If you are interested in personalizing your command line, you can check out the starship project, which allows you to customize your command line with a lot of different options.

    "},{"location":"s1_development_environment/deep_learning_software/","title":"M4 - Deep Learning Software","text":""},{"location":"s1_development_environment/deep_learning_software/#deep-learning-software","title":"Deep Learning Software","text":"

    Core Module

    Deep learning has since its revolution back in 2012 transformed our lives. From Google Translate to driverless cars to personal assistants to protein engineering, deep learning is transforming nearly every sector of our economy and our lives. However, it did not take long before people realized that deep learning is not a simple beast to tame and it comes with its own kinds of problems, especially if you want to use it in a production setting. In particular the concept of technical debt was invented to indicate the significant maintenance costs at a system level that it takes to run machine learning in production. MLOps should very much be seen as the response to the concept of technical debt, namely that we should develop methods, processes and tools (with inspiration from classical DevOps) to counter the problems we run into when working with deep learning models.

    It is important to note that all the concepts and tools that have been developed for MLOps can absolutely be used together with more classical machine learning models (think K-nearest neighbor, Random forest etc.), however deep learning comes with its own set of problems which mostly have to do with the sheer size of the data and models we are working with. For these reasons, we are focusing on working with deep learning models in this course.

    "},{"location":"s1_development_environment/deep_learning_software/#software-landscape-for-deep-learning","title":"Software landscape for Deep Learning","text":"

    Regarding software for Deep Learning, the landscape is currently dominated by three software frameworks (listed in order of when they were published):

    We won't go into a longer discussion on which framework is best, as it is pointless. Pytorch and Tensorflow have been around for the longest and therefore have bigger communities and feature sets at this point in time. They are both very similar in the sense that they both have features directed against research and production. JAX is kind of the new kid on the block, which in many ways improves on Pytorch and Tensorflow, but is still not as mature as the other frameworks. As the frameworks use different kind of programming principles (object oriented vs. functional programming), comparing them is essentially meaningless.

    In this course we have chosen to work with Pytorch, because we find it a bit more intuitive and it is the framework that we use for our day to day research life. Additionally, as of right now it is absolutely the dominating framework for published models, research papers and competition winners

    The intention behind this set of exercises is to bring everyone's Pytorch skills up-to-date. If you already are a Pytorch-Jedi feel free to pass the first set of exercises, but I recommend that you still complete it. The exercises are in large part taken directly from the deep learning course at udacity. Note that these exercises are given as notebooks, which is the last time we are going to use them actively in course. Instead, after this set of exercises, we are going to focus on writing code in python scripts.

    The notebooks contains a lot of explaining text. The exercises that you are supposed to fill out are inlined in the text in small \"exercise\" blocks:

    If you need a fresh-up on any deep learning topic in general throughout the course, we recommend to find the relevant chapter in the deep learning book by Ian Goodfellow, Yoshua Bengio and Aaron Courville (can also be found in the literature folder). It is absolutely not necessary to be good at deep learning to pass this course as the focus is on all the software needed to get deep learning models into production. However, it is important to have a basic understanding of the concepts.

    "},{"location":"s1_development_environment/deep_learning_software/#exercises","title":"\u2754 Exercises","text":"

    Exercise files

    1. Start a jupyter notebook session in your terminal (assuming you are standing in the root of the course material). Alternatively you should be able to open the notebooks directly in your code editor. For VS code users you can read more about how to work with jupyter notebooks in VS code here

    2. Complete the Tensors in Pytorch notebook. It focuses on basic manipulation of Pytorch tensors. You can pass this notebook if you are comfortable doing this.

    3. Complete the Neural Networks in Pytorch notebook. It focuses on building a very simple neural network using the Pytorch nn.Module interface.

    4. Complete the Training Neural Networks notebook. It focuses on how to write a simple training loop for training a neural network.

    5. Complete the Fashion MNIST notebook, that summaries concepts learned in the notebook 2 and 3 on building a neural network for classifying the Fashion MNIST dataset.

    6. Complete the Inference and Validation notebook. This notebook adds important concepts on how to do inference and validation on our neural network.

    7. Complete the Saving_and_Loading_Models notebook. This notebook addresses how to save and load model weights. This is important if you want to share a model with someone else.

    "},{"location":"s1_development_environment/deep_learning_software/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"
    1. If tensor a has shape [N, d] and tensor b has shape [M, d] how can we calculate the pairwise distance between rows in a and b without using a for loop?

      Solution

      We can take advantage of broadcasting to do this

      a = torch.randn(N, d)\nb = torch.randn(M, d)\ndist = torch.sum((a.unsqueeze(1) - b.unsqueeze(0))**2, dim=2)  # shape [N, M]\n
    2. What should be the size of S for an input image of size 1x28x28, and how many parameters does the neural network then have?

      from torch import nn\nneural_net = nn.Sequential(\n    nn.Conv2d(1, 32, 3), nn.ReLU(), nn.Conv2d(32, 64, 3), nn.ReLU(), nn.Flatten(), nn.Linear(S, 10)\n)\n
      Solution

      Since both convolutions have a kernel size of 3, stride 1 (default value) and no padding that means that we lose 2 pixels in each dimension, because the kernel can not be centered on the edge pixels. Therefore, the output of the first convolution would be 32x26x26. The output of the second convolution would be 64x24x24. The size of S must therefore be 64 * 24 * 24 = 36864. The number of parameters in a convolutional layer is kernel_size * kernel_size * in_channels * out_channels + out_channels (last term is the bias) and the number of parameters in a linear layer is in_features * out_features + out_features (last term is the bias). Therefore, the total number of parameters in the network is 3*3*1*32 + 32 + 3*3*32*64 + 64 + 36864*10 + 10 = 387,466, which could be calculated by running:

      sum([prod(p.shape) for p in neural_net.parameters()])\n
    3. A working training loop in Pytorch should have these three function calls: optimizer.zero_grad(), loss.backward(), optimizer.step(). Explain what would happen in the training loop (or implement it) if you forgot each of the function calls.

      Solution

      optimizer.zero_grad() is in charge of zeroring the gradient. If this is not done, then gradients would accumulate over the steps leading to exploding gradients. loss.backward() is in charge of calculating the gradients. If this is not done, then the gradients would not be calculated and the optimizer would not be able to update the weights. optimizer.step() is in charge of updating the weights. If this is not done, then the weights would not be updated and the model would not learn anything.

    "},{"location":"s1_development_environment/deep_learning_software/#final-exercise","title":"Final exercise","text":"

    As the final exercise we will develop a simple baseline model which we will continue to develop on during the course. For this exercise we provide the data in the data/corruptmnist folder. Do NOT use the data in the corruptmnist_v2 folder as that is intended for another exercise. As the name suggest this is a (subsampled) corrupted version of regular MNIST. Your overall task is the following:

    Implement a MNIST neural network that achieves at least 85 % accuracy on the test set.

    Before any training can start, you should identify what corruption that we have applied to the MNIST dataset to create the corrupted version. This can help you identify what kind of neural network to use to get good performance, but any network should really be able to achieve this.

    One key point of this course is trying to stay organized. Spending time now organizing your code, will save time in the future as you start to add more and more features. As subgoals, please fulfill the following exercises

    1. Implement your model in a script called model.py

    2. Implement your data setup in a script called data.py. The data was saved using torch.save, so to load it you should use torch.load.

      Saving the model

      When saving the model, you should use torch.save(model.state_dict(), \"model.pt\") and when loading the model you should use model.load_state_dict(torch.load(\"model.pt\")). If you do torch.save(model, \"model.pt\") this can lead to problems when loading the model later on, as it will try to not only save the model weights but also the model definition. This can lead to problems if you change the model definition later on (which you most likely is going to do).

    3. Implement training and evaluation of your model in main.py script. The main.py script should be able to take an additional subcommands indicating if the model should train or evaluate. It will look something like this:

      python main.py train --lr 1e-4\npython main.py evaluate trained_model.pt\n

      which can be implemented in various ways.

      VS code and command line arguments

      If you try to execute the above code in VS code using the debugger (F5) or the build in run functionality in the upper right corner:

      you will get an error message saying that you need to select a command to run e.g. main.py either needs the train or evaluate command. This can be fixed by adding a lunch.json to a specialized .vscode folder in the root of the project. The lunch.json file should look something like this:

      {\n    \"version\": \"0.2.0\",\n    \"configurations\": [\n        {\n            \"name\": \"Python: Current File\",\n            \"type\": \"python\",\n            \"request\": \"launch\",\n            \"program\": \"${file}\",\n            \"args\": [\n                \"train\",\n                \"--lr\",\n                \"1e-4\"\n            ],\n            \"console\": \"integratedTerminal\",\n            \"justMyCode\": true\n        }\n    ]\n}\n

      This will inform VS code that then we execute the current file (in this case main.py) we want to run it with the train command and additionally pass the --lr argument with the value 1e-4. You can read more about creating a lunch.json file here. If you want to have multiple configurations you can add them to the configurations list as additional dictionaries.

    To start you off, a very basic version of each script is provided in the final_exercise folder. We have already implemented some logic, especially to make sure you can easily run different subcommands in for step 4. If you are interested in how this is done you can checkout this optional module on defining command line interfaces (CLI). We additionally also provide an requirements.txt with suggestion to what packages are necessary to complete the exercise.

    As documentation that your model is actually working, when running in the train command the script needs to produce a single plot with the training curve (training step vs training loss). When the evaluate command is run, it should write the test set accuracy to the terminal.

    It is part of the exercise to not implement in notebooks as code development in the real life happens in script. As the model is simple to run (for now) you should be able to complete the exercise on your laptop, even if you are only training on cpu. That said you are allowed to upload your scripts to your own \"Google Drive\" and then you can call your scripts from a Google Colab notebook, which is shown in the image below where all code is place in the fashion_trainer.py script and the Colab notebook is just used to execute it.

    Be sure to have completed the final exercise before the next session, as we will be building on top of the model you have created.

    "},{"location":"s1_development_environment/editor/","title":"M3 - Editor","text":""},{"location":"s1_development_environment/editor/#editoride","title":"Editor/IDE","text":"

    Core Module

    Notebooks can be great for testing out ideas, developing simple code and explaining and visualizing certain aspects of a codebase. Remember that Jupyter notebook was created with intention to \"...allows you to create and share documents that contain live code, equations, visualizations and narrative text.\" However, any larger machine learning project will require you to work in multiple .py files and here notebooks will provide a suboptimal workflow. Therefore, to for truly getting \"work done\" you will need a good editor / IDE.

    Many opinions exist on this matter, but for simplicity we recommend getting started with one of the following 3:

    Editor Webpage Comment (Biased opinion) Spyder https://www.spyder-ide.org/ Matlab like environment that is easy to get started with Visual studio code https://code.visualstudio.com/ Support for multiple languages with fairly easy setup PyCharm https://www.jetbrains.com/pycharm/ IDE for python professionals. Will take a bit of time getting used to

    We highly recommend Visual studio (VS) code if you do not already have a editor installed (or just want to try something new.). We therefore put additional effort into explaining VS code.

    Below you see an overview of the vs code interface

    Image credit

    The main components of VS code are:

    "},{"location":"s1_development_environment/editor/#exercises","title":"\u2754 Exercises","text":"

    The overall goal of the exercises, is that you should start familiarizing yourself with the editor that you have chosen. If you are already an expert in one of them, feel free to skip the rest. You should at least be able to:

    The instructions below are specific to Visual studio code but we recommend that you try to answer the questions if using another editor. In the exercise_files folder belonging to this session we have put cheat sheets for VS code (one for Windows and one for Mac/Linux), that can give you an easy overview of the different macros in VS code. The following exercises are just to get you started but you can find many more tutorials here.

    1. VS code is a general editor for many languages and to get proper python support we need to install some extensions. In the action bar go to the extension tap and search for python in the marketplace. For here we highly recommend installing the following packages:

      • Python: general python support for VS code
      • Pylance: language server for python that provides better code completion and type checking
      • Jupyter: support for jupyter notebooks directly in VSCode
      • Python Environment Manager: allows for easy management of virtual environments
    2. If you install the Python package you should see something like this in your status bar:

      which indicates that you are using the stock python installation, instead of the one you have created using conda. Click it and change the python environment to the one you actually want to use.

    3. One of the most useful tools in VS Code is the ability to navigate the whole project using the built-in Explorer. To really take advantage of the VS code you need to make sure what you are working on is a project. Create a folder called hello (somewhere on your laptop) and open it in VS Code (Click File in the menu and then select Open Folder). You should end up with a completely clean workspace (as shown below). Click the New file button and create a file called hello.py.

      Image credit

    4. Finally, lets run some code. Add something simple to the hello.py file like:

      Image credit

      and click the run button as shown in the image. It should create a new terminal, activate the environment that you have chosen and finally run your script. In addition to clicking the run button, you can also

      • Select some code and press Shift+Enter to run it in the terminal
      • Select some code and right click, choosing to run in a interactive window (where you can interact with the results like in a jupyter notebook)

    That's, the basic of using VS code. We recommend highly that you revisit this tutorial during the course when we get to topics such as debugging and version control which VS code can help with.

    "},{"location":"s1_development_environment/editor/#a-note-on-jupyter-notebooks-in-production-environments","title":"A note on jupyter notebooks in production environments","text":"

    As already stated jupyter notebooks are great for development as they allow developers to easily test our new ideas. However, they often lead to pain points when models actually need to be deployed. We highly recommend reading section 5.1.1 of this paper by Shankar et al. that in more detail discuss the strong opinions to jupyter notebooks that exist within the developer community.

    All this said there at least exist one simple tool to make notebooks work better in a production setting. Its called nbconvert and can be installed with

    conda install nbconvert # or pip install nbconvert\n

    You may need some further dependencies such as Pandoc, TeX and Pyppeteer for it to work (see install instructions here). After this, converting a notebook to a .py script is a simple as:

    jupyter nbconvert --to=script my_notebook.ipynb\n

    which will produce a similar named script called my_notebook.py. We highly recommend that you stick to developing scripts directly during the course to get experience with doing so, but nbconvert can be an fantastic tool to have in your toolbox.

    "},{"location":"s1_development_environment/package_manager/","title":"M2 - Package Manager","text":""},{"location":"s1_development_environment/package_manager/#package-managers-and-virtual-environments","title":"Package managers and virtual environments","text":"

    Core Module

    Python is a great programming language and this is mostly due to its vast ecosystem of packages. No matter what you want to do, there is probably a package that can get you started. Just try to remember when the last time you wrote a program only using the python standard library? Probably never. For this reason, we need a way to install third-party packages and this is where package managers come into play.

    You have probably already used pip for the longest time, which is the default package manager for Python. pip is great for beginners but it is missing one essential feature that you will need as a developer or data scientist: virtual environments. Virtual environments are an essential way to make sure that the dependencies of different projects do not cross-contaminate each other. As a naive example, consider project A that requires torch==1.3.0 and project B that requires torch==2.0, then doing

    cd project_A  # move to project A\npip install torch==1.3.0  # install old torch version\ncd ../project_B  # move to project B\npip install torch==2.0  # install new torch version\ncd ../project_A  # move back to project A\npython main.py  # try executing main script from project A\n

    will mean that even though we are executing the main script from project A's folder, it will use torch==2.0 instead of torch==1.3.0 because that is the last version we installed, because in both cases pip will install the package into the same environment, in this case the global environment. Instead, if we did something like:

    Unix/macOSWindows
    cd project_A  # move to project A\npython -m venv env  # create a virtual environment in project A\nsource env/bin/activate  # activate that virtual environment\npip install torch==1.3.0  # install old torch version into the virtual environment belonging to project A\ncd ../project_B  # move to project B\npython -m venv env  # create a virtual environment in project B\nsource env/bin/activate  # activate that virtual environment\npip install torch==2.0  # install new torch version into the virtual environment belonging to project B\ncd ../project_A  # move back to project A\nsource env/bin/activate  # activate the virtual environment belonging to project A\npython main.py  # succeed in executing main script from project A\n
    cd project_A  # move to project A\npython -m venv env  # create a virtual environment in project A\n.\\env\\Scripts\\activate  # activate that virtual environment\npip install torch==1.3.0  # install old torch version into the virtual environment belonging to project A\ncd ../project_B  # move to project B\npython -m venv env  # create a virtual environment in project B\n.\\env\\Scripts\\activate  # activate that virtual environment\npip install torch==2.0  # install new torch version into the virtual environment belonging to project B\ncd ../project_A  # move back to project A\n.\\env\\Scripts\\activate  # activate the virtual environment belonging to project A\npython main.py  # succeed in executing main script from project A\n

    then we would be sure that torch==1.3.0 is used when executing main.py in project A because we are using two different virtual environments. In the above case, we used the venv module which is the built-in Python module for creating virtual environments. venv+pip is arguably a good combination but when working on multiple projects it can quickly become a hassle to manage all the different virtual environments yourself, remembering which Python version to use, which packages to install and so on.

    For this reason, a number of package managers have been created that can help you manage your virtual environments and dependencies, with some of the most popular being:

    with more being created every year (rye is looking like an interesting project). This is considered a problem in the Python community, because it means that there is no standard way of managing dependencies like in other languages like npm for node.js or cargo for rust.

    Image credit

    In the course, we do not care about which package manager you use, but we do care that you use one. If you are already familiar with one package manager, then skip this exercise and continue to use that. The best recommendation that I can give regarding package managers, in general, is to find one you like and then stick with it. A lot of time can be wasted on trying to find the perfect package manager, but in the end, they all do the same with some minor differences. Checkout this blog post if you want a fairly up-to-date evaluation of the different environment management and packaging tools that exist in the Python ecosystem.

    If you are not familiar with any package managers, then we recommend that you use conda and pip for this course. You probably already have conda installed on your laptop, which is great. What conda does especially well, is that it allows you to create virtual environments with different Python versions, which can be really useful if you encounter dependencies that have not been updated in a long time. In this course specifically, we are going to recommend the following workflow

    Installing packages with pip inside conda environments has been considered a bad practice for a long time, but since conda>=4.6 it is considered safe to do so. The reason for this is that conda now has a built-in compatibility layer that makes sure that pip installed packages are compatible with the other packages installed in the environment.

    "},{"location":"s1_development_environment/package_manager/#python-dependencies","title":"Python dependencies","text":"

    Before we get started with the exercises, let's first talk a bit about Python dependencies. One of the most common ways to specify dependencies in the Python community is through a requirements.txt file, which is a simple text file that contains a list of all the packages that you want to install. The format allows you to specify the package name and version number you want, with 7 different operators:

    package1           # any version\npackage2 == x.y.z  # exact version\npackage3 >= x.y.z  # at least version x.y.z\npackage4 >  x.y.z  # newer than version x.y.z\npackage4 <= x.y.z  # at most version x.y.z\npackage5 <  x.y.z  # older than version x.y.z\npackage6 ~= x.y.z  # install version newer than x.y.z and older than x.y+1\n

    In general, all packages (should) follow the semantic versioning standard, which means that the version number is split into three parts: x.y.z where x is the major version, y is the minor version and z is the patch version.

    The reason that we need to specify the version number is that we want to make sure that we can reproduce our code at a later point. If we do not specify the version number, then we are at the mercy of the package maintainer to not change the API of the package. This is especially important when working with machine learning models, as we want to make sure that we can reproduce the exact same model at a later point.

    Finally, we also need to discuss dependency resolution, which is the process of figuring out which packages are compatible. This is a non-trivial problem, and there exist a lot of different algorithms for doing this. If you have ever thought that pip and conda were taking a long time to install something, then it is probably because they were trying to figure out which packages are compatible with each other. For example, if you try to install

    pip install \"matplotlib >= 3.8.0\" \"numpy <= 1.19\" --dry-run\n

    then it would simply fail because there are no versions of matplotlib and numpy under the given constraints that are compatible with each other. In this case, we would need to relax the constraints to something like

    pip install \"matplotlib >= 3.8.0\" \"numpy <= 1.21\" --dry-run\n

    to make it work.

    "},{"location":"s1_development_environment/package_manager/#exercises","title":"\u2754 Exercises","text":"

    For hints regarding how to use conda you can check out the cheat sheet in the exercise folder.

    1. Download and install conda. You are free to either install full conda or the much simpler version miniconda. The core difference between the two packages is that conda already comes with a lot of packages that you would normally have to install with miniconda. The downside is that conda is a much larger package which can be a huge disadvantage on smaller devices. Make sure that your installation is working by writing conda help in a terminal and it should show you the help message for conda. If this does not work you probably need to set some system variable to point to the conda installation

    2. If you have successfully installed conda, then you should be able to execute the conda command in a terminal.

      Conda will always tell you what environment you are currently in, indicated by the (env_name) in the prompt. By default it will always start in the (base) environment.

    3. Try creating a new virtual environment. Make sure that it is called my_enviroment and that it installs version 3.11 of Python. What command should you execute to do this?

      Use Python 3.8 or higher

      We highly recommend that you use Python 3.8 or higher for this course. In general, we recommend that you use the second latest version of Python that is available (currently Python 3.11 as of writing this). This is because the latest version of Python is often not supported by all dependencies. You can always check the status of different Python version support here.

    4. Which conda command gives you a list of all the environments that you have created?

    5. Which conda command gives you a list of the packages installed in the current environment?

      1. How do you easily export this list to a text file? Do this, and make sure you export it to a file called enviroment.yaml, as conda uses another format by default than pip.

      2. Inspect the file to see what is in it.

      3. The enviroment.yaml file you have created is one way to secure reproducibility between users because anyone should be able to get an exact copy of you environment if they have your enviroment.yaml file. Try creating a new environment directly from you enviroment.yaml file and check that the packages being installed exactly match what you originally had.

    6. As the introduction states, it is fairly safe to use pip inside conda today. What is the corresponding pip command that gives you a list of all pip installed packages? And how do you export this to requirements.txt file?

    7. If you look through the requirements that both pip and conda produce then you will see that it is often filled with a lot more packages than what you are actually using in your project. What you are really interested in are the packages that you import in your code: from package import module. One way to get around this is to use the package pipreqs, which will automatically scan your project and create a requirements file specific to that. Let's try it out:

      1. Install pipreqs:

        pip install pipreqs\n
      2. Either try out pipreqs on one of your own projects or try it out on some other online project. What does the requirements.txt file pipreqs produces look like compared to the files produced by either pip or conda.

    "},{"location":"s1_development_environment/package_manager/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"
    1. Try executing the command

      pip install \"pytest < 4.6\" pytest-cov==2.12.1\n

      based on the error message you get, what would be a compatible way to install these?

      Solution

      As pytess-cov==2.12.1 requires a version of pytest newer than 4.6, we can simply change the command to be:

      pip install \"pytest >= 4.6\" pytest-cov==2.12.1\n

      but there of course exists other solutions as well.

    This ends the module on setting up virtual environments. While the methods mentioned in the exercises are great ways to construct requirements files automatically, sometimes it is just easier to manually sit down and create the files as you in that way secure that only the most necessary requirements are installed when creating a new environment.

    "},{"location":"s2_organisation_and_version_control/","title":"Getting started with MLOps - Organization and version control","text":"

    Slides

    Today we take our first steps into the world of MLOps. The set of modules in this session focuses on getting organized and making sure that you are familiar with good development practices. While many of the practices you will learn about these modules does not seem that important when you are a single person working on a project, it is crucial when working in large groups that the difference in how different people organize and write their code is minimized. The topics in this session will focus on:

    Image credit

    Some exercises in this course are very loosely stated (including the exercises today). You are expected to seek out information before you ask for help (Google is your friend!) as you will both learn more for trying to solve the problems yourself, and it is more realistic of how the \"real world\" works.

    Learning objectives

    The learning objectives of this session are:

    "},{"location":"s2_organisation_and_version_control/code_structure/","title":"M6 - Code structure","text":""},{"location":"s2_organisation_and_version_control/code_structure/#code-organization","title":"Code organization","text":"

    Core Module

    With a basic understanding of version control, it is now time to really begin filling up our code repository. However, the question then remains how to organize our code? As developers we tend to not think about code organization that much. It is instead something that just dynamically is being created as we may need it. However, maybe we should spend some time initially getting organized with the chance of this making our code easier to develop and maintain in the long run. If we do not spend time organizing our code, we may end up with a mess of code that is hard to understand or maintain

    Big ball of Mud

    A Big Ball of Mud is a haphazardly structured, sprawling, sloppy, duct-tape-and-baling-wire, spaghetti-code jungle. These systems show unmistakable signs of unregulated growth, and repeated, expedient repair. Information is shared promiscuously among distant elements of the system, often to the point where nearly all the important information becomes global or duplicated. The overall structure of the system may never have been well defined. If it was, it may have eroded beyond recognition. Programmers with a shred of architectural sensibility shun these quagmires. Only those who are unconcerned about architecture, and, perhaps, are comfortable with the inertia of the day-to-day chore of patching the holes in these failing dikes, are content to work on such systems. Brian Foote and Joseph Yoder, Big Ball of Mud. Fourth Conference on Patterns Languages of Programs (PLoP '97/EuroPLoP '97) Monticello, Illinois, September 1997

    We are here going to focus on the organization of data science projects and machine learning projects. The core difference this kind of projects introduces compared to more traditional systems, is data. The key to modern machine learning is without a doubt the vast amounts of data that we have access to today. It is therefore not unreasonable that data should influence our choice of code structure. If we had another kind of application, then the layout of our codebase should probably be different.

    "},{"location":"s2_organisation_and_version_control/code_structure/#cookiecutter","title":"Cookiecutter","text":"

    We are in this course going to use the tool cookiecutter, which is tool for creating projects from project templates. A project template is in short just na overall structure of how you want your folders, files etc. to be organised from the beginning. For this course we are going to be using a custom MLOps template. The template is essentially a fork of the cookiecutter data science template template that has been used for a couple of years in the course, but specialized a bit more towards MLOps instead of general data science.

    We are not going to argue that this template is better than every other template, we are just focusing on that it is a standardized way of creating project structures for machine learning projects. By standardized we mean, that if two persons are both using cookiecutter with the same template, the layout of their code does follow some specific rules, enabling one to faster understand the other person's code. Code organization is therefore not only to make the code easier for you to maintain but also for others to read and understand.

    Below is seen the default code structure of cookiecutter for data science projects.

    What is important to keep in mind when using a template, is that it exactly is a template. By definition a template is guide to make something. Therefore, not all parts of an template may be important for your project at hand. Your job is to pick the parts from the template that is useful for organizing your machine learning project and add the parts that are missing.

    "},{"location":"s2_organisation_and_version_control/code_structure/#python-projects","title":"Python projects","text":"

    While the same template in principal could be used regardless of what language we were using for our machine learning or data science application, there are certain considerations to take into account based on what language we are using. Python is the dominant language for machine learning and data science currently, which is why we in this section are focusing on some of the special files you will need for your Python projects.

    The first file you may or may not know is the __init__.py file. In Python the __init__.py file is used to mark a directory as a Python package. Therefore as a bare minimum, any Python package should look something like this:

    \u251c\u2500\u2500 src/\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u251c\u2500\u2500 file1.py\n\u2502   \u251c\u2500\u2500 file2.py\n\u251c\u2500\u2500 pyproject.toml\n

    The second file to focus on is the pyproject.toml. This file is important for actually converting your code into a Python project. Essentially, whenever you run pip install, pip is in charge of both downloading the package you want but also in charge of installing it. For pip to be able to install a package it needs instructions on what part of the code it should install and how to install it. This is the job of the pyproject.toml file.

    Below we have both added a description of the structure of the pyproject.toml file but also setup.py + setup.cfg which is the \"old\" way of providing project instructions regarding Python project. However, you may still encounter a lot of projects using setup.py + setup.cfg so it is good to at least know about them.

    pyproject.tomlsetup.py + setup.cfg

    pyproject.toml is the new standardized way of describing project metadata in a declaratively way, introduced in PEP 621. It is written toml format which is easy to read. At the very least your pyproject.toml file should include the [build-system] and [project] sections:

    [build-system]\nrequires = [\"setuptools\", \"wheel\"]\nbuild-backend = \"setuptools.build_meta\"\n\n[project]\nname = \"my-package-name\"\nversion = \"0.1.0\"\nauthors = [{name = \"EM\", email = \"me@em.com\"}]\ndescription = \"Something cool here.\"\nrequires-python = \">=3.8\"\ndynamic = [\"dependencies\"]\n\n[tool.setuptools.dynamic]\ndependencies = {file = [\"requirements.txt\"]}\n

    the [build-system] informs pip/python that to build this Python project it needs the two packages setuptools and wheels and that it should call the setuptools.build_meta function to actually build the project. The [project] section essentially contains metadata regarding the package, what its called etc. if we ever want to publish it to PyPI.

    For specifying dependencies of your project you have two options. Either you specify them in a requirements.txt file and it as a dynamic field in pyproject.toml as shown above. Alternatively, you can add a dependencies field under the [project] header like this:

    [project]\ndependencies = [\n    'torch==2.1.0',\n    'matplotlib>=3.8.1'\n]\n

    The improvement over setup.py + setup.cfg is that pyproject.toml also allows for metadata from other tools to be specified in it, essentially making sure you only need a single file for your project. For example, in the next [module M7 on good coding practices] you will learn about the tool ruff and how it can help format your code. If we want to configure ruff for our project we can do that directly in pyproject.toml by adding additional headers:

    [ruff]\nruff_option = ...\n

    To read more about how to specify pyproject.toml this page is a good place to start.

    setup.py is the original way to describing how a Python package should be build. The most basic setup.py file will look like this:

    from setuptools import setup\nfrom pip.req import parse_requirements\nrequirements = [str(ir.req) for ir in parse_requirements(\"requirements.txt\")]\nsetup(\n    name=\"my-package-name\",\n    version=\"0.1.0\",\n    author=\"EM\",\n    description=\"Something cool here.\"\n    install_requires=requirements,\n)\n

    Essentially, the it is the exact same meta information as in pyproject.toml, just written directly in Python syntax instead of toml. Because there was a wish to deperate this meta information into a separate file, the setup.cfg file was created which can contain the exact same information as setup.py just in a declarative config.

    [metadata]\nname = my-package-name\nversion = 0.1.0\nauthor = EM\ndescription = \"Something cool here.\"\n# ...\n

    This non-standardized way of providing meta information regarding a package was essentially what lead to the creation of pyproject.toml.

    Regardless of what way a project is configured, after creating the above files the correct way to install them would be the same

    pip install .\n# or in developer mode\npip install -e . # (1)!\n
    1. The -e is short for --editable mode also called developer mode. Since we will continuously iterating on our package this is the preferred way to install our package, because that means that we do not have to run pip install every time we make a change. Essentially, in developer mode changes in the Python source code can immediately take place without requiring a new installation.

    after running this your code should be available to import as from project_name import ... like any other Python package you use. This is the most essential you need to know about creating Python packages.

    "},{"location":"s2_organisation_and_version_control/code_structure/#exercises","title":"\u2754 Exercises","text":"

    After having installed cookiecutter (exercise 1 and 2), the remaining exercises are intended to be used on taking the simple CNN MNIST classifier from yesterdays exercise and force it into this structure. You are not required to fill out every folder and file in the project structure, but try to at least follow the steps in exercises. Whenever you need to run a file I recommend always doing this from the root directory e.g.

    python <project_name>/data/make_dataset.py data/raw data/processed\npython <project_name>/models/train_model.py <arguments>\netc...\n

    in this way paths (for saving and loading files) are always relative to the root.

    1. Install cookiecutter framework

      pip install cookiecutter\n
    2. Start a new project using this template, that is specialized for this course (1).

      1. If you feel like the template can be improve in some way, feel free to either open a issue with the proposed improvement or directly send a pull request to the repository \ud83d\ude04.

      You do this by running the cookiecutter command using the template url:

      cookiecutter <url-to-template>\n

      Valid project names

      When asked for a project name you should follow the PEP8 guidelines for naming packages. This means that the name should be all lowercase and if you want to separate words, you should use underscores. For example my_project is a valid name, while MyProject is not. Additionally, the packaage name cannot start with a number.

      Flat-layout vs src-layout

      There are two common choices on how layout your source directory. The first is called src-layout where the source code is always place in a src/<project_name> folder and the second is called flat-layout where the source code is place is just placed in a <project_name> folder. The template we are using in this course is using the flat-layout, but there are pros and cons for both.

    3. After having created your new project, the first step is to also create a corresponding virtual environment and install any needed requirements. If you have a virtual environment from yesterday feel free to use that else create a new. Then install the project in that environment

      pip install -e .\n
    4. Start by filling out the <project_name>/data/make_dataset.py file. When this file runs, it should take the raw data e.g. the corrupted MNIST files from yesterday (../data/corruptmnist) which now should be located in a data/raw folder and process them into a single tensor, normalize the tensor and save this intermediate representation to the data/processed folder. By normalization here we refer to making sure the images have mean 0 and standard deviation 1.

    5. This template comes with a Makefile that can be used to easily define common operations in a project. You do not have to understand the complete file but try taking a look at it. In particular the following commands may come in handy

      make data  # runs the make_dataset.py file, try it!\nmake clean  # clean __pycache__ files\nmake requirements  # install everything in the requirements.txt file\n
      Windows users

      make is a GNU build tool that is by default not available on Windows. There are two recommended ways to get it running on Windows. The first is leveraging linux subsystem for Windows which you maybe have already installed. The second option is utilizing the chocolatey package manager, which enables Windows users to install packages similar to Linux system.

      In general we recommend that you add commands to the Makefile as you move along in the course. If you want to know more about how to write Makefiles then this is an excellent video.

    6. Put your model file (model.py) into <project_name>/models folder together and insert the relevant code from the main.py file into the train_model.py file. Make sure that whenever a model is trained and it is saved, that it gets saved to the models folder (preferably in sub-folders).

    7. When you run train_model.py, make sure that some statistics/visualizations from the trained models gets saved to the reports/figures/ folder. This could be a simple .png of the training curve.

    8. (Optional) Can you figure out a way to add a train command to the Makefile such that training can be started using

      make train\n
    9. Fill out the newly created <project_name>/models/predict_model.py file, such that it takes a pre-trained model file and creates prediction for some data. Recommended interface is that users can give this file either a folder with raw images that gets loaded in or a numpy or pickle file with already loaded images e.g. something like this

      python <project_name>/models/predict_model.py \\\n    models/my_trained_model.pt \\  # file containing a pretrained model\n    data/example_images.npy  # file containing just 10 images for prediction\n
    10. Fill out the file <project_name>/visualization/visualize.py with this (as minimum, feel free to add more visualizations)

      • Loads a pre-trained network
      • Extracts some intermediate representation of the data (your training set) from your cnn. This could be the features just before the final classification layer
      • Visualize features in a 2D space using t-SNE to do the dimensionality reduction.
      • Save the visualization to a file in the reports/figures/ folder.
    11. (Optional) Feel free to create more files/visualizations (what about investigating/explore the data distribution?)

    12. Make sure to update the README.md file with a short description on how your scripts should be run

    13. Finally make sure to update the requirements.txt file with any packages that are necessary for running your code (see this set of exercises for help)

    14. (Optional) Lets say that you are not satisfied with the template I have recommended that you use, which is completely fine. What should you then do? You should of course create your own template! This is actually not that hard to do.

      1. Just for a starting point I would recommend that you fork either the mlops template which you have already been using or alternatively fork the data science template template.

      2. After forking the template, clone it down locally and lets start modifying it. The first step is changing the cookiecutter.json file. For the mlops template it looks like this:

        {\n    \"project_name\": \"project_name\",\n    \"repo_name\": \"{{ cookiecutter.project_name.lower().replace(' ', '_') }}\",\n    \"author_name\": \"Your name (or your organization/company/team)\",\n    \"description\": \"A short description of the project.\",\n    \"python_version_number\": \"3.10\",\n    \"open_source_license\": [\"No license file\", \"MIT\", \"BSD-3-Clause\"]\n}\n

        simply add a new line to the json file with the name of the variable you want to add and the default value you want it to have.

      3. The actual template is located in the {{ cookiecutter.project_name }} folder. cookiecutter works by replacing everywhere that it sees {{ cookiecutter.<variable_name> }} with the value of the variable. Therefore, if you want to add a new file to the template, just add it to the {{ cookiecutter.project_name }} folder and make sure to add the {{ cookiecutter.<variable_name> }} where you want the variable to be replaced.

      4. After you have made the changes you want to the template, you should test it locally. Just run

        cookiecutter . -f --no-input\n

        and it should create a new folder using the default values of the cookiecutter.json file.

      5. Finally, make sure to push any changes you made to the template to GitHub, such that you in the future can use it by simply running

        cookiecutter https://github.com/<username>/<my_template_repo>\n
    "},{"location":"s2_organisation_and_version_control/code_structure/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"
    1. Starting from complete scratch, what is the steps needed to create a new github repository and push a specific template to it as the very first commit.

      Solution
      1. Create a completely barebone repository, either using the GitHub UI or if you have the github cli installed (not git) you can run

        gh repo create <repo_name> --public --confirm\n
      2. Run cookiecutter with the template you want to use

        cookiecutter <template>\n

        The name of the folder created by cookiecutter should be the same as you just used.

      3. Run the following sequence of commands

        cd <project_name>\ngit init\ngit add .\ngit commit -m \"Initial commit\"\ngit remote add origin https://github.com/<username>/<repo_name>\ngit push origin master\n
      4. That's it. The template should now have been pushed to the repository as the first commit.

        That ends the module on code structure and cookiecutter. We again want to stress the point of using cookiecutter is not about following one specific template, but instead just to use any template for organizing your code. What often happens in a team is that multiple templates are needed in different stages of the development phase or for different product types because they share common structure, while still having some specifics. Keeping templates up-to-date then becomes critical such that no team member is using an outdated template. If you ever end up in this situation, we highly recommend to checkout cruft that works alongside cookiecutter to not only make projects but update existing ones as template evolves. Cruft additionally also has template validation capabilities to ensure projects match the latest version of a template.

        "},{"location":"s2_organisation_and_version_control/dvc/","title":"M8 - Data version control","text":""},{"location":"s2_organisation_and_version_control/dvc/#data-version-control","title":"Data Version Control","text":"

        Core Module

        In this module, we are going to return to version control. However, this time we are going to focus on version control of data. The reason we need to separate between standard version control and data version control comes down to one problem: size.

        Classic version control was developed to keep track of code files, which all are simple text files. Even a codebase that contains 1000+ files with millions of lines of code can probably be stored in less than a single gigabyte (GB). On the other hand, the size of data can be drastically bigger. As most machine learning algorithms only get better with the more data that you feed them, we are seeing models today that are being trained on petabytes of data (1.000.000 GB).

        Because this is an important concept there exist a couple of frameworks that have specialized in versioning data such as DVC, DAGsHub, Hub, Modelstore and ModelDB. Regardless of what framework, they all implement somewhat the same concept: instead of storing the actual data files or in general storing any large artifacts files we instead store a pointer to these large flies. We then version control the point instead of the artifact.

        Image credit

        We are in this course going to use DVC provided by iterative.ai as they also provide tools for automatizing machine learning, which we are going to focus on later.

        "},{"location":"s2_organisation_and_version_control/dvc/#dvc-what-is-it","title":"DVC: What is it?","text":"

        DVC (Data Version Control) is simply an extension of git to not only take versioning data but also models and experiments in general. But how does it deal with these large data files? Essentially, DVC will just keep track of a small metafile that will then point to some remote location where your original data is stored. Metafiles essentially work as placeholders for your data files. Your large data files are then stored in some remote location such as Google Drive or an S3 bucket from Amazon.

        Image credit

        As the figure shows, we now have two remote locations: one for code and one for data. We use git pull/push for the code and dvc pull/push for the data. The key concept is the connection between the data file model.pkl which is fairly large and its respective metafile model.pkl.dvc which is very small. The large file is stored in the data remote and the metafile is stored in the code remote.

        "},{"location":"s2_organisation_and_version_control/dvc/#exercises","title":"\u2754 Exercises","text":"

        If in doubt about some of the exercises, we recommend checking out the documentation for DVC as it contains excellent tutorials.

        1. For these exercises, we are going to use Google drive as a remote storage solution for our data. If you do not already have a Google account, please create one (we are going to use it again in later exercises). Please make sure that you at least have 1GB of free space.

        2. Next, install DVC and the Google Drive extension

          pip install dvc\npip install \"dvc[gdrive]\"\n

          If you installed DVC via pip and plan to use cloud services as remote storage, you might need to install these optional dependencies: [s3], [azure], [gdrive], [gs], [oss], [ssh]. Alternatively, use [all] to include them all. If you encounter that the installation fails, we recommend that you start by updating pip and then trying to update dvc:

          pip install -U pip\npip install -U \u201ddvc[gdrive]\u201d\n

          If this does not work for you, it is most likely due to a problem with pygit2 and in that case we recommend that you follow the instructions here.

        3. In your MNIST repository run the following command from the terminal

          dvc init\n

          this will setup dvc for this repository (similar to how git init will initialize a git repository). These files should be committed using standard git to your repository.

        4. Go to your Google Drive and create a new folder called dtu_mlops_data. Then copy the unique identifier belonging to that folder as shown in the figure below

          Using this identifier, add it as a remote storage

          dvc remote add -d storage gdrive://<your_identifier>\n
        5. Check the content of the file .dvc/config. Does it contain a pointer to your remote storage? Afterwards, make sure to add this file to the next commit we are going to make:

          git add .dvc/config\n
        6. Call the dvc add command on your data files exactly like you would add a file with git (you do not need to add every file by itself as you can directly add the data/ folder). Doing this should create a human-readable file with the extension .dvc. This is the metafile as explained earlier that will serve as a placeholder for your data. If you are on Windows and this step fails you may need to install pywin32. At the same time, the data folder should have been added to the .gitignore file that marks which files should not be tracked by git. Confirm that this is correct.

        7. Now we are going to add, commit and tag the metafiles so we can restore to this stage later on. Commit and tag the files, which should look something like this:

          git add data.dvc .gitignore\ngit commit -m \"First datasets, containing 25000 images\"\ngit tag -a \"v1.0\" -m \"data v1.0\"\n
        8. Finally, push your data to the remote storage using dvc push. You will be asked to authenticate, which involves copy-pasting the code in the link prompted. Check out your Google Drive folder. You will see that the data is not in a recognizable format anymore due to the way that dvc packs and tracks the data. The boring detail is that dvc converts the data into content-addressable storage which makes data much faster to get. Finally, make sure that your data is not stored in your Github repository.

          After authenticating the first time, DVC should be setup without having to authenticate again. If you for some reason encounter that dvc fails to authenticate, you can try to reset the authentication. Locate the file $CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json where $CACHE_HOME depends on your operating system:

          macOSLinuxWindows

          ~/Library/Caches

          ~/.cache This is the typical location, but it may vary depending on what distro you are running

          {user}/AppData/Local

          Delete the complete {gdrive_client_id} folder and retry authenticating with dvc push.

        9. After completing the above steps, it is very easy for others (or yourself) to get setup with both code and data by simply running

          git clone <my_repository>\ncd <my_repository>\ndvc pull\n

          (assuming that you give them access right to the folder in your drive). Try doing this (in some other location than your standard code) to make sure that the two commands indeed download both your code and data.

        10. Lets look about the process of updating our data. Remember the important aspect of version control is that we do not need to store explicit files called data_v1.pt, data_v2.pt etc. but just have a single data.pt that where we can always checkout earlier versions. Initially start by copying the data data/corruptmnist_v2 folder from this repository to your MNIST code. This contains 3 extra datafiles with 15000 additional observations. Rerun your data pipeline so these gets incorporated into the files in your processed folder.

        11. Redo the above steps, adding the new data using dvc, committing and tagging the metafiles e.g. the following commands should be executed (with appropriate input):

          dvc add -> git add -> git commit -> git tag -> dvc push -> git push.

        12. Lets say that you wanted to go back to the state of your data in v1.0. If the above steps have been done correctly, you should be able to do this using:

          git checkout v1.0\ndvc checkout\n

          confirm that you have reverted to the original data.

        13. (Optional) Finally, it is important to note that dvc is not only intended to be used to store data files but also any other large files such as trained model weights (with billions of parameters these can be quite large). For example, if we always store our best-performing model in a file called best_model.ckpt then we can use dvc to version control it, store it online and make it easy for others to download. Feel free to experiment with this using your model checkpoints.

        In general dvc is a great framework for version-controlling data and models. However, it is important to note that it does have some performance issue when dealing with datasets that consist of many files. Therefore, if you are ever working with a dataset that consists of many small files, it can be a good idea to:

        • zip files into a single archive and then version control the archive. The zip archive should be placed in a data/raw folder and then unzipped in the data/processed folder.

        • If possible turn your data into 1D arrays, then it can be stored in a single file such as .parquet or .csv. This is especially useful for tabular data. Then you can version control the single file instead of the many files.

        "},{"location":"s2_organisation_and_version_control/dvc/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"
        1. How do you know that a repository is using dvc?

          Solution

          Similar to a git repository having a .git directory, a repository using dvc needs to have a .dvc folder. Alternatively you can you the dvc status command.

        2. Assume you just added a folder called data/ that you want to track with dvc. What is the sequence of 5 commands to successful version control the folder? (assuming you already setup a remote)

          Solution
          dvc add data/\ngit add .\ngit commit -m \"added raw data\"\ngit push\ndvc push\n

        That's all for today. With the combined power of git and dvc we should be able to version control everything in our development pipeline such that no changes are lost (assuming we commit regularly). It should be noted that dvc offers more than just data version control, so if you want to deep dive into dvc we recommend their pipeline feature and how this can be used to setup version controlled experiments. Note that we are going to revisit dvc later for a more permanent (and large-scale) storage solution.

        "},{"location":"s2_organisation_and_version_control/git/","title":"M5 - Git","text":""},{"location":"s2_organisation_and_version_control/git/#git","title":"Git","text":"

        Core Module

        Proper collaboration with other people will require that you can work on the same codebase in an organized manner. This is the reason that version control exist. Simply stated, it is a way to keep track of:

        • Who made changes to the code
        • When did the change happen
        • What changes were made

        For a full explanation please see this page

        Secondly, it is important to note that GitHub is not git! GitHub is the dominating player when it comes to hosting repositories but that does not mean that they are the only one providing free repository hosting (see bitbucket or gitlab) for some other examples).

        That said we will be using git and GitHub throughout this course. It is a requirement for passing this course that you create a public repository with your code and use git to upload any code changes. How much you choose to integrate this into your own projects depends, but you are at least expected to be familiar with git+GitHub.

        Image credit"},{"location":"s2_organisation_and_version_control/git/#initial-config","title":"Initial config","text":"

        What does Git stand for?

        The name \"git\" was given by Linus Torvalds when he wrote the very first version. He described the tool as \"the stupid content tracker\" and the name as (depending on your mood):

        • Random three-letter combination that is pronounceable, and not actually used by any common UNIX command. The fact that it is a mispronunciation of \"get\" may or may not be relevant.
        • Stupid. Contemptible and Despicable. simple. Take your pick from the dictionary of slang.
        • \"Global information tracker\": you're in a good mood, and it actually works for you. Angels sing, and a light suddenly fills the room.
        • \"Goddamn idiotic truckload of sh*t\": when it breaks
        1. Install git on your computer and make sure that your installation is working by writing git help in a terminal and it should show you the help message for git.

        2. Create a GitHub account if you do not already have one.

        3. To make sure that we do not have to type in our GitHub username every time that we want to do some changes, we can once and for all set them on our local machine

          # type in a terminal\ngit config credential.helper store\ngit config --global user.email <email>\n
        "},{"location":"s2_organisation_and_version_control/git/#git-overview","title":"Git overview","text":"

        The most simple way to think of version control, is that it is just nodes with lines connecting them

        Each node, which we call a commit is uniquely identified by a hash string. Each node, stores what our code looked like at that point in time (when we made the commit) and using the hash codes we can easily revert to a specific point in time.

        The commits are made up of local changes that we make to our code. A basic workflow for adding commits are seen below

        Assuming that we have made some changes to our local working directory and that we want to get these updates to be online in the remote repository we have to do the following steps:

        • First we run the command git add. This will move our changes to the staging area. While changes are in the staging area we can very easily revert them (using git restore). There have therefore not been assigned a unique hash to the code yet, and we can therefore still overwrite it.

        • To take our code from the staging area and make it into a commit, we simply run git commit which will locally add a note to the graph. It is important again, that we have not pushed the commit to the online repository yet.

        • Finally, we want others to be able to use the changes that we made. We do a simple git push and our commit gets online

        Of course, the real power of version control is the ability to make branches, as in the image below

        Image credit

        Each branch can contain code that are not present on other branches. This is useful when you are many developers working together on the same project.

        "},{"location":"s2_organisation_and_version_control/git/#exercises","title":"\u2754 Exercises","text":"
        1. In your GitHub account create an repository, where the intention is that you upload the code from the final exercise from yesterday

          1. After creating the repository, clone it to your computer

            git clone https://github.com/my_user_name/my_repository_name.git\n
          2. Move/copy the three files from yesterday into the repository (and any other that you made)

          3. Add the files to a commit by using git add command (1)

            1. Writing good commit message is a skill in itself. A commit message should be short but informative about the work you are trying to commit. Try to practise writing good commit messages throughout the course. You can see this guideline for help.
          4. Commit the files using git commit

          5. Finally push the files to your repository using git push. Make sure to check online that the files have been updated in your repository.

          6. You can always use the command git status to check where you are in the process of making a commit.

          7. Also checkout the git log command, which will show you the history of commits that you have made.

        2. Make sure that you understand how to make branches, as this will allow you to try out code changes without messing with your working code. Creating a new branch can be done using:

          # create a new branch\ngit checkout -b <my_branch_name>\n

          Afterwards, you can use git checkout to change between branches (remember to commit your work!) Try adding something (a file, a new line of code etc.) to the newly created branch, commit it and try changing back to master afterwards. You should hopefully see whatever you added on the branch is not present on the main branch.

        3. If you do not already have a cloned version of this repository belonging to the course, make sure to make one! I am continuously updating/changing some of the material during the course and I therefore recommend that you each day before the lecture do a git pull on your local copy

        4. Git may seem like a waste of time when solutions like dropbox, google drive etc exist, and it is not completely untrue when you are only one or two working on a project. However, these file management systems falls short when hundreds to thousands of people work together. For this exercise you will go through the steps of sending an open-source contribution:

          1. Go online and find a project you do not own, where you can improve the code. You can either look at this page of good issues to get started with or for simplicity you can just choose the repository belonging to the course. Now fork the project by clicking the Fork button.

            This will create a local copy of the repository which you have complete writing access to. Note that code updates to the original repository does not update code in your local repository.

          2. Clone your local fork of the project using git clone.

          3. As default your local repository will be on the main branch (HINT: you can check this with the git status command). It is good practice to make a new branch when working on some changes. Use the git branch command followed by the git checkout command to create a new branch.

          4. You are now ready to make changes to the repository. Try to find something to improve (any spelling mistakes?). When you have made the changes, do the standard git cycle: add -> commit -> push

          5. Go online to the original repository and go to the Pull requests tab. Find compare button and choose the button to compare the master branch of the original repo with the branch that you just created in your own repository. Check the diff on the page to make sure that it contains the changes you have made.

          6. Write a bit about the changes you have made and click Create pull request :)

        5. Forking a repository has the consequence that your fork and the repository that you forked can diverge. To mitigate this we can set what is called an remote upstream. Take a look on this page , and set a remote upstream for the repository you just forked.

        6. After setting the upstream branch, we need to pull and merge any update. Take a look on this page and figure out how to do this.

        7. As a final exercise we want to simulate a merge conflict, which happens when two users try to commit changes to exactly same lines of code in the codebase, and git is not able to resolve how the different commits should be integrated.

          1. In your browser, open your favorite repository (it could be the one you just worked on), go to any file of your choosing and click the edit button (see image below) and make some change to the file. For example, if you choose a python file you can just import some random packages at the top of the file. Commit the change.

          2. Make sure not to pull the change you just made to your local computer. Locally make changes to the same file in the same lines and commit them afterwards.

          3. Now try to git pull the online changes. What should (hopefully) happen is that git will tell you that it found a merge conflict that needs to be resolved. Open the file and you should see something like this

            <<<<<<< HEAD\nthis is some content to mess with\ncontent to append\n=======\ntotally different content to merge later\n>>>>>>> master\n

            this should be interpret as: everything that's between <<<<<<< and ======= are the changes made by your local commit and everything between ======= and >>>>>>> are the changes you are trying to pull. To fix the merge conflict you simply have to make the code in the two \"cells\" work together. When you are done, remove the identifiers <<<<<<<, ======= and >>>>>>>.

          4. Finally, commit the merge and try to push.

        8. (Optional) The above exercises have focused on how to use git from the terminal, which I highly recommend learning. However, if you are using a proper editor they also have build in support for version control. We recommend getting familiar with these features (here is a tutorial for VS Code)

        "},{"location":"s2_organisation_and_version_control/git/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"
        1. How do you know if a certain directory is a git repository?

          Solution

          You can check if there is a \".git\" directory. Alternative you can use the git status command.

        2. Explain what the file gitignore is used for?

          Solution

          The file gitignore is used to tell git which files to ignore when doing a git add . command. This is useful for files that are not part of the codebase, but are needed for the code to run (e.g. data files) or files that contain sensitive information (e.g. .env files that contain API keys and passwords).

        3. You have two branches - main and devel. What sequence of commands would you need to execute to make sure that devel is in sync with main?

          Solution
          git checkout main\ngit pull\ngit checkout devel\ngit merge main\n
        4. What best practices are you familiar with regarding version control?

          Solution
          • Use a descriptive commit message
          • Make each commit a logical unit
          • Incorporate others' changes frequently
          • Share your changes frequently
          • Coordinate with your co-workers
          • Don't commit generated files

        That covers the basics of git to get you started. In the exercise folder you can find a git cheat sheet with the most useful commands for future reference. Finally, we want to point out another awesome feature of GitHub: in browser editor. Sometimes you have a small edit that you want to make, but still would like to do this in a IDE/editor. Or you may be in the situation where you are working from another device than your usual developer machine. GitHub has an built-in editor that can simply be enabled by changing any URL from

        https://github.com/username/repository\n

        to

        https://github.dev/username/repository\n

        Try it out on your newly created repository.

        "},{"location":"s2_organisation_and_version_control/good_coding_practice/","title":"M7 - Good coding practice","text":""},{"location":"s2_organisation_and_version_control/good_coding_practice/#good-coding-practice","title":"Good coding practice","text":"

        Quote

        Code is read more often than it is written. Guido Van Rossum (author of Python)

        It is hard to define exactly what good coding practises are. But the above quote by Guido does hint at what it could be, namely that it has to do with how others observe and persive your code. In general, good coding practice is about making sure that you code is easy to read and understand, not only by others but also by your future self. The key concept to keep in mind with we are talking about good coding practice is consistency. In many cases it does not matter exactly how you choose to style your code etc., the important part is that you are consistent about it.

        Image credit"},{"location":"s2_organisation_and_version_control/good_coding_practice/#documentation","title":"Documentation","text":"

        Most programmers have a love-hate relationship with documentation: We absolute hate writing it ourself, but love when someone else has actually taken time to add it to their code. There is no doubt about that well documented code is much easier to maintain, as you do not need to remember all details about the code to still maintain it. It is key to remember that good documentation saves more time, than it takes to write.

        The problem with documentation is that there is no right or wrong way to do it. You can end up doing:

        • Under documentation: You document information that is clearly visible from the code and not the complex parts that are actually hard to understand.

        • Over documentation: Writing too much documentation will have the opposite effect on most people than what you want: there is too much to read, so people will skip it.

        Writing good documentation is a skill that takes time to train, so lets try to do it.

        Quote

        Code tells you how; Comments tell you why. Jeff Atwood

        "},{"location":"s2_organisation_and_version_control/good_coding_practice/#exercises","title":"\u2754 Exercises","text":"
        1. Go over the most complicated file in your project. Be critical and add comments where the logic behind the code is not easily understandable. (1)

          1. In deep learning we often work with multi-dimensional tensors that constantly changes shape after each operation. It is good practise to annotate with comments when tensors undergoes some reshaping. In the following example we compute the pairwise euclidean distance between two tensors using broadcasting which results in multiple shape operations.

            x = torch.randn(5, 10)  # N x D\ny = torch.randn(7, 10)  # M x D\nxy = x.unsqueeze(1) - y.unsqueeze(0)  # (N x 1 x D) - (1 x M x D) = (N x M x D)\npairwise_euc_dist = xy.abs().pow(2.0).sum(dim=-1)  # N x M\n
        2. Add docstrings to at least two python function/methods. You can see here (example 5) a good example how to use identifiable keywords such as Parameters, Args, Returns which standardizes the way of writing docstrings.

        "},{"location":"s2_organisation_and_version_control/good_coding_practice/#styling","title":"Styling","text":"

        While python already enforces some styling (e.g. code should be indented in a specific way), this is not enough to secure that code from different users actually look like each other. Maybe even more troubling is that you will often see that your own style of coding changes as you become more and more experienced. This kind of difference in coding style is not that important to take care of when you are working on a personal project, but when working multiple people together on the same project it is important to consider.

        The question then remains what styling you should use. This is where Pep8 comes into play, which is the official style guide for python. It is essentially contains what is considered \"good practice\" and \"bad practice\" when coding python.

        The many years the most commonly used tool to check if you code is PEP8 compliant is to use flake8. However, we are in this course going to be using ruff that are quickly gaining popularity due to how fast it is and how quickly the developers are adding new features. (1)

        1. both flake8 and ruff is what is called a linter or lint tool, which is any kind of static code analyze program that is used to flag programming errors, bugs, and styling errors.
        "},{"location":"s2_organisation_and_version_control/good_coding_practice/#exercises_1","title":"\u2754 Exercises","text":"
        1. Install ruff

          pip install ruff\n
        2. Run ruff on your project or part of your project

          ruff check .  # Lint all files in the current directory (and any subdirectories)\nruff check path/to/code/  # Lint all files in `/path/to/code` (and any subdirectories).\n

          are you PEP8 compliant or are you a normal mortal?

        You could go and fix all the small errors that ruff is giving. However, in practice large projects instead relies on some kind of code formatter, that will automatically format your code for you to be PEP8 compliant. Some of the biggest formatters for the longest time in Python have been black and yapf, but we are going to use ruff which also have a build in formatter that should be a drop-in replacement for black.

        1. Try to use ruff format to format your code

          ruff format .  # Format all files in the current directory.\nruff format /path/to/file.py  # Format a single file.\n

        By default ruff will apply a selection of rules when we are either checking it or formatting it. However, many more rules can be activated through configuration. If you have completed module M6 on code structure you will have encountered the pyproject.toml file, which can store both build instructions about our package but also configuration of developer tools. Lets try to configure ruff using the pyproject.toml file.

        1. One aspect that is not covered by PEP8 is how import statements in Python should be organized. If you are like most people, you place your import statements at the top of the file and they are ordered simply by when you needed them. A better practice is to introduce some clear structure in our imports. In older versions of this course we have used isort to do the job, but we are here going to configure ruff to do the job. In your pyproject.toml file add the following lines

          [tool.ruff]\nselect = [\"I\"]\n

          and try re-running ruff check and ruff format. Hopefully this should reorganize your imports to follow common practice. (1)

          1. the common practise is to first list built-in python packages (like os) in one block, followed by third-party dependencies (like torch) in a second block and finally imports from your own package in a third block. Each block is then put in alphabetical order.
        2. One PEP8 styling rule that is often diverged from is the recommended line length of 79 characters, which by many (including myself) is considered very restrictive. If you code consist of multiple levels of indentation, you can quikly run into 79 characters being limiting. For this reason many projects increase it, often to 120 characters which seems to be the sweet spot of how many characters fits in a coding window on a laptop. Add the line

          line-length=120\n

          under the [tool.ruff] section in the pyproject.toml file and rerun ruff check and ruff format on your code.

        3. Experiment yourself with further configuration of ruff. In particular we recommend adding more rules and looking [tool.ruff.pydocstyle] configuration to indicate how you have styled your documentation.

        "},{"location":"s2_organisation_and_version_control/good_coding_practice/#typing","title":"Typing","text":"

        In addition to writing documentation and following a specific styling, in python we have a third way of improving the quality of our code: through typing. Typing goes back to the earlier programming languages like c, c++ etc. where data types needed to be explicit stated for variables:

        int main() {\n    int x = 5 + 6;\n    float y = 0.5;\n    cout << \"Hello World! \" << x << std::endl();\n}\n

        This is not required by python but it can really improve the readability of code, that you can directly read from the code what the expected types of input arguments and returns are. In python the : character have been reserved for type hints. Here is one example of adding typing to a function:

        def add2(x: int, y: int) -> int:\n    return x+y\n

        here we mark that both x and y are integers and using the arrow notation -> we mark that the output type is also an integer. Assuming that we are also going to use the function for floats and torch.Tensors we could improve the typing by specifying a union of types. Depending on the version of python you are using the syntax for this can be different.

        python <3.10python >=3.10
        from torch import Tensor  # note it is Tensor with upper case T. This is the base class of all tensors\nfrom typing import Union\ndef add2(x: Union[int, float, Tensor], y: Union[int, float, Tensor]) -> Union[int, float, Tensor]:\n    return x+y\n
        from torch import Tensor  # note it is Tensor with upper case T. This is the base class of all tensors\ndef add2(x: int | float | Tensor, y: int | float | Tensor) -> int | float | Tensor:\n    return x+y\n

        Finally, since this is a very generic function it also works on numpy arrays etc. we can always default to the Any type if we are not sure about all the specific types that a function can take

        from typing import Any\ndef add2(x: Any, y: Any) -> Any:\n    return x+y\n

        However, in this case we basically is in the same case as if our function were not typed, as the type hints does not help us at all. Therefore, use Any only when necessary.

        "},{"location":"s2_organisation_and_version_control/good_coding_practice/#exercises_2","title":"\u2754 Exercises","text":"

        Exercise files

        1. We provide a file called typing_exercise.py. Add typing everywhere in the file. Please note that you will need the following import:

          from typing import Callable, Optional, Tuple, Union, List  # you will need all of them in your code\n

          for it to work. This cheat sheet is a good resource on typing. We also provide typing_exercise_solution.py, but try to solve the exercise yourself.

        2. mypy is what is called a static type checker. If you are using typing in your code, then a static type checker can help you find common mistakes. mypy does not run your code, but it scans it and checks that the types you have given are compatible. Install mypy

          pip install mypy\n
        3. Try to run mypy on the typing.py file

          mypy typing_exercise.py\n

          If you have solved exercise 11 correctly then you should get no errors. If not mypy should tell you where your types are incompatible.

        "},{"location":"s2_organisation_and_version_control/good_coding_practice/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"
        1. According to PEP8 what is wrong with the following code?

          class myclass(nn.Module):\n    def TrainNetwork(self, X, y):\n        ...\n
          Solution

          According to PEP8 classes should follow the CapWords convention, meaning that the first letter in each word of the class name should be capitalized. Thus myclass should therefore be MyClass. On the other hand, functions and methods should be full lowercase with words separated by underscore. Thus TrainNetwork should be train_network.

        2. What would be the of argument x for a function def f(x): if it should support the following input

          x1 = [1, 2, 3, 4]\nx2 = (1, 2, 3, 4)\nx3 = None\nx4 = {1: \"1\", 2: \"2\", 3: \"3\", 4: \"4\"}\n
          Solution

          The easy solution would be to do def f(x : Any). But instead we could also go with:

          def f(x: None | Tuple[int, ...] | List[int] | Dict[int, str]):\n

          alternatively, we could also do

          def f(x: None | Iterable[int]):\n

          because both list, tuple and dict are iterables and therefore can be covered by one type (in this specific case).

        This ends the module on coding style. We again want to emphazize that a good coding style is more about having a consistent style than strictly following PEP8. A good example of this is Google, that have their own style guide for Python. This guide does not match PEP8 exactly, but it makes sure that different teams within google that are working on different projects are still to a large degree following the same style and therefore if a project is handed from one team to another then at least that will not be a problem.

        "},{"location":"s3_reproducibility/","title":"Reproducibility","text":"

        Slides

        Today is all about reproducibility - one of those concepts that everyone agrees is very important and something should be done about, but the reality is that it is very hard to secure full reproducibility. The last sessions have already touched a bit on how tools like conda and code organization can help make code more reproducible. Today we are going all the way to ensure that our scripts and our computing environment are fully reproducible.

        "},{"location":"s3_reproducibility/#why-does-reproducibility-matter","title":"Why does reproducibility matter","text":"

        Reproducibility is closely related to the scientific method:

        Observe -> Question -> Hypotheses -> Experiment -> Conclude -> Result -> Observe -> ...

        Not having reproducible experiments essentially breaks the cycle between doing experiments and making conclusions. If experiments are not reproducible, then we do not expect that others will arrive at the same conclusion as ourselves. As machine learning experiments are fundamentally the same as doing chemical experiments in a laboratory, we should be equally careful in making sure our environments are reproducible (think of your laptop as your laboratory).

        Secondly, if we focus on why reproducibility matters especially in machine learning, it is part of the bigger challenge of making sure that machine learning is trustworthy.

        Many different aspects are needed if trustworthy machine learning is ever going to be a reality. We need robustness of our pipelines so we can trust that they do not fail under heavy load. We need integrity to make sure that pipelines are deployed if they are of high quality. We need explainability to make sure that we understand what our machine learning models are doing, so it is not just a black box. We need reproducibility to make sure that the results of our models can be reproduced over and over again. Finally, we need fairness to make sure that our models are not biased toward specific populations. Figure inspired by this paper.

        Trustworthy ML is the idea that machine learning agents can be trusted. Take the example of a machine learning agent being responsible for medical diagnoses. It is s very clear that we need to be able to trust that the agent gives us the correct diagnosis for the system to work in practice. Reproducibility plays a big role here, because without we cannot be sure that the same agent deployed at two different hospitals will give the same diagnosis (given the same input).

        Learning objectives

        The learning objectives of this session are:

        • To understand the importance of reproducibility in computer science
        • To be able to use docker to create a reproducible container, including how to build them from scratch
        • Understand different ways of configuring your code and how to use hydra to integrate with config files
        "},{"location":"s3_reproducibility/config_files/","title":"M10 - Config Files","text":""},{"location":"s3_reproducibility/config_files/#config-files","title":"Config files","text":"

        With docker we can make sure that our compute environment is reproducible, but that does not mean that all our experiments magically become reproducible. There are other factors that are important for creating reproducible experiments.

        In this paper (highly recommended read) the authors tried to reproduce the results of 255 papers and tried to figure out which factors where significant to succeed. One of those factors were \"Hyperparameters Specified\" e.g. whether or not the authors of the paper had precisely specified the hyperparameter that was used to run the experiments. It should come as no surprise that this can be a determining factor for reproducibility, however it is not given that hyperparameters are always well specified.

        "},{"location":"s3_reproducibility/config_files/#configure-experiments","title":"Configure experiments","text":"

        There is really no way around it: deep learning contains a lot of hyperparameters. In general, a hyperparameter is any parameter that affects the learning process (e.g. the weights of a neural network are not hyperparameters because they are a consequence of the learning process). The problem with having many hyperparameters to control in your code, is that if you are not careful and structure them it may be hard after running a experiment to figure out which hyperparameters were actually used. Lack of proper configuration management can cause serious problems with reliability, uptime, and the ability to scale a system.

        One of the most basic ways of structuring hyperparameters, is just to put them directly into you train.py script in some object:

        class my_hp:\n    batch_size: 64\n    lr: 128\n    other_hp: 12345\n\n# easy access to them\ndl = DataLoader(Dataset, batch_size=my_hp.batch_size)\n

        the problem here is configuration is not easy. Each time you want to run a new experiment, you basically have to change the script. If you run the code multiple times, without committing the changes in between then the exact hyperparameter configuration for some experiments may be lost. Alright, with this in mind you change strategy to use an argument parser e.g. run experiments like this

        python train.py --batch_size 256 --learning_rate 1e-4 --other_hp 12345\n

        This at least solves the problem with configurability. However, we again can end up with losing experiments if we are not careful.

        What we really want is some way to easily configure our experiments where the hyperparameters are systematically saved with the experiment. For this we turn our attention to Hydra, a configuration tool that is based around writing config files to keep track of hyperparameters. Hydra operates on top of OmegaConf which is a yaml based hierarchical configuration system.

        A simple yaml configuration file could look like

        #config.yaml\nhyperparameters:\n  batch_size: 64\n  learning_rate: 1e-4\n

        with the corresponding python code for loading the file

        from omegaconf import OmegaConf\n# loading\nconfig = OmegaConf.load('config.yaml')\n\n# accessing in two different ways\ndl = DataLoader(dataset, batch_size=config.hyperparameters.batch_size)\noptimizer = torch.optim.Adam(model.parameters(), lr=config['hyperparameters']['lr'])\n

        or using hydra for loading the configuration

        import hydra\n\n@hydra.main(config_name=\"basic.yaml\")\ndef main(cfg):\n    print(cfg.hyperparameters.batch_size, cfg.hyperparameters.learning_rate)\n\nif __name__ == \"__main__\":\n    main()\n

        The idea behind refactoring our hyperparameters into .yaml files is that we disentangle the model configuration from the model. In this way it is easier to do version control of the configuration because we have it in a separate file.

        "},{"location":"s3_reproducibility/config_files/#exercises","title":"\u2754 Exercises","text":"

        Exercise files

        The main idea behind the exercises is to take a single script (that we provide) and use Hydra to make sure that everything gets correctly logged such that you would be able to exactly report to others how each experiment was configured. In the provided script, the hyperparameters are hardcoded into the code and your job will be to separate them out into a configuration file.

        Note that we provide a solution (in the vae_solution folder) that can help you get through the exercise, but try to look online for your answers before looking at the solution. Remember: its not about the result, its about the journey.

        1. Start by install hydra: pip install hydra-core --upgrade

        2. Next take a look at the vae_mnist.py and model.py file and understand what is going on. It is a model we will revisit during the course.

        3. Identify the key hyperparameters of the script. Some of them should be easy to find, but at least 3 have made it into the core part of the code. One essential hyperparameter is also not included in the script but is needed to be completely reproducible (HINT: the weights of any neural network are initialized at random).

        4. Write a configuration file config.yaml where you write down the hyperparameters that you have found

        5. Get the script running by loading the configuration file inside your script (using hydra) that incorporates the hyperparameters into the script. Note: you should only edit the vae_mnist.py file and not the model.py file.

        6. Run the script

        7. By default hydra will write the results to a outputs folder, with a sub-folder for the day the experiment was run and further the time it was started. Inspect your run by going over each file the hydra has generated and check the information has been logged. Can you find the hyperparameters?

        8. Hydra also allows for dynamically changing and adding parameters on the fly from the command-line:

          1. Try changing one parameter from the command-line

            python vae_mnist.py hyperparameters.seed=1234\n
          2. Try adding one parameter from the command-line

            python vae_mnist.py +experiment.stuff_that_i_want_to_add=42\n
        9. By default the file vae_mnist.log should be empty, meaning that whatever you printed to the terminal did not get picked up by Hydra. This is due to Hydra under the hood making use of the native python logging package. This means that to also save all printed output from the script we need to convert all calls to print with log.info

          1. Create a logger in the script:

            import logging\nlog = logging.getLogger(__name__)\n
          2. Exchange all calls to print with calls to log.info

          3. Try re-running the script and make sure that the output printed to the terminal also gets saved to the vae_mnist.log file

        10. Make sure that your script is fully reproducible. To check this you will need two runs of the script to compare. Then run the reproducibility_tester.py script as

          python reproducibility_tester.py path/to/run/1 path/to/run/2\n

          the script will go over trained weights to see if the match and that the hyperparameters was the same. Note: for the script to work, the weights should be saved to a file called trained_model.pt (this is the default of the vae_mnist.py script, so only relevant if you have changed the saving of the weights)

        11. Finally, make a new experiment using a new configuration file where you have changed a hyperparameter of your own choice. You are not allowed to change the configuration file in the script but should instead be able to provide it as an argument when launching the script e.g. something like

          python vae_mnist.py experiment=exp2\n

          We recommend that you use a file structure like this

          |--conf\n|  |--config.yaml\n|  |--experiments\n|     |--exp1.yaml\n|     |--exp2.yaml\n|--my_app.py\n
        "},{"location":"s3_reproducibility/config_files/#final-exercise","title":"Final exercise","text":"

        Make your MNIST code reproducible! Apply what you have just done to the simple script to your MNIST code. Only requirement is that you this time use multiple configuration files, meaning that you should have at least one model_conf.yaml file and a training_conf.yaml file that separates out the hyperparameters that have to do with the model definition and those that have to do with the training. You can also choose to work with even more complex config setups: in the image below the configuration has two layers such that we individually can specify hyperparameters belonging to a specific model architecture and hyperparameters for each individual optimizer that we may try.

        Image credit"},{"location":"s3_reproducibility/docker/","title":"M9 - Docker","text":""},{"location":"s3_reproducibility/docker/#docker","title":"Docker","text":"

        Core Module

        Image credit

        While the above picture may seem silly at first, it is actually pretty close to how docker came to existence. A big part of creating a MLOps pipeline, is that you are able to reproduce it. Reproducibility goes beyond versioning our code with git and using conda environment to keep track of our python installations. To really get reproducibility we need to also capture also system level components like

        • operating system
        • software dependencies (other than python packages)

        Docker provides this kind of system-level reproducibility by creating isolated programs dependencies. In addition to docker providing reproducibility, one of the key features are also scalability which is important when we later on are going to discuss deployment. Because docker is system-level reproducible, it does not (conceptually) matter if we try to start our program on a single machine or a 1000 machines at once.

        "},{"location":"s3_reproducibility/docker/#docker-overview","title":"Docker overview","text":"

        Docker has three main concepts: docker file, docker image and docker container:

        • A docker file is a basic text document that contains all the commands a user could call on the command line to run an application. This includes installing dependencies, pulling data from online storage, setting up code and what commands that you want to run (e.g. python train.py)

        • Running, or more correctly building a docker file will create a docker image. An image is a lightweight, standalone/containerized, executable package of software that includes everything (application code, libraries, tools, dependencies etc.) necessary to make an application run.

        • Actually running an image will create a docker container. This means that the same image can be launched multiple times, creating multiple containers.

        The exercises today will focus on how to construct the actual docker file, as this is the first step to constructing your own container.

        "},{"location":"s3_reproducibility/docker/#docker-sharing","title":"Docker sharing","text":"

        The whole point of using docker is that sharing applications becomes much easier. In general, we have two options

        • After creating the Dockerfile we can simply commit it to github (its just a text file) and then ask other users to simply build the image by themselves.

        • After building the image ourself, we can choose to upload it to a image registry such as Docker Hub where other can get our image by simply running docker pull, making them able to instantaneous running it as a container, as shown in the figure below

        Image credit"},{"location":"s3_reproducibility/docker/#exercises","title":"\u2754 Exercises","text":"

        In the following exercises we guide you how to build a docker file for your MNIST repository that will make the training and prediction a self contained application. Please make sure that you somewhat understand each step and do not just copy of the exercise. Also note that you probably need to execute the exercise from an elevated terminal e.g. with administrative privilege.

        The exercises today are only an introduction to docker and some of the steps are going to be unoptimized from a production setting view. For example we often want to keep the size of docker image as small as possible, which we are not focusing on for these exercises.

        If you are using VScode then we recommend install the docker VScode extension for easy getting an overview of which images have been build and which are running. Additionally the extension named Dev Containers may also be beneficial for you to download.

        1. Start by installing docker. How much trouble that you need to go through depends on your operating system. For Windows and Mac we recommend they install Docker desktop, which comes with a graphical user interface (GUI) for quickly viewing docker images and docker containers currently build/in-use. Windows users that have not installed WSL yet are going to have to do it now (as docker need it as backend for starting virtual machines) but you do not need to install docker in WSL. After installing docker we recommend that you restart you laptop.

        2. Try running the following to confirm that your installation is working:

          docker run hello-world\n

          which should give the message

          Hello from Docker!\nThis message shows that your installation appears to be working correctly.\n
        3. Next lets try to download a image from docker hub. Download the busybox image:

          docker pull busybox\n

          which is an very small (1-5Mb) containerized application that contains the most essential GNU fileutils, shellutils etc.

        4. After pulling the image, write

          docker images\n

          which should show you all images that are available. You should see the busybox image that we just downloaded.

        5. Lets try to run this image

          docker run busybox\n

          you will get that nothing happens! The reason for that is we did that not provide any commands to docker run. We essentially just ask it to start the busybox virtual machine, do nothing and then close it again. Now, try again this time with

          docker run busybox echo \"hello from busybox\"\n

          Note how fast this process is. In just a few seconds, Docker is able to start a virtual machine, execute a command and kill it afterwards.

        6. Try running

          docker ps\n

          what does this command do? What if you add -a to the end?

        7. If we wanted to run multiple commands within the virtual machine, we can start it in interactive mode

          docker run -it busybox\n

          this can be a great way to investigate what the filesystem of our virtual machine looks like.

        8. As you may have already noticed by now, each time we execute docker run we can still see small remnants of the containers using docker ps -a. These stray containers can end up take a lot of disk space. To remove them, use docker rm where you provide the container id that you want to delete

          docker rm <container_id>\n
        9. Lets, now move on to trying to construct an docker file ourself for our MNIST project. Create a file called trainer.dockerfile. The intention is that we want to develop one dockerfile for running our training script and one for doing predictions.

        10. Instead of starting from scratch we nearly always want to start from some base image. For this exercise we are going to start from a simple python image. Add the following to your Dockerfile

          # Base image\nFROM python:3.9-slim\n
        11. Next we are going to install some essentials in our image. The essentials more or less consist of a python installation. These instructions may seem familiar if you are using linux:

          # install python\nRUN apt update && \\\n    apt install --no-install-recommends -y build-essential gcc && \\\n    apt clean && rm -rf /var/lib/apt/lists/*\n
        12. The previous two steps are common for any docker application where you want to run python. All the remaining steps are application specific (to some degree):

          1. Lets copy over our application (the essential parts) from our computer to the container:

            COPY requirements.txt requirements.txt\nCOPY pyproject.toml pyproject.toml\nCOPY <project-name>/ <project-name>/\nCOPY data/ data/\n

            Remember that we only want the essential parts to keep our docker image as small as possible. Why do we need each of these files/folders to run training in our docker container?

          2. Lets set the working directory in our container and add commands that install the dependencies (1):

            1. We split the the installation into two steps, such that docker can cache our project dependencies separately from our application code. This means that if we change our application code, we do not need to reinstall all the dependencies. This is a common strategy for docker images.

              As an alternative you can use RUN make requirements if you have a Makefile that installs the dependencies. Just remember to also copy over the Makefile into the docker image.

            WORKDIR /\nRUN pip install -r requirements.txt --no-cache-dir\nRUN pip install . --no-deps --no-cache-dir\n

            the --no-cache-dir is quite important. Can you explain what it does and why it is important in relation to docker.

          3. Finally, we are going to name our training script as the entrypoint for our docker image. The entrypoint is the application that we want to run when the image is being executed:

            ENTRYPOINT [\"python\", \"-u\", \"<project_name>/train_model.py\"]\n

            the \"u\" here makes sure that any output from our script e.g. any print(...) statements gets redirected to our terminal. If not included you would need to use docker logs to inspect your run.

        13. We are now ready to building our docker file into a docker image

          docker build -f trainer.dockerfile . -t trainer:latest\n
          MAC M1/M2 users

          In general docker images are build for a specific platform. For example, if you are using a Mac with a M1/M2 chip then you are running on a ARM architecture. If you are using a Windows or Linux machine then you are running on a AMD64 architecture. This is important to know when building docker images. Thus, docker images you build may not work on other platforms than the one you build it on. You can specify which platform you want to build for by adding the --platform argument to the docker build command:

          docker build --platform linux/amd64 -f train.dockerfile . -t trainer:latest\n

          and also when running the image:

          docker run --platform linux/amd64 trainer:latest\n

          Do not that this will significantly increase the build and run time of your docker image when running locally, because docker will need to emulate the other platform. In general for the exercises today, you should not need to specify the platform, but be aware of this if you are building docker images on your own.

          please note here we are providing two extra arguments to docker build. The -f train.dockerfile . (the dot is important to remember) indicates which dockerfile that we want to run (except if you named it just Dockerfile) and the -t trainer:latest is the respective name and tag that we see afterwards when running docker images (see image below). Please note that building a docker image can take a couple of minutes.

          Docker images and space

          Docker images can take up a lot of space on your computer. Especially, the docker images we are trying to build because Pytorch is huge dependency. If you are running low on space, you can try to

          docker system prune\n

          alternatively you can manually delete images using docker rmi {image_name}:{image_tag}.

        14. Try running docker images and confirm that you get output similar to the one above. If you succeeds with this, then try running the docker image

          docker run --name experiment1 trainer:latest\n

          you should hopefully see your training starting. Please note that we can start as many containers that we want at the same time by giving them all different names using the --name tag.

          1. You are most likely going to re-build your docker image multiple times, either due to an implementation error or the addition of new functionality. Therefore, instead of watching pip suffer through downloading torch for the 20th time, you can reuse the cache from last time the docker image was build. To do this, replace the line in your dockerfile that installs your requirements with:

            RUN --mount=type=cache,target=~/pip/.cache pip install -r requirements.txt --no-cache-dir\n

            which mounts your local pip cache to the docker image. For building the image you need to have enabled the BuildKit feature. If you have docker version v23.0 or later (you can check this by running docker version) then this is enabled by default. Else you need to enable it by setting the environment variable DOCKER_BUILDKIT=1 before building the image.

            Try changing your dockerfile and re-building the image. You should see that the build process is much faster.

        15. Remember, if you ever are in doubt how files are organized inside a docker image you always have the option to start the image in interactive mode:

          docker run -it --entrypoint sh {image_name}:{image_name}\n
        16. When your training has completed you will notice that any files that is created when running your training script is not present on your laptop (for example if your script is saving the trained model to file). This is because the files were created inside your container (which is its own little machine). To get the files you have two options:

          1. If you already have a completed run then you can use

            docker cp\n

            to copy the files between your container and laptop. For example to copy a file called trained_model.pt from a folder you would do:

            docker cp {container_name}:{dir_path}/{file_name} {local_dir_path}/{local_file_name}\n

            Try this out.

          2. A much more efficient strategy is to mount a volume that is shared between the host (your laptop) and the container. This can be done with the -v option for the docker run command. For example, if we want to automatically get the trained_model.pt file after running our training script we could simply execute the container as

            docker run --name {container_name} -v %cd%/models:/models/ trainer:latest\n

            this command mounts our local models folder as a corresponding models folder in the container. Any file save by the container to this folder will be synchronized back to our host machine. Try this out! Note if you have multiple files/folders that you want to mount (if in doubt about file organization in the container try to do the next exercise first). Also note that the %cd% need to change depending on your OS, see this page for help.

        17. With training done we also need to write an application for prediction. Create a new docker image called predict.dockerfile. This file should call your <project_name>/models/predict_model.py script instead. This image will need some trained model weights to work. Feel free to either includes these during the build process or mount them afterwards. When you created the file try to build and run it to confirm that it works. Hint: if you are passing in the model checkpoint and prediction data as arguments to your script, your docker run probably need to look something like

          docker run --name predict --rm \\\n    -v %cd%/trained_model.pt:/models/trained_model.pt \\  # mount trained model file\n    -v %cd%/data/example_images.npy:/example_images.npy \\  # mount data we want to predict on\n    predict:latest \\\n    ../../models/trained_model.pt \\  # argument to script, path relative to script location in container\n    ../../example_images.npy\n
        18. (Optional, requires GPU support) By default a virtual machine created by docker only have access to your cpu and not your gpu. While you do not necessarily have a laptop with a GPU that supports training of neural network (e.g. one from Nvidia) it is beneficial that you understand how to construct a docker image that can take advantage of a GPU if you were to run this on a machine in the future that have a GPU (e.g. in the cloud). It does take a bit more work, but many of the steps will be similar to building a normal docker image.

          1. There are three prerequisites for working with Nvidia GPU accelerated docker containers. First you need to have the Docker Engine installed (already taken care of), have Nvidia GPU with updated GPU drivers and finally have the Nvidia container toolkit installed. The last part you not likely have not installed and needs to do. Some distros of Linux have known problems with the installation process, so you may have to search through known issues in nvidia-docker repository to find a solution

          2. To test that everything is working start by pulling a relevant Nvidia docker image. In my case this is the correct image:

            docker pull nvidia/cuda:11.0.3-base-ubuntu20.04\n

            but it may differ based on what cuda vision you have. You can find all the different official Nvidia images here. After pulling the image, try running the nvidia-smi command inside a container based on the image you just pulled. It should look something like this:

            docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi\n

            and should show an image like below:

            If it does not work, try redoing the steps.

          3. We should hopefully have a working setup now for running Nvidia accelerated docker containers. Next step is to get Pytorch inside of our container, such that our Pytorch implementation also correctly identify the GPU. Luckily for us Nvidia provides a set of docker images for GPU-optimized software for AI, HPC and visualizations through their NGC Catalog. The containers that have to do with Pytorch can be seen here. Try pulling the latest:

            docker pull nvcr.io/nvidia/pytorch:22.07-py3\n

            It may take some time, because the NGC images includes a lot of other software for optimizing Pytorch applications. It may be possible for you to find other images for running GPU accelerated applications that have a smaller memory footprint, but NGC are the recommend and supported way.

          4. Lets test that this container work:

            docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:22.07-py3\n

            this should run the container in interactive mode attached to your current terminal. Try opening python in the container and try writing:

            import torch\nprint(torch.cuda.is_available())\n

            which hopefully should return True.

          5. Finally, we need to incorporate all this into our already developed docker files for our application. This is also fairly easy as we just need to change our FROM statement in the beginning of our docker file:

            FROM python:3.7-slim\n

            change to

            FROM  nvcr.io/nvidia/pytorch:22.07-py3\n

            try doing this to one of your docker files, build the image and run the container. Remember to check that your application is using GPU by printing torch.cuda.is_available().

        19. (Optional) Another way you can use dockerfiles in your day to day work is for Dev-containers. Developer containers allows you to develop code directly inside a container, making sure that your code is running in the same environment as it will when deployed. This is especially useful if you are working on a project that has a lot of dependencies that are hard to install on your local machine. Setup instructions for VS code and Pycharm can be found here (should be simple since we have already installed docker):

          • VS code
          • Pycharm

          We focus on the VS code setup here.

          1. First install the Remote - Containers extension.

          2. Create a .devcontainer folder in your project root and create a Dockerfile inside it. We keep this file very barebone for now, so lets just define a base installation of python:

            FROM python:3.11-slim-buster\n\nRUN apt update && \\\n    apt install --no-install-recommends -y build-essential gcc && \\\n    apt clean && rm -rf /var/lib/apt/lists/*\n
          3. Create a devcontainer.json file in the .devcontainer folder. This file should look something like this:

            {\n    \"name\": \"my_working_env\",\n    \"dockerFile\": \"Dockerfile\",\n    \"postCreateCommand\": \"pip install -r requirements.txt\"\n}\n

            this file tells VS code that we want to use the Dockerfile that we just created and that we want to install our python dependencies after the container has been created.

          4. After creating these files, you should be able to open the command palette in VS code (F1) and search for the option Remote-Containers: Reopen in Container or Remote-Containers: Rebuild and Reopen in Container. Choose either of these options.

            This will start a new VS code instance inside a docker container. You should be able to see this in the bottom left corner of your VS code window. You should also be able to see that the python interpreter has changed to the one inside the container.

            You are now ready to start developing inside the container. Try opening a terminal and run python and import torch to confirm that everything is working.

        20. (Optional) In M8 on Data version control you learned about the framework dvc for version controlling data. A neutral question at this point would then be how to incorporate dvc into our docker image. We need to do two things:

          • Make sure that dvc have all the correct files to pull data from our remote storage
          • Make sure that dvc have the correct credentials to pull data from our remote storage

          We are going to assume that dvc (and any dvc extension needed) is part of your requirement.txt file and that it is already being installed in a RUN pip install -r requirements.txt command in your dockerfile. If not, then you need to add it.

          1. Add the following lines to your dockerfile

            RUN dvc init --no-scm\nCOPY .dvc/config .dvc/config\nCOPY *.dvc *.dvc\nRUN dvc config core.no_scm true\nRUN dvc pull\n

            The first line initialize dvc in the docker image. The --no-scm option is needed because normally dvc can only be initialized inside a git repository, but this option allows to initialize dvc without being in one. The second and third line copies over the dvc config file and the dvc metadate files that are needed to pull data from your remote storage. The last line pulls the data.

          2. If your data is not public, we need to provide credentials in some way to pull the data. We are for now going to do it in a not-so-secure way. When dvc first connected to your drive a credential file was created. This file is located in $CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json where $CACHE_HOME.

            macOSLinuxWindows

            ~/Library/Caches

            ~/.cache This is the typical location, but it may vary depending on what distro you are running

            {user}/AppData/Local

            Find the file. The content should look similar to this (only some fields are shown):

            {\n    \"access_token\": ...,\n    \"client_id\": ...,\n    \"client_secret\": ...,\n    \"refresh_token\": ...,\n    ...\n}\n

            We are going to copy the file into our docker image. This of course is not a secure way of doing it, but it is the easiest way to get started. As long as you are not sharing your docker image with anyone else, then it is fine. Add the following lines to your dockerfile before the RUN dvc pull command:

            ```dockerfile COPY default.json dvc remote modify myremote --local gdrive_service_account_json_file_path default.json ````

            where <path_to_default.json> is the path to the default.json file that you just found. The last line tells dvc to use the default.json file as the credentials for pulling data from your remote storage. You can confirm that this works by running dvc pull in your docker image.

            "},{"location":"s3_reproducibility/docker/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"
            1. What is the difference between a docker image and a docker container?

              Solution

              A docker image is a template for a docker container. A docker container is a running instance of a docker image. A docker image is a static file, while a docker container is a running process.

            2. What are the 3 steps involved in containerizing an application?

              Solution
              1. Write a Dockerfile that includes your app (including the commands to run it) and its dependencies
              2. Build the image using the Dockefile you wrote
              3. Run the container using the image you've built
            3. What advantage is there to running your application inside a docker container instead of running the application directly on your machine?

              Solution

              Running inside a docker container gives you a consistent and independent environment for your application. This means that you can be sure that your application will run the same way on your machine as it will on another machine. Thus, docker gives the ability to abstract away the differences between different machines.

            4. A docker container is build from a series of layers that are stacked on top of each others. This should be clear if you look at the output when building a docker image. What is the advantage of this?

              Solution

              The advantage is efficiency and reusability. When a change is made to a docker image, only the layer(s) that are changed needs to be updated. For example, if you update the application code in your docker image, which usually is the last layer, then only that layer needs to be rebuild, making the process much faster. Additionally, if you have multiple docker images that share the same base image, then the base image only needs to be downloaded once.

            The covers the absolute minimum you should know about docker to get a working image and container. If you want to really deep dive into this topic you can find a copy of the Docker Cookbook by S\u00e9bastien Goasguen in the literature folder.

            If you are actively going to be using docker in the near future, one thing to consider is the image size. Even these simple images that we have build still takes up GB in size. A number of optimizations steps can be taken to reduce the image size for you or your end user. If you have time you can read this article on different approaches to reduce image size. Additionally, you can take a look at the dive-in extension for docker desktop that lets you explore in depth your docker images.

            "},{"location":"s4_debugging_and_logging/","title":"Debugging, Profiling, Logging and Boilerplate","text":"

            Slides

            Today we are initially going to go over three different topics that are all fundamentally necessary for any data scientist or DevOps engineer:

            • Debugging
            • Profiling
            • Logging

            All three topics can be characterized by something you probably already are familiar with. Since you started programming, you have done debugging as nobody can write perfect code in the first try. Similarly, while you have not directly profiled your code, I bet that you at some point have had some very slow code and optimized it to run faster. Identifying and improving is the fundamentals of profiling code. Finally, logging is a very broad term and basically refers to any kind of output from your applications that help you at a later point identify the \"performance\" of you application.

            However, while we expect you to already be familiar with these topics, we do not expect all of you to be expects in this as it is very rarely topics that are focused on. Today we are going to introduce some best practices and tools to help you overcome each and everyone of these three important topics.

            As the final topic for today we are going to learn about how we can minimize boilerplate and focus on coding what actually matters for our project instead of all the boilerplate to get it working.

            Learning objectives

            The learning objectives of this session are:

            • Understand the basics of debugging and how to use a debugger to find bugs in your code
            • Can use a profiler to identify bottlenecks in your code and from those profiles optimize the runtime of your programs
            • Familiar with an experiment logging framework for tracking experiments and hyperparameters of your code to make it reproducible
            • Be able to use pytorch-lightning framework to minimize boilerplate code and structure deep learning models
            "},{"location":"s4_debugging_and_logging/boilerplate/","title":"M14 - Boilerplate","text":""},{"location":"s4_debugging_and_logging/boilerplate/#minimizing-boilerplate","title":"Minimizing boilerplate","text":"

            Boilerplate is a general term that describes any standardized text, copy, documents, methods, or procedures that may be used over again without making major changes to the original. But how does this relate to doing machine learning projects? If you have already tried doing a couple of projects within machine learning you will probably have seen a pattern: every project usually consist of these three aspects of code:

            • a model implementation
            • some training code
            • a collection of utilities for saving models, logging images etc.

            While the latter two certainly seems important, in most cases the actual development or research often revolves around defining the model. In this sense, both the training code and the utilities becomes boilerplate that should just carry over from one project to another. But the problem usually is that we have not generalized our training code to take care of the small adjusted that may be required in future projects and we therefore end up implementing it over and over again every time that we start a new project. This is of course a waste of our time that we should try to find a solution to.

            This is where high-level frameworks comes into play. High-level frameworks are build on top of another framework (Pytorch in this case) and tries to abstract/standardize how to do particular tasks such as training. At first it may seem irritating that you need to comply to someone else code structure, however there is a good reason for that. The idea is that you can focus on what really matters (your task, model architecture etc.) and do not have to worry about the actual boilerplate that comes with it.

            The most popular high-level (training) frameworks within the Pytorch ecosystem are:

            • fast.ai
            • Ignite
            • skorch
            • Catalyst
            • Composer
            • Pytorch Lightning

            They all offer many of the same features, so choosing one over the other for most projects should not matter that much. We are here going to use Pytorch Lightning, as it offers all the functionality that we are going to need later in the course.

            "},{"location":"s4_debugging_and_logging/boilerplate/#pytorch-lightning","title":"Pytorch Lightning","text":"

            In general we refer to the documentation from Pytorch lightning if in doubt about how to format your code for doing specific tasks. We are here going to explain the key concepts of the API that you need to understand to use the framework, starting with the LightningModule and the Trainer.

            "},{"location":"s4_debugging_and_logging/boilerplate/#lightningmodule","title":"LightningModule","text":"

            The LightningModule is a subclass of a standard nn.Module that basically adds additional structure. In addition to the standard __init__ and forward methods that need to be implemented in a nn.Module, a LightningModule further requires two more methods implemented:

            • training_step: should contain your actual training code e.g. given a batch of data this should return the loss that you want to optimize

            • configure_optimizers: should return the optimizer that you want to use

            Below is shown these two methods added to standard MNIST classifier

            Compared to a standard nn.Module, the additional methods in the LightningModule basically specifies exactly how you want to optimize your model.

            "},{"location":"s4_debugging_and_logging/boilerplate/#trainer","title":"Trainer","text":"

            The second component to lightning is the Trainer object. As the name suggest, the `Trainer object takes care of the actual training, automizing everything that you do not want to worry about.

            from pytorch_lightning import Trainer\nmodel = MyAwesomeModel()  # this is our LightningModule\ntrainer = Trainer()\ntraier.fit(model)\n

            That's is essentially all that you need to specify in lightning to have a working model. The trainer object does not have methods that you need to implement yourself, but it have a bunch of arguments that can be used to control how many epochs that you want to train, if you want to run on gpu etc. To get the training of our model to work we just need to specify how our data should be feed into the lighning framework.

            "},{"location":"s4_debugging_and_logging/boilerplate/#data","title":"Data","text":"

            For organizing our code that has to do with data in Lightning we essentially have three different options. However, all three assume that we are using torch.utils.data.DataLoader for the dataloading.

            1. If we already have a train_dataloader and possible also a val_dataloader and test_dataloader defined we can simply add them to our LightningModule using the similar named methods:

              def train_dataloader(self):\n    return DataLoader(...)\n\ndef val_dataloader(self):\n    return DataLoader(...)\n\ndef test_dataloader(self):\n    return DataLoader(...)\n
            2. Maybe even simpler, we can directly feed such dataloaders in the fit method of the Trainer object:

              trainer.fit(model, train_dataloader, val_dataloader)\ntrainer.test(model, test_dataloader)\n
            3. Finally, Lightning also have the LightningDataModule that organizes data loading into a single structure, see this page for more info. Putting data loading into a DataModule makes sense as it is then can be reused between projects.

            "},{"location":"s4_debugging_and_logging/boilerplate/#callbacks","title":"Callbacks","text":"

            Callbacks is one way to add additional functionality to your model, that strictly speaking is not already part of your model. Callbacks should therefore be seen as self-contained feature that can be reused between projects. You have the option to implement callbacks yourself (by inheriting from the pytorch_lightning.callbacks.Callback base class) or use one of the build in callbacks. Of particular interest are ModelCheckpoint and EarlyStopping callbacks:

            • The ModelCheckpoint makes sure to save checkpoints of you model. This is in principal not hard to do yourself, but the ModelCheckpoint callback offers additional functionality by saving checkpoints only when some metric improves, or only save the best K performing models etc.

              model = MyModel()\ncheckpoint_callback = ModelCheckpoint(\n    dirpath=\"./models\", monitor=\"val_loss\", mode=\"min\"\n)\ntrainer = Trainer(callbacks=[checkpoint_callbacks])\ntrainer.fit(model)\n
            • The EarlyStopping callback can help you prevent overfitting by automatically stopping the training if a certain value is not improving anymore:

              model = MyModel()\nearly_stopping_callback = EarlyStopping(\n    monitor=\"val_loss\", patience=3, verbose=True, mode=\"min\"\n)\ntrainer = Trainer(callbacks=[early_stopping_callback])\ntrainer.fit(model)\n

            Multiple callbacks can be used by passing them all in a list e.g.

            trainer = Trainer(callbacks=[checkpoint_callbacks, early_stopping_callback])\n
            "},{"location":"s4_debugging_and_logging/boilerplate/#exercises","title":"\u2754 Exercises","text":"

            Please note that the in following exercise we will basically ask you to reformat all your MNIST code to follow the lightning standard, such that we can take advantage of all the tricks the framework has to offer. The reason we did not implement our model in lightning to begin with, is that to truly understand why it is beneficially to use a high-level framework to do some of the heavy lifting you need to have gone through some of implementation troubles yourself.

            1. Install pytorch lightning:

              pip install pytorch-lightning # (1)!\n
              1. You may also install it as pip install lightning which includes more than just the Pytorch Lightning package. This also includes Lightning Fabric and Lightning Apps which you can read more about here and here.
            2. Convert your corrupted MNIST model into a LightningModule. You can either choose to completely override your old model or implement it in a new file. The bare minimum that you need to add while converting to get it working with the rest of lightning:

              • The training_step method. This function should contain essentially what goes into a single training step and should return the loss at the end

              • The configure_optimizers method

              Please read the documentation for more info.

            3. Make sure your data is formatted such that it can be loaded using the torch.utils.data.DataLoader object.

            4. Instantiate a Trainer object. It is recommended to take a look at the trainer arguments (there are many of them) and maybe adjust some of them:

              1. Investigate what the default_root_dir flag does

              2. As default lightning will run for 1000 epochs. This may be too much (for now). Change this by changing the appropriate flag. Additionally, there also exist a flag to set the maximum number of steps that we should train for.

              3. To start with we also want to limit the amount of training data to 20% of its original size. which trainer flag do you need to set for this to work?

            5. Try fitting your model: trainer.fit(model)

            6. Now try adding some callbacks to your trainer.

            7. The privous module was all about logging in wandb, so the question is naturally how does lightning support this. Lightning does not only support wandb, but also many others. Common for all of them, is that logging just need to happen through the self.log method in your LightningModule:

              1. Add self.log to your `LightningModule. Should look something like this:

                def training_step(self, batch, batch_idx):\n    data, target = batch\n    preds = self(data)\n    loss = self.criterion(preds, target)\n    acc = (target == preds.argmax(dim=-1)).float().mean()\n    self.log('train_loss', loss)\n    self.log('train_acc', acc)\n    return loss\n
              2. Add the wandb logger to your trainer

                trainer = Trainer(logger=pl.loggers.WandbLogger(project=\"dtu_mlops\"))\n

                and try to train the model. Confirm that you are seeing the scalars appearing in your wandb portal.

              3. self.log does sadly only support logging scalar tensors. Luckily, for logging other quantities we can still access the standard wandb.log through our model

                def training_step(self, batch, batch_idx):\n    ...\n    # self.logger.experiment is the same as wandb.log\n    self.logger.experiment.log({'logits': wandb.Histrogram(preds)})\n

                try doing this, by logging something else than scalar tensors.

            8. Finally, we maybe also want to do some validation or testing. In lightning we just need to add the validation_step and test_step to our lightning module and supply the respective data in form of a separate dataloader. Try to at least implement one of them.

            9. (Optional, requires GPU) One of the big advantages of using lightning is that you no more need to deal with device placement e.g. called .to('cuda') everywhere. If you have a GPU, try to set the gpus flag in the trainer. If you do not have one, do not worry, we are going to return to this when we are going to run training in the cloud.

            10. (Optional) As default Pytorch uses float32 for representing floating point numbers. However, research have shown that neural network training is very robust towards a decrease in precision. The great benefit going from float32 to float16 is that we get approximately half the memory consumption. Try out half-precision training in Pytorch lightning. You can enable this by setting the precision flag in the Trainer.

            11. (Optional) Lightning also have built-in support for profiling. Checkout how to do this using the profiler argument in the Trainer object.

            12. (Optional) Another great feature of Lightning is that the allow for easily defining command line interfaces through the Lightning CLI feature. The Lightning CLI is essentially a drop in replacement for defining command line interfaces (covered in this module) and can also replace the need for config files (covered in this module) for securing reproducibility when working inside the Lightning framework. We highly recommend checking out the feature and that you try to refactor your code such that you do not need to call trainer.fit anymore but it is instead directly controlled from the Lightning CLI.

            13. Free exercise: Experiment with what the lightning framework is capable of. Either try out more of the trainer flags, some of the other callbacks, or maybe look into some of the other methods that can be implemented in your lightning module. Only your imagination is the limit!

            That covers everything for today. It has been a mix of topics that all should help you write \"better\" code (by some objective measure). If you want to deep dive more into the Pytorch lightning framework, we highly recommend looking at the different tutorials in the documentation that covers more advanced models and training cases. Additionally, we also want to highlight other frameworks in the lightning ecosystem:

            • Torchmetrics: collection of machine learning metrics written in Pytorch
            • lightning flash: High-level framework for fast prototyping, baselining, finetuning with a even simpler interface than lightning
            • lightning-bolts: Collection of SOTA pretrained models, model components, callbacks, losses and datasets for testing out ideas as fast a possible
            "},{"location":"s4_debugging_and_logging/debugging/","title":"M11 - Debugging","text":""},{"location":"s4_debugging_and_logging/debugging/#debugging","title":"Debugging","text":"

            Debugging is very hard to teach and is one of the skills that just comes with experience. That said, there are good and bad ways to debug a program. We are all probably familiar with just inserting print(...) statements everywhere in our code. It is easy and can many times help narrow down where the problem happens. That said, this is not a great way of debugging when dealing with a very large codebase. You should therefore familiarize yourself with the built-in python debugger as it may come in handy during the course.

            To invoke the build in python debugger you can either:

            • Set a trace directly with the python debugger by calling

              import pdb\npdb.set_trace()\n

              anywhere you want to stop the code. Then you can use different commands (see the python_debugger_cheatsheet.pdf) to step through the code.

            • If you are using an editor, then you can insert inline breakpoints (in VS code this can be done by pressing F9) and then execute the script in debug mode (inline breakpoints can often be seen as small red dots to the left of your code). The editor should then offer some interface to allow you step through your code. Here is a guide to using the build in debugger in VScode.

            • Additionally, if your program is stopping on an error and you automatically want to start the debugger where it happens, then you can simply launch the program like this from the terminal

              python -m pdb -c continue my_script.py\n
            "},{"location":"s4_debugging_and_logging/debugging/#exercises","title":"\u2754 Exercises","text":"

            Exercise files

            We here provide a script vae_mnist_bugs.py which contains a number of bugs to get it running. Start by going over the script and try to understand what is going on. Hereafter, try to get it running by solving the bugs. The following bugs exist in the script:

            • One device bug (will only show if running on gpu, but try to find it anyways)
            • One shape bug
            • One math bug
            • One training bug

            Some of the bugs prevents the script from even running, while some of them influences the training dynamics. Try to find them all. We also provide a working version called vae_mnist_working.py (but please try to find the bugs before looking at the script). Successfully debugging and running the script should produce three files:

            • orig_data.png containing images from the standard MNIST training set
            • reconstructions.png reconstructions from the model
            • generated_samples.png samples from the model

            Again, we cannot stress enough that the exercise is actually not about finding the bugs but using a proper debugger to find them.

            "},{"location":"s4_debugging_and_logging/logging/","title":"M13 - Logging","text":""},{"location":"s4_debugging_and_logging/logging/#logging","title":"Logging","text":"

            Core Module

            Logging in general refers to the practise of recording events activities over time. Having proper logging in your applications can be extremely beneficial for a few reasons:

            • Debugging becomes easier because we in a more structure way can output information about the state of our program, variables, values etc. to help identify and fix bugs or unexpected behavior.

            • When we move into a more production environment, proper logging is essential for monitoring the health and performance of our application.

            • It can help in auditing as logging info about specific activities etc. can help keeping a record of who did what and when.

            • Having proper logging means that info is saved for later, that can be analysed to gain insight into the behavior of our application, such as trends.

            We are in this course going to divide the kind of logging we can do into categories: application logging and experiment logging. In general application logging is important regardless of the kind of application you are developing, whereas experiment logging is important machine learning based projects where we are doing experiments.

            "},{"location":"s4_debugging_and_logging/logging/#application-logging","title":"Application logging","text":"

            The most basic form of logging in Python applications is the good old print statement:

            for batch_idx, batch in enumerate(dataloader):\n    print(f\"Processing batch {batch_idx} out of {len(dataloader)}\")\n    ...\n

            This will keep a \"record\" of the events happening in our script, in this case how far we have progressed. We could even change the print to include something like batch.shape to also have information about the current data being processed.

            Using print statements is fine for small applications, but to have proper logging we need a bit more functionality than what print can offer. Python actually comes with a great logging module, that defines functions for flexible logging. It is exactly this we are going to look at in this module.

            The four main components to the Python logging module are:

            1. Logger: The main entry point for using the logging system. You create instances of the Logger class to emit log messages.

            2. Handler: Defines where the log messages go. Handlers send the log messages to specific destinations, such as the console or a file.

            3. Formatter: Specifies the layout of the log messages. Formatters determine the structure of the log records, including details like timestamps and log message content.

            4. Level: Specifies the severity of a log message.

            Especially, the last point is important to understand. Levels essentially allows of to get rid of statements like this:

            if debug:\n    print(x.shape)\n

            where the logging is conditional on the variable debug which we can set a runtime. Thus, it is something we can disable for users of our application (debug=False) but have enabled when we develop the application (debug=True). And it makes sense that not all things logged, should be available to all stakeholders of a codebase. We as developers probably always wants the highest level of logging, whereas users of the our code need less info and we may want to differentiate this based on users.

            It is also important to understand the different between logging and error handling. Error handling Python is done using raise statements and try/catch like:

            def f(x: int):\n    if not isinstance(x, int):\n        raise ValueError(\"Expected an integer\")\n    return 2 * x\n\ntry:\n    f(5):\nexcept ValueError:\n    print(\"I failed to do a thing, but continuing.\")\n

            Why would we evere need log warning, error, critical levels of information, if we are just going to handle it? The reason is that raising exceptions are meant to change the program flow at runtime e.g. things we do not want the user to do, but we can deal with in some way. Logging is always for after a program have run, to inspect what went wrong. Sometimes you need one, sometimes the other, sometimes both.

            "},{"location":"s4_debugging_and_logging/logging/#exercises","title":"\u2754 Exercises","text":"

            Exercises are inspired by this made with ml module on the same topic. If you need help for the exercises you can find a simple solution script here.

            1. As logging is a built-in module in Python, nothing needs to be installed. Instead start a new file called my_logger.py and start out with the following code:

              import logging\nimport sys\n\n# Create super basic logger\nlogging.basicConfig(stream=sys.stdout, level=logging.DEBUG)\nlogger = logging.getLogger(__name__) # (1)\n\n# Logging levels (from lowest to highest priority)\nlogger.debug(\"Used for debugging your code.\")\nlogger.info(\"Informative messages from your code.\")\nlogger.warning(\"Everything works but there is something to be aware of.\")\nlogger.error(\"There's been a mistake with the process.\")\nlogger.critical(\"There is something terribly wrong and process may terminate.\")\n
              1. The built-in variable __name__ always contains the record of the script or module that is currently being run. Therefore if we initialize our logger base using this variable, it will always be unique to our application and not conflict with logger setup by any third-party package.

              Try running the code. Than try changing the argument level when creating the logger. What happens when you do that?

            2. Instead of sending logs to the terminal, we may also want to send them to a file. This can be beneficial, such that only warning level logs and higher are available to the user, but debug and info is still saved when the application is running.

              1. Try adding the following dict to your logger.py file:

                logging_config = {\n    \"version\": 1,\n    \"formatters\": { # (1)\n        \"minimal\": {\"format\": \"%(message)s\"},\n        \"detailed\": {\n            \"format\": \"%(levelname)s %(asctime)s [%(name)s:%(filename)s:%(funcName)s:%(lineno)d]\\n%(message)s\\n\"\n        },\n    },\n    \"handlers\": { # (2)\n        \"console\": {\n            \"class\": \"logging.StreamHandler\",\n            \"stream\": sys.stdout,\n            \"formatter\": \"minimal\",\n            \"level\": logging.DEBUG,\n        },\n        \"info\": {\n            \"class\": \"logging.handlers.RotatingFileHandler\",\n            \"filename\": Path(LOGS_DIR, \"info.log\"),\n            \"maxBytes\": 10485760,  # 1 MB\n            \"backupCount\": 10,\n            \"formatter\": \"detailed\",\n            \"level\": logging.INFO,\n        },\n        \"error\": {\n            \"class\": \"logging.handlers.RotatingFileHandler\",\n            \"filename\": Path(LOGS_DIR, \"error.log\"),\n            \"maxBytes\": 10485760,  # 1 MB\n            \"backupCount\": 10,\n            \"formatter\": \"detailed\",\n            \"level\": logging.ERROR,\n        },\n    },\n    \"root\": {\n        \"handlers\": [\"console\", \"info\", \"error\"],\n        \"level\": logging.INFO,\n        \"propagate\": True,\n    },\n}\n
                1. The formatter section determines how logs should be formatted. Here we define two separate formatters, called minimal and detailed which we can use in the next part of the code.

                2. The handlers is in charge of what should happen to different level of logging. console uses the minimal format we defined and sens logs to the stdout stream for messages of level DEBUG and higher. The info handler uses the detailed format and sends messages of level INFO and higher to a separate info.log file. The error handler does the same for messages of level ERROR and higher to a file called error.log.

                you will need to set the LOGS_DIR variable and also figure out how to add this logging_config using the logging config submodule to your logger.

              2. When the code successfully runs, check the LOGS_DIR folder and make sure that a info.log and error.log file was created with the appropriate content.

            3. Finally, lets try to add a little bit of style and color to our logging. For this we can use the great package rich which is a great package for rich text and beautiful formatting in terminals. Install rich and add the following line to your my_logger.py script:

              logger.root.handlers[0] = RichHandler(markup=True)  # set rich handler\n

              and try re-running the script. Hopefully you should see something beautiful in your terminal like this:

            4. (Optional) We already briefly touched on logging during the module on config files using hydra. If you want to configure hydra to use custom logging scheme as the one we setup in the last two exercises, you can take a look at this page. In hydra you will need to provide the configuration of the logger as config file. You can find examples of such config file here.

            "},{"location":"s4_debugging_and_logging/logging/#experiment-logging","title":"Experiment logging","text":"

            When most people think machine learning, we think about the training phase. Being able to track and log experiments is an important part of understanding what is going on with your model while you are training. It can help you debug your model and help tweak your models to perfection. Without proper logging of experiments, it can be really hard to iterate on the model because you do not know what changes lead to increase or decrease in performance.

            The most basic logging we can do when running experiments is writing the metrics that our model is producing e.g. the loss or the accuracy to the terminal or a file for later inspection. We can then also use tools such as matplotlib for plotting the progression of our metrics over time. This kind of workflow may be enough when doing smaller experiments or working alone on a project, but there is no way around using a proper experiment tracker and visualizer when doing large scale experiments in collaboration with others. It especially becomes important when you want to compare performance between different runs.

            There exist many tools for logging your experiments, with some of them being:

            • Tensorboard
            • Comet
            • MLFlow
            • Neptune
            • Weights and Bias

            All of the frameworks offers many of the same functionalities, you can see a (bias) review here. We are going to use Weights and Bias (wandb), as it support everything we need in this course. Additionally, it is an excellent tool for collaboration and sharing of results.

            Using the Weights and Bias (wandb) dashboard we can quickly get an overview and compare many runs over different metrics. This allows for better iteration of models and training procedure."},{"location":"s4_debugging_and_logging/logging/#exercises_1","title":"\u2754 Exercises","text":"
            1. Start by creating an account at wandb. I recommend using your github account but feel free to choose what you want. When you are logged in you should get an API key of length 40. Copy this for later use (HINT: if you forgot to copy the API key, you can find it under settings).

            2. Next install wandb on your laptop

              pip install wandb\n
            3. Now connect to your wandb account

              wandb login\n

              you will be asked to provide the 40 length API key. The connection should be remain open to the wandb server even when you close the terminal, such that you do not have to login each time. If using wandb in a notebook you need to manually close the connection using wandb.finish().

            4. With it all setup we are now ready to incorporate wandb into our code. The interface is fairly simple, and this guide should give enough hints to get you through the exercise. (HINT: the two methods you need to call are wandb.init and wandb.log). To start with, logging the training loss of your model will be enough.

            5. After running your model, checkout the webpage. Hopefully you should be able to see at least one run with something logged.

            6. Now log something else than scalar values. This could be a image, a histogram or a matplotlib figure. In all cases the logging is still going to use wandb.log but you need extra calls to wandb.Image etc. depending on what you choose to log.

            7. Finally, lets create a report that you can share. Click the Create report button and include some of the graphs/plots/images that you have generated in the report.

            8. To make sure that you have completed todays exercises, make the report shareable by clicking the Share button and create view-only-link. Send the link to my email nsde@dtu.dk, so I can checkout your awesome work \ud83d\ude03

            9. When calling wandb.init you have two arguments called project and entity. Make sure that you understand these and try them out. It will come in handy for your group work as they essentially allows multiple users to upload their own runs to the same project in wandb.

            10. Wandb also comes with build in feature for doing hyperparameter sweeping which can be beneficial to get a better working model. Look through the documentation on how to do a hyperparameter sweep in Wandb. You at least need to create a new file called sweep.yaml and make sure that you call wandb.log in your code on an appropriate value. Note: if you want hydra and wandb to work together you will need to change the command config in your sweep.yaml file, see this page.

            11. In the future it will be important for us to be able to run Wandb inside a docker container (together with whatever training or inference we specify). The problem here is that we cannot authenticate Wandb in the same way as the previous exercise, it needs to happen automatically. Lets therefore look into how we can do that.

              1. First we need to generate an authentication key, or more precise an API key. This is in general the way any service (like a docker container) can authenticate. Start by going https://wandb.ai/home, click your profile icon in the upper right corner and then go to settings. Scroll down to the danger zone and generate a new API key and finally copy it.

              2. Next create a new docker file called wandb.docker and add the following code

                FROM python:3.9\nRUN apt update && \\\n    apt install --no-install-recommends -y build-essential gcc && \\\n    apt clean && rm -rf /var/lib/apt/lists/*\nRUN pip install wandb\nCOPY s4_debugging_and_logging/exercise_files/wandb_tester.py wandb_tester.py\nENTRYPOINT [\"python\", \"-u\", \"wandb_tester.py\"]\n

                please take a look at the script being copied into the image and afterwards build the docker image.

              3. When we want to run the image, what we need to do is including a environment variables that contains the API key we generated. This will then authenticate the docker container with the wandb server:

                docker run -e WANDB_API_KEY=<your-api-key> wandb:latest\n

                Try running it an confirm that the results are uploaded to the wandb server.

            12. Feel free to experiment more with wandb as it is a great tool for logging, organizing and sharing experiments.

            That is the module on logging. Please note that at this point in the course you will begin to see some overlap between the different frameworks. While we mainly used hydra for configuring our python scripts it can also be used to save metrics and hyperparameters similar to how wandb can. Similar arguments holds for dvc which can also be used to log metrics. In our opinion wandb just offers a better experience when interacting with the results after logging. We want to stress that the combination of tools presented in this course may not be the best for all your future projects, and we recommend finding a setup that fits you. That said, each framework provide specific features that the others does not.

            Finally, we want to note that we during the course really try to showcase a lot of open source frameworks, Wandb is not one. It is free to use for personal usage (with a few restrictions) but for enterprise it does require a license. If you are eager to only work with open-source tools we highly recommend trying out MLFlow which offers the same overall functionalities as Wandb.

            "},{"location":"s4_debugging_and_logging/profiling/","title":"M12 - Profiling","text":""},{"location":"s4_debugging_and_logging/profiling/#profilers","title":"Profilers","text":"

            Core Module

            "},{"location":"s4_debugging_and_logging/profiling/#profilers_1","title":"Profilers","text":"

            In general profiling code is about improving the performance of your code. In this session we are going to take a somewhat narrow approach to what \"performance\" is: runtime, meaning the time it takes to execute your program.

            At the bare minimum, the two questions a proper profiling of your program should be able to answer is:

            • \u201c How many times is each method in my code called?\u201d
            • \u201c How long do each of these methods take?\u201d

            The first question is important to priorities optimization. If two methods A and B have approximately the same runtime, but A is called 1000 more times than B we should probably spend time optimizing A over B if we want to speedup our code. The second question is gives itself, directly telling us which methods are the expensive to call.

            Using profilers can help you find bottlenecks in your code. In this exercise we will look at two different profilers, with the first one being the cProfile. cProfile is pythons build in profiler that can help give you an overview runtime of all the functions and methods involved in your programs.

            "},{"location":"s4_debugging_and_logging/profiling/#exercises","title":"\u2754 Exercises","text":"
            1. Run the cProfile on the vae_mnist_working.py script. Hint: you can directly call the profiler on a script using the -m arg

              python -m cProfile -o <output_file> -s <sort_order> myscript.py\n
            2. Try looking at the output of the profiling. Can you figure out which function took the longest to run?

            3. Can you explain the difference between tottime and cumtime? Under what circumstances does these differ and when are they equal.

            4. To get a better feeling of the profiled result we can try to visualize it. Python does not provide a native solution, but open-source solutions such as snakeviz exist. Try installing snakeviz and load a profiled run into it (HINT: snakeviz expect the run to have the file format .prof).

            5. Try optimizing the run! (Hint: The data is not stored as torch tensor). After optimizing the code make sure (using cProfile and snakeviz) that the code actually runs faster.

            "},{"location":"s4_debugging_and_logging/profiling/#pytorch-profiling","title":"Pytorch profiling","text":"

            Profiling machine learning code can become much more complex because we are suddenly beginning to mix different devices (CPU+GPU), that can (and should) overlap some of their computations. When profiling this kind of machine learning code we are often looking for bottlenecks. A bottleneck is simple the place in your code that is preventing other processes from performing their best. This is the reason that all major deep learning frameworks also include their own profilers that can help profiling more complex applications.

            The image below show a typical report using the build in profiler in pytorch. As the image shows the profiler looks both a the kernel time (this is the time spend doing actual computations) and also transfer times such as memcpy (where we are copying data between devices). It can even analyze your code and give recommendations.

            Using the profiler can be as simple as wrapping the code that you want to profile with the torch.profiler.profile decorator

            with torch.profiler.profile(...) as prof:\n    # code that I want to profile\n    output = model(data)\n
            "},{"location":"s4_debugging_and_logging/profiling/#exercises_1","title":"\u2754 Exercises","text":"

            Exercise files

            In these investigate the profiler that is build into PyTorch already. Note that these exercises requires that you have PyTorch v1.8.1 installed (or higher). You can always check which version you currently have installed by writing (in a python interpreter):

            import torch\nprint(torch.__version__)\n

            But we always recommend to update to the latest Pytorch version for the best experience. Additionally, to display the result nicely (like snakeviz for cProfile) we are also going to use the tensorboard profiler extension

            pip install torch_tb_profiler\n
            1. A good starting point is too look at the API for the profiler. Here the important class to look at is the torch.profiler.profile class.

            2. Lets try out an simple example (taken from here):

              1. Try to run the following code

                import torch\nimport torchvision.models as models\nfrom torch.profiler import profile, ProfilerActivity\n\nmodel = models.resnet18()\ninputs = torch.randn(5, 3, 224, 224)\n\nwith profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:\n    model(inputs)\n

                this will profile the forward pass of Resnet 18 model.

              2. Running this code will produce an prof object that contains all the relevant information about the profiling. Try writing the following code:

                print(prof.key_averages().table(sort_by=\"cpu_time_total\", row_limit=10))\n

                what operation is taking most of the cpu?

              3. Try running

                print(prof.key_averages(group_by_input_shape=True).table(sort_by=\"cpu_time_total\", row_limit=30))\n

                can you see any correlation between the shape of the input and the cost of the operation?

              4. (Optional) If you have a GPU you can also profile the operations on that device:

                with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:\n    model(inputs)\n
              5. (Optional) As an alternative to using profile as an context-manager we can also use its .start and .stop methods:

                prof = profile(...)\nprof.start()\n...  # code I want to profile\nprof.stop()\n

                Try doing this on the above example.

            3. The torch.profiler.profile function takes some additional arguments. What argument would you need to set to also profile the memory usage? (Hint: this page) Try doing it to the simple example above and make sure to sort the sample by self_cpu_memory_usage.

            4. As mentioned we can also get a graphical output for better inspection. After having done a profiling try to export the results with:

              prof.export_chrome_trace(\"trace.json\")\n

              you should be able to visualize the file by going to chrome://tracing in any chromium based web browser. Can you still identify the information printed in the previous exercises from the visualizations?

            5. Running profiling on a single forward step can produce misleading results as it only provides a single sample that may depend on what background processes that are running on your computer. Therefore it is recommended to profile multiple iterations of your model. If this is the case then we need to include prof.step() to tell the profiler when we are doing a new iteration

              with profile(...) as prof:\n    for i in range(10):\n        model(inputs)\n        prof.step()\n

              Try doing this. Is the conclusion this the same on what operations that are taken up most of the time? Have the percentage changed significantly?

            6. Additionally, we can also visualize the profiling results using the profiling viewer in tensorboard.

              1. Start by initializing the profile class with an additional argument:

                from torch.profiler import profile, tensorboard_trace_handler\nwith profile(..., on_trace_ready=tensorboard_trace_handler(\"./log/resnet18\")) as prof:\n    ...\n

                Try run a profiling (using a couple of iterations) and make sure that a file with the .pt.trace.json is produced in the log/resnet18 folder.

              2. Now try launching tensorboard

                tensorboard --logdir=./log\n

                and open the page http://localhost:6006/#pytorch_profiler, where you should hopefully see an image similar to the one below:

                Image credit

                Try poking around in the interface.

              3. Tensorboard have a nice feature for comparing runs under the diff tab. Try redoing a profiling run but use model = models.resnet34() instead. Load up both runs and try to look at the diff between them.

            7. As an final exercise, try to use the profiler on the vae_mnist_working.py file from the previous module on debugging, where you profile a hole training run (not only the forward pass). What is the bottleneck during the training? Is it still the forward pass or is it something else? Can you improve the code somehow based on the information from the profiler.

            This end the module on profiling. If you want to go into more details on this topic we can recommend looking into line_profiler and kernprof. A downside of using python's cProfile is that it can only profiling at an functional/modular level, that is great for identifying hotspots in your code. However, sometimes the cause of an computationally hotspot is a single line of code in a function, which will not be caught by cProfile. An example would be an simple index operations such as a[idx] = b, which for large arrays and non-sequential indexes is really expensive. For these cases line_profiler and kernprof are excellent tools to have in your toolbox. Additionally, if you do not like cProfile we can also recommend py-spy which is another open-source profiling tool for python programs.

            "},{"location":"s5_continuous_integration/","title":"Continuous Integration","text":"

            Slides

            Continues integration is a sub-discipline of the general field of Continues X. Continuous X is one of the core elements of modern DevOps, and by extension MLOps. Continuous X assumes that we have a (long) developer pipeline (see image below) where we want to make some changes to our code e.g:

            • Update our training data or data processing
            • Update our model architecture
            • Something else...

            Basically, any code change we will expect will have a influence on the final result. The problem with doing changes to the start of our pipeline is that we want the change to propagate all the way through to the end of the pipeline.

            Image credit

            This is where continuous X comes into play. The word continuous here refers to the fact that the pipeline should continuously be updated as we make code changes. You can also choose to think of this as the automatization of processes. The X then covers that the process we need to go through to automate steps in the pipeline depends on where we are in the pipeline e.g. the tools needed to do continuous integration are different from the tools needed to do continuous delivery.

            In this session, we are going to focus on continuous integration (CI). As indicated in the image above, CI usually takes care of the first part of the developer pipeline which has to do with the code base, code building and code testing. This is paramount to step in automatization as we would rather catch bugs at the beginning of our pipeline than in the end.

            Learning objectives

            The learning objectives of this session are:

            • Being able to write unit tests that cover both data and models in your ML pipeline
            • Know how to implement CI using Github actions such that tests are automatically executed on code changes
            • Can use pre-commit to secure that code that is not up to standard does not get committed
            • Know how to implement CI for continuous building of containers
            • Basic knowledge of how machine learning processes can be implemented in a continuous way
            "},{"location":"s5_continuous_integration/auto_docker/","title":"M18 - Continuous Containers","text":""},{"location":"s5_continuous_integration/auto_docker/#continuous-docker-building","title":"Continuous docker building","text":"

            The Github Actions we learned about in M16 are an powerful tool that can be used to much more than simply running our tests tests that we write for our application. In this module we are going to look at how we can use it for continuously building docker images. As you have already seen docker building can take a couple of minutes to build each time we do changes to our code base. For this reason we really just want to build a new image every time we do a commit of our code. Thus, it should come as no surprise that we can also automate the building process and furthermore we can take advantage of online compute power to parallelize the process.

            As discussed in the initial module on docker, docker hub is an online solution for storing build docker images in the cloud that is then easy to pull down on whatever machine you want to run on. Docker hub is free to use for personal use, as long as the images you push are public. We are in this session going to look how we can automatically build and push our docker builds to docker hub. In a future module we are also going to look at the exact same process of building and pushing containers but this time to an general cloud provider.

            "},{"location":"s5_continuous_integration/auto_docker/#exercises","title":"\u2754 Exercises","text":"

            For these exercises you can choose to work with any docker file of your choosing. If you want an easy docker file, you can use the following:

            FROM busybox\nCMD echo \"Howdy cowboy\"\n

            Alternatively, you can choose to focus on automatizing the training and prediction docker files back from M9. You will most likely need to change the docker image for your applications if they contains any references to your data e.g. you have an COPY data/ data/ statement in the file. Since we do not store our data in Github, we cannot copy it during the build process.

            1. Start by pushing whatever docker file you want that should be continuously build to your repository

            2. Start by creating a Docker Hub account

            3. Next, within Docker Hub create an access token by going to Settings -> Security. Click the New Access Token button and give it a name that you recognize.

            4. Copy the newly created access token and head over to your Github repository online. Go to Settings -> Secrets -> Actions and click the New repository secret. Copy over the access token and give it the name DOCKER_HUB_TOKEN. Additionally, add two other secrets DOCKER_HUB_USERNAME and DOCKER_HUB_REPOSITORY that contains your docker username and docker repository name respectively.

            5. Next we are going to construct the actual Github actions workflow file:

              name: Docker Image CI\n\non:\n    push:\n        branches: [ master ]\n\njobs:\n    build:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v2\n    - name: Build the Docker image\n        run: |\n        echo \"${{ secrets.DOCKER_HUB_TOKEN }}\" | docker login \\\n            -u \"${{ secrets.DOCKER_HUB_USERNAME }}\" --password-stdin docker.io\n        docker build . --file Dockerfile \\\n            --tag docker.io/${{ secrets.DOCKER_HUB_USERNAME }}/${{ secrets.DOCKER_HUB_REPOSITORY }}:$GITHUB_SHA\n        docker push docker.io/${{ secrets.DOCKER_HUB_USERNAME }}/${{ secrets.DOCKER_HUB_REPOSITORY }}:$GITHUB_SHA\n

              The first part of the workflow file should look somewhat recognizable. However, the last three lines are where all the magic happens. Carefully go through them and figure out what they do. If you want some help you can looking at the help page for docker login, docker build and docker push.

            6. Upload the workflow to your github repository and check that it is being executed. If everything you should be able to see the the build docker image in your container repository in docker hub.

            7. Make sure that you can execute docker pull locally to pull down the image that you just continuously build

            8. (Optional) To test that the container works directly in github you can also try to include an additional step that actually runs the container.

                  - name: Run container\n      run: |\n        docker run ...\n

            That ends the session on continues docker building. We are going to revisit this topic after introducing the basic concepts of working in the cloud, as it will make our life easier in the long run when we get to continues deployment (CD) that our containers are stored the same place where we are going to run them. For completeness it is worth mentioning that docker hub also offers the possibility of building your images in a continues way, by specifying so called build rules.

            "},{"location":"s5_continuous_integration/cml/","title":"M19 - Continuous Machine Learning","text":""},{"location":"s5_continuous_integration/cml/#continuous-machine-learning","title":"Continuous Machine Learning","text":"

            The continuous integration we have looked at until now is what we can consider \"classical\" continuous integration, that have its roots in DevOps and not MLOps. While the test that we have written and the containers ww have developed in the previous session have be around machine learning, everything we have done translate to completely to how it would be done if we had developed any other application did not include machine learning.

            In this session, we are now gonna change gear and look at continuous machine learning (CML). As the name may suggest we are now focusing on automatizing actual machine learning processes. You may ask why we need continues integration principals baked into machine learning pipelines? The reason is the same as with any continues integration, namely that we have a bunch of checks that we want our newly trained model to pass before we trust it. Writing unittests secures that our code is not broken, but there are other failure modes of a machine learning pipeline that should be checked before the model is ready for deployment:

            • Did I train on the correct data?
            • Did my model converge at all?
            • Did it reach a certain threshold at all?

            Answering these questions in a continues way are possible through continuous machine learning. For this session, we are going to use cml by iterative.ai for this session. Strictly speaking, using the cml framework is not a necessary component for doing continuous machine learning but it streamlined way of doing this and offers tools to easily get a report about how a specific run performed. If we where just interested in trigging model training every time we do a git push we essentially just need to include

            run: python train.py\n

            to any of our workflow files.

            The figure below describes the overall process using the cml framework. It should be clear that it is the very same process that we go through as in the other continues integration sessions: push code -> trigger github actions -> do stuff. The new part in this session is that we want an report of the finding of the automated run to appear after the run is done.

            Image credit"},{"location":"s5_continuous_integration/cml/#exercises","title":"\u2754 Exercises","text":"
            1. We are first going to revisit our train.py script. If we want cml to automatically be able to report the performance of our trained model to us after it is trained, we need to give it some statistics to work with. Below is some psedo-code that computes the accuracy and the confusion matrix of our trained model. Create an copy of your training script (call it train_cml.py) and make sure your script is also producing an classification report and confusion matrix as in the pseudo-code.

              # assume we have a trained model\nimport matplotlib.pyplot as plt\nfrom sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay\npreds, target = [], []\nfor batch in train_dataloader:\n    x, y = batch\n    probs = model(x)\n    preds.append(probs.argmax(dim=-1))\n    target.append(y.detach())\n\ntarget = torch.cat(target, dim=0)\npreds = torch.cat(preds, dim=0)\n\nreport = classification_report(target, preds)\nwith open(\"classification_report.txt\", 'w') as outfile:\n    outfile.write(report)\nconfmat = confusion_matrix(target, preds)\ndisp = ConfusionMatrixDisplay(cm = confmat, )\nplt.savefig('confusion_matrix.png')\n
            2. Similar to what we have looked at until now, automation happens using github workflow files. The main difference from continuous integration we have looked on until now, is that we are actually going to train our model whenever we do a git push. Copy the following code into a new workflow (called cml.yaml) and add that file to the folder were you keep your workflow files.

              name: train-my-model\non: [push]\njobs:\n  run:\n    runs-on: [ubuntu-latest]\n    steps:\n      - uses: actions/checkout@v2\n      - uses: iterative/setup-cml@v1\n      - name: Train model\n        run: |\n          pip install -r requirements.txt  # install dependencies\n          python train.py  # run training\n      - name: Write report\n        env:\n          # this authenticates that the right permissions are in place\n          REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}\n        run: |\n          # send all information to report.md that will be reported to us when the workflow finish\n          cat classification_report.txt >> report.md\n          cml-publish confusion_matrix.png --md >> report.md\n          cml-send-comment report.md\n

              Nearly everything in the workflow file should look familiar, except the last two lines.

            3. Try pushing the workflow file to your github repository and make sure that it completes. If it does not, you may need to adjust the workflow file slightly.

            4. Send yourself a pull-request. I recommend seeing this very short video on how to send yourself a pull-request with a small change. If you workflow file is executed correctly you should see github-actions commenting with a performance report on your PR.

            5. (Optional) cml is offered by the same people behind dvc and it should therefore come as no surprise that these features can interact with each other. If you want to deep dive into this, here is a great starting point.

            The ends the session on continues machine learning. If you have not already noticed, one limitation of using github actions is that their default runners e.g. runs-on: [ubuntu-latest] are only CPU machines (see hardware config . As we all know, modern machine learning more or less requires hardware acceleration (=GPUs) to train within reasonable time. Luckily for us cml also integrated with large cloud provides and I therefore recommend that after doing through the modules on cloud computing that you return to this exercise and experiment with setting up self-hosted runners.

            "},{"location":"s5_continuous_integration/github_actions/","title":"M16 - Github Actions","text":""},{"location":"s5_continuous_integration/github_actions/#github-actions","title":"Github actions","text":"

            Core Module

            With the tests established in the previous module we are now ready to move on to actually implementing some continuous integration in our pipeline. As you probably have already realized testing your code locally may take cumbersome to do, because

            • You need to run it often to make sure to catch bugs early on
            • If you want to have high code coverage of your code base, you will need many tests that takes a long time to run

            For these reasons we want to automate the testing, such that it done every time we push to our repository. If we combine this with only pushing to branches and then only merging these branches whenever all automated testing have passed, our code should be fairly safe against unwanted bugs (assuming your tests are well covering your code).

            "},{"location":"s5_continuous_integration/github_actions/#github-actions_1","title":"Github actions","text":"

            Github actions are the CI solution that Github provides. Each of your repositories gets 2,000 minutes of free testing per month which should be more than enough for the scope of this course (and probably all personal projects you do). Getting Github actions setup in a repository may seem complicated at first, but workflow files that you create for one repository can more or less be reused for any other repository that you have.

            Lets take a look at how a github workflow file is organized:

            • Initially we start by giving the workflow a name
            • Next we specify on what events the workflow should be triggered. This includes both the action (pull request, push etc) and on what branches is should activate
            • Next we list the jobs that we want to do. Jobs are by default executed in parallel but can also be dependent on each other
            • In the runs-on we can specify which operation system we want the workflow to run on. We also have the possibility to specify multiple.
            • Finally we have the steps. This is where we specify the actual commands that should be run when the workflow is executed.

            Image credit"},{"location":"s5_continuous_integration/github_actions/#exercises","title":"\u2754 Exercises","text":"
            1. Start by creating a .github folder in the root of your repository. Add a sub-folder to that called workflows.

            2. Go over this page that explains how to do automated testing of python code in github actions. You do not have to understand everything, but at least get a feeling of what a workflow file should look like.

            3. We have provided a workflow file called tests.yml that should run your tests for you. Place this file in the .github/workflows/ folder. The workflow file consist of three steps

              • First a python environment is setup (in this case python 3.8)

              • Next all dependencies required to run the test are installed

              • Finally, pytest is called and test will be run

            4. For the script to work you need to define the requirements.txt and requirements_tests.txt. The first file should contain all packages required to run your code. The second file is all additional packages required to run the tests. In your simple case it may very well be that the second file is empty, however sometimes additional packages are used for testing that are not strictly required for the scripts to run.

            5. Finally, try pushing the changes to your repository. Hopefully your tests should just start, and you will after sometime see a green check mark next to hash of the commit. Also try to checkout the Actions tap where you can see the history of actions run.

            6. Normally we develop code one operating system and just hope that it will work on other operating systems. However, CI enables us to automatically test on other systems than ourself.

              1. The provided tests.yml only runs on one operating system. Which one?

              2. Alter the file (or write a new) that executes the test on the two other main operating systems that exist.

            7. As the workflow is currently setup, github actions will destroy every downloaded package when the workflow has been executed. To improve this we can take advantage of caching:

              1. Figure out how to implement caching in your workflow file. You can find a guide here and here.

              2. When you have implemented a caching system go to Actions->Caches in your repository and make sure that they are correctly added. It should look something like the image below

              3. Measure how long your workflow takes before and after adding caching to your workflow. Did it improve the runtime of your workflow?

            8. (Optional) Code coverage can also be added to the workflow file by uploading it as an artifact after running the coverage. Follow the instructions in this post on how to do it.

            9. As stated in the introduction, ideally we want to only push our code to branches, such that our workflows run before we actually merge code into our codebase. We can directly prevent bad behavior by adding branch protection rules to our repository. Take the image below as an example from one of my own PRs:

              In this example, the PR cannot be merge to the main branch before the following is fulfilled: At least 2 reviewers with write access have approved the PR, all Github actions marked as Required are passing and all conversations needs to be resolved. Since not all important tests are passing, further changes are necessary. We want to implement something similar. Do the following:

              1. On your Github repository of choice, go to Settings -> Branches -> Add branch protection rule:

              2. To your main/master branch add the following rules:

                • At least one person needs to approve any PR
                • All your workflows has to pass
                • All conversations needs to be resolved
              3. To test that everything works, try creating a PR (possibly with a small bug) and see that your main/master branch is protected

            10. One problem you may have encountered is running your tests that have to do with your data, with the core problem being that your data is actually not stored in github (assuming you have done module M8 - DVC) and therefore cannot be tested. However, it is possible for us to download data while running our CI. Lets try to setup that:

              1. The first problem is that we need our CI needs to be able to authenticate with the our storage solution. We can take advantage of an authentication file that is created the first time we push with DVC. It is located in $CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json where $CACHE_HOME depends on your operating system:

                macOSLinuxWindows

                ~/Library/Caches

                ~/.cache This is the typical location, but it may vary depending on what distro you are running

                {user}/AppData/Local

                Find the file. The content should look similar to this (only some fields are shown):

                {\n    \"access_token\": ...,\n    \"client_id\": ...,\n    \"client_secret\": ...,\n    \"refresh_token\": ...,\n    ...\n}\n
              2. The content of that file is should be treated as an password an not shared with the world and the relevant question is therefore how to use this info in public repository. The answer is github secrets, where we can store information, access it in our workflow files and it is still not public. Navigate to the secrets option (as shown below) and create a secret with the name GDRIVE_CREDENTIALS_DATA that contains the content of the file you found in the previous exercise.

              3. Afterwards, add the following code to your workflow file:

                - uses: iterative/setup-dvc@v1\n- name: Get data\n  run: dvc pull\n  env:\n    GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}\n

                that runs dvc pull using the secret authentication file. For help you can visit this small repository that implements the same workflow.

              4. Finally, add the changes, commit, push and confirm that everything works as expected. You should now be able to run unit tests that depends on your input data.

            11. In module M6 on good coding practices (optional module) of the course you where introduced to a couple of good coding practices such as being consistent with your coding style, how your Python packages are sorted and that your code follows certain standards. All this was done using the ruff framework. In this set of exercises we will setup github workflows that will automatically test for this.

              1. Create a new workflow file called codecheck.yml, that implements the following three steps

                • Setup python environment

                • Installs ruff

                • Runs ruff check and ruff format on the repository

                (HINT: You should be able to just change the last steps of the tests.yml workflow file)

              2. In addition to ruff we also used mypy in those set of exercies for checking if the typing we added to our code was good enough. Add another step to the codecheck.yml file which runs mypy on your repository.

              3. Try to make sure that all steps are passing on repository. Especially mypy can be hard to get passing, so this exercise formally only requires you to get ruff passing.

            "},{"location":"s5_continuous_integration/github_actions/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"
            1. When working with Github actions you will often encounter the following 4 concepts:

              • Workflow
              • Runner
              • Job
              • Action

              Try to define them with your own words.

              Solution
              • Workflow: A yaml file that defines the instructions to execute on specific events. Needs to be placed in the .github/workflows folder.
              • Runner: Workflows need to run somewhere. The environment that the workflow is being executed on is called the runner. Most commonly the runner is hosted by Github but can also hosted by yourself.
              • Job: A series of steps which are executed on the same runner. A workflow must include at least one job, but often contains many.
              • Action: A action is the smallest unit in a workflow. Jobs often consist of multiple actions that are executed sequentially.
            2. The on attribute specify upon which events the workflow will be triggered. Assume you have set the on attribute to the following:

              on:\n    push:\n      branches: [main]\n    pull_request:\n      branches: [main]\n    schedule:\n      - cron: \"0 0 * * *\"\n    workflow_dispatch: {}\n

              What 4 events would trigger the execution of that action?

              Solution
              1. Direct push to branch main would trigger it
              2. Any pull request opened that will merge into main would trigger it
              3. At the end of the day the action would trigger
              4. The trigger can be executed by manually triggering it through the Github UI, example shown below

            This ends the module on Github workflows. If you are more interested in this topic you can checkout module M31 on documentation which first including locally building some documentation for your project and afterwards use Github actions for deploying it to Github Pages. Additionally, Github also have a lot of templates already for running a lot CI tasks. If you try to create a workflow file directly in Github you may encounter the following page

            We highly recommend checking this out if you want to write any other kind of CI pipeline in Github actions. We can also recommend this repository that have an list of awesome actions and checkout the act repository which is a tool for running your GitHub Actions locally!

            "},{"location":"s5_continuous_integration/pre_commit/","title":"M17 - Pre commit","text":""},{"location":"s5_continuous_integration/pre_commit/#pre-commit","title":"Pre-commit","text":"

            One of the cornerstones of working with git is remembering to commit your work often. Often committing makes sure that it is easier to identify and revert unwanted changes that you have introduced, because the code changes becomes smaller per commit.

            However, as you hopefully already seen in the course there are a lot of mental task to do before you actually write git commit in the terminal. The most basic thing is of course making sure that you have saved all your changes, and you are not committing a not up-to-date file. However, this also includes tasks such as styling, formatting, making sure all tests succeeds etc. All these mental to-do notes does not mix well with the principal of remembering to commit often, because you in principal have to do them every time.

            The obvious solution to this problem is to automate all or some of our mental task every time that we do a commit. This is where pre-commit hooks comes into play, as they can help us attach additional tasks that should be run every time that we do a git commit.

            "},{"location":"s5_continuous_integration/pre_commit/#configuration","title":"Configuration","text":"

            Pre-commit simply works by inserting whatever workflow we want to automate in between whenever we do a git commit and afterwards would do a git push.

            Image credit

            The system works by looking for a file called .pre-commit-config.yaml that we can configure. If we execute

            pre-commit sample-config | out-file .pre-commit-config.yaml -encoding utf8\n

            you should get a sample file that looks like

            # See https://pre-commit.com for more information\n# See https://pre-commit.com/hooks.html for more hooks\nrepos:\n-   repo: https://github.com/pre-commit/pre-commit-hooks\n    rev: v3.2.0\n    hooks:\n    -   id: trailing-whitespace\n    -   id: end-of-file-fixer\n    -   id: check-yaml\n    -   id: check-added-large-files\n

            the file structure is very simple:

            • It starts by listing the repositories where we want to get our pre-commits from, in this case https://github.com/pre-commit/pre-commit-hooks. This repository contains a large collection of pre-commit hooks.
            • Next we need to defined what pre-commit hooks that we want to get by specifying the id of the different hooks. The id corresponds to an id in this file: https://github.com/pre-commit/pre-commit-hooks/blob/master/.pre-commit-hooks.yaml

            When we are done defining our .pre-commit-config.yaml we just need to install it

            pre-commit install\n

            this will make sure that the file is automatically executed whenever we run git commit

            "},{"location":"s5_continuous_integration/pre_commit/#exercises","title":"\u2754 Exercises","text":"
            1. Install pre-commit

              pip install pre-commit\n
            2. Next create the sample file

              pre-commit sample-config > .pre-commit-config.yaml\n
            3. The sample file already contains 4 hooks. Make sure you understand what each do and if you need them at all.

            4. pre-commit works by hooking into the git commit command, running whenever that command is run. For this to work, we need to install the hooks into git commit. Run

              pre-commit install\n

              to do this.

            5. Try to commit your recently created .pre-commit-config.yaml file. You will likely not do anything, because pre-commit only check files that are being committed. Instead try to run

              pre-commit run --all-files\n

              that will check every file in your repository.

            6. Try adding at least another check from the base repository to your .pre-commit-config.yaml file.

            7. If you have completed the optional module M7 on good coding practice you will have learned about the linter ruff. ruff comes with its own pre-commit hook. Try adding that to your .pre-commit-config.yaml file and see what happens when you try to commit files.

            8. (Optional) Add more hooks to your .pre-commit-config.yaml.

            9. Sometimes you are in a hurry, so make sure that you also can do commits without running pre-commit e.g.

              git commit -m <message> --no-verify\n
            10. Finally, figure out how to disable pre-commit again (if you get tired of it).

            That was all about how pre-commit can be used to automate tasks. If you want to deep dive more into the topic you can checkout this page on how to define your own pre-commit hooks.

            "},{"location":"s5_continuous_integration/unittesting/","title":"M15 - Unittesting","text":""},{"location":"s5_continuous_integration/unittesting/#unit-testing","title":"Unit testing","text":"

            Core Module

            What often comes to mind for many developers, when discussing continuous integration (CI) is code testing. CI should secure that whenever a codebase is updated it is automatically tested such that if bugs have been introduced in the codebase it will be caught early on. If you look at the MLOps cycle, CI is one of the cornerstones of the operations part. However, it should be noted that applying CI does not magically secure that your code does not break. CI is only as strong as the tests that are automatically executed. CI simply structures and automates this.

            Quote

            Continuous Integration doesn\u2019t get rid of bugs, but it does make them dramatically easier to find and remove. Martin Fowler, Chief Scientist, ThoughtWorks

            Image credit

            The kind of tests we are going to look at are called unit testing. Unit testing refers to the practice of writing test that tests individual parts of your code base to test for correctness. By unit, you can therefore think of a function, module or in general any object. By writing tests in this way it should be very easy to isolate which part of the code broke after an update to the code base. Another way to test your code base would be through integration testing which is equally important but we are not going to focus on it in this course.

            Unit tests (and integration tests) are not a unique concept to MLOps but are a core concept of DevOps. However, it is important to note that testing machine learning-based systems is much more difficult than traditional systems. The reason for this is that machine learning systems depend on data, that influences the state of our system. For this reason, we not only need unit tests and integration tests of our code we also need data testing, infrastructure testing and more monitoring to check that we stay within the data distribution we are training on (more on this in module M25 on data drifting). This added complexity is illustrated in the figure below.

            "},{"location":"s5_continuous_integration/unittesting/#pytest","title":"Pytest","text":"

            Before we can begin to automate testing of our code base we of course need to write the tests first. It is both a hard and tedious task to do but arguably the most important aspect of CI. Python offers a couple of different libraries for writing tests. We are going to use pytest.

            "},{"location":"s5_continuous_integration/unittesting/#exercises","title":"\u2754 Exercises","text":"

            The following exercises should be applied to your MNIST repository

            1. The first part of doing CI is writing the unit tests. We do not expect you to cover every part of the code you have developed but try to at least write tests that cover two files. Start by creating a tests folder.

            2. Read the getting started guide for pytest which is the testing framework that we are going to use

            3. Install pytest:

              pip install pytest\n
            4. Write some tests. Below are some guidelines on some tests that should be implemented, but you are of course free to implement more tests. You can at any point check if your tests are passing by typing in a terminal

              pytest tests/\n

              When you implement a test you need to follow two standards, for pytest to be able to find your tests. First any files created (except __init__.py) should always start with test_*.py. Secondly, any test implemented needs to be wrapped into its own function that again needs to start with test_:

              # this will be found and executed by pytest\ndef test_something():\n    ...\n\n# this will not be found and executed by pytest\ndef something_to_test():\n    ...\n
              1. Start by creating a tests/__init__.py file and fill in the following:

                import os\n_TEST_ROOT = os.path.dirname(__file__)  # root of test folder\n_PROJECT_ROOT = os.path.dirname(_TEST_ROOT)  # root of project\n_PATH_DATA = os.path.join(_PROJECT_ROOT, \"Data\")  # root of data\n

                these can help you refer to your data files during testing. For example, in another test file, I could write

                from tests import _PATH_DATA\n

                which then contains the root path to my data.

              2. Data testing: In a file called tests/test_data.py implement at least a test that checks that data gets correctly loaded. By this, we mean that you should check

                def test_data():\n    dataset = MNIST(...)\n    assert len(dataset) == N_train for training and N_test for test\n    assert that each datapoint has shape [1,28,28] or [784] depending on how you choose to format\n    assert that all labels are represented\n

                where N_train should be either 30.000 or 50.000 depending on if you are just the first subset of the corrupted MNIST data or also including the second subset. N_test should be 5000.

              3. Model testing: In a file called tests/test_model.py implement at least a test that checks for a given input with shape X that the output of the model has shape Y.

              4. Training testing: In a file called tests/test_training.py implement at least one test that asserts something about your training script. You are here given free hands on what should be tested but try to test something that risks being broken when developing the code.

              5. Good code raises errors and gives out warnings in appropriate places. This is often in the case of some invalid combination of input to your script. For example, your model could check for the size of the input given to it (see code below) to make sure it corresponds to what you are expecting. Not implementing such errors would still result in Pytorch failing at a later point due to shape errors, however, these custom errors will probably make more sense to the end user. Implement at least one raised error or warning somewhere in your code and use either pytest.raises or pytest.warns to check that they are correctly raised/warned. As inspiration, the following implements ValueError in code belonging to the model:

                # src/models/model.py\ndef forward(self, x: Tensor):\n    if x.ndim != 4:\n        raise ValueError('Expected input to a 4D tensor')\n    if x.shape[1] != 1 or x.shape[2] != 28 or x.shape[3] != 28:\n        raise ValueError('Expected each sample to have shape [1, 28, 28]')\n

                which would be captured by a test looking something like this:

                # tests/test_model.py\ndef test_error_on_wrong_shape():\n    with pytest.raises(ValueError, match='Expected input to a 4D tensor')\n        model(torch.randn(1,2,3))\n
              6. A test is only as good as the error message it gives, and by default, assert will only report that the check failed. However, we can help ourselves and others by adding strings after assert like

                assert len(train_dataset) == N_train, \"Dataset did not have the correct number of samples\"\n

                Add such comments to the assert statements you just did.

              7. The tests that involve checking anything that has to do with our data, will of course fail if the data is not present. To future-proof our code, we can take advantage of the pytest.mark.skipif decorator. Use this decorator to skip your data tests if the corresponding data files does not exist. It should look something like this

                import os.path\n@pytest.mark.skipif(not os.path.exists(file_path), reason=\"Data files not found\")\ndef test_something_about_data():\n    ...\n

                You can read more about skipping tests here

            5. After writing the different tests, make sure that they are passing locally.

            6. We often want to check a function/module for various input arguments. In this case, you could write the same test over and over again for the different input, but pytest also has built-in support for this with the use of the pytest.mark.parametrize decorator. Implement a parametrized test and make sure that it runs for different inputs.

            7. There is no way of measuring how good the test you have written is. However, what we can measure is the code coverage. Code coverage refers to the percentage of your codebase that actually gets run when all your tests are executed. Having a high coverage at least means that all your code will run when executed.

              1. Install coverage

                pip install coverage\n
              2. Instead of running your tests directly with pytest, now do

                coverage run -m pytest tests/\n
              3. To get a simple coverage report simply type

                coverage report\n

                which will give you the percentage of cover in each of your files. You can also write

                coverage report -m\n

                to get the exact lines that were missed by your tests.

              4. Finally, try to increase the coverage by writing a new test that runs some of the lines in your codebase that are not covered yet.

              5. Often coverage reports the code coverage on files that we actually do not want to get a code coverage for. Figure out how to configure coverage to exclude some files.

            "},{"location":"s5_continuous_integration/unittesting/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"
            1. Assuming you have a code coverage of 100%, would you expect that no bugs are present in your code?

              Solution

              No, code coverage is not a guarantee that your code is bug-free. It is just a measure of how many lines of code are run when your tests are executed. Therefore, there may still be some corner case that is not covered by your tests and will result in a bug. However, having a high code coverage is a good indicator that you have tested your code.

            2. Consider the following code:

              @pytest.mark.parametrize(\"network_size\", [10, 100, 1000])\n@pytest.mark.parametrize(\"device\", [\"cpu\", \"cuda\"])\nclass MyTestClass:\n    @pytest.mark.parametrize(\"network_type\", [\"alexnet\", \"squeezenet\", \"vgg\", \"resnet\"])\n    @pytest.mark.parametrize(\"precision\", [torch.half, torch.float, torch.double])\n    def test_network1(self, network_size, device, network_type, precision):\n        if device == \"cuda\" and not torch.cuda.is_available():\n            pytest.skip(\"Test requires cuda\")\n        model = MyModelClass(network_size, network_type).to(device=device, dtype=precision)\n        ...\n\n    @pytest.mark.parametrize(\"add_dropout\", [True, False])\n    def test_network2(self, network_size, device, add_dropout):\n        if device == \"cuda\" and not torch.cuda.is_available():\n            pytest.skip(\"Test requires cuda\")\n        model = MyModelClass2(network_size, add_dropout).to(device)\n        ...\n

              how many tests are executed when running the above code?

              Solution

              The answer depends on whether or not we are running on a GPU-enabled machine. The test_network1 has 4 parameters, network_size, device, network_type, precision, that respectively can take on 3, 2, 4, 3 values meaning that in total that test will be running 3x2x4x3=72 times with different parameters on a GPU-enabled machine and 36 on a machine without a GPU. A similar calculation can be done for test_network2, which only has three factors network_size, device, add_dropout that result in 3x2x2=12 test on a GPU-enabled machine and 6 on a machine without a GPU. In total, that means 84 tests would run on a machine with a GPU and 42 on a machine without a GPU.

            That covers the basics of writing unit tests for Python code. We want to note that pytest of course is not the only framework for doing this. Python has a built-in framework called unittest for doing this also (but pytest offers a bit more features). Another open-source framework that you could choose to check out is hypothesis which can help catch errors in corner cases of your code. In addition to writing unit tests it is also highly recommended to test code that you include in your docstring belonging to your functions and modulus to make sure that any code there is in your documentation is also correct. For such testing, we can highly recommend using Python built-in framework doctest.

            "},{"location":"s6_the_cloud/","title":"Cloud computing","text":"

            Slides

            Running computations locally is often sufficient when only playing around with code in initial phase of development. However, to really scale your experiments you will need more computing power than what your standard laptop/desktop can offer. You probably already have experience with running on a local cluster or similar but todays topic is about utilizing cloud computing.

            Image credit

            There exist a numerous amount of cloud compute providers with some of the biggest being:

            • Azure
            • AWS
            • Google Cloud project
            • Alibaba cloud

            The all have slight advantages and disadvantages over each others. In this course we are going to focus on Google cloud, because they have been kindly enough to sponsor $50 of cloud credit to each student. If you happen to run out of credit, you can also get some free credit for a limited amount of time when you signup with a new account. What's important to note is that all these different cloud providers all have the same set of services, and that learning how to use the services of one cloud provider in many cases translate to also know how to use the same services at another cloud provider. The services are called something different and can have a bit of a different interface/interaction pattern but in the end it does not really matter.

            Todays exercises are about getting to know how to work with the cloud. If you are in doubt about anything or want to deep dive into some topics, I can recommend watching this series of videos or going through the general docs.

            Learning objectives

            The learning objectives of this session are:

            • In general being familiar with the Google SDK working
            • Being able to start different compute instances and work with them
            • Know how to setup continues integration workflows for building of docker images
            • Knowledge about how to store data and containers/artifacts in cloud buckets
            • Being able to train simple Deep Learning models using a combination of cloud services
            "},{"location":"s6_the_cloud/cloud_setup/","title":"Cloud setup","text":"

            Core Module

            Google cloud project (GCP) is the cloud service provided by Google. The key concept, or selling point, of any cloud provider is the idea of near-infinite resources. Without the cloud it simply is not feasible to do many modern deep learning and machine learning tasks because they cannot be scaled locally.

            The image below shows a subset of all the different services that the Google cloud platform offers. The ones marked in red are the ones we are actually going to investigate in this course. Therefore, if you get done with exercises early I highly recommend that you deep dive more into the Google cloud platform.

            Image credit"},{"location":"s6_the_cloud/cloud_setup/#exercises","title":"\u2754 Exercises","text":"

            As the first step we are going to get you setup with some Google cloud credits.

            1. Go to https://learn.inside.dtu.dk. Go to this course. Find the recent message where there should be a download link and instructions on how to claim the $50 cloud credit. Please do not share the link anywhere as there are a limited amount of coupons. If you are not officially taking this course at DTU, Google gives $300 cloud credits whenever you signup with a new account. NOTE that you need to provide a credit card for this so make sure to closely monitor your credit use so you do not end spending more than the free credit.

            2. Login to the homepage of gcp. It should look like this:

            3. Go to billing and make sure that your account is showing $50 of cloud credit

              make sure to also checkout the Reports throughout the course. When you are starting to use some of the cloud services these tabs will update with info about how much time you can use before your cloud credit runs out. Make sure that you monitor this page as you will not be given another coupon.

            4. One way to stay organized within GCP is to create projects.

              Create a new project called dtumlops. When you click create you should get a notification that the project is being created. The notification bell is good way to make sure how the processes you are running are doing throughout the course.

            5. For setup we are going to install gcloud. gcloud is the command line interface for working with our Google cloud account. Nearly everything that we can do through the web interface we can also do through the gcloud interface. Follow the installation instructions here for your specific OS.

              1. After installation, try in a terminal to type:

                gcloud -h\n

                the command should and show the help page. If not, something went wrong in the installation (you may need to restart after installing).

              2. Now login by typing

                gcloud auth login\n

                you should be sent to an web page where you link your cloud account to the gcloud interface. Afterwards, also run this command:

                gcloud auth application-default login\n

                If you at some point want to revoke this you can type:

                gcloud auth revoke\n
              3. Next you will need to set the project that we just created. In your web browser under project info, you should be able to see the Project ID belonging to your dtumlops project. Copy this an type the following command in a terminal

                gcloud config set project <project-id>\n

                You can also get the project info by running

                gcloud projects list\n
              4. Next install the Google cloud python API:

                pip install --upgrade google-api-python-client\n

                Make sure that the python interface is also installed. In a python terminal type

                import googleapiclient\n

                this should work without any errors.

              5. (Optional) If you are using VSCode you can also download the relevant extension called Cloud Code. After installing it you should see a small Cloud Code button in the action bar.

            6. Finally, we need to activate a couple of developer APIs that are not activated by default. In a terminal write

              gcloud services enable apigateway.googleapis.com\ngcloud services enable servicemanagement.googleapis.com\ngcloud services enable servicecontrol.googleapis.com\n

              you can always check which services are enabled by typing

              gcloud services list\n

            After following these step your laptop should hopefully be setup for using gcp locally. You are now ready to use their services, both locally on your laptop and in the cloud console.

            "},{"location":"s6_the_cloud/cloud_setup/#iam-and-quotas","title":"IAM and Quotas","text":"

            A big part of using the cloud in a bigger organisation has to do with Admin and quotas. Admin here in general refers to the different roles that users of GCP and quotas refers to the amount of resources that a given user has access to. For example one employee, lets say a data scientist, may only be granted access to certain GCP services that have to do with development and training of machine learning model, with X amounts of GPUs available to use to make sure that the employee does not spend too much money. Another employee, a devops engineer, probably do not need access to the same services and not necessarily the same resources.

            In this course we are not going to focus too much on this aspect but it is important to know that it exists. One feature you are going to need for doing the project is how to share a project with other people. This is done through the IAM (Identities and Access Management) page. Simply click the Grant Access button, search for the email of the person you want to share the project with and give them either Viewer, Editor or Owner access, depending on what you want them to be able to do. The figure below shows how to do this.

            What we are going to go through right now is how to increase the quotas for how many GPUs you have available for your project. By default any free accounts in GCP (or accounts using teaching credits) the default quota for GPUs that you can use is either 0 or 1 (their policies sometimes changes). We will in the exercises below try to increase it.

            "},{"location":"s6_the_cloud/cloud_setup/#exercises_1","title":"\u2754 Exercises","text":"
            1. Start by enabling the Compute Engine service. Simply search for it in the top search bar. It should bring you to the a page where you can enable the service (may take some time). We are going to look more into this service in the next module.

            2. Next go to the IAM & Admin page, again search for it in the top search bar. The remaining steps are illustrated in the figure below.

              1. Go to the quotas page

              2. In the search field search for GPUs (all regions) (needs to match exactly, the search field is case sensitive), such that you get the same quota as in the image.

              3. In the limit you can see what your current quota for the number of GPUs you can use are. Additional, to the right of the limit you can see the current usage. It is worth checking in on if you are ever in doubt if a job is running on GPU or not.

              4. Click the quota and afterwards the Edit qoutas button.

              5. In the pop-op window, increase your limit to either 1 or 2.

              6. After sending your request you can try clicking the Increase requests tab to see the status of your request

            If you are ever running into errors when working in GPU that contains statements about quotas you can always try to go to this page and see what you are actually allowed to use currently and try to increase it. For example, when you get to training machine learning models using Vertex AI in the next module, you would most likely need to ask for quota increase for that service as well.

            Finally, we want to note that a quota increase is sometimes not allowed within 24 hours of creating an account. If your request gets rejected, we recommend to wait a day and try again. If this does still not work, you may need to use their services some more to make sure you are not a bot that wants to mine crypto.

            "},{"location":"s6_the_cloud/cloud_setup/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"
            1. What considerations to take when choosing an GCP region for running a new application?

              Solution

              A series of factors may influence your choice of region, including:

              • Services availability in the region, not all services are available in all regions
              • Resource availability: some regions have more GPUs available than others
              • Reduced latency: if your application is running in the same region as your users, the latency will be lower
              • Compliance: some countries has strict rules that requires user info to be stored inside a particular region eg. EU has GDPR rules that requires all user data to be stored in the EU
              • Pricing: some regions may have different pricing than others
            2. The 3 major cloud providers all have the same services, but they are called something different depending on the provider. What are the corresponding names of these GCP services in AWS and Azure?

              • Compute Engine
              • Cloud storage
              • Cloud functions
              • Cloud run
              • Cloud build
              • Vertex AI

              It is important to know these correspondences to navigate blogpost etc. about MLOps on the internet.

              Solution GCP AWS Azure Compute Engine Elastic Compute Cloud (EC2) Virtual Machines Cloud storage Simple Storage Service (S3) Blob Storage Cloud functions Lambda Functions Serverless Compute Cloud run App Runner, Fargate, Lambda Container Apps, Container Instances Cloud build CodeBuild DevOps Vertex AI SageMaker AI Platform
            "},{"location":"s6_the_cloud/using_the_cloud/","title":"Using the cloud","text":"

            Core Module

            In this set of exercises we are going to get more familiar with the using some of the resources that the Google cloud project offers.

            "},{"location":"s6_the_cloud/using_the_cloud/#compute","title":"Compute","text":"

            The most basic service of any cloud provider is the ability to create and run virtual machines. In gcp this service is called Compute Engine API. A virtual machine allows you to essentially run an operating system that behaves like a completely separate computer. There are many reasons why one to use virtual machines:

            • Virtual machines allow you to scale your operations, essentially giving you access to infinitely many individual computers

            • Virtual machines allow you to use large scale hardware. For example if you are developing an deep learning model on your laptop and want to know the inference time for a specific hardware configuration, you can just create a virtual machine with those specs and run your model.

            • Virtual machines allow you to run processes in the \"background\". If you want to train a model for a week or more, you do not want to do this on your own laptop as you cannot really move it or do anything with while it is training. Virtual machines allow you to just launch a job and forget about it (at least until you run out of credit).

            "},{"location":"s6_the_cloud/using_the_cloud/#exercises","title":"\u2754 Exercises","text":"

            We are now going to start actually using the cloud.

            1. Click on the Compute Engine tab in sidebar on the homepage of gcp.

            2. Try to Create instance. You will see the following image below.

              Give it a meaningful name, set the location to some location that is closer to where you actually is (to reduce latency). Finally try to adjust the the configuration a bit. What two factors are effecting the price of the compute unit?

            3. After figuring this out, create a e2-medium instance (leave rest configured as default). Before clicking the Create button make sure to check the Equavalent Command Line button. You should see a very long command that you could have typed instead to do the exact same.

            4. Now in a local terminal type:

              gcloud compute instances list\n

              you should hopefully see the instance you have just created.

            5. You can start a terminal directly by typing:

              gcloud beta compute ssh --zone <zone> <name> --project <project-id>\n

              You can always see the exact command that you need to run to ssh to an VM by selecting the View gcloud command option in the Compute Engine overview (see image below).

            6. While logged into the instance, check if Python and Pytorch is installed? You should see that neither is installed. The VM we have only specified what compute resources it should have, and not what software should be in it. We can fix this by starting VMs based on specific docker images (its all coming together).

              1. gcp Comes with a number of ready-to-go images for doing deep learning. More info can be found here. Try, running this line:

                gcloud container images list --repository=\"gcr.io/deeplearning-platform-release\"\n

                what does the output show?

              2. Next, start (in the terminal) a new instance using a Pytorch image. The command for doing it should look something like this:

                gcloud compute instances create <instance_name> \\\n    --zone=<zone> \\\n    --image-family=<image-family> \\\n    --image-project=deeplearning-platform-release \\\n    # add these arguments if you want to run on GPU\n    --accelerator=\"type=nvidia-tesla-K80,count=1\" \\\n    --maintenance-policy TERMINATE \\\n    --metadata=\"install-nvidia-driver=True\" \\\n

                You can find more info here on what <image-family> should have as value and what extra argument you need to add if you want to run on GPU (if you have access).

              3. ssh to the VM as one of the previous exercises. Confirm that the container indeed contains both a python installation and Pytorch is also installed. Hint: you also have the possibility through the web page to start a browser session directly to the VMs you create:

            7. Finally, everything that you have done locally can also be achieved through the web terminal, which of course comes pre-installed with the gcloud command etc.

              Try out launching this and run some of the commands from the previous exercises.

            Stopping VMs

            If you are not careful you can end up wasting a lot of credits on virtual machines that you are not using. VMs are charged by the minute, so even if you are not using them you are still paying for them. Therefore, it is important that you remember to stop your VMs when you are not using them. You can do this by either clicking the Stop button in the VM overview page or by running the following command:

            gcloud compute instances stop <instance-name>\n
            "},{"location":"s6_the_cloud/using_the_cloud/#data-storage","title":"Data storage","text":"

            Another big part of cloud computing is storage of data. There are many reason that you want to store your data in the cloud including:

            • Easily being able to share
            • Easily expand as you need more
            • Data is stored multiple locations, making sure that it is not lost in case of an emergency

            Cloud storage is luckily also very cheap. Google cloud only takes around $0.026 per GB per month. This means that around 1 TB of data would cost you $26 which is more than what the same amount of data would cost on Goggle Drive, but the storage in Google cloud is much more focused on enterprise where you have a need for accessing data through an API.

            "},{"location":"s6_the_cloud/using_the_cloud/#exercises_1","title":"\u2754 Exercises","text":"

            When we did the exercise on data version control, we made dvc work together with our own Google drive to storage data. However, a big limitation of this is that we need to authentic each time we try to either push or pull the data. The reason is that we need to use an API instead which is offered through gcp.

            We are going to follow the instructions from this page

            1. Lets start by creating a data storage. On the GCP startpage, in the sidebar, click on the Cloud Storage. On the next page click the Create bucket:

              Give the bucket an unique name, set it to a region close by and importantly remember to enable Object versioning under the last tab. Finally click Create.

            2. After creating the storage, you should be able to see it online and you should be able to see it if you type in your local terminal:

              gsutil ls\n

              gsutil is an additional command to gcloud, that provides more command line options.

            3. Next we need the Google storage extension for dvc

              pip install dvc[gs]\n
            4. Now in your MNIST repository where you have already configured dvc, we are going to change the storage from our Google drive to our newly created Google cloud storage.

              dvc remote add -d remote_storage <output-from-gsutils>\n

              In addition we are also going to modify the remote to support object versioning (called version_aware in dvc):

              dvc remote modify remote_storage version_aware true\n

              This will change the default way that dvc handles data. Instead of just storing the latest version of the data as content-addressable storage it will now store the data as it looks in our local repository, which allows us to not only use dvc to download our data.

            5. The above command will change the .dvc/config file. git add and git commit the changes to that file. Finally, push data to the cloud

              dvc push\n
            6. Finally, make sure that you can pull without having to give your credentials. The easiest way to see this is to delete the .dvc/cache folder that should be locally on your laptop and afterwards do a dvc pull.

            This setup should work when trying to access the data from your laptop, which we authenticated in the previous module. However, how can you access the data from a virtual machine, inside a docker container or from a different laptop? We in general recommend two ways:

            • You can make the bucket public accessible e.g. no authentication needed. That means that anyone with the url to the data can access it. This is the easiest way to do it, but also the least secure. You can read more about how to make your buckets public here.

            • You can create a service account which is a more secure way of accessing data. A service account is essentially a second user which you can give access to specific services. You can read more about how to create a service account here. Once you have created a service account you can give it access to a specific bucket by going to the Permissions tab of the bucket and add the service account as a member.

              If you need to authenticate your service account from a VM, you can do it by running the following command:

              gcloud auth activate-service-account --key-file=<key-file>\n

              where the <key-file is the json file that you downloaded when you created the service account (DO NOT SHARE THIS).

            "},{"location":"s6_the_cloud/using_the_cloud/#artifact-registry","title":"Artifact registry","text":"

            You should hopefully at this point have seen the strength of using containers to create reproducible environments. They allow us to specify exactly the software that we want to run inside our VMs. However, you should already have run into two problems with containers

            • Building process can take a lot of time
            • Docker images can be large

            For this reason we want to move both the building process and the storage of images to the cloud. In GCP the service for this is called Artifact registry, formerly known as Container registry.

            "},{"location":"s6_the_cloud/using_the_cloud/#exercises_2","title":"\u2754 Exercises","text":"

            For the purpose of these exercise I recommend that you start out with a dummy version of some code to make sure that the building process do not take too long. You are more than free to fork this repository. The repository contains a simple python script that does image classification using sklearn. The docker images for this application are therefore going to be substantially faster to build and smaller in size than the images we are used to that uses Pytorch.

            1. Start by enabling the service: Google Artifact Registry API and Google Cloud Build API. This can be done through the web side (by searching for the services) or can also be enabled from the terminal:

              gcloud services enable artifactregistry.googleapis.com\ngcloud services enable cloudbuild.googleapis.com\n
            2. Google cloud building can in principal work out of the box with docker files. However, the recommended way is to add specialized cloudbuild.yaml files. They should look something like this:

              steps:\n    - name: 'gcr.io/cloud-builders/docker'\n        args: ['build', '-t', 'gcr.io/<project-id>/<image-name>', '.']\n    - name: 'gcr.io/cloud-builders/docker'\n        args: ['push', 'gcr.io/<project-id>/<image-name>']\n

              which essentially is a basic yaml file that contains a list of steps, where each step consist of the service that should be used and the arguments for that service. In the above example we are calling the same service (cloud-builders/docker) with different arguments (build and then push). Implement such a file in your repository. Hint: if you forked the repository then you at least need to change the <project-id>.

            3. From the gcp homepage, navigate to the triggers panel:

              Click on the manage repositories.

            4. From there, click the Connect Repository and go through the steps of authenticating your github profile with gcp and choose the repository that you want to setup build triggers. For now, skip the Create a trigger (optional) part by pressing Done in the end.

            5. Navigate back to the Triggers homepage and click Create trigger. Set the following:

              • Give a name
              • Event: choose Push to branch
              • Source: choose the repository you just connected
              • Branch: choose ^main$
              • Configuration: choose either Autodetected or Cloud build configuration file

              Finally click the Create button and the trigger should show up on the triggers page.

            6. To activate the trigger, push some code to the chosen repository.

            7. Go to the Cloud Build page and you should see the image being build and pushed.

              Try clicking on the build to checkout the build process and building summary. As you can see from the image, if a build is failing you will often find valuable info by looking at the build summary. If you build is failing try to configure it to run in one of these regions: us-central1, us-west2, europe-west1, asia-east1, australia-southeast1, southamerica-east1 as specified in the documentation.

            8. If/when your build is successful, navigate to the Artifact Registry page. You should hopefully find that the image you just build was pushed here. Congrats!

            9. Finally, to to pull your image down to your laptop

              docker pull gcr.io/<project-id>/<image_name>:<image_tag>\n

              you will need to authenticate docker with gcp first. Instructions can be found here, but the following command should hopefully be enough to make docker and gcp talk to each other:

              gcloud auth configure-docker\n

              Note: To do this you need to have docker actively running in the background, as any other time you want to use docker.

            10. Automatization through the cloud is in general the way to go, but sometimes you may want to manually create images and push them to the registry. Figure out how to push an image to your Container Registry. For simplicity you can just push the busybox image you downloaded during the initial docker exercises. This page should help you with exercise.

            "},{"location":"s6_the_cloud/using_the_cloud/#training","title":"Training","text":"

            As our final step in our journey through different GCP services in this module we are going to look at training of our models. This is one of the important tasks that GCP can help us with because we can always rent more hardware as long as we have credits, meaning that we can both scale horizontal (run more experiments) and vertical (run longer experiments).

            We are going to checkout two ways of running our experiments. First we are going to return to the Compute Engine service because it gives the most simple form of scaling of experiments. That is: we create a VM with a appropriate docker image, we start it and login to the VM and we run our experiments. It is possible for most people to run a couple of experiments in parallel this way. However, what if there was an abstract layer that automatically created VM for us, lunched our experiments and the close the VM afterwards?

            This is where Vertex AI service comes into play. This is a dedicated service for handling ML models in GCP in the cloud. Vertex AI is in principal and end-to-end service that can take care of everything machine learning related in the cloud. In this course we are primarily focused on just the training of our models, and then use other services for different parts of our pipeline.

            "},{"location":"s6_the_cloud/using_the_cloud/#exercises_3","title":"\u2754 Exercises","text":"
            1. Lets start by see how we could train a model using Pytorch using the Compute Engine service:

              1. Start by creating a appropriate VM. If you want to start a VM that have Pytorch pre-installed with only CPU support you can run the following command

                gcloud compute instances create <instance-name> \\\n    --zone europe-west1-b \\\n    --image-family=pytorch-latest-cpu \\\n    --image-project=deeplearning-platform-release\n

                alternatively, if you have access to GPU in your GCP account you could start a VM in the following way

                gcloud compute instances create <instance-name> \\\n    --zone europe-west4-a \\\n    --image-family=pytorch-latest-gpu \\\n    --image-project=deeplearning-platform-release \\\n    --accelerator=\"type=nvidia-tesla-v100,count=1\" \\\n    --metadata=\"install-nvidia-driver=True\" \\\n    --maintenance-policy TERMINATE\n
              2. Next login into your newly created VM. You can either open an ssh terminal in the cloud console or run the following command

                gcloud beta compute ssh <instance-name>\n
              3. It is recommend to always check that the VM we get is actually what we asked for. In this case the VM should have Pytorch pre-installed so lets check for that by running

                python -c \"import torch; print(torch.__version__)\"\n

                Additionally, if you have a VM with GPU support also try running the nvidia-smi command.

              4. When you have logged in to the VM, it works as your own machine. Therefore to run some training code you would need to do the same setup step you have done on your own machine: clone your github, install dependencies, download data, run code. Try doing this to make sure you can train a model.

            2. (Optional, may not work as intended) The last step in the previous exercise involves a lot of setup that would be necessary to do every time we create a new VM, making horizontal scaling of experiments cumbersome. However, we have already developed docker images that can take care of most of the setup.

              1. Lets for simplicity just create a very small docker image (called gcp_vm_tester.dockerfile) that you can use

                FROM gcr.io/deeplearning-platform-release/pytorch-cpu\nRUN pip install matplotlib\n

                this basically just extends the base Pytorch image to also install matplotlib. The important part about the docker images that we want to use here is that they should not have an ENTRYPOINT at the end, because we do not want the docker container to actually run our scripts, just install dependencies on startup.

              2. Lets build docker and manually push it to our container repository in gcp. Build with:

                docker build -f gcp_vm_tester.dockerfile.dockerfile . -t gcp_vm_tester:latest\n

                and then push with

                docker tag tester gcr.io/<project-id>/gcp_vm_tester\ndocker push gcr.io/<project-id>/gcp_vm_tester\n

                confirm by going to the container registry in the cloud console and check that the image has been correctly pushed.

              3. Lets then create a VM with that particular docker image. Instead of using gcloud compute instances create we are now using the gcloud compute instances create-with-container command

                gcloud compute instances create-with-container <instance-name> \\\n    --container-image=gcr.io/<project-id>/gcp_vm_tester\n    --zone europe-west1-b\n
              4. Confirm that everything works by accessing your newly created VM and run both of these commands

                python -c \"import torch; print(torch.__version__)\"\npython -c \"import matplotlib; print(matplotlib.__version__)\"\n
            3. We are now moving on to the final way to train our code, using Vertex AI service.

              1. Start by enabling it by searching for Vertex AI in the cloud console and go to the service

              2. The way we are going to use Vertex AI is to create custom jobs because we have already developed docker containers that contains everything to run our code. Thus the only command that we actually need to use is gcloud ai custom-jobs create command. An example here would be:

                gcloud ai custom-jobs create \\\n    --region=europe-west1 \\\n    --display-name=test-run \\\n    --config=config.yaml\n

                Essentially, this command combines everything into one command: it first creates a VM with the specs specified by a configuration file, then loads a container specified again in the configuration file and finally it runs everything. A example of a config file could be:

                # config_cpu.yaml\nworkerPoolSpecs:\n    machineSpec:\n        machineType: n1-highmem-2\n    replicaCount: 1\n    containerSpec:\n        imageUri: gcr.io/<project-id>/<docker-img>\n

                if you only want to run on CPU and another example for GPU:

                # config_gpu.yaml\nworkerPoolSpecs:\n    machineSpec:\n        machineType: n1-standard-8\n        acceleratorType: NVIDIA_TESLA_T4 #(1)!\n        acceleratorCount: 1\n    replicaCount: 1\n    containerSpec:\n        imageUri: gcr.io/<project-id>/<docker-img>\n
                1. In this case we are requesting a Nvidia Tesla T4 GPU. This will only work if you have quota for allocating this type of GPU in the Vertex AI service. You can check how to request quota in the last exercise of the previous module. Remember that it is not enough to just request quota for the GPU, the request need to by approved by Google before you can use it.

                you can read more about the configuration formatting here and the different types of machines here. Try to execute a job using the gcloud ai custom-jobs create command. For additional documentation you can checkout the documentation on the command and this page and this page

              3. Assuming you manage to lunch a job, you should see an output like this:

                To executing the commands that is outputted to look at both the status and the progress of your job.

              4. In addition you can also visit the Custom Jobs tab in training part of Vertex AI

                Check it out.

              5. During custom training we do not necessarily need to use dvc for downloading our data. A more efficient way is to use cloud storage as a mounted file system. This allows us to access data directly from the cloud storage without having to download it first. All our training jobs are automatically mounted a gcs folder in the root directory. Try to access the data from your training script:

                # loading from a bucket using mounted file system\ndata = torch.load('/gcs/<my-bucket-name>/data.pt')\n# writing to a bucket using mounted file system\ntorch.save(data, '/gcs/<my-bucket-name>/data.pt')\n

                is should speed up the training process a bit.

            This ends the session on how to use Google cloud services for now. In a future session we are going to investigate a bit more of the services offered in GCP, in particular for deploying the models that we have just trained.

            "},{"location":"s7_deployment/","title":"08. Model deployment","text":"

            Slides

            Lets say that you have spend 1000 GPU hours and trained the most awesome model that you want to share with the world. One way to do this is of course to just place all your code in a github repository, upload a file with the trained model weights to your favorite online storage (assuming it is too big for github to handle) and ask people to just download your code and the weights to run the code by themselves. This is a fine approach in a small research setting, but in production you need to be able to deploy the model to an environment that is fully contained such that people can just execute without looking (too hard) at the code.

            Image credit

            In this session we try to look at methods specialized towards deployment of models on your local machine and also how to deploy services in the cloud.

            Learning objectives

            The learning objectives of this session are:

            • Understand the basics of requests and APIs
            • Can create custom APIs using the framework fastapi and run it locally
            • Knowledge about serverless deployments and how to deploy custom APIs using both serverless functions and serverless containers
            "},{"location":"s7_deployment/apis/","title":"M22 - Requests and APIs","text":""},{"location":"s7_deployment/apis/#requests-and-apis","title":"Requests and APIs","text":"

            Core Module

            Before we can get deployment of our models we need to understand concepts such as APIs and requests. The core reason for this is that we need a new abstraction layer on top of our applications that are not Python-specific. While Python is the defacto language for machine learning, we cannot expect everybody else to use it and in particular, we cannot expect network protocols (both locally and external) to be able to communicate with our Python programs out of the box. For this reason, we need to understand requests, in particular HTTP requests and how to create APIs that can interact with those requests.

            "},{"location":"s7_deployment/apis/#requests","title":"Requests","text":"

            When we are talking about requests, we are essentially talking about the communication method used in client-server types of architectures. As shown in the image below, in this architecture, the client (user) is going to send requests to a server (our machine learning application) and the server will give a response. For example, the user may send a request to get the class of a specific image, which our application will do and then send back the response in terms of a label.

            Image credit

            The common way of sending requests is called HTTP (Hypertext Transfer Protocol). It is essentially a specification of the intermediary transportation method between the client and server. An HTTP request essentially consists of two parts:

            • A request URL: the location of the server we want to send our request to
            • A request Method: describing what action we want to perform on the server

            The common request methods are (case sensitive):

            • GET: get data from the server
            • POST/PUT: send data to the server
            • DELETE: delete data on the server

            You can read more about the different methods here. For most machine learning applications, GET and POST are the core methods to remember. Additionally, if you want to read more about HTTP in general we highly recommend that you go over this comic strip protocol, but the TLDR is that it provides privacy, integrity and identification over the web.

            "},{"location":"s7_deployment/apis/#exercises","title":"\u2754 Exercises","text":"

            We are going to do a couple of exercises on sending requests using requests package to get familiar with the syntax.

            1. Start by installing the `requests`` package

              pip install requests\n
            2. Afterwards, create a small script and try to execute the code

              import requests\nresponse = requests.get('https://api.github.com/this-api-should-not-exist')\nprint(response.status_code)\n

              As you can see from the syntax, we are sending a request using the GET method. This code should return status code 404. Take a look at this page that contains a list of status codes. Next, let's call a page that exists

              import requests\nresponse = requests.get('https://api.github.com')\nprint(response.status_code)\n

              What is the status code now and what does it mean? Status codes are important when you have an application that is interacting with a server and wants to make sure that it does not fail, which can be done with simple if statements on the status codes

              if response.status_code == 200:\n    print('Success!')\nelif response.status_code == 404:\n    print('Not Found.')\n
            3. Next, try to call the following

              response=requests.get(\"https://api.github.com/repos/SkafteNicki/dtu_mlops\")\n

              which gives back a payload. Essentially, payload refers to any additional data that is sent from the client to the server or vice-versa. Try looking at the response.content attribute. What is the type of this attribute?

            4. You should hopefully observe that the .content attribute is of type bytes. It is important to note that this is the standard way of sending payloads to encode them into byte objects. To get a more human-readable version of the response, we can convert it to JSON format

              response.json()\n

              Important to remember that a JSON object in Python is just a nested dictionary if you ever want to iterate over the object in some way.

            5. When we use the GET method we can additionally provide a params argument, that specifies what we want the server to send back for a specific request URL:

              response = requests.get(\n    'https://api.github.com/search/repositories',\n    params={'q': 'requests+language:python'},\n)\n

              Before looking at reponse.json() can you explain what the code does? You can try looking at this page for help.

            6. Sometimes the content of a page cannot be converted into JSON, because as already stated data is sent as bytes. Say that we want to download an image, which we can do in the following way

              import requests\nresponse = requests.get('https://imgs.xkcd.com/comics/making_progress.png')\n

              Try calling response.json(), what happens? Next, try calling response.content. To get the result in this case we would need to convert from bytes to an image:

              with open(r'img.png','wb') as f:\n    f.write(response.content)\n
            7. The get method is the most useful method because it allows us to get data from the server. However, as stated in the beginning multiple request methods exist, for example, the POST method for sending data to the server. Try executing:

              pload = {'username':'Olivia','password':'123'}\nresponse = requests.post('https://httpbin.org/post', data = pload)\n

              Investigate the response (this is an artificial example because we do not control the server).

            8. Finally, we should also know that requests can be sent directly from the command line using the curl command. Sometimes it is easier to send a request directly from the terminal and sometimes it is easier to do it from a script.

              1. Make sure you have curl installed, or else find instruction on installing it. To check call curl --help` with the documentation on curl.

              2. To execute requests.get('https://api.github.com') using curl we would simply do

                curl -X GET \"https://api.github.com\"\ncurl -X GET -I \"https://api.github.com\" # if you want the status code\n

                Try it yourself.

              3. Try to redo some of the exercises yourself using curl.

            That ends the intro session on requests. Do not worry if you are still not completely comfortable with sending requests, we are going to return to how we do it in practice when we have created our API. If you want to learn more about the requests package you can check out this tutorial and if you want to see more examples of how to use curl you can check out this page

            "},{"location":"s7_deployment/apis/#creating-apis","title":"Creating APIs","text":"

            Requests are all about being on the client side of our client-server architecture. We are now going to move on to the server side where we will be learning about writing the APIs that requests can interact with. An application programming interface (API) is essentially the way for the developer (you) tells a user how to use the application that you have created. The API is an abstraction layer that allows the user to interact with our application in the way we want them to interact with it, without the user even having to look at the code.

            We can take the API from github as an example https://api.github.com. This API allows any user to retrieve, integrate and send data to Github without ever having to visit their webpage. The API exposes multiple endpoints that have various functions:

            • https://api.github.com/repos/OWNER/REPO/branches: check out the branches on a given repository
            • https://api.github.com/search/code: search through Github for repositories
            • https://api.github.com/repos/OWNER/REPO/actions/workflows: check the status of workflows for a given repository

            and we could go on. However, there may be functionality that Github is not interested in users having access to and they may therefore choose not to have endpoints for specific features (1).

            1. Many companies provide public APIs to interact with their services/data. For a general list of public APIs you can check out this page. For the Danes out there, you can check out this list of public and private APIs from Danish companies and organizations.

            The particular kind of API we are going to work with is called REST API (or RESTful API). The REST API specify specific constraints that a particular API needs to fulfill to be considered RESTful. You can read more about what the six guiding principles behind REST API on this page but one of the most important to have in mind is that the client-server architecture needs to be stateless. This means that whenever a request is send to the server it needs to be self-contained (all information included) and the server cannot rely on any previously stored information from previous requests.

            To implement APIs in practise we are going to use FastAPI. FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints. FastAPI is only one of many frameworks for defining APIs, however, compared to other frameworks such as Flask and django it offers a sweet spot of being flexible enough to do what you want without having many additional (unnecessary) features.

            "},{"location":"s7_deployment/apis/#exercises_1","title":"\u2754 Exercises","text":"

            The exercises below are a condensed version of this and this tutorial. If you ever need context for the exercises, we can recommend trying to go through these. Additionally, we also provide this solution file that you can look through for help.

            1. Install FastAPI

              pip install fastapi\n

              This contains the functions, modules, and variables we are going to need to define our interface.

            2. Additionally, also install uvicorn which is a package for defining low level server applications.

              pip install uvicorn[standard]\n
            3. Start by defining a small application like this in a file called main.py:

              from fastapi import FastAPI\napp = FastAPI()\n\n@app.get(\"/\")\ndef read_root():\n    return {\"Hello\": \"World\"}\n\n@app.get(\"/items/{item_id}\")\ndef read_item(item_id: int):\n    return {\"item_id\": item_id}\n

              Importantly here is the use of the @app.get decorator. What could this decorator refer to? Explain what the two functions are probably doing.

            4. Next lets launch our app. Since we called our script main.py and we inside the script initialized our API with app = FastAPI, our application that we want to deploy can be referenced by main:app:

              uvicorn --reload --port 8000 main:app\n

              this will launch a server at this page: http://localhost:8000/. As you will hopefully see, this page will return the content of the root function, like the image below. Remember to also check the output in your terminal as that will give info on when and how your application is being invoked.

              1. What webpage should you open to get the server to return 1?

              2. Also checkout the pages: http://localhost:8000/docs and http://localhost:8000/redoc. What does these pages show?

              3. The power of the docs and redoc pages is that they allow you to easily test your application with their simple UI. As shown in the image below, simply open the endpoint you want to test, click the Try it out button, input any values and execute it. It will return both the corresponding curl command for invoking your endpoint, the corresponding URL and response of you application. Try it out.

              4. You can also checkout http://localhost:8000/openapi.json to check out the schema that is generated which essentially is a json file containing the overall specifications of your program.

              5. Try to access http://localhost:8000/items/foo, what happens in this case? When you specify types in your API, FastAPI will automatically do type validation using pydantic, making sure users can only access your API with the correct types. Therefore, remember to include types in your applications!

            5. With the fundamentals in place let's configure it a bit more:

              1. Lets start by changing the root function to include a bit more info. In particular we are also interested in returning the status code so the end user can easily read that. Default status codes are included in the http built-in python package:

                from http import HTTPStatus\n\n@app.get(\"/\")\ndef root():\n    \"\"\" Health check.\"\"\"\n    response = {\n        \"message\": HTTPStatus.OK.phrase,\n        \"status-code\": HTTPStatus.OK,\n    }\n    return response\n

                try to reload the app and see what is returned now. You should not have to re-launch the app because we initialized the app with the --reload argument.

              2. When we decorate our functions with @app.get(\"/items/{item_id}\"), item_id is in the case what we call a path parameters because it is a parameter that is directly included in the path of our endpoint. We have already seen how we can restrict a path to a single type, but what if we want to restrict it to specific values? This is often the case if we are working with parameters of type str. In this case we would need to define a enum:

                from enum import Enum\nclass ItemEnum(Enum):\n    alexnet = \"alexnet\"\n    resnet = \"resnet\"\n    lenet = \"lenet\"\n\n@app.get(\"/restric_items/{item_id}\")\ndef read_item(item_id: ItemEnum):\n    return {\"item_id\": item_id}\n

                Add this API, reload and execute both a valid parameter and a non-valid parameter.

              3. In contrast to path parameters we have query parameters. In the requests exercises we saw an example of this where we were calling https://api.github.com/search/code with the query 'q': 'requests+language:python'. Any parameter in FastAPI that is not a path parameter, will be considered a query parameter:

                @app.get(\"/query_items\")\ndef read_item(item_id: int):\n    return {\"item_id\": item_id}\n

                Add this API, reload and figure out how to pass in a query parameter.

              4. We have until now worked with the .get method, but lets also see an example of the .post method. As already described the POST request method is used for uploading data to the server. Here is a simple app that saves username and password in a database (please never implement this in real life like this):

                database = {'username': [ ], 'password': [ ]}\n\n@app.post(\"/login/\")\ndef login(username: str, password: str):\n    username_db = database['username']\n    password_db = database['password']\n    if username not in username_db and password not in password_db:\n        with open('database.csv', \"a\") as file:\n            file.write(f\"{username}, {password} \\n\")\n        username_db.append(username)\n        password_db.append(password)\n    return \"login saved\"\n

                Make sure you understand what the function does and then try to execute it a couple of times to see your database updating. It is important to note that we sometimes in the following exercises use the .get method and sometimes the .post method. For our usage it does not really matter.

            6. We are now moving on to figuring out how to provide different standard inputs like text, images, json to our APIs. It is important that you try out each example yourself and in particular you look at the curl commands that are necessary to invoke each application.

              1. Here is a small application, that takes a single text input

                @app.get(\"/text_model/\")\ndef contains_email(data: str):\n    regex = r'\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b'\n    response = {\n        \"input\": data,\n        \"message\": HTTPStatus.OK.phrase,\n        \"status-code\": HTTPStatus.OK,\n        \"is_email\": re.fullmatch(regex, data) is not None\n    }\n    return response\n

                What does the application do? Try it out yourself

              2. Let's say we wanted to extend the application to check for a specific email domain, either gmail or hotmail. Assume that we want to feed this into our application as a json object e.g.

                {\n    \"email\": \"mlops@gmail.com\",\n    \"domain_match\": \"gmail\"\n}\n

                Figure out how to alter the data parameter such that it takes in the json object and make sure to extend the application to check if the email and domain also match. Hint: take a look at this page

              3. Let's move on to an application that requires a file input:

                from fastapi import UploadFile, File\nfrom typing import Optional\n\n@app.post(\"/cv_model/\")\nasync def cv_model(data: UploadFile = File(...)):\n    with open('image.jpg', 'wb') as image:\n        content = await data.read()\n        image.write(content)\n        image.close()\n\n    response = {\n        \"input\": data,\n        \"message\": HTTPStatus.OK.phrase,\n        \"status-code\": HTTPStatus.OK,\n    }\n    return response\n

                A couple of new things are going on here: we use the specialized UploadFile and File bodies in our input definition. Additionally, we added the async/await keywords. Figure out what everything does and try to run the application (you can use any image file you like).

              4. The above application actually does not do anything. Let's add opencv as a package and let's resize the image. It can be done with the following three lines:

                import cv2\nimg = cv2.imread(\"image.jpg\")\nres = cv2.resize(img, (h, w))\n

                Figure out where to add them in the application and additionally add h and w as optional parameters, with a default value of 28. Try running the application where you specify everything and one more time where you leave out h and w.

              5. Finally, let's also figure out how to return a file from our application. You will need to add the following lines:

                from fastapi.responses import FileResponse\ncv2.imwrite('image_resize.jpg', res)\nFileResponse('image_resize.jpg')\n

                Figure out where to add them to the code and try running the application one more time to see that you get a file back with the resized image.

            7. (Optional) Let's try to figure out how to use FastAPI in a machine learning context. Below is a script that downloads a VisionEncoderDecoder from huggingface . The model can be used to create captions for a given image. Thus calling

              predict_step(['s7_deployment/exercise_files/my_cat.jpg'])\n

              returns a list of strings like ['a cat laying on a couch with a stuffed animal'] (try this yourself). Create a FastAPI application that can do inference using this model e.g. it should take in an image, preferably an optional json object for configuring some of the hyperparameters (like max_length) and should return a string containing the generated caption.

              from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor, AutoTokenizer\nimport torch\nfrom PIL import Image\n\nmodel = VisionEncoderDecoderModel.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\nfeature_extractor = ViTFeatureExtractor.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\ntokenizer = AutoTokenizer.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nmodel.to(device)\n\ngen_kwargs = {\"max_length\": 16, \"num_beams\": 8, \"num_return_sequences\": 1}\ndef predict_step(image_paths):\n    images = []\n    for image_path in image_paths:\n        i_image = Image.open(image_path)\n        if i_image.mode != \"RGB\":\n            i_image = i_image.convert(mode=\"RGB\")\n\n        images.append(i_image)\n    pixel_values = feature_extractor(images=images, return_tensors=\"pt\").pixel_values\n    pixel_values = pixel_values.to(device)\n    output_ids = model.generate(pixel_values, **gen_kwargs)\n    preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)\n    preds = [pred.strip() for pred in preds]\n    return preds\n
            8. As the final step, we want to figure out how to include our FastAPI application in a docker container as it will help us when we want to deploy in the cloud because docker as always can take care of the dependencies for our application. For the following set of exercises you can take whatever previous FastAPI application as the base application for the container

              1. Start by creating a requirement.txt file for your application. You will at least need fastapi and uvicorn in the file and we always recommend that you are specific about the version you want to use

                fastapi>=0.68.0,<0.69.0\nuvicorn>=0.15.0,<0.16.0\n# add anything else you application needs to be able to run\n
              2. Next, create a Dockerfile with the following content

                FROM python:3.9\nWORKDIR /code\nCOPY ./requirements.txt /code/requirements.txt\n\nRUN pip install --no-cache-dir --upgrade -r /code/requirements.txt\nCOPY ./app /code/app\n\nCMD [\"uvicorn\", \"app.main:app\", \"--host\", \"0.0.0.0\", \"--port\", \"80\"]\n

                The above assumes that your file structure looks like this

                .\n\u251c\u2500\u2500 app\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u2514\u2500\u2500 main.py\n\u251c\u2500\u2500 Dockerfile\n\u2514\u2500\u2500 requirements.txt\n

                Hopefully, all these steps should look familiar if you already went through module M9, except for maybe the last line. However, this is just the standard way that we have run our FastAPI applications in the last couple of exercises, this time with some extra arguments regarding the ports we allow.

              3. Next, build the corresponding docker image

                docker build -t my_fastapi_app .\n
              4. Finally, run the image such that a container is spun up that runs our application. The important part here is to remember to specify the -p argument (p for port) which should be the same number as the port we have specified in the last line of our Dockerfile.

                docker run --name mycontainer -p 80:80 myimage\n
              5. Check that everything is working by going to the corresponding localhost page http://localhost/items/5?q=somequery

            9. (Optional) In module M15 on unittesting you learned how to write unit tests for your data pipeline and model. It should come as no surprise that the same can also be done for your API. Doing so should be able to tell you if your API is working as you expect it to do. The only complication regarding APIs is that you need a server to do testing, and we cannot use uvicorn for this. Check out this page on how to test FastAPI application, and add a file called test_api.py to your tests folder with appropriate tests for your API.

            This ends the module on APIs. If you want to go further in this direction we highly recommend that you check out bentoml which is an API standard that focuses solely on creating easy-to-understand APIs and services for ml-applications. Additionally, we can also highly recommend checking out Postman which can help design, document and in particular test the API you are writing to make sure that it works as expected.

            "},{"location":"s7_deployment/cloud_deployment/","title":"M24 - Cloud Deployment","text":""},{"location":"s7_deployment/cloud_deployment/#cloud-deployment","title":"Cloud deployment","text":"

            Core Module

            We are now returning to using the cloud. In this module you should have gone through the steps of having your code in your github repository to automatically build into a docker container, store that, store data and pull it all together to make a training run. After the training is completed you should hopefully have a file stored in the cloud with your trained model weights.

            Todays exercises will be about serving those model weights to an end user. We focus on two different ways of deploying our model, Google cloud functions and Google Vertex AI endpoints.

            "},{"location":"s7_deployment/cloud_deployment/#cloud-functions","title":"Cloud Functions","text":"

            Cloud functions are the easiest way to get started with deployment because they are what is called serverless. For serverless deployment we still need a server to do the actual workload, however the core concept is that you do you have to manage the server. Everything is magically taken care of behind the scene.

            "},{"location":"s7_deployment/cloud_deployment/#exercises","title":"\u2754 Exercises","text":"
            1. Go to the start page of Cloud Functions. Can be found in the sidebar on the homepage or you can just search for it. Activate the service if not already active.

            2. Click the Create Function button which should take you to a screen like the image below. Give it a name, set the server region to somewhere close by and change the authentication policy to Allow unauthenticated invocations so we can access it directly from a browser. Remember to note down the URL of the service somewhere.

            3. On the next page, for Runtime pick the Python 3.9 option. This will make the inline editor show both a main.py and requirements.py file. Look over them. Click the Deploy button in the lower left corner.

            4. Afterwards you should see a green check mark beside your function meaning that it is deployed. Click the Test function button which will take you to the testing page.

            5. If you know what the application does, it should come as no surprise that it can run without any input. We therefore just send an empty request by clicking the Test The Function button. Does the function return the output you expected? Wait for the logs to show up. What do they show?

              1. What should the Triggering event look like in the testing prompt for the program to respond with

                Good day to you sir!\n

                Try it out.

              2. Click on the metrics tab. Identify what each panel is showing.

              3. Go to the trigger tab and go to the url for the application.

              4. Checkout the logs tab. You should see that your application have already been invoked multiple times. Also try to execute this command in a terminal:

                gcloud functions logs read\n
            6. Next, we are going to create an application that actually takes some input so we can try to send it requests. We provide a very simple sklearn_cloud_function.py script to get started.

              1. Figure out what the script does and run the script. This should create a file with trained model.

              2. Next create a storage bucket and upload the model file to the bucket. You can either do this through the webpage or run the following commands:

                gsutil mb gs://<bucket-name>  # mb stands for make bucket\ngsutil cp <file-name> gs://<bucket-name>  # cp stands for copy\n

                check that the file is in the bucket.

              3. Create a new cloud function with the same initial settings as the first one. Choose also the Python 3.9 but this time change code to something that can actually use the model we just uploaded. Here is a code snippet to help you:

                from google.cloud import storage\nimport pickle\n\nBUCKET_NAME = ...\nMODEL_FILE = ...\n\nclient = storage.Client()\nbucket = client.get_bucket(BUCKET_NAME)\nblob = bucket.get_blob(MODEL_FILE)\nmy_model = pickle.loads(blob.download_as_string())\n\ndef knn_classifier(request):\n    \"\"\" will to stuff to your request \"\"\"\n    request_json = request.get_json()\n    if request_json and 'input_data' in request_json:\n        data = request_json['input_data']\n        input_data = list(map(int, data.split(',')))\n        prediction = my_model.predict([input_data])\n        return f'Belongs to class: {prediction}'\n    else:\n        return 'No input data received'\n

                Some notes: * For locally testing the above code you will need to install the google-cloud-storage python package * Remember to change the Entry point * Remember to also fill out the requirements.txt file. You need at least two packages to run the application with google-cloud-storage being one of them. * If you deployment fails, try to go to the Logs Explorer page in gcp which can help you identify why.

              4. When you have successfully deployed the model, try to make predictions with it.

            7. You can finally try to redo the exercises deploying a Pytorch application. You will essentially need to go through the same steps as the sklearn example, including uploading a trained model to a storage, write a cloud function that loads it and return some output. You are free to choose whatever Pytorch model you want.

            "},{"location":"s7_deployment/cloud_deployment/#cloud-run","title":"Cloud Run","text":"

            Cloud functions are great for simple deployments, that can be encapsulated in a single script with only simple requirements. However, they do not really scale with more advance applications that may depend on multiple programming languages. We are already familiar with how we can deal with this through containers and Cloud Run is the corresponding service in GCP for deploying containers.

            "},{"location":"s7_deployment/cloud_deployment/#exercises_1","title":"\u2754 Exercises","text":"
            1. We are going to start locally by developing a small app that we can deploy. We provide two small examples to choose from: first a small FastAPI app consisting of this .py file and this dockerfile . Secondly a small streamlit application consisting of just this dockerfile . You are free to choose which application to work with.

              1. Start by going over the files belonging to your choice app and understand what it does.

              2. Next build the docker image belonging to the app

                docker build -f <dockerfile> . -t gcp_test_app:latest\n
              3. Next tag and push the image to your container registry

                docker tag gcp_test_app gcr.io/<project-id>/gcp_test_app\ndocker push gcr.io/<project-id>/gcp_test_app\n

                afterwards check you container registry to check that you have successfully pushed the image.

            2. Next go to Cloud Run in the cloud console an enable the service

            3. Click the Create Service button which should bring you to a page similar to the one below

              Do the following: * Click the select button, which will bring up all build containers and pick the one you want to deploy. In the future you probably want to choose the Continuously deploy new revision from a source repository such that a new version is always deployed when a new container is build. * Hereafter, give the service a name and select the region. We recommend do choose a region close to you, however it does not really matter that much for our use case * Set the authentication method to Allow unauthenticated invocations such that we can call it without providing credentials. In the future you may only set that authenticated invocations are allowed. * Expand the Container, Connections, Security tab and edit the port such that it matches the port exposed in your chosen application.

              Finally, click the create button and wait for the service to be deployed (may take some time).

            4. If you manage to deploy the service you should see a image like this:

              You can now access you application by clicking url. This will access the root of your application, so you may need to add / or /<path> to the url depending on how the app works.

            5. Everything we just did to deploy an container can be reproduced using the following command:

              gcloud run deploy $APP --image $TAG --platform managed --region $REGION --allow-unauthenticated\n

              and checked using these two commands

              gcloud run services list\ngcloud run services describe $APP --region $REGION\n

              feel free to experiment doing the deployment from the command line.

            6. Instead of deploying our docker container using the UI or command line, which is a manual operation, we can do it in a continues manner by using cloudbuild.yaml file we learned about in the previous section. We just need to add a new step to the file. We provide an example

              steps:\n# Build the container image\n- name: 'gcr.io/cloud-builders/docker'\n  args: ['build', '-t', 'gcr.io/$PROJECT_ID/<container-name>:lates', '.'] #(1)!\n# Push the container image to Container Registry\n- name: 'gcr.io/cloud-builders/docker'\n  args: ['push', 'gcr.io/$PROJECT_ID/<container-name>:latest']\n# Deploy container image to Cloud Run\n- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'\n  entrypoint: gcloud\n  args:\n  - 'run'\n  - 'deploy'\n  - '<service-name>'\n  - '--image'\n  - 'gcr.io/$PROJECT_ID/<container-name>:latest'\n  - '--region'\n  - '<region>'\n
              1. This line assume you are standing in the root of your repository and is trying to build the docker image specified in a file called Dockerfile and tag it with the name gcr.io/$PROJECT_ID/my_deployment:latest. Therefore if you want to point to another dockerfile you need to add -f option to the command. For example if you want to point to a my_app/my_serving_app.dockerfile you need to change the line to

                args: ['build', '-f', 'my_app/my_serving_app.dockerfile', '-t', 'gcr.io/$PROJECT_ID/my_deployment:lates', '.']\n

              where you need to replace <container-name> with the name of your container, <service-name> with the name of the service you want to deploy and <region> with the region you want to deploy to. Afterwards you need to setup a trigger (or reuse the one you already have) to build the container and deploy it to cloud run. Confirm that this works by making a change to your application and pushing it to github and see if the application is updated continuously. For help you can look here for help. If you succeeded, congratulations you have now setup a continues deployment pipeline.

            That ends the exercises on deployment. The exercises above is just a small taste of what deployment has to offer. In both sections we have explicitly chosen to work with serverless deployments. But what if you wanted to do the opposite e.g. being the one in charge of the management of the cluster that handles the deployed services? If you are really interested in taking deployment to the next level should get started on kubernetes which is the de-facto open-source container orchestration platform that is being used in production environments. If you want to deep dive we recommend starting here which describes how to make pipelines that are a necessary component before you start to create your own kubernetes cluster.

            "},{"location":"s7_deployment/local_deployment/","title":"M23 - Local Deployment","text":""},{"location":"s7_deployment/local_deployment/#local-deployment","title":"Local Deployment","text":"

            Regardless of your application, model and usecase, the first starting point of serving your model should always be to deploy it locally. The simple reason for that is debugging: if you deploy directly to the cloud you often get less verbose error message and/or the iteration time is much slower because it simply takes much longer time to deploy to the cloud than locally. Locally should therefore always be the first step with any new application.

            For this module we are going to focus on deployment of deep learning models, in particular Pytorch models which is used throughout the course. Pytorch has historically been developed for research purposed, where iterating with quick ideas was valued over fast computations. This is evident since Pytorch uses an dynamic graph underneath to represent the computational graph that is being created whenever you are running calculations. The graph is important, as it keeps track on how to do backpropergation though your Pytorch application. However, running code dynamically is notoriously slower than compiling your code before running it. Lets therefore first consider another way of compiling our code.

            "},{"location":"s7_deployment/local_deployment/#compilation","title":"Compilation","text":"

            If you ever coded in any low-level language such as c, fortran or c++ you should be familiar with the term compiling. Compiling is the task of taken a computer program written in one language and translating it into another. In most cases this means taken whatever you have written in your preferred programming language and translating it into machine code that the computer can execute. But what does compilation have to do with coding Pytorch models?

            It happens to be that Pytorch comes with its own compiler that can optimize your model for you. It can be found in the submodule torch.jit. Jit stands for just-in-time, meaning that compilation runs at the same time we are executing the code. If you know anything about low-level languages such c/c++ you know that we normally compile the code before we run it. With jit we essentially merges the two phases into one. jit has two types of compilation modes, called respective script and trace. We are in the exercises going to look at script as it is the easiest to get started with and works without any code changes for nearly all kind of models. If you ever encounter that script does not work for you then trace can be used which is more general.

            The major reasons why we want to compile our models with torch.jit are:

            • Scriptet code can be invoked in its own interpreter, which is basically a restricted Python interpreter. This interpreter does not acquire the Global Interpreter Lock (GIL), and so many requests can be processed on the same instance simultaneously.
            • This scriptet format allows us to save the whole model to disk and load it into another environment, such as in a server written in a language other than Python
            • Scriptet code gives us a representation in which we can do compiler optimizations on the code to provide more efficient execution
            • Scriptet code allows us to interface with many backend/device runtimes that require a broader view of the program than individual operators.
            "},{"location":"s7_deployment/local_deployment/#exercises","title":"\u2754 Exercises","text":"

            We are here going to look at torch.jit.script for compiling our code.

            1. To see the difference in the this exercises, we start out with a large model. Download one of the large image classification models from torchvision such as ResNet-152. For the purpose of the exercise it does not matter if you work with a random initialized model or a pretrained version.

            2. Next try to script the model using torch.jit.script. You can find the documentation here.

            3. Just to confirm that by compiling our model using torch.jit.script did not change the output of our model, try checking that the output of the scripted model corresponds to the output of the non-scripted model. You can do this on a single random datapoint, and you should check that the top-5 prediced classes are the same

              assert torch.allclose(unscripted_top5_indices, scripted_top5_indices)\n

              Hint: use torch.topk.

            4. Finally, try benchmarking the non-scripted model against the scripted model. I recommend using the built-in benchmarker in Pytorch: torch.utils.benchmark.Timer, which you can read more about how to use here. Do you see a increase in performance of the scripted model compared to the non-scriptet model. If so, what is the percentage increase in efficiency?

            "},{"location":"s7_deployment/local_deployment/#torchserve","title":"Torchserve","text":"

            For locally deploying our model we are going to look at Torchserve. Torchserve (illustrated below) is a combined services for packaging and serving multiple Pytorch at the same time.

            Image credit

            Before we go into details of Torchmetrics, an important question is why we need such an abstraction on top of our developed model. Why can't we just do:

            python inference.py --my_model model_checkpoint.pt --new_datapoint img.png\n

            If we where never going to do anything else than just calling the model ourself then it is probably not worth adding anything else. However, if we ever want anyone else to interact with our model, we need to comply with standard ways of requesting and sending data. This is especially true when the next step is to start deploying our model in the cloud. Torchserve essentially brings in a inference API on top of our model that turns our model into a client-server type of system: the client (user) is going to send requests to a server (our application) and the server will give an response. The request will be send as a standard HTTP requests which Torchserve will help us decode into a useful input which we can then do inference on and return the result, again as an standardized HTTP response. Torchserve is in that regard similar to FastAPI or Flask if you have ever used one of those frameworks.

            Finally, the packaging part of Torchserve is necessary because we cannot give a Torchserve a raw file of trained model weights as these essentially is just a list of floats. We need a file that both contains the model definition and the trained weights, such that the model essentially becomes independent of the python interpreter.

            "},{"location":"s7_deployment/local_deployment/#exercises_1","title":"\u2754 Exercises","text":"

            Torchserve can be a bit rough around the edges but is fairly easy to work with. We are largely going to follow the instructions listed in the readme file for Torchserve. The intention in these exercises is to serve a Resnet type neural network that is trained for classification on ImageNet. Additional documentation can be found here.

            1. Install torchserve and its dependencies. There are separate instructions on the homepage depending on you are using Windows, WSL or Linux/MAC.

            2. Create a folder called model_store. This is where we will store the model that we are going to deploy

            3. Try to run the torchserve --model-store model_store command. If the service starts with no errors, you have installed it correctly and can continue the exercise. Else it is Googling time!

            4. Next lets create a model we can serve. If you have done the previous exercises on compiling using scripting, we highly recommend to initialize and save such model

              model = ResnetFromTorchVision(pretrained=True)\nscript_model = torch.jit.script(model)\nscript_model.save('deployable_model.pt')\n
            5. Call the model archiver. We have provided a file called index_to_name.json that maps from predicted class indices to interpretable class name e.g. 1->\"goldfish\". This file should be provided as the extra-files argument such that the deployed model automatically outputs the class name. Note that this files of course only works for models trained on imagenet.

              torch-model-archiver \\\n    --model-name my_fancy_model\n    --version 1.0 \\\n    --serialized-file path/to/serialized_model.pt \\\n    --export-path model_store\n    --extra-files index_to_name.json\n    --handler image_classifier\n
            6. Checkout the model_store folder. Has the model archiver correctly created a model (with .mar extension) inside the folder?

            7. Finally, we are going to deploy our model and use it:

              1. Start serving your model in one terminal:

                torchserve --start --ncs --model-store model_store --models my_fancy_model=my_fancy_model.mar\n
              2. Next, pick a image that you want to do inference on. It can be any image that you want but try to pick one that actually contains an object from the set of imagenet classes. I have also provided a image of my own cat in the my_cat.jpg file.

              3. Open another terminal, which we are going to use for inference. The easiest way to do inference is using curl directly in the terminal but you are also free to experiment with the requests API directly in python. Using curl should look something like this

                curl http://127.0.0.1:8080/predictions/my_fancy_model -T my_image.jpg\n
            8. Torchserve supports serving multiple models, not just one. Create a new vision model (either another resnet model or something similar), script it, save it, archive it in the save model store folder and then re-run torchserve like this

              torchserve --start --ncs --model-store model_store --models all\n

              Make sure that you can do inference with both models by calling curl.

            That ends the module on local deployment. Hopefully in this phase you have gained a bit experience with sending HTTP requests as this will be very important in the next module when we will try to deploy the models in the cloud.

            "},{"location":"s8_monitoring/","title":"Monitoring","text":"

            Slides

            We have now reached the end of our machine learning pipeline. We have successfully developed, trained and deployed a machine learning model. However, the question then becomes if you can trust that your newly deployed model still works as expected after 1 day without you intervening? What about 1 month? What about 1 year?

            There may be corner cases where an ML models is working as expected, but the wast majority of ML models will perform worse over time because they are not generalizing well enough. For example, assume you have just deployed an application that classifies images from phones, when suddenly a new phone comes out with a new kind of sensor that takes images that either have very weird aspect ratio or something else your model is not robust towards. There is nothing wrong with this, you can essentially just retrain your model on new data that accounts for this corner case, however you need a mechanisms that informs you.

            This is very monitoring comes into play. Monitoring practices are in charge of collecting any information about your application in some format that can then be analyzed and reacted on. Monitoring is essential to securing the longevity of your applications.

            As with many other sub-fields within MLOps we can divide monitoring into classic monitoring and ML specific monitoring. Classic monitoring (known from classic DevOps) is often about

            • Errors: Is my application workings without problems?
            • Logs: What is actually going on?
            • Performance: How fast is my application?

            All these are basic information you are interested in regardless of what application type you are trying to deploy. However, then there are ML related monitoring that especially relates data. Take the example above, with the new phone, this we would in general consider to be data drifting problem e.g. the data you are trying to do inference on have drifted away from the distribution of data your model was trained on. Such monitoring problems are unique to machine learning applications and needs to be handled separately.

            We are in this session going to see examples of both kinds of monitoring.

            Learning objectives

            The learning objectives of this session are:

            • Understand the concepts of data drifting in machine learning applications
            • Can detect data drifting using the evidently framework
            • Understand the importance of different system level monitoring and can conceptually implement it
            "},{"location":"s8_monitoring/data_drifting/","title":"M25 - Data Drifting","text":""},{"location":"s8_monitoring/data_drifting/#data-drifting","title":"Data drifting","text":"

            Data drifting is one of the core reasons for model accuracy degrades over time in production. For machine learning models, data drift is the change in model input data that leads to model performance degradation. In practical terms, this means that the model is receiving input that is outside of the scope that it was trained on, as seen in the figure below. This shows that the underlying distribution of a particular feature has slowly been increasing in value over two years

            Image credit

            In some cases, it may be that if you normalize some feature in a better way that you are able to generalize your model better, but this is not always the case. The reason for such a drift is commonly some external factor that you essentially have no control over. That really only leaves you with one option: retrain your model on the newly received input features and deploy that model to production. This process is probably going to repeat over the lifetime of your application if you want to keep it up-to-date with the real world.

            Image credit

            We have now come up with a solution to the data drift problem, but there is one important detail that we have not taken care of: When we should actually trigger the retraining? We do not want to wait around for our model performance to degrade, thus we need tools that can detect when we are seeing a drift in our data.

            "},{"location":"s8_monitoring/data_drifting/#exercises","title":"\u2754 Exercises","text":"

            For these exercises we are going to use the framework Evidently developed by EvidentlyAI. Evidently currently supports both detection for both regression and classification models. The exercises are in large taken from here and in general we recommend if you are in doubt about an exercise to look at the docs for API and examples (their documentation can be a bit lacking sometimes, so you may also have to dive into the source code).

            Additionally, we want to stress that data drift detection, concept drift detection etc. is still an active field of research and therefore exist multiple frameworks for doing this kind of detection. In addition to Evidently, we can also mention NannyML, WhyLogs and deepcheck.

            1. Start by install Evidently

              pip install evidently\n

              you will also need scikit-learn and pandas installed if you do not already have it.

            2. Hopefully you already gone through session S7 on deployment. As part of the deployment to GCP functions you should have developed a application that can classify the iris dataset, based on a model trained by this script . We are going to convert this into a FastAPI application for the purpose here:

              1. Convert your GCP function into a FastAPI application. The appropriate curl command should look something like this:

                curl -X 'POST' \\\n    'http://127.0.0.1:8000/iris_v1/?sepal_length=1.0&sepal_width=1.0&petal_length=1.0&petal_width=1.0' \\\n    -H 'accept: application/json' \\\n    -d ''\n

                and the response body should look like this:

                {\n    \"prediction\": \"Iris-Setosa\",\n    \"prediction_int\": 0\n}\n

                We have implemented a solution in this file (called v1) if you need help.

              2. Next we are going to add some functionality to our application. We need to add that the input for the user is saved to a database whenever our application is called. However, to not slow down the response to our user we want to implement this as an background task. A background task is a function that should be executed after the user have got their response. Implement a background task that save the user input to a database implemented as a simple .csv file. You can read more about background tasks here. The header of the database should look something like this:

                time, sepal_length, sepal_width, petal_length, petal_width, prediction\n2022-12-28 17:24:34.045649, 1.0, 1.0, 1.0, 1.0, 1\n2022-12-28 17:24:44.026432, 2.0, 2.0, 2.0, 2.0, 1\n...\n

                thus both input, timestamp and predicted value should be saved. We have implemented a solution in this file (called v2) if you need help.

              3. Call you API a number of times to generate some dummy data in the database.

            3. Create a new data_drift.py file where we are going to implement the data drifting detection and reporting. Start by adding both the real iris data and your generated dummy data as pandas dataframes.

              import pandas as pd\nfrom sklearn import datasets\nreference_data = datasets.load_iris(as_frame='auto').frame\ncurrent_data = pd.read_csv('prediction_database.csv')\n

              if done correctly you will most likely end up with two dataframes that look like

              # reference_data\nsepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  target\n0                  5.1               3.5                1.4               0.2       0\n1                  4.9               3.0                1.4               0.2       0\n...\n148                6.2               3.4                5.4               2.3       2\n149                5.9               3.0                5.1               1.8       2\n[150 rows x 5 columns]\n\n# current_data\ntime                         sepal_length   sepal_width   petal_length   petal_width   prediction\n2022-12-28 17:24:34.045649   1.0            1.0            1.0           1.0           1\n...\n2022-12-28 17:24:34.045649   1.0            1.0            1.0           1.0           1\n[10 rows x 5 columns]\n

              Standardize the dataframes such that they have the same column names and drop the time column from the current_data dataframe.

            4. We are now ready to generate some reports about data drifting:

              1. Try executing the following code:

                from evidently.report import Report\nfrom evidently.metric_preset import DataDriftPreset\nreport = Report(metrics=[DataDriftPreset()])\nreport.run(reference_data=reference, current_data=current)\nreport.save_html('report.html')\n

                and open the generated .html page. What does it say about your data? Have it drifted? Make sure to poke around to understand what the different plots are actually showing.

              2. Data drifting is not the only kind of reporting evidently can make. We can also get reports on the data quality. Try first adding a few Nan values to your reference data. Secondly, try changing the report to

                from evidently.metric_preset import DataDriftPreset, DataQualityPreset\nreport = Report(metrics=[DataDriftPreset(), DataQualityPreset()])\n

                and re-run the report. Checkout the newly generated report. Again go over the generated plots and make sure that it picked up on the missing values you just added.

              3. The final report present we will look at is the TargetDriftPreset. Target drift means that our model is over/under predicting certain classes e.g. or general terms the distribution of predicted values differs from the ground true distribution of targets. Try adding the TargetDriftPreset to the Report class and re-run the analysis and inspect the result. Have your targets drifted?

            5. Evidently reports are meant for debugging, exploration and reporting of results. However, as we stated in the beginning, what we are actually interested in methods automatically detecting when we are beginning to drift. For this we will need to look at Test and TestSuites:

              1. Lets start with a simple test that checks if there are any missing values in our dataset:

                from evidently.test_suite import TestSuite\nfrom evidently.tests import TestNumberOfMissingValues\ndata_test = TestSuite(tests=[TestNumberOfMissingValues()])\ndata_test.run(reference_data=reference, current_data=current)\n

                again we could run data_test.save_html to get a nice view of the results (feel free to try it out) but additionally we can also call data_test.as_dict() method that will give a dict with the test results. What dictionary key contains the if all tests have passed or not?

              2. Take a look at this colab notebook that contains all tests implemented in Evidently. Pick 5 tests of your choice, where at least 1 fails by default and implement them as a TestSuite. Then try changing the arguments of the test so they better fit your usecase and get them all passing.

            6. (Optional) When doing monitoring in practice, we are not always interested in running on all data collected from our API maybe only the last N entries or maybe just from the last hour of observations. Since we are already logging the timestamps of when our API is called we can use that for filtering. Implement a simple filter that either takes an integer n and returns the last n entries in our database or some datetime t that filters away observations earlier than this.

            7. Evidently by default only supports structured data e.g. tabular data (so does nearly every other framework). Thus, the question then becomes how we can extend unstructured data such as images or text? The solution is to extract structured features from the data which we then can run the analysis on.

              1. (Optional) For images the simple solution would be to flatten the images and consider each pixel a feature, however this does not work in practice because changes in the individual pixels does not really tell anything about the image. Instead we should derive some feature such as:

                • Average brightness
                • Contrast of image
                • Image sharpness
                • ...

                These are all numbers that can make up a feature vector for a image. Try out doing this yourself, for example by extracting such features from MNIST and FashionMNIST datasets, and check if you can detect a drift between the two sets.

              2. (Optional) For text a common approach is to extra some higher level embedding such as the very classical GLOVE embedding. Try following this tutorial to understand how drift detection is done on text.

              3. Lets instead take a deep learning based approach to doing this. Lets consider the CLIP model, which is normally used to do image captioning. For our purpose this is perfect because we can use the model to get abstract feature embeddings for both images and text:

                from PIL import Image\nimport requests\n# requires transformers package: pip install transformers\nfrom transformers import CLIPProcessor, CLIPModel\n\nmodel = CLIPModel.from_pretrained(\"openai/clip-vit-base-patch32\")\nprocessor = CLIPProcessor.from_pretrained(\"openai/clip-vit-base-patch32\")\n\nurl = \"http://images.cocodataset.org/val2017/000000039769.jpg\"\nimage = Image.open(requests.get(url, stream=True).raw)\n\n# set either text=None or images=None when only the other is needed\ninputs = processor(text=[\"a photo of a cat\", \"a photo of a dog\"], images=image, return_tensors=\"pt\", padding=True)\n\nimg_features = model.get_image_features(inputs['pixel_values'])\ntext_features = model.get_text_features(inputs['input_ids'], inputs['attention_mask'])\n

                Both img_features and text_features are in this case a (512,) abstract feature embedding, that should be able to tell us something about our data distribution. Try using this method to extract features on two different datasets like CIFAR10 and SVHN if you want to work with vision or IMDB movie review and Amazon review for text. After extracting the features try running some of the data distribution testing you just learned about.

            8. (Optional) If we have multiple applications and want to run monitoring for each application we often want also the monitoring to be a deployed application (that only we can access). Implement a /monitoring/ endpoint that does all the reporting we just went through such that you have two endpoints:

              http://127.0.0.1:8000/iris_infer/?sepal_length=1.0&sepal_width=1.0&petal_length=1.0&petal_width=1.0 # user endpoint\nhttp://127.0.0.1:8000/iris_monitoring/ # monitoring endpoint\n

              Our monitoring endpoint should return a HTML page either showing an Evidently report or test suit. Try implementing this endpoint. We have implemented a solution in this file if you need help with how to return an HTML page from a FastAPI application.

            9. As an final exercise, we recommend that you try implementing this to run directly in the cloud. You will need to implement this in a container e.g. GCP Run service because the data gathering from the endpoint should still be implemented as an background task. For this to work you will need to change the following:

            10. Instead of saving the input to a local file you should either store it in GCP bucket or an BigQuery SQL table (this is a better solution, but also out-of-scope for this course)

            11. You can either run the data analysis locally by just pulling from cloud storage predictions and training data or alternatively you can deploy this as its own endpoint that can be invoked. For the latter option we recommend that this should require authentication.

            That ends the module on detection of data drifting, data quality etc. If this has not already been made clear, monitoring of machine learning applications is an extremely hard discipline because it is not a clear cut when we should actually respond to feature beginning to drift and when it is probably fine. That comes down to the individual application what kind of rules that should be implemented. Additionally, the tools presented here are also in no way complete and are especially limited in one way: they are only considering the marginal distribution of data. Every analysis that we done have been on the distribution per feature (the marginal distribution), however as the image below show it is possible for data to have drifted to another distribution with the marginal being approximately the same.

            There are methods such as Maximum Mean Discrepancy (MMD) tests that are able to do testing on multivariate distributions, which you are free to dive into. In this course we will just always recommend to consider multiple features when doing decision regarding your deployed applications.

            "},{"location":"s8_monitoring/monitoring/","title":"M26 - System Monitoring","text":""},{"location":"s8_monitoring/monitoring/#monitoring","title":"Monitoring","text":"

            In this module we are going to look into more classical monitoring of applications. The key concept we are often working with here is called telemetry. Telemetry in general refer to any automatic measurement and wireless transmission of data from our application. It could be numbers such as:

            • The number of requests are our application receiving per minute/hour/day. This number is of interest because it is directly proportional to the running cost of application.
            • The amount of time (on average) our application runs per request. The number is of interest because it most likely is the core contributor to the latency that our users are experience (which we want to be low).
            • ...

            In general there are three different kinds of telemetry we are interested in:

            Name Description Example Purpose Metrics Metrics are quantitative measurements of the system. They are usually numbers that are aggregated over a period of time. E.g. the number of requests per minute. The number of requests per minute. Metrics are used to get an overview of the system. They are often used to create dashboards that can be used to get an overview of the system. Logs Logs are textual or structured records generated by applications. They provide a detailed account of events, errors, warnings, and informational messages that occur during the operation of the system. System logs, error logs Logs are essential for diagnosing issues, debugging, and auditing. They provide a detailed history of what happened in a system, making it easier to trace the root cause of problems and track the behavior of components over time. Traces Traces are detailed records of specific transactions or events as they move through a system. A trace typically includes information about the sequence of operations, timing, and dependencies between different components. Distributed tracing in microservices architecture Traces help in understanding the flow of a request or a transaction across different components. They are valuable for identifying bottlenecks, understanding latency, and troubleshooting issues related to the flow of data or control.

            We are mainly going to focus in this module on metrics.

            "},{"location":"s8_monitoring/monitoring/#local-instrumentator","title":"Local instrumentator","text":"

            Before we look into the cloud lets at least conceptually understand how a given instance of a app can expose values that we may be interested in monitoring.

            The standard framework for exposing metrics is called prometheus. Prometheus is a time series database that is designed to store metrics. It is also designed to be very easy to instrument applications with and it is designed to scale to large amounts of data. The way prometheus works is that it exposes a /metrics endpoint that can be queried to get the current state of the metrics. The metrics are exposed in a format called prometheus text format.

            "},{"location":"s8_monitoring/monitoring/#exercises","title":"\u2754 Exercises","text":"
            1. Start by installing prometheus-fastapi-instrumentator in python

              pip install prometheus-fastapi-instrumentator\n

              this will allow us to easily instrument our FastAPI application with prometheus.

            2. Create a simple FastAPI application in a file called app.py. You can reuse any application from the previous module on APIs. To that file now add the following code:

              from prometheus_fastapi_instrumentator import Instrumentator\n\n# your app code here\n\nInstrumentator().instrument(app).expose(app)\n

              This will instrument your application with prometheus and expose the metrics on the /metrics endpoint.

            3. Run the app using uvicorn server. Make sure that the app exposes the endpoints you expect it too exposes, but make sure you also checkout the /metrics endpoint.

            4. The metric endpoint exposes multiple /metrics. Metrics always looks like this:

              # TYPE key <type>\nkey value\n

              e.g. it is essentially a ditionary of key-value pairs with the added functionality of a <type>. Look at this page over the different types prometheus metrics can have and try to understand the different metrics being exposed.

            5. Look at the documentation for the prometheus-fastapi-instrumentator and try to add at least one more metric to your application. Rerun the application and confirm that the new metric is being exposed.

            "},{"location":"s8_monitoring/monitoring/#cloud-monitoring","title":"Cloud monitoring","text":"

            Any cloud system with respect for itself will have some kind of monitoring system. GCP has a service called Monitoring that is designed to monitor all the different services. By default it will monitor a lot of metrics out-of-box. However, the question is if we want to monitor more than the default metrics. The complexity that comes with doing monitoring in the cloud is that we need more than one container. We at least need one container actually running the application that is also exposing the /metrics endpoint and then we need a another container that is collecting the metrics from the first container and storing them in a database. To implement such system of containers that need to talk to each others we in general need to use a container orchestration system such as Kubernetes. This is out of scope for this course, but we can use a feature of Cloud Run called sidecar containers to achieve the same effect. A sidecar container is a container that is running alongside the main container and can be used to do things such as collecting metrics.

            "},{"location":"s8_monitoring/monitoring/#exercises_1","title":"\u2754 Exercises","text":"
            1. Overall we recommend that you just become familiar with the monitoring tab for your cloud run service (see image) above. Try to invoke your service a couple of times and see what happens to the metrics over time.

              1. (Optional) If you really want to load test your application we recommend checking out the tool locust. Locust is a Python based load testing tool that can be used to simulate many users accessing your application at the same time.
            2. Try creating a service level objective (SLO). In short a SLO is a target for how well your application should be performing. Click the Create SLO button and fill it out with what you consider to be a good SLO for your application.

            3. (Optional) To expose our own metrics we need to setup a sidecar container. To do this follow the instructions here. We have setup a simple example that uses fastapi and prometheus that you can find here. After you have correctly setup the sidecar container you should be able to see the metrics in the monitoring tab.

            "},{"location":"s8_monitoring/monitoring/#alert-systems","title":"Alert systems","text":"

            A core problem within monitoring is alert systems. The alert system is in charge of sending out alerts to relevant people when some telemetry or metric we are tracking is not behaving as it should. Alert systems are a subjective choice of when and how many should be send out and in general should be proportional with how important to the of the metric/telemetry. We commonly run into what is referred to the goldielock problem where we want just the right amount of alerts however it is more often the case that we either have

            • Too many alerts, such that they become irrelevant and the really important ones are overseen, often referred to as alert fatigue
            • Or alternatively, we have too little alerts and problems that should have triggered an alert is not dealt with when they happen which can have unforeseen consequences.

            Therefore, setting up proper alert systems can be as challenging as setting up the systems for actually the metrics we want to trigger alerts.

            "},{"location":"s8_monitoring/monitoring/#exercises_2","title":"\u2754 Exercises","text":"

            We are in this exercise going to look at how we can setup automatic alerting such that we get an message every time one of our applications are not behaving as expected.

            1. Go to the Monitoring service. Then go to Alerting tab.

            2. Start by setting up an notification channel. A recommend setting up with an email.

            3. Next lets create a policy. Clicking the Add Condition should bring up a window as below. You are free to setup the condition as you want but the image is one way bo setup an alert that will react to the number of times an cloud function is invoked (actually it measures the amount of log entries from cloud functions).

            4. After adding the condition, add the notification channel you created in one of the earlier steps. Remember to also add some documentation that should be send with the alert to better describe what the alert is actually doing.

            5. When the alert is setup you need to trigger it. If you setup the condition as the image above you just need to invoke the cloud function many times. Here is a small code snippet that you can execute on your laptop to call a cloud function many time (you need to change the url and payload depending on your function):

              import time\nimport requests\nurl = 'https://us-central1-dtumlops-335110.cloudfunctions.net/function-2'\npayload = {'message': 'Hello, General Kenobi'}\n\nfor _ in range(1000):\n    r = requests.get(url, params=payload)\n
            6. Make sure that you get the alert through the notification channel you setup.

            "},{"location":"s9_scalable_applications/","title":"Scaling applications","text":"

            Slides

            This module is all about scaling the applications that we are building. We are here going to use a very narrow definition of scaling namely that we want our applications to run faster, however one should note that in general scaling is a much broader term. There are many different ways to scale your applications and we are going to look at three of these related to different tasks machine learning algorithms:

            • Scaling data loading
            • Scaling training
            • Scaling inference

            We are going to approach the term scaling from two different angles that both should result in your application running faster. The first approach is levering multiple devices, such as using multiple CPU cores or parallelizing training across multiple GPUs. The second approach is more analytical, were we are actually going to look at how we can design smaller/faster model architectures that runs faster.

            It should be noted that this module is specific to working with Pytorch applications. In particular we are going to see how we can both improve base Pytorch code and how to utilize the Pytorch Lightning which we introduced in module M14 on boilerplate to improve the scaling of our applications. If your application is written using another framework we can guarantee that the same techniques in these modules transfers to that framework, but may require you do seek out how to specifically to it.

            If you manage to complete all modules in this session, feel free to checkout the extra module on scalable hyperparameter optimization.

            Learning objectives

            The learning objectives of this session are:

            • Understand how data loading during training can be parallelized and have experimented with it
            • Understand the different paradigms for distributed training and can run multi-gpu experiments using the framework pytorch-lightning
            • Knowledge of different ways, including quantization, pruning, architecture tuning etc. to improve inference speed
            "},{"location":"s9_scalable_applications/data_loading/","title":"M27 - Distributed Data Loading","text":""},{"location":"s9_scalable_applications/data_loading/#distributed-data-loading","title":"Distributed Data Loading","text":"

            Core Module

            One way that deep learning fundamentally changed the way we think about data in machine learning is that more data is always better. This was very much not the case with more traditional machine learning algorithms (random forest, support vector machines etc.) where a pleatau in performance was often reached for a certain amount of data and did not improve if more was added. However, as deep learning models have become deeper and deeper and thereby more and more data hungry performance seems to be ever increasing or at least not reaching a pleatau in the same way as for traditional machine learning.

            Image credit

            As we are trying to feed more and more data into our models and obvious first question to ask is how to do this in a efficient way. As an general rule of thumb we want the performance bottleneck to be the forward/backward e.g. the actual computation in our neural network and not the data loading. By bottleneck we here refer to the part of our pipeline that is restricting how fast we can process data. If data loading is our bottleneck, then our compute device can sit idle while waiting for data to arrive, which is both inefficient and costly. For example if you are using a cloud provider for training deep learning models, you are paying by the hour per device, and thus not using them fully can be costly in the long run.

            In the first set of exercises we are therefore going to focus on distributed data loading i.e. how do load data in parallel to make sure that we always have data ready for our compute devices. We are in the following going to look at what is going on behind the scene when we use Pytorch to parallelize data loading.

            "},{"location":"s9_scalable_applications/data_loading/#a-closer-look-on-data-loading","title":"A closer look on Data loading","text":"

            Before we talk distributed applications it is important to understand the physical layout of a standard CPU (the brain of your computer).

            Most modern CPUs is a single chip that consist of multiple cores. Each core can further be divided into threads. In most laptops the core count is 4 and commonly 2 threads per code. This means that the common laptop have 8 threads. The number of threads a compute unit has is important, because that directly corresponds to the number of parallel operations that can be executed i.e. one per thread. In a Python terminal you should be able to get the number of cores in your machine by writing (try it):

            import multiprocessing\ncores = multiprocessing.cpu_count()\nprint(f\"Number of cores: {cores}, Number of threads: {2*cores}\")\n

            A distributed application is in general any kind of application that parallelizes some or all of it workload. We are in these exercises only focusing on distributed data loading, which happens primarily only on the CPU. In Pytorch it is easy to parallelize data loading if you are using their dataset/dataloader interface:

            from torch.utils.data import Dataset, DataLoader\nclass MyDataset(Dataset):\n    def __init__(self, ...):\n        # whatever logic is needed to init the data set\n        self.data = ...\n\n    def __getitem__(self, idx):\n        # return one item\n        return self.data[idx]\n\ndataset = MyDataset()\ndataloader = Dataloader(\n    dataset,\n    batch_size=8,\n    num_workers=4  # this is the number of threads we want to parallelize workload over\n)\n

            Lets take a deep dive into what happens when we request a batch from our dataloader e.g. next(dataloader). First we must understand that we have a thread that plays the role of the main and the remaining threads (in the above example we request 4) are called workers. When the dataloader is created, we create this structure and make sure that all threads have a copy of our dataset definition so each can call the __getitem__ method.

            Then comes the actual part where we request a batch for data. Assume that we have a batch size of 8 and we do not do any shuffeling. In this step the master thread then distributes the list of requested data points ([0,1,2,3,4,5,6,7]) to the four worker threads. With 8 indices and 4 workers, each worker will receive 2 indices.

            Each worker thread then calls __getitem__ method for all the indices it has received. When all processes are done, the loaded images datapoints gets send back to the master thread collected into a single structure/tensor.

            Each arrow is corresponds to a communication between two threads, which is not a free operations. In total to get a single batch (not counting the initial startup cost) in this example we need to do 8 communication operations. This may seem like a small price to pay, but that may not be the case. If the process time of __getitem__ is very low (data is stored in memory, we just need to index to get it) then it does not make sense to use multiprocessing. The computationally saving by doing the look-up operations in parallel is smaller than the communication cost there is between the main thread and the workers. Multiprocessing makes sense when the process time of __getitem__ is high (data is probably stored on the harddrive).

            It is this trade-off that we are going to investigate in the exercises.

            "},{"location":"s9_scalable_applications/data_loading/#exercises","title":"\u2754 Exercises","text":"

            This exercise is intended to be done on the labeled faces in the wild (LFW) dataset. The dataset consist images of famous people extracted from the internet. The dataset had been used to drive the field of facial verification, which you can read more about here. We are going imagine that this dataset cannot fit in memory, and your job is therefore to construct a data pipeline that can be parallelized based on loading the raw datafiles (.jpg) at runtime.

            1. Download the dataset and extract to a folder. It does not matter if you choose the non-aligned or aligned version of the dataset.

            2. We provide the lfw_dataset.py file where we have started the process of defining a data class. Fill out the __init__, __len__ and __getitem__. Note that __getitem__ expect that you return a single img which should be a torch.Tensor. Loading should be done using PIL Image, as PIL images is the default input format for torchvision for transforms (for data augmentation).

            3. Make sure that the script runs without any additional arguments

              python lfw_dataset.py\n
            4. Visualize a single batch by filling out the codeblock after the first TODO right after defining the dataloader. The visualization should show when launching the script as

              python lfw_dataset.py -visualize_batch\n

              Hint: this tutorial.

            5. Experiment how the number of workers influences the performance. We have already provide code that will pass over 100 batches from the dataset 5 times and calculate how long time it took, which you can play around with by calling

              python lfw_dataset.py -get_timing -num_workers 1\n

              Make a errorbar plot with number of workers along the x-axis and the timing along the y-axis. The errorbars should correspond to the standard deviation over the 5 runs. HINT: if it is taking too long to evaluate, measure the time over less batches (set the -batches_to_check flag). Also if you are not seeing an improvement, try increasing the batch size (since data loading is parallelized per batch).

              For certain machines like the Mac with M1 chipset it is necessary to set the multiprocessing_context flag in the dataloder to \"fork\". This essentially tells the dataloader how the worker nodes should be created.

            6. Retry the experiment where you change the data augmentation to be more complex:

              lfw_trans = transforms.Compose([\n    transforms.RandomAffine(5, (0.1, 0.1), (0.5, 2.0)),\n    # add more transforms here\n    transforms.ToTensor()\n])\n

              by making the augmentation more computationally demanding, it should be easier to get an boost in performance when using multiple workers because the data augmentation is also executed in parallel.

            7. (Optional, requires access to GPU) If your dataset fits in GPU memory it is beneficial to set the pin_memory flag to True. By setting this flag we are essentially telling Pytorch that they can lock the data in-place in memory which will make the transfer between the host (CPU) and the device (GPU) faster.

            This ends the module on distributed data loading in Pytorch. If you want to go into more details we highly recommend that you read this paper that goes into great details on analyzing on how data loading in Pytorch work and performance benchmarks.

            "},{"location":"s9_scalable_applications/distributed_training/","title":"M28 - Distributed Training","text":""},{"location":"s9_scalable_applications/distributed_training/#distributed-training","title":"Distributed Training","text":"

            In this module we are going to look at distributed training. Distributed training is one of the key ingredients to all the awesome results that deep learning models are producing. For example: Alphafold the highly praised model from Deepmind that seems to have solved protein structure prediction, was trained in a distributed fashion for a few weeks. The training was done on 16 TPUv3s (specialized hardware), which is approximately equal to 100-200 modern GPUs. This means that training Alphafold without distributed training on a single GPU (probably not even possible) would take a couple of years to train! Therefore, it is simply impossible currently to train some of the state-of-the-art (SOTA) models within deep learning currently, without taking advantage of distributed training.

            When we talk about distributed training, there are a number of different paradigms that we may use to parallelize our computations

            • Data parallel (DP) training
            • Distributed data parallel (DDP) training
            • Sharded training

            In this module we are going to look at data parallel training, which is the original way of doing parallel training and distributed data parallel training which is an improved version of data parallel. If you want to know more about sharded training which is the newest of the paradigms you can read more about it in this blog post, which describes how sharded can save over 60% of memory used during your training.

            Finally, we want to note that for all the exercises in the module you are going to need a multi GPU setup. If you have not already gained access to multi GPU machines on GCP (see the quotas exercises in this module) you will need to find another way of running the exercises. For DTU Students I can recommend checking out this optional module on using the high performance cluster (HPC) where you can get access to multi GPU resources.

            "},{"location":"s9_scalable_applications/distributed_training/#data-parallel","title":"Data parallel","text":"

            While data parallel today in general is seen as obsolete compared to distributed data parallel, we are still going to investigate it a bit since it offers the most simple form of distributed computations in deep learning pipeline.

            In the figure below is shown both the forward and backward step in the data parallel paradigm

            The steps are the following:

            • Whenever we try to do forward call e.g. out=model(batch) we take the batch and divide it equally between all devices. If we have a batch size of N and M devices each device will be sent N/M datapoints.

            • Afterwards each device receives a copy of the model e.g. a copy of the weights that currently parametrizes our neural network.

            • In this step we perform the actual forward pass in parallel. This is the actual steps that can help us scale our training.

            • Finally we need to send back the output of each replicated model to the primary device.

            Similar to the analysis we did of parallel data loading, we cannot always expect that this will actual take less time than doing the forward call on a single GPU. If we are parallelizing over M devices, we essentially need to do 3xM communication calls to send batch, model and output between the devices. If the parallel forward call does not outweigh this, then it will take longer.

            In addition, we also have the backward path to focus on

            • As the end of the forward collected the output on the primary device, this is also where the loss is accumulated. Thus, loss gradients are first calculated on the primary device

            • Next we scatter the gradient to all the workers

            • The workers then perform a parallel backward pass through their individual model

            • Finally, we reduce (sum) the gradients from all the workers on the main process such that we can do gradient descend.

            One of the big downsides of using data parallel is that all the replicas are destroyed after each backward call. This means that we over and over again need to replicate our model and send it to the devices that are part of the computations.

            Even though it seems like a lot of logic is implementing data parallel into your code, in Pytorch we can very simply enable data parallel training by wrapping our model in the nn.DataParallel class.

            from torch import nn\nmodel = MyModelClass()\nmodel = nn.DataParallel(model, device_ids=[0, 1])  # data parallel on gpu 0 and 1\npreds = model(input)  # same as usual\n
            "},{"location":"s9_scalable_applications/distributed_training/#exercises","title":"\u2754 Exercises","text":"

            Please note that the exercise only makes sense if you have access to multiple GPUs.

            1. Create a new script (call it data_parallel.py) where you take a copy of model FashionCNN from the fashion_mnist.py script. Instantiate the model and wrap torch.nn.DataParallel around it such that it can be executed in data parallel.

            2. Try to run inference in parallel on multiple devices (pass a batch multiple times and time it) e.g.

              import time\nstart = time.time()\nfor _ in range(n_reps):\n    out = model(batch)\nend = time.time()\n

              Does data parallel decrease the inference time? If no, can you explain why that may be? Try playing around with the batch size, and see if data parallel is more beneficial for larger batch sizes.

            "},{"location":"s9_scalable_applications/distributed_training/#distributed-data-parallel","title":"Distributed data parallel","text":"

            It should be clear that there is huge disadvantage of using the data parallel paradigm to scale your applications: the model needs to replicated on each pass (because it is destroyed in the end), which requires a large transfer of data. This is the main problem that distributed data parallel tries to solve.

            The two key difference between distributed data parallel and data parallel that we move the model update (the gradient step) to happen on each device in parallel instead of only on the main device. This has the consequence that we do not need to move replicate the model on each step, instead we just keep a local version on each device that we keep updating. The full set of steps (as shown in the figure):

            • Initialize an exact copy of the model on each device

            • From disk (or memory) we start by loading data into a section of page-locked host memory per device. Page-locked memory is essentially a way to reverse a piece of a computers memory for a specific transfer that is going to happen over and over again to speed it up. The page-locked regions are loaded with non-overlapping data.

            • Transfer data from page-locked memory to each device in parallel

            • Perform forward pass in parallel

            • Do a all-reduce operation on the gradients. An all-reduce operation is a so call all-to-all operation meaning that all processes send their own gradient to all other processes and also received from all other processes.

            • Reduce the combined gradient signal from all processes and update the individual model in parallel. Since all processes received the same gradient information, all models will still be in sync.

            Thus, in distributed data parallel we here end up only doing a single communication call between all processes, compared to all the communication going on in data parallel. While all-reduce is a more expensive operation that many of the other communication operations that we can do, because we only have to do a single we gain a huge performance boost. Empirically distributed data parallel tends to be 2-3 times faster than data parallel.

            However, this performance increase does not come for free. Where we could implement data parallel in a single line in Pytorch, distributed data parallel is much more involving.

            "},{"location":"s9_scalable_applications/distributed_training/#exercises_1","title":"\u2754 Exercises","text":"
            1. We have provided an example of how to do distributed data parallel training in Pytorch in the two files distributed_example.py and distributed_example.sh. You objective is to get a understanding of the necessary components in the script to get this kind of distributed training to work. Try to answer the following questions (HINT: try to Google around):

              1. What is the function of the DDP wrapper?

              2. What is the function of the DistributedSampler?

              3. Why is it necessary to call dist.barrier() before passing a batch into the model?

              4. What does the different environment variables do in the .sh file

            2. Try to benchmark the runs using 1 and 2 GPUs

            3. The first exercise have hopefully convinced you that it can be quite the trouble writing distributed training applications yourself. Luckily for us, Pytorch-lightning can take care of this for us such that we do not have to care about the specific details. To get your model training on multiple GPUs you need to change two arguments in the trainer: the accelerator flag and the gpus flag. In addition to this, you can read through this guide about any additional steps you may need to do (for many of you, it should just work). Try running your model on multiple GPUs.

            4. Try benchmarking your training using 1 and 2 gpus e.g. try running a couple of epochs and measure how long time it takes. How much of a speedup can you actually get? Why can you not get a speedup of 2?

            "},{"location":"s9_scalable_applications/inference/","title":"M29 - Scalable Inference","text":""},{"location":"s9_scalable_applications/inference/#scalable-inference","title":"Scalable Inference","text":"

            Inference is task of applying our trained model to some new and unseen data, often called prediction. Thus, scaling inference is different from scaling data loading and training, mainly due to inference normally only using a single data point (or a few). As we can neither parallelize the data loading or parallelize using multiple GPUs (at least not in any efficient way), this is of no use to us when we are doing inference. Secondly, inference is often not something we do on machines that can perform large computations, as most inference today is actually either done on edge devices e.g. mobile phones or in low-cost-low-compute cloud environments. Thus, we need to be smarter about how we scale inference than just throwing more compute at it.

            In this module we are going to look at various ways that you can either reduce the size of your model and or make your model faster. Both are important for running inference fast regardless of the setup you are running your model on. We want to note that this is still very much an active area of research and therefore best practices for what to do in a specific situation can change.

            "},{"location":"s9_scalable_applications/inference/#choosing-the-right-architecture","title":"Choosing the right architecture","text":"

            Assume you are starting a completely new project and have to come up with a model architecture for doing this. What is you strategy? The common way to do this, is to look at prior work on similar problems that you are facing and either directly choosing the same architecture or creating some slight variation hereof. This is a great way to get started, but the architecture that you end up choosing may be optimal in terms of performance but not inference speed.

            The fact is that not all base architectures are created equal, and a 10K parameter model with one architecture can have significantly different inference speed than another 10K parameter model with another architecture. For example, consider the figure below which compares a number of models from the [timm] package, colored based on their base architecture. The general trend is that the number of images that can be processed by a model per sec (y-axis) is inverse proportional to the number of parameters (x-axis). However, we in general see that convolutional base architectures (conv) are more efficient than transformer (vit) for the same parameter budget.

            Image credit"},{"location":"s9_scalable_applications/inference/#exercises","title":"\u2754 Exercises","text":"

            As dissed in this blogpost the largest increase in inference speed you will see (given some specific hardware) is choosing an efficient model architectures. In the exercises below we are going to investigate the inference speed of different architectures.

            1. Start by checking out this table which contains a list of pretrained weights in torchvision. Try finding an

              • Efficientnet
              • Resnet
              • Transformer based

              model that have in the range of 20-30 mio parameters.

            2. Write a small script that initialize all models and does inference with them. It should look something like this

              import time\nfrom torchvision import models\n\nm1 = models.ModelArchitechture1()\nm2 = models.ModelArchitechture2()\nm3 = models.ModelArchitechture3()\n\ninput = torch.randn(100, 3, 256, 256)\n\nfor i, m in enumerate([m1, m2, m3]):\n    tic = time.time()\n    for _ in range(n_reps):\n        _ = m(input)\n    toc = time.time()\n    print(f\"Model {i} took: {(toc - tic) / n_reps}\")\n
            3. Does the results make sense? Based on the above figure we would expect that efficientnet is faster than resnet, which is faster than the transformer based model. Is this also what you are seeing?

            4. To figure out why one net is more efficient than another we can try to count the operations each network need to do for inference. A operation here we can define as a FLOP (floating point operation) which is any mathematical operation (such as +, -, *, /) or assignment that involves floating-point numbers. Luckily for us someone has already created a python package for calculating this in pytorch: ptflops

              1. Install the package

                pip install ptflops\n
              2. Try calling the get_model_complexity_info function from the ptflops package on the networks from the previous exercise. What are the results?

            5. In the table from the initial exercise, you could also see the overall performance of each network on the Imagenet-1K dataset. Given this performance, the inference speed, the flops count what network would you choose to use in a production setting? Discuss when choosing one over another should be considered.

            "},{"location":"s9_scalable_applications/inference/#quantization","title":"Quantization","text":"

            Quantization is a technique where all computations are performed with integers instead of floats. We are essentially taking all continuous signals and converting them into discretized signals.

            Image credit

            As discussed in this blogpost series, while float (32-bit) is the primarily used precision in machine learning because is strikes a good balance between memory consumption, precision and computational requirement it does not mean that during inference we can take advantage of quantization to improve the speed of our model. For instance:

            • Floating-point computations are slower than integer operations

            • Recent hardware have specialized hardware for doing integer operations

            • Many neural networks are actually not bottlenecked by how many computations they need to do but by how fast we can transfer data e.g. the memory bandwidth and cache of your system is the limiting factor. Therefore working with 8-bit integers vs 32-bit floats means that we can approximately move data around 4 times as fast.

            • Storing models in integers instead of floats save us approximately 75% of the ram/harddisk space whenever we save a checkpoint. This is especially useful in relation to deploying models using docker (as you hopefully remember) as it will lower the size of our docker images.

            But how do we convert between floats and integers in quantization? In most cases we often use a linear affine quantization:

            $$ x_{int} = \\text{round}\\left( \\frac{x_{float}}{s} + z \\right) $$

            where $s$ is a scale and $z$ is the so called zero point. But how does to doing inference in a neural network. The figure below shows all the conversations that we need to make to our standard inference pipeline to actually do computations in quantized format.

            Image credit"},{"location":"s9_scalable_applications/inference/#exercises_1","title":"\u2754 Exercises","text":"
            1. Lets look at how quantized tensors look in Pytorch

              1. Start by creating a tensor that contains both random numbers

              2. Next call the torch.quantize_per_tensor function on the tensor. What does the quantized tensor look like? How does the values relate to the scale and zero_point arguments.

              3. Finally, try to call the .dequantize() method on the tensor. Do you get a tensor back that is close to what you initially started out with.

            2. As you hopefully saw in the first exercise we are going to perform a number of rounding errors when doing quantization and naively we would expect that this would accumulate and lead to a much worse model. However, in practice we observe that quantization still works, and we actually have a mathematically sound reason for this. Can you figure out why quantization still works with all the small rounding errors? HINT: it has to do with the central limit theorem

            3. Lets move on to quantization of our model. Follow this tutorial from Pytorch on how to do quantization. The goal is to construct a model model_fc32 that works on normal floats and a quantized version model_int8. For simplicity you can just use one of the models from the tutorial.

            4. Lets try to benchmark our quantized model and see if all the trouble that we went through actually paid of. Also try to perform the benchmark on the non-quantized model and see if you get a difference. If you do not get an improvement, explain why that may be.

            "},{"location":"s9_scalable_applications/inference/#pruning","title":"Pruning","text":"

            Pruning is another way for reducing the model size and maybe improve performance of our network. As the figure below illustrates, in pruning we are simply removing weights in our network that we do not consider important for the task at hand. By removing, we here mean that the weight gets set to 0. There are many ways to determine if a weight is important but the general rule that the importance of a weight is proportional to the magnitude of a given weight. This makes intuitively sense, since weights in all linear operations (fully connected or convolutional) are always multiplied onto the incoming value, thus a small weight means a small outgoing activation.

            Image credit"},{"location":"s9_scalable_applications/inference/#exercises_2","title":"\u2754 Exercises","text":"
            1. We provide a start script that implements the famous LeNet in this file. Open and run it just to make sure that you know the network.

            2. Pytorch have already some pruning methods implemented in its package. Import the prune module from torch.nn.utils in the script.

            3. Try to prune the weights of the first convolutional layer by calling

              prune.random_unstructured(module_1, name=\"weight\", amount=0.3)  # (1)!\n
              1. You can read about the prune method here.

              Try printing the named_parameters, named_buffers before and after the module is pruned. Can you explain the difference and what is the connection to the module_1.weight attribute.

            4. Try pruning the bias of the same module this time using the l1_unstructured function from the pruning module. Again check the named_parameters, named_buffers argument to make sure you understand the difference between L1 pruning and unstructured pruning.

            5. Instead of pruning only a single module in the model lets try pruning the hole model. To do this we just need to iterate over all named_modules in the model like this:

              for name, module in new_model.named_modules():\n    prune.l1_unstructured(module, name='weight', amount=0.2)\n

              But what if we wanted to apply different pruning to different layers. Implement a pruning scheme where

              • The weights of convolutional layers are L1 pruned with amount=0.2
              • The weights of linear layers are unstructured pruned with amount=0.4

              Print print(dict(new_model.named_buffers()).keys()) after the pruning to confirm that all weights have been correctly pruned.

            6. The pruning we have looked at until know have only been local in nature e.g. we have applied the pruning independently for each layer, not accounting globally for how much we should actually prune. As you may realize this can quickly lead to an network that is pruned too much. Instead, the more common approach is too prune globally where we remove the smallest X amount of connections:

              1. Start by creating a tuple over all the weights with the following format

                parameters_to_prune = (\n    (model.conv1, 'weight'),\n    # fill in the rest of the modules yourself\n    (model.fc3, 'weight'),\n)\n

                The tuple needs to have length 5. Challenge: Can you construct the tuple using for loops, such that the code works for arbitrary size networks?

              2. Next prune using the global_unstructured function to globally prune the tuple of parameters

                prune.global_unstructured(\n    parameters_to_prune,\n    pruning_method=prune.L1Unstructured,\n    amount=0.2,\n)\n
              3. Check that the amount that have been pruned is actually equal to the 20% specified in the pruning. We provide the following function that for a given submodule (for example model.conv1) computes the amount of pruned weights

                def check_prune_level(module: nn.Module):\n    sparsity_level = 100 * float(torch.sum(module.weight == 0) / module.weight.numel())\n    print(f\"Sparsity level of module {sparsity_level}\")\n
            7. With a pruned network we really want to see if all our effort actually ended up with a network that is faster and/or smaller in memory. Do the following to the globally pruned network from the previous exercises:

              1. First we need to make the pruning of our network permanent. Right now it is only semi-permanent as we are still keeping a copy of the original weights in memory. Make the change permanent by calling prune.remove on every pruned module in the model. Hint: iterate over the parameters_to_prune tuple.

              2. Next try to measure the time of a single inference (repeated 100 times) for both the pruned and non-pruned network

                import time\ntic = time.time()\nfor _ in range(100):\n    _ = network(torch.randn(100, 1, 28, 28))\ntoc = time.time()\nprint(toc - tic)\n

                Is the pruned network actually faster? If not can you explain why?

              3. Next lets measure the size of our network (called pruned_network) and a freshly initialized network (called network):

                torch.save(pruned_network.state_dict(), 'pruned_network.pt')\ntorch.save(network.state_dict(), 'network.pt')\n

                Lookup the size of each file. Are the pruned network actually smaller? If not can you explain why?

              4. Repeat the last exercise, but this time start by converting all pruned weights to sparse format first by calling the .to_sparse() method on each pruned weight. Is the saved model smaller now?

            This ends the exercises on pruning. As you probably realized in the last couple of exercises, then pruning does not guarantee speedups out of the box. This is because linear operations in Pytorch does not handle sparse structures out of the box. To actually get speedups we would need to deep dive into the sparse tensor operations, which again does not even guarantee that a speedup because the performance of these operations depends on the sparsity structure of the pruned weights. Investigating this is out of scope for these exercises, but we highly recommend checking it out if you are interested in sparse networks.

            "},{"location":"s9_scalable_applications/inference/#knowledge-distillation","title":"Knowledge distillation","text":"

            Knowledge distillation is somewhat similar to pruning, in the sense that it tries to find a smaller model that can perform equally well as a large model, however it does so in a completely different way. Knowledge distillation is a model compression technique that builds on the work of Bucila et al. in which we try do distill/compress the knowledge of a large complex model (also called the teacher model) into a simpler model (also called the student model).

            The best known example of this is the DistilBERT model. The DistilBERT model is a smaller version of the large natural-language procession model Bert, which achieves 97% of the performance of Bert while only containing 40% of the weights and being 60% faster. You can see in the figure below how it is much smaller in size compared to other models developed at the same time.

            Image credit

            Knowledge distillation works by assuming we have a big teacher that is already performing well that we want to compress. By running our training set through our large model we get a softmax distribution for each and every training sample. The goal of the students, is to both match the original labels of the training data but also match the softmax distribution of the teacher model. The intuition behind doing this, is that teacher model needs to be more complex to learn the complex inter-class relasionship from just (one-hot) labels. The student on the other hand gets directly feed with softmax distributions from the teacher that explicit encodes this inter-class relasionship and thus does not need the same capasity to learn the same as the teacher.

            Image credit"},{"location":"s9_scalable_applications/inference/#exercises_3","title":"\u2754 Exercises","text":"

            Lets try implementing model distillation ourself. We are going to see if we can achieve this on the cifar10 dataset. Do note that exercise below can take quite long time to finish because it involves training multiple networks and therefore involve some waiting.

            1. Start by install the transformers and datasets packages from Huggingface

              pip install transformers\npip install datasets\n

              which we are going to download the cifar10 dataset and a teacher model.

            2. Next download the cifar10 dataset

              from datasets import load_dataset\ndataset = load_dataset(\"cifar10\")\n
            3. Next lets initialize our teacher model. For this we consider a large transformer based model:

              from transformers import AutoFeatureExtractor, AutoModelForImageClassification\nextractor = AutoFeatureExtractor.from_pretrained(\"aaraki/vit-base-patch16-224-in21k-finetuned-cifar10\")\nmodel = AutoModelForImageClassification.from_pretrained(\"aaraki/vit-base-patch16-224-in21k-finetuned-cifar10\")\n
            4. To get the logits (un-normalized softmax scores) from our teacher model for a single datapoint from the training dataset you would extract it like this:

              sample_img = dataset['train'][0]['img']\npreprocessed_img = extractor(dataset['train'][0]['img'], return_tensors='pt')\noutput =  model(**preprocessed_img)\nprint(output.logits)\n# tensor([[ 3.3682, -0.3160, -0.2798, -0.5006, -0.5529, -0.5625, -0.6144, -0.4671, 0.2807, -0.3066]])\n

              Repeat this process for the hole training dataset and store the result somewhere.

            5. Implement a simple convolutional model. You can create a custom one yourself or use a small one from torchvision.

            6. Train the model on cifar10 to convergence, so you have a base result on how the model is performing.

            7. Redo the training, but this time add knowledge distillation to your training objective. It should look like this:

              for batch in dataset:\n    # ...\n    img, target, teacher_logits = batch\n    preds = model(img)\n    loss = torch.nn.functional.cross_entropy(preds, target)\n    loss_teacher = torch.nn.functional.cross_entropy(preds, teacher_logits)\n    loss = loss + loss_teacher\n    loss.backward()\n    # ...\n
            8. Compare the final performance obtained with and without knowledge distillation. Did the performance improve or not?

            This ends the module on scaling inference in machine learning models.

            "},{"location":"tools/","title":"Tools","text":"

            Just a collection of tools and scripts for running the course.

            "}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Home","text":"

            Machine Learning Operations

            Repository for course 02476 at DTU.

            Checkout the homepage!

            "},{"location":"#i-course-information","title":"\u2139\ufe0f Course information","text":"
            • Course responsible

            • Postdoc Nicki Skafte Detlefsen, nsde@dtu.dk

            • Professor S\u00f8ren Hauberg, sohau@dtu.dk

            • 5 ECTS (European Credit Transfer System), corresponding to 140 hours of work

            • 3 week period in January
            • Master level course
            • Grade: Pass/not passed
            • Type of assessment: oral presentation + project report
            • Recommended prerequisites: DTU course 02456 (Deep Learning) or experience with the following topics:

            • General understanding of machine learning (datasets, probability, classifiers, overfitting etc.)

            • Basic knowledge of deep learning (backpropagation, convolutional neural networks, auto-encoders etc.)
            • Coding in PyTorch. The first day we provide some exercises in PyTorch to get everyone's skills up-to-date as fast as possible.
            "},{"location":"#course-setup","title":"\ud83d\udcbb Course setup","text":"

            Start by cloning or downloading this repository

            git clone https://github.com/SkafteNicki/dtu_mlops\n

            If you do not have git installed (yet) we will touch upon it in the course. The folder will contain all the exercise material for this course and lectures. Additionally, you should join our Slack channel which we use for communication. The link may be expired, write to me.

            "},{"location":"#course-organization","title":"\ud83d\udcc2 Course organization","text":"

            We highly recommend that when going through the material you use the homepage which is the corresponding Github pages version of this repository that is more nicely rendered, that also includes some special HTML magic provided by Material for MkDocs.

            The course is divided into sessions, denoted by capital S, and modules, denoted by capital M. A session corresponds to a full day of work if you are following the course, meaning approximately 9 hours of work. Each session (S) corresponds to a topic within MLOps and consists of multiple modules (M) that each cover a tool within the session.

            Importantly we differ between core modules and optional modules. Core modules will be marked by

            Core Module

            at the top of their corresponding page. Core modules are important to go through to be able to pass the course. You are highly recommended to still do the optional modules.

            "},{"location":"#mlops-what-is-it","title":"\ud83c\udd92 MLOps: What is it?","text":"

            Machine Learning Operations (MLOps) is a rather new field that has seen its uprise as machine learning and particularly deep learning has become a widely available technology. The term itself is a compound of \"machine learning\" and \"operations\" and covers everything that has to do with the management of the production ML lifecycle.

            The lifecycle of production ML can largely be divided into three phases:

            1. Design: The initial phase starts with an investigation of the problem. Based on this analysis, several requirements can be prioritized for what we want our future model to do. Since machine learning requires data to be trained, we also investigate in this step what data we have and if we need to source it in some other way.

            2. Model development: Based on the design phase we can begin to conjure some machine learning algorithms to solve our problems. As always, the initial step often involves doing some data analysis to make sure that our model is learning the signal that we want it to learn. Secondly, is the machine learning engineering phase, where the particular model architecture is chosen. Finally, we also need to do validation and testing to make sure that our model is generalizing well.

            3. Operations: Based on the model development phase, we now have a model that we want to use. The operations are where create an automatic pipeline that makes sure that whenever we make changes to our codebase they get automatically incorporated into our model, such that we do not slow down production. Equally important is the ongoing monitoring of already deployed models to make sure that they behave exactly as we specified them.

            It is important to note that the three steps are a cycle, meaning that when you have successfully deployed a machine learning model that is not the end of it. Your initial requirements may change, forcing you to revisit the design phase. Some new algorithms may show promising results, so you revisit the model development phase to implement this. Finally, you may try to cut the cost of running your model in production, making you revisit the operations phase, and trying to optimize some steps.

            The focus in this course is particularly on the Operations part of MLOps as this is what many data scientists are missing in their toolbox to take all the knowledge they have about data processing and model development into a production setting.

            "},{"location":"#learning-objectives","title":"\u2754 Learning objectives","text":"

            General course objective

            Introduce the student to a number of coding practices that will help them organization, scale, monitor and deploy machine learning models either in a research or production setting. To provide hands-on experience with a number of frameworks, both local and in the cloud, for doing large scale machine learning models.

            This includes:

            • Organize code in an efficient way for easy maintainability and shareability
            • Understand the importance of reproducibility and how to create reproducible containerized applications and experiments
            • Cable of using version control to efficiently collaborate on code development
            • Knowledge of continuous integration (CI) and continuous machine learning (CML) for automating code development
            • Being able to debug, profile, visualize and monitor multiple experiments to assess model performance
            • Cable of using online cloud-based computing services to scale experiments
            • Demonstrate knowledge about different distributed training paradigms within machine learning and how to apply them
            • Deploy machine learning models, both locally and in the cloud
            • Conduct a research project in collaboration with fellow students using the frameworks taught in the course
            • Have lots of fun and share memes! :)
            "},{"location":"#references","title":"\ud83d\udcd3 References","text":"

            Additional reading resources (in no particular order):

            • Ref 1 Introduction blog post for those who have never heard about MLOps and want to get an overview.

            • Ref 2 Great document from Google about the different levels of MLOps.

            • Ref 3 Another introduction to the principles of MLOps and the different stages of MLOps.

            • Ref 4 Great paper about the technical depth in machine learning.

            • Ref 5 Interview study that uncovers many of the pain points that ML engineers go through when doing MLOps.

            Other courses with content similar to this:

            • Made with ML. Great online MLOps course that also covers additional topics on the foundations of working with ML.

            • Full stack deep learning. Another MLOps online course going through the whole developer pipeline.

            • MLOps Zoomcamp. MLOps online course that includes many of the same topics.

            "},{"location":"#contributing","title":"\ud83d\udc68\u200d\ud83c\udfeb Contributing","text":"

            If you want to contribute to the course, we are happy to have you! Anything from fixing typos to adding new content is welcome. For building the course material locally, it is a simple two-step process:

            pip install -r requirements.txt\nmkdocs serve\n

            Which will start a local server that you can access at localhost:8000 and will automatically update when you make changes to the course material. When you have something that you want to contribute, please make a pull request.

            "},{"location":"#license","title":"\u2755 License","text":"

            I highly value open source, and the content of this course is therefore free to use under the Apache 2.0 license. If you use parts of this course in your work, please cite using:

            @misc{skafte_mlops,\n    author       = {Nicki Skafte Detlefsen},\n    title        = {Machine Learning Operations},\n    howpublished = {\\url{https://github.com/SkafteNicki/dtu_mlops}},\n    year         = {2024}\n}\n
            "},{"location":"challenges/","title":"Challenges","text":"

            If you have managed to go through all other material, congratulations, you are already a good way to becoming an MLOps engineer with a great overview of tools, concepts and techniques within the field. Below are listed some technical hard problems regarding MLOps. These are meant as inspiration to get you to deep dive more into using all the cloud services that gcp offers. You are also free to continue work on your project.

            • Currently testing takes place in Github, but it should come as no surprise that gcp can also take care of this. Implementing testing on gcp. This blogpost can probably help.

            • In the lectures we setup cloud build to automatically build a docker container for training whenever we pushed code to our github repository. However, we also setup CI testing in github. If tests are failing on github the building of the docker image is still being done, essentially wasting our precious cloud credit. Setup a system so cloud building only commence when all tests are passing.

            • Authenticating between gcp, wandb and dvc can be tricky to do in a secure way. Figure out how to use the Secret Manager in gcp to pass secrets e.g. API keys during the build process of docker images. This page may help

            • We have already done deployment through Cloud Functions. The native extension to cloud functions is the service Cloud Run which allows for more than just code snippets to be deployed. Checkout this service and try to deploy a container using it.

            • All deployments we have done in the course have been serverless, because it makes it easier for us to focus on the actual application we are trying to deploy instead of focusing on server management. That said, going through the trouble of using a server orchestrator yourself can be worth it in many situations. Figure out how to use kubernetes in gcp. It will involve getting familiar with the kubernetes API and probably also kubeflow for managing pipelines on the server.

            • Vertex AI is the newest ML service on gcp. It combines many of the features of the AI platform service you have already used with the AutoML service. Figure out how to use Vertex AI service to either train a custom model or use their AutoML feature. This blogpost can be a good place to start.

            • If you want different services to be able to talk to each other the correct way is to setup a system using Pub and Sub (publish and subscription) service in gcp. Essentially it allows a service to publish a message and other services to subscribe and react to it. For example the AI platform could publish a message every time a model was done training and cloud build could subscribe to that, automatically staring to build a docker image using the trained model. Investigate Pub and Sub and try to make two services talk to each other.

            • In the deployment exercises you probably looked at least once on the logs. We can automate what we do with the logs using the Logs Explorer service, which collects all logs from all services that you are using. Setup Logs routing for one of your deployed services to your cloud storage. Afterwards setup a VM that consumes the logs and accumulate them.

            "},{"location":"faq/","title":"Frequently asked questions","text":"

            For further questions, please contact Nicki.

            "},{"location":"faq/#is-it-possible-to-attend-the-course-fully-online","title":"Is it possible to attend the course fully online \u2754","text":"

            Mostly yes. All exercises are provided online and lectures will be recorded and streamed. However, do note that

            • For project days (see which days in the time plan) you will need to agree with your project group that you are working from home.
            • The oral part of the exam takes place on campus. EuroTEQ students are exempt from this rule. If you have an extremely good reason for not being able to come to campus on the exam date, please contact us within the first week of the course.
            • We have limited TA resources and will be prioritizing students coming to campus for help. If you are attending online, feel free to ask questions on our Slack channel and we will help to the best of our ability.

            Overall we try to support flexible learning as much as possible with some limitations.

            "},{"location":"faq/#what-are-the-prerequisites-for-taking-this-course","title":"What are the prerequisites for taking this course \u2754","text":"

            We recommend that you have a basic understanding of machine learning concepts such as what a dataset is, what probabilities are, what a classifier is, what overfitting means etc. This corresponds to the curriculum covered in course 02450. The actual focus of the course is not on machine learning models, but we will be using these basic concepts throughout the exercises.

            Additionally, we recommend basic knowledge about deep learning and how to code in Pytorch, corresponding to the curriculum covered in 02456. From prior experience, we know that not all students have gained knowledge about deep learning models before this course, and we will be covering the basics of how to code in PyTorch in one of the first modules of the course to get everyone up to speed.

            "},{"location":"faq/#i-will-be-missing-x-days-of-the-course-will-that-be-a-problem","title":"I will be missing X days of the course, will that be a problem \u2754","text":"

            Depends. The course is fairly intensive, with most students working from 9-17 every day. If you already know that you will be missing X days of the course, then I highly recommend that you go through some of the first sessions before the course starts to give yourself a bit of breathing room. If you are not able to do so, please be aware that an additional effort may be needed from you to keep up with your fellow students.

            "},{"location":"faq/#how-many-should-we-be-in-a-group-for-the-projects","title":"How many should we be in a group for the projects \u2754","text":"

            Between 3 and 5. The projects are designed to be done in groups meaning that we intentionally make them too big for one person to do alone. Luckily, a lot of the work that you need to do can be done in parallel, so it is not as bad as it sounds.

            "},{"location":"faq/#when-will-the-exam-take-place","title":"When will the exam take place \u2754","text":"

            The oral part of the exam, which is a small project demo, always falls on the last day of the course. For January 2024, this means the 19th. The written part which is a small project report, should be handed in at midnight on the final course day.

            "},{"location":"faq/#where-can-i-find-information-regarding-the-exam","title":"Where can I find information regarding the exam \u2754","text":"

            Look at the bottom of this page. Details will be updated as we get closer to the exam date.

            "},{"location":"faq/#can-i-use-chatgpt-or-similar-tools-for-the-exercises-project-exam-report-coding-writing","title":"Can I use ChatGPT or similar tools for the exercises, project, exam report (coding + writing) \u2754","text":"

            Yes, yes, and yes, but remember that its a tool and you need to validate the output before using it. We would prefer for the exam report that you formulate the answers in your own words because it is intended for you do describe what you have been doing in your project. The I in LLM stands for intelligence.

            "},{"location":"faq/#i-am-a-foreign-student-and-my-home-university-doesnt-accept-passnot-pass-what-can-i-do","title":"I am a foreign student and my home university doesn't accept pass/not pass, what can I do \u2754","text":"

            We can give a grade on the Danish 7-point grading scale for foreign students who need it, where their home university does not accept pass/no-pass. You need to contact the course responsible Nicki within the first week of the course to request this. Secondly, make sure to also inform us about it during the oral part of the exam because we need to ask you additional questions to be able to give an exact grade.

            "},{"location":"faq/#i-am-a-euroteq-student-any-special-rules-for-me","title":"I am a EuroTEQ student, any special rules for me \u2754","text":"

            You will be allowed to attend the oral part of the exam online and we will provide a special Slack channel for you, trying to make sure that you get the same help as students from DTU who can attend the course on campus.

            "},{"location":"overview/","title":"Summary of course content","text":"

            There are a lot of moving parts in this course, so it may be hard to understand how it all fits together. This page provides a summary of the frameworks in this course e.g. the stack of tools used. In the figure below we have provided an overview on how the different tools of the course interacts with each other. The table after the figure provides a short description of each of the parts.

            The MLOps stack in the course. This is just an example of one stack, and depending on your use case you may want to use a different stack of tools that better fits your needs. Regardless of the stack, the principles of MLOps are the same. Framework Description Pytorch is the backbone of our code, it provides the computational engine and the data structures that we need to define our data structures. Pytorch lightning is a framework that provides a high-level interface to Pytorch. It provides a lot of functionality that we need to train our models, such as logging, checkpointing, early stopping, etc. such that we do not have to implement it ourselves. It also allows us to scale our models to multiple GPUs and multiple nodes. We control the dependencies and python interpreter using Conda that enables us to construct reproducible virtual environments For configuring our experiments we use Hydra that allows us to define a hierarchical configuration structure config files Using Weights and Bias allows us to track and log any values and hyperparameters for our experiments Whenever we run into performance bottlenecks with our code we can use the Profiler to find the cause of the bottleneck When we run into bugs in our code we can use the Debugger to find the cause of the bug For organizing our code and creating templates we can use Cookiecutter Docker is a tool that allows us to create a container that contains all the dependencies and code that we need to run our code For controlling the versions of our data and synchronization between local and remote data storage, we can use DVC that makes this process easy For version control of our code we use Git (in complement with Github) that allows multiple developers to work together on a shared codebase We can use Pytest to write unit tests for our code, to make sure that new changes to the code does break the code base For linting our code and keeping a consistent coding style we can use tools such as Pylint and Flake8 that checks our code for common mistakes and style issues For running our unit tests and other checks on our code in a continues manner e.g. after we commit and push our code we can use Github actions that automate this process Using Cloud build we can automate the process of building our docker images and pushing them to our container registry Container registry is a service that allows us to store our docker images for later use by other services For storing our data and trained models we can use Cloud storage that provides a scalable and secure storage solution For general compute tasks we can use Compute engine that provides a scalable and secure compute solution For training our experiments in a easy and scalable manner we can use Vertex AI For creating a REST API for our model we can use FastAPI that provides a high-level interface for creating APIs For simple deployments of our code we can use Cloud functions that allows us to run our code in response to events through simple python functions For more complex deployments of our code we can use Cloud run that allows us to run our code in response to events through docker containers Cloud monitoring gives us the tools to keep track of important logs and errors from the other cloud services For monitoring our deployed model is experiencing any drift we can use Evidently AI that provides a framework and dashboard for monitoring drift For monitoring the telemetry of our deployed model we can use OpenTelemetry that provides a standard for collecting and exporting telemetry data"},{"location":"projects/","title":"Project work","text":"

            Slides

            Approximately 1/3 of the course time is dedicated to doing project work. The projects will serve as the basis of your exam. In the project, you will essentially re-apply everything that you learn throughout the course to a self chosen project. The overall goals with the project is:

            • Being able to work in a group on a larger project
            • To formulate a project within the provided guidelines
            • Apply the material though in the course to the problem
            • Present your findings

            In the projects you are free to work on whatever problem that you want. That said, we have a specific requirement, that you need to incorporate some third-party framework into your project. If you want inspiration for projects, here are some examples

            1. Classification of tweets

            2. Translating from English to German

            3. Classification of scientific papers

            4. Classification of rice types from images

            We hope most students will be able to form groups by themselves. Expected group size is between 3 and 5. If you are not able to form a group, please make sure to post in the #looking-for-group channel on Slack or make sure to be present on the 4th day of the course (the day before the project work starts) where we will help students that have not found a group yet.

            "},{"location":"projects/#open-source-tools","title":"Open-source tools","text":"

            We strive to keep the tools thought in this course as open-source as possible. The great thing about the open-source community is that whatever problem you are working on, there is probably some package out there that can get you at least 10% of the way. For the project, we want to enforce this point and you are required to include some third-party package, that is neither Pytorch or one of the tools already covered in the course, into your project.

            If you have no idea what framework to include, the Pytorch ecosystem is a great place for finding open-source frameworks that can help you accelerate your own projects where Pytorch is the backengine. All tools in the ecosystem should work greatly together with Pytorch. However, it is important to note that the ecosystem is not a complete list of all the awesome packages that exist to extend the functionality of Pytorch. If you are still missing inspiration for frameworks to use, we highly recommend these three that have been used in previous years of the course:

            • PyTorch Image Models. PyTorch Image Models (also known as TIMM) is the absolutely most used computer vision package (maybe except for torchvision). It contains models, scripts and pre trained for a lot of state-of-the-art image models within computer vision.

            • Transformers. The Transformers repository from the Huggingface group focuses on state-of-the-art Natural Language Processing (NLP). It provides many pre-trained model to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.

            • Pytorch-Geometric. PyTorch Geometric (PyG) is a geometric deep learning. It consists of various methods for deep learning on graphs and other irregular structures, also known as geometric deep learning, from a variety of published papers.

            "},{"location":"projects/#project-days","title":"Project days","text":"

            Each project day is fully dedicated to project work, except for maybe external inspirational lectures in the morning. The group decides exactly where they want to work on the project, how they want to work on the project, how do distribute the workload etc. We actually encourage strongly to parallelize work during the project, because there are a lot of tasks to do, but it it is important that all group members at least have some understanding of the whole project.

            Remember that the focus of the project work is not to demonstrate that you can work with the biggest and baddest deep learning model, but instead that you show that you can incorporate the tools that are taught throughout the course in a meaningful way.

            Also note that the project is not expected to be very large in scope. It may simply be that you want to train X model on Y data. You will approximately be given 4 full days to work on the project. It is better that you start out with a smaller project and then add complexity along the way if you have time.

            "},{"location":"projects/#day-1","title":"Day 1","text":"

            The first project days is all about getting started on the projects and formulating exactly what you want to work on as a group.

            1. Start by brainstorming projects! Try to figure out exactly what you want to work with and begin to investigate what third party package that can support the project.

            2. When you have come up with an idea, write a project description. The description is the delivery for today and should be at least 300 words. Try to answer the following questions in the description:

              • Overall goal of the project
              • What framework are you going to use and you do you intend to include the framework into your project?
              • What data are you going to run on (initially, may change)
              • What models do you expect to use
            3. (Optional) If you want to think more about the product design of your project, feel free to fill out the ML canvas (or part of it). You can read more about the different fields on canvas here.

            4. After having done the product description, you can start on the actual coding of the project. In the next section, a to-do list is attached that summaries what we are doing in the course. You are NOT expected to fulfill all bullet points from week 1 today.

            The project description will serve as an guideline for us at the exam that you have somewhat reached the goals that you set out to do. By the end of the day, you should commit your project description to the README.md file belonging to your project repository. If you filled out the ML canvas, feel free to include that as part of the README.md file. Also remember to commit whatever you have done on the project until now. When you have done this, go to DTU Learn and hand-in (as a group) the link to your github repository as an assignment.

            We will briefly (before next Monday) look over your github repository and project description to check that everything is fine. If we have any questions/concerns we will contact you.

            "},{"location":"projects/#day-2","title":"Day 2","text":"

            The goal for today is simply to continue working on your project. Start with bullet points in the checklist from week 1 and continue with bullet points for week 2.

            "},{"location":"projects/#day-3","title":"Day 3","text":"

            Continue working on your project, today you should hopefully focus on the bullet points in the checklist from week 2. There is no delivery for this week, but make sure that you have committed all your progress at the end of the day. We will again briefly look over the repositories and will reach out to your group if we are worried about the progression of your project.

            "},{"location":"projects/#day-4","title":"Day 4","text":"

            We have now entered the final week of the course and the second last project day. You are most likely continuing with bullet points from week 2, but should hopefully begin to look at the bullet points from week 3 today. These are in general much more complex, so we recommend looking at them until you have completed most from week 2. We also recommend that you being to fill our report template.

            "},{"location":"projects/#day-5","title":"Day 5","text":"

            Today you are finishing your project. We recommend that you start by creating a architechtual overview of your project similar to this figure. I recommend using draw.io for creating this kind of diagram, but feel free to use any tool you like. Else you should just continue working on your project, checking of as many bullet points as possible. Finally, you should also prepare yourself for the exam tomorrow.

            "},{"location":"projects/#project-checklist","title":"Project checklist","text":"

            Please note that all the lists are exhaustive meaning that I do not expect you to have completed very point on the checklist for the exam.

            "},{"location":"projects/#week-1","title":"Week 1","text":"
            • Create a git repository
            • Make sure that all team members have write access to the github repository
            • Create a dedicated environment for you project to keep track of your packages
            • Create the initial file structure using cookiecutter
            • Fill out the make_dataset.py file such that it downloads whatever data you need and
            • Add a model file and a training script and get that running
            • Remember to fill out the requirements.txt file with whatever dependencies that you are using
            • Remember to comply with good coding practices (pep8) while doing the project
            • Do a bit of code typing and remember to document essential parts of your code
            • Setup version control for your data or part of your data
            • Construct one or multiple docker files for your code
            • Build the docker files locally and make sure they work as intended
            • Write one or multiple configurations files for your experiments
            • Used Hydra to load the configurations and manage your hyperparameters
            • When you have something that works somewhat, remember at some point to to some profiling and see if you can optimize your code
            • Use Weights & Biases to log training progress and other important metrics/artifacts in your code. Additionally, consider running a hyperparameter optimization sweep.
            • Use Pytorch-lightning (if applicable) to reduce the amount of boilerplate in your code
            "},{"location":"projects/#week-2","title":"Week 2","text":"
            • Write unit tests related to the data part of your code
            • Write unit tests related to model construction and or model training
            • Calculate the coverage.
            • Get some continuous integration running on the github repository
            • Create a data storage in GCP Bucket for you data and preferable link this with your data version control setup
            • Create a trigger workflow for automatically building your docker images
            • Get your model training in GCP using either the Engine or Vertex AI
            • Create a FastAPI application that can do inference using your model
            • If applicable, consider deploying the model locally using torchserve
            • Deploy your model in GCP using either Functions or Run as the backend
            "},{"location":"projects/#week-3","title":"Week 3","text":"
            • Check how robust your model is towards data drifting
            • Setup monitoring for the system telemetry of your deployed model
            • Setup monitoring for the performance of your deployed model
            • If applicable, play around with distributed data loading
            • If applicable, play around with distributed model training
            • Play around with quantization, compilation and pruning for you trained models to increase inference speed
            "},{"location":"projects/#additional","title":"Additional","text":"
            • Revisit your initial project description. Did the project turn out as you wanted?
            • Make sure all group members have a understanding about all parts of the project
            • Uploaded all your code to github
            "},{"location":"projects/#exam","title":"Exam","text":"

            The exam consist of a written and oral element, and both contributes to the overall evaluation if you should pass or not pass the course.

            For the written part of the exam we provide an template folder called reports. As the first task you should copy the folder and all its content to your project repository. Then, you jobs is to fill out the README.md file which contains the report template. The file itself contains instructions on how to fill it out and instructions on using the included report.py file. You will hand-in the template by simple including it in your project repository. By midnight on the 20/1 we will scrape it automatically, and changes after this point are therefore not registered.

            For the oral part of the exam you will be given a time slot where you have to show up for 5-7 min and give a very short demo of your project. What we are interested in seeing is essentially a live demo of your deployed application/project. We will possibly also ask questions regarding the overall curriculum of the course. Importantly, you should have your deployed application, the github repository with your project code, W&B account and your GCP account ready before you enter the exam so we can quickly jump around. We will send out an the time slots during the last week.

            "},{"location":"timeplan/","title":"Timeplan","text":"

            Slides

            The course is organised into exercise (2/3 of the course) days and project days (1/3 of the course).

            Exercise days start at 9:00 in the morning with an lecture (15-30 min) that will give some context about at least one of the topics of that day. Additionally, previous days exercises may shortly be touched upon. The remaining of the day will be spend on solving exercises either individually or in small groups. For some people the exercises may be fast to do and for others it will take the hole day. We will provide help throughout the day. We will try to answer questions on slack but help with be priorities to students physically on campus.

            Project days are intended for project work and you are therefore responsible for making an agreement with your group when and where you are going to work. The first project days there will be a lecture at 9:00 with project information. Other project days we may also start the day with an external lecture, which we highly recommend that you participate in. During each project day we will have office hours where you can ask questions for the project.

            Below is an overall timeplan for each day, including the presentation topic of the day and the frameworks that you will be using in the exercises.

            Legend: \ud83d\udcdd Slides, \ud83c\udfa5 Recording.

            Note

            Current dates listed below are for January 2024 version of the course. The lectures and recordings are currently from January 2023 version of the course. Please note that for January 2024, the first week starts on a Tuesday and ends on a Saturday.

            "},{"location":"timeplan/#week-1","title":"Week 1","text":"

            In the first week you will be introduced to a number of development practices for organizing and developing code, especially with a focus on making everything reproducible.

            Date Day Presentation topic Frameworks Format 2/1 Tuesday Deep learning software \ud83d\udcdd \ud83c\udfa5(2023) \ud83c\udfa5(2024) Terminal, Conda, IDE, Pytorch Exercises 3/1 Wednesday MLOps: what is it? \ud83d\udcdd.pdf) \ud83c\udfa5(2023) \ud83c\udfa5(2023) Git, CookieCutter, Pep8, DVC Exercises 4/1 Thursday Reproducibility \ud83d\udcdd \ud83c\udfa5(2023) \ud83c\udfa5(2024) Docker, Hydra Exercises 5/1 Friday Debugging \ud83d\udcdd \ud83c\udfa5(2023) \ud83c\udfa5(2024) Debugger, Profiler, Wandb, Lightning Exercises 6/1 Saturday Pytorch ecosystem \ud83d\udcdd \ud83c\udfa5(2023) \ud83c\udfa5(2024) - Projects"},{"location":"timeplan/#week-2","title":"Week 2","text":"

            The second week is about automatization and the cloud. Automatization will help use making sure that our code does not break when we make changes to it. The cloud will help us scale up our applications and we learn how to use different services to help develop a full machine learning pipeline.

            Date Day Presentation topic Frameworks Format 8/1 Monday Continuous Integration \ud83d\udcdd \ud83c\udfa5 Pytest, Github actions, Pre-commit, CML Exercises 9/1 Tuesday The Cloud \ud83d\udcdd \ud83c\udfa5 GCP Engine, Bucket, Container registry, Vertex AI Exercises 10/1 Wednesday Deployment \ud83d\udcdd \ud83c\udfa5 FastAPI, Torchservce, GCP Functions, Run Exercises 11/1 Thursday No lecture \ud83c\udfa5 - Projects 12/1 Friday No lecture \ud83c\udfa5 - Projects"},{"location":"timeplan/#week-3","title":"Week 3","text":"

            For the final week we look into advance topics such as monitoring and scaling of applications. Monitoring is especially important for the longivity for the applications that we develop, that we actually can deploy them either locally or in the cloud and that we have the tools to monitor how they behave over time. Scaling of applications is an important topic if we ever want our applications to be used by many people at the same time.

            Date Day Presentation topic Frameworks Format 15/1 Monday Monitoring \ud83d\udcdd \ud83c\udfa5 Evidently AI, OpenTelemetry, Signoz Exercises 16/1 Tuesday Scalable applications \ud83d\udcdd \ud83c\udfa5 Pytorch, Lightning Exercises 17/1 Wednesday - - Projects 18/1 Thursday - - Projects 19/1 Friday - - Exam"},{"location":"reports/","title":"Exam template for 02476 Machine Learning Operations","text":"

            This is the report template for the exam. Please only remove the text formatted as with three dashes in front and behind like:

            --- question 1 fill here ---

            where you instead should add your answers. Any other changes may have unwanted consequences when your report is auto generated in the end of the course. For questions where you are asked to include images, start by adding the image to the figures subfolder (please only use .png, .jpg or .jpeg) and then add the following code in your answer:

            ![my_image](figures/<image>.<extension>)\n

            In addition to this markdown file, we also provide the report.py script that provides two utility functions:

            Running:

            python report.py html\n

            will generate an .html page of your report. After deadline for answering this template, we will autoscrape everything in this reports folder and then use this utility to generate an .html page that will be your serve as your final handin.

            Running

            python report.py check\n

            will check your answers in this template against the constrains listed for each question e.g. is your answer too short, too long, have you included an image when asked to.

            For both functions to work it is important that you do not rename anything. The script have two dependencies that can be installed with pip install click markdown.

            "},{"location":"reports/#overall-project-checklist","title":"Overall project checklist","text":"

            The checklist is exhaustic which means that it includes everything that you could possible do on the project in relation the curricilum in this course. Therefore, we do not expect at all that you have checked of all boxes at the end of the project.

            "},{"location":"reports/#week-1","title":"Week 1","text":"
            • Create a git repository
            • Make sure that all team members have write access to the github repository
            • Create a dedicated environment for you project to keep track of your packages
            • Create the initial file structure using cookiecutter
            • Fill out the make_dataset.py file such that it downloads whatever data you need and
            • Add a model file and a training script and get that running
            • Remember to fill out the requirements.txt file with whatever dependencies that you are using
            • Remember to comply with good coding practices (pep8) while doing the project
            • Do a bit of code typing and remember to document essential parts of your code
            • Setup version control for your data or part of your data
            • Construct one or multiple docker files for your code
            • Build the docker files locally and make sure they work as intended
            • Write one or multiple configurations files for your experiments
            • Used Hydra to load the configurations and manage your hyperparameters
            • When you have something that works somewhat, remember at some point to to some profiling and see if you can optimize your code
            • Use Weights & Biases to log training progress and other important metrics/artifacts in your code. Additionally, consider running a hyperparameter optimization sweep.
            • Use Pytorch-lightning (if applicable) to reduce the amount of boilerplate in your code
            "},{"location":"reports/#week-2","title":"Week 2","text":"
            • Write unit tests related to the data part of your code
            • Write unit tests related to model construction and or model training
            • Calculate the coverage.
            • Get some continuous integration running on the github repository
            • Create a data storage in GCP Bucket for you data and preferable link this with your data version control setup
            • Create a trigger workflow for automatically building your docker images
            • Get your model training in GCP using either the Engine or Vertex AI
            • Create a FastAPI application that can do inference using your model
            • If applicable, consider deploying the model locally using torchserve
            • Deploy your model in GCP using either Functions or Run as the backend
            "},{"location":"reports/#week-3","title":"Week 3","text":"
            • Check how robust your model is towards data drifting
            • Setup monitoring for the system telemetry of your deployed model
            • Setup monitoring for the performance of your deployed model
            • If applicable, play around with distributed data loading
            • If applicable, play around with distributed model training
            • Play around with quantization, compilation and pruning for you trained models to increase inference speed
            "},{"location":"reports/#additional","title":"Additional","text":"
            • Revisit your initial project description. Did the project turn out as you wanted?
            • Make sure all group members have a understanding about all parts of the project
            • Uploaded all your code to github
            "},{"location":"reports/#group-information","title":"Group information","text":""},{"location":"reports/#question-1","title":"Question 1","text":"

            Enter the group number you signed up on

            Answer:

            --- question 1 fill here ---

            "},{"location":"reports/#question-2","title":"Question 2","text":"

            Enter the study number for each member in the group

            Example:

            sXXXXXX, sXXXXXX, sXXXXXX

            Answer:

            --- question 2 fill here ---

            "},{"location":"reports/#question-3","title":"Question 3","text":"

            What framework did you choose to work with and did it help you complete the project?

            Answer length: 100-200 words.

            Example: We used the third-party framework ... in our project. We used functionality ... and functionality ... from the package to do ... and ... in our project.

            Answer:

            --- question 3 fill here ---

            "},{"location":"reports/#coding-environment","title":"Coding environment","text":"

            In the following section we are interested in learning more about you local development environment.

            "},{"location":"reports/#question-4","title":"Question 4","text":"

            Explain how you managed dependencies in your project? Explain the process a new team member would have to go through to get an exact copy of your environment.

            Answer length: 100-200 words

            Example: We used ... for managing our dependencies. The list of dependencies was auto-generated using ... . To get a complete copy of our development environment, one would have to run the following commands

            Answer:

            --- question 4 fill here ---

            "},{"location":"reports/#question-5","title":"Question 5","text":"

            We expect that you initialized your project using the cookiecutter template. Explain the overall structure of your code. Did you fill out every folder or only a subset?

            Answer length: 100-200 words

            Example: From the cookiecutter template we have filled out the ... , ... and ... folder. We have removed the ... folder because we did not use any ... in our project. We have added an ... folder that contains ... for running our experiments. Answer:

            --- question 5 fill here ---

            "},{"location":"reports/#question-6","title":"Question 6","text":"

            Did you implement any rules for code quality and format? Additionally, explain with your own words why these concepts matters in larger projects.

            Answer length: 50-100 words.

            Answer:

            --- question 6 fill here ---

            "},{"location":"reports/#version-control","title":"Version control","text":"

            In the following section we are interested in how version control was used in your project during development to corporate and increase the quality of your code.

            "},{"location":"reports/#question-7","title":"Question 7","text":"

            How many tests did you implement and what are they testing in your code?

            Answer length: 50-100 words.

            Example: In total we have implemented X tests. Primarily we are testing ... and ... as these the most critical parts of our application but also ... .

            Answer:

            --- question 7 fill here ---

            "},{"location":"reports/#question-8","title":"Question 8","text":"

            What is the total code coverage (in percentage) of your code? If you code had an code coverage of 100% (or close to), would you still trust it to be error free? Explain you reasoning.

            Answer length: 100-200 words.

            Example: The total code coverage of code is X%, which includes all our source code. We are far from 100% coverage of our ** code and even if we were then...*

            Answer:

            --- question 8 fill here ---

            "},{"location":"reports/#question-9","title":"Question 9","text":"

            Did you workflow include using branches and pull requests? If yes, explain how. If not, explain how branches and pull request can help improve version control.

            Answer length: 100-200 words.

            Example: We made use of both branches and PRs in our project. In our group, each member had an branch that they worked on in addition to the main branch. To merge code we ...

            Answer:

            --- question 9 fill here ---

            "},{"location":"reports/#question-10","title":"Question 10","text":"

            Did you use DVC for managing data in your project? If yes, then how did it improve your project to have version control of your data. If no, explain a case where it would be beneficial to have version control of your data.

            Answer length: 100-200 words.

            Example: We did make use of DVC in the following way: ... . In the end it helped us in ... for controlling ... part of our pipeline

            Answer:

            --- question 10 fill here ---

            "},{"location":"reports/#question-11","title":"Question 11","text":"

            Discuss you continues integration setup. What kind of CI are you running (unittesting, linting, etc.)? Do you test multiple operating systems, python version etc. Do you make use of caching? Feel free to insert a link to one of your github actions workflow.

            Answer length: 200-300 words.

            Example: We have organized our CI into 3 separate files: one for doing ..., one for running ... testing and one for running ... . In particular for our ..., we used ... .An example of a triggered workflow can be seen here:

            Answer:

            --- question 11 fill here ---

            "},{"location":"reports/#running-code-and-tracking-experiments","title":"Running code and tracking experiments","text":"

            In the following section we are interested in learning more about the experimental setup for running your code and especially the reproducibility of your experiments.

            "},{"location":"reports/#question-12","title":"Question 12","text":"

            How did you configure experiments? Did you make use of config files? Explain with coding examples of how you would run a experiment.

            Answer length: 50-100 words.

            Example: We used a simple argparser, that worked in the following way: python my_script.py --lr 1e-3 --batch_size 25

            Answer:

            --- question 12 fill here ---

            "},{"location":"reports/#question-13","title":"Question 13","text":"

            Reproducibility of experiments are important. Related to the last question, how did you secure that no information is lost when running experiments and that your experiments are reproducible?

            Answer length: 100-200 words.

            Example: We made use of config files. Whenever an experiment is run the following happens: ... . To reproduce an experiment one would have to do ...

            Answer:

            --- question 13 fill here ---

            "},{"location":"reports/#question-14","title":"Question 14","text":"

            Upload 1 to 3 screenshots that show the experiments that you have done in W&B (or another experiment tracking service of your choice). This may include loss graphs, logged images, hyperparameter sweeps etc. You can take inspiration from this figure. Explain what metrics you are tracking and why they are important.

            Answer length: 200-300 words + 1 to 3 screenshots.

            Example: As seen in the first image when have tracked ... and ... which both inform us about ... in our experiments. As seen in the second image we are also tracking ... and ...

            Answer:

            --- question 14 fill here ---

            "},{"location":"reports/#question-15","title":"Question 15","text":"

            Docker is an important tool for creating containerized applications. Explain how you used docker in your experiments? Include how you would run your docker images and include a link to one of your docker files.

            Answer length: 100-200 words.

            Example: For our project we developed several images: one for training, inference and deployment. For example to run the training docker image: docker run trainer:latest lr=1e-3 batch_size=64. Link to docker file:

            Answer:

            --- question 15 fill here ---

            "},{"location":"reports/#question-16","title":"Question 16","text":"

            When running into bugs while trying to run your experiments, how did you perform debugging? Additionally, did you try to profile your code or do you think it is already perfect?

            Answer length: 100-200 words.

            Example: Debugging method was dependent on group member. Some just used ... and others used ... . We did a single profiling run of our main code at some point that showed ...

            Answer:

            --- question 16 fill here ---

            "},{"location":"reports/#working-in-the-cloud","title":"Working in the cloud","text":"

            In the following section we would like to know more about your experience when developing in the cloud.

            "},{"location":"reports/#question-17","title":"Question 17","text":"

            List all the GCP services that you made use of in your project and shortly explain what each service does?

            Answer length: 50-200 words.

            Example: We used the following two services: Engine and Bucket. Engine is used for... and Bucket is used for...

            Answer:

            --- question 17 fill here ---

            "},{"location":"reports/#question-18","title":"Question 18","text":"

            The backbone of GCP is the Compute engine. Explained how you made use of this service and what type of VMs you used?

            Answer length: 100-200 words.

            Example: We used the compute engine to run our ... . We used instances with the following hardware: ... and we started the using a custom container: ...

            Answer:

            --- question 18 fill here ---

            "},{"location":"reports/#question-19","title":"Question 19","text":"

            Insert 1-2 images of your GCP bucket, such that we can see what data you have stored in it. You can take inspiration from this figure.

            Answer:

            --- question 19 fill here ---

            "},{"location":"reports/#question-20","title":"Question 20","text":"

            Upload one image of your GCP container registry, such that we can see the different images that you have stored. You can take inspiration from this figure.

            Answer:

            --- question 20 fill here ---

            "},{"location":"reports/#question-21","title":"Question 21","text":"

            Upload one image of your GCP cloud build history, so we can see the history of the images that have been build in your project. You can take inspiration from this figure.

            Answer:

            --- question 21 fill here ---

            "},{"location":"reports/#question-22","title":"Question 22","text":"

            Did you manage to deploy your model, either in locally or cloud? If not, describe why. If yes, describe how and preferably how you invoke your deployed service?

            Answer length: 100-200 words.

            Example: For deployment we wrapped our model into application using ... . We first tried locally serving the model, which worked. Afterwards we deployed it in the cloud, using ... . To invoke the service an user would call curl -X POST -F \"file=@file.json\"<weburl>

            Answer:

            --- question 22 fill here ---

            "},{"location":"reports/#question-23","title":"Question 23","text":"

            Did you manage to implement monitoring of your deployed model? If yes, explain how it works. If not, explain how monitoring would help the longevity of your application.

            Answer length: 100-200 words.

            Example: We did not manage to implement monitoring. We would like to have monitoring implemented such that over time we could measure ... and ... that would inform us about this ... behaviour of our application.

            Answer:

            --- question 23 fill here ---

            "},{"location":"reports/#question-24","title":"Question 24","text":"

            How many credits did you end up using during the project and what service was most expensive?

            Answer length: 25-100 words.

            Example: Group member 1 used ..., Group member 2 used ..., in total ... credits was spend during development. The service costing the most was ... due to ...

            Answer:

            --- question 24 fill here ---

            "},{"location":"reports/#overall-discussion-of-project","title":"Overall discussion of project","text":"

            In the following section we would like you to think about the general structure of your project.

            "},{"location":"reports/#question-25","title":"Question 25","text":"

            Include a figure that describes the overall architecture of your system and what services that you make use of. You can take inspiration from this figure. Additionally in your own words, explain the overall steps in figure.

            Answer length: 200-400 words

            Example:

            The starting point of the diagram is our local setup, where we integrated ... and ... and ... into our code. Whenever we commit code and puch to github, it auto triggers ... and ... . From there the diagram shows ...

            Answer:

            --- question 25 fill here ---

            "},{"location":"reports/#question-26","title":"Question 26","text":"

            Discuss the overall struggles of the project. Where did you spend most time and what did you do to overcome these challenges?

            Answer length: 200-400 words.

            Example: The biggest challenges in the project was using ... tool to do ... . The reason for this was ...

            Answer:

            --- question 26 fill here ---

            "},{"location":"reports/#question-27","title":"Question 27","text":"

            State the individual contributions of each team member. This is required information from DTU, because we need to make sure all members contributed actively to the project

            Answer length: 50-200 words.

            Example: Student sXXXXXX was in charge of developing of setting up the initial cookie cutter project and developing of the docker containers for training our applications. Student sXXXXXX was in charge of training our models in the cloud and deploying them afterwards. All members contributed to code by...

            Answer:

            --- question 27 fill here ---

            "},{"location":"s10_extra/","title":"Extra learning modules","text":"

            All modules listed here are not part of the core course, but expands on some of the other topics. Some of them may still be under construction and may in the future be moved into other sessions.

            "},{"location":"s10_extra/cli/","title":"M30 - Command Line Interfaces","text":""},{"location":"s10_extra/cli/#command-line-interfaces","title":"Command line interfaces","text":"

            If you have worked with python for some time you are probably familiar with the argparse package, which allows you to directly pass in additional arguments to your script in the terminal

            python my_script.py --arg1 val1 --arg2 val2\n

            argparse is a very simple way of constructing what is called a command line interfaces (CLI). CLI allows you to interact with your application directly in the terminal instead of having change things in your code. It is essentially a text-based user interface (UI) (in contrast to an graphical user interface (GUI) that we know from all our desktop applications).

            However, one limitation of argparse is the possibility of easily defining an CLI with subcommands. If we take git as an example, git is the main command but it has multiple subcommands: push, pull, commit etc. that all can take their own arguments. This kind of second CLI with subcommands is somewhat possible to do using only argparse, however it requires a bit of hacks.

            You could of course ask the question why we at all would like to have the possibility of defining such CLI. The main argument here is to give users of our code a single entrypoint to interact with our application instead of having multiple scripts. As long as all subcommands are proper documented, then our interface should be simple to interact with (again think git where each subcommand can be given the -h arg to get specific help).

            Instead of using argparse we are here going to look at the click package. click extends the functionalities of argparse to allow for easy definition of subcommands and many other things, which we are not going to touch upon in this module. For completeness we should also mention that click is not the only package for doing this, and of other excellent frameworks for creating command line interfaces easily we can mention Typer.

            "},{"location":"s10_extra/cli/#exercises","title":"\u2754 Exercises","text":"

            Exercise files

            1. Install click

              pip install click\n
            2. Create a new python file greetings.py and add the following code:

              import click\n\n@click.command()\n@click.option('--count', default=1, help='Number of greetings.')\n@click.option('--name', prompt='Your name', help='The person to greet.')\ndef hello(count, name):\n    \"\"\"Simple program that greets NAME for a total of COUNT times.\"\"\"\n    for x in range(count):\n        click.echo(f\"Hello {name}!\")\n\nif __name__ == '__main__':\n    hello()\n

              try running the program in the following ways

              python greetings.py\npython greetings.py --count=3\npython greetings.py --help\n
            3. Make sure you understand what the click.command() decorator and click.option decorator does. You can find the full API docs here.

            4. As stated above, the power of using a tool like click is due to its ability to define subcommands. In click this is done through the click.group() decorator. To the code example from above, add another command:

              @click.command()\n@click.option('--count', default=1, help='Number of greetings.')\n@click.option('--name', prompt='Your name', help='The person to greet.')\ndef howdy(count, name):\n    for x in range(count):\n        click.echo(f\"Howdy {name}!\")\n

              and by using the click.group() decorator make these commands into subcommands such that you would be able to call the script in the following way

              python greetings.py hello\npython greetings.py howdy\n
            5. As an final exercise we provide you with a script that is ready to run as it is, but your job will be do turn it into a script with multiple subcommands, with multiple arguments for each subcommand.

              1. Start by taking a look at the provided code. It is a simple script that runs the K-nearest neighbour classification algorithm on the iris dataset and produces a plot of the decision boundary.

              2. Create a script that has the following subcommands with input arguments

                • Subcommand train: Load data, train model and save. Should take a single argument -o that specifics the filename the trained model should be saved to.
                • Subcommand infer: Load trained model and runs prediction on input data. Should take two arguments: -i that specifies which trained model to load and -d to specify a user defined datapoint to run inference on.
                • Subcommand plot: Load trained model and constructs the decision boundary plot from the code. Should take two arguments: -i that specifies a trained model to load and -o the file to write the generated plot to
                • Subcommand optim: Load data, runs hyperparameter optimization and prints optimal parameters. Should at least take a single argument that in some way adjust the hyperparameter optimization (free to choose how)

                In the end we like the script to be callable in the following ways

                python main.py train -o 'model.ckpt'\npython main.py infer -i 'model.ckpt' -d [[0,1]]\npython main.py plot -i 'model.ckpt' -o 'generated_plot.png'\npython main.py optim\n
            "},{"location":"s10_extra/design/","title":"Designing MLOps pipelines","text":"

            Danger

            Module is still under development

            \"Machine learning engineering is 10% machine learning and 90% engineering.\" - Chip Huyen

            We highly recommend that you read the book Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications by Chip Huyen which gives an fantastic overview of the thought processes that goes into designing moder machine learning systems.

            "},{"location":"s10_extra/design/#the-stack","title":"The stack","text":"

            Have you ever encountered the concept of full stack developer. A full stack developer is an developer who can both develop client and server software or in more general terms, it is a developer who can take care of the complete developer pipeline.

            Below is seen an image of the massive amounts of tools that exist within the MLOps umbrella.

            "},{"location":"s10_extra/design/#visualizing-the-design","title":"Visualizing the design","text":""},{"location":"s10_extra/documentation/","title":"M31 - Documentation","text":""},{"location":"s10_extra/documentation/#documentation","title":"Documentation","text":"

            In today's rapidly evolving software development landscape, effective documentation is a crucial component of any project. The ability to create clear, concise, and user-friendly technical documentation can make a significant difference in the success of your codebase. We all probably encountered code that we wanted to use, only for us to abandon using it because it was missing documentation such that we could get started with it.

            Technical documentation or code documentation can be many things:

            • Plain text, images and videos explaining core concepts for your software
            • Documentation of API on how to call a function or class, what the different parameters are etc.
            • Code examples of how to use certain functionality

            and many more. We are in this module going to focus on setting up a very basic documentation system that will automatically help you document the API of your code. For this reason we recommend that before continuning with this module that you have completed module M7 on good coding practices or have similar experience with writing docstrings for python functions and classes.

            There are different systems for writing documentation. In fact there is a lot to choose from:

            • MkDocs
            • Sphinx
            • GitBook
            • Docusaurus
            • Doxygen
            • Jekyll

            Important to note that all these are static site generators. The word static here refers to that when the content is generated and served on a webside, the underlying HTML code will not change. It may contain HTML elements that dynamic (like video), but the site does not change (1).

            1. Good examples of dynamic sites are any social media or news media where new posts, pages etc. are constantly added over time. Good examples of static sites are documentation, blogposts etc.

            We are in this module going to look at Mkdocs, which (in my opinion) is one of the easiest systems to get started with because all documentation is written in markdown and the build system is written in Python. As an alternativ, you can consider doing the exercises in Sphinx which is probably the most used documentation system for Python code. Sphinx offer more customization than Mkdocs, so is generally preferred for larger projects with complex documentation, but for smaller projects Mkdocs should be easier to get started with and is sufficient.

            Mkdocs by default does not include many features and for that reason we are directly going to dive into using the material for mkdocs theme that provides a lot of nice customization to create professional static sites. In fact, this hole course is written in mkdocs using the material theme.

            "},{"location":"s10_extra/documentation/#mkdocs","title":"Mkdocs","text":"

            The core file when using mkdocs is the mkdocs.yml file, which is the configuration file for the project:

            site_name: Documentation of my project\nsite_author: Jane Doe\ndocs_dir: source # (1)!\n\ntheme:\n    language: en\n    name: material # (2)!\n    features: # (3)!\n    - content.code.copy\n    - content.code.annotate\n\nplugins: # (4)!\n    - search\n    - mkdocstrings\n\nnav: # (5)!\n  - Home: index.md\n
            1. This indicates the source directory of our documentation. If the layout of your documentation is a bit different than what described above, you may need to change this.

            2. The overall theme of your documentation. We recommend the material theme but there are many more to choose from and you can also create your own.

            3. The featuers section is where features that are supported by your given theme can be enabled. In this example we have enabled content.code.copy feature which adds a small copy button to all code block and the content.code.annotate feature which allows you to add annotations like this box to code blocks.

            4. Plugins add new functionality to your documentation. In this case we have added two plugins that add functionality for searching through our documentation and automatically adding documentation from docstrings. Remember that some plugins requires you to install additional Python packages with those plugins, so remember to add them to your requirements.txt file.

            5. The nav section is where you define the navigation structure of your documentation. When you add new .md files to the source folder you then need to add them to the nav section.

            And that is more or less what you need to get started. In general, if you need help with configuration of your documentation in mkdocs I recommend looking at this page and this page.

            "},{"location":"s10_extra/documentation/#exercises","title":"Exercises","text":"

            In this set of exercises we assume that you have completed module M6 on code structure and therefore have a repository that at least contains the following:

            \u251c\u2500\u2500 pyproject.toml     <- Project configuration file with package metadata\n\u2502\n\u251c\u2500\u2500 docs               <- Documentation folder\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 index.md       <- Homepage for your documentation\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 mkdocs.yml     <- Configuration file for mkdocs\n\u2502   \u2502\n\u2502   \u2514\u2500\u2500 source/        <- Source directory for documentation files\n\u2502\n\u2514\u2500\u2500 src                <- Source code for use in this project.\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 __init__.py    <- Makes src a Python module\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 models         <- model implementations, training script\n\u2502   \u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u2502   \u251c\u2500\u2500 model.py\n\u2502   \u2502   \u251c\u2500\u2500 train_model.py\n...\n

            It is not important exactly what is in the src folder for the exercises, but we are going to refer to the above structure in the exercises, so adjust accordingly if you diviate from this. Additionally, we are going to assume that your project code is installed in your environment such that it can be imported as normal python code.

            1. We are going to need two python packages to get started: mkdocs and material for mkdocs. Install with

              pip install \"mkdocs-material >= 4.8.0\" # (1)!\n
              1. Since mkdocs is a dependency of mkdocs-material we only need to install the latter.
            2. Run in your terminal (from the docs folder):

              mkdocs serve # (1)!\n
              1. mkdocs serve will automatically rebuild the hole site whenever you save a file inside the docs folder. This is not a problem if you have a fairly small site with not that many pages (or elements), but can take a long time for large sites. Consider running with the --dirty option for only re-building the site for files that have been changed.

              which should render the index.md file as the homepage. You can leave the documentation server running during the remaining exercises.

            3. We are no ready to document the API of our code:

              1. Make sure you at least have one function and class inside your src module. If you do not have you can for simplicity copy the following module to the src/models/model.py file

                import torch\n\nclass MyNeuralNet(torch.nn.Module):\n    \"\"\"Basic neural network class.\n\n    Args:\n        in_features: number of input features\n        out_features: number of output features\n\n    \"\"\"\n    def __init__(self, in_features: int, out_features: int) -> None:\n        self.l1 = torch.nn.Linear(in_features, 500)\n        self.l2 = torch.nn.Linear(500, out_features)\n        self.r = torch.nn.ReLU()\n\n    def forward(self, x: torch.Tensor) -> torch.Tensor:\n        \"\"\"Forward pass of the model.\n\n        Args:\n            x: input tensor expected to be of shape [N,in_features]\n\n        Returns:\n            Output tensor with shape [N,out_features]\n\n        \"\"\"\n        return self.l2(self.r(self.l1(x)))\n

                and the following function to add src/predict_model.py file:

                def predict(\n    model: torch.nn.Module,\n    dataloader: torch.utils.data.DataLoader\n) -> None:\n    \"\"\"Run prediction for a given model and dataloader.\n\n    Args:\n        model: model to use for prediction\n        dataloader: dataloader with batches\n\n    Returns\n        Tensor of shape [N, d] where N is the number of samples and d is the output dimension of the model\n\n    \"\"\"\n    return [model(batch) for batch in dataloader]\n
              2. Add a markdown file to the docs/source folder called my_api.md and add that file to the nav: section in the mkdocs.yaml file.

              3. To that file add the following code:

                # My API\n\n::: src.models.model.MyNeuralNet\n\n::: src.predict_model.predict\n

                The ::: indicator tells mkdocs that it should look for the corresponding function/module and then render it on the given page. Thus, if you have a function/module located in another location change the paths accordingly.

              4. Make sure that the documentation correctly includes your function and module on the given page.

              5. (Optional) Include more functions/modules in your documentation.

            4. (Optional) Look through the documentation for mkdocstrings and try to improve the layout a bit. Especially, the headings, docstrings and signatures could be of interest to adjust.

            5. Finally, try to build a final version of your documentation

              mkdocs build\n

              this should result in a site folder that contains the actual HTML code for documentation.

            "},{"location":"s10_extra/documentation/#publish-your-documentation","title":"Publish your documentation","text":"

            To publish your documentation you need a place to host your build documentation e.g. the content of the site folder you build in the last exercise. There are many places to host your documentation, but if you only need a static site and are already hosting your code through Github, then a good option is Github Pages. Github pages is free to use for your public projects.

            Before getting started with this set of exercises you should have completed module M16 on github actions so you already know about workflow files.

            "},{"location":"s10_extra/documentation/#exercises_1","title":"Exercises","text":"
            1. Start by adding a new file called deploy_docs.yaml to the .github/workflows folder. Add the following cod to that file and save it.

              name: Deploy docs\n\non:\npush:\n    branches:\n        - main\n\npermissions:\n    contents: write # (1)\n\njobs:\ndeploy:\n    runs-on: ubuntu-latest\n    steps:\n        - uses: actions/checkout@v3\n          with:\n            fetch-depth: 0\n        - uses: actions/setup-python@v4\n          with:\n            python-version: 3.10\n        - uses: actions/cache@v2\n          with:\n            key: ${{ github.ref }}\n            path: .cache\n        - run: pip install -r requirements.txt\n        - run: mkdocs gh-deploy --force\n
              1. It is important to give write premissions to this actions because it is not only reading your code but it will actually also push code.

              Before continuing, make sure you understand what the different steps of the workflow does and especially we recommend looking at the documentation of the mkdocs gh-deploy command.

            2. Commit and push the file. Check that the action is executed and if it succeeds, that your build project is pushed to a branch called gh-pages. If the action does not succeeds, then figure out what is wrong and fix it!

            3. After confirming that our action is working, you need to configure Github to actually publish the content being build by Github Actions. Do the following:

              • Go to the Settings tab and then the Pages subsection
              • In the Source setting choose the Deploy from a branch
              • In the Branch setting choose the gh-pages branch and /(root) folder and save

              This should then start deploying your site to https://<your-username>.github.io/<your-reponame>/. If it does not do this you may need to recommit and trigger the github actions build again.

            4. Make sure your documentation is published and looks as it should.

            This ends the module on technical documentation. We cannot stress enough how important it is to write proper documentation for larger projects that need to be maintained over a longer time. It is often a iterative process, but it is often best to do it while writing the code.

            "},{"location":"s10_extra/frontend/","title":"Frontend","text":"

            Danger

            Module is still under development

            "},{"location":"s10_extra/frontend/#streamlit","title":"Streamlit","text":"

            steamlit

            "},{"location":"s10_extra/frontend/#exercises","title":"\u2754 Exercises","text":"
            1. Start by installing streamlit
            pip install streamlit\n

            and run streamlit hallo afterwards to check that everything works as expected.

            "},{"location":"s10_extra/high_performance_clusters/","title":"M33 - High Performance Clusters","text":""},{"location":"s10_extra/high_performance_clusters/#high-performance-clusters","title":"High Performance Clusters","text":"

            As discussed in the intro session on the cloud, cloud providers offers near infinite compute resources. However, using these resources comes at a hefty price often and it is therefore important to be aware of another resource many have access to: High Performance Clusters or HPC. HPCs exist all over the world, and many time you already have access to one or can easily get access to one. If you are an university student you most likely have a local HPC that you can access through your institution. Else, there exist public HPC resources that everybody (with a project) can apply for. As an example in the EU we have EuroHPC initiative that currently has 8 different supercomputers with a centralized location for applying for resources that are both open for research projects and start-ups.

            Depending on your application, you may have different needs and it is therefore important to be aware also of the different tiers of HPC. In Europe, HPC are often categorized such that Tier-0 are European Centers with petaflop or hexascale machines,\u00a0Tier 1 are National centers of supercomputers, and Tier 2 are Regional centers. The lower the Tier, the larger applications it is possible to run.

            Image credit"},{"location":"s10_extra/high_performance_clusters/#cluster-architectures","title":"Cluster architectures","text":"

            In very general terms, cluster can come as two different kind of systems: supercomputers and LSF (Load Sharing Facility). A supercomputer (as shown below) is organized into different modules, that are separated by network link. When you login to a supercomputer you will meet the front end which contains all the software needed to run computations. When you submit a job it will get sent to the backend modules which in most cases includes: general compute modules (CPU), acceleration modules (GPU), a memory module (RAM) and finally a storage module (HDD). Depending on your application you may need one module more than another. For example in deep learning the acceleration module is important but in physics simulation the general compute module / storage model is probably more important.

            Overview of the Meluxina supercomputer that's part of EuroHPC. Image credit

            Alternatively, LSF are a network of computers where each computer has its own CPU, GPU, RAM etc. and the individual computes (or nodes) are then connected by network. The important different between a supercomputer and as LSF systems is how the resources are organized. When comparing supercomputers to LSF system it is generally the case that it is better to run on a LSF system if you are only requesting resources that can be handled by a single node, however it is better to run on a supercomputer if you have a resource intensive application that requires many devices to communicate with each others.

            Regardless of cluster architectures, on the software side of HPC, the most important part is what's called the HPC scheduler. Without a HPC scheduler an HPC cluster would just be a bunch of servers with different jobs interfering with each other. The problem is when you have a large collection of resources and a large collection of users, you cannot rely on the users just running their applications without interfering with each other. A HPC scheduler is in charge of managing that whenever an user request to run an application, they get put in a queue and whenever the resources their application ask for are available the application gets run.

            The biggest bach control systems for doing scheduling on HPC are:

            • SLURM
            • MOAB HPC Suite
            • PBS Works

            We are going to take a look at PBS works as that is what is installed on our local university cluster.

            "},{"location":"s10_extra/high_performance_clusters/#exercises","title":"\u2754 Exercises","text":"

            Exercise files

            The following exercises are focused on local students at DTU that want to use our local HPC resources. That said, the steps in the exercise are fairly general to other types of cluster. For the purpose of this exercise we are going to see how we can run this image classifier script , but feel free to work with whatever application you want to.

            1. Start by accessing the cluster. This can either be through ssh in a terminal or if you want a graphical interface thinlinc can be installed. In general we recommend following the steps here for DTU students as the setup depends on if you are on campus or not.

            2. When you have access to the cluster we are going to start with the setup phase. In the setup phase we are going to setup the environment necessary for our computations. If you have accessed the cluster through graphical interface start by opening a terminal.

              1. Lets start by setting up conda for controlling our dependencies. If you have not already worked with conda, please checkout module M2 on package managers and virtual environments. In general you should be able to setup (mini)conda through these two commands:

                wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nsh Miniconda3-latest-Linux-x86_64.sh\n
              2. Close the terminal and open a new for the installation to complete. Type conda in the terminal to check that everything is fine. Go ahead and create a new environment that we can install dependencies in

                conda create -n \"hpc_env\" python=3.10 --no-default-packages\n

                and activate it.

              3. Copy over any files you need. For the image classifier script you need the requirements file and the actual application.

              4. Next, install all the requirements you need. If you want to run the image classifier script you can run this command in the terminal

                pip install -r image_classifier_requirements.txt\n

                using this requirements file.

            3. That's all the setup needed. You would need to go through the creating of environment and installation of requirements whenever you start a new project (no need for reinstalling conda). For the next step we need to look at how to submit jobs on the cluster. We are now ready to submit the our first job to the cluster:

              1. Start by checking the statistics for the different clusters. Try to use both the qstat command which should give an overview of the different cluster, number of running jobs and number of pending jobs. For many system you can also try the much more user friendly command classstat command.

              2. Figure out which queue you want to use. For the sake of the exercises it needs to be one with GPU support. For DTU students, any queue that starts with gpu are GPU accelerated.

              3. Now we are going to develop a bash script for submitting our job. We have provided an example of such scripts. Take a careful look and go each line and make sure you understand it. Afterwards, change it to your needs (queue and student email).

              4. Try to submit the script:

                bsub < jobscript.sh\n

                You can check the status of your script by running the bstat command. Hopefully, the job should go through really quickly. Take a look at the output file, it should be called something like gpu_*.out. Also take a look at the gpu_*.err file. Does both files look as they should?

            4. Lets now try to run our application on the cluster. To do that we need to take care of two things:

              1. First we need to load the correct version of CUDA. A cluster system often contains multiple versions of specific software to suit the needs of all their users, and it is the users that are in charge of loading the correct software during job submission. The only extra software that needs to be loaded for most Pytorch applications are a CUDA module. You can check which modules are available on the cluster with

                module avail\n

                Afterwards, add the correct CUDA version you need to the jobscript.sh file. If you are trying to run the provided image classifier script then the correct version is CUDA/11.7 (can be seen in the requirements file).

                # add to the bottom of the file\nmodule load cuda/11.7\n
              2. We are now ready to add in our application. The only thing we need to take care of is telling the system to run it using the python version that is connected to our hpc_env we created in the beginning. Try typing:

                which python\n

                which should give you the full path. Then add to the bottom of the jobscript file:

                ~/miniconda3/envs/hpc_env/bin/python \\\n    image_classifier.py \\\n    --trainer.accelerator 'gpu' --trainer.devices 1  --trainer.max_epochs 5\n

                which will run the image classifier script (change it if you are running something else).

              3. Finally submit the job:

                bsub < jobscript.sh\n

                and check when it is done that it has produced what you expected.

              4. (Optional) If you application supports multi GPUs also try that out. You would first need to change the jobscript to request multiple GPUs and additionally you would need to tell your application to run on multiple GPUs. For the image classifier script it can be done by changing the --trainer.devices flag to 2 (or higher).

            This ends the module on using HPC systems.

            "},{"location":"s10_extra/hyperparameters/","title":"M32 - Hyperparameter optimization","text":""},{"location":"s10_extra/hyperparameters/#hyperparameter-optimization","title":"Hyperparameter optimization","text":"

            Hyperparameter optimization is not a new idea within machine learning but have somewhat seen a renaissance with the uprise of deep learning. This can mainly be contributed to the following:

            • Trying to beat state-of-the-art often comes down to very small differences in performance, and hyperparameter optimization can help squeeze out a bit more
            • Deep learning models are in general not that robust towards the choice of hyparameter so choosing the wrong set may lead to a model that does not work

            However the problem with doing hyperparameter optimization of a deep learning models is that it can take over a week to train a single model. In most cases we therefore cannot do a full grid search of all hyperparameter combinations to get the best model. Instead we have to do some tricks that will help us speed up our searching. In these exercises we are going to be integrating optuna into our different models, that will provide the tools for speeding up our search.

            It should be noted that a lot of deep learning models does not optimize every hyperparameter that is included in the model but instead relies on heuristic guidelines (\"rule of thumb\") based on what seems to be working in general e.g. a learning rate of 0.01 seems to work great with the Adam optimizer. That said, these rules probably only apply to 80% of deep learning model, whereas for the last 20% the recommendations may be suboptimal Here is a great site that has collected an extensive list of these recommendations, taken from the excellent deep learning book by Ian Goodfellow, Yoshua Bengio and Aaron Courville.

            In practice, I recommend trying to identify (through experimentation) which hyperparameters that are important for the performance of your model and then spend your computational budget trying to optimize them while setting the rest to a \"recommended value\".

            "},{"location":"s10_extra/hyperparameters/#exercises","title":"\u2754 Exercises","text":"

            Exercise files

            1. Start by installing optuna: pip install optuna

            2. Initially we will look at the cross_validate.py file. It implements simple K-fold cross validation of a random forest sklearn digits dataset (subset of MNIST). Look over the script and try to run it.

            3. We will now try to write the same code in optune. Please note that the script have a variable OPTUNA=False that you can use to change what part of the code should run. The three main concepts of optuna is

              • A trial: a single experiment

              • A study: a collection of trials

              • The objective: function to determine how \"good\" a trial is

              Lets start by writing the objective function, which we have already started in the script. For now you do not need to care about the trial argument, just assume that it contains the hyperparameters needed to define your random forest model. The output of the objective function should be a single number that we want to optimize. (HINT: did you remember to do K-fold crossvalidation inside your objective function?)

            4. Next lets focus on the trial. Inside the objective function the trial should be used to suggest what parameters to use next. Take a look at the documentation for trial or take a look at the code examples and figure out how to define the hyperparameter of the model.

            5. Finally lets launch a study. It can be as simple as

              study = optuna.create_study()\nstudy.optimize(objective, n_trials=100)\n

              but lets play around a bit with it:

              1. By default the .optimize method will minimize the objective (by definition the optimum of an objective function is at its minimum). Is the score your objective function is returning something that should be minimized? If not, a simple solution is to put a - in front of the metric. However, look through the documentation on how to change the direction of the optimization.

              2. Optuna will by default do Bayesian optimization when sampling the hyperparameters (using a evolutionary algorithm for suggesting new trials). However, since this example is quite simple, we can actually perform a full grid search. How would you do this in Optuna?

              3. Compare the performance of a single optuna run using Bayesian optimization with n_trials=10 with a exhaustive grid search that have search through all hyperparameters. What is the performance/time trade-off for these two solutions?

            6. In addition to doing baysian optimization, the other great part about Optuna is that it have native support for Pruning unpromising trials. Pruning refers to the user stopping trials for hyperparameter combinations that does not seem to lead anywhere. You may have learning rate that is so high that training is diverging or a neural network with too many parameters so it is just overfitting to the training data. This however begs the question: what constitutes an unpromising trial? This is up to you to define based on prior experimentation.

              1. Start by looking at the fashion_trainer.py script. Its a simple classification network for classifying images in the FashionMNIST dataset. Run the script with the default hyperparameters to get a feeling of how the training should be progress. Note down the performance on the test set.

              2. Start by defining a validation set and a validation dataloader that we can use for hyperparameter optimization (HINT: use 5-10% of you training data).

              3. Now, adjust the script to use Optuna. The 5 hyperparameters listed in the table above should at least be included in the hyperparameter search. For some we have already defined the search space but for the remaining you need to come up with a good range of values to investigate. We done integrating optuna, run a small study (n_tirals=3) to check that the code is working.

                Hyperparameter Search space Learning rate 1e-6 to 1e0 Number of output features in the second last layer ??? The amount of dropout to apply ??? Batch size ??? Use batch normalize or not {True, False} (Optional) Different activations functions {nn.ReLU, nn.Tanh, nn.RReLU, nn.LeakyReLU, nn.ELU}
              4. If implemented correctly the number of hyperparameter combinations should be at least 1000, meaning that we not only need baysian optimization but probably also need pruning to succeed. Checkout the page for built-in pruners in Optuna. Implement pruning in the script. I recommend using either the MedianPruner or the ProcentilePruner.

              5. Re-run the study using pruning with a large number of trials (n_trials>50)

              6. Take a look at this visualization page for ideas on how to visualize the study you just did. Make at least two visualization of the study and make sure that you understand them.

              7. Pruning is great for better spending your computational budged, however it comes with a trade-off. What is it and what hyperparameter should one be especially careful about when using pruning?

              8. Finally, what parameter combination achieved the best performance? What is the test set performance for this set of parameters. Did you improve over the initial set of hyperparameters?

            7. The exercises until now have focused on doing the hyperparameter searching sequentially, meaning that we test one set of parameters at the time. It is a fine approach because you can easily let it run for a week without any interaction. However, assuming that you have the computational resources to run in parallel, how do you do that?

              1. To run hyperparameter search in parallel we need a common database that all experiments can read and write to. We are going to use the recommended mysql. You do not have to understand what SQL is to complete this exercise, but it is basically a language (like python) for managing databases. Install mysql.

              2. Next we are going to initialize a database that we can read and write to. For this exercises we are going to focus on a locally stored database but it could of course also be located in the cloud.

                mysql -u root -e \"CREATE DATABASE IF NOT EXISTS example\"\n

                you can also do this directly in python when calling the create_study command by also setting the storage and load_if_exists=True flags.

              3. Now we are going to create a Optuna study in our database

                optuna create-study --study-name \"distributed-example\" --storage \"mysql://root@localhost/example\"\n
              4. Change how you initialize the study to read and write to the database. Therefore, instead of doing

                study = optuna.create_study()\n

                then do

                study = optuna.load_study(\n    study_name=\"distributed-example\", storage=\"mysql://root@localhost/example\"\n)\n

                where the study_name and storage should match how the study was created.

              5. For running in parallel, you can either open up a extra terminal and simple launch your script once per open terminal or you can use the provided parallel_lancher.py that will launch multiple executions of your script. It should be used as:

                python parallel_lancher.py myscript.py --num_parallel 2\n
              6. Finally, make sure that you can access the results

            That's all on how to do hyperparameter optimization in a scalable way. If you feel like it you can try to apply these techniques on the ongoing corrupted MNIST example, where you are free to choose what hyperparameters that you want to use.

            "},{"location":"s10_extra/kubernetes/","title":"Kubernetes","text":"

            Danger

            Module is still under development

            "},{"location":"s10_extra/kubernetes/#kubernetes","title":"Kubernetes","text":"

            Kubernetes, also known as K8s, is an open-source platform designed to automate the deployment, scaling, and operation of application containers across clusters of hosts. It provides the framework to run distributed systems resiliently, handling scaling and failover for your applications, providing deployment patterns, and more.

            "},{"location":"s10_extra/kubernetes/#what-is-kubernetes","title":"What is Kubernetes?","text":""},{"location":"s10_extra/kubernetes/#brief-history","title":"Brief History","text":"

            Kubernetes was originally developed by Google and is now maintained by the Cloud Native Computing Foundation.

            "},{"location":"s10_extra/kubernetes/#core-functions","title":"Core Functions","text":"

            Kubernetes makes it easier to deploy and manage containerized applications at scale.

            "},{"location":"s10_extra/kubernetes/#key-concepts","title":"Key Concepts","text":"
            • Pods
            • Nodes
            • Clusters
            • ...
            "},{"location":"s10_extra/kubernetes/#kubernetes-architecture","title":"Kubernetes Architecture","text":"

            Kubernetes follows a client-server architecture. At a high level, it consists of a Control Plane (master) and Nodes (workers).

            Overview of Kubernetes Architecture. Image Credit: Kubernetes Official Documentation"},{"location":"s10_extra/kubernetes/#control-plane-components","title":"Control Plane Components","text":"
            • API Server: The frontend for Kubernetes.
            • etcd: Consistent and highly-available key value store.
            • ...
            "},{"location":"s10_extra/kubernetes/#node-components","title":"Node Components","text":"
            • Kubelet: An agent that runs on each node.
            • Container Runtime: The software responsible for running containers.
            • ...
            "},{"location":"s10_extra/kubernetes/#minikube-local-kubernetes-environment","title":"Minikube: Local Kubernetes Environment","text":"

            Minikube is a tool that allows you to run Kubernetes locally. It runs a single-node Kubernetes cluster inside a VM on your laptop for users looking to try out Kubernetes or develop with it day-to-day.

            "},{"location":"s10_extra/kubernetes/#installing-minikube","title":"Installing Minikube","text":"
            1. System Requirements: Ensure your system meets the minimum requirements.
            2. Download and Install: Visit Minikube's official installation guide.
            3. Start Minikube: Run minikube start.
            "},{"location":"s10_extra/kubernetes/#exercises","title":"\u2754 Exercises","text":"
            1. Install Minikube following the steps above.
            2. Validate the installation by typing minikube in a terminal.
            3. Ensure that kubectl, the command-line tool for Kubernetes, is correctly installed by typing kubectl in a terminal.
            "},{"location":"s10_extra/kubernetes/#yatai-model-serving-platform-for-kubernetes","title":"Yatai: Model Serving Platform for Kubernetes","text":"

            Yatai is a model serving platform, making it easier to deploy machine learning models in Kubernetes environments.

            "},{"location":"s10_extra/kubernetes/#what-is-yatai","title":"What is Yatai?","text":"

            Yatai simplifies the deployment, management, and scaling of machine learning models in Kubernetes.

            "},{"location":"s10_extra/kubernetes/#getting-started-with-yatai","title":"Getting Started with Yatai","text":"
            1. Installation: Steps to install Yatai in your Kubernetes cluster.
            2. Basic Usage: How to deploy your first model using Yatai.
            "},{"location":"s10_extra/kubernetes/#additional-resources","title":"Additional Resources","text":"
            • Official Kubernetes Documentation
            • Interactive Tutorials
            • Community Forums
            • ...
            "},{"location":"s10_extra/onnx/","title":"Onnx","text":""},{"location":"s10_extra/onnx/#onnx","title":"Onnx","text":"

            Danger

            Module is still under development

            "},{"location":"s10_extra/onnx/#model-packaging","title":"Model packaging","text":"

            Whenever we want to serve an machine learning model, what we are actually interested in is doing predictions e.g. given a new datapoint we pass it through our model (forward pass) and the returned value is the predicted value of that datapoint. At a high-level, model predictions depends on three things:

            • The codebase that implements the models prediction method
            • The model weights which contains an actual instance of the model
            • Code dependencies necessary for running the codebase.

            We have already in module M9 on Docker touch on how to take care of all these things. Containers makes it easy to link a codebase, model weights and code dependencies into a single object. We in general can refer to this as model packaging, because as the name suggest, we are packaging our model into a format that is independent of the actual environment that we are trying to run the model in.

            However, containers is not the only way to do model packaging. If we put some light restrictions on the device we want run our model predictions on, we can achieve the same result using ONNX. The Open Neural Network Exchange (ONNX) is a standardized format for creating and sharing machine learning models. ONNX provides an open source format for machine learning models, both deep learning and traditional ML. It defines an extensible computation graph model, as well as definitions of built-in operators and standard data types.

            Image credit

            As the above image indicates, the idea behind ONNX is that a model trained with a specific framework on a specific device, lets say Pytorch on your local computer, can be exported and run with an entirely different framework and hardware easily. For example, not all frameworks are created equally. For example Pytorch is in general considered an developer friendly framework, however it has historically been slow to run inference with compared to a framework such as Caffe2. ONNX allow you to mix-and-match frameworks based on different usecases, and essentially increases the longivity of your model.

            "},{"location":"s10_extra/onnx/#exercises","title":"\u2754 Exercises","text":"
            1. Start by installing ONNX:

              pip install onnx\npip install onnxruntime\n

              the first package includes the basic building blocks for implementing generalized ONNX models and the second package is for running ONNX optimal on different hardware.

            2. As an test that your installation is working, try executing the following python code

              import onnxruntime\nonnxruntime.get_all_providers()\n

              these providers are translation layers that are implemented ONNX, such that the same ONNX model can run on completely different hardware. Can you identify at least two of the providers that are necessary for running standard Pytorch code on CPU and GPU? Can you identify others

            3. One big advantage of having a standardized format, is that we can easily visualize the computational graph of our model because it consist only of core ONNX operations. We are here going to use the open-source tool netron for visualization. You can either choose to download the program or just run it in your webbrowser.

            "},{"location":"s10_extra/pipeline/","title":"Pipelines and workflows","text":"

            Danger

            Module is still under development

            Image credit"},{"location":"s10_extra/pipeline/#dags","title":"DAGs","text":"

            Directed Acyclic Graph (DAG)

            "},{"location":"s10_extra/pipeline/#exercises","title":"\u2754 Exercises","text":"
            1. Start by installing prefect:

              pip install prefect\n
            2. Start a local Prefect server instance in your virtual environment.

              prefect server start\n
            3. The great thing about Prefect is that the orchestration tasks and flows are written in pure Python.

            "},{"location":"s1_development_environment/","title":"Getting started - Setting up a development environment","text":"

            Slides

            Today we start our journey into the world of machine learning operations (MLOps). However, before we can really get started, we need to make sure that you have a basic understanding of a couple of topics, as we will be using these throughout the course. In particular, today is all about getting set up with a proper development environment that can support your journey. Most of you probably already have experience with these topics, and it will be mostly repetition.

            The reason we are starting here is that many students are missing very basic skills that are never taught but are just expected to be picked up by yourself. This session will only cover the most basic skills to get you started on your development journey. If you wish to learn more about basic computer science skills in general, we highly recommend that you check out The Missing Semester of Your CS Education course from MIT.

            Learning objectives

            The learning objectives of this session are:

            • Understand the basics of the command line
            • Being able to create reproducible virtual environments
            • Can use a modern IDE / editor for code development
            • Can write and run a Python program implementing a simple deep learning model
            "},{"location":"s1_development_environment/command_line/","title":"M1 - The command line","text":""},{"location":"s1_development_environment/command_line/#the-command-line","title":"The command line","text":"

            Core Module

            Image credit

            Contrary to popular belief, the command line (also commonly known as the terminal) is not a mythical being that has existed since the dawn of time. Instead, it was created at a time when it was not given that your computer had a graphical interface that you could interact with. Think of it as a text interface to your computer.

            The terminal is a well-known concept to users of Linux, however, MAC and (especially) Windows users often do not need and therefore encounter it. Having a basic understanding of how to use a command line can help improve your workflow. The reason that the command line is an important tool to get to know, is that doing any kind of MLOps will require us to be able to interact with many different tools, many of which do not have a graphical interface. Additionally, when we get to working in the cloud later in the course, you will be forced to interact with the command line.

            Note if you already are a terminal wizard then feel free to skip the exercises below. They are very elementary.

            "},{"location":"s1_development_environment/command_line/#the-anatomy-of-the-command-line","title":"The anatomy of the command line","text":"

            Regardless of the operating system, all command lines look more or less the same:

            As already stated, it is essentially just a big text interface to interact with your computer. As the image illustrates, when trying to execute a command, there are several parts to it:

            1. The prompt is the part where you type your commands. It usually contains the name of the current directory you are in, followed by some kind of sign: $, >, : are the usual ones. It can also contain other information, such as in the case of the above image which also shows the current conda environment.
            2. The command is the actual command you want to execute. For example, ls or cd
            3. The options are additional arguments that you can pass to the command. For example, ls -l or cd ...
            4. The arguments are the actual arguments that you pass to the command. For example, ls -l figures or cd ...

            The core difference between options and arguments is that options are optional, while arguments are not.

            Image credit"},{"location":"s1_development_environment/command_line/#exercises","title":"\u2754 Exercises","text":"

            We have put a cheat sheet in the exercise files folder belonging to this session, that gives a quick overview of the different commands that can be executed in the command line.

            Windows users

            We highly recommend that you install Windows Subsystem for Linux (WSL). This will install a full Linux system on your Windows machine. Please follow this guide. Remember to run commands from an elevated (as administrator) Windows Command Prompt. You can in general complete all exercises in the course from a normal Windows Command Prompt, but some are easier to do if you run from WSL.

            If you decide to run in WSL you need to remember that you now have two different systems, and installing a package on one system does not mean that it is installed on the other. For example, if you install pip in WSL, you need to install it again in Windows if you want to use it there.

            If you decide to not run in WSL, please always work in a Windows Command Prompt and not Powershell.

            1. Start by opening a terminal.

            2. To navigate inside a terminal, we rely on the cd command and pwd command. Make sure you know how to go back and forth in your file system. (1)

              1. Your terminal should support tab-completion which can help finish commands for you!
            3. The ls command is important when we want to know the content of a folder. Try to use the command, and also try it with the additional option -l. What does it show?

            4. Make sure to familiarize yourself with the which, echo, cat, wget, less and top commands. Also, familiarize yourself with the > operator. You are probably going to use some of them throughout the course or in your future career. For Windows users, these commands may be named something else, e.g. where command on Windows corresponds to which.

            5. It is also significant that you know how to edit a file through the terminal. Most systems should have the nano editor installed, else try to figure out which one is installed in your system.

              1. Type nano in the terminal

              2. Write the following text in the script

                if __name__ == \"__main__\":\n    print(\"Hello world!\")\n
              3. Save the script and try to execute it

              4. Afterward, try to edit the file through the terminal (change Hello world to something else)

            6. All terminals come with their own programming language. The most common system is called bash. It can come in handy being able to write simple programs in bash. For example, one case is that you want to execute multiple Python programs sequentially, which can be done through a bash script.

              Windows users

              Bash is not part of Windows, so you need to run this part through WSL. If you did not install WSL, you can skip this part or as an alternative do the exercises in Powershell which is the native Windows scripting language (not recommended).

              1. Write a bash script (in nano) and try executing it:

                #!/bin/bash\n# A sample Bash script, by Ryan\necho Hello World!\n
              2. Change the bash script to call the Python program you just wrote.

              3. Try to Google how to write a simple for-loop that executes the Python script 10 times in a row.

            "},{"location":"s1_development_environment/command_line/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"
            1. Here is one command from later in the course when we are going to work in the cloud

              gcloud compute instances create-with-container instance-1 \\\n    --container-image=gcr.io/<project-id>/gcp_vm_tester\n    --zone=europe-west1-b\n

              Identify the command, options and arguments.

              Solution
              • The command is gcloud compute instances create-with-container.
              • The options are --container-image=gcr.io/<project-id>/gcp_vm_tester and --zone=europe-west1-b.
              • The arguments are instance-1.

              The tricky part of this example is that commands can have subcommands, which are also commands. In this case compute is a subcommand to gcloud, instances is a subcommand to compute and create-with-container is a subcommand to instances

            2. Two common arguments that nearly all commands have are the -h and -V options. What does each of them do?

              Solution

              The -h (or --help) option prints the help message for the command, including subcommands and arguments. Try it out by executing python -h. The -V (or --version) option prints the version of the installed program. Try it out by executing python --version.

            This ends the module on the command line. If you are still not comfortable working with the command line, fear not as we are going to use it extensively throughout the course. If you want to spend additional time on this topic, we highly recommend that you watch this video on how to use the command line.

            If you are interested in personalizing your command line, you can check out the starship project, which allows you to customize your command line with a lot of different options.

            "},{"location":"s1_development_environment/deep_learning_software/","title":"M4 - Deep Learning Software","text":""},{"location":"s1_development_environment/deep_learning_software/#deep-learning-software","title":"Deep Learning Software","text":"

            Core Module

            Deep learning has since its revolution back in 2012 transformed our lives. From Google Translate to driverless cars to personal assistants to protein engineering, deep learning is transforming nearly every sector of our economy and our lives. However, it did not take long before people realized that deep learning is not a simple beast to tame and it comes with its own kinds of problems, especially if you want to use it in a production setting. In particular the concept of technical debt was invented to indicate the significant maintenance costs at a system level that it takes to run machine learning in production. MLOps should very much be seen as the response to the concept of technical debt, namely that we should develop methods, processes and tools (with inspiration from classical DevOps) to counter the problems we run into when working with deep learning models.

            It is important to note that all the concepts and tools that have been developed for MLOps can absolutely be used together with more classical machine learning models (think K-nearest neighbor, Random forest etc.), however deep learning comes with its own set of problems which mostly have to do with the sheer size of the data and models we are working with. For these reasons, we are focusing on working with deep learning models in this course.

            "},{"location":"s1_development_environment/deep_learning_software/#software-landscape-for-deep-learning","title":"Software landscape for Deep Learning","text":"

            Regarding software for Deep Learning, the landscape is currently dominated by three software frameworks (listed in order of when they were published):

            • Tensorflow

            • Pytorch

            • JAX

            We won't go into a longer discussion on which framework is best, as it is pointless. Pytorch and Tensorflow have been around for the longest and therefore have bigger communities and feature sets at this point in time. They are both very similar in the sense that they both have features directed against research and production. JAX is kind of the new kid on the block, which in many ways improves on Pytorch and Tensorflow, but is still not as mature as the other frameworks. As the frameworks use different kind of programming principles (object oriented vs. functional programming), comparing them is essentially meaningless.

            In this course we have chosen to work with Pytorch, because we find it a bit more intuitive and it is the framework that we use for our day to day research life. Additionally, as of right now it is absolutely the dominating framework for published models, research papers and competition winners

            The intention behind this set of exercises is to bring everyone's Pytorch skills up-to-date. If you already are a Pytorch-Jedi feel free to pass the first set of exercises, but I recommend that you still complete it. The exercises are in large part taken directly from the deep learning course at udacity. Note that these exercises are given as notebooks, which is the last time we are going to use them actively in course. Instead, after this set of exercises, we are going to focus on writing code in python scripts.

            The notebooks contains a lot of explaining text. The exercises that you are supposed to fill out are inlined in the text in small \"exercise\" blocks:

            If you need a fresh-up on any deep learning topic in general throughout the course, we recommend to find the relevant chapter in the deep learning book by Ian Goodfellow, Yoshua Bengio and Aaron Courville (can also be found in the literature folder). It is absolutely not necessary to be good at deep learning to pass this course as the focus is on all the software needed to get deep learning models into production. However, it is important to have a basic understanding of the concepts.

            "},{"location":"s1_development_environment/deep_learning_software/#exercises","title":"\u2754 Exercises","text":"

            Exercise files

            1. Start a jupyter notebook session in your terminal (assuming you are standing in the root of the course material). Alternatively you should be able to open the notebooks directly in your code editor. For VS code users you can read more about how to work with jupyter notebooks in VS code here

            2. Complete the Tensors in Pytorch notebook. It focuses on basic manipulation of Pytorch tensors. You can pass this notebook if you are comfortable doing this.

            3. Complete the Neural Networks in Pytorch notebook. It focuses on building a very simple neural network using the Pytorch nn.Module interface.

            4. Complete the Training Neural Networks notebook. It focuses on how to write a simple training loop for training a neural network.

            5. Complete the Fashion MNIST notebook, that summaries concepts learned in the notebook 2 and 3 on building a neural network for classifying the Fashion MNIST dataset.

            6. Complete the Inference and Validation notebook. This notebook adds important concepts on how to do inference and validation on our neural network.

            7. Complete the Saving_and_Loading_Models notebook. This notebook addresses how to save and load model weights. This is important if you want to share a model with someone else.

            "},{"location":"s1_development_environment/deep_learning_software/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"
            1. If tensor a has shape [N, d] and tensor b has shape [M, d] how can we calculate the pairwise distance between rows in a and b without using a for loop?

              Solution

              We can take advantage of broadcasting to do this

              a = torch.randn(N, d)\nb = torch.randn(M, d)\ndist = torch.sum((a.unsqueeze(1) - b.unsqueeze(0))**2, dim=2)  # shape [N, M]\n
            2. What should be the size of S for an input image of size 1x28x28, and how many parameters does the neural network then have?

              from torch import nn\nneural_net = nn.Sequential(\n    nn.Conv2d(1, 32, 3), nn.ReLU(), nn.Conv2d(32, 64, 3), nn.ReLU(), nn.Flatten(), nn.Linear(S, 10)\n)\n
              Solution

              Since both convolutions have a kernel size of 3, stride 1 (default value) and no padding that means that we lose 2 pixels in each dimension, because the kernel can not be centered on the edge pixels. Therefore, the output of the first convolution would be 32x26x26. The output of the second convolution would be 64x24x24. The size of S must therefore be 64 * 24 * 24 = 36864. The number of parameters in a convolutional layer is kernel_size * kernel_size * in_channels * out_channels + out_channels (last term is the bias) and the number of parameters in a linear layer is in_features * out_features + out_features (last term is the bias). Therefore, the total number of parameters in the network is 3*3*1*32 + 32 + 3*3*32*64 + 64 + 36864*10 + 10 = 387,466, which could be calculated by running:

              sum([prod(p.shape) for p in neural_net.parameters()])\n
            3. A working training loop in Pytorch should have these three function calls: optimizer.zero_grad(), loss.backward(), optimizer.step(). Explain what would happen in the training loop (or implement it) if you forgot each of the function calls.

              Solution

              optimizer.zero_grad() is in charge of zeroring the gradient. If this is not done, then gradients would accumulate over the steps leading to exploding gradients. loss.backward() is in charge of calculating the gradients. If this is not done, then the gradients would not be calculated and the optimizer would not be able to update the weights. optimizer.step() is in charge of updating the weights. If this is not done, then the weights would not be updated and the model would not learn anything.

            "},{"location":"s1_development_environment/deep_learning_software/#final-exercise","title":"Final exercise","text":"

            As the final exercise we will develop a simple baseline model which we will continue to develop on during the course. For this exercise we provide the data in the data/corruptmnist folder. Do NOT use the data in the corruptmnist_v2 folder as that is intended for another exercise. As the name suggest this is a (subsampled) corrupted version of regular MNIST. Your overall task is the following:

            Implement a MNIST neural network that achieves at least 85 % accuracy on the test set.

            Before any training can start, you should identify what corruption that we have applied to the MNIST dataset to create the corrupted version. This can help you identify what kind of neural network to use to get good performance, but any network should really be able to achieve this.

            One key point of this course is trying to stay organized. Spending time now organizing your code, will save time in the future as you start to add more and more features. As subgoals, please fulfill the following exercises

            1. Implement your model in a script called model.py

            2. Implement your data setup in a script called data.py. The data was saved using torch.save, so to load it you should use torch.load.

              Saving the model

              When saving the model, you should use torch.save(model.state_dict(), \"model.pt\") and when loading the model you should use model.load_state_dict(torch.load(\"model.pt\")). If you do torch.save(model, \"model.pt\") this can lead to problems when loading the model later on, as it will try to not only save the model weights but also the model definition. This can lead to problems if you change the model definition later on (which you most likely is going to do).

            3. Implement training and evaluation of your model in main.py script. The main.py script should be able to take an additional subcommands indicating if the model should train or evaluate. It will look something like this:

              python main.py train --lr 1e-4\npython main.py evaluate trained_model.pt\n

              which can be implemented in various ways.

              VS code and command line arguments

              If you try to execute the above code in VS code using the debugger (F5) or the build in run functionality in the upper right corner:

              you will get an error message saying that you need to select a command to run e.g. main.py either needs the train or evaluate command. This can be fixed by adding a lunch.json to a specialized .vscode folder in the root of the project. The lunch.json file should look something like this:

              {\n    \"version\": \"0.2.0\",\n    \"configurations\": [\n        {\n            \"name\": \"Python: Current File\",\n            \"type\": \"python\",\n            \"request\": \"launch\",\n            \"program\": \"${file}\",\n            \"args\": [\n                \"train\",\n                \"--lr\",\n                \"1e-4\"\n            ],\n            \"console\": \"integratedTerminal\",\n            \"justMyCode\": true\n        }\n    ]\n}\n

              This will inform VS code that then we execute the current file (in this case main.py) we want to run it with the train command and additionally pass the --lr argument with the value 1e-4. You can read more about creating a lunch.json file here. If you want to have multiple configurations you can add them to the configurations list as additional dictionaries.

            To start you off, a very basic version of each script is provided in the final_exercise folder. We have already implemented some logic, especially to make sure you can easily run different subcommands in for step 4. If you are interested in how this is done you can checkout this optional module on defining command line interfaces (CLI). We additionally also provide an requirements.txt with suggestion to what packages are necessary to complete the exercise.

            As documentation that your model is actually working, when running in the train command the script needs to produce a single plot with the training curve (training step vs training loss). When the evaluate command is run, it should write the test set accuracy to the terminal.

            It is part of the exercise to not implement in notebooks as code development in the real life happens in script. As the model is simple to run (for now) you should be able to complete the exercise on your laptop, even if you are only training on cpu. That said you are allowed to upload your scripts to your own \"Google Drive\" and then you can call your scripts from a Google Colab notebook, which is shown in the image below where all code is place in the fashion_trainer.py script and the Colab notebook is just used to execute it.

            Be sure to have completed the final exercise before the next session, as we will be building on top of the model you have created.

            "},{"location":"s1_development_environment/editor/","title":"M3 - Editor","text":""},{"location":"s1_development_environment/editor/#editoride","title":"Editor/IDE","text":"

            Core Module

            Notebooks can be great for testing out ideas, developing simple code and explaining and visualizing certain aspects of a codebase. Remember that Jupyter notebook was created with intention to \"...allows you to create and share documents that contain live code, equations, visualizations and narrative text.\" However, any larger machine learning project will require you to work in multiple .py files and here notebooks will provide a suboptimal workflow. Therefore, to for truly getting \"work done\" you will need a good editor / IDE.

            Many opinions exist on this matter, but for simplicity we recommend getting started with one of the following 3:

            Editor Webpage Comment (Biased opinion) Spyder https://www.spyder-ide.org/ Matlab like environment that is easy to get started with Visual studio code https://code.visualstudio.com/ Support for multiple languages with fairly easy setup PyCharm https://www.jetbrains.com/pycharm/ IDE for python professionals. Will take a bit of time getting used to

            We highly recommend Visual studio (VS) code if you do not already have a editor installed (or just want to try something new.). We therefore put additional effort into explaining VS code.

            Below you see an overview of the vs code interface

            Image credit

            The main components of VS code are:

            • The action bar: VS code is not an editor meant for a single language and can do many things. One of the core reasons that VS code have become so popular is that custom plug-ins called extensions can be installed to add functionality to VS code. It is in the action bar that you can navigate between these different applications when you have installed them.

            • The side bar: The side bar has different functionality depending on what extension that you have open. In most cases, the side bar will just contain the file explorer.

            • The editor: This where you code is. VS code supports a number of layouts in the editor (one column, two column etc.). You can make a custom layout by dragging a file to where you want the layout to split.

            • The panel: The panel contains a terminal for you to interact with. This can quickly be used to try out code by opening a python interpreter, management of environments etc.

            • The status bar: The status bar contains information based on the extensions that you have installed. In particular for python development, the status bar can be used to change conda environment.

            "},{"location":"s1_development_environment/editor/#exercises","title":"\u2754 Exercises","text":"

            The overall goal of the exercises, is that you should start familiarizing yourself with the editor that you have chosen. If you are already an expert in one of them, feel free to skip the rest. You should at least be able to:

            • Create a new file
            • Run a python script
            • Change the python environment

            The instructions below are specific to Visual studio code but we recommend that you try to answer the questions if using another editor. In the exercise_files folder belonging to this session we have put cheat sheets for VS code (one for Windows and one for Mac/Linux), that can give you an easy overview of the different macros in VS code. The following exercises are just to get you started but you can find many more tutorials here.

            1. VS code is a general editor for many languages and to get proper python support we need to install some extensions. In the action bar go to the extension tap and search for python in the marketplace. For here we highly recommend installing the following packages:

              • Python: general python support for VS code
              • Pylance: language server for python that provides better code completion and type checking
              • Jupyter: support for jupyter notebooks directly in VSCode
              • Python Environment Manager: allows for easy management of virtual environments
            2. If you install the Python package you should see something like this in your status bar:

              which indicates that you are using the stock python installation, instead of the one you have created using conda. Click it and change the python environment to the one you actually want to use.

            3. One of the most useful tools in VS Code is the ability to navigate the whole project using the built-in Explorer. To really take advantage of the VS code you need to make sure what you are working on is a project. Create a folder called hello (somewhere on your laptop) and open it in VS Code (Click File in the menu and then select Open Folder). You should end up with a completely clean workspace (as shown below). Click the New file button and create a file called hello.py.

              Image credit

            4. Finally, lets run some code. Add something simple to the hello.py file like:

              Image credit

              and click the run button as shown in the image. It should create a new terminal, activate the environment that you have chosen and finally run your script. In addition to clicking the run button, you can also

              • Select some code and press Shift+Enter to run it in the terminal
              • Select some code and right click, choosing to run in a interactive window (where you can interact with the results like in a jupyter notebook)

            That's, the basic of using VS code. We recommend highly that you revisit this tutorial during the course when we get to topics such as debugging and version control which VS code can help with.

            "},{"location":"s1_development_environment/editor/#a-note-on-jupyter-notebooks-in-production-environments","title":"A note on jupyter notebooks in production environments","text":"

            As already stated jupyter notebooks are great for development as they allow developers to easily test our new ideas. However, they often lead to pain points when models actually need to be deployed. We highly recommend reading section 5.1.1 of this paper by Shankar et al. that in more detail discuss the strong opinions to jupyter notebooks that exist within the developer community.

            All this said there at least exist one simple tool to make notebooks work better in a production setting. Its called nbconvert and can be installed with

            conda install nbconvert # or pip install nbconvert\n

            You may need some further dependencies such as Pandoc, TeX and Pyppeteer for it to work (see install instructions here). After this, converting a notebook to a .py script is a simple as:

            jupyter nbconvert --to=script my_notebook.ipynb\n

            which will produce a similar named script called my_notebook.py. We highly recommend that you stick to developing scripts directly during the course to get experience with doing so, but nbconvert can be an fantastic tool to have in your toolbox.

            "},{"location":"s1_development_environment/package_manager/","title":"M2 - Package Manager","text":""},{"location":"s1_development_environment/package_manager/#package-managers-and-virtual-environments","title":"Package managers and virtual environments","text":"

            Core Module

            Python is a great programming language and this is mostly due to its vast ecosystem of packages. No matter what you want to do, there is probably a package that can get you started. Just try to remember when the last time you wrote a program only using the python standard library? Probably never. For this reason, we need a way to install third-party packages and this is where package managers come into play.

            You have probably already used pip for the longest time, which is the default package manager for Python. pip is great for beginners but it is missing one essential feature that you will need as a developer or data scientist: virtual environments. Virtual environments are an essential way to make sure that the dependencies of different projects do not cross-contaminate each other. As a naive example, consider project A that requires torch==1.3.0 and project B that requires torch==2.0, then doing

            cd project_A  # move to project A\npip install torch==1.3.0  # install old torch version\ncd ../project_B  # move to project B\npip install torch==2.0  # install new torch version\ncd ../project_A  # move back to project A\npython main.py  # try executing main script from project A\n

            will mean that even though we are executing the main script from project A's folder, it will use torch==2.0 instead of torch==1.3.0 because that is the last version we installed, because in both cases pip will install the package into the same environment, in this case the global environment. Instead, if we did something like:

            Unix/macOSWindows
            cd project_A  # move to project A\npython -m venv env  # create a virtual environment in project A\nsource env/bin/activate  # activate that virtual environment\npip install torch==1.3.0  # install old torch version into the virtual environment belonging to project A\ncd ../project_B  # move to project B\npython -m venv env  # create a virtual environment in project B\nsource env/bin/activate  # activate that virtual environment\npip install torch==2.0  # install new torch version into the virtual environment belonging to project B\ncd ../project_A  # move back to project A\nsource env/bin/activate  # activate the virtual environment belonging to project A\npython main.py  # succeed in executing main script from project A\n
            cd project_A  # move to project A\npython -m venv env  # create a virtual environment in project A\n.\\env\\Scripts\\activate  # activate that virtual environment\npip install torch==1.3.0  # install old torch version into the virtual environment belonging to project A\ncd ../project_B  # move to project B\npython -m venv env  # create a virtual environment in project B\n.\\env\\Scripts\\activate  # activate that virtual environment\npip install torch==2.0  # install new torch version into the virtual environment belonging to project B\ncd ../project_A  # move back to project A\n.\\env\\Scripts\\activate  # activate the virtual environment belonging to project A\npython main.py  # succeed in executing main script from project A\n

            then we would be sure that torch==1.3.0 is used when executing main.py in project A because we are using two different virtual environments. In the above case, we used the venv module which is the built-in Python module for creating virtual environments. venv+pip is arguably a good combination but when working on multiple projects it can quickly become a hassle to manage all the different virtual environments yourself, remembering which Python version to use, which packages to install and so on.

            For this reason, a number of package managers have been created that can help you manage your virtual environments and dependencies, with some of the most popular being:

            • conda
            • pipenv
            • poetry
            • pipx
            • hatch
            • pdm

            with more being created every year (rye is looking like an interesting project). This is considered a problem in the Python community, because it means that there is no standard way of managing dependencies like in other languages like npm for node.js or cargo for rust.

            Image credit

            In the course, we do not care about which package manager you use, but we do care that you use one. If you are already familiar with one package manager, then skip this exercise and continue to use that. The best recommendation that I can give regarding package managers, in general, is to find one you like and then stick with it. A lot of time can be wasted on trying to find the perfect package manager, but in the end, they all do the same with some minor differences. Checkout this blog post if you want a fairly up-to-date evaluation of the different environment management and packaging tools that exist in the Python ecosystem.

            If you are not familiar with any package managers, then we recommend that you use conda and pip for this course. You probably already have conda installed on your laptop, which is great. What conda does especially well, is that it allows you to create virtual environments with different Python versions, which can be really useful if you encounter dependencies that have not been updated in a long time. In this course specifically, we are going to recommend the following workflow

            • Use conda to create virtual environments with specific Python versions
            • Use pip to install packages in that environment

            Installing packages with pip inside conda environments has been considered a bad practice for a long time, but since conda>=4.6 it is considered safe to do so. The reason for this is that conda now has a built-in compatibility layer that makes sure that pip installed packages are compatible with the other packages installed in the environment.

            "},{"location":"s1_development_environment/package_manager/#python-dependencies","title":"Python dependencies","text":"

            Before we get started with the exercises, let's first talk a bit about Python dependencies. One of the most common ways to specify dependencies in the Python community is through a requirements.txt file, which is a simple text file that contains a list of all the packages that you want to install. The format allows you to specify the package name and version number you want, with 7 different operators:

            package1           # any version\npackage2 == x.y.z  # exact version\npackage3 >= x.y.z  # at least version x.y.z\npackage4 >  x.y.z  # newer than version x.y.z\npackage4 <= x.y.z  # at most version x.y.z\npackage5 <  x.y.z  # older than version x.y.z\npackage6 ~= x.y.z  # install version newer than x.y.z and older than x.y+1\n

            In general, all packages (should) follow the semantic versioning standard, which means that the version number is split into three parts: x.y.z where x is the major version, y is the minor version and z is the patch version.

            The reason that we need to specify the version number is that we want to make sure that we can reproduce our code at a later point. If we do not specify the version number, then we are at the mercy of the package maintainer to not change the API of the package. This is especially important when working with machine learning models, as we want to make sure that we can reproduce the exact same model at a later point.

            Finally, we also need to discuss dependency resolution, which is the process of figuring out which packages are compatible. This is a non-trivial problem, and there exist a lot of different algorithms for doing this. If you have ever thought that pip and conda were taking a long time to install something, then it is probably because they were trying to figure out which packages are compatible with each other. For example, if you try to install

            pip install \"matplotlib >= 3.8.0\" \"numpy <= 1.19\" --dry-run\n

            then it would simply fail because there are no versions of matplotlib and numpy under the given constraints that are compatible with each other. In this case, we would need to relax the constraints to something like

            pip install \"matplotlib >= 3.8.0\" \"numpy <= 1.21\" --dry-run\n

            to make it work.

            "},{"location":"s1_development_environment/package_manager/#exercises","title":"\u2754 Exercises","text":"

            For hints regarding how to use conda you can check out the cheat sheet in the exercise folder.

            1. Download and install conda. You are free to either install full conda or the much simpler version miniconda. The core difference between the two packages is that conda already comes with a lot of packages that you would normally have to install with miniconda. The downside is that conda is a much larger package which can be a huge disadvantage on smaller devices. Make sure that your installation is working by writing conda help in a terminal and it should show you the help message for conda. If this does not work you probably need to set some system variable to point to the conda installation

            2. If you have successfully installed conda, then you should be able to execute the conda command in a terminal.

              Conda will always tell you what environment you are currently in, indicated by the (env_name) in the prompt. By default it will always start in the (base) environment.

            3. Try creating a new virtual environment. Make sure that it is called my_enviroment and that it installs version 3.11 of Python. What command should you execute to do this?

              Use Python 3.8 or higher

              We highly recommend that you use Python 3.8 or higher for this course. In general, we recommend that you use the second latest version of Python that is available (currently Python 3.11 as of writing this). This is because the latest version of Python is often not supported by all dependencies. You can always check the status of different Python version support here.

            4. Which conda command gives you a list of all the environments that you have created?

            5. Which conda command gives you a list of the packages installed in the current environment?

              1. How do you easily export this list to a text file? Do this, and make sure you export it to a file called enviroment.yaml, as conda uses another format by default than pip.

              2. Inspect the file to see what is in it.

              3. The enviroment.yaml file you have created is one way to secure reproducibility between users because anyone should be able to get an exact copy of you environment if they have your enviroment.yaml file. Try creating a new environment directly from you enviroment.yaml file and check that the packages being installed exactly match what you originally had.

            6. As the introduction states, it is fairly safe to use pip inside conda today. What is the corresponding pip command that gives you a list of all pip installed packages? And how do you export this to requirements.txt file?

            7. If you look through the requirements that both pip and conda produce then you will see that it is often filled with a lot more packages than what you are actually using in your project. What you are really interested in are the packages that you import in your code: from package import module. One way to get around this is to use the package pipreqs, which will automatically scan your project and create a requirements file specific to that. Let's try it out:

              1. Install pipreqs:

                pip install pipreqs\n
              2. Either try out pipreqs on one of your own projects or try it out on some other online project. What does the requirements.txt file pipreqs produces look like compared to the files produced by either pip or conda.

            "},{"location":"s1_development_environment/package_manager/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"
            1. Try executing the command

              pip install \"pytest < 4.6\" pytest-cov==2.12.1\n

              based on the error message you get, what would be a compatible way to install these?

              Solution

              As pytess-cov==2.12.1 requires a version of pytest newer than 4.6, we can simply change the command to be:

              pip install \"pytest >= 4.6\" pytest-cov==2.12.1\n

              but there of course exists other solutions as well.

            This ends the module on setting up virtual environments. While the methods mentioned in the exercises are great ways to construct requirements files automatically, sometimes it is just easier to manually sit down and create the files as you in that way secure that only the most necessary requirements are installed when creating a new environment.

            "},{"location":"s2_organisation_and_version_control/","title":"Getting started with MLOps - Organization and version control","text":"

            Slides

            Today we take our first steps into the world of MLOps. The set of modules in this session focuses on getting organized and making sure that you are familiar with good development practices. While many of the practices you will learn about these modules does not seem that important when you are a single person working on a project, it is crucial when working in large groups that the difference in how different people organize and write their code is minimized. The topics in this session will focus on:

            • Version control for helping tracking and managing changes to your code and data
            • Coding practices for staying organized in large projects

            Image credit

            Some exercises in this course are very loosely stated (including the exercises today). You are expected to seek out information before you ask for help (Google is your friend!) as you will both learn more for trying to solve the problems yourself, and it is more realistic of how the \"real world\" works.

            Learning objectives

            The learning objectives of this session are:

            • Understand the basics of version control and can use git to track changes to your code
            • Knowledge on how to package python code into a library and how organize your code for reuse
            • Understand different coding practices and how to use them to improve the quality of your code
            • Can use dvc to version control data
            "},{"location":"s2_organisation_and_version_control/code_structure/","title":"M6 - Code structure","text":""},{"location":"s2_organisation_and_version_control/code_structure/#code-organization","title":"Code organization","text":"

            Core Module

            With a basic understanding of version control, it is now time to really begin filling up our code repository. However, the question then remains how to organize our code? As developers we tend to not think about code organization that much. It is instead something that just dynamically is being created as we may need it. However, maybe we should spend some time initially getting organized with the chance of this making our code easier to develop and maintain in the long run. If we do not spend time organizing our code, we may end up with a mess of code that is hard to understand or maintain

            Big ball of Mud

            A Big Ball of Mud is a haphazardly structured, sprawling, sloppy, duct-tape-and-baling-wire, spaghetti-code jungle. These systems show unmistakable signs of unregulated growth, and repeated, expedient repair. Information is shared promiscuously among distant elements of the system, often to the point where nearly all the important information becomes global or duplicated. The overall structure of the system may never have been well defined. If it was, it may have eroded beyond recognition. Programmers with a shred of architectural sensibility shun these quagmires. Only those who are unconcerned about architecture, and, perhaps, are comfortable with the inertia of the day-to-day chore of patching the holes in these failing dikes, are content to work on such systems. Brian Foote and Joseph Yoder, Big Ball of Mud. Fourth Conference on Patterns Languages of Programs (PLoP '97/EuroPLoP '97) Monticello, Illinois, September 1997

            We are here going to focus on the organization of data science projects and machine learning projects. The core difference this kind of projects introduces compared to more traditional systems, is data. The key to modern machine learning is without a doubt the vast amounts of data that we have access to today. It is therefore not unreasonable that data should influence our choice of code structure. If we had another kind of application, then the layout of our codebase should probably be different.

            "},{"location":"s2_organisation_and_version_control/code_structure/#cookiecutter","title":"Cookiecutter","text":"

            We are in this course going to use the tool cookiecutter, which is tool for creating projects from project templates. A project template is in short just na overall structure of how you want your folders, files etc. to be organised from the beginning. For this course we are going to be using a custom MLOps template. The template is essentially a fork of the cookiecutter data science template template that has been used for a couple of years in the course, but specialized a bit more towards MLOps instead of general data science.

            We are not going to argue that this template is better than every other template, we are just focusing on that it is a standardized way of creating project structures for machine learning projects. By standardized we mean, that if two persons are both using cookiecutter with the same template, the layout of their code does follow some specific rules, enabling one to faster understand the other person's code. Code organization is therefore not only to make the code easier for you to maintain but also for others to read and understand.

            Below is seen the default code structure of cookiecutter for data science projects.

            What is important to keep in mind when using a template, is that it exactly is a template. By definition a template is guide to make something. Therefore, not all parts of an template may be important for your project at hand. Your job is to pick the parts from the template that is useful for organizing your machine learning project and add the parts that are missing.

            "},{"location":"s2_organisation_and_version_control/code_structure/#python-projects","title":"Python projects","text":"

            While the same template in principal could be used regardless of what language we were using for our machine learning or data science application, there are certain considerations to take into account based on what language we are using. Python is the dominant language for machine learning and data science currently, which is why we in this section are focusing on some of the special files you will need for your Python projects.

            The first file you may or may not know is the __init__.py file. In Python the __init__.py file is used to mark a directory as a Python package. Therefore as a bare minimum, any Python package should look something like this:

            \u251c\u2500\u2500 src/\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u251c\u2500\u2500 file1.py\n\u2502   \u251c\u2500\u2500 file2.py\n\u251c\u2500\u2500 pyproject.toml\n

            The second file to focus on is the pyproject.toml. This file is important for actually converting your code into a Python project. Essentially, whenever you run pip install, pip is in charge of both downloading the package you want but also in charge of installing it. For pip to be able to install a package it needs instructions on what part of the code it should install and how to install it. This is the job of the pyproject.toml file.

            Below we have both added a description of the structure of the pyproject.toml file but also setup.py + setup.cfg which is the \"old\" way of providing project instructions regarding Python project. However, you may still encounter a lot of projects using setup.py + setup.cfg so it is good to at least know about them.

            pyproject.tomlsetup.py + setup.cfg

            pyproject.toml is the new standardized way of describing project metadata in a declaratively way, introduced in PEP 621. It is written toml format which is easy to read. At the very least your pyproject.toml file should include the [build-system] and [project] sections:

            [build-system]\nrequires = [\"setuptools\", \"wheel\"]\nbuild-backend = \"setuptools.build_meta\"\n\n[project]\nname = \"my-package-name\"\nversion = \"0.1.0\"\nauthors = [{name = \"EM\", email = \"me@em.com\"}]\ndescription = \"Something cool here.\"\nrequires-python = \">=3.8\"\ndynamic = [\"dependencies\"]\n\n[tool.setuptools.dynamic]\ndependencies = {file = [\"requirements.txt\"]}\n

            the [build-system] informs pip/python that to build this Python project it needs the two packages setuptools and wheels and that it should call the setuptools.build_meta function to actually build the project. The [project] section essentially contains metadata regarding the package, what its called etc. if we ever want to publish it to PyPI.

            For specifying dependencies of your project you have two options. Either you specify them in a requirements.txt file and it as a dynamic field in pyproject.toml as shown above. Alternatively, you can add a dependencies field under the [project] header like this:

            [project]\ndependencies = [\n    'torch==2.1.0',\n    'matplotlib>=3.8.1'\n]\n

            The improvement over setup.py + setup.cfg is that pyproject.toml also allows for metadata from other tools to be specified in it, essentially making sure you only need a single file for your project. For example, in the next [module M7 on good coding practices] you will learn about the tool ruff and how it can help format your code. If we want to configure ruff for our project we can do that directly in pyproject.toml by adding additional headers:

            [ruff]\nruff_option = ...\n

            To read more about how to specify pyproject.toml this page is a good place to start.

            setup.py is the original way to describing how a Python package should be build. The most basic setup.py file will look like this:

            from setuptools import setup\nfrom pip.req import parse_requirements\nrequirements = [str(ir.req) for ir in parse_requirements(\"requirements.txt\")]\nsetup(\n    name=\"my-package-name\",\n    version=\"0.1.0\",\n    author=\"EM\",\n    description=\"Something cool here.\"\n    install_requires=requirements,\n)\n

            Essentially, the it is the exact same meta information as in pyproject.toml, just written directly in Python syntax instead of toml. Because there was a wish to deperate this meta information into a separate file, the setup.cfg file was created which can contain the exact same information as setup.py just in a declarative config.

            [metadata]\nname = my-package-name\nversion = 0.1.0\nauthor = EM\ndescription = \"Something cool here.\"\n# ...\n

            This non-standardized way of providing meta information regarding a package was essentially what lead to the creation of pyproject.toml.

            Regardless of what way a project is configured, after creating the above files the correct way to install them would be the same

            pip install .\n# or in developer mode\npip install -e . # (1)!\n
            1. The -e is short for --editable mode also called developer mode. Since we will continuously iterating on our package this is the preferred way to install our package, because that means that we do not have to run pip install every time we make a change. Essentially, in developer mode changes in the Python source code can immediately take place without requiring a new installation.

            after running this your code should be available to import as from project_name import ... like any other Python package you use. This is the most essential you need to know about creating Python packages.

            "},{"location":"s2_organisation_and_version_control/code_structure/#exercises","title":"\u2754 Exercises","text":"

            After having installed cookiecutter (exercise 1 and 2), the remaining exercises are intended to be used on taking the simple CNN MNIST classifier from yesterdays exercise and force it into this structure. You are not required to fill out every folder and file in the project structure, but try to at least follow the steps in exercises. Whenever you need to run a file I recommend always doing this from the root directory e.g.

            python <project_name>/data/make_dataset.py data/raw data/processed\npython <project_name>/models/train_model.py <arguments>\netc...\n

            in this way paths (for saving and loading files) are always relative to the root.

            1. Install cookiecutter framework

              pip install cookiecutter\n
            2. Start a new project using this template, that is specialized for this course (1).

              1. If you feel like the template can be improve in some way, feel free to either open a issue with the proposed improvement or directly send a pull request to the repository \ud83d\ude04.

              You do this by running the cookiecutter command using the template url:

              cookiecutter <url-to-template>\n

              Valid project names

              When asked for a project name you should follow the PEP8 guidelines for naming packages. This means that the name should be all lowercase and if you want to separate words, you should use underscores. For example my_project is a valid name, while MyProject is not. Additionally, the packaage name cannot start with a number.

              Flat-layout vs src-layout

              There are two common choices on how layout your source directory. The first is called src-layout where the source code is always place in a src/<project_name> folder and the second is called flat-layout where the source code is place is just placed in a <project_name> folder. The template we are using in this course is using the flat-layout, but there are pros and cons for both.

            3. After having created your new project, the first step is to also create a corresponding virtual environment and install any needed requirements. If you have a virtual environment from yesterday feel free to use that else create a new. Then install the project in that environment

              pip install -e .\n
            4. Start by filling out the <project_name>/data/make_dataset.py file. When this file runs, it should take the raw data e.g. the corrupted MNIST files from yesterday (../data/corruptmnist) which now should be located in a data/raw folder and process them into a single tensor, normalize the tensor and save this intermediate representation to the data/processed folder. By normalization here we refer to making sure the images have mean 0 and standard deviation 1.

            5. This template comes with a Makefile that can be used to easily define common operations in a project. You do not have to understand the complete file but try taking a look at it. In particular the following commands may come in handy

              make data  # runs the make_dataset.py file, try it!\nmake clean  # clean __pycache__ files\nmake requirements  # install everything in the requirements.txt file\n
              Windows users

              make is a GNU build tool that is by default not available on Windows. There are two recommended ways to get it running on Windows. The first is leveraging linux subsystem for Windows which you maybe have already installed. The second option is utilizing the chocolatey package manager, which enables Windows users to install packages similar to Linux system.

              In general we recommend that you add commands to the Makefile as you move along in the course. If you want to know more about how to write Makefiles then this is an excellent video.

            6. Put your model file (model.py) into <project_name>/models folder together and insert the relevant code from the main.py file into the train_model.py file. Make sure that whenever a model is trained and it is saved, that it gets saved to the models folder (preferably in sub-folders).

            7. When you run train_model.py, make sure that some statistics/visualizations from the trained models gets saved to the reports/figures/ folder. This could be a simple .png of the training curve.

            8. (Optional) Can you figure out a way to add a train command to the Makefile such that training can be started using

              make train\n
            9. Fill out the newly created <project_name>/models/predict_model.py file, such that it takes a pre-trained model file and creates prediction for some data. Recommended interface is that users can give this file either a folder with raw images that gets loaded in or a numpy or pickle file with already loaded images e.g. something like this

              python <project_name>/models/predict_model.py \\\n    models/my_trained_model.pt \\  # file containing a pretrained model\n    data/example_images.npy  # file containing just 10 images for prediction\n
            10. Fill out the file <project_name>/visualization/visualize.py with this (as minimum, feel free to add more visualizations)

              • Loads a pre-trained network
              • Extracts some intermediate representation of the data (your training set) from your cnn. This could be the features just before the final classification layer
              • Visualize features in a 2D space using t-SNE to do the dimensionality reduction.
              • Save the visualization to a file in the reports/figures/ folder.
            11. (Optional) Feel free to create more files/visualizations (what about investigating/explore the data distribution?)

            12. Make sure to update the README.md file with a short description on how your scripts should be run

            13. Finally make sure to update the requirements.txt file with any packages that are necessary for running your code (see this set of exercises for help)

            14. (Optional) Lets say that you are not satisfied with the template I have recommended that you use, which is completely fine. What should you then do? You should of course create your own template! This is actually not that hard to do.

              1. Just for a starting point I would recommend that you fork either the mlops template which you have already been using or alternatively fork the data science template template.

              2. After forking the template, clone it down locally and lets start modifying it. The first step is changing the cookiecutter.json file. For the mlops template it looks like this:

                {\n    \"project_name\": \"project_name\",\n    \"repo_name\": \"{{ cookiecutter.project_name.lower().replace(' ', '_') }}\",\n    \"author_name\": \"Your name (or your organization/company/team)\",\n    \"description\": \"A short description of the project.\",\n    \"python_version_number\": \"3.10\",\n    \"open_source_license\": [\"No license file\", \"MIT\", \"BSD-3-Clause\"]\n}\n

                simply add a new line to the json file with the name of the variable you want to add and the default value you want it to have.

              3. The actual template is located in the {{ cookiecutter.project_name }} folder. cookiecutter works by replacing everywhere that it sees {{ cookiecutter.<variable_name> }} with the value of the variable. Therefore, if you want to add a new file to the template, just add it to the {{ cookiecutter.project_name }} folder and make sure to add the {{ cookiecutter.<variable_name> }} where you want the variable to be replaced.

              4. After you have made the changes you want to the template, you should test it locally. Just run

                cookiecutter . -f --no-input\n

                and it should create a new folder using the default values of the cookiecutter.json file.

              5. Finally, make sure to push any changes you made to the template to GitHub, such that you in the future can use it by simply running

                cookiecutter https://github.com/<username>/<my_template_repo>\n
            "},{"location":"s2_organisation_and_version_control/code_structure/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"
            1. Starting from complete scratch, what is the steps needed to create a new github repository and push a specific template to it as the very first commit.

              Solution
              1. Create a completely barebone repository, either using the GitHub UI or if you have the github cli installed (not git) you can run

                gh repo create <repo_name> --public --confirm\n
              2. Run cookiecutter with the template you want to use

                cookiecutter <template>\n

                The name of the folder created by cookiecutter should be the same as you just used.

              3. Run the following sequence of commands

                cd <project_name>\ngit init\ngit add .\ngit commit -m \"Initial commit\"\ngit remote add origin https://github.com/<username>/<repo_name>\ngit push origin master\n
              4. That's it. The template should now have been pushed to the repository as the first commit.

                That ends the module on code structure and cookiecutter. We again want to stress the point of using cookiecutter is not about following one specific template, but instead just to use any template for organizing your code. What often happens in a team is that multiple templates are needed in different stages of the development phase or for different product types because they share common structure, while still having some specifics. Keeping templates up-to-date then becomes critical such that no team member is using an outdated template. If you ever end up in this situation, we highly recommend to checkout cruft that works alongside cookiecutter to not only make projects but update existing ones as template evolves. Cruft additionally also has template validation capabilities to ensure projects match the latest version of a template.

                "},{"location":"s2_organisation_and_version_control/dvc/","title":"M8 - Data version control","text":""},{"location":"s2_organisation_and_version_control/dvc/#data-version-control","title":"Data Version Control","text":"

                Core Module

                In this module, we are going to return to version control. However, this time we are going to focus on version control of data. The reason we need to separate between standard version control and data version control comes down to one problem: size.

                Classic version control was developed to keep track of code files, which all are simple text files. Even a codebase that contains 1000+ files with millions of lines of code can probably be stored in less than a single gigabyte (GB). On the other hand, the size of data can be drastically bigger. As most machine learning algorithms only get better with the more data that you feed them, we are seeing models today that are being trained on petabytes of data (1.000.000 GB).

                Because this is an important concept there exist a couple of frameworks that have specialized in versioning data such as DVC, DAGsHub, Hub, Modelstore and ModelDB. Regardless of what framework, they all implement somewhat the same concept: instead of storing the actual data files or in general storing any large artifacts files we instead store a pointer to these large flies. We then version control the point instead of the artifact.

                Image credit

                We are in this course going to use DVC provided by iterative.ai as they also provide tools for automatizing machine learning, which we are going to focus on later.

                "},{"location":"s2_organisation_and_version_control/dvc/#dvc-what-is-it","title":"DVC: What is it?","text":"

                DVC (Data Version Control) is simply an extension of git to not only take versioning data but also models and experiments in general. But how does it deal with these large data files? Essentially, DVC will just keep track of a small metafile that will then point to some remote location where your original data is stored. Metafiles essentially work as placeholders for your data files. Your large data files are then stored in some remote location such as Google Drive or an S3 bucket from Amazon.

                Image credit

                As the figure shows, we now have two remote locations: one for code and one for data. We use git pull/push for the code and dvc pull/push for the data. The key concept is the connection between the data file model.pkl which is fairly large and its respective metafile model.pkl.dvc which is very small. The large file is stored in the data remote and the metafile is stored in the code remote.

                "},{"location":"s2_organisation_and_version_control/dvc/#exercises","title":"\u2754 Exercises","text":"

                If in doubt about some of the exercises, we recommend checking out the documentation for DVC as it contains excellent tutorials.

                1. For these exercises, we are going to use Google drive as a remote storage solution for our data. If you do not already have a Google account, please create one (we are going to use it again in later exercises). Please make sure that you at least have 1GB of free space.

                2. Next, install DVC and the Google Drive extension

                  pip install dvc\npip install \"dvc[gdrive]\"\n

                  If you installed DVC via pip and plan to use cloud services as remote storage, you might need to install these optional dependencies: [s3], [azure], [gdrive], [gs], [oss], [ssh]. Alternatively, use [all] to include them all. If you encounter that the installation fails, we recommend that you start by updating pip and then trying to update dvc:

                  pip install -U pip\npip install -U \u201ddvc[gdrive]\u201d\n

                  If this does not work for you, it is most likely due to a problem with pygit2 and in that case we recommend that you follow the instructions here.

                3. In your MNIST repository run the following command from the terminal

                  dvc init\n

                  this will setup dvc for this repository (similar to how git init will initialize a git repository). These files should be committed using standard git to your repository.

                4. Go to your Google Drive and create a new folder called dtu_mlops_data. Then copy the unique identifier belonging to that folder as shown in the figure below

                  Using this identifier, add it as a remote storage

                  dvc remote add -d storage gdrive://<your_identifier>\n
                5. Check the content of the file .dvc/config. Does it contain a pointer to your remote storage? Afterwards, make sure to add this file to the next commit we are going to make:

                  git add .dvc/config\n
                6. Call the dvc add command on your data files exactly like you would add a file with git (you do not need to add every file by itself as you can directly add the data/ folder). Doing this should create a human-readable file with the extension .dvc. This is the metafile as explained earlier that will serve as a placeholder for your data. If you are on Windows and this step fails you may need to install pywin32. At the same time, the data folder should have been added to the .gitignore file that marks which files should not be tracked by git. Confirm that this is correct.

                7. Now we are going to add, commit and tag the metafiles so we can restore to this stage later on. Commit and tag the files, which should look something like this:

                  git add data.dvc .gitignore\ngit commit -m \"First datasets, containing 25000 images\"\ngit tag -a \"v1.0\" -m \"data v1.0\"\n
                8. Finally, push your data to the remote storage using dvc push. You will be asked to authenticate, which involves copy-pasting the code in the link prompted. Check out your Google Drive folder. You will see that the data is not in a recognizable format anymore due to the way that dvc packs and tracks the data. The boring detail is that dvc converts the data into content-addressable storage which makes data much faster to get. Finally, make sure that your data is not stored in your Github repository.

                  After authenticating the first time, DVC should be setup without having to authenticate again. If you for some reason encounter that dvc fails to authenticate, you can try to reset the authentication. Locate the file $CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json where $CACHE_HOME depends on your operating system:

                  macOSLinuxWindows

                  ~/Library/Caches

                  ~/.cache This is the typical location, but it may vary depending on what distro you are running

                  {user}/AppData/Local

                  Delete the complete {gdrive_client_id} folder and retry authenticating with dvc push.

                9. After completing the above steps, it is very easy for others (or yourself) to get setup with both code and data by simply running

                  git clone <my_repository>\ncd <my_repository>\ndvc pull\n

                  (assuming that you give them access right to the folder in your drive). Try doing this (in some other location than your standard code) to make sure that the two commands indeed download both your code and data.

                10. Lets look about the process of updating our data. Remember the important aspect of version control is that we do not need to store explicit files called data_v1.pt, data_v2.pt etc. but just have a single data.pt that where we can always checkout earlier versions. Initially start by copying the data data/corruptmnist_v2 folder from this repository to your MNIST code. This contains 3 extra datafiles with 15000 additional observations. Rerun your data pipeline so these gets incorporated into the files in your processed folder.

                11. Redo the above steps, adding the new data using dvc, committing and tagging the metafiles e.g. the following commands should be executed (with appropriate input):

                  dvc add -> git add -> git commit -> git tag -> dvc push -> git push.

                12. Lets say that you wanted to go back to the state of your data in v1.0. If the above steps have been done correctly, you should be able to do this using:

                  git checkout v1.0\ndvc checkout\n

                  confirm that you have reverted to the original data.

                13. (Optional) Finally, it is important to note that dvc is not only intended to be used to store data files but also any other large files such as trained model weights (with billions of parameters these can be quite large). For example, if we always store our best-performing model in a file called best_model.ckpt then we can use dvc to version control it, store it online and make it easy for others to download. Feel free to experiment with this using your model checkpoints.

                In general dvc is a great framework for version-controlling data and models. However, it is important to note that it does have some performance issue when dealing with datasets that consist of many files. Therefore, if you are ever working with a dataset that consists of many small files, it can be a good idea to:

                • zip files into a single archive and then version control the archive. The zip archive should be placed in a data/raw folder and then unzipped in the data/processed folder.

                • If possible turn your data into 1D arrays, then it can be stored in a single file such as .parquet or .csv. This is especially useful for tabular data. Then you can version control the single file instead of the many files.

                "},{"location":"s2_organisation_and_version_control/dvc/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"
                1. How do you know that a repository is using dvc?

                  Solution

                  Similar to a git repository having a .git directory, a repository using dvc needs to have a .dvc folder. Alternatively you can you the dvc status command.

                2. Assume you just added a folder called data/ that you want to track with dvc. What is the sequence of 5 commands to successful version control the folder? (assuming you already setup a remote)

                  Solution
                  dvc add data/\ngit add .\ngit commit -m \"added raw data\"\ngit push\ndvc push\n

                That's all for today. With the combined power of git and dvc we should be able to version control everything in our development pipeline such that no changes are lost (assuming we commit regularly). It should be noted that dvc offers more than just data version control, so if you want to deep dive into dvc we recommend their pipeline feature and how this can be used to setup version controlled experiments. Note that we are going to revisit dvc later for a more permanent (and large-scale) storage solution.

                "},{"location":"s2_organisation_and_version_control/git/","title":"M5 - Git","text":""},{"location":"s2_organisation_and_version_control/git/#git","title":"Git","text":"

                Core Module

                Proper collaboration with other people will require that you can work on the same codebase in an organized manner. This is the reason that version control exist. Simply stated, it is a way to keep track of:

                • Who made changes to the code
                • When did the change happen
                • What changes were made

                For a full explanation please see this page

                Secondly, it is important to note that GitHub is not git! GitHub is the dominating player when it comes to hosting repositories but that does not mean that they are the only one providing free repository hosting (see bitbucket or gitlab) for some other examples).

                That said we will be using git and GitHub throughout this course. It is a requirement for passing this course that you create a public repository with your code and use git to upload any code changes. How much you choose to integrate this into your own projects depends, but you are at least expected to be familiar with git+GitHub.

                Image credit"},{"location":"s2_organisation_and_version_control/git/#initial-config","title":"Initial config","text":"

                What does Git stand for?

                The name \"git\" was given by Linus Torvalds when he wrote the very first version. He described the tool as \"the stupid content tracker\" and the name as (depending on your mood):

                • Random three-letter combination that is pronounceable, and not actually used by any common UNIX command. The fact that it is a mispronunciation of \"get\" may or may not be relevant.
                • Stupid. Contemptible and Despicable. simple. Take your pick from the dictionary of slang.
                • \"Global information tracker\": you're in a good mood, and it actually works for you. Angels sing, and a light suddenly fills the room.
                • \"Goddamn idiotic truckload of sh*t\": when it breaks
                1. Install git on your computer and make sure that your installation is working by writing git help in a terminal and it should show you the help message for git.

                2. Create a GitHub account if you do not already have one.

                3. To make sure that we do not have to type in our GitHub username every time that we want to do some changes, we can once and for all set them on our local machine

                  # type in a terminal\ngit config credential.helper store\ngit config --global user.email <email>\n
                "},{"location":"s2_organisation_and_version_control/git/#git-overview","title":"Git overview","text":"

                The most simple way to think of version control, is that it is just nodes with lines connecting them

                Each node, which we call a commit is uniquely identified by a hash string. Each node, stores what our code looked like at that point in time (when we made the commit) and using the hash codes we can easily revert to a specific point in time.

                The commits are made up of local changes that we make to our code. A basic workflow for adding commits are seen below

                Assuming that we have made some changes to our local working directory and that we want to get these updates to be online in the remote repository we have to do the following steps:

                • First we run the command git add. This will move our changes to the staging area. While changes are in the staging area we can very easily revert them (using git restore). There have therefore not been assigned a unique hash to the code yet, and we can therefore still overwrite it.

                • To take our code from the staging area and make it into a commit, we simply run git commit which will locally add a note to the graph. It is important again, that we have not pushed the commit to the online repository yet.

                • Finally, we want others to be able to use the changes that we made. We do a simple git push and our commit gets online

                Of course, the real power of version control is the ability to make branches, as in the image below

                Image credit

                Each branch can contain code that are not present on other branches. This is useful when you are many developers working together on the same project.

                "},{"location":"s2_organisation_and_version_control/git/#exercises","title":"\u2754 Exercises","text":"
                1. In your GitHub account create an repository, where the intention is that you upload the code from the final exercise from yesterday

                  1. After creating the repository, clone it to your computer

                    git clone https://github.com/my_user_name/my_repository_name.git\n
                  2. Move/copy the three files from yesterday into the repository (and any other that you made)

                  3. Add the files to a commit by using git add command (1)

                    1. Writing good commit message is a skill in itself. A commit message should be short but informative about the work you are trying to commit. Try to practise writing good commit messages throughout the course. You can see this guideline for help.
                  4. Commit the files using git commit

                  5. Finally push the files to your repository using git push. Make sure to check online that the files have been updated in your repository.

                  6. You can always use the command git status to check where you are in the process of making a commit.

                  7. Also checkout the git log command, which will show you the history of commits that you have made.

                2. Make sure that you understand how to make branches, as this will allow you to try out code changes without messing with your working code. Creating a new branch can be done using:

                  # create a new branch\ngit checkout -b <my_branch_name>\n

                  Afterwards, you can use git checkout to change between branches (remember to commit your work!) Try adding something (a file, a new line of code etc.) to the newly created branch, commit it and try changing back to master afterwards. You should hopefully see whatever you added on the branch is not present on the main branch.

                3. If you do not already have a cloned version of this repository belonging to the course, make sure to make one! I am continuously updating/changing some of the material during the course and I therefore recommend that you each day before the lecture do a git pull on your local copy

                4. Git may seem like a waste of time when solutions like dropbox, google drive etc exist, and it is not completely untrue when you are only one or two working on a project. However, these file management systems falls short when hundreds to thousands of people work together. For this exercise you will go through the steps of sending an open-source contribution:

                  1. Go online and find a project you do not own, where you can improve the code. You can either look at this page of good issues to get started with or for simplicity you can just choose the repository belonging to the course. Now fork the project by clicking the Fork button.

                    This will create a local copy of the repository which you have complete writing access to. Note that code updates to the original repository do not update code in your local repository.

                  2. Clone your local fork of the project using git clone.

                  3. As default your local repository will be on the main branch (HINT: you can check this with the git status command). It is good practice to make a new branch when working on some changes. Use the git branch command followed by the git checkout command to create a new branch.

                  4. You are now ready to make changes to the repository. Try to find something to improve (any spelling mistakes?). When you have made the changes, do the standard git cycle: add -> commit -> push

                  5. Go online to the original repository and go to the Pull requests tab. Find compare button and choose the button to compare the master branch of the original repo with the branch that you just created in your own repository. Check the diff on the page to make sure that it contains the changes you have made.

                  6. Write a bit about the changes you have made and click Create pull request :)

                5. Forking a repository has the consequence that your fork and the repository that you forked can diverge. To mitigate this we can set what is called an remote upstream. Take a look on this page , and set a remote upstream for the repository you just forked.

                6. After setting the upstream branch, we need to pull and merge any update. Take a look on this page and figure out how to do this.

                7. As a final exercise we want to simulate a merge conflict, which happens when two users try to commit changes to exactly same lines of code in the codebase, and git is not able to resolve how the different commits should be integrated.

                  1. In your browser, open your favorite repository (it could be the one you just worked on), go to any file of your choosing and click the edit button (see image below) and make some change to the file. For example, if you choose a python file you can just import some random packages at the top of the file. Commit the change.

                  2. Make sure not to pull the change you just made to your local computer. Locally make changes to the same file in the same lines and commit them afterwards.

                  3. Now try to git pull the online changes. What should (hopefully) happen is that git will tell you that it found a merge conflict that needs to be resolved. Open the file and you should see something like this

                    <<<<<<< HEAD\nthis is some content to mess with\ncontent to append\n=======\ntotally different content to merge later\n>>>>>>> master\n

                    this should be interpret as: everything that's between <<<<<<< and ======= are the changes made by your local commit and everything between ======= and >>>>>>> are the changes you are trying to pull. To fix the merge conflict you simply have to make the code in the two \"cells\" work together. When you are done, remove the identifiers <<<<<<<, ======= and >>>>>>>.

                  4. Finally, commit the merge and try to push.

                8. (Optional) The above exercises have focused on how to use git from the terminal, which I highly recommend learning. However, if you are using a proper editor they also have build in support for version control. We recommend getting familiar with these features (here is a tutorial for VS Code)

                "},{"location":"s2_organisation_and_version_control/git/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"
                1. How do you know if a certain directory is a git repository?

                  Solution

                  You can check if there is a \".git\" directory. Alternative you can use the git status command.

                2. Explain what the file gitignore is used for?

                  Solution

                  The file gitignore is used to tell git which files to ignore when doing a git add . command. This is useful for files that are not part of the codebase, but are needed for the code to run (e.g. data files) or files that contain sensitive information (e.g. .env files that contain API keys and passwords).

                3. You have two branches - main and devel. What sequence of commands would you need to execute to make sure that devel is in sync with main?

                  Solution
                  git checkout main\ngit pull\ngit checkout devel\ngit merge main\n
                4. What best practices are you familiar with regarding version control?

                  Solution
                  • Use a descriptive commit message
                  • Make each commit a logical unit
                  • Incorporate others' changes frequently
                  • Share your changes frequently
                  • Coordinate with your co-workers
                  • Don't commit generated files

                That covers the basics of git to get you started. In the exercise folder you can find a git cheat sheet with the most useful commands for future reference. Finally, we want to point out another awesome feature of GitHub: in browser editor. Sometimes you have a small edit that you want to make, but still would like to do this in a IDE/editor. Or you may be in the situation where you are working from another device than your usual developer machine. GitHub has an built-in editor that can simply be enabled by changing any URL from

                https://github.com/username/repository\n

                to

                https://github.dev/username/repository\n

                Try it out on your newly created repository.

                "},{"location":"s2_organisation_and_version_control/good_coding_practice/","title":"M7 - Good coding practice","text":""},{"location":"s2_organisation_and_version_control/good_coding_practice/#good-coding-practice","title":"Good coding practice","text":"

                Quote

                Code is read more often than it is written. Guido Van Rossum (author of Python)

                It is hard to define exactly what good coding practises are. But the above quote by Guido does hint at what it could be, namely that it has to do with how others observe and persive your code. In general, good coding practice is about making sure that you code is easy to read and understand, not only by others but also by your future self. The key concept to keep in mind with we are talking about good coding practice is consistency. In many cases it does not matter exactly how you choose to style your code etc., the important part is that you are consistent about it.

                Image credit"},{"location":"s2_organisation_and_version_control/good_coding_practice/#documentation","title":"Documentation","text":"

                Most programmers have a love-hate relationship with documentation: We absolute hate writing it ourself, but love when someone else has actually taken time to add it to their code. There is no doubt about that well documented code is much easier to maintain, as you do not need to remember all details about the code to still maintain it. It is key to remember that good documentation saves more time, than it takes to write.

                The problem with documentation is that there is no right or wrong way to do it. You can end up doing:

                • Under documentation: You document information that is clearly visible from the code and not the complex parts that are actually hard to understand.

                • Over documentation: Writing too much documentation will have the opposite effect on most people than what you want: there is too much to read, so people will skip it.

                Writing good documentation is a skill that takes time to train, so lets try to do it.

                Quote

                Code tells you how; Comments tell you why. Jeff Atwood

                "},{"location":"s2_organisation_and_version_control/good_coding_practice/#exercises","title":"\u2754 Exercises","text":"
                1. Go over the most complicated file in your project. Be critical and add comments where the logic behind the code is not easily understandable. (1)

                  1. In deep learning we often work with multi-dimensional tensors that constantly changes shape after each operation. It is good practise to annotate with comments when tensors undergoes some reshaping. In the following example we compute the pairwise euclidean distance between two tensors using broadcasting which results in multiple shape operations.

                    x = torch.randn(5, 10)  # N x D\ny = torch.randn(7, 10)  # M x D\nxy = x.unsqueeze(1) - y.unsqueeze(0)  # (N x 1 x D) - (1 x M x D) = (N x M x D)\npairwise_euc_dist = xy.abs().pow(2.0).sum(dim=-1)  # N x M\n
                2. Add docstrings to at least two python function/methods. You can see here (example 5) a good example how to use identifiable keywords such as Parameters, Args, Returns which standardizes the way of writing docstrings.

                "},{"location":"s2_organisation_and_version_control/good_coding_practice/#styling","title":"Styling","text":"

                While python already enforces some styling (e.g. code should be indented in a specific way), this is not enough to secure that code from different users actually look like each other. Maybe even more troubling is that you will often see that your own style of coding changes as you become more and more experienced. This kind of difference in coding style is not that important to take care of when you are working on a personal project, but when working multiple people together on the same project it is important to consider.

                The question then remains what styling you should use. This is where Pep8 comes into play, which is the official style guide for python. It is essentially contains what is considered \"good practice\" and \"bad practice\" when coding python.

                The many years the most commonly used tool to check if you code is PEP8 compliant is to use flake8. However, we are in this course going to be using ruff that are quickly gaining popularity due to how fast it is and how quickly the developers are adding new features. (1)

                1. both flake8 and ruff is what is called a linter or lint tool, which is any kind of static code analyze program that is used to flag programming errors, bugs, and styling errors.
                "},{"location":"s2_organisation_and_version_control/good_coding_practice/#exercises_1","title":"\u2754 Exercises","text":"
                1. Install ruff

                  pip install ruff\n
                2. Run ruff on your project or part of your project

                  ruff check .  # Lint all files in the current directory (and any subdirectories)\nruff check path/to/code/  # Lint all files in `/path/to/code` (and any subdirectories).\n

                  are you PEP8 compliant or are you a normal mortal?

                You could go and fix all the small errors that ruff is giving. However, in practice large projects instead relies on some kind of code formatter, that will automatically format your code for you to be PEP8 compliant. Some of the biggest formatters for the longest time in Python have been black and yapf, but we are going to use ruff which also have a build in formatter that should be a drop-in replacement for black.

                1. Try to use ruff format to format your code

                  ruff format .  # Format all files in the current directory.\nruff format /path/to/file.py  # Format a single file.\n

                By default ruff will apply a selection of rules when we are either checking it or formatting it. However, many more rules can be activated through configuration. If you have completed module M6 on code structure you will have encountered the pyproject.toml file, which can store both build instructions about our package but also configuration of developer tools. Lets try to configure ruff using the pyproject.toml file.

                1. One aspect that is not covered by PEP8 is how import statements in Python should be organized. If you are like most people, you place your import statements at the top of the file and they are ordered simply by when you needed them. A better practice is to introduce some clear structure in our imports. In older versions of this course we have used isort to do the job, but we are here going to configure ruff to do the job. In your pyproject.toml file add the following lines

                  [tool.ruff]\nselect = [\"I\"]\n

                  and try re-running ruff check and ruff format. Hopefully this should reorganize your imports to follow common practice. (1)

                  1. the common practise is to first list built-in python packages (like os) in one block, followed by third-party dependencies (like torch) in a second block and finally imports from your own package in a third block. Each block is then put in alphabetical order.
                2. One PEP8 styling rule that is often diverged from is the recommended line length of 79 characters, which by many (including myself) is considered very restrictive. If you code consist of multiple levels of indentation, you can quikly run into 79 characters being limiting. For this reason many projects increase it, often to 120 characters which seems to be the sweet spot of how many characters fits in a coding window on a laptop. Add the line

                  line-length=120\n

                  under the [tool.ruff] section in the pyproject.toml file and rerun ruff check and ruff format on your code.

                3. Experiment yourself with further configuration of ruff. In particular we recommend adding more rules and looking [tool.ruff.pydocstyle] configuration to indicate how you have styled your documentation.

                "},{"location":"s2_organisation_and_version_control/good_coding_practice/#typing","title":"Typing","text":"

                In addition to writing documentation and following a specific styling, in python we have a third way of improving the quality of our code: through typing. Typing goes back to the earlier programming languages like c, c++ etc. where data types needed to be explicit stated for variables:

                int main() {\n    int x = 5 + 6;\n    float y = 0.5;\n    cout << \"Hello World! \" << x << std::endl();\n}\n

                This is not required by python but it can really improve the readability of code, that you can directly read from the code what the expected types of input arguments and returns are. In python the : character have been reserved for type hints. Here is one example of adding typing to a function:

                def add2(x: int, y: int) -> int:\n    return x+y\n

                here we mark that both x and y are integers and using the arrow notation -> we mark that the output type is also an integer. Assuming that we are also going to use the function for floats and torch.Tensors we could improve the typing by specifying a union of types. Depending on the version of python you are using the syntax for this can be different.

                python <3.10python >=3.10
                from torch import Tensor  # note it is Tensor with upper case T. This is the base class of all tensors\nfrom typing import Union\ndef add2(x: Union[int, float, Tensor], y: Union[int, float, Tensor]) -> Union[int, float, Tensor]:\n    return x+y\n
                from torch import Tensor  # note it is Tensor with upper case T. This is the base class of all tensors\ndef add2(x: int | float | Tensor, y: int | float | Tensor) -> int | float | Tensor:\n    return x+y\n

                Finally, since this is a very generic function it also works on numpy arrays etc. we can always default to the Any type if we are not sure about all the specific types that a function can take

                from typing import Any\ndef add2(x: Any, y: Any) -> Any:\n    return x+y\n

                However, in this case we basically is in the same case as if our function were not typed, as the type hints does not help us at all. Therefore, use Any only when necessary.

                "},{"location":"s2_organisation_and_version_control/good_coding_practice/#exercises_2","title":"\u2754 Exercises","text":"

                Exercise files

                1. We provide a file called typing_exercise.py. Add typing everywhere in the file. Please note that you will need the following import:

                  from typing import Callable, Optional, Tuple, Union, List  # you will need all of them in your code\n

                  for it to work. This cheat sheet is a good resource on typing. We also provide typing_exercise_solution.py, but try to solve the exercise yourself.

                2. mypy is what is called a static type checker. If you are using typing in your code, then a static type checker can help you find common mistakes. mypy does not run your code, but it scans it and checks that the types you have given are compatible. Install mypy

                  pip install mypy\n
                3. Try to run mypy on the typing.py file

                  mypy typing_exercise.py\n

                  If you have solved exercise 11 correctly then you should get no errors. If not mypy should tell you where your types are incompatible.

                "},{"location":"s2_organisation_and_version_control/good_coding_practice/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"
                1. According to PEP8 what is wrong with the following code?

                  class myclass(nn.Module):\n    def TrainNetwork(self, X, y):\n        ...\n
                  Solution

                  According to PEP8 classes should follow the CapWords convention, meaning that the first letter in each word of the class name should be capitalized. Thus myclass should therefore be MyClass. On the other hand, functions and methods should be full lowercase with words separated by underscore. Thus TrainNetwork should be train_network.

                2. What would be the of argument x for a function def f(x): if it should support the following input

                  x1 = [1, 2, 3, 4]\nx2 = (1, 2, 3, 4)\nx3 = None\nx4 = {1: \"1\", 2: \"2\", 3: \"3\", 4: \"4\"}\n
                  Solution

                  The easy solution would be to do def f(x : Any). But instead we could also go with:

                  def f(x: None | Tuple[int, ...] | List[int] | Dict[int, str]):\n

                  alternatively, we could also do

                  def f(x: None | Iterable[int]):\n

                  because both list, tuple and dict are iterables and therefore can be covered by one type (in this specific case).

                This ends the module on coding style. We again want to emphazize that a good coding style is more about having a consistent style than strictly following PEP8. A good example of this is Google, that have their own style guide for Python. This guide does not match PEP8 exactly, but it makes sure that different teams within google that are working on different projects are still to a large degree following the same style and therefore if a project is handed from one team to another then at least that will not be a problem.

                "},{"location":"s3_reproducibility/","title":"Reproducibility","text":"

                Slides

                Today is all about reproducibility - one of those concepts that everyone agrees is very important and something should be done about, but the reality is that it is very hard to secure full reproducibility. The last sessions have already touched a bit on how tools like conda and code organization can help make code more reproducible. Today we are going all the way to ensure that our scripts and our computing environment are fully reproducible.

                "},{"location":"s3_reproducibility/#why-does-reproducibility-matter","title":"Why does reproducibility matter","text":"

                Reproducibility is closely related to the scientific method:

                Observe -> Question -> Hypotheses -> Experiment -> Conclude -> Result -> Observe -> ...

                Not having reproducible experiments essentially breaks the cycle between doing experiments and making conclusions. If experiments are not reproducible, then we do not expect that others will arrive at the same conclusion as ourselves. As machine learning experiments are fundamentally the same as doing chemical experiments in a laboratory, we should be equally careful in making sure our environments are reproducible (think of your laptop as your laboratory).

                Secondly, if we focus on why reproducibility matters especially in machine learning, it is part of the bigger challenge of making sure that machine learning is trustworthy.

                Many different aspects are needed if trustworthy machine learning is ever going to be a reality. We need robustness of our pipelines so we can trust that they do not fail under heavy load. We need integrity to make sure that pipelines are deployed if they are of high quality. We need explainability to make sure that we understand what our machine learning models are doing, so it is not just a black box. We need reproducibility to make sure that the results of our models can be reproduced over and over again. Finally, we need fairness to make sure that our models are not biased toward specific populations. Figure inspired by this paper.

                Trustworthy ML is the idea that machine learning agents can be trusted. Take the example of a machine learning agent being responsible for medical diagnoses. It is s very clear that we need to be able to trust that the agent gives us the correct diagnosis for the system to work in practice. Reproducibility plays a big role here, because without we cannot be sure that the same agent deployed at two different hospitals will give the same diagnosis (given the same input).

                Learning objectives

                The learning objectives of this session are:

                • To understand the importance of reproducibility in computer science
                • To be able to use docker to create a reproducible container, including how to build them from scratch
                • Understand different ways of configuring your code and how to use hydra to integrate with config files
                "},{"location":"s3_reproducibility/config_files/","title":"M10 - Config Files","text":""},{"location":"s3_reproducibility/config_files/#config-files","title":"Config files","text":"

                With docker we can make sure that our compute environment is reproducible, but that does not mean that all our experiments magically become reproducible. There are other factors that are important for creating reproducible experiments.

                In this paper (highly recommended read) the authors tried to reproduce the results of 255 papers and tried to figure out which factors where significant to succeed. One of those factors were \"Hyperparameters Specified\" e.g. whether or not the authors of the paper had precisely specified the hyperparameter that was used to run the experiments. It should come as no surprise that this can be a determining factor for reproducibility, however it is not given that hyperparameters are always well specified.

                "},{"location":"s3_reproducibility/config_files/#configure-experiments","title":"Configure experiments","text":"

                There is really no way around it: deep learning contains a lot of hyperparameters. In general, a hyperparameter is any parameter that affects the learning process (e.g. the weights of a neural network are not hyperparameters because they are a consequence of the learning process). The problem with having many hyperparameters to control in your code, is that if you are not careful and structure them it may be hard after running a experiment to figure out which hyperparameters were actually used. Lack of proper configuration management can cause serious problems with reliability, uptime, and the ability to scale a system.

                One of the most basic ways of structuring hyperparameters, is just to put them directly into you train.py script in some object:

                class my_hp:\n    batch_size: 64\n    lr: 128\n    other_hp: 12345\n\n# easy access to them\ndl = DataLoader(Dataset, batch_size=my_hp.batch_size)\n

                the problem here is configuration is not easy. Each time you want to run a new experiment, you basically have to change the script. If you run the code multiple times, without committing the changes in between then the exact hyperparameter configuration for some experiments may be lost. Alright, with this in mind you change strategy to use an argument parser e.g. run experiments like this

                python train.py --batch_size 256 --learning_rate 1e-4 --other_hp 12345\n

                This at least solves the problem with configurability. However, we again can end up with losing experiments if we are not careful.

                What we really want is some way to easily configure our experiments where the hyperparameters are systematically saved with the experiment. For this we turn our attention to Hydra, a configuration tool that is based around writing config files to keep track of hyperparameters. Hydra operates on top of OmegaConf which is a yaml based hierarchical configuration system.

                A simple yaml configuration file could look like

                #config.yaml\nhyperparameters:\n  batch_size: 64\n  learning_rate: 1e-4\n

                with the corresponding python code for loading the file

                from omegaconf import OmegaConf\n# loading\nconfig = OmegaConf.load('config.yaml')\n\n# accessing in two different ways\ndl = DataLoader(dataset, batch_size=config.hyperparameters.batch_size)\noptimizer = torch.optim.Adam(model.parameters(), lr=config['hyperparameters']['lr'])\n

                or using hydra for loading the configuration

                import hydra\n\n@hydra.main(config_name=\"basic.yaml\")\ndef main(cfg):\n    print(cfg.hyperparameters.batch_size, cfg.hyperparameters.learning_rate)\n\nif __name__ == \"__main__\":\n    main()\n

                The idea behind refactoring our hyperparameters into .yaml files is that we disentangle the model configuration from the model. In this way it is easier to do version control of the configuration because we have it in a separate file.

                "},{"location":"s3_reproducibility/config_files/#exercises","title":"\u2754 Exercises","text":"

                Exercise files

                The main idea behind the exercises is to take a single script (that we provide) and use Hydra to make sure that everything gets correctly logged such that you would be able to exactly report to others how each experiment was configured. In the provided script, the hyperparameters are hardcoded into the code and your job will be to separate them out into a configuration file.

                Note that we provide a solution (in the vae_solution folder) that can help you get through the exercise, but try to look online for your answers before looking at the solution. Remember: its not about the result, its about the journey.

                1. Start by install hydra: pip install hydra-core --upgrade

                2. Next take a look at the vae_mnist.py and model.py file and understand what is going on. It is a model we will revisit during the course.

                3. Identify the key hyperparameters of the script. Some of them should be easy to find, but at least 3 have made it into the core part of the code. One essential hyperparameter is also not included in the script but is needed to be completely reproducible (HINT: the weights of any neural network are initialized at random).

                4. Write a configuration file config.yaml where you write down the hyperparameters that you have found

                5. Get the script running by loading the configuration file inside your script (using hydra) that incorporates the hyperparameters into the script. Note: you should only edit the vae_mnist.py file and not the model.py file.

                6. Run the script

                7. By default hydra will write the results to a outputs folder, with a sub-folder for the day the experiment was run and further the time it was started. Inspect your run by going over each file the hydra has generated and check the information has been logged. Can you find the hyperparameters?

                8. Hydra also allows for dynamically changing and adding parameters on the fly from the command-line:

                  1. Try changing one parameter from the command-line

                    python vae_mnist.py hyperparameters.seed=1234\n
                  2. Try adding one parameter from the command-line

                    python vae_mnist.py +experiment.stuff_that_i_want_to_add=42\n
                9. By default the file vae_mnist.log should be empty, meaning that whatever you printed to the terminal did not get picked up by Hydra. This is due to Hydra under the hood making use of the native python logging package. This means that to also save all printed output from the script we need to convert all calls to print with log.info

                  1. Create a logger in the script:

                    import logging\nlog = logging.getLogger(__name__)\n
                  2. Exchange all calls to print with calls to log.info

                  3. Try re-running the script and make sure that the output printed to the terminal also gets saved to the vae_mnist.log file

                10. Make sure that your script is fully reproducible. To check this you will need two runs of the script to compare. Then run the reproducibility_tester.py script as

                  python reproducibility_tester.py path/to/run/1 path/to/run/2\n

                  the script will go over trained weights to see if the match and that the hyperparameters was the same. Note: for the script to work, the weights should be saved to a file called trained_model.pt (this is the default of the vae_mnist.py script, so only relevant if you have changed the saving of the weights)

                11. Finally, make a new experiment using a new configuration file where you have changed a hyperparameter of your own choice. You are not allowed to change the configuration file in the script but should instead be able to provide it as an argument when launching the script e.g. something like

                  python vae_mnist.py experiment=exp2\n

                  We recommend that you use a file structure like this

                  |--conf\n|  |--config.yaml\n|  |--experiments\n|     |--exp1.yaml\n|     |--exp2.yaml\n|--my_app.py\n
                "},{"location":"s3_reproducibility/config_files/#final-exercise","title":"Final exercise","text":"

                Make your MNIST code reproducible! Apply what you have just done to the simple script to your MNIST code. Only requirement is that you this time use multiple configuration files, meaning that you should have at least one model_conf.yaml file and a training_conf.yaml file that separates out the hyperparameters that have to do with the model definition and those that have to do with the training. You can also choose to work with even more complex config setups: in the image below the configuration has two layers such that we individually can specify hyperparameters belonging to a specific model architecture and hyperparameters for each individual optimizer that we may try.

                Image credit"},{"location":"s3_reproducibility/docker/","title":"M9 - Docker","text":""},{"location":"s3_reproducibility/docker/#docker","title":"Docker","text":"

                Core Module

                Image credit

                While the above picture may seem silly at first, it is actually pretty close to how docker came to existence. A big part of creating a MLOps pipeline, is that you are able to reproduce it. Reproducibility goes beyond versioning our code with git and using conda environment to keep track of our python installations. To really get reproducibility we need to also capture also system level components like

                • operating system
                • software dependencies (other than python packages)

                Docker provides this kind of system-level reproducibility by creating isolated programs dependencies. In addition to docker providing reproducibility, one of the key features are also scalability which is important when we later on are going to discuss deployment. Because docker is system-level reproducible, it does not (conceptually) matter if we try to start our program on a single machine or a 1000 machines at once.

                "},{"location":"s3_reproducibility/docker/#docker-overview","title":"Docker overview","text":"

                Docker has three main concepts: docker file, docker image and docker container:

                • A docker file is a basic text document that contains all the commands a user could call on the command line to run an application. This includes installing dependencies, pulling data from online storage, setting up code and what commands that you want to run (e.g. python train.py)

                • Running, or more correctly building a docker file will create a docker image. An image is a lightweight, standalone/containerized, executable package of software that includes everything (application code, libraries, tools, dependencies etc.) necessary to make an application run.

                • Actually running an image will create a docker container. This means that the same image can be launched multiple times, creating multiple containers.

                The exercises today will focus on how to construct the actual docker file, as this is the first step to constructing your own container.

                "},{"location":"s3_reproducibility/docker/#docker-sharing","title":"Docker sharing","text":"

                The whole point of using docker is that sharing applications becomes much easier. In general, we have two options

                • After creating the Dockerfile we can simply commit it to github (its just a text file) and then ask other users to simply build the image by themselves.

                • After building the image ourself, we can choose to upload it to a image registry such as Docker Hub where other can get our image by simply running docker pull, making them able to instantaneous running it as a container, as shown in the figure below

                Image credit"},{"location":"s3_reproducibility/docker/#exercises","title":"\u2754 Exercises","text":"

                In the following exercises we guide you how to build a docker file for your MNIST repository that will make the training and prediction a self contained application. Please make sure that you somewhat understand each step and do not just copy of the exercise. Also note that you probably need to execute the exercise from an elevated terminal e.g. with administrative privilege.

                The exercises today are only an introduction to docker and some of the steps are going to be unoptimized from a production setting view. For example we often want to keep the size of docker image as small as possible, which we are not focusing on for these exercises.

                If you are using VScode then we recommend install the docker VScode extension for easy getting an overview of which images have been build and which are running. Additionally the extension named Dev Containers may also be beneficial for you to download.

                1. Start by installing docker. How much trouble that you need to go through depends on your operating system. For Windows and Mac we recommend they install Docker desktop, which comes with a graphical user interface (GUI) for quickly viewing docker images and docker containers currently build/in-use. Windows users that have not installed WSL yet are going to have to do it now (as docker need it as backend for starting virtual machines) but you do not need to install docker in WSL. After installing docker we recommend that you restart you laptop.

                2. Try running the following to confirm that your installation is working:

                  docker run hello-world\n

                  which should give the message

                  Hello from Docker!\nThis message shows that your installation appears to be working correctly.\n
                3. Next lets try to download a image from docker hub. Download the busybox image:

                  docker pull busybox\n

                  which is an very small (1-5Mb) containerized application that contains the most essential GNU fileutils, shellutils etc.

                4. After pulling the image, write

                  docker images\n

                  which should show you all images that are available. You should see the busybox image that we just downloaded.

                5. Lets try to run this image

                  docker run busybox\n

                  you will get that nothing happens! The reason for that is we did that not provide any commands to docker run. We essentially just ask it to start the busybox virtual machine, do nothing and then close it again. Now, try again this time with

                  docker run busybox echo \"hello from busybox\"\n

                  Note how fast this process is. In just a few seconds, Docker is able to start a virtual machine, execute a command and kill it afterwards.

                6. Try running

                  docker ps\n

                  what does this command do? What if you add -a to the end?

                7. If we wanted to run multiple commands within the virtual machine, we can start it in interactive mode

                  docker run -it busybox\n

                  this can be a great way to investigate what the filesystem of our virtual machine looks like.

                8. As you may have already noticed by now, each time we execute docker run we can still see small remnants of the containers using docker ps -a. These stray containers can end up take a lot of disk space. To remove them, use docker rm where you provide the container id that you want to delete

                  docker rm <container_id>\n
                9. Lets, now move on to trying to construct an docker file ourself for our MNIST project. Create a file called trainer.dockerfile. The intention is that we want to develop one dockerfile for running our training script and one for doing predictions.

                10. Instead of starting from scratch we nearly always want to start from some base image. For this exercise we are going to start from a simple python image. Add the following to your Dockerfile

                  # Base image\nFROM python:3.9-slim\n
                11. Next we are going to install some essentials in our image. The essentials more or less consist of a python installation. These instructions may seem familiar if you are using linux:

                  # install python\nRUN apt update && \\\n    apt install --no-install-recommends -y build-essential gcc && \\\n    apt clean && rm -rf /var/lib/apt/lists/*\n
                12. The previous two steps are common for any docker application where you want to run python. All the remaining steps are application specific (to some degree):

                  1. Lets copy over our application (the essential parts) from our computer to the container:

                    COPY requirements.txt requirements.txt\nCOPY pyproject.toml pyproject.toml\nCOPY <project-name>/ <project-name>/\nCOPY data/ data/\n

                    Remember that we only want the essential parts to keep our docker image as small as possible. Why do we need each of these files/folders to run training in our docker container?

                  2. Lets set the working directory in our container and add commands that install the dependencies (1):

                    1. We split the the installation into two steps, such that docker can cache our project dependencies separately from our application code. This means that if we change our application code, we do not need to reinstall all the dependencies. This is a common strategy for docker images.

                      As an alternative you can use RUN make requirements if you have a Makefile that installs the dependencies. Just remember to also copy over the Makefile into the docker image.

                    WORKDIR /\nRUN pip install -r requirements.txt --no-cache-dir\nRUN pip install . --no-deps --no-cache-dir\n

                    the --no-cache-dir is quite important. Can you explain what it does and why it is important in relation to docker.

                  3. Finally, we are going to name our training script as the entrypoint for our docker image. The entrypoint is the application that we want to run when the image is being executed:

                    ENTRYPOINT [\"python\", \"-u\", \"<project_name>/train_model.py\"]\n

                    the \"u\" here makes sure that any output from our script e.g. any print(...) statements gets redirected to our terminal. If not included you would need to use docker logs to inspect your run.

                13. We are now ready to building our docker file into a docker image

                  docker build -f trainer.dockerfile . -t trainer:latest\n
                  MAC M1/M2 users

                  In general docker images are build for a specific platform. For example, if you are using a Mac with a M1/M2 chip then you are running on a ARM architecture. If you are using a Windows or Linux machine then you are running on a AMD64 architecture. This is important to know when building docker images. Thus, docker images you build may not work on other platforms than the one you build it on. You can specify which platform you want to build for by adding the --platform argument to the docker build command:

                  docker build --platform linux/amd64 -f train.dockerfile . -t trainer:latest\n

                  and also when running the image:

                  docker run --platform linux/amd64 trainer:latest\n

                  Do not that this will significantly increase the build and run time of your docker image when running locally, because docker will need to emulate the other platform. In general for the exercises today, you should not need to specify the platform, but be aware of this if you are building docker images on your own.

                  please note here we are providing two extra arguments to docker build. The -f train.dockerfile . (the dot is important to remember) indicates which dockerfile that we want to run (except if you named it just Dockerfile) and the -t trainer:latest is the respective name and tag that we see afterwards when running docker images (see image below). Please note that building a docker image can take a couple of minutes.

                  Docker images and space

                  Docker images can take up a lot of space on your computer. Especially, the docker images we are trying to build because Pytorch is huge dependency. If you are running low on space, you can try to

                  docker system prune\n

                  alternatively you can manually delete images using docker rmi {image_name}:{image_tag}.

                14. Try running docker images and confirm that you get output similar to the one above. If you succeeds with this, then try running the docker image

                  docker run --name experiment1 trainer:latest\n

                  you should hopefully see your training starting. Please note that we can start as many containers that we want at the same time by giving them all different names using the --name tag.

                  1. You are most likely going to re-build your docker image multiple times, either due to an implementation error or the addition of new functionality. Therefore, instead of watching pip suffer through downloading torch for the 20th time, you can reuse the cache from last time the docker image was build. To do this, replace the line in your dockerfile that installs your requirements with:

                    RUN --mount=type=cache,target=~/pip/.cache pip install -r requirements.txt --no-cache-dir\n

                    which mounts your local pip cache to the docker image. For building the image you need to have enabled the BuildKit feature. If you have docker version v23.0 or later (you can check this by running docker version) then this is enabled by default. Else you need to enable it by setting the environment variable DOCKER_BUILDKIT=1 before building the image.

                    Try changing your dockerfile and re-building the image. You should see that the build process is much faster.

                15. Remember, if you ever are in doubt how files are organized inside a docker image you always have the option to start the image in interactive mode:

                  docker run -it --entrypoint sh {image_name}:{image_name}\n
                16. When your training has completed you will notice that any files that is created when running your training script is not present on your laptop (for example if your script is saving the trained model to file). This is because the files were created inside your container (which is its own little machine). To get the files you have two options:

                  1. If you already have a completed run then you can use

                    docker cp\n

                    to copy the files between your container and laptop. For example to copy a file called trained_model.pt from a folder you would do:

                    docker cp {container_name}:{dir_path}/{file_name} {local_dir_path}/{local_file_name}\n

                    Try this out.

                  2. A much more efficient strategy is to mount a volume that is shared between the host (your laptop) and the container. This can be done with the -v option for the docker run command. For example, if we want to automatically get the trained_model.pt file after running our training script we could simply execute the container as

                    docker run --name {container_name} -v %cd%/models:/models/ trainer:latest\n

                    this command mounts our local models folder as a corresponding models folder in the container. Any file save by the container to this folder will be synchronized back to our host machine. Try this out! Note if you have multiple files/folders that you want to mount (if in doubt about file organization in the container try to do the next exercise first). Also note that the %cd% need to change depending on your OS, see this page for help.

                17. With training done we also need to write an application for prediction. Create a new docker image called predict.dockerfile. This file should call your <project_name>/models/predict_model.py script instead. This image will need some trained model weights to work. Feel free to either includes these during the build process or mount them afterwards. When you created the file try to build and run it to confirm that it works. Hint: if you are passing in the model checkpoint and prediction data as arguments to your script, your docker run probably need to look something like

                  docker run --name predict --rm \\\n    -v %cd%/trained_model.pt:/models/trained_model.pt \\  # mount trained model file\n    -v %cd%/data/example_images.npy:/example_images.npy \\  # mount data we want to predict on\n    predict:latest \\\n    ../../models/trained_model.pt \\  # argument to script, path relative to script location in container\n    ../../example_images.npy\n
                18. (Optional, requires GPU support) By default a virtual machine created by docker only have access to your cpu and not your gpu. While you do not necessarily have a laptop with a GPU that supports training of neural network (e.g. one from Nvidia) it is beneficial that you understand how to construct a docker image that can take advantage of a GPU if you were to run this on a machine in the future that have a GPU (e.g. in the cloud). It does take a bit more work, but many of the steps will be similar to building a normal docker image.

                  1. There are three prerequisites for working with Nvidia GPU accelerated docker containers. First you need to have the Docker Engine installed (already taken care of), have Nvidia GPU with updated GPU drivers and finally have the Nvidia container toolkit installed. The last part you not likely have not installed and needs to do. Some distros of Linux have known problems with the installation process, so you may have to search through known issues in nvidia-docker repository to find a solution

                  2. To test that everything is working start by pulling a relevant Nvidia docker image. In my case this is the correct image:

                    docker pull nvidia/cuda:11.0.3-base-ubuntu20.04\n

                    but it may differ based on what cuda vision you have. You can find all the different official Nvidia images here. After pulling the image, try running the nvidia-smi command inside a container based on the image you just pulled. It should look something like this:

                    docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi\n

                    and should show an image like below:

                    If it does not work, try redoing the steps.

                  3. We should hopefully have a working setup now for running Nvidia accelerated docker containers. Next step is to get Pytorch inside of our container, such that our Pytorch implementation also correctly identify the GPU. Luckily for us Nvidia provides a set of docker images for GPU-optimized software for AI, HPC and visualizations through their NGC Catalog. The containers that have to do with Pytorch can be seen here. Try pulling the latest:

                    docker pull nvcr.io/nvidia/pytorch:22.07-py3\n

                    It may take some time, because the NGC images includes a lot of other software for optimizing Pytorch applications. It may be possible for you to find other images for running GPU accelerated applications that have a smaller memory footprint, but NGC are the recommend and supported way.

                  4. Lets test that this container work:

                    docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:22.07-py3\n

                    this should run the container in interactive mode attached to your current terminal. Try opening python in the container and try writing:

                    import torch\nprint(torch.cuda.is_available())\n

                    which hopefully should return True.

                  5. Finally, we need to incorporate all this into our already developed docker files for our application. This is also fairly easy as we just need to change our FROM statement in the beginning of our docker file:

                    FROM python:3.7-slim\n

                    change to

                    FROM  nvcr.io/nvidia/pytorch:22.07-py3\n

                    try doing this to one of your docker files, build the image and run the container. Remember to check that your application is using GPU by printing torch.cuda.is_available().

                19. (Optional) Another way you can use dockerfiles in your day to day work is for Dev-containers. Developer containers allows you to develop code directly inside a container, making sure that your code is running in the same environment as it will when deployed. This is especially useful if you are working on a project that has a lot of dependencies that are hard to install on your local machine. Setup instructions for VS code and Pycharm can be found here (should be simple since we have already installed docker):

                  • VS code
                  • Pycharm

                  We focus on the VS code setup here.

                  1. First install the Remote - Containers extension.

                  2. Create a .devcontainer folder in your project root and create a Dockerfile inside it. We keep this file very barebone for now, so lets just define a base installation of python:

                    FROM python:3.11-slim-buster\n\nRUN apt update && \\\n    apt install --no-install-recommends -y build-essential gcc && \\\n    apt clean && rm -rf /var/lib/apt/lists/*\n
                  3. Create a devcontainer.json file in the .devcontainer folder. This file should look something like this:

                    {\n    \"name\": \"my_working_env\",\n    \"dockerFile\": \"Dockerfile\",\n    \"postCreateCommand\": \"pip install -r requirements.txt\"\n}\n

                    this file tells VS code that we want to use the Dockerfile that we just created and that we want to install our python dependencies after the container has been created.

                  4. After creating these files, you should be able to open the command palette in VS code (F1) and search for the option Remote-Containers: Reopen in Container or Remote-Containers: Rebuild and Reopen in Container. Choose either of these options.

                    This will start a new VS code instance inside a docker container. You should be able to see this in the bottom left corner of your VS code window. You should also be able to see that the python interpreter has changed to the one inside the container.

                    You are now ready to start developing inside the container. Try opening a terminal and run python and import torch to confirm that everything is working.

                20. (Optional) In M8 on Data version control you learned about the framework dvc for version controlling data. A neutral question at this point would then be how to incorporate dvc into our docker image. We need to do two things:

                  • Make sure that dvc have all the correct files to pull data from our remote storage
                  • Make sure that dvc have the correct credentials to pull data from our remote storage

                  We are going to assume that dvc (and any dvc extension needed) is part of your requirement.txt file and that it is already being installed in a RUN pip install -r requirements.txt command in your dockerfile. If not, then you need to add it.

                  1. Add the following lines to your dockerfile

                    RUN dvc init --no-scm\nCOPY .dvc/config .dvc/config\nCOPY *.dvc *.dvc\nRUN dvc config core.no_scm true\nRUN dvc pull\n

                    The first line initialize dvc in the docker image. The --no-scm option is needed because normally dvc can only be initialized inside a git repository, but this option allows to initialize dvc without being in one. The second and third line copies over the dvc config file and the dvc metadate files that are needed to pull data from your remote storage. The last line pulls the data.

                  2. If your data is not public, we need to provide credentials in some way to pull the data. We are for now going to do it in a not-so-secure way. When dvc first connected to your drive a credential file was created. This file is located in $CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json where $CACHE_HOME.

                    macOSLinuxWindows

                    ~/Library/Caches

                    ~/.cache This is the typical location, but it may vary depending on what distro you are running

                    {user}/AppData/Local

                    Find the file. The content should look similar to this (only some fields are shown):

                    {\n    \"access_token\": ...,\n    \"client_id\": ...,\n    \"client_secret\": ...,\n    \"refresh_token\": ...,\n    ...\n}\n

                    We are going to copy the file into our docker image. This of course is not a secure way of doing it, but it is the easiest way to get started. As long as you are not sharing your docker image with anyone else, then it is fine. Add the following lines to your dockerfile before the RUN dvc pull command:

                    ```dockerfile COPY default.json dvc remote modify myremote --local gdrive_service_account_json_file_path default.json ````

                    where <path_to_default.json> is the path to the default.json file that you just found. The last line tells dvc to use the default.json file as the credentials for pulling data from your remote storage. You can confirm that this works by running dvc pull in your docker image.

                    "},{"location":"s3_reproducibility/docker/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"
                    1. What is the difference between a docker image and a docker container?

                      Solution

                      A docker image is a template for a docker container. A docker container is a running instance of a docker image. A docker image is a static file, while a docker container is a running process.

                    2. What are the 3 steps involved in containerizing an application?

                      Solution
                      1. Write a Dockerfile that includes your app (including the commands to run it) and its dependencies
                      2. Build the image using the Dockefile you wrote
                      3. Run the container using the image you've built
                    3. What advantage is there to running your application inside a docker container instead of running the application directly on your machine?

                      Solution

                      Running inside a docker container gives you a consistent and independent environment for your application. This means that you can be sure that your application will run the same way on your machine as it will on another machine. Thus, docker gives the ability to abstract away the differences between different machines.

                    4. A docker container is build from a series of layers that are stacked on top of each others. This should be clear if you look at the output when building a docker image. What is the advantage of this?

                      Solution

                      The advantage is efficiency and reusability. When a change is made to a docker image, only the layer(s) that are changed needs to be updated. For example, if you update the application code in your docker image, which usually is the last layer, then only that layer needs to be rebuild, making the process much faster. Additionally, if you have multiple docker images that share the same base image, then the base image only needs to be downloaded once.

                    The covers the absolute minimum you should know about docker to get a working image and container. If you want to really deep dive into this topic you can find a copy of the Docker Cookbook by S\u00e9bastien Goasguen in the literature folder.

                    If you are actively going to be using docker in the near future, one thing to consider is the image size. Even these simple images that we have build still takes up GB in size. A number of optimizations steps can be taken to reduce the image size for you or your end user. If you have time you can read this article on different approaches to reduce image size. Additionally, you can take a look at the dive-in extension for docker desktop that lets you explore in depth your docker images.

                    "},{"location":"s4_debugging_and_logging/","title":"Debugging, Profiling, Logging and Boilerplate","text":"

                    Slides

                    Today we are initially going to go over three different topics that are all fundamentally necessary for any data scientist or DevOps engineer:

                    • Debugging
                    • Profiling
                    • Logging

                    All three topics can be characterized by something you probably already are familiar with. Since you started programming, you have done debugging as nobody can write perfect code in the first try. Similarly, while you have not directly profiled your code, I bet that you at some point have had some very slow code and optimized it to run faster. Identifying and improving is the fundamentals of profiling code. Finally, logging is a very broad term and basically refers to any kind of output from your applications that help you at a later point identify the \"performance\" of you application.

                    However, while we expect you to already be familiar with these topics, we do not expect all of you to be expects in this as it is very rarely topics that are focused on. Today we are going to introduce some best practices and tools to help you overcome each and everyone of these three important topics.

                    As the final topic for today we are going to learn about how we can minimize boilerplate and focus on coding what actually matters for our project instead of all the boilerplate to get it working.

                    Learning objectives

                    The learning objectives of this session are:

                    • Understand the basics of debugging and how to use a debugger to find bugs in your code
                    • Can use a profiler to identify bottlenecks in your code and from those profiles optimize the runtime of your programs
                    • Familiar with an experiment logging framework for tracking experiments and hyperparameters of your code to make it reproducible
                    • Be able to use pytorch-lightning framework to minimize boilerplate code and structure deep learning models
                    "},{"location":"s4_debugging_and_logging/boilerplate/","title":"M14 - Boilerplate","text":""},{"location":"s4_debugging_and_logging/boilerplate/#minimizing-boilerplate","title":"Minimizing boilerplate","text":"

                    Boilerplate is a general term that describes any standardized text, copy, documents, methods, or procedures that may be used over again without making major changes to the original. But how does this relate to doing machine learning projects? If you have already tried doing a couple of projects within machine learning you will probably have seen a pattern: every project usually consist of these three aspects of code:

                    • a model implementation
                    • some training code
                    • a collection of utilities for saving models, logging images etc.

                    While the latter two certainly seems important, in most cases the actual development or research often revolves around defining the model. In this sense, both the training code and the utilities becomes boilerplate that should just carry over from one project to another. But the problem usually is that we have not generalized our training code to take care of the small adjusted that may be required in future projects and we therefore end up implementing it over and over again every time that we start a new project. This is of course a waste of our time that we should try to find a solution to.

                    This is where high-level frameworks comes into play. High-level frameworks are build on top of another framework (Pytorch in this case) and tries to abstract/standardize how to do particular tasks such as training. At first it may seem irritating that you need to comply to someone else code structure, however there is a good reason for that. The idea is that you can focus on what really matters (your task, model architecture etc.) and do not have to worry about the actual boilerplate that comes with it.

                    The most popular high-level (training) frameworks within the Pytorch ecosystem are:

                    • fast.ai
                    • Ignite
                    • skorch
                    • Catalyst
                    • Composer
                    • Pytorch Lightning

                    They all offer many of the same features, so choosing one over the other for most projects should not matter that much. We are here going to use Pytorch Lightning, as it offers all the functionality that we are going to need later in the course.

                    "},{"location":"s4_debugging_and_logging/boilerplate/#pytorch-lightning","title":"Pytorch Lightning","text":"

                    In general we refer to the documentation from Pytorch lightning if in doubt about how to format your code for doing specific tasks. We are here going to explain the key concepts of the API that you need to understand to use the framework, starting with the LightningModule and the Trainer.

                    "},{"location":"s4_debugging_and_logging/boilerplate/#lightningmodule","title":"LightningModule","text":"

                    The LightningModule is a subclass of a standard nn.Module that basically adds additional structure. In addition to the standard __init__ and forward methods that need to be implemented in a nn.Module, a LightningModule further requires two more methods implemented:

                    • training_step: should contain your actual training code e.g. given a batch of data this should return the loss that you want to optimize

                    • configure_optimizers: should return the optimizer that you want to use

                    Below is shown these two methods added to standard MNIST classifier

                    Compared to a standard nn.Module, the additional methods in the LightningModule basically specifies exactly how you want to optimize your model.

                    "},{"location":"s4_debugging_and_logging/boilerplate/#trainer","title":"Trainer","text":"

                    The second component to lightning is the Trainer object. As the name suggest, the `Trainer object takes care of the actual training, automizing everything that you do not want to worry about.

                    from pytorch_lightning import Trainer\nmodel = MyAwesomeModel()  # this is our LightningModule\ntrainer = Trainer()\ntraier.fit(model)\n

                    That's is essentially all that you need to specify in lightning to have a working model. The trainer object does not have methods that you need to implement yourself, but it have a bunch of arguments that can be used to control how many epochs that you want to train, if you want to run on gpu etc. To get the training of our model to work we just need to specify how our data should be feed into the lighning framework.

                    "},{"location":"s4_debugging_and_logging/boilerplate/#data","title":"Data","text":"

                    For organizing our code that has to do with data in Lightning we essentially have three different options. However, all three assume that we are using torch.utils.data.DataLoader for the dataloading.

                    1. If we already have a train_dataloader and possible also a val_dataloader and test_dataloader defined we can simply add them to our LightningModule using the similar named methods:

                      def train_dataloader(self):\n    return DataLoader(...)\n\ndef val_dataloader(self):\n    return DataLoader(...)\n\ndef test_dataloader(self):\n    return DataLoader(...)\n
                    2. Maybe even simpler, we can directly feed such dataloaders in the fit method of the Trainer object:

                      trainer.fit(model, train_dataloader, val_dataloader)\ntrainer.test(model, test_dataloader)\n
                    3. Finally, Lightning also have the LightningDataModule that organizes data loading into a single structure, see this page for more info. Putting data loading into a DataModule makes sense as it is then can be reused between projects.

                    "},{"location":"s4_debugging_and_logging/boilerplate/#callbacks","title":"Callbacks","text":"

                    Callbacks is one way to add additional functionality to your model, that strictly speaking is not already part of your model. Callbacks should therefore be seen as self-contained feature that can be reused between projects. You have the option to implement callbacks yourself (by inheriting from the pytorch_lightning.callbacks.Callback base class) or use one of the build in callbacks. Of particular interest are ModelCheckpoint and EarlyStopping callbacks:

                    • The ModelCheckpoint makes sure to save checkpoints of you model. This is in principal not hard to do yourself, but the ModelCheckpoint callback offers additional functionality by saving checkpoints only when some metric improves, or only save the best K performing models etc.

                      model = MyModel()\ncheckpoint_callback = ModelCheckpoint(\n    dirpath=\"./models\", monitor=\"val_loss\", mode=\"min\"\n)\ntrainer = Trainer(callbacks=[checkpoint_callbacks])\ntrainer.fit(model)\n
                    • The EarlyStopping callback can help you prevent overfitting by automatically stopping the training if a certain value is not improving anymore:

                      model = MyModel()\nearly_stopping_callback = EarlyStopping(\n    monitor=\"val_loss\", patience=3, verbose=True, mode=\"min\"\n)\ntrainer = Trainer(callbacks=[early_stopping_callback])\ntrainer.fit(model)\n

                    Multiple callbacks can be used by passing them all in a list e.g.

                    trainer = Trainer(callbacks=[checkpoint_callbacks, early_stopping_callback])\n
                    "},{"location":"s4_debugging_and_logging/boilerplate/#exercises","title":"\u2754 Exercises","text":"

                    Please note that the in following exercise we will basically ask you to reformat all your MNIST code to follow the lightning standard, such that we can take advantage of all the tricks the framework has to offer. The reason we did not implement our model in lightning to begin with, is that to truly understand why it is beneficially to use a high-level framework to do some of the heavy lifting you need to have gone through some of implementation troubles yourself.

                    1. Install pytorch lightning:

                      pip install pytorch-lightning # (1)!\n
                      1. You may also install it as pip install lightning which includes more than just the Pytorch Lightning package. This also includes Lightning Fabric and Lightning Apps which you can read more about here and here.
                    2. Convert your corrupted MNIST model into a LightningModule. You can either choose to completely override your old model or implement it in a new file. The bare minimum that you need to add while converting to get it working with the rest of lightning:

                      • The training_step method. This function should contain essentially what goes into a single training step and should return the loss at the end

                      • The configure_optimizers method

                      Please read the documentation for more info.

                    3. Make sure your data is formatted such that it can be loaded using the torch.utils.data.DataLoader object.

                    4. Instantiate a Trainer object. It is recommended to take a look at the trainer arguments (there are many of them) and maybe adjust some of them:

                      1. Investigate what the default_root_dir flag does

                      2. As default lightning will run for 1000 epochs. This may be too much (for now). Change this by changing the appropriate flag. Additionally, there also exist a flag to set the maximum number of steps that we should train for.

                      3. To start with we also want to limit the amount of training data to 20% of its original size. which trainer flag do you need to set for this to work?

                    5. Try fitting your model: trainer.fit(model)

                    6. Now try adding some callbacks to your trainer.

                    7. The privous module was all about logging in wandb, so the question is naturally how does lightning support this. Lightning does not only support wandb, but also many others. Common for all of them, is that logging just need to happen through the self.log method in your LightningModule:

                      1. Add self.log to your `LightningModule. Should look something like this:

                        def training_step(self, batch, batch_idx):\n    data, target = batch\n    preds = self(data)\n    loss = self.criterion(preds, target)\n    acc = (target == preds.argmax(dim=-1)).float().mean()\n    self.log('train_loss', loss)\n    self.log('train_acc', acc)\n    return loss\n
                      2. Add the wandb logger to your trainer

                        trainer = Trainer(logger=pl.loggers.WandbLogger(project=\"dtu_mlops\"))\n

                        and try to train the model. Confirm that you are seeing the scalars appearing in your wandb portal.

                      3. self.log does sadly only support logging scalar tensors. Luckily, for logging other quantities we can still access the standard wandb.log through our model

                        def training_step(self, batch, batch_idx):\n    ...\n    # self.logger.experiment is the same as wandb.log\n    self.logger.experiment.log({'logits': wandb.Histrogram(preds)})\n

                        try doing this, by logging something else than scalar tensors.

                    8. Finally, we maybe also want to do some validation or testing. In lightning we just need to add the validation_step and test_step to our lightning module and supply the respective data in form of a separate dataloader. Try to at least implement one of them.

                    9. (Optional, requires GPU) One of the big advantages of using lightning is that you no more need to deal with device placement e.g. called .to('cuda') everywhere. If you have a GPU, try to set the gpus flag in the trainer. If you do not have one, do not worry, we are going to return to this when we are going to run training in the cloud.

                    10. (Optional) As default Pytorch uses float32 for representing floating point numbers. However, research have shown that neural network training is very robust towards a decrease in precision. The great benefit going from float32 to float16 is that we get approximately half the memory consumption. Try out half-precision training in Pytorch lightning. You can enable this by setting the precision flag in the Trainer.

                    11. (Optional) Lightning also have built-in support for profiling. Checkout how to do this using the profiler argument in the Trainer object.

                    12. (Optional) Another great feature of Lightning is that the allow for easily defining command line interfaces through the Lightning CLI feature. The Lightning CLI is essentially a drop in replacement for defining command line interfaces (covered in this module) and can also replace the need for config files (covered in this module) for securing reproducibility when working inside the Lightning framework. We highly recommend checking out the feature and that you try to refactor your code such that you do not need to call trainer.fit anymore but it is instead directly controlled from the Lightning CLI.

                    13. Free exercise: Experiment with what the lightning framework is capable of. Either try out more of the trainer flags, some of the other callbacks, or maybe look into some of the other methods that can be implemented in your lightning module. Only your imagination is the limit!

                    That covers everything for today. It has been a mix of topics that all should help you write \"better\" code (by some objective measure). If you want to deep dive more into the Pytorch lightning framework, we highly recommend looking at the different tutorials in the documentation that covers more advanced models and training cases. Additionally, we also want to highlight other frameworks in the lightning ecosystem:

                    • Torchmetrics: collection of machine learning metrics written in Pytorch
                    • lightning flash: High-level framework for fast prototyping, baselining, finetuning with a even simpler interface than lightning
                    • lightning-bolts: Collection of SOTA pretrained models, model components, callbacks, losses and datasets for testing out ideas as fast a possible
                    "},{"location":"s4_debugging_and_logging/debugging/","title":"M11 - Debugging","text":""},{"location":"s4_debugging_and_logging/debugging/#debugging","title":"Debugging","text":"

                    Debugging is very hard to teach and is one of the skills that just comes with experience. That said, there are good and bad ways to debug a program. We are all probably familiar with just inserting print(...) statements everywhere in our code. It is easy and can many times help narrow down where the problem happens. That said, this is not a great way of debugging when dealing with a very large codebase. You should therefore familiarize yourself with the built-in python debugger as it may come in handy during the course.

                    To invoke the build in python debugger you can either:

                    • Set a trace directly with the python debugger by calling

                      import pdb\npdb.set_trace()\n

                      anywhere you want to stop the code. Then you can use different commands (see the python_debugger_cheatsheet.pdf) to step through the code.

                    • If you are using an editor, then you can insert inline breakpoints (in VS code this can be done by pressing F9) and then execute the script in debug mode (inline breakpoints can often be seen as small red dots to the left of your code). The editor should then offer some interface to allow you step through your code. Here is a guide to using the build in debugger in VScode.

                    • Additionally, if your program is stopping on an error and you automatically want to start the debugger where it happens, then you can simply launch the program like this from the terminal

                      python -m pdb -c continue my_script.py\n
                    "},{"location":"s4_debugging_and_logging/debugging/#exercises","title":"\u2754 Exercises","text":"

                    Exercise files

                    We here provide a script vae_mnist_bugs.py which contains a number of bugs to get it running. Start by going over the script and try to understand what is going on. Hereafter, try to get it running by solving the bugs. The following bugs exist in the script:

                    • One device bug (will only show if running on gpu, but try to find it anyways)
                    • One shape bug
                    • One math bug
                    • One training bug

                    Some of the bugs prevents the script from even running, while some of them influences the training dynamics. Try to find them all. We also provide a working version called vae_mnist_working.py (but please try to find the bugs before looking at the script). Successfully debugging and running the script should produce three files:

                    • orig_data.png containing images from the standard MNIST training set
                    • reconstructions.png reconstructions from the model
                    • generated_samples.png samples from the model

                    Again, we cannot stress enough that the exercise is actually not about finding the bugs but using a proper debugger to find them.

                    "},{"location":"s4_debugging_and_logging/logging/","title":"M13 - Logging","text":""},{"location":"s4_debugging_and_logging/logging/#logging","title":"Logging","text":"

                    Core Module

                    Logging in general refers to the practise of recording events activities over time. Having proper logging in your applications can be extremely beneficial for a few reasons:

                    • Debugging becomes easier because we in a more structure way can output information about the state of our program, variables, values etc. to help identify and fix bugs or unexpected behavior.

                    • When we move into a more production environment, proper logging is essential for monitoring the health and performance of our application.

                    • It can help in auditing as logging info about specific activities etc. can help keeping a record of who did what and when.

                    • Having proper logging means that info is saved for later, that can be analysed to gain insight into the behavior of our application, such as trends.

                    We are in this course going to divide the kind of logging we can do into categories: application logging and experiment logging. In general application logging is important regardless of the kind of application you are developing, whereas experiment logging is important machine learning based projects where we are doing experiments.

                    "},{"location":"s4_debugging_and_logging/logging/#application-logging","title":"Application logging","text":"

                    The most basic form of logging in Python applications is the good old print statement:

                    for batch_idx, batch in enumerate(dataloader):\n    print(f\"Processing batch {batch_idx} out of {len(dataloader)}\")\n    ...\n

                    This will keep a \"record\" of the events happening in our script, in this case how far we have progressed. We could even change the print to include something like batch.shape to also have information about the current data being processed.

                    Using print statements is fine for small applications, but to have proper logging we need a bit more functionality than what print can offer. Python actually comes with a great logging module, that defines functions for flexible logging. It is exactly this we are going to look at in this module.

                    The four main components to the Python logging module are:

                    1. Logger: The main entry point for using the logging system. You create instances of the Logger class to emit log messages.

                    2. Handler: Defines where the log messages go. Handlers send the log messages to specific destinations, such as the console or a file.

                    3. Formatter: Specifies the layout of the log messages. Formatters determine the structure of the log records, including details like timestamps and log message content.

                    4. Level: Specifies the severity of a log message.

                    Especially, the last point is important to understand. Levels essentially allows of to get rid of statements like this:

                    if debug:\n    print(x.shape)\n

                    where the logging is conditional on the variable debug which we can set a runtime. Thus, it is something we can disable for users of our application (debug=False) but have enabled when we develop the application (debug=True). And it makes sense that not all things logged, should be available to all stakeholders of a codebase. We as developers probably always wants the highest level of logging, whereas users of the our code need less info and we may want to differentiate this based on users.

                    It is also important to understand the different between logging and error handling. Error handling Python is done using raise statements and try/catch like:

                    def f(x: int):\n    if not isinstance(x, int):\n        raise ValueError(\"Expected an integer\")\n    return 2 * x\n\ntry:\n    f(5):\nexcept ValueError:\n    print(\"I failed to do a thing, but continuing.\")\n

                    Why would we evere need log warning, error, critical levels of information, if we are just going to handle it? The reason is that raising exceptions are meant to change the program flow at runtime e.g. things we do not want the user to do, but we can deal with in some way. Logging is always for after a program have run, to inspect what went wrong. Sometimes you need one, sometimes the other, sometimes both.

                    "},{"location":"s4_debugging_and_logging/logging/#exercises","title":"\u2754 Exercises","text":"

                    Exercises are inspired by this made with ml module on the same topic. If you need help for the exercises you can find a simple solution script here.

                    1. As logging is a built-in module in Python, nothing needs to be installed. Instead start a new file called my_logger.py and start out with the following code:

                      import logging\nimport sys\n\n# Create super basic logger\nlogging.basicConfig(stream=sys.stdout, level=logging.DEBUG)\nlogger = logging.getLogger(__name__) # (1)\n\n# Logging levels (from lowest to highest priority)\nlogger.debug(\"Used for debugging your code.\")\nlogger.info(\"Informative messages from your code.\")\nlogger.warning(\"Everything works but there is something to be aware of.\")\nlogger.error(\"There's been a mistake with the process.\")\nlogger.critical(\"There is something terribly wrong and process may terminate.\")\n
                      1. The built-in variable __name__ always contains the record of the script or module that is currently being run. Therefore if we initialize our logger base using this variable, it will always be unique to our application and not conflict with logger setup by any third-party package.

                      Try running the code. Than try changing the argument level when creating the logger. What happens when you do that?

                    2. Instead of sending logs to the terminal, we may also want to send them to a file. This can be beneficial, such that only warning level logs and higher are available to the user, but debug and info is still saved when the application is running.

                      1. Try adding the following dict to your logger.py file:

                        logging_config = {\n    \"version\": 1,\n    \"formatters\": { # (1)\n        \"minimal\": {\"format\": \"%(message)s\"},\n        \"detailed\": {\n            \"format\": \"%(levelname)s %(asctime)s [%(name)s:%(filename)s:%(funcName)s:%(lineno)d]\\n%(message)s\\n\"\n        },\n    },\n    \"handlers\": { # (2)\n        \"console\": {\n            \"class\": \"logging.StreamHandler\",\n            \"stream\": sys.stdout,\n            \"formatter\": \"minimal\",\n            \"level\": logging.DEBUG,\n        },\n        \"info\": {\n            \"class\": \"logging.handlers.RotatingFileHandler\",\n            \"filename\": Path(LOGS_DIR, \"info.log\"),\n            \"maxBytes\": 10485760,  # 1 MB\n            \"backupCount\": 10,\n            \"formatter\": \"detailed\",\n            \"level\": logging.INFO,\n        },\n        \"error\": {\n            \"class\": \"logging.handlers.RotatingFileHandler\",\n            \"filename\": Path(LOGS_DIR, \"error.log\"),\n            \"maxBytes\": 10485760,  # 1 MB\n            \"backupCount\": 10,\n            \"formatter\": \"detailed\",\n            \"level\": logging.ERROR,\n        },\n    },\n    \"root\": {\n        \"handlers\": [\"console\", \"info\", \"error\"],\n        \"level\": logging.INFO,\n        \"propagate\": True,\n    },\n}\n
                        1. The formatter section determines how logs should be formatted. Here we define two separate formatters, called minimal and detailed which we can use in the next part of the code.

                        2. The handlers is in charge of what should happen to different level of logging. console uses the minimal format we defined and sens logs to the stdout stream for messages of level DEBUG and higher. The info handler uses the detailed format and sends messages of level INFO and higher to a separate info.log file. The error handler does the same for messages of level ERROR and higher to a file called error.log.

                        you will need to set the LOGS_DIR variable and also figure out how to add this logging_config using the logging config submodule to your logger.

                      2. When the code successfully runs, check the LOGS_DIR folder and make sure that a info.log and error.log file was created with the appropriate content.

                    3. Finally, lets try to add a little bit of style and color to our logging. For this we can use the great package rich which is a great package for rich text and beautiful formatting in terminals. Install rich and add the following line to your my_logger.py script:

                      logger.root.handlers[0] = RichHandler(markup=True)  # set rich handler\n

                      and try re-running the script. Hopefully you should see something beautiful in your terminal like this:

                    4. (Optional) We already briefly touched on logging during the module on config files using hydra. If you want to configure hydra to use custom logging scheme as the one we setup in the last two exercises, you can take a look at this page. In hydra you will need to provide the configuration of the logger as config file. You can find examples of such config file here.

                    "},{"location":"s4_debugging_and_logging/logging/#experiment-logging","title":"Experiment logging","text":"

                    When most people think machine learning, we think about the training phase. Being able to track and log experiments is an important part of understanding what is going on with your model while you are training. It can help you debug your model and help tweak your models to perfection. Without proper logging of experiments, it can be really hard to iterate on the model because you do not know what changes lead to increase or decrease in performance.

                    The most basic logging we can do when running experiments is writing the metrics that our model is producing e.g. the loss or the accuracy to the terminal or a file for later inspection. We can then also use tools such as matplotlib for plotting the progression of our metrics over time. This kind of workflow may be enough when doing smaller experiments or working alone on a project, but there is no way around using a proper experiment tracker and visualizer when doing large scale experiments in collaboration with others. It especially becomes important when you want to compare performance between different runs.

                    There exist many tools for logging your experiments, with some of them being:

                    • Tensorboard
                    • Comet
                    • MLFlow
                    • Neptune
                    • Weights and Bias

                    All of the frameworks offers many of the same functionalities, you can see a (bias) review here. We are going to use Weights and Bias (wandb), as it support everything we need in this course. Additionally, it is an excellent tool for collaboration and sharing of results.

                    Using the Weights and Bias (wandb) dashboard we can quickly get an overview and compare many runs over different metrics. This allows for better iteration of models and training procedure."},{"location":"s4_debugging_and_logging/logging/#exercises_1","title":"\u2754 Exercises","text":"
                    1. Start by creating an account at wandb. I recommend using your github account but feel free to choose what you want. When you are logged in you should get an API key of length 40. Copy this for later use (HINT: if you forgot to copy the API key, you can find it under settings).

                    2. Next install wandb on your laptop

                      pip install wandb\n
                    3. Now connect to your wandb account

                      wandb login\n

                      you will be asked to provide the 40 length API key. The connection should be remain open to the wandb server even when you close the terminal, such that you do not have to login each time. If using wandb in a notebook you need to manually close the connection using wandb.finish().

                    4. With it all setup we are now ready to incorporate wandb into our code. The interface is fairly simple, and this guide should give enough hints to get you through the exercise. (HINT: the two methods you need to call are wandb.init and wandb.log). To start with, logging the training loss of your model will be enough.

                    5. After running your model, checkout the webpage. Hopefully you should be able to see at least one run with something logged.

                    6. Now log something else than scalar values. This could be a image, a histogram or a matplotlib figure. In all cases the logging is still going to use wandb.log but you need extra calls to wandb.Image etc. depending on what you choose to log.

                    7. Finally, lets create a report that you can share. Click the Create report button and include some of the graphs/plots/images that you have generated in the report.

                    8. To make sure that you have completed todays exercises, make the report shareable by clicking the Share button and create view-only-link. Send the link to my email nsde@dtu.dk, so I can checkout your awesome work \ud83d\ude03

                    9. When calling wandb.init you have two arguments called project and entity. Make sure that you understand these and try them out. It will come in handy for your group work as they essentially allows multiple users to upload their own runs to the same project in wandb.

                    10. Wandb also comes with build in feature for doing hyperparameter sweeping which can be beneficial to get a better working model. Look through the documentation on how to do a hyperparameter sweep in Wandb. You at least need to create a new file called sweep.yaml and make sure that you call wandb.log in your code on an appropriate value. Note: if you want hydra and wandb to work together you will need to change the command config in your sweep.yaml file, see this page.

                    11. In the future it will be important for us to be able to run Wandb inside a docker container (together with whatever training or inference we specify). The problem here is that we cannot authenticate Wandb in the same way as the previous exercise, it needs to happen automatically. Lets therefore look into how we can do that.

                      1. First we need to generate an authentication key, or more precise an API key. This is in general the way any service (like a docker container) can authenticate. Start by going https://wandb.ai/home, click your profile icon in the upper right corner and then go to settings. Scroll down to the danger zone and generate a new API key and finally copy it.

                      2. Next create a new docker file called wandb.docker and add the following code

                        FROM python:3.9\nRUN apt update && \\\n    apt install --no-install-recommends -y build-essential gcc && \\\n    apt clean && rm -rf /var/lib/apt/lists/*\nRUN pip install wandb\nCOPY s4_debugging_and_logging/exercise_files/wandb_tester.py wandb_tester.py\nENTRYPOINT [\"python\", \"-u\", \"wandb_tester.py\"]\n

                        please take a look at the script being copied into the image and afterwards build the docker image.

                      3. When we want to run the image, what we need to do is including a environment variables that contains the API key we generated. This will then authenticate the docker container with the wandb server:

                        docker run -e WANDB_API_KEY=<your-api-key> wandb:latest\n

                        Try running it an confirm that the results are uploaded to the wandb server.

                    12. Feel free to experiment more with wandb as it is a great tool for logging, organizing and sharing experiments.

                    That is the module on logging. Please note that at this point in the course you will begin to see some overlap between the different frameworks. While we mainly used hydra for configuring our python scripts it can also be used to save metrics and hyperparameters similar to how wandb can. Similar arguments holds for dvc which can also be used to log metrics. In our opinion wandb just offers a better experience when interacting with the results after logging. We want to stress that the combination of tools presented in this course may not be the best for all your future projects, and we recommend finding a setup that fits you. That said, each framework provide specific features that the others does not.

                    Finally, we want to note that we during the course really try to showcase a lot of open source frameworks, Wandb is not one. It is free to use for personal usage (with a few restrictions) but for enterprise it does require a license. If you are eager to only work with open-source tools we highly recommend trying out MLFlow which offers the same overall functionalities as Wandb.

                    "},{"location":"s4_debugging_and_logging/profiling/","title":"M12 - Profiling","text":""},{"location":"s4_debugging_and_logging/profiling/#profilers","title":"Profilers","text":"

                    Core Module

                    "},{"location":"s4_debugging_and_logging/profiling/#profilers_1","title":"Profilers","text":"

                    In general profiling code is about improving the performance of your code. In this session we are going to take a somewhat narrow approach to what \"performance\" is: runtime, meaning the time it takes to execute your program.

                    At the bare minimum, the two questions a proper profiling of your program should be able to answer is:

                    • \u201c How many times is each method in my code called?\u201d
                    • \u201c How long do each of these methods take?\u201d

                    The first question is important to priorities optimization. If two methods A and B have approximately the same runtime, but A is called 1000 more times than B we should probably spend time optimizing A over B if we want to speedup our code. The second question is gives itself, directly telling us which methods are the expensive to call.

                    Using profilers can help you find bottlenecks in your code. In this exercise we will look at two different profilers, with the first one being the cProfile. cProfile is pythons build in profiler that can help give you an overview runtime of all the functions and methods involved in your programs.

                    "},{"location":"s4_debugging_and_logging/profiling/#exercises","title":"\u2754 Exercises","text":"
                    1. Run the cProfile on the vae_mnist_working.py script. Hint: you can directly call the profiler on a script using the -m arg

                      python -m cProfile -o <output_file> -s <sort_order> myscript.py\n
                    2. Try looking at the output of the profiling. Can you figure out which function took the longest to run?

                    3. Can you explain the difference between tottime and cumtime? Under what circumstances does these differ and when are they equal.

                    4. To get a better feeling of the profiled result we can try to visualize it. Python does not provide a native solution, but open-source solutions such as snakeviz exist. Try installing snakeviz and load a profiled run into it (HINT: snakeviz expect the run to have the file format .prof).

                    5. Try optimizing the run! (Hint: The data is not stored as torch tensor). After optimizing the code make sure (using cProfile and snakeviz) that the code actually runs faster.

                    "},{"location":"s4_debugging_and_logging/profiling/#pytorch-profiling","title":"Pytorch profiling","text":"

                    Profiling machine learning code can become much more complex because we are suddenly beginning to mix different devices (CPU+GPU), that can (and should) overlap some of their computations. When profiling this kind of machine learning code we are often looking for bottlenecks. A bottleneck is simple the place in your code that is preventing other processes from performing their best. This is the reason that all major deep learning frameworks also include their own profilers that can help profiling more complex applications.

                    The image below show a typical report using the build in profiler in pytorch. As the image shows the profiler looks both a the kernel time (this is the time spend doing actual computations) and also transfer times such as memcpy (where we are copying data between devices). It can even analyze your code and give recommendations.

                    Using the profiler can be as simple as wrapping the code that you want to profile with the torch.profiler.profile decorator

                    with torch.profiler.profile(...) as prof:\n    # code that I want to profile\n    output = model(data)\n
                    "},{"location":"s4_debugging_and_logging/profiling/#exercises_1","title":"\u2754 Exercises","text":"

                    Exercise files

                    In these investigate the profiler that is build into PyTorch already. Note that these exercises requires that you have PyTorch v1.8.1 installed (or higher). You can always check which version you currently have installed by writing (in a python interpreter):

                    import torch\nprint(torch.__version__)\n

                    But we always recommend to update to the latest Pytorch version for the best experience. Additionally, to display the result nicely (like snakeviz for cProfile) we are also going to use the tensorboard profiler extension

                    pip install torch_tb_profiler\n
                    1. A good starting point is too look at the API for the profiler. Here the important class to look at is the torch.profiler.profile class.

                    2. Lets try out an simple example (taken from here):

                      1. Try to run the following code

                        import torch\nimport torchvision.models as models\nfrom torch.profiler import profile, ProfilerActivity\n\nmodel = models.resnet18()\ninputs = torch.randn(5, 3, 224, 224)\n\nwith profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:\n    model(inputs)\n

                        this will profile the forward pass of Resnet 18 model.

                      2. Running this code will produce an prof object that contains all the relevant information about the profiling. Try writing the following code:

                        print(prof.key_averages().table(sort_by=\"cpu_time_total\", row_limit=10))\n

                        what operation is taking most of the cpu?

                      3. Try running

                        print(prof.key_averages(group_by_input_shape=True).table(sort_by=\"cpu_time_total\", row_limit=30))\n

                        can you see any correlation between the shape of the input and the cost of the operation?

                      4. (Optional) If you have a GPU you can also profile the operations on that device:

                        with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:\n    model(inputs)\n
                      5. (Optional) As an alternative to using profile as an context-manager we can also use its .start and .stop methods:

                        prof = profile(...)\nprof.start()\n...  # code I want to profile\nprof.stop()\n

                        Try doing this on the above example.

                    3. The torch.profiler.profile function takes some additional arguments. What argument would you need to set to also profile the memory usage? (Hint: this page) Try doing it to the simple example above and make sure to sort the sample by self_cpu_memory_usage.

                    4. As mentioned we can also get a graphical output for better inspection. After having done a profiling try to export the results with:

                      prof.export_chrome_trace(\"trace.json\")\n

                      you should be able to visualize the file by going to chrome://tracing in any chromium based web browser. Can you still identify the information printed in the previous exercises from the visualizations?

                    5. Running profiling on a single forward step can produce misleading results as it only provides a single sample that may depend on what background processes that are running on your computer. Therefore it is recommended to profile multiple iterations of your model. If this is the case then we need to include prof.step() to tell the profiler when we are doing a new iteration

                      with profile(...) as prof:\n    for i in range(10):\n        model(inputs)\n        prof.step()\n

                      Try doing this. Is the conclusion this the same on what operations that are taken up most of the time? Have the percentage changed significantly?

                    6. Additionally, we can also visualize the profiling results using the profiling viewer in tensorboard.

                      1. Start by initializing the profile class with an additional argument:

                        from torch.profiler import profile, tensorboard_trace_handler\nwith profile(..., on_trace_ready=tensorboard_trace_handler(\"./log/resnet18\")) as prof:\n    ...\n

                        Try run a profiling (using a couple of iterations) and make sure that a file with the .pt.trace.json is produced in the log/resnet18 folder.

                      2. Now try launching tensorboard

                        tensorboard --logdir=./log\n

                        and open the page http://localhost:6006/#pytorch_profiler, where you should hopefully see an image similar to the one below:

                        Image credit

                        Try poking around in the interface.

                      3. Tensorboard have a nice feature for comparing runs under the diff tab. Try redoing a profiling run but use model = models.resnet34() instead. Load up both runs and try to look at the diff between them.

                    7. As an final exercise, try to use the profiler on the vae_mnist_working.py file from the previous module on debugging, where you profile a hole training run (not only the forward pass). What is the bottleneck during the training? Is it still the forward pass or is it something else? Can you improve the code somehow based on the information from the profiler.

                    This end the module on profiling. If you want to go into more details on this topic we can recommend looking into line_profiler and kernprof. A downside of using python's cProfile is that it can only profiling at an functional/modular level, that is great for identifying hotspots in your code. However, sometimes the cause of an computationally hotspot is a single line of code in a function, which will not be caught by cProfile. An example would be an simple index operations such as a[idx] = b, which for large arrays and non-sequential indexes is really expensive. For these cases line_profiler and kernprof are excellent tools to have in your toolbox. Additionally, if you do not like cProfile we can also recommend py-spy which is another open-source profiling tool for python programs.

                    "},{"location":"s5_continuous_integration/","title":"Continuous Integration","text":"

                    Slides

                    Continues integration is a sub-discipline of the general field of Continues X. Continuous X is one of the core elements of modern DevOps, and by extension MLOps. Continuous X assumes that we have a (long) developer pipeline (see image below) where we want to make some changes to our code e.g:

                    • Update our training data or data processing
                    • Update our model architecture
                    • Something else...

                    Basically, any code change we will expect will have a influence on the final result. The problem with doing changes to the start of our pipeline is that we want the change to propagate all the way through to the end of the pipeline.

                    Image credit

                    This is where continuous X comes into play. The word continuous here refers to the fact that the pipeline should continuously be updated as we make code changes. You can also choose to think of this as the automatization of processes. The X then covers that the process we need to go through to automate steps in the pipeline depends on where we are in the pipeline e.g. the tools needed to do continuous integration are different from the tools needed to do continuous delivery.

                    In this session, we are going to focus on continuous integration (CI). As indicated in the image above, CI usually takes care of the first part of the developer pipeline which has to do with the code base, code building and code testing. This is paramount to step in automatization as we would rather catch bugs at the beginning of our pipeline than in the end.

                    Learning objectives

                    The learning objectives of this session are:

                    • Being able to write unit tests that cover both data and models in your ML pipeline
                    • Know how to implement CI using Github actions such that tests are automatically executed on code changes
                    • Can use pre-commit to secure that code that is not up to standard does not get committed
                    • Know how to implement CI for continuous building of containers
                    • Basic knowledge of how machine learning processes can be implemented in a continuous way
                    "},{"location":"s5_continuous_integration/auto_docker/","title":"M18 - Continuous Containers","text":""},{"location":"s5_continuous_integration/auto_docker/#continuous-docker-building","title":"Continuous docker building","text":"

                    The Github Actions we learned about in M16 are an powerful tool that can be used to much more than simply running our tests tests that we write for our application. In this module we are going to look at how we can use it for continuously building docker images. As you have already seen docker building can take a couple of minutes to build each time we do changes to our code base. For this reason we really just want to build a new image every time we do a commit of our code. Thus, it should come as no surprise that we can also automate the building process and furthermore we can take advantage of online compute power to parallelize the process.

                    As discussed in the initial module on docker, docker hub is an online solution for storing build docker images in the cloud that is then easy to pull down on whatever machine you want to run on. Docker hub is free to use for personal use, as long as the images you push are public. We are in this session going to look how we can automatically build and push our docker builds to docker hub. In a future module we are also going to look at the exact same process of building and pushing containers but this time to an general cloud provider.

                    "},{"location":"s5_continuous_integration/auto_docker/#exercises","title":"\u2754 Exercises","text":"

                    For these exercises you can choose to work with any docker file of your choosing. If you want an easy docker file, you can use the following:

                    FROM busybox\nCMD echo \"Howdy cowboy\"\n

                    Alternatively, you can choose to focus on automatizing the training and prediction docker files back from M9. You will most likely need to change the docker image for your applications if they contains any references to your data e.g. you have an COPY data/ data/ statement in the file. Since we do not store our data in Github, we cannot copy it during the build process.

                    1. Start by pushing whatever docker file you want that should be continuously build to your repository

                    2. Start by creating a Docker Hub account

                    3. Next, within Docker Hub create an access token by going to Settings -> Security. Click the New Access Token button and give it a name that you recognize.

                    4. Copy the newly created access token and head over to your Github repository online. Go to Settings -> Secrets -> Actions and click the New repository secret. Copy over the access token and give it the name DOCKER_HUB_TOKEN. Additionally, add two other secrets DOCKER_HUB_USERNAME and DOCKER_HUB_REPOSITORY that contains your docker username and docker repository name respectively.

                    5. Next we are going to construct the actual Github actions workflow file:

                      name: Docker Image CI\n\non:\n    push:\n        branches: [ master ]\n\njobs:\n    build:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v2\n    - name: Build the Docker image\n        run: |\n        echo \"${{ secrets.DOCKER_HUB_TOKEN }}\" | docker login \\\n            -u \"${{ secrets.DOCKER_HUB_USERNAME }}\" --password-stdin docker.io\n        docker build . --file Dockerfile \\\n            --tag docker.io/${{ secrets.DOCKER_HUB_USERNAME }}/${{ secrets.DOCKER_HUB_REPOSITORY }}:$GITHUB_SHA\n        docker push docker.io/${{ secrets.DOCKER_HUB_USERNAME }}/${{ secrets.DOCKER_HUB_REPOSITORY }}:$GITHUB_SHA\n

                      The first part of the workflow file should look somewhat recognizable. However, the last three lines are where all the magic happens. Carefully go through them and figure out what they do. If you want some help you can looking at the help page for docker login, docker build and docker push.

                    6. Upload the workflow to your github repository and check that it is being executed. If everything you should be able to see the the build docker image in your container repository in docker hub.

                    7. Make sure that you can execute docker pull locally to pull down the image that you just continuously build

                    8. (Optional) To test that the container works directly in github you can also try to include an additional step that actually runs the container.

                          - name: Run container\n      run: |\n        docker run ...\n

                    That ends the session on continues docker building. We are going to revisit this topic after introducing the basic concepts of working in the cloud, as it will make our life easier in the long run when we get to continues deployment (CD) that our containers are stored the same place where we are going to run them. For completeness it is worth mentioning that docker hub also offers the possibility of building your images in a continues way, by specifying so called build rules.

                    "},{"location":"s5_continuous_integration/cml/","title":"M19 - Continuous Machine Learning","text":""},{"location":"s5_continuous_integration/cml/#continuous-machine-learning","title":"Continuous Machine Learning","text":"

                    The continuous integration we have looked at until now is what we can consider \"classical\" continuous integration, that have its roots in DevOps and not MLOps. While the test that we have written and the containers ww have developed in the previous session have be around machine learning, everything we have done translate to completely to how it would be done if we had developed any other application did not include machine learning.

                    In this session, we are now gonna change gear and look at continuous machine learning (CML). As the name may suggest we are now focusing on automatizing actual machine learning processes. You may ask why we need continues integration principals baked into machine learning pipelines? The reason is the same as with any continues integration, namely that we have a bunch of checks that we want our newly trained model to pass before we trust it. Writing unittests secures that our code is not broken, but there are other failure modes of a machine learning pipeline that should be checked before the model is ready for deployment:

                    • Did I train on the correct data?
                    • Did my model converge at all?
                    • Did it reach a certain threshold at all?

                    Answering these questions in a continues way are possible through continuous machine learning. For this session, we are going to use cml by iterative.ai for this session. Strictly speaking, using the cml framework is not a necessary component for doing continuous machine learning but it streamlined way of doing this and offers tools to easily get a report about how a specific run performed. If we where just interested in trigging model training every time we do a git push we essentially just need to include

                    run: python train.py\n

                    to any of our workflow files.

                    The figure below describes the overall process using the cml framework. It should be clear that it is the very same process that we go through as in the other continues integration sessions: push code -> trigger github actions -> do stuff. The new part in this session is that we want an report of the finding of the automated run to appear after the run is done.

                    Image credit"},{"location":"s5_continuous_integration/cml/#exercises","title":"\u2754 Exercises","text":"
                    1. We are first going to revisit our train.py script. If we want cml to automatically be able to report the performance of our trained model to us after it is trained, we need to give it some statistics to work with. Below is some psedo-code that computes the accuracy and the confusion matrix of our trained model. Create an copy of your training script (call it train_cml.py) and make sure your script is also producing an classification report and confusion matrix as in the pseudo-code.

                      # assume we have a trained model\nimport matplotlib.pyplot as plt\nfrom sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay\npreds, target = [], []\nfor batch in train_dataloader:\n    x, y = batch\n    probs = model(x)\n    preds.append(probs.argmax(dim=-1))\n    target.append(y.detach())\n\ntarget = torch.cat(target, dim=0)\npreds = torch.cat(preds, dim=0)\n\nreport = classification_report(target, preds)\nwith open(\"classification_report.txt\", 'w') as outfile:\n    outfile.write(report)\nconfmat = confusion_matrix(target, preds)\ndisp = ConfusionMatrixDisplay(cm = confmat, )\nplt.savefig('confusion_matrix.png')\n
                    2. Similar to what we have looked at until now, automation happens using github workflow files. The main difference from continuous integration we have looked on until now, is that we are actually going to train our model whenever we do a git push. Copy the following code into a new workflow (called cml.yaml) and add that file to the folder were you keep your workflow files.

                      name: train-my-model\non: [push]\njobs:\n  run:\n    runs-on: [ubuntu-latest]\n    steps:\n      - uses: actions/checkout@v2\n      - uses: iterative/setup-cml@v1\n      - name: Train model\n        run: |\n          pip install -r requirements.txt  # install dependencies\n          python train.py  # run training\n      - name: Write report\n        env:\n          # this authenticates that the right permissions are in place\n          REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}\n        run: |\n          # send all information to report.md that will be reported to us when the workflow finish\n          cat classification_report.txt >> report.md\n          cml-publish confusion_matrix.png --md >> report.md\n          cml-send-comment report.md\n

                      Nearly everything in the workflow file should look familiar, except the last two lines.

                    3. Try pushing the workflow file to your github repository and make sure that it completes. If it does not, you may need to adjust the workflow file slightly.

                    4. Send yourself a pull-request. I recommend seeing this very short video on how to send yourself a pull-request with a small change. If you workflow file is executed correctly you should see github-actions commenting with a performance report on your PR.

                    5. (Optional) cml is offered by the same people behind dvc and it should therefore come as no surprise that these features can interact with each other. If you want to deep dive into this, here is a great starting point.

                    The ends the session on continues machine learning. If you have not already noticed, one limitation of using github actions is that their default runners e.g. runs-on: [ubuntu-latest] are only CPU machines (see hardware config . As we all know, modern machine learning more or less requires hardware acceleration (=GPUs) to train within reasonable time. Luckily for us cml also integrated with large cloud provides and I therefore recommend that after doing through the modules on cloud computing that you return to this exercise and experiment with setting up self-hosted runners.

                    "},{"location":"s5_continuous_integration/github_actions/","title":"M16 - Github Actions","text":""},{"location":"s5_continuous_integration/github_actions/#github-actions","title":"Github actions","text":"

                    Core Module

                    With the tests established in the previous module we are now ready to move on to actually implementing some continuous integration in our pipeline. As you probably have already realized testing your code locally may take cumbersome to do, because

                    • You need to run it often to make sure to catch bugs early on
                    • If you want to have high code coverage of your code base, you will need many tests that takes a long time to run

                    For these reasons we want to automate the testing, such that it done every time we push to our repository. If we combine this with only pushing to branches and then only merging these branches whenever all automated testing have passed, our code should be fairly safe against unwanted bugs (assuming your tests are well covering your code).

                    "},{"location":"s5_continuous_integration/github_actions/#github-actions_1","title":"Github actions","text":"

                    Github actions are the CI solution that Github provides. Each of your repositories gets 2,000 minutes of free testing per month which should be more than enough for the scope of this course (and probably all personal projects you do). Getting Github actions setup in a repository may seem complicated at first, but workflow files that you create for one repository can more or less be reused for any other repository that you have.

                    Lets take a look at how a github workflow file is organized:

                    • Initially we start by giving the workflow a name
                    • Next we specify on what events the workflow should be triggered. This includes both the action (pull request, push etc) and on what branches is should activate
                    • Next we list the jobs that we want to do. Jobs are by default executed in parallel but can also be dependent on each other
                    • In the runs-on we can specify which operation system we want the workflow to run on. We also have the possibility to specify multiple.
                    • Finally we have the steps. This is where we specify the actual commands that should be run when the workflow is executed.

                    Image credit"},{"location":"s5_continuous_integration/github_actions/#exercises","title":"\u2754 Exercises","text":"
                    1. Start by creating a .github folder in the root of your repository. Add a sub-folder to that called workflows.

                    2. Go over this page that explains how to do automated testing of python code in github actions. You do not have to understand everything, but at least get a feeling of what a workflow file should look like.

                    3. We have provided a workflow file called tests.yml that should run your tests for you. Place this file in the .github/workflows/ folder. The workflow file consist of three steps

                      • First a python environment is setup (in this case python 3.8)

                      • Next all dependencies required to run the test are installed

                      • Finally, pytest is called and test will be run

                    4. For the script to work you need to define the requirements.txt and requirements_tests.txt. The first file should contain all packages required to run your code. The second file is all additional packages required to run the tests. In your simple case it may very well be that the second file is empty, however sometimes additional packages are used for testing that are not strictly required for the scripts to run.

                    5. Finally, try pushing the changes to your repository. Hopefully your tests should just start, and you will after sometime see a green check mark next to hash of the commit. Also try to checkout the Actions tap where you can see the history of actions run.

                    6. Normally we develop code one operating system and just hope that it will work on other operating systems. However, CI enables us to automatically test on other systems than ourself.

                      1. The provided tests.yml only runs on one operating system. Which one?

                      2. Alter the file (or write a new) that executes the test on the two other main operating systems that exist.

                    7. As the workflow is currently setup, github actions will destroy every downloaded package when the workflow has been executed. To improve this we can take advantage of caching:

                      1. Figure out how to implement caching in your workflow file. You can find a guide here and here.

                      2. When you have implemented a caching system go to Actions->Caches in your repository and make sure that they are correctly added. It should look something like the image below

                      3. Measure how long your workflow takes before and after adding caching to your workflow. Did it improve the runtime of your workflow?

                    8. (Optional) Code coverage can also be added to the workflow file by uploading it as an artifact after running the coverage. Follow the instructions in this post on how to do it.

                    9. As stated in the introduction, ideally we want to only push our code to branches, such that our workflows run before we actually merge code into our codebase. We can directly prevent bad behavior by adding branch protection rules to our repository. Take the image below as an example from one of my own PRs:

                      In this example, the PR cannot be merge to the main branch before the following is fulfilled: At least 2 reviewers with write access have approved the PR, all Github actions marked as Required are passing and all conversations needs to be resolved. Since not all important tests are passing, further changes are necessary. We want to implement something similar. Do the following:

                      1. On your Github repository of choice, go to Settings -> Branches -> Add branch protection rule:

                      2. To your main/master branch add the following rules:

                        • At least one person needs to approve any PR
                        • All your workflows has to pass
                        • All conversations needs to be resolved
                      3. To test that everything works, try creating a PR (possibly with a small bug) and see that your main/master branch is protected

                    10. One problem you may have encountered is running your tests that have to do with your data, with the core problem being that your data is actually not stored in github (assuming you have done module M8 - DVC) and therefore cannot be tested. However, it is possible for us to download data while running our CI. Lets try to setup that:

                      1. The first problem is that we need our CI needs to be able to authenticate with the our storage solution. We can take advantage of an authentication file that is created the first time we push with DVC. It is located in $CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json where $CACHE_HOME depends on your operating system:

                        macOSLinuxWindows

                        ~/Library/Caches

                        ~/.cache This is the typical location, but it may vary depending on what distro you are running

                        {user}/AppData/Local

                        Find the file. The content should look similar to this (only some fields are shown):

                        {\n    \"access_token\": ...,\n    \"client_id\": ...,\n    \"client_secret\": ...,\n    \"refresh_token\": ...,\n    ...\n}\n
                      2. The content of that file is should be treated as an password an not shared with the world and the relevant question is therefore how to use this info in public repository. The answer is github secrets, where we can store information, access it in our workflow files and it is still not public. Navigate to the secrets option (as shown below) and create a secret with the name GDRIVE_CREDENTIALS_DATA that contains the content of the file you found in the previous exercise.

                      3. Afterwards, add the following code to your workflow file:

                        - uses: iterative/setup-dvc@v1\n- name: Get data\n  run: dvc pull\n  env:\n    GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}\n

                        that runs dvc pull using the secret authentication file. For help you can visit this small repository that implements the same workflow.

                      4. Finally, add the changes, commit, push and confirm that everything works as expected. You should now be able to run unit tests that depends on your input data.

                    11. In module M6 on good coding practices (optional module) of the course you where introduced to a couple of good coding practices such as being consistent with your coding style, how your Python packages are sorted and that your code follows certain standards. All this was done using the ruff framework. In this set of exercises we will setup github workflows that will automatically test for this.

                      1. Create a new workflow file called codecheck.yml, that implements the following three steps

                        • Setup python environment

                        • Installs ruff

                        • Runs ruff check and ruff format on the repository

                        (HINT: You should be able to just change the last steps of the tests.yml workflow file)

                      2. In addition to ruff we also used mypy in those set of exercies for checking if the typing we added to our code was good enough. Add another step to the codecheck.yml file which runs mypy on your repository.

                      3. Try to make sure that all steps are passing on repository. Especially mypy can be hard to get passing, so this exercise formally only requires you to get ruff passing.

                    "},{"location":"s5_continuous_integration/github_actions/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"
                    1. When working with Github actions you will often encounter the following 4 concepts:

                      • Workflow
                      • Runner
                      • Job
                      • Action

                      Try to define them with your own words.

                      Solution
                      • Workflow: A yaml file that defines the instructions to execute on specific events. Needs to be placed in the .github/workflows folder.
                      • Runner: Workflows need to run somewhere. The environment that the workflow is being executed on is called the runner. Most commonly the runner is hosted by Github but can also hosted by yourself.
                      • Job: A series of steps which are executed on the same runner. A workflow must include at least one job, but often contains many.
                      • Action: A action is the smallest unit in a workflow. Jobs often consist of multiple actions that are executed sequentially.
                    2. The on attribute specify upon which events the workflow will be triggered. Assume you have set the on attribute to the following:

                      on:\n    push:\n      branches: [main]\n    pull_request:\n      branches: [main]\n    schedule:\n      - cron: \"0 0 * * *\"\n    workflow_dispatch: {}\n

                      What 4 events would trigger the execution of that action?

                      Solution
                      1. Direct push to branch main would trigger it
                      2. Any pull request opened that will merge into main would trigger it
                      3. At the end of the day the action would trigger
                      4. The trigger can be executed by manually triggering it through the Github UI, example shown below

                    This ends the module on Github workflows. If you are more interested in this topic you can checkout module M31 on documentation which first including locally building some documentation for your project and afterwards use Github actions for deploying it to Github Pages. Additionally, Github also have a lot of templates already for running a lot CI tasks. If you try to create a workflow file directly in Github you may encounter the following page

                    We highly recommend checking this out if you want to write any other kind of CI pipeline in Github actions. We can also recommend this repository that have an list of awesome actions and checkout the act repository which is a tool for running your GitHub Actions locally!

                    "},{"location":"s5_continuous_integration/pre_commit/","title":"M17 - Pre commit","text":""},{"location":"s5_continuous_integration/pre_commit/#pre-commit","title":"Pre-commit","text":"

                    One of the cornerstones of working with git is remembering to commit your work often. Often committing makes sure that it is easier to identify and revert unwanted changes that you have introduced, because the code changes becomes smaller per commit.

                    However, as you hopefully already seen in the course there are a lot of mental task to do before you actually write git commit in the terminal. The most basic thing is of course making sure that you have saved all your changes, and you are not committing a not up-to-date file. However, this also includes tasks such as styling, formatting, making sure all tests succeeds etc. All these mental to-do notes does not mix well with the principal of remembering to commit often, because you in principal have to do them every time.

                    The obvious solution to this problem is to automate all or some of our mental task every time that we do a commit. This is where pre-commit hooks comes into play, as they can help us attach additional tasks that should be run every time that we do a git commit.

                    "},{"location":"s5_continuous_integration/pre_commit/#configuration","title":"Configuration","text":"

                    Pre-commit simply works by inserting whatever workflow we want to automate in between whenever we do a git commit and afterwards would do a git push.

                    Image credit

                    The system works by looking for a file called .pre-commit-config.yaml that we can configure. If we execute

                    pre-commit sample-config | out-file .pre-commit-config.yaml -encoding utf8\n

                    you should get a sample file that looks like

                    # See https://pre-commit.com for more information\n# See https://pre-commit.com/hooks.html for more hooks\nrepos:\n-   repo: https://github.com/pre-commit/pre-commit-hooks\n    rev: v3.2.0\n    hooks:\n    -   id: trailing-whitespace\n    -   id: end-of-file-fixer\n    -   id: check-yaml\n    -   id: check-added-large-files\n

                    the file structure is very simple:

                    • It starts by listing the repositories where we want to get our pre-commits from, in this case https://github.com/pre-commit/pre-commit-hooks. This repository contains a large collection of pre-commit hooks.
                    • Next we need to defined what pre-commit hooks that we want to get by specifying the id of the different hooks. The id corresponds to an id in this file: https://github.com/pre-commit/pre-commit-hooks/blob/master/.pre-commit-hooks.yaml

                    When we are done defining our .pre-commit-config.yaml we just need to install it

                    pre-commit install\n

                    this will make sure that the file is automatically executed whenever we run git commit

                    "},{"location":"s5_continuous_integration/pre_commit/#exercises","title":"\u2754 Exercises","text":"
                    1. Install pre-commit

                      pip install pre-commit\n
                    2. Next create the sample file

                      pre-commit sample-config > .pre-commit-config.yaml\n
                    3. The sample file already contains 4 hooks. Make sure you understand what each do and if you need them at all.

                    4. pre-commit works by hooking into the git commit command, running whenever that command is run. For this to work, we need to install the hooks into git commit. Run

                      pre-commit install\n

                      to do this.

                    5. Try to commit your recently created .pre-commit-config.yaml file. You will likely not do anything, because pre-commit only check files that are being committed. Instead try to run

                      pre-commit run --all-files\n

                      that will check every file in your repository.

                    6. Try adding at least another check from the base repository to your .pre-commit-config.yaml file.

                    7. If you have completed the optional module M7 on good coding practice you will have learned about the linter ruff. ruff comes with its own pre-commit hook. Try adding that to your .pre-commit-config.yaml file and see what happens when you try to commit files.

                    8. (Optional) Add more hooks to your .pre-commit-config.yaml.

                    9. Sometimes you are in a hurry, so make sure that you also can do commits without running pre-commit e.g.

                      git commit -m <message> --no-verify\n
                    10. Finally, figure out how to disable pre-commit again (if you get tired of it).

                    That was all about how pre-commit can be used to automate tasks. If you want to deep dive more into the topic you can checkout this page on how to define your own pre-commit hooks.

                    "},{"location":"s5_continuous_integration/unittesting/","title":"M15 - Unittesting","text":""},{"location":"s5_continuous_integration/unittesting/#unit-testing","title":"Unit testing","text":"

                    Core Module

                    What often comes to mind for many developers, when discussing continuous integration (CI) is code testing. CI should secure that whenever a codebase is updated it is automatically tested such that if bugs have been introduced in the codebase it will be caught early on. If you look at the MLOps cycle, CI is one of the cornerstones of the operations part. However, it should be noted that applying CI does not magically secure that your code does not break. CI is only as strong as the tests that are automatically executed. CI simply structures and automates this.

                    Quote

                    Continuous Integration doesn\u2019t get rid of bugs, but it does make them dramatically easier to find and remove. Martin Fowler, Chief Scientist, ThoughtWorks

                    Image credit

                    The kind of tests we are going to look at are called unit testing. Unit testing refers to the practice of writing test that tests individual parts of your code base to test for correctness. By unit, you can therefore think of a function, module or in general any object. By writing tests in this way it should be very easy to isolate which part of the code broke after an update to the code base. Another way to test your code base would be through integration testing which is equally important but we are not going to focus on it in this course.

                    Unit tests (and integration tests) are not a unique concept to MLOps but are a core concept of DevOps. However, it is important to note that testing machine learning-based systems is much more difficult than traditional systems. The reason for this is that machine learning systems depend on data, that influences the state of our system. For this reason, we not only need unit tests and integration tests of our code we also need data testing, infrastructure testing and more monitoring to check that we stay within the data distribution we are training on (more on this in module M25 on data drifting). This added complexity is illustrated in the figure below.

                    "},{"location":"s5_continuous_integration/unittesting/#pytest","title":"Pytest","text":"

                    Before we can begin to automate testing of our code base we of course need to write the tests first. It is both a hard and tedious task to do but arguably the most important aspect of CI. Python offers a couple of different libraries for writing tests. We are going to use pytest.

                    "},{"location":"s5_continuous_integration/unittesting/#exercises","title":"\u2754 Exercises","text":"

                    The following exercises should be applied to your MNIST repository

                    1. The first part of doing CI is writing the unit tests. We do not expect you to cover every part of the code you have developed but try to at least write tests that cover two files. Start by creating a tests folder.

                    2. Read the getting started guide for pytest which is the testing framework that we are going to use

                    3. Install pytest:

                      pip install pytest\n
                    4. Write some tests. Below are some guidelines on some tests that should be implemented, but you are of course free to implement more tests. You can at any point check if your tests are passing by typing in a terminal

                      pytest tests/\n

                      When you implement a test you need to follow two standards, for pytest to be able to find your tests. First any files created (except __init__.py) should always start with test_*.py. Secondly, any test implemented needs to be wrapped into its own function that again needs to start with test_:

                      # this will be found and executed by pytest\ndef test_something():\n    ...\n\n# this will not be found and executed by pytest\ndef something_to_test():\n    ...\n
                      1. Start by creating a tests/__init__.py file and fill in the following:

                        import os\n_TEST_ROOT = os.path.dirname(__file__)  # root of test folder\n_PROJECT_ROOT = os.path.dirname(_TEST_ROOT)  # root of project\n_PATH_DATA = os.path.join(_PROJECT_ROOT, \"Data\")  # root of data\n

                        these can help you refer to your data files during testing. For example, in another test file, I could write

                        from tests import _PATH_DATA\n

                        which then contains the root path to my data.

                      2. Data testing: In a file called tests/test_data.py implement at least a test that checks that data gets correctly loaded. By this, we mean that you should check

                        def test_data():\n    dataset = MNIST(...)\n    assert len(dataset) == N_train for training and N_test for test\n    assert that each datapoint has shape [1,28,28] or [784] depending on how you choose to format\n    assert that all labels are represented\n

                        where N_train should be either 30.000 or 50.000 depending on if you are just the first subset of the corrupted MNIST data or also including the second subset. N_test should be 5000.

                      3. Model testing: In a file called tests/test_model.py implement at least a test that checks for a given input with shape X that the output of the model has shape Y.

                      4. Training testing: In a file called tests/test_training.py implement at least one test that asserts something about your training script. You are here given free hands on what should be tested but try to test something that risks being broken when developing the code.

                      5. Good code raises errors and gives out warnings in appropriate places. This is often in the case of some invalid combination of input to your script. For example, your model could check for the size of the input given to it (see code below) to make sure it corresponds to what you are expecting. Not implementing such errors would still result in Pytorch failing at a later point due to shape errors, however, these custom errors will probably make more sense to the end user. Implement at least one raised error or warning somewhere in your code and use either pytest.raises or pytest.warns to check that they are correctly raised/warned. As inspiration, the following implements ValueError in code belonging to the model:

                        # src/models/model.py\ndef forward(self, x: Tensor):\n    if x.ndim != 4:\n        raise ValueError('Expected input to a 4D tensor')\n    if x.shape[1] != 1 or x.shape[2] != 28 or x.shape[3] != 28:\n        raise ValueError('Expected each sample to have shape [1, 28, 28]')\n

                        which would be captured by a test looking something like this:

                        # tests/test_model.py\ndef test_error_on_wrong_shape():\n    with pytest.raises(ValueError, match='Expected input to a 4D tensor')\n        model(torch.randn(1,2,3))\n
                      6. A test is only as good as the error message it gives, and by default, assert will only report that the check failed. However, we can help ourselves and others by adding strings after assert like

                        assert len(train_dataset) == N_train, \"Dataset did not have the correct number of samples\"\n

                        Add such comments to the assert statements you just did.

                      7. The tests that involve checking anything that has to do with our data, will of course fail if the data is not present. To future-proof our code, we can take advantage of the pytest.mark.skipif decorator. Use this decorator to skip your data tests if the corresponding data files does not exist. It should look something like this

                        import os.path\n@pytest.mark.skipif(not os.path.exists(file_path), reason=\"Data files not found\")\ndef test_something_about_data():\n    ...\n

                        You can read more about skipping tests here

                    5. After writing the different tests, make sure that they are passing locally.

                    6. We often want to check a function/module for various input arguments. In this case, you could write the same test over and over again for the different input, but pytest also has built-in support for this with the use of the pytest.mark.parametrize decorator. Implement a parametrized test and make sure that it runs for different inputs.

                    7. There is no way of measuring how good the test you have written is. However, what we can measure is the code coverage. Code coverage refers to the percentage of your codebase that actually gets run when all your tests are executed. Having a high coverage at least means that all your code will run when executed.

                      1. Install coverage

                        pip install coverage\n
                      2. Instead of running your tests directly with pytest, now do

                        coverage run -m pytest tests/\n
                      3. To get a simple coverage report simply type

                        coverage report\n

                        which will give you the percentage of cover in each of your files. You can also write

                        coverage report -m\n

                        to get the exact lines that were missed by your tests.

                      4. Finally, try to increase the coverage by writing a new test that runs some of the lines in your codebase that are not covered yet.

                      5. Often coverage reports the code coverage on files that we actually do not want to get a code coverage for. Figure out how to configure coverage to exclude some files.

                    "},{"location":"s5_continuous_integration/unittesting/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"
                    1. Assuming you have a code coverage of 100%, would you expect that no bugs are present in your code?

                      Solution

                      No, code coverage is not a guarantee that your code is bug-free. It is just a measure of how many lines of code are run when your tests are executed. Therefore, there may still be some corner case that is not covered by your tests and will result in a bug. However, having a high code coverage is a good indicator that you have tested your code.

                    2. Consider the following code:

                      @pytest.mark.parametrize(\"network_size\", [10, 100, 1000])\n@pytest.mark.parametrize(\"device\", [\"cpu\", \"cuda\"])\nclass MyTestClass:\n    @pytest.mark.parametrize(\"network_type\", [\"alexnet\", \"squeezenet\", \"vgg\", \"resnet\"])\n    @pytest.mark.parametrize(\"precision\", [torch.half, torch.float, torch.double])\n    def test_network1(self, network_size, device, network_type, precision):\n        if device == \"cuda\" and not torch.cuda.is_available():\n            pytest.skip(\"Test requires cuda\")\n        model = MyModelClass(network_size, network_type).to(device=device, dtype=precision)\n        ...\n\n    @pytest.mark.parametrize(\"add_dropout\", [True, False])\n    def test_network2(self, network_size, device, add_dropout):\n        if device == \"cuda\" and not torch.cuda.is_available():\n            pytest.skip(\"Test requires cuda\")\n        model = MyModelClass2(network_size, add_dropout).to(device)\n        ...\n

                      how many tests are executed when running the above code?

                      Solution

                      The answer depends on whether or not we are running on a GPU-enabled machine. The test_network1 has 4 parameters, network_size, device, network_type, precision, that respectively can take on 3, 2, 4, 3 values meaning that in total that test will be running 3x2x4x3=72 times with different parameters on a GPU-enabled machine and 36 on a machine without a GPU. A similar calculation can be done for test_network2, which only has three factors network_size, device, add_dropout that result in 3x2x2=12 test on a GPU-enabled machine and 6 on a machine without a GPU. In total, that means 84 tests would run on a machine with a GPU and 42 on a machine without a GPU.

                    That covers the basics of writing unit tests for Python code. We want to note that pytest of course is not the only framework for doing this. Python has a built-in framework called unittest for doing this also (but pytest offers a bit more features). Another open-source framework that you could choose to check out is hypothesis which can help catch errors in corner cases of your code. In addition to writing unit tests it is also highly recommended to test code that you include in your docstring belonging to your functions and modulus to make sure that any code there is in your documentation is also correct. For such testing, we can highly recommend using Python built-in framework doctest.

                    "},{"location":"s6_the_cloud/","title":"Cloud computing","text":"

                    Slides

                    Running computations locally is often sufficient when only playing around with code in initial phase of development. However, to really scale your experiments you will need more computing power than what your standard laptop/desktop can offer. You probably already have experience with running on a local cluster or similar but todays topic is about utilizing cloud computing.

                    Image credit

                    There exist a numerous amount of cloud compute providers with some of the biggest being:

                    • Azure
                    • AWS
                    • Google Cloud project
                    • Alibaba cloud

                    The all have slight advantages and disadvantages over each others. In this course we are going to focus on Google cloud, because they have been kindly enough to sponsor $50 of cloud credit to each student. If you happen to run out of credit, you can also get some free credit for a limited amount of time when you signup with a new account. What's important to note is that all these different cloud providers all have the same set of services, and that learning how to use the services of one cloud provider in many cases translate to also know how to use the same services at another cloud provider. The services are called something different and can have a bit of a different interface/interaction pattern but in the end it does not really matter.

                    Todays exercises are about getting to know how to work with the cloud. If you are in doubt about anything or want to deep dive into some topics, I can recommend watching this series of videos or going through the general docs.

                    Learning objectives

                    The learning objectives of this session are:

                    • In general being familiar with the Google SDK working
                    • Being able to start different compute instances and work with them
                    • Know how to setup continues integration workflows for building of docker images
                    • Knowledge about how to store data and containers/artifacts in cloud buckets
                    • Being able to train simple Deep Learning models using a combination of cloud services
                    "},{"location":"s6_the_cloud/cloud_setup/","title":"Cloud setup","text":"

                    Core Module

                    Google cloud project (GCP) is the cloud service provided by Google. The key concept, or selling point, of any cloud provider is the idea of near-infinite resources. Without the cloud it simply is not feasible to do many modern deep learning and machine learning tasks because they cannot be scaled locally.

                    The image below shows a subset of all the different services that the Google cloud platform offers. The ones marked in red are the ones we are actually going to investigate in this course. Therefore, if you get done with exercises early I highly recommend that you deep dive more into the Google cloud platform.

                    Image credit"},{"location":"s6_the_cloud/cloud_setup/#exercises","title":"\u2754 Exercises","text":"

                    As the first step we are going to get you setup with some Google cloud credits.

                    1. Go to https://learn.inside.dtu.dk. Go to this course. Find the recent message where there should be a download link and instructions on how to claim the $50 cloud credit. Please do not share the link anywhere as there are a limited amount of coupons. If you are not officially taking this course at DTU, Google gives $300 cloud credits whenever you signup with a new account. NOTE that you need to provide a credit card for this so make sure to closely monitor your credit use so you do not end spending more than the free credit.

                    2. Login to the homepage of gcp. It should look like this:

                    3. Go to billing and make sure that your account is showing $50 of cloud credit

                      make sure to also checkout the Reports throughout the course. When you are starting to use some of the cloud services these tabs will update with info about how much time you can use before your cloud credit runs out. Make sure that you monitor this page as you will not be given another coupon.

                    4. One way to stay organized within GCP is to create projects.

                      Create a new project called dtumlops. When you click create you should get a notification that the project is being created. The notification bell is good way to make sure how the processes you are running are doing throughout the course.

                    5. For setup we are going to install gcloud. gcloud is the command line interface for working with our Google cloud account. Nearly everything that we can do through the web interface we can also do through the gcloud interface. Follow the installation instructions here for your specific OS.

                      1. After installation, try in a terminal to type:

                        gcloud -h\n

                        the command should and show the help page. If not, something went wrong in the installation (you may need to restart after installing).

                      2. Now login by typing

                        gcloud auth login\n

                        you should be sent to an web page where you link your cloud account to the gcloud interface. Afterwards, also run this command:

                        gcloud auth application-default login\n

                        If you at some point want to revoke this you can type:

                        gcloud auth revoke\n
                      3. Next you will need to set the project that we just created. In your web browser under project info, you should be able to see the Project ID belonging to your dtumlops project. Copy this an type the following command in a terminal

                        gcloud config set project <project-id>\n

                        You can also get the project info by running

                        gcloud projects list\n
                      4. Next install the Google cloud python API:

                        pip install --upgrade google-api-python-client\n

                        Make sure that the python interface is also installed. In a python terminal type

                        import googleapiclient\n

                        this should work without any errors.

                      5. (Optional) If you are using VSCode you can also download the relevant extension called Cloud Code. After installing it you should see a small Cloud Code button in the action bar.

                    6. Finally, we need to activate a couple of developer APIs that are not activated by default. In a terminal write

                      gcloud services enable apigateway.googleapis.com\ngcloud services enable servicemanagement.googleapis.com\ngcloud services enable servicecontrol.googleapis.com\n

                      you can always check which services are enabled by typing

                      gcloud services list\n

                    After following these step your laptop should hopefully be setup for using gcp locally. You are now ready to use their services, both locally on your laptop and in the cloud console.

                    "},{"location":"s6_the_cloud/cloud_setup/#iam-and-quotas","title":"IAM and Quotas","text":"

                    A big part of using the cloud in a bigger organisation has to do with Admin and quotas. Admin here in general refers to the different roles that users of GCP and quotas refers to the amount of resources that a given user has access to. For example one employee, lets say a data scientist, may only be granted access to certain GCP services that have to do with development and training of machine learning model, with X amounts of GPUs available to use to make sure that the employee does not spend too much money. Another employee, a devops engineer, probably do not need access to the same services and not necessarily the same resources.

                    In this course we are not going to focus too much on this aspect but it is important to know that it exists. One feature you are going to need for doing the project is how to share a project with other people. This is done through the IAM (Identities and Access Management) page. Simply click the Grant Access button, search for the email of the person you want to share the project with and give them either Viewer, Editor or Owner access, depending on what you want them to be able to do. The figure below shows how to do this.

                    What we are going to go through right now is how to increase the quotas for how many GPUs you have available for your project. By default any free accounts in GCP (or accounts using teaching credits) the default quota for GPUs that you can use is either 0 or 1 (their policies sometimes changes). We will in the exercises below try to increase it.

                    "},{"location":"s6_the_cloud/cloud_setup/#exercises_1","title":"\u2754 Exercises","text":"
                    1. Start by enabling the Compute Engine service. Simply search for it in the top search bar. It should bring you to the a page where you can enable the service (may take some time). We are going to look more into this service in the next module.

                    2. Next go to the IAM & Admin page, again search for it in the top search bar. The remaining steps are illustrated in the figure below.

                      1. Go to the quotas page

                      2. In the search field search for GPUs (all regions) (needs to match exactly, the search field is case sensitive), such that you get the same quota as in the image.

                      3. In the limit you can see what your current quota for the number of GPUs you can use are. Additional, to the right of the limit you can see the current usage. It is worth checking in on if you are ever in doubt if a job is running on GPU or not.

                      4. Click the quota and afterwards the Edit qoutas button.

                      5. In the pop-op window, increase your limit to either 1 or 2.

                      6. After sending your request you can try clicking the Increase requests tab to see the status of your request

                    If you are ever running into errors when working in GPU that contains statements about quotas you can always try to go to this page and see what you are actually allowed to use currently and try to increase it. For example, when you get to training machine learning models using Vertex AI in the next module, you would most likely need to ask for quota increase for that service as well.

                    Finally, we want to note that a quota increase is sometimes not allowed within 24 hours of creating an account. If your request gets rejected, we recommend to wait a day and try again. If this does still not work, you may need to use their services some more to make sure you are not a bot that wants to mine crypto.

                    "},{"location":"s6_the_cloud/cloud_setup/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"
                    1. What considerations to take when choosing an GCP region for running a new application?

                      Solution

                      A series of factors may influence your choice of region, including:

                      • Services availability in the region, not all services are available in all regions
                      • Resource availability: some regions have more GPUs available than others
                      • Reduced latency: if your application is running in the same region as your users, the latency will be lower
                      • Compliance: some countries has strict rules that requires user info to be stored inside a particular region eg. EU has GDPR rules that requires all user data to be stored in the EU
                      • Pricing: some regions may have different pricing than others
                    2. The 3 major cloud providers all have the same services, but they are called something different depending on the provider. What are the corresponding names of these GCP services in AWS and Azure?

                      • Compute Engine
                      • Cloud storage
                      • Cloud functions
                      • Cloud run
                      • Cloud build
                      • Vertex AI

                      It is important to know these correspondences to navigate blogpost etc. about MLOps on the internet.

                      Solution GCP AWS Azure Compute Engine Elastic Compute Cloud (EC2) Virtual Machines Cloud storage Simple Storage Service (S3) Blob Storage Cloud functions Lambda Functions Serverless Compute Cloud run App Runner, Fargate, Lambda Container Apps, Container Instances Cloud build CodeBuild DevOps Vertex AI SageMaker AI Platform
                    "},{"location":"s6_the_cloud/using_the_cloud/","title":"Using the cloud","text":"

                    Core Module

                    In this set of exercises we are going to get more familiar with the using some of the resources that the Google cloud project offers.

                    "},{"location":"s6_the_cloud/using_the_cloud/#compute","title":"Compute","text":"

                    The most basic service of any cloud provider is the ability to create and run virtual machines. In gcp this service is called Compute Engine API. A virtual machine allows you to essentially run an operating system that behaves like a completely separate computer. There are many reasons why one to use virtual machines:

                    • Virtual machines allow you to scale your operations, essentially giving you access to infinitely many individual computers

                    • Virtual machines allow you to use large scale hardware. For example if you are developing an deep learning model on your laptop and want to know the inference time for a specific hardware configuration, you can just create a virtual machine with those specs and run your model.

                    • Virtual machines allow you to run processes in the \"background\". If you want to train a model for a week or more, you do not want to do this on your own laptop as you cannot really move it or do anything with while it is training. Virtual machines allow you to just launch a job and forget about it (at least until you run out of credit).

                    "},{"location":"s6_the_cloud/using_the_cloud/#exercises","title":"\u2754 Exercises","text":"

                    We are now going to start actually using the cloud.

                    1. Click on the Compute Engine tab in sidebar on the homepage of gcp.

                    2. Try to Create instance. You will see the following image below.

                      Give it a meaningful name, set the location to some location that is closer to where you actually is (to reduce latency). Finally try to adjust the the configuration a bit. What two factors are effecting the price of the compute unit?

                    3. After figuring this out, create a e2-medium instance (leave rest configured as default). Before clicking the Create button make sure to check the Equavalent Command Line button. You should see a very long command that you could have typed instead to do the exact same.

                    4. Now in a local terminal type:

                      gcloud compute instances list\n

                      you should hopefully see the instance you have just created.

                    5. You can start a terminal directly by typing:

                      gcloud beta compute ssh --zone <zone> <name> --project <project-id>\n

                      You can always see the exact command that you need to run to ssh to an VM by selecting the View gcloud command option in the Compute Engine overview (see image below).

                    6. While logged into the instance, check if Python and Pytorch is installed? You should see that neither is installed. The VM we have only specified what compute resources it should have, and not what software should be in it. We can fix this by starting VMs based on specific docker images (its all coming together).

                      1. gcp Comes with a number of ready-to-go images for doing deep learning. More info can be found here. Try, running this line:

                        gcloud container images list --repository=\"gcr.io/deeplearning-platform-release\"\n

                        what does the output show?

                      2. Next, start (in the terminal) a new instance using a Pytorch image. The command for doing it should look something like this:

                        gcloud compute instances create <instance_name> \\\n    --zone=<zone> \\\n    --image-family=<image-family> \\\n    --image-project=deeplearning-platform-release \\\n    # add these arguments if you want to run on GPU\n    --accelerator=\"type=nvidia-tesla-K80,count=1\" \\\n    --maintenance-policy TERMINATE \\\n    --metadata=\"install-nvidia-driver=True\" \\\n

                        You can find more info here on what <image-family> should have as value and what extra argument you need to add if you want to run on GPU (if you have access).

                      3. ssh to the VM as one of the previous exercises. Confirm that the container indeed contains both a python installation and Pytorch is also installed. Hint: you also have the possibility through the web page to start a browser session directly to the VMs you create:

                    7. Finally, everything that you have done locally can also be achieved through the web terminal, which of course comes pre-installed with the gcloud command etc.

                      Try out launching this and run some of the commands from the previous exercises.

                    Stopping VMs

                    If you are not careful you can end up wasting a lot of credits on virtual machines that you are not using. VMs are charged by the minute, so even if you are not using them you are still paying for them. Therefore, it is important that you remember to stop your VMs when you are not using them. You can do this by either clicking the Stop button in the VM overview page or by running the following command:

                    gcloud compute instances stop <instance-name>\n
                    "},{"location":"s6_the_cloud/using_the_cloud/#data-storage","title":"Data storage","text":"

                    Another big part of cloud computing is storage of data. There are many reason that you want to store your data in the cloud including:

                    • Easily being able to share
                    • Easily expand as you need more
                    • Data is stored multiple locations, making sure that it is not lost in case of an emergency

                    Cloud storage is luckily also very cheap. Google cloud only takes around $0.026 per GB per month. This means that around 1 TB of data would cost you $26 which is more than what the same amount of data would cost on Goggle Drive, but the storage in Google cloud is much more focused on enterprise where you have a need for accessing data through an API.

                    "},{"location":"s6_the_cloud/using_the_cloud/#exercises_1","title":"\u2754 Exercises","text":"

                    When we did the exercise on data version control, we made dvc work together with our own Google drive to storage data. However, a big limitation of this is that we need to authentic each time we try to either push or pull the data. The reason is that we need to use an API instead which is offered through gcp.

                    We are going to follow the instructions from this page

                    1. Lets start by creating a data storage. On the GCP startpage, in the sidebar, click on the Cloud Storage. On the next page click the Create bucket:

                      Give the bucket an unique name, set it to a region close by and importantly remember to enable Object versioning under the last tab. Finally click Create.

                    2. After creating the storage, you should be able to see it online and you should be able to see it if you type in your local terminal:

                      gsutil ls\n

                      gsutil is an additional command to gcloud, that provides more command line options.

                    3. Next we need the Google storage extension for dvc

                      pip install dvc[gs]\n
                    4. Now in your MNIST repository where you have already configured dvc, we are going to change the storage from our Google drive to our newly created Google cloud storage.

                      dvc remote add -d remote_storage <output-from-gsutils>\n

                      In addition we are also going to modify the remote to support object versioning (called version_aware in dvc):

                      dvc remote modify remote_storage version_aware true\n

                      This will change the default way that dvc handles data. Instead of just storing the latest version of the data as content-addressable storage it will now store the data as it looks in our local repository, which allows us to not only use dvc to download our data.

                    5. The above command will change the .dvc/config file. git add and git commit the changes to that file. Finally, push data to the cloud

                      dvc push\n
                    6. Finally, make sure that you can pull without having to give your credentials. The easiest way to see this is to delete the .dvc/cache folder that should be locally on your laptop and afterwards do a dvc pull.

                    This setup should work when trying to access the data from your laptop, which we authenticated in the previous module. However, how can you access the data from a virtual machine, inside a docker container or from a different laptop? We in general recommend two ways:

                    • You can make the bucket public accessible e.g. no authentication needed. That means that anyone with the url to the data can access it. This is the easiest way to do it, but also the least secure. You can read more about how to make your buckets public here.

                    • You can create a service account which is a more secure way of accessing data. A service account is essentially a second user which you can give access to specific services. You can read more about how to create a service account here. Once you have created a service account you can give it access to a specific bucket by going to the Permissions tab of the bucket and add the service account as a member.

                      If you need to authenticate your service account from a VM, you can do it by running the following command:

                      gcloud auth activate-service-account --key-file=<key-file>\n

                      where the <key-file is the json file that you downloaded when you created the service account (DO NOT SHARE THIS).

                    "},{"location":"s6_the_cloud/using_the_cloud/#artifact-registry","title":"Artifact registry","text":"

                    You should hopefully at this point have seen the strength of using containers to create reproducible environments. They allow us to specify exactly the software that we want to run inside our VMs. However, you should already have run into two problems with containers

                    • Building process can take a lot of time
                    • Docker images can be large

                    For this reason we want to move both the building process and the storage of images to the cloud. In GCP the service for this is called Artifact registry, formerly known as Container registry.

                    "},{"location":"s6_the_cloud/using_the_cloud/#exercises_2","title":"\u2754 Exercises","text":"

                    For the purpose of these exercise I recommend that you start out with a dummy version of some code to make sure that the building process do not take too long. You are more than free to fork this repository. The repository contains a simple python script that does image classification using sklearn. The docker images for this application are therefore going to be substantially faster to build and smaller in size than the images we are used to that uses Pytorch.

                    1. Start by enabling the service: Google Artifact Registry API and Google Cloud Build API. This can be done through the web side (by searching for the services) or can also be enabled from the terminal:

                      gcloud services enable artifactregistry.googleapis.com\ngcloud services enable cloudbuild.googleapis.com\n
                    2. Google cloud building can in principal work out of the box with docker files. However, the recommended way is to add specialized cloudbuild.yaml files. They should look something like this:

                      steps:\n    - name: 'gcr.io/cloud-builders/docker'\n        args: ['build', '-t', 'gcr.io/<project-id>/<image-name>', '.']\n    - name: 'gcr.io/cloud-builders/docker'\n        args: ['push', 'gcr.io/<project-id>/<image-name>']\n

                      which essentially is a basic yaml file that contains a list of steps, where each step consist of the service that should be used and the arguments for that service. In the above example we are calling the same service (cloud-builders/docker) with different arguments (build and then push). Implement such a file in your repository. Hint: if you forked the repository then you at least need to change the <project-id>.

                    3. From the gcp homepage, navigate to the triggers panel:

                      Click on the manage repositories.

                    4. From there, click the Connect Repository and go through the steps of authenticating your github profile with gcp and choose the repository that you want to setup build triggers. For now, skip the Create a trigger (optional) part by pressing Done in the end.

                    5. Navigate back to the Triggers homepage and click Create trigger. Set the following:

                      • Give a name
                      • Event: choose Push to branch
                      • Source: choose the repository you just connected
                      • Branch: choose ^main$
                      • Configuration: choose either Autodetected or Cloud build configuration file

                      Finally click the Create button and the trigger should show up on the triggers page.

                    6. To activate the trigger, push some code to the chosen repository.

                    7. Go to the Cloud Build page and you should see the image being build and pushed.

                      Try clicking on the build to checkout the build process and building summary. As you can see from the image, if a build is failing you will often find valuable info by looking at the build summary. If you build is failing try to configure it to run in one of these regions: us-central1, us-west2, europe-west1, asia-east1, australia-southeast1, southamerica-east1 as specified in the documentation.

                    8. If/when your build is successful, navigate to the Artifact Registry page. You should hopefully find that the image you just build was pushed here. Congrats!

                    9. Finally, to to pull your image down to your laptop

                      docker pull gcr.io/<project-id>/<image_name>:<image_tag>\n

                      you will need to authenticate docker with gcp first. Instructions can be found here, but the following command should hopefully be enough to make docker and gcp talk to each other:

                      gcloud auth configure-docker\n

                      Note: To do this you need to have docker actively running in the background, as any other time you want to use docker.

                    10. Automatization through the cloud is in general the way to go, but sometimes you may want to manually create images and push them to the registry. Figure out how to push an image to your Container Registry. For simplicity you can just push the busybox image you downloaded during the initial docker exercises. This page should help you with exercise.

                    "},{"location":"s6_the_cloud/using_the_cloud/#training","title":"Training","text":"

                    As our final step in our journey through different GCP services in this module we are going to look at training of our models. This is one of the important tasks that GCP can help us with because we can always rent more hardware as long as we have credits, meaning that we can both scale horizontal (run more experiments) and vertical (run longer experiments).

                    We are going to checkout two ways of running our experiments. First we are going to return to the Compute Engine service because it gives the most simple form of scaling of experiments. That is: we create a VM with a appropriate docker image, we start it and login to the VM and we run our experiments. It is possible for most people to run a couple of experiments in parallel this way. However, what if there was an abstract layer that automatically created VM for us, lunched our experiments and the close the VM afterwards?

                    This is where Vertex AI service comes into play. This is a dedicated service for handling ML models in GCP in the cloud. Vertex AI is in principal and end-to-end service that can take care of everything machine learning related in the cloud. In this course we are primarily focused on just the training of our models, and then use other services for different parts of our pipeline.

                    "},{"location":"s6_the_cloud/using_the_cloud/#exercises_3","title":"\u2754 Exercises","text":"
                    1. Lets start by see how we could train a model using Pytorch using the Compute Engine service:

                      1. Start by creating a appropriate VM. If you want to start a VM that have Pytorch pre-installed with only CPU support you can run the following command

                        gcloud compute instances create <instance-name> \\\n    --zone europe-west1-b \\\n    --image-family=pytorch-latest-cpu \\\n    --image-project=deeplearning-platform-release\n

                        alternatively, if you have access to GPU in your GCP account you could start a VM in the following way

                        gcloud compute instances create <instance-name> \\\n    --zone europe-west4-a \\\n    --image-family=pytorch-latest-gpu \\\n    --image-project=deeplearning-platform-release \\\n    --accelerator=\"type=nvidia-tesla-v100,count=1\" \\\n    --metadata=\"install-nvidia-driver=True\" \\\n    --maintenance-policy TERMINATE\n
                      2. Next login into your newly created VM. You can either open an ssh terminal in the cloud console or run the following command

                        gcloud beta compute ssh <instance-name>\n
                      3. It is recommend to always check that the VM we get is actually what we asked for. In this case the VM should have Pytorch pre-installed so lets check for that by running

                        python -c \"import torch; print(torch.__version__)\"\n

                        Additionally, if you have a VM with GPU support also try running the nvidia-smi command.

                      4. When you have logged in to the VM, it works as your own machine. Therefore to run some training code you would need to do the same setup step you have done on your own machine: clone your github, install dependencies, download data, run code. Try doing this to make sure you can train a model.

                    2. (Optional, may not work as intended) The last step in the previous exercise involves a lot of setup that would be necessary to do every time we create a new VM, making horizontal scaling of experiments cumbersome. However, we have already developed docker images that can take care of most of the setup.

                      1. Lets for simplicity just create a very small docker image (called gcp_vm_tester.dockerfile) that you can use

                        FROM gcr.io/deeplearning-platform-release/pytorch-cpu\nRUN pip install matplotlib\n

                        this basically just extends the base Pytorch image to also install matplotlib. The important part about the docker images that we want to use here is that they should not have an ENTRYPOINT at the end, because we do not want the docker container to actually run our scripts, just install dependencies on startup.

                      2. Lets build docker and manually push it to our container repository in gcp. Build with:

                        docker build -f gcp_vm_tester.dockerfile.dockerfile . -t gcp_vm_tester:latest\n

                        and then push with

                        docker tag tester gcr.io/<project-id>/gcp_vm_tester\ndocker push gcr.io/<project-id>/gcp_vm_tester\n

                        confirm by going to the container registry in the cloud console and check that the image has been correctly pushed.

                      3. Lets then create a VM with that particular docker image. Instead of using gcloud compute instances create we are now using the gcloud compute instances create-with-container command

                        gcloud compute instances create-with-container <instance-name> \\\n    --container-image=gcr.io/<project-id>/gcp_vm_tester\n    --zone europe-west1-b\n
                      4. Confirm that everything works by accessing your newly created VM and run both of these commands

                        python -c \"import torch; print(torch.__version__)\"\npython -c \"import matplotlib; print(matplotlib.__version__)\"\n
                    3. We are now moving on to the final way to train our code, using Vertex AI service.

                      1. Start by enabling it by searching for Vertex AI in the cloud console and go to the service

                      2. The way we are going to use Vertex AI is to create custom jobs because we have already developed docker containers that contains everything to run our code. Thus the only command that we actually need to use is gcloud ai custom-jobs create command. An example here would be:

                        gcloud ai custom-jobs create \\\n    --region=europe-west1 \\\n    --display-name=test-run \\\n    --config=config.yaml\n

                        Essentially, this command combines everything into one command: it first creates a VM with the specs specified by a configuration file, then loads a container specified again in the configuration file and finally it runs everything. A example of a config file could be:

                        # config_cpu.yaml\nworkerPoolSpecs:\n    machineSpec:\n        machineType: n1-highmem-2\n    replicaCount: 1\n    containerSpec:\n        imageUri: gcr.io/<project-id>/<docker-img>\n

                        if you only want to run on CPU and another example for GPU:

                        # config_gpu.yaml\nworkerPoolSpecs:\n    machineSpec:\n        machineType: n1-standard-8\n        acceleratorType: NVIDIA_TESLA_T4 #(1)!\n        acceleratorCount: 1\n    replicaCount: 1\n    containerSpec:\n        imageUri: gcr.io/<project-id>/<docker-img>\n
                        1. In this case we are requesting a Nvidia Tesla T4 GPU. This will only work if you have quota for allocating this type of GPU in the Vertex AI service. You can check how to request quota in the last exercise of the previous module. Remember that it is not enough to just request quota for the GPU, the request need to by approved by Google before you can use it.

                        you can read more about the configuration formatting here and the different types of machines here. Try to execute a job using the gcloud ai custom-jobs create command. For additional documentation you can checkout the documentation on the command and this page and this page

                      3. Assuming you manage to lunch a job, you should see an output like this:

                        To executing the commands that is outputted to look at both the status and the progress of your job.

                      4. In addition you can also visit the Custom Jobs tab in training part of Vertex AI

                        Check it out.

                      5. During custom training we do not necessarily need to use dvc for downloading our data. A more efficient way is to use cloud storage as a mounted file system. This allows us to access data directly from the cloud storage without having to download it first. All our training jobs are automatically mounted a gcs folder in the root directory. Try to access the data from your training script:

                        # loading from a bucket using mounted file system\ndata = torch.load('/gcs/<my-bucket-name>/data.pt')\n# writing to a bucket using mounted file system\ntorch.save(data, '/gcs/<my-bucket-name>/data.pt')\n

                        is should speed up the training process a bit.

                    This ends the session on how to use Google cloud services for now. In a future session we are going to investigate a bit more of the services offered in GCP, in particular for deploying the models that we have just trained.

                    "},{"location":"s7_deployment/","title":"08. Model deployment","text":"

                    Slides

                    Lets say that you have spend 1000 GPU hours and trained the most awesome model that you want to share with the world. One way to do this is of course to just place all your code in a github repository, upload a file with the trained model weights to your favorite online storage (assuming it is too big for github to handle) and ask people to just download your code and the weights to run the code by themselves. This is a fine approach in a small research setting, but in production you need to be able to deploy the model to an environment that is fully contained such that people can just execute without looking (too hard) at the code.

                    Image credit

                    In this session we try to look at methods specialized towards deployment of models on your local machine and also how to deploy services in the cloud.

                    Learning objectives

                    The learning objectives of this session are:

                    • Understand the basics of requests and APIs
                    • Can create custom APIs using the framework fastapi and run it locally
                    • Knowledge about serverless deployments and how to deploy custom APIs using both serverless functions and serverless containers
                    "},{"location":"s7_deployment/apis/","title":"M22 - Requests and APIs","text":""},{"location":"s7_deployment/apis/#requests-and-apis","title":"Requests and APIs","text":"

                    Core Module

                    Before we can get deployment of our models we need to understand concepts such as APIs and requests. The core reason for this is that we need a new abstraction layer on top of our applications that are not Python-specific. While Python is the defacto language for machine learning, we cannot expect everybody else to use it and in particular, we cannot expect network protocols (both locally and external) to be able to communicate with our Python programs out of the box. For this reason, we need to understand requests, in particular HTTP requests and how to create APIs that can interact with those requests.

                    "},{"location":"s7_deployment/apis/#requests","title":"Requests","text":"

                    When we are talking about requests, we are essentially talking about the communication method used in client-server types of architectures. As shown in the image below, in this architecture, the client (user) is going to send requests to a server (our machine learning application) and the server will give a response. For example, the user may send a request to get the class of a specific image, which our application will do and then send back the response in terms of a label.

                    Image credit

                    The common way of sending requests is called HTTP (Hypertext Transfer Protocol). It is essentially a specification of the intermediary transportation method between the client and server. An HTTP request essentially consists of two parts:

                    • A request URL: the location of the server we want to send our request to
                    • A request Method: describing what action we want to perform on the server

                    The common request methods are (case sensitive):

                    • GET: get data from the server
                    • POST/PUT: send data to the server
                    • DELETE: delete data on the server

                    You can read more about the different methods here. For most machine learning applications, GET and POST are the core methods to remember. Additionally, if you want to read more about HTTP in general we highly recommend that you go over this comic strip protocol, but the TLDR is that it provides privacy, integrity and identification over the web.

                    "},{"location":"s7_deployment/apis/#exercises","title":"\u2754 Exercises","text":"

                    We are going to do a couple of exercises on sending requests using requests package to get familiar with the syntax.

                    1. Start by installing the `requests`` package

                      pip install requests\n
                    2. Afterwards, create a small script and try to execute the code

                      import requests\nresponse = requests.get('https://api.github.com/this-api-should-not-exist')\nprint(response.status_code)\n

                      As you can see from the syntax, we are sending a request using the GET method. This code should return status code 404. Take a look at this page that contains a list of status codes. Next, let's call a page that exists

                      import requests\nresponse = requests.get('https://api.github.com')\nprint(response.status_code)\n

                      What is the status code now and what does it mean? Status codes are important when you have an application that is interacting with a server and wants to make sure that it does not fail, which can be done with simple if statements on the status codes

                      if response.status_code == 200:\n    print('Success!')\nelif response.status_code == 404:\n    print('Not Found.')\n
                    3. Next, try to call the following

                      response=requests.get(\"https://api.github.com/repos/SkafteNicki/dtu_mlops\")\n

                      which gives back a payload. Essentially, payload refers to any additional data that is sent from the client to the server or vice-versa. Try looking at the response.content attribute. What is the type of this attribute?

                    4. You should hopefully observe that the .content attribute is of type bytes. It is important to note that this is the standard way of sending payloads to encode them into byte objects. To get a more human-readable version of the response, we can convert it to JSON format

                      response.json()\n

                      Important to remember that a JSON object in Python is just a nested dictionary if you ever want to iterate over the object in some way.

                    5. When we use the GET method we can additionally provide a params argument, that specifies what we want the server to send back for a specific request URL:

                      response = requests.get(\n    'https://api.github.com/search/repositories',\n    params={'q': 'requests+language:python'},\n)\n

                      Before looking at reponse.json() can you explain what the code does? You can try looking at this page for help.

                    6. Sometimes the content of a page cannot be converted into JSON, because as already stated data is sent as bytes. Say that we want to download an image, which we can do in the following way

                      import requests\nresponse = requests.get('https://imgs.xkcd.com/comics/making_progress.png')\n

                      Try calling response.json(), what happens? Next, try calling response.content. To get the result in this case we would need to convert from bytes to an image:

                      with open(r'img.png','wb') as f:\n    f.write(response.content)\n
                    7. The get method is the most useful method because it allows us to get data from the server. However, as stated in the beginning multiple request methods exist, for example, the POST method for sending data to the server. Try executing:

                      pload = {'username':'Olivia','password':'123'}\nresponse = requests.post('https://httpbin.org/post', data = pload)\n

                      Investigate the response (this is an artificial example because we do not control the server).

                    8. Finally, we should also know that requests can be sent directly from the command line using the curl command. Sometimes it is easier to send a request directly from the terminal and sometimes it is easier to do it from a script.

                      1. Make sure you have curl installed, or else find instruction on installing it. To check call curl --help` with the documentation on curl.

                      2. To execute requests.get('https://api.github.com') using curl we would simply do

                        curl -X GET \"https://api.github.com\"\ncurl -X GET -I \"https://api.github.com\" # if you want the status code\n

                        Try it yourself.

                      3. Try to redo some of the exercises yourself using curl.

                    That ends the intro session on requests. Do not worry if you are still not completely comfortable with sending requests, we are going to return to how we do it in practice when we have created our API. If you want to learn more about the requests package you can check out this tutorial and if you want to see more examples of how to use curl you can check out this page

                    "},{"location":"s7_deployment/apis/#creating-apis","title":"Creating APIs","text":"

                    Requests are all about being on the client side of our client-server architecture. We are now going to move on to the server side where we will be learning about writing the APIs that requests can interact with. An application programming interface (API) is essentially the way for the developer (you) tells a user how to use the application that you have created. The API is an abstraction layer that allows the user to interact with our application in the way we want them to interact with it, without the user even having to look at the code.

                    We can take the API from github as an example https://api.github.com. This API allows any user to retrieve, integrate and send data to Github without ever having to visit their webpage. The API exposes multiple endpoints that have various functions:

                    • https://api.github.com/repos/OWNER/REPO/branches: check out the branches on a given repository
                    • https://api.github.com/search/code: search through Github for repositories
                    • https://api.github.com/repos/OWNER/REPO/actions/workflows: check the status of workflows for a given repository

                    and we could go on. However, there may be functionality that Github is not interested in users having access to and they may therefore choose not to have endpoints for specific features (1).

                    1. Many companies provide public APIs to interact with their services/data. For a general list of public APIs you can check out this page. For the Danes out there, you can check out this list of public and private APIs from Danish companies and organizations.

                    The particular kind of API we are going to work with is called REST API (or RESTful API). The REST API specify specific constraints that a particular API needs to fulfill to be considered RESTful. You can read more about what the six guiding principles behind REST API on this page but one of the most important to have in mind is that the client-server architecture needs to be stateless. This means that whenever a request is send to the server it needs to be self-contained (all information included) and the server cannot rely on any previously stored information from previous requests.

                    To implement APIs in practise we are going to use FastAPI. FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints. FastAPI is only one of many frameworks for defining APIs, however, compared to other frameworks such as Flask and django it offers a sweet spot of being flexible enough to do what you want without having many additional (unnecessary) features.

                    "},{"location":"s7_deployment/apis/#exercises_1","title":"\u2754 Exercises","text":"

                    The exercises below are a condensed version of this and this tutorial. If you ever need context for the exercises, we can recommend trying to go through these. Additionally, we also provide this solution file that you can look through for help.

                    1. Install FastAPI

                      pip install fastapi\n

                      This contains the functions, modules, and variables we are going to need to define our interface.

                    2. Additionally, also install uvicorn which is a package for defining low level server applications.

                      pip install uvicorn[standard]\n
                    3. Start by defining a small application like this in a file called main.py:

                      from fastapi import FastAPI\napp = FastAPI()\n\n@app.get(\"/\")\ndef read_root():\n    return {\"Hello\": \"World\"}\n\n@app.get(\"/items/{item_id}\")\ndef read_item(item_id: int):\n    return {\"item_id\": item_id}\n

                      Importantly here is the use of the @app.get decorator. What could this decorator refer to? Explain what the two functions are probably doing.

                    4. Next lets launch our app. Since we called our script main.py and we inside the script initialized our API with app = FastAPI, our application that we want to deploy can be referenced by main:app:

                      uvicorn --reload --port 8000 main:app\n

                      this will launch a server at this page: http://localhost:8000/. As you will hopefully see, this page will return the content of the root function, like the image below. Remember to also check the output in your terminal as that will give info on when and how your application is being invoked.

                      1. What webpage should you open to get the server to return 1?

                      2. Also checkout the pages: http://localhost:8000/docs and http://localhost:8000/redoc. What does these pages show?

                      3. The power of the docs and redoc pages is that they allow you to easily test your application with their simple UI. As shown in the image below, simply open the endpoint you want to test, click the Try it out button, input any values and execute it. It will return both the corresponding curl command for invoking your endpoint, the corresponding URL and response of you application. Try it out.

                      4. You can also checkout http://localhost:8000/openapi.json to check out the schema that is generated which essentially is a json file containing the overall specifications of your program.

                      5. Try to access http://localhost:8000/items/foo, what happens in this case? When you specify types in your API, FastAPI will automatically do type validation using pydantic, making sure users can only access your API with the correct types. Therefore, remember to include types in your applications!

                    5. With the fundamentals in place let's configure it a bit more:

                      1. Lets start by changing the root function to include a bit more info. In particular we are also interested in returning the status code so the end user can easily read that. Default status codes are included in the http built-in python package:

                        from http import HTTPStatus\n\n@app.get(\"/\")\ndef root():\n    \"\"\" Health check.\"\"\"\n    response = {\n        \"message\": HTTPStatus.OK.phrase,\n        \"status-code\": HTTPStatus.OK,\n    }\n    return response\n

                        try to reload the app and see what is returned now. You should not have to re-launch the app because we initialized the app with the --reload argument.

                      2. When we decorate our functions with @app.get(\"/items/{item_id}\"), item_id is in the case what we call a path parameters because it is a parameter that is directly included in the path of our endpoint. We have already seen how we can restrict a path to a single type, but what if we want to restrict it to specific values? This is often the case if we are working with parameters of type str. In this case we would need to define a enum:

                        from enum import Enum\nclass ItemEnum(Enum):\n    alexnet = \"alexnet\"\n    resnet = \"resnet\"\n    lenet = \"lenet\"\n\n@app.get(\"/restric_items/{item_id}\")\ndef read_item(item_id: ItemEnum):\n    return {\"item_id\": item_id}\n

                        Add this API, reload and execute both a valid parameter and a non-valid parameter.

                      3. In contrast to path parameters we have query parameters. In the requests exercises we saw an example of this where we were calling https://api.github.com/search/code with the query 'q': 'requests+language:python'. Any parameter in FastAPI that is not a path parameter, will be considered a query parameter:

                        @app.get(\"/query_items\")\ndef read_item(item_id: int):\n    return {\"item_id\": item_id}\n

                        Add this API, reload and figure out how to pass in a query parameter.

                      4. We have until now worked with the .get method, but lets also see an example of the .post method. As already described the POST request method is used for uploading data to the server. Here is a simple app that saves username and password in a database (please never implement this in real life like this):

                        database = {'username': [ ], 'password': [ ]}\n\n@app.post(\"/login/\")\ndef login(username: str, password: str):\n    username_db = database['username']\n    password_db = database['password']\n    if username not in username_db and password not in password_db:\n        with open('database.csv', \"a\") as file:\n            file.write(f\"{username}, {password} \\n\")\n        username_db.append(username)\n        password_db.append(password)\n    return \"login saved\"\n

                        Make sure you understand what the function does and then try to execute it a couple of times to see your database updating. It is important to note that we sometimes in the following exercises use the .get method and sometimes the .post method. For our usage it does not really matter.

                    6. We are now moving on to figuring out how to provide different standard inputs like text, images, json to our APIs. It is important that you try out each example yourself and in particular you look at the curl commands that are necessary to invoke each application.

                      1. Here is a small application, that takes a single text input

                        @app.get(\"/text_model/\")\ndef contains_email(data: str):\n    regex = r'\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b'\n    response = {\n        \"input\": data,\n        \"message\": HTTPStatus.OK.phrase,\n        \"status-code\": HTTPStatus.OK,\n        \"is_email\": re.fullmatch(regex, data) is not None\n    }\n    return response\n

                        What does the application do? Try it out yourself

                      2. Let's say we wanted to extend the application to check for a specific email domain, either gmail or hotmail. Assume that we want to feed this into our application as a json object e.g.

                        {\n    \"email\": \"mlops@gmail.com\",\n    \"domain_match\": \"gmail\"\n}\n

                        Figure out how to alter the data parameter such that it takes in the json object and make sure to extend the application to check if the email and domain also match. Hint: take a look at this page

                      3. Let's move on to an application that requires a file input:

                        from fastapi import UploadFile, File\nfrom typing import Optional\n\n@app.post(\"/cv_model/\")\nasync def cv_model(data: UploadFile = File(...)):\n    with open('image.jpg', 'wb') as image:\n        content = await data.read()\n        image.write(content)\n        image.close()\n\n    response = {\n        \"input\": data,\n        \"message\": HTTPStatus.OK.phrase,\n        \"status-code\": HTTPStatus.OK,\n    }\n    return response\n

                        A couple of new things are going on here: we use the specialized UploadFile and File bodies in our input definition. Additionally, we added the async/await keywords. Figure out what everything does and try to run the application (you can use any image file you like).

                      4. The above application actually does not do anything. Let's add opencv as a package and let's resize the image. It can be done with the following three lines:

                        import cv2\nimg = cv2.imread(\"image.jpg\")\nres = cv2.resize(img, (h, w))\n

                        Figure out where to add them in the application and additionally add h and w as optional parameters, with a default value of 28. Try running the application where you specify everything and one more time where you leave out h and w.

                      5. Finally, let's also figure out how to return a file from our application. You will need to add the following lines:

                        from fastapi.responses import FileResponse\ncv2.imwrite('image_resize.jpg', res)\nFileResponse('image_resize.jpg')\n

                        Figure out where to add them to the code and try running the application one more time to see that you get a file back with the resized image.

                    7. (Optional) Let's try to figure out how to use FastAPI in a machine learning context. Below is a script that downloads a VisionEncoderDecoder from huggingface . The model can be used to create captions for a given image. Thus calling

                      predict_step(['s7_deployment/exercise_files/my_cat.jpg'])\n

                      returns a list of strings like ['a cat laying on a couch with a stuffed animal'] (try this yourself). Create a FastAPI application that can do inference using this model e.g. it should take in an image, preferably an optional json object for configuring some of the hyperparameters (like max_length) and should return a string containing the generated caption.

                      from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor, AutoTokenizer\nimport torch\nfrom PIL import Image\n\nmodel = VisionEncoderDecoderModel.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\nfeature_extractor = ViTFeatureExtractor.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\ntokenizer = AutoTokenizer.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nmodel.to(device)\n\ngen_kwargs = {\"max_length\": 16, \"num_beams\": 8, \"num_return_sequences\": 1}\ndef predict_step(image_paths):\n    images = []\n    for image_path in image_paths:\n        i_image = Image.open(image_path)\n        if i_image.mode != \"RGB\":\n            i_image = i_image.convert(mode=\"RGB\")\n\n        images.append(i_image)\n    pixel_values = feature_extractor(images=images, return_tensors=\"pt\").pixel_values\n    pixel_values = pixel_values.to(device)\n    output_ids = model.generate(pixel_values, **gen_kwargs)\n    preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)\n    preds = [pred.strip() for pred in preds]\n    return preds\n
                    8. As the final step, we want to figure out how to include our FastAPI application in a docker container as it will help us when we want to deploy in the cloud because docker as always can take care of the dependencies for our application. For the following set of exercises you can take whatever previous FastAPI application as the base application for the container

                      1. Start by creating a requirement.txt file for your application. You will at least need fastapi and uvicorn in the file and we always recommend that you are specific about the version you want to use

                        fastapi>=0.68.0,<0.69.0\nuvicorn>=0.15.0,<0.16.0\n# add anything else you application needs to be able to run\n
                      2. Next, create a Dockerfile with the following content

                        FROM python:3.9\nWORKDIR /code\nCOPY ./requirements.txt /code/requirements.txt\n\nRUN pip install --no-cache-dir --upgrade -r /code/requirements.txt\nCOPY ./app /code/app\n\nCMD [\"uvicorn\", \"app.main:app\", \"--host\", \"0.0.0.0\", \"--port\", \"80\"]\n

                        The above assumes that your file structure looks like this

                        .\n\u251c\u2500\u2500 app\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u2514\u2500\u2500 main.py\n\u251c\u2500\u2500 Dockerfile\n\u2514\u2500\u2500 requirements.txt\n

                        Hopefully, all these steps should look familiar if you already went through module M9, except for maybe the last line. However, this is just the standard way that we have run our FastAPI applications in the last couple of exercises, this time with some extra arguments regarding the ports we allow.

                      3. Next, build the corresponding docker image

                        docker build -t my_fastapi_app .\n
                      4. Finally, run the image such that a container is spun up that runs our application. The important part here is to remember to specify the -p argument (p for port) which should be the same number as the port we have specified in the last line of our Dockerfile.

                        docker run --name mycontainer -p 80:80 myimage\n
                      5. Check that everything is working by going to the corresponding localhost page http://localhost/items/5?q=somequery

                    9. (Optional) In module M15 on unittesting you learned how to write unit tests for your data pipeline and model. It should come as no surprise that the same can also be done for your API. Doing so should be able to tell you if your API is working as you expect it to do. The only complication regarding APIs is that you need a server to do testing, and we cannot use uvicorn for this. Check out this page on how to test FastAPI application, and add a file called test_api.py to your tests folder with appropriate tests for your API.

                    This ends the module on APIs. If you want to go further in this direction we highly recommend that you check out bentoml which is an API standard that focuses solely on creating easy-to-understand APIs and services for ml-applications. Additionally, we can also highly recommend checking out Postman which can help design, document and in particular test the API you are writing to make sure that it works as expected.

                    "},{"location":"s7_deployment/cloud_deployment/","title":"M24 - Cloud Deployment","text":""},{"location":"s7_deployment/cloud_deployment/#cloud-deployment","title":"Cloud deployment","text":"

                    Core Module

                    We are now returning to using the cloud. In this module you should have gone through the steps of having your code in your github repository to automatically build into a docker container, store that, store data and pull it all together to make a training run. After the training is completed you should hopefully have a file stored in the cloud with your trained model weights.

                    Todays exercises will be about serving those model weights to an end user. We focus on two different ways of deploying our model, Google cloud functions and Google Vertex AI endpoints.

                    "},{"location":"s7_deployment/cloud_deployment/#cloud-functions","title":"Cloud Functions","text":"

                    Cloud functions are the easiest way to get started with deployment because they are what is called serverless. For serverless deployment we still need a server to do the actual workload, however the core concept is that you do you have to manage the server. Everything is magically taken care of behind the scene.

                    "},{"location":"s7_deployment/cloud_deployment/#exercises","title":"\u2754 Exercises","text":"
                    1. Go to the start page of Cloud Functions. Can be found in the sidebar on the homepage or you can just search for it. Activate the service if not already active.

                    2. Click the Create Function button which should take you to a screen like the image below. Give it a name, set the server region to somewhere close by and change the authentication policy to Allow unauthenticated invocations so we can access it directly from a browser. Remember to note down the URL of the service somewhere.

                    3. On the next page, for Runtime pick the Python 3.9 option. This will make the inline editor show both a main.py and requirements.py file. Look over them. Click the Deploy button in the lower left corner.

                    4. Afterwards you should see a green check mark beside your function meaning that it is deployed. Click the Test function button which will take you to the testing page.

                    5. If you know what the application does, it should come as no surprise that it can run without any input. We therefore just send an empty request by clicking the Test The Function button. Does the function return the output you expected? Wait for the logs to show up. What do they show?

                      1. What should the Triggering event look like in the testing prompt for the program to respond with

                        Good day to you sir!\n

                        Try it out.

                      2. Click on the metrics tab. Identify what each panel is showing.

                      3. Go to the trigger tab and go to the url for the application.

                      4. Checkout the logs tab. You should see that your application have already been invoked multiple times. Also try to execute this command in a terminal:

                        gcloud functions logs read\n
                    6. Next, we are going to create an application that actually takes some input so we can try to send it requests. We provide a very simple sklearn_cloud_function.py script to get started.

                      1. Figure out what the script does and run the script. This should create a file with trained model.

                      2. Next create a storage bucket and upload the model file to the bucket. You can either do this through the webpage or run the following commands:

                        gsutil mb gs://<bucket-name>  # mb stands for make bucket\ngsutil cp <file-name> gs://<bucket-name>  # cp stands for copy\n

                        check that the file is in the bucket.

                      3. Create a new cloud function with the same initial settings as the first one. Choose also the Python 3.9 but this time change code to something that can actually use the model we just uploaded. Here is a code snippet to help you:

                        from google.cloud import storage\nimport pickle\n\nBUCKET_NAME = ...\nMODEL_FILE = ...\n\nclient = storage.Client()\nbucket = client.get_bucket(BUCKET_NAME)\nblob = bucket.get_blob(MODEL_FILE)\nmy_model = pickle.loads(blob.download_as_string())\n\ndef knn_classifier(request):\n    \"\"\" will to stuff to your request \"\"\"\n    request_json = request.get_json()\n    if request_json and 'input_data' in request_json:\n        data = request_json['input_data']\n        input_data = list(map(int, data.split(',')))\n        prediction = my_model.predict([input_data])\n        return f'Belongs to class: {prediction}'\n    else:\n        return 'No input data received'\n

                        Some notes: * For locally testing the above code you will need to install the google-cloud-storage python package * Remember to change the Entry point * Remember to also fill out the requirements.txt file. You need at least two packages to run the application with google-cloud-storage being one of them. * If you deployment fails, try to go to the Logs Explorer page in gcp which can help you identify why.

                      4. When you have successfully deployed the model, try to make predictions with it.

                    7. You can finally try to redo the exercises deploying a Pytorch application. You will essentially need to go through the same steps as the sklearn example, including uploading a trained model to a storage, write a cloud function that loads it and return some output. You are free to choose whatever Pytorch model you want.

                    "},{"location":"s7_deployment/cloud_deployment/#cloud-run","title":"Cloud Run","text":"

                    Cloud functions are great for simple deployments, that can be encapsulated in a single script with only simple requirements. However, they do not really scale with more advance applications that may depend on multiple programming languages. We are already familiar with how we can deal with this through containers and Cloud Run is the corresponding service in GCP for deploying containers.

                    "},{"location":"s7_deployment/cloud_deployment/#exercises_1","title":"\u2754 Exercises","text":"
                    1. We are going to start locally by developing a small app that we can deploy. We provide two small examples to choose from: first a small FastAPI app consisting of this .py file and this dockerfile . Secondly a small streamlit application consisting of just this dockerfile . You are free to choose which application to work with.

                      1. Start by going over the files belonging to your choice app and understand what it does.

                      2. Next build the docker image belonging to the app

                        docker build -f <dockerfile> . -t gcp_test_app:latest\n
                      3. Next tag and push the image to your container registry

                        docker tag gcp_test_app gcr.io/<project-id>/gcp_test_app\ndocker push gcr.io/<project-id>/gcp_test_app\n

                        afterwards check you container registry to check that you have successfully pushed the image.

                    2. Next go to Cloud Run in the cloud console an enable the service

                    3. Click the Create Service button which should bring you to a page similar to the one below

                      Do the following: * Click the select button, which will bring up all build containers and pick the one you want to deploy. In the future you probably want to choose the Continuously deploy new revision from a source repository such that a new version is always deployed when a new container is build. * Hereafter, give the service a name and select the region. We recommend do choose a region close to you, however it does not really matter that much for our use case * Set the authentication method to Allow unauthenticated invocations such that we can call it without providing credentials. In the future you may only set that authenticated invocations are allowed. * Expand the Container, Connections, Security tab and edit the port such that it matches the port exposed in your chosen application.

                      Finally, click the create button and wait for the service to be deployed (may take some time).

                    4. If you manage to deploy the service you should see a image like this:

                      You can now access you application by clicking url. This will access the root of your application, so you may need to add / or /<path> to the url depending on how the app works.

                    5. Everything we just did to deploy an container can be reproduced using the following command:

                      gcloud run deploy $APP --image $TAG --platform managed --region $REGION --allow-unauthenticated\n

                      and checked using these two commands

                      gcloud run services list\ngcloud run services describe $APP --region $REGION\n

                      feel free to experiment doing the deployment from the command line.

                    6. Instead of deploying our docker container using the UI or command line, which is a manual operation, we can do it in a continues manner by using cloudbuild.yaml file we learned about in the previous section. We just need to add a new step to the file. We provide an example

                      steps:\n# Build the container image\n- name: 'gcr.io/cloud-builders/docker'\n  args: ['build', '-t', 'gcr.io/$PROJECT_ID/<container-name>:lates', '.'] #(1)!\n# Push the container image to Container Registry\n- name: 'gcr.io/cloud-builders/docker'\n  args: ['push', 'gcr.io/$PROJECT_ID/<container-name>:latest']\n# Deploy container image to Cloud Run\n- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'\n  entrypoint: gcloud\n  args:\n  - 'run'\n  - 'deploy'\n  - '<service-name>'\n  - '--image'\n  - 'gcr.io/$PROJECT_ID/<container-name>:latest'\n  - '--region'\n  - '<region>'\n
                      1. This line assume you are standing in the root of your repository and is trying to build the docker image specified in a file called Dockerfile and tag it with the name gcr.io/$PROJECT_ID/my_deployment:latest. Therefore if you want to point to another dockerfile you need to add -f option to the command. For example if you want to point to a my_app/my_serving_app.dockerfile you need to change the line to

                        args: ['build', '-f', 'my_app/my_serving_app.dockerfile', '-t', 'gcr.io/$PROJECT_ID/my_deployment:lates', '.']\n

                      where you need to replace <container-name> with the name of your container, <service-name> with the name of the service you want to deploy and <region> with the region you want to deploy to. Afterwards you need to setup a trigger (or reuse the one you already have) to build the container and deploy it to cloud run. Confirm that this works by making a change to your application and pushing it to github and see if the application is updated continuously. For help you can look here for help. If you succeeded, congratulations you have now setup a continues deployment pipeline.

                    That ends the exercises on deployment. The exercises above is just a small taste of what deployment has to offer. In both sections we have explicitly chosen to work with serverless deployments. But what if you wanted to do the opposite e.g. being the one in charge of the management of the cluster that handles the deployed services? If you are really interested in taking deployment to the next level should get started on kubernetes which is the de-facto open-source container orchestration platform that is being used in production environments. If you want to deep dive we recommend starting here which describes how to make pipelines that are a necessary component before you start to create your own kubernetes cluster.

                    "},{"location":"s7_deployment/local_deployment/","title":"M23 - Local Deployment","text":""},{"location":"s7_deployment/local_deployment/#local-deployment","title":"Local Deployment","text":"

                    Regardless of your application, model and usecase, the first starting point of serving your model should always be to deploy it locally. The simple reason for that is debugging: if you deploy directly to the cloud you often get less verbose error message and/or the iteration time is much slower because it simply takes much longer time to deploy to the cloud than locally. Locally should therefore always be the first step with any new application.

                    For this module we are going to focus on deployment of deep learning models, in particular Pytorch models which is used throughout the course. Pytorch has historically been developed for research purposed, where iterating with quick ideas was valued over fast computations. This is evident since Pytorch uses an dynamic graph underneath to represent the computational graph that is being created whenever you are running calculations. The graph is important, as it keeps track on how to do backpropergation though your Pytorch application. However, running code dynamically is notoriously slower than compiling your code before running it. Lets therefore first consider another way of compiling our code.

                    "},{"location":"s7_deployment/local_deployment/#compilation","title":"Compilation","text":"

                    If you ever coded in any low-level language such as c, fortran or c++ you should be familiar with the term compiling. Compiling is the task of taken a computer program written in one language and translating it into another. In most cases this means taken whatever you have written in your preferred programming language and translating it into machine code that the computer can execute. But what does compilation have to do with coding Pytorch models?

                    It happens to be that Pytorch comes with its own compiler that can optimize your model for you. It can be found in the submodule torch.jit. Jit stands for just-in-time, meaning that compilation runs at the same time we are executing the code. If you know anything about low-level languages such c/c++ you know that we normally compile the code before we run it. With jit we essentially merges the two phases into one. jit has two types of compilation modes, called respective script and trace. We are in the exercises going to look at script as it is the easiest to get started with and works without any code changes for nearly all kind of models. If you ever encounter that script does not work for you then trace can be used which is more general.

                    The major reasons why we want to compile our models with torch.jit are:

                    • Scriptet code can be invoked in its own interpreter, which is basically a restricted Python interpreter. This interpreter does not acquire the Global Interpreter Lock (GIL), and so many requests can be processed on the same instance simultaneously.
                    • This scriptet format allows us to save the whole model to disk and load it into another environment, such as in a server written in a language other than Python
                    • Scriptet code gives us a representation in which we can do compiler optimizations on the code to provide more efficient execution
                    • Scriptet code allows us to interface with many backend/device runtimes that require a broader view of the program than individual operators.
                    "},{"location":"s7_deployment/local_deployment/#exercises","title":"\u2754 Exercises","text":"

                    We are here going to look at torch.jit.script for compiling our code.

                    1. To see the difference in the this exercises, we start out with a large model. Download one of the large image classification models from torchvision such as ResNet-152. For the purpose of the exercise it does not matter if you work with a random initialized model or a pretrained version.

                    2. Next try to script the model using torch.jit.script. You can find the documentation here.

                    3. Just to confirm that by compiling our model using torch.jit.script did not change the output of our model, try checking that the output of the scripted model corresponds to the output of the non-scripted model. You can do this on a single random datapoint, and you should check that the top-5 prediced classes are the same

                      assert torch.allclose(unscripted_top5_indices, scripted_top5_indices)\n

                      Hint: use torch.topk.

                    4. Finally, try benchmarking the non-scripted model against the scripted model. I recommend using the built-in benchmarker in Pytorch: torch.utils.benchmark.Timer, which you can read more about how to use here. Do you see a increase in performance of the scripted model compared to the non-scriptet model. If so, what is the percentage increase in efficiency?

                    "},{"location":"s7_deployment/local_deployment/#torchserve","title":"Torchserve","text":"

                    For locally deploying our model we are going to look at Torchserve. Torchserve (illustrated below) is a combined services for packaging and serving multiple Pytorch at the same time.

                    Image credit

                    Before we go into details of Torchmetrics, an important question is why we need such an abstraction on top of our developed model. Why can't we just do:

                    python inference.py --my_model model_checkpoint.pt --new_datapoint img.png\n

                    If we where never going to do anything else than just calling the model ourself then it is probably not worth adding anything else. However, if we ever want anyone else to interact with our model, we need to comply with standard ways of requesting and sending data. This is especially true when the next step is to start deploying our model in the cloud. Torchserve essentially brings in a inference API on top of our model that turns our model into a client-server type of system: the client (user) is going to send requests to a server (our application) and the server will give an response. The request will be send as a standard HTTP requests which Torchserve will help us decode into a useful input which we can then do inference on and return the result, again as an standardized HTTP response. Torchserve is in that regard similar to FastAPI or Flask if you have ever used one of those frameworks.

                    Finally, the packaging part of Torchserve is necessary because we cannot give a Torchserve a raw file of trained model weights as these essentially is just a list of floats. We need a file that both contains the model definition and the trained weights, such that the model essentially becomes independent of the python interpreter.

                    "},{"location":"s7_deployment/local_deployment/#exercises_1","title":"\u2754 Exercises","text":"

                    Torchserve can be a bit rough around the edges but is fairly easy to work with. We are largely going to follow the instructions listed in the readme file for Torchserve. The intention in these exercises is to serve a Resnet type neural network that is trained for classification on ImageNet. Additional documentation can be found here.

                    1. Install torchserve and its dependencies. There are separate instructions on the homepage depending on you are using Windows, WSL or Linux/MAC.

                    2. Create a folder called model_store. This is where we will store the model that we are going to deploy

                    3. Try to run the torchserve --model-store model_store command. If the service starts with no errors, you have installed it correctly and can continue the exercise. Else it is Googling time!

                    4. Next lets create a model we can serve. If you have done the previous exercises on compiling using scripting, we highly recommend to initialize and save such model

                      model = ResnetFromTorchVision(pretrained=True)\nscript_model = torch.jit.script(model)\nscript_model.save('deployable_model.pt')\n
                    5. Call the model archiver. We have provided a file called index_to_name.json that maps from predicted class indices to interpretable class name e.g. 1->\"goldfish\". This file should be provided as the extra-files argument such that the deployed model automatically outputs the class name. Note that this files of course only works for models trained on imagenet.

                      torch-model-archiver \\\n    --model-name my_fancy_model\n    --version 1.0 \\\n    --serialized-file path/to/serialized_model.pt \\\n    --export-path model_store\n    --extra-files index_to_name.json\n    --handler image_classifier\n
                    6. Checkout the model_store folder. Has the model archiver correctly created a model (with .mar extension) inside the folder?

                    7. Finally, we are going to deploy our model and use it:

                      1. Start serving your model in one terminal:

                        torchserve --start --ncs --model-store model_store --models my_fancy_model=my_fancy_model.mar\n
                      2. Next, pick a image that you want to do inference on. It can be any image that you want but try to pick one that actually contains an object from the set of imagenet classes. I have also provided a image of my own cat in the my_cat.jpg file.

                      3. Open another terminal, which we are going to use for inference. The easiest way to do inference is using curl directly in the terminal but you are also free to experiment with the requests API directly in python. Using curl should look something like this

                        curl http://127.0.0.1:8080/predictions/my_fancy_model -T my_image.jpg\n
                    8. Torchserve supports serving multiple models, not just one. Create a new vision model (either another resnet model or something similar), script it, save it, archive it in the save model store folder and then re-run torchserve like this

                      torchserve --start --ncs --model-store model_store --models all\n

                      Make sure that you can do inference with both models by calling curl.

                    That ends the module on local deployment. Hopefully in this phase you have gained a bit experience with sending HTTP requests as this will be very important in the next module when we will try to deploy the models in the cloud.

                    "},{"location":"s8_monitoring/","title":"Monitoring","text":"

                    Slides

                    We have now reached the end of our machine learning pipeline. We have successfully developed, trained and deployed a machine learning model. However, the question then becomes if you can trust that your newly deployed model still works as expected after 1 day without you intervening? What about 1 month? What about 1 year?

                    There may be corner cases where an ML models is working as expected, but the wast majority of ML models will perform worse over time because they are not generalizing well enough. For example, assume you have just deployed an application that classifies images from phones, when suddenly a new phone comes out with a new kind of sensor that takes images that either have very weird aspect ratio or something else your model is not robust towards. There is nothing wrong with this, you can essentially just retrain your model on new data that accounts for this corner case, however you need a mechanisms that informs you.

                    This is very monitoring comes into play. Monitoring practices are in charge of collecting any information about your application in some format that can then be analyzed and reacted on. Monitoring is essential to securing the longevity of your applications.

                    As with many other sub-fields within MLOps we can divide monitoring into classic monitoring and ML specific monitoring. Classic monitoring (known from classic DevOps) is often about

                    • Errors: Is my application workings without problems?
                    • Logs: What is actually going on?
                    • Performance: How fast is my application?

                    All these are basic information you are interested in regardless of what application type you are trying to deploy. However, then there are ML related monitoring that especially relates data. Take the example above, with the new phone, this we would in general consider to be data drifting problem e.g. the data you are trying to do inference on have drifted away from the distribution of data your model was trained on. Such monitoring problems are unique to machine learning applications and needs to be handled separately.

                    We are in this session going to see examples of both kinds of monitoring.

                    Learning objectives

                    The learning objectives of this session are:

                    • Understand the concepts of data drifting in machine learning applications
                    • Can detect data drifting using the evidently framework
                    • Understand the importance of different system level monitoring and can conceptually implement it
                    "},{"location":"s8_monitoring/data_drifting/","title":"M25 - Data Drifting","text":""},{"location":"s8_monitoring/data_drifting/#data-drifting","title":"Data drifting","text":"

                    Data drifting is one of the core reasons for model accuracy degrades over time in production. For machine learning models, data drift is the change in model input data that leads to model performance degradation. In practical terms, this means that the model is receiving input that is outside of the scope that it was trained on, as seen in the figure below. This shows that the underlying distribution of a particular feature has slowly been increasing in value over two years

                    Image credit

                    In some cases, it may be that if you normalize some feature in a better way that you are able to generalize your model better, but this is not always the case. The reason for such a drift is commonly some external factor that you essentially have no control over. That really only leaves you with one option: retrain your model on the newly received input features and deploy that model to production. This process is probably going to repeat over the lifetime of your application if you want to keep it up-to-date with the real world.

                    Image credit

                    We have now come up with a solution to the data drift problem, but there is one important detail that we have not taken care of: When we should actually trigger the retraining? We do not want to wait around for our model performance to degrade, thus we need tools that can detect when we are seeing a drift in our data.

                    "},{"location":"s8_monitoring/data_drifting/#exercises","title":"\u2754 Exercises","text":"

                    For these exercises we are going to use the framework Evidently developed by EvidentlyAI. Evidently currently supports both detection for both regression and classification models. The exercises are in large taken from here and in general we recommend if you are in doubt about an exercise to look at the docs for API and examples (their documentation can be a bit lacking sometimes, so you may also have to dive into the source code).

                    Additionally, we want to stress that data drift detection, concept drift detection etc. is still an active field of research and therefore exist multiple frameworks for doing this kind of detection. In addition to Evidently, we can also mention NannyML, WhyLogs and deepcheck.

                    1. Start by install Evidently

                      pip install evidently\n

                      you will also need scikit-learn and pandas installed if you do not already have it.

                    2. Hopefully you already gone through session S7 on deployment. As part of the deployment to GCP functions you should have developed a application that can classify the iris dataset, based on a model trained by this script . We are going to convert this into a FastAPI application for the purpose here:

                      1. Convert your GCP function into a FastAPI application. The appropriate curl command should look something like this:

                        curl -X 'POST' \\\n    'http://127.0.0.1:8000/iris_v1/?sepal_length=1.0&sepal_width=1.0&petal_length=1.0&petal_width=1.0' \\\n    -H 'accept: application/json' \\\n    -d ''\n

                        and the response body should look like this:

                        {\n    \"prediction\": \"Iris-Setosa\",\n    \"prediction_int\": 0\n}\n

                        We have implemented a solution in this file (called v1) if you need help.

                      2. Next we are going to add some functionality to our application. We need to add that the input for the user is saved to a database whenever our application is called. However, to not slow down the response to our user we want to implement this as an background task. A background task is a function that should be executed after the user have got their response. Implement a background task that save the user input to a database implemented as a simple .csv file. You can read more about background tasks here. The header of the database should look something like this:

                        time, sepal_length, sepal_width, petal_length, petal_width, prediction\n2022-12-28 17:24:34.045649, 1.0, 1.0, 1.0, 1.0, 1\n2022-12-28 17:24:44.026432, 2.0, 2.0, 2.0, 2.0, 1\n...\n

                        thus both input, timestamp and predicted value should be saved. We have implemented a solution in this file (called v2) if you need help.

                      3. Call you API a number of times to generate some dummy data in the database.

                    3. Create a new data_drift.py file where we are going to implement the data drifting detection and reporting. Start by adding both the real iris data and your generated dummy data as pandas dataframes.

                      import pandas as pd\nfrom sklearn import datasets\nreference_data = datasets.load_iris(as_frame='auto').frame\ncurrent_data = pd.read_csv('prediction_database.csv')\n

                      if done correctly you will most likely end up with two dataframes that look like

                      # reference_data\nsepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  target\n0                  5.1               3.5                1.4               0.2       0\n1                  4.9               3.0                1.4               0.2       0\n...\n148                6.2               3.4                5.4               2.3       2\n149                5.9               3.0                5.1               1.8       2\n[150 rows x 5 columns]\n\n# current_data\ntime                         sepal_length   sepal_width   petal_length   petal_width   prediction\n2022-12-28 17:24:34.045649   1.0            1.0            1.0           1.0           1\n...\n2022-12-28 17:24:34.045649   1.0            1.0            1.0           1.0           1\n[10 rows x 5 columns]\n

                      Standardize the dataframes such that they have the same column names and drop the time column from the current_data dataframe.

                    4. We are now ready to generate some reports about data drifting:

                      1. Try executing the following code:

                        from evidently.report import Report\nfrom evidently.metric_preset import DataDriftPreset\nreport = Report(metrics=[DataDriftPreset()])\nreport.run(reference_data=reference, current_data=current)\nreport.save_html('report.html')\n

                        and open the generated .html page. What does it say about your data? Have it drifted? Make sure to poke around to understand what the different plots are actually showing.

                      2. Data drifting is not the only kind of reporting evidently can make. We can also get reports on the data quality. Try first adding a few Nan values to your reference data. Secondly, try changing the report to

                        from evidently.metric_preset import DataDriftPreset, DataQualityPreset\nreport = Report(metrics=[DataDriftPreset(), DataQualityPreset()])\n

                        and re-run the report. Checkout the newly generated report. Again go over the generated plots and make sure that it picked up on the missing values you just added.

                      3. The final report present we will look at is the TargetDriftPreset. Target drift means that our model is over/under predicting certain classes e.g. or general terms the distribution of predicted values differs from the ground true distribution of targets. Try adding the TargetDriftPreset to the Report class and re-run the analysis and inspect the result. Have your targets drifted?

                    5. Evidently reports are meant for debugging, exploration and reporting of results. However, as we stated in the beginning, what we are actually interested in methods automatically detecting when we are beginning to drift. For this we will need to look at Test and TestSuites:

                      1. Lets start with a simple test that checks if there are any missing values in our dataset:

                        from evidently.test_suite import TestSuite\nfrom evidently.tests import TestNumberOfMissingValues\ndata_test = TestSuite(tests=[TestNumberOfMissingValues()])\ndata_test.run(reference_data=reference, current_data=current)\n

                        again we could run data_test.save_html to get a nice view of the results (feel free to try it out) but additionally we can also call data_test.as_dict() method that will give a dict with the test results. What dictionary key contains the if all tests have passed or not?

                      2. Take a look at this colab notebook that contains all tests implemented in Evidently. Pick 5 tests of your choice, where at least 1 fails by default and implement them as a TestSuite. Then try changing the arguments of the test so they better fit your usecase and get them all passing.

                    6. (Optional) When doing monitoring in practice, we are not always interested in running on all data collected from our API maybe only the last N entries or maybe just from the last hour of observations. Since we are already logging the timestamps of when our API is called we can use that for filtering. Implement a simple filter that either takes an integer n and returns the last n entries in our database or some datetime t that filters away observations earlier than this.

                    7. Evidently by default only supports structured data e.g. tabular data (so does nearly every other framework). Thus, the question then becomes how we can extend unstructured data such as images or text? The solution is to extract structured features from the data which we then can run the analysis on.

                      1. (Optional) For images the simple solution would be to flatten the images and consider each pixel a feature, however this does not work in practice because changes in the individual pixels does not really tell anything about the image. Instead we should derive some feature such as:

                        • Average brightness
                        • Contrast of image
                        • Image sharpness
                        • ...

                        These are all numbers that can make up a feature vector for a image. Try out doing this yourself, for example by extracting such features from MNIST and FashionMNIST datasets, and check if you can detect a drift between the two sets.

                      2. (Optional) For text a common approach is to extra some higher level embedding such as the very classical GLOVE embedding. Try following this tutorial to understand how drift detection is done on text.

                      3. Lets instead take a deep learning based approach to doing this. Lets consider the CLIP model, which is normally used to do image captioning. For our purpose this is perfect because we can use the model to get abstract feature embeddings for both images and text:

                        from PIL import Image\nimport requests\n# requires transformers package: pip install transformers\nfrom transformers import CLIPProcessor, CLIPModel\n\nmodel = CLIPModel.from_pretrained(\"openai/clip-vit-base-patch32\")\nprocessor = CLIPProcessor.from_pretrained(\"openai/clip-vit-base-patch32\")\n\nurl = \"http://images.cocodataset.org/val2017/000000039769.jpg\"\nimage = Image.open(requests.get(url, stream=True).raw)\n\n# set either text=None or images=None when only the other is needed\ninputs = processor(text=[\"a photo of a cat\", \"a photo of a dog\"], images=image, return_tensors=\"pt\", padding=True)\n\nimg_features = model.get_image_features(inputs['pixel_values'])\ntext_features = model.get_text_features(inputs['input_ids'], inputs['attention_mask'])\n

                        Both img_features and text_features are in this case a (512,) abstract feature embedding, that should be able to tell us something about our data distribution. Try using this method to extract features on two different datasets like CIFAR10 and SVHN if you want to work with vision or IMDB movie review and Amazon review for text. After extracting the features try running some of the data distribution testing you just learned about.

                    8. (Optional) If we have multiple applications and want to run monitoring for each application we often want also the monitoring to be a deployed application (that only we can access). Implement a /monitoring/ endpoint that does all the reporting we just went through such that you have two endpoints:

                      http://127.0.0.1:8000/iris_infer/?sepal_length=1.0&sepal_width=1.0&petal_length=1.0&petal_width=1.0 # user endpoint\nhttp://127.0.0.1:8000/iris_monitoring/ # monitoring endpoint\n

                      Our monitoring endpoint should return a HTML page either showing an Evidently report or test suit. Try implementing this endpoint. We have implemented a solution in this file if you need help with how to return an HTML page from a FastAPI application.

                    9. As an final exercise, we recommend that you try implementing this to run directly in the cloud. You will need to implement this in a container e.g. GCP Run service because the data gathering from the endpoint should still be implemented as an background task. For this to work you will need to change the following:

                    10. Instead of saving the input to a local file you should either store it in GCP bucket or an BigQuery SQL table (this is a better solution, but also out-of-scope for this course)

                    11. You can either run the data analysis locally by just pulling from cloud storage predictions and training data or alternatively you can deploy this as its own endpoint that can be invoked. For the latter option we recommend that this should require authentication.

                    That ends the module on detection of data drifting, data quality etc. If this has not already been made clear, monitoring of machine learning applications is an extremely hard discipline because it is not a clear cut when we should actually respond to feature beginning to drift and when it is probably fine. That comes down to the individual application what kind of rules that should be implemented. Additionally, the tools presented here are also in no way complete and are especially limited in one way: they are only considering the marginal distribution of data. Every analysis that we done have been on the distribution per feature (the marginal distribution), however as the image below show it is possible for data to have drifted to another distribution with the marginal being approximately the same.

                    There are methods such as Maximum Mean Discrepancy (MMD) tests that are able to do testing on multivariate distributions, which you are free to dive into. In this course we will just always recommend to consider multiple features when doing decision regarding your deployed applications.

                    "},{"location":"s8_monitoring/monitoring/","title":"M26 - System Monitoring","text":""},{"location":"s8_monitoring/monitoring/#monitoring","title":"Monitoring","text":"

                    In this module we are going to look into more classical monitoring of applications. The key concept we are often working with here is called telemetry. Telemetry in general refer to any automatic measurement and wireless transmission of data from our application. It could be numbers such as:

                    • The number of requests are our application receiving per minute/hour/day. This number is of interest because it is directly proportional to the running cost of application.
                    • The amount of time (on average) our application runs per request. The number is of interest because it most likely is the core contributor to the latency that our users are experience (which we want to be low).
                    • ...

                    In general there are three different kinds of telemetry we are interested in:

                    Name Description Example Purpose Metrics Metrics are quantitative measurements of the system. They are usually numbers that are aggregated over a period of time. E.g. the number of requests per minute. The number of requests per minute. Metrics are used to get an overview of the system. They are often used to create dashboards that can be used to get an overview of the system. Logs Logs are textual or structured records generated by applications. They provide a detailed account of events, errors, warnings, and informational messages that occur during the operation of the system. System logs, error logs Logs are essential for diagnosing issues, debugging, and auditing. They provide a detailed history of what happened in a system, making it easier to trace the root cause of problems and track the behavior of components over time. Traces Traces are detailed records of specific transactions or events as they move through a system. A trace typically includes information about the sequence of operations, timing, and dependencies between different components. Distributed tracing in microservices architecture Traces help in understanding the flow of a request or a transaction across different components. They are valuable for identifying bottlenecks, understanding latency, and troubleshooting issues related to the flow of data or control.

                    We are mainly going to focus in this module on metrics.

                    "},{"location":"s8_monitoring/monitoring/#local-instrumentator","title":"Local instrumentator","text":"

                    Before we look into the cloud lets at least conceptually understand how a given instance of a app can expose values that we may be interested in monitoring.

                    The standard framework for exposing metrics is called prometheus. Prometheus is a time series database that is designed to store metrics. It is also designed to be very easy to instrument applications with and it is designed to scale to large amounts of data. The way prometheus works is that it exposes a /metrics endpoint that can be queried to get the current state of the metrics. The metrics are exposed in a format called prometheus text format.

                    "},{"location":"s8_monitoring/monitoring/#exercises","title":"\u2754 Exercises","text":"
                    1. Start by installing prometheus-fastapi-instrumentator in python

                      pip install prometheus-fastapi-instrumentator\n

                      this will allow us to easily instrument our FastAPI application with prometheus.

                    2. Create a simple FastAPI application in a file called app.py. You can reuse any application from the previous module on APIs. To that file now add the following code:

                      from prometheus_fastapi_instrumentator import Instrumentator\n\n# your app code here\n\nInstrumentator().instrument(app).expose(app)\n

                      This will instrument your application with prometheus and expose the metrics on the /metrics endpoint.

                    3. Run the app using uvicorn server. Make sure that the app exposes the endpoints you expect it too exposes, but make sure you also checkout the /metrics endpoint.

                    4. The metric endpoint exposes multiple /metrics. Metrics always looks like this:

                      # TYPE key <type>\nkey value\n

                      e.g. it is essentially a ditionary of key-value pairs with the added functionality of a <type>. Look at this page over the different types prometheus metrics can have and try to understand the different metrics being exposed.

                    5. Look at the documentation for the prometheus-fastapi-instrumentator and try to add at least one more metric to your application. Rerun the application and confirm that the new metric is being exposed.

                    "},{"location":"s8_monitoring/monitoring/#cloud-monitoring","title":"Cloud monitoring","text":"

                    Any cloud system with respect for itself will have some kind of monitoring system. GCP has a service called Monitoring that is designed to monitor all the different services. By default it will monitor a lot of metrics out-of-box. However, the question is if we want to monitor more than the default metrics. The complexity that comes with doing monitoring in the cloud is that we need more than one container. We at least need one container actually running the application that is also exposing the /metrics endpoint and then we need a another container that is collecting the metrics from the first container and storing them in a database. To implement such system of containers that need to talk to each others we in general need to use a container orchestration system such as Kubernetes. This is out of scope for this course, but we can use a feature of Cloud Run called sidecar containers to achieve the same effect. A sidecar container is a container that is running alongside the main container and can be used to do things such as collecting metrics.

                    "},{"location":"s8_monitoring/monitoring/#exercises_1","title":"\u2754 Exercises","text":"
                    1. Overall we recommend that you just become familiar with the monitoring tab for your cloud run service (see image) above. Try to invoke your service a couple of times and see what happens to the metrics over time.

                      1. (Optional) If you really want to load test your application we recommend checking out the tool locust. Locust is a Python based load testing tool that can be used to simulate many users accessing your application at the same time.
                    2. Try creating a service level objective (SLO). In short a SLO is a target for how well your application should be performing. Click the Create SLO button and fill it out with what you consider to be a good SLO for your application.

                    3. (Optional) To expose our own metrics we need to setup a sidecar container. To do this follow the instructions here. We have setup a simple example that uses fastapi and prometheus that you can find here. After you have correctly setup the sidecar container you should be able to see the metrics in the monitoring tab.

                    "},{"location":"s8_monitoring/monitoring/#alert-systems","title":"Alert systems","text":"

                    A core problem within monitoring is alert systems. The alert system is in charge of sending out alerts to relevant people when some telemetry or metric we are tracking is not behaving as it should. Alert systems are a subjective choice of when and how many should be send out and in general should be proportional with how important to the of the metric/telemetry. We commonly run into what is referred to the goldielock problem where we want just the right amount of alerts however it is more often the case that we either have

                    • Too many alerts, such that they become irrelevant and the really important ones are overseen, often referred to as alert fatigue
                    • Or alternatively, we have too little alerts and problems that should have triggered an alert is not dealt with when they happen which can have unforeseen consequences.

                    Therefore, setting up proper alert systems can be as challenging as setting up the systems for actually the metrics we want to trigger alerts.

                    "},{"location":"s8_monitoring/monitoring/#exercises_2","title":"\u2754 Exercises","text":"

                    We are in this exercise going to look at how we can setup automatic alerting such that we get an message every time one of our applications are not behaving as expected.

                    1. Go to the Monitoring service. Then go to Alerting tab.

                    2. Start by setting up an notification channel. A recommend setting up with an email.

                    3. Next lets create a policy. Clicking the Add Condition should bring up a window as below. You are free to setup the condition as you want but the image is one way bo setup an alert that will react to the number of times an cloud function is invoked (actually it measures the amount of log entries from cloud functions).

                    4. After adding the condition, add the notification channel you created in one of the earlier steps. Remember to also add some documentation that should be send with the alert to better describe what the alert is actually doing.

                    5. When the alert is setup you need to trigger it. If you setup the condition as the image above you just need to invoke the cloud function many times. Here is a small code snippet that you can execute on your laptop to call a cloud function many time (you need to change the url and payload depending on your function):

                      import time\nimport requests\nurl = 'https://us-central1-dtumlops-335110.cloudfunctions.net/function-2'\npayload = {'message': 'Hello, General Kenobi'}\n\nfor _ in range(1000):\n    r = requests.get(url, params=payload)\n
                    6. Make sure that you get the alert through the notification channel you setup.

                    "},{"location":"s9_scalable_applications/","title":"Scaling applications","text":"

                    Slides

                    This module is all about scaling the applications that we are building. We are here going to use a very narrow definition of scaling namely that we want our applications to run faster, however one should note that in general scaling is a much broader term. There are many different ways to scale your applications and we are going to look at three of these related to different tasks machine learning algorithms:

                    • Scaling data loading
                    • Scaling training
                    • Scaling inference

                    We are going to approach the term scaling from two different angles that both should result in your application running faster. The first approach is levering multiple devices, such as using multiple CPU cores or parallelizing training across multiple GPUs. The second approach is more analytical, were we are actually going to look at how we can design smaller/faster model architectures that runs faster.

                    It should be noted that this module is specific to working with Pytorch applications. In particular we are going to see how we can both improve base Pytorch code and how to utilize the Pytorch Lightning which we introduced in module M14 on boilerplate to improve the scaling of our applications. If your application is written using another framework we can guarantee that the same techniques in these modules transfers to that framework, but may require you do seek out how to specifically to it.

                    If you manage to complete all modules in this session, feel free to checkout the extra module on scalable hyperparameter optimization.

                    Learning objectives

                    The learning objectives of this session are:

                    • Understand how data loading during training can be parallelized and have experimented with it
                    • Understand the different paradigms for distributed training and can run multi-gpu experiments using the framework pytorch-lightning
                    • Knowledge of different ways, including quantization, pruning, architecture tuning etc. to improve inference speed
                    "},{"location":"s9_scalable_applications/data_loading/","title":"M27 - Distributed Data Loading","text":""},{"location":"s9_scalable_applications/data_loading/#distributed-data-loading","title":"Distributed Data Loading","text":"

                    Core Module

                    One way that deep learning fundamentally changed the way we think about data in machine learning is that more data is always better. This was very much not the case with more traditional machine learning algorithms (random forest, support vector machines etc.) where a pleatau in performance was often reached for a certain amount of data and did not improve if more was added. However, as deep learning models have become deeper and deeper and thereby more and more data hungry performance seems to be ever increasing or at least not reaching a pleatau in the same way as for traditional machine learning.

                    Image credit

                    As we are trying to feed more and more data into our models and obvious first question to ask is how to do this in a efficient way. As an general rule of thumb we want the performance bottleneck to be the forward/backward e.g. the actual computation in our neural network and not the data loading. By bottleneck we here refer to the part of our pipeline that is restricting how fast we can process data. If data loading is our bottleneck, then our compute device can sit idle while waiting for data to arrive, which is both inefficient and costly. For example if you are using a cloud provider for training deep learning models, you are paying by the hour per device, and thus not using them fully can be costly in the long run.

                    In the first set of exercises we are therefore going to focus on distributed data loading i.e. how do load data in parallel to make sure that we always have data ready for our compute devices. We are in the following going to look at what is going on behind the scene when we use Pytorch to parallelize data loading.

                    "},{"location":"s9_scalable_applications/data_loading/#a-closer-look-on-data-loading","title":"A closer look on Data loading","text":"

                    Before we talk distributed applications it is important to understand the physical layout of a standard CPU (the brain of your computer).

                    Most modern CPUs is a single chip that consist of multiple cores. Each core can further be divided into threads. In most laptops the core count is 4 and commonly 2 threads per code. This means that the common laptop have 8 threads. The number of threads a compute unit has is important, because that directly corresponds to the number of parallel operations that can be executed i.e. one per thread. In a Python terminal you should be able to get the number of cores in your machine by writing (try it):

                    import multiprocessing\ncores = multiprocessing.cpu_count()\nprint(f\"Number of cores: {cores}, Number of threads: {2*cores}\")\n

                    A distributed application is in general any kind of application that parallelizes some or all of it workload. We are in these exercises only focusing on distributed data loading, which happens primarily only on the CPU. In Pytorch it is easy to parallelize data loading if you are using their dataset/dataloader interface:

                    from torch.utils.data import Dataset, DataLoader\nclass MyDataset(Dataset):\n    def __init__(self, ...):\n        # whatever logic is needed to init the data set\n        self.data = ...\n\n    def __getitem__(self, idx):\n        # return one item\n        return self.data[idx]\n\ndataset = MyDataset()\ndataloader = Dataloader(\n    dataset,\n    batch_size=8,\n    num_workers=4  # this is the number of threads we want to parallelize workload over\n)\n

                    Lets take a deep dive into what happens when we request a batch from our dataloader e.g. next(dataloader). First we must understand that we have a thread that plays the role of the main and the remaining threads (in the above example we request 4) are called workers. When the dataloader is created, we create this structure and make sure that all threads have a copy of our dataset definition so each can call the __getitem__ method.

                    Then comes the actual part where we request a batch for data. Assume that we have a batch size of 8 and we do not do any shuffeling. In this step the master thread then distributes the list of requested data points ([0,1,2,3,4,5,6,7]) to the four worker threads. With 8 indices and 4 workers, each worker will receive 2 indices.

                    Each worker thread then calls __getitem__ method for all the indices it has received. When all processes are done, the loaded images datapoints gets send back to the master thread collected into a single structure/tensor.

                    Each arrow is corresponds to a communication between two threads, which is not a free operations. In total to get a single batch (not counting the initial startup cost) in this example we need to do 8 communication operations. This may seem like a small price to pay, but that may not be the case. If the process time of __getitem__ is very low (data is stored in memory, we just need to index to get it) then it does not make sense to use multiprocessing. The computationally saving by doing the look-up operations in parallel is smaller than the communication cost there is between the main thread and the workers. Multiprocessing makes sense when the process time of __getitem__ is high (data is probably stored on the harddrive).

                    It is this trade-off that we are going to investigate in the exercises.

                    "},{"location":"s9_scalable_applications/data_loading/#exercises","title":"\u2754 Exercises","text":"

                    This exercise is intended to be done on the labeled faces in the wild (LFW) dataset. The dataset consist images of famous people extracted from the internet. The dataset had been used to drive the field of facial verification, which you can read more about here. We are going imagine that this dataset cannot fit in memory, and your job is therefore to construct a data pipeline that can be parallelized based on loading the raw datafiles (.jpg) at runtime.

                    1. Download the dataset and extract to a folder. It does not matter if you choose the non-aligned or aligned version of the dataset.

                    2. We provide the lfw_dataset.py file where we have started the process of defining a data class. Fill out the __init__, __len__ and __getitem__. Note that __getitem__ expect that you return a single img which should be a torch.Tensor. Loading should be done using PIL Image, as PIL images is the default input format for torchvision for transforms (for data augmentation).

                    3. Make sure that the script runs without any additional arguments

                      python lfw_dataset.py\n
                    4. Visualize a single batch by filling out the codeblock after the first TODO right after defining the dataloader. The visualization should show when launching the script as

                      python lfw_dataset.py -visualize_batch\n

                      Hint: this tutorial.

                    5. Experiment how the number of workers influences the performance. We have already provide code that will pass over 100 batches from the dataset 5 times and calculate how long time it took, which you can play around with by calling

                      python lfw_dataset.py -get_timing -num_workers 1\n

                      Make a errorbar plot with number of workers along the x-axis and the timing along the y-axis. The errorbars should correspond to the standard deviation over the 5 runs. HINT: if it is taking too long to evaluate, measure the time over less batches (set the -batches_to_check flag). Also if you are not seeing an improvement, try increasing the batch size (since data loading is parallelized per batch).

                      For certain machines like the Mac with M1 chipset it is necessary to set the multiprocessing_context flag in the dataloder to \"fork\". This essentially tells the dataloader how the worker nodes should be created.

                    6. Retry the experiment where you change the data augmentation to be more complex:

                      lfw_trans = transforms.Compose([\n    transforms.RandomAffine(5, (0.1, 0.1), (0.5, 2.0)),\n    # add more transforms here\n    transforms.ToTensor()\n])\n

                      by making the augmentation more computationally demanding, it should be easier to get an boost in performance when using multiple workers because the data augmentation is also executed in parallel.

                    7. (Optional, requires access to GPU) If your dataset fits in GPU memory it is beneficial to set the pin_memory flag to True. By setting this flag we are essentially telling Pytorch that they can lock the data in-place in memory which will make the transfer between the host (CPU) and the device (GPU) faster.

                    This ends the module on distributed data loading in Pytorch. If you want to go into more details we highly recommend that you read this paper that goes into great details on analyzing on how data loading in Pytorch work and performance benchmarks.

                    "},{"location":"s9_scalable_applications/distributed_training/","title":"M28 - Distributed Training","text":""},{"location":"s9_scalable_applications/distributed_training/#distributed-training","title":"Distributed Training","text":"

                    In this module we are going to look at distributed training. Distributed training is one of the key ingredients to all the awesome results that deep learning models are producing. For example: Alphafold the highly praised model from Deepmind that seems to have solved protein structure prediction, was trained in a distributed fashion for a few weeks. The training was done on 16 TPUv3s (specialized hardware), which is approximately equal to 100-200 modern GPUs. This means that training Alphafold without distributed training on a single GPU (probably not even possible) would take a couple of years to train! Therefore, it is simply impossible currently to train some of the state-of-the-art (SOTA) models within deep learning currently, without taking advantage of distributed training.

                    When we talk about distributed training, there are a number of different paradigms that we may use to parallelize our computations

                    • Data parallel (DP) training
                    • Distributed data parallel (DDP) training
                    • Sharded training

                    In this module we are going to look at data parallel training, which is the original way of doing parallel training and distributed data parallel training which is an improved version of data parallel. If you want to know more about sharded training which is the newest of the paradigms you can read more about it in this blog post, which describes how sharded can save over 60% of memory used during your training.

                    Finally, we want to note that for all the exercises in the module you are going to need a multi GPU setup. If you have not already gained access to multi GPU machines on GCP (see the quotas exercises in this module) you will need to find another way of running the exercises. For DTU Students I can recommend checking out this optional module on using the high performance cluster (HPC) where you can get access to multi GPU resources.

                    "},{"location":"s9_scalable_applications/distributed_training/#data-parallel","title":"Data parallel","text":"

                    While data parallel today in general is seen as obsolete compared to distributed data parallel, we are still going to investigate it a bit since it offers the most simple form of distributed computations in deep learning pipeline.

                    In the figure below is shown both the forward and backward step in the data parallel paradigm

                    The steps are the following:

                    • Whenever we try to do forward call e.g. out=model(batch) we take the batch and divide it equally between all devices. If we have a batch size of N and M devices each device will be sent N/M datapoints.

                    • Afterwards each device receives a copy of the model e.g. a copy of the weights that currently parametrizes our neural network.

                    • In this step we perform the actual forward pass in parallel. This is the actual steps that can help us scale our training.

                    • Finally we need to send back the output of each replicated model to the primary device.

                    Similar to the analysis we did of parallel data loading, we cannot always expect that this will actual take less time than doing the forward call on a single GPU. If we are parallelizing over M devices, we essentially need to do 3xM communication calls to send batch, model and output between the devices. If the parallel forward call does not outweigh this, then it will take longer.

                    In addition, we also have the backward path to focus on

                    • As the end of the forward collected the output on the primary device, this is also where the loss is accumulated. Thus, loss gradients are first calculated on the primary device

                    • Next we scatter the gradient to all the workers

                    • The workers then perform a parallel backward pass through their individual model

                    • Finally, we reduce (sum) the gradients from all the workers on the main process such that we can do gradient descend.

                    One of the big downsides of using data parallel is that all the replicas are destroyed after each backward call. This means that we over and over again need to replicate our model and send it to the devices that are part of the computations.

                    Even though it seems like a lot of logic is implementing data parallel into your code, in Pytorch we can very simply enable data parallel training by wrapping our model in the nn.DataParallel class.

                    from torch import nn\nmodel = MyModelClass()\nmodel = nn.DataParallel(model, device_ids=[0, 1])  # data parallel on gpu 0 and 1\npreds = model(input)  # same as usual\n
                    "},{"location":"s9_scalable_applications/distributed_training/#exercises","title":"\u2754 Exercises","text":"

                    Please note that the exercise only makes sense if you have access to multiple GPUs.

                    1. Create a new script (call it data_parallel.py) where you take a copy of model FashionCNN from the fashion_mnist.py script. Instantiate the model and wrap torch.nn.DataParallel around it such that it can be executed in data parallel.

                    2. Try to run inference in parallel on multiple devices (pass a batch multiple times and time it) e.g.

                      import time\nstart = time.time()\nfor _ in range(n_reps):\n    out = model(batch)\nend = time.time()\n

                      Does data parallel decrease the inference time? If no, can you explain why that may be? Try playing around with the batch size, and see if data parallel is more beneficial for larger batch sizes.

                    "},{"location":"s9_scalable_applications/distributed_training/#distributed-data-parallel","title":"Distributed data parallel","text":"

                    It should be clear that there is huge disadvantage of using the data parallel paradigm to scale your applications: the model needs to replicated on each pass (because it is destroyed in the end), which requires a large transfer of data. This is the main problem that distributed data parallel tries to solve.

                    The two key difference between distributed data parallel and data parallel that we move the model update (the gradient step) to happen on each device in parallel instead of only on the main device. This has the consequence that we do not need to move replicate the model on each step, instead we just keep a local version on each device that we keep updating. The full set of steps (as shown in the figure):

                    • Initialize an exact copy of the model on each device

                    • From disk (or memory) we start by loading data into a section of page-locked host memory per device. Page-locked memory is essentially a way to reverse a piece of a computers memory for a specific transfer that is going to happen over and over again to speed it up. The page-locked regions are loaded with non-overlapping data.

                    • Transfer data from page-locked memory to each device in parallel

                    • Perform forward pass in parallel

                    • Do a all-reduce operation on the gradients. An all-reduce operation is a so call all-to-all operation meaning that all processes send their own gradient to all other processes and also received from all other processes.

                    • Reduce the combined gradient signal from all processes and update the individual model in parallel. Since all processes received the same gradient information, all models will still be in sync.

                    Thus, in distributed data parallel we here end up only doing a single communication call between all processes, compared to all the communication going on in data parallel. While all-reduce is a more expensive operation that many of the other communication operations that we can do, because we only have to do a single we gain a huge performance boost. Empirically distributed data parallel tends to be 2-3 times faster than data parallel.

                    However, this performance increase does not come for free. Where we could implement data parallel in a single line in Pytorch, distributed data parallel is much more involving.

                    "},{"location":"s9_scalable_applications/distributed_training/#exercises_1","title":"\u2754 Exercises","text":"
                    1. We have provided an example of how to do distributed data parallel training in Pytorch in the two files distributed_example.py and distributed_example.sh. You objective is to get a understanding of the necessary components in the script to get this kind of distributed training to work. Try to answer the following questions (HINT: try to Google around):

                      1. What is the function of the DDP wrapper?

                      2. What is the function of the DistributedSampler?

                      3. Why is it necessary to call dist.barrier() before passing a batch into the model?

                      4. What does the different environment variables do in the .sh file

                    2. Try to benchmark the runs using 1 and 2 GPUs

                    3. The first exercise have hopefully convinced you that it can be quite the trouble writing distributed training applications yourself. Luckily for us, Pytorch-lightning can take care of this for us such that we do not have to care about the specific details. To get your model training on multiple GPUs you need to change two arguments in the trainer: the accelerator flag and the gpus flag. In addition to this, you can read through this guide about any additional steps you may need to do (for many of you, it should just work). Try running your model on multiple GPUs.

                    4. Try benchmarking your training using 1 and 2 gpus e.g. try running a couple of epochs and measure how long time it takes. How much of a speedup can you actually get? Why can you not get a speedup of 2?

                    "},{"location":"s9_scalable_applications/inference/","title":"M29 - Scalable Inference","text":""},{"location":"s9_scalable_applications/inference/#scalable-inference","title":"Scalable Inference","text":"

                    Inference is task of applying our trained model to some new and unseen data, often called prediction. Thus, scaling inference is different from scaling data loading and training, mainly due to inference normally only using a single data point (or a few). As we can neither parallelize the data loading or parallelize using multiple GPUs (at least not in any efficient way), this is of no use to us when we are doing inference. Secondly, inference is often not something we do on machines that can perform large computations, as most inference today is actually either done on edge devices e.g. mobile phones or in low-cost-low-compute cloud environments. Thus, we need to be smarter about how we scale inference than just throwing more compute at it.

                    In this module we are going to look at various ways that you can either reduce the size of your model and or make your model faster. Both are important for running inference fast regardless of the setup you are running your model on. We want to note that this is still very much an active area of research and therefore best practices for what to do in a specific situation can change.

                    "},{"location":"s9_scalable_applications/inference/#choosing-the-right-architecture","title":"Choosing the right architecture","text":"

                    Assume you are starting a completely new project and have to come up with a model architecture for doing this. What is you strategy? The common way to do this, is to look at prior work on similar problems that you are facing and either directly choosing the same architecture or creating some slight variation hereof. This is a great way to get started, but the architecture that you end up choosing may be optimal in terms of performance but not inference speed.

                    The fact is that not all base architectures are created equal, and a 10K parameter model with one architecture can have significantly different inference speed than another 10K parameter model with another architecture. For example, consider the figure below which compares a number of models from the [timm] package, colored based on their base architecture. The general trend is that the number of images that can be processed by a model per sec (y-axis) is inverse proportional to the number of parameters (x-axis). However, we in general see that convolutional base architectures (conv) are more efficient than transformer (vit) for the same parameter budget.

                    Image credit"},{"location":"s9_scalable_applications/inference/#exercises","title":"\u2754 Exercises","text":"

                    As dissed in this blogpost the largest increase in inference speed you will see (given some specific hardware) is choosing an efficient model architectures. In the exercises below we are going to investigate the inference speed of different architectures.

                    1. Start by checking out this table which contains a list of pretrained weights in torchvision. Try finding an

                      • Efficientnet
                      • Resnet
                      • Transformer based

                      model that have in the range of 20-30 mio parameters.

                    2. Write a small script that initialize all models and does inference with them. It should look something like this

                      import time\nfrom torchvision import models\n\nm1 = models.ModelArchitechture1()\nm2 = models.ModelArchitechture2()\nm3 = models.ModelArchitechture3()\n\ninput = torch.randn(100, 3, 256, 256)\n\nfor i, m in enumerate([m1, m2, m3]):\n    tic = time.time()\n    for _ in range(n_reps):\n        _ = m(input)\n    toc = time.time()\n    print(f\"Model {i} took: {(toc - tic) / n_reps}\")\n
                    3. Does the results make sense? Based on the above figure we would expect that efficientnet is faster than resnet, which is faster than the transformer based model. Is this also what you are seeing?

                    4. To figure out why one net is more efficient than another we can try to count the operations each network need to do for inference. A operation here we can define as a FLOP (floating point operation) which is any mathematical operation (such as +, -, *, /) or assignment that involves floating-point numbers. Luckily for us someone has already created a python package for calculating this in pytorch: ptflops

                      1. Install the package

                        pip install ptflops\n
                      2. Try calling the get_model_complexity_info function from the ptflops package on the networks from the previous exercise. What are the results?

                    5. In the table from the initial exercise, you could also see the overall performance of each network on the Imagenet-1K dataset. Given this performance, the inference speed, the flops count what network would you choose to use in a production setting? Discuss when choosing one over another should be considered.

                    "},{"location":"s9_scalable_applications/inference/#quantization","title":"Quantization","text":"

                    Quantization is a technique where all computations are performed with integers instead of floats. We are essentially taking all continuous signals and converting them into discretized signals.

                    Image credit

                    As discussed in this blogpost series, while float (32-bit) is the primarily used precision in machine learning because is strikes a good balance between memory consumption, precision and computational requirement it does not mean that during inference we can take advantage of quantization to improve the speed of our model. For instance:

                    • Floating-point computations are slower than integer operations

                    • Recent hardware have specialized hardware for doing integer operations

                    • Many neural networks are actually not bottlenecked by how many computations they need to do but by how fast we can transfer data e.g. the memory bandwidth and cache of your system is the limiting factor. Therefore working with 8-bit integers vs 32-bit floats means that we can approximately move data around 4 times as fast.

                    • Storing models in integers instead of floats save us approximately 75% of the ram/harddisk space whenever we save a checkpoint. This is especially useful in relation to deploying models using docker (as you hopefully remember) as it will lower the size of our docker images.

                    But how do we convert between floats and integers in quantization? In most cases we often use a linear affine quantization:

                    $$ x_{int} = \\text{round}\\left( \\frac{x_{float}}{s} + z \\right) $$

                    where $s$ is a scale and $z$ is the so called zero point. But how does to doing inference in a neural network. The figure below shows all the conversations that we need to make to our standard inference pipeline to actually do computations in quantized format.

                    Image credit"},{"location":"s9_scalable_applications/inference/#exercises_1","title":"\u2754 Exercises","text":"
                    1. Lets look at how quantized tensors look in Pytorch

                      1. Start by creating a tensor that contains both random numbers

                      2. Next call the torch.quantize_per_tensor function on the tensor. What does the quantized tensor look like? How does the values relate to the scale and zero_point arguments.

                      3. Finally, try to call the .dequantize() method on the tensor. Do you get a tensor back that is close to what you initially started out with.

                    2. As you hopefully saw in the first exercise we are going to perform a number of rounding errors when doing quantization and naively we would expect that this would accumulate and lead to a much worse model. However, in practice we observe that quantization still works, and we actually have a mathematically sound reason for this. Can you figure out why quantization still works with all the small rounding errors? HINT: it has to do with the central limit theorem

                    3. Lets move on to quantization of our model. Follow this tutorial from Pytorch on how to do quantization. The goal is to construct a model model_fc32 that works on normal floats and a quantized version model_int8. For simplicity you can just use one of the models from the tutorial.

                    4. Lets try to benchmark our quantized model and see if all the trouble that we went through actually paid of. Also try to perform the benchmark on the non-quantized model and see if you get a difference. If you do not get an improvement, explain why that may be.

                    "},{"location":"s9_scalable_applications/inference/#pruning","title":"Pruning","text":"

                    Pruning is another way for reducing the model size and maybe improve performance of our network. As the figure below illustrates, in pruning we are simply removing weights in our network that we do not consider important for the task at hand. By removing, we here mean that the weight gets set to 0. There are many ways to determine if a weight is important but the general rule that the importance of a weight is proportional to the magnitude of a given weight. This makes intuitively sense, since weights in all linear operations (fully connected or convolutional) are always multiplied onto the incoming value, thus a small weight means a small outgoing activation.

                    Image credit"},{"location":"s9_scalable_applications/inference/#exercises_2","title":"\u2754 Exercises","text":"
                    1. We provide a start script that implements the famous LeNet in this file. Open and run it just to make sure that you know the network.

                    2. Pytorch have already some pruning methods implemented in its package. Import the prune module from torch.nn.utils in the script.

                    3. Try to prune the weights of the first convolutional layer by calling

                      prune.random_unstructured(module_1, name=\"weight\", amount=0.3)  # (1)!\n
                      1. You can read about the prune method here.

                      Try printing the named_parameters, named_buffers before and after the module is pruned. Can you explain the difference and what is the connection to the module_1.weight attribute.

                    4. Try pruning the bias of the same module this time using the l1_unstructured function from the pruning module. Again check the named_parameters, named_buffers argument to make sure you understand the difference between L1 pruning and unstructured pruning.

                    5. Instead of pruning only a single module in the model lets try pruning the hole model. To do this we just need to iterate over all named_modules in the model like this:

                      for name, module in new_model.named_modules():\n    prune.l1_unstructured(module, name='weight', amount=0.2)\n

                      But what if we wanted to apply different pruning to different layers. Implement a pruning scheme where

                      • The weights of convolutional layers are L1 pruned with amount=0.2
                      • The weights of linear layers are unstructured pruned with amount=0.4

                      Print print(dict(new_model.named_buffers()).keys()) after the pruning to confirm that all weights have been correctly pruned.

                    6. The pruning we have looked at until know have only been local in nature e.g. we have applied the pruning independently for each layer, not accounting globally for how much we should actually prune. As you may realize this can quickly lead to an network that is pruned too much. Instead, the more common approach is too prune globally where we remove the smallest X amount of connections:

                      1. Start by creating a tuple over all the weights with the following format

                        parameters_to_prune = (\n    (model.conv1, 'weight'),\n    # fill in the rest of the modules yourself\n    (model.fc3, 'weight'),\n)\n

                        The tuple needs to have length 5. Challenge: Can you construct the tuple using for loops, such that the code works for arbitrary size networks?

                      2. Next prune using the global_unstructured function to globally prune the tuple of parameters

                        prune.global_unstructured(\n    parameters_to_prune,\n    pruning_method=prune.L1Unstructured,\n    amount=0.2,\n)\n
                      3. Check that the amount that have been pruned is actually equal to the 20% specified in the pruning. We provide the following function that for a given submodule (for example model.conv1) computes the amount of pruned weights

                        def check_prune_level(module: nn.Module):\n    sparsity_level = 100 * float(torch.sum(module.weight == 0) / module.weight.numel())\n    print(f\"Sparsity level of module {sparsity_level}\")\n
                    7. With a pruned network we really want to see if all our effort actually ended up with a network that is faster and/or smaller in memory. Do the following to the globally pruned network from the previous exercises:

                      1. First we need to make the pruning of our network permanent. Right now it is only semi-permanent as we are still keeping a copy of the original weights in memory. Make the change permanent by calling prune.remove on every pruned module in the model. Hint: iterate over the parameters_to_prune tuple.

                      2. Next try to measure the time of a single inference (repeated 100 times) for both the pruned and non-pruned network

                        import time\ntic = time.time()\nfor _ in range(100):\n    _ = network(torch.randn(100, 1, 28, 28))\ntoc = time.time()\nprint(toc - tic)\n

                        Is the pruned network actually faster? If not can you explain why?

                      3. Next lets measure the size of our network (called pruned_network) and a freshly initialized network (called network):

                        torch.save(pruned_network.state_dict(), 'pruned_network.pt')\ntorch.save(network.state_dict(), 'network.pt')\n

                        Lookup the size of each file. Are the pruned network actually smaller? If not can you explain why?

                      4. Repeat the last exercise, but this time start by converting all pruned weights to sparse format first by calling the .to_sparse() method on each pruned weight. Is the saved model smaller now?

                    This ends the exercises on pruning. As you probably realized in the last couple of exercises, then pruning does not guarantee speedups out of the box. This is because linear operations in Pytorch does not handle sparse structures out of the box. To actually get speedups we would need to deep dive into the sparse tensor operations, which again does not even guarantee that a speedup because the performance of these operations depends on the sparsity structure of the pruned weights. Investigating this is out of scope for these exercises, but we highly recommend checking it out if you are interested in sparse networks.

                    "},{"location":"s9_scalable_applications/inference/#knowledge-distillation","title":"Knowledge distillation","text":"

                    Knowledge distillation is somewhat similar to pruning, in the sense that it tries to find a smaller model that can perform equally well as a large model, however it does so in a completely different way. Knowledge distillation is a model compression technique that builds on the work of Bucila et al. in which we try do distill/compress the knowledge of a large complex model (also called the teacher model) into a simpler model (also called the student model).

                    The best known example of this is the DistilBERT model. The DistilBERT model is a smaller version of the large natural-language procession model Bert, which achieves 97% of the performance of Bert while only containing 40% of the weights and being 60% faster. You can see in the figure below how it is much smaller in size compared to other models developed at the same time.

                    Image credit

                    Knowledge distillation works by assuming we have a big teacher that is already performing well that we want to compress. By running our training set through our large model we get a softmax distribution for each and every training sample. The goal of the students, is to both match the original labels of the training data but also match the softmax distribution of the teacher model. The intuition behind doing this, is that teacher model needs to be more complex to learn the complex inter-class relasionship from just (one-hot) labels. The student on the other hand gets directly feed with softmax distributions from the teacher that explicit encodes this inter-class relasionship and thus does not need the same capasity to learn the same as the teacher.

                    Image credit"},{"location":"s9_scalable_applications/inference/#exercises_3","title":"\u2754 Exercises","text":"

                    Lets try implementing model distillation ourself. We are going to see if we can achieve this on the cifar10 dataset. Do note that exercise below can take quite long time to finish because it involves training multiple networks and therefore involve some waiting.

                    1. Start by install the transformers and datasets packages from Huggingface

                      pip install transformers\npip install datasets\n

                      which we are going to download the cifar10 dataset and a teacher model.

                    2. Next download the cifar10 dataset

                      from datasets import load_dataset\ndataset = load_dataset(\"cifar10\")\n
                    3. Next lets initialize our teacher model. For this we consider a large transformer based model:

                      from transformers import AutoFeatureExtractor, AutoModelForImageClassification\nextractor = AutoFeatureExtractor.from_pretrained(\"aaraki/vit-base-patch16-224-in21k-finetuned-cifar10\")\nmodel = AutoModelForImageClassification.from_pretrained(\"aaraki/vit-base-patch16-224-in21k-finetuned-cifar10\")\n
                    4. To get the logits (un-normalized softmax scores) from our teacher model for a single datapoint from the training dataset you would extract it like this:

                      sample_img = dataset['train'][0]['img']\npreprocessed_img = extractor(dataset['train'][0]['img'], return_tensors='pt')\noutput =  model(**preprocessed_img)\nprint(output.logits)\n# tensor([[ 3.3682, -0.3160, -0.2798, -0.5006, -0.5529, -0.5625, -0.6144, -0.4671, 0.2807, -0.3066]])\n

                      Repeat this process for the hole training dataset and store the result somewhere.

                    5. Implement a simple convolutional model. You can create a custom one yourself or use a small one from torchvision.

                    6. Train the model on cifar10 to convergence, so you have a base result on how the model is performing.

                    7. Redo the training, but this time add knowledge distillation to your training objective. It should look like this:

                      for batch in dataset:\n    # ...\n    img, target, teacher_logits = batch\n    preds = model(img)\n    loss = torch.nn.functional.cross_entropy(preds, target)\n    loss_teacher = torch.nn.functional.cross_entropy(preds, teacher_logits)\n    loss = loss + loss_teacher\n    loss.backward()\n    # ...\n
                    8. Compare the final performance obtained with and without knowledge distillation. Did the performance improve or not?

                    This ends the module on scaling inference in machine learning models.

                    "},{"location":"tools/","title":"Tools","text":"

                    Just a collection of tools and scripts for running the course.

                    "}]} \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index 6ca664eb3..0519d0396 100644 Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ