From cbc32aaa89c7ee2f028474bb3b516da51aeced9a Mon Sep 17 00:00:00 2001 From: <> Date: Thu, 21 Nov 2024 09:49:47 +0000 Subject: [PATCH] Deployed e9259bf with MkDocs version: 1.6.1 --- s7_deployment/ml_deployment/index.html | 4 ++-- search/search_index.json | 2 +- sitemap.xml.gz | Bin 127 -> 127 bytes 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/s7_deployment/ml_deployment/index.html b/s7_deployment/ml_deployment/index.html index c45981f1f..7772f86c5 100644 --- a/s7_deployment/ml_deployment/index.html +++ b/s7_deployment/ml_deployment/index.html @@ -1984,7 +1984,7 @@
Machine Learning Operations
Repository for course 02476 at DTU.
Checkout the homepage!
"},{"location":"#i-course-information","title":"\u2139\ufe0f Course information","text":"
Recommended prerequisites: DTU course 02456 (Deep Learning) or experience with the following topics:
Start by cloning or downloading this repository
git clone https://github.com/SkafteNicki/dtu_mlops\n
If you do not have git installed (yet) we will touch upon it in the course. The folder will contain all the exercise material for this course and lectures. Additionally, you should join our Slack channel which we use for communication. The link may be expired, write to me.
"},{"location":"#course-organization","title":"\ud83d\udcc2 Course organization","text":"We highly recommend that when going through the material you use the homepage which is the corresponding GitHub Pages version of this repository that is more nicely rendered, and also includes some special HTML magic provided by Material for MkDocs.
The course is divided into sessions, denoted by capital S, and modules, denoted by capital M. A session corresponds to a full day of work if you are following the course, meaning approximately 9 hours of work. Each session (S) corresponds to a topic within MLOps and consists of multiple modules (M) that each cover a specific topic.
Importantly we differ between core modules and optional modules. Core modules will be marked by
Core Module
at the top of their corresponding page. Core modules are important to go through to be able to pass the course. You are highly recommended to still do the optional modules.
Additionally, be aware of the following icons throughout the course material:
This icon can be expanded to show code belonging to a given exercise
ExampleI will contain some code for an exercise.
This icon can be expanded to show a solution for a given exercise
SolutionI will present a solution to the exercise.
This icon (1) can be expanded to show a hint or a note for a given exercise
Machine Learning Operations (MLOps) is a rather new field that has seen its uprise as machine learning and particularly deep learning has become a widely available technology. The term itself is a compound of \"machine learning\" and \"operations\" and covers everything that has to do with the management of the production ML lifecycle.
The lifecycle of production ML can largely be divided into three phases:
Design: The initial phase starts with an investigation of the problem. Based on this analysis, several requirements can be prioritized for what we want our future model to do. Since machine learning requires data to be trained, we also investigate in this step what data we have and if we need to source it in some other way.
Model development: Based on the design phase we can begin to conjure some machine learning algorithms to solve our problems. As always, the initial step often involves doing some data analysis to make sure that our model is learning the signal that we want it to learn. Secondly, is the machine learning engineering phase, where the particular model architecture is chosen. Finally, we also need to do validation and testing to make sure that our model is generalizing well.
Operations: Based on the model development phase, we now have a model that we want to use. The operations are where create an automatic pipeline that makes sure that whenever we make changes to our codebase they get automatically incorporated into our model, such that we do not slow down production. Equally important is the ongoing monitoring of already deployed models to make sure that they behave exactly as we specified them.
It is important to note that the three steps are a cycle, meaning that when you have successfully deployed a machine learning model that is not the end of it. Your initial requirements may change, forcing you to revisit the design phase. Some new algorithms may show promising results, so you revisit the model development phase to implement this. Finally, you may try to cut the cost of running your model in production, making you revisit the operations phase, and trying to optimize some steps.
The focus in this course is particularly on the Operations part of MLOps as this is what many data scientists are missing in their toolbox to take all the knowledge they have about data processing and model development into a production setting.
"},{"location":"#learning-objectives","title":"\u2754 Learning objectives","text":"General course objective
Introduce the student to a number of coding practices that will help them organization, scale, monitor and deploy machine learning models either in a research or production setting. To provide hands-on experience with a number of frameworks, both local and in the cloud, for doing large scale machine learning models.
This includes:
Additional reading resources (in no particular order):
Ref 1 Introduction blog post for those who have never heard about MLOps and want to get an overview.
Ref 2 Great document from Google about the different levels of MLOps.
Ref 3 Another introduction to the principles of MLOps and the different stages of MLOps.
Ref 4 Great paper about the technical dept in machine learning.
Ref 5 Interview study that uncovers many of the pain points that ML engineers go through when doing MLOps.
Other courses with content similar to this:
Made with ML. Great online MLOps course that also covers additional topics on the foundations of working with ML.
Full stack deep learning. Another MLOps online course going through the whole developer pipeline.
MLOps Zoomcamp. MLOps online course that includes many of the same topics.
If you want to contribute to the course, we are happy to have you! Anything from fixing typos to adding new content is welcome. For building the course material locally, it is a simple two-step process:
pip install -r requirements.txt\nmkdocs serve\n
Which will start a local server that you can access at http://127.0.0.1:8000
and will automatically update when you make changes to the course material. When you have something that you want to contribute, please make a pull request.
I highly value open source, and the content of this course is therefore free to use under the Apache 2.0 license. If you use parts of this course in your work, please cite using:
@misc{skafte_mlops,\n author = {Nicki Skafte Detlefsen},\n title = {Machine Learning Operations},\n howpublished = {\\url{https://github.com/SkafteNicki/dtu_mlops}},\n year = {2024}\n}\n
"},{"location":"pages/faq/","title":"Frequently asked questions","text":"For further questions, please contact Nicki.
"},{"location":"pages/faq/#when-is-the-next-time-the-course-is-running","title":"When is the next time the course is running \u2754","text":"The course always runs in January, during the 3-week period at DTU. The exact dates can be found in the academic calendar.
"},{"location":"pages/faq/#is-it-possible-to-attend-the-course-fully-online","title":"Is it possible to attend the course fully online \u2754","text":"Mostly yes. All exercises are provided online and lectures will be recorded and streamed. However, do note that
Overall we try to support flexible learning as much as possible with some limitations.
"},{"location":"pages/faq/#what-are-the-prerequisites-for-taking-this-course","title":"What are the prerequisites for taking this course \u2754","text":"We recommend that you have a basic understanding of machine learning concepts such as what a dataset is, what probabilities are, what a classifier is, what overfitting means etc. This corresponds to the curriculum covered in course 02450. The actual focus of the course is not on machine learning models, but we will be using these basic concepts throughout the exercises.
Additionally, we recommend basic knowledge about deep learning and how to code in PyTorch, corresponding to the curriculum covered in 02456. From prior experience, we know that not all students have gained knowledge about deep learning models before this course, and we will be covering the basics of how to code in PyTorch in one of the first modules of the course to get everyone up to speed.
"},{"location":"pages/faq/#i-will-be-missing-x-days-of-the-course-will-that-be-a-problem","title":"I will be missing X days of the course, will that be a problem \u2754","text":"Depends. The course is fairly intensive, with most students working from 9-17 every day. If you already know that you will be missing X days of the course, then I highly recommend that you go through some of the first sessions before the course starts to give yourself a bit of breathing room. If you are not able to do so, please be aware that an additional effort may be needed from you to keep up with your fellow students.
"},{"location":"pages/faq/#how-many-should-we-be-in-a-group-for-the-projects","title":"How many should we be in a group for the projects \u2754","text":"Between 3 and 5. The projects are designed to be done in groups meaning that we intentionally make them too big for one person to do alone. Luckily, a lot of the work that you need to do can be done in parallel, so it is not as bad as it sounds.
"},{"location":"pages/faq/#when-will-the-exam-take-place","title":"When will the exam take place \u2754","text":"From 2025 and onwards, the exam only consist of a project report. The report should be handed in at midnight on the final day of the course. For January 2025, this means the 24th.
"},{"location":"pages/faq/#where-can-i-find-information-regarding-the-exam","title":"Where can I find information regarding the exam \u2754","text":"Look at the bottom of this page. Details will be updated as we get closer to the exam date.
"},{"location":"pages/faq/#can-i-use-chatgpt-or-similar-tools-for-the-exercises-project-exam-report-coding-writing","title":"Can I use ChatGPT or similar tools for the exercises, project, exam report (coding + writing) \u2754","text":"Yes, yes, and yes, but remember that its a tool and you need to validate the output before using it. We would prefer for the exam report that you formulate the answers in your own words because it is intended for you do describe what you have been doing in your project. The I in LLM stands for intelligence.
"},{"location":"pages/faq/#i-am-a-phd-student-not-enrolled-at-dtu-can-i-take-the-course","title":"I am a PhD student not enrolled at DTU, can I take the course \u2754","text":"Yes, PhD students from other universities can attend the course. You can checkout this page for more information or in general you can contact phdcourses@dtu.dk for more information. Do note that the registration deadline is usually in beginning of December.
"},{"location":"pages/faq/#i-am-a-foreign-student-and-my-home-university-doesnt-accept-passnot-pass-what-can-i-do","title":"I am a foreign student and my home university doesn't accept pass/not pass, what can I do \u2754","text":"We can give a grade on the Danish 7-point grading scale for foreign students who need it, where their home university does not accept pass/no-pass. You need to contact the course responsible Nicki within the first week of the course to request this. Secondly, we may need to further validate your work, so please be prepared for doing a short oral exam on one of the last days of the course.
"},{"location":"pages/faq/#i-am-a-euroteq-student-any-special-rules-for-me","title":"I am a EuroTEQ student, any special rules for me \u2754","text":"Not really, you will attend the course as any other student. However, we will provide a special Slack channel for you, trying to make sure that you can get the same help as students from DTU who can attend the course on campus.
"},{"location":"pages/overview/","title":"Summary of course content","text":"There are a lot of moving parts in this course, so it may be hard to understand how it all fits together. This page provides a summary of the frameworks in this course e.g. the stack of tools used. In the figure below we have provided an overview on how the different tools of the course interacts with each other. The table after the figure provides a short description of each of the parts.
The MLOps stack in the course. This is just an example of one stack, and depending on your use case you may want to use a different stack of tools that better fits your needs. Regardless of the stack, the principles of MLOps are the same. Framework Description PyTorch is the backbone of our code, it provides the computational engine and the data structures that we need to define our data structures. PyTorch lightning is a framework that provides a high-level interface to PyTorch. It provides a lot of functionality that we need to train our models, such as logging, checkpointing, early stopping, etc. such that we do not have to implement it ourselves. It also allows us to scale our models to multiple GPUs and multiple nodes. We control the dependencies and Python interpreter using Conda that enables us to construct reproducible virtual environments For configuring our experiments we use Hydra that allows us to define a hierarchical configuration structure config files Using Weights and Bias allows us to track and log any values and hyperparameters for our experiments Whenever we run into performance bottlenecks with our code we can use the Profiler to find the cause of the bottleneck When we run into bugs in our code we can use the Debugger to find the cause of the bug For organizing our code and creating templates we can use Cookiecutter Docker is a tool that allows us to create a container that contains all the dependencies and code that we need to run our code For controlling the versions of our data and synchronization between local and remote data storage, we can use DVC that makes this process easy For version control of our code we use Git (in complement with Github) that allows multiple developers to work together on a shared codebase We can use Pytest to write unit tests for our code, to make sure that new changes to the code does break the code base For linting our code and keeping a consistent coding style we can use tools such as Pylint and Flake8 that checks our code for common mistakes and style issues For running our unit tests and other checks on our code in a continuous manner e.g. after we commit and push our code we can use Github actions that automate this process Using Cloud build we can automate the process of building our docker images and pushing them to our artifact registry Artifact registry is a service that allows us to store our docker images for later use by other services For storing our data and trained models we can use Cloud storage that provides a scalable and secure storage solution For general compute tasks we can use Compute engine that provides a scalable and secure compute solution For training our experiments in a easy and scalable manner we can use Vertex AI For creating a REST API for our model we can use FastAPI that provides a high-level interface for creating APIs For simple deployments of our code we can use Cloud functions that allows us to run our code in response to events through simple Python functions For more complex deployments of our code we can use Cloud run that allows us to run our code in response to events through docker containers Cloud monitoring gives us the tools to keep track of important logs and errors from the other cloud services For monitoring our deployed model is experiencing any drift we can use Evidently AI that provides a framework and dashboard for monitoring drift For monitoring the telemetry of our deployed model we can use OpenTelemetry that provides a standard for collecting and exporting telemetry data"},{"location":"pages/projects/","title":"Project work","text":"Slides
Approximately 1/3 of the course time is dedicated to doing project work. The projects will serve as the basis of your exam. In the project, you will essentially re-apply everything that you learn throughout the course to a self chosen project. The overall goals with the project is:
In the projects you are free to work on whatever problem that you want. That said, we have a specific requirement, that you need to incorporate some third-party framework into your project. If you want inspiration for projects, here are some examples
Classification of tweets
Translating from English to German
Classification of scientific papers
Classification of rice types from images
We hope most students will be able to form groups by themselves. Expected group size is between 3 and 5. If you are not able to form a group, please make sure to post in the #looking-for-group
channel on Slack or make sure to be present on the 4th day of the course (the day before the project work starts) where we will help students that have not found a group yet.
We strive to keep the tools thought in this course as open-source as possible. The great thing about the open-source community is that whatever problem you are working on, there is probably some package out there that can get you at least 10% of the way. For the project, we want to enforce this point and you are required to include some third-party package, that is neither PyTorch or one of the tools already covered in the course, into your project.
If you have no idea what framework to include, the PyTorch ecosystem is a great place for finding open-source frameworks that can help you accelerate your own projects where PyTorch is the backengine. All tools in the ecosystem should work greatly together with PyTorch. However, it is important to note that the ecosystem is not a complete list of all the awesome packages that exist to extend the functionality of PyTorch. If you are still missing inspiration for frameworks to use, we highly recommend these three that have been used in previous years of the course:
PyTorch Image Models. PyTorch Image Models (also known as TIMM) is the absolutely most used computer vision package (maybe except for torchvision
). It contains models, scripts and pre trained for a lot of state-of-the-art image models within computer vision.
Transformers. The Transformers repository from the Huggingface group focuses on state-of-the-art Natural Language Processing (NLP). It provides many pre-trained model to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.
PyTorch-Geometric. PyTorch Geometric (PyG) is a geometric deep learning. It consists of various methods for deep learning on graphs and other irregular structures, also known as geometric deep learning, from a variety of published papers.
Each project day is fully dedicated to project work, except for maybe external inspirational lectures in the morning. The group decides exactly where they want to work on the project, how they want to work on the project, how do distribute the workload etc. We encourage strongly to parallelize work during the project, because there are a lot of tasks to do, but it it is important that all group members at least have some understanding of the whole project.
Remember that the focus of the project work is not to demonstrate that you can work with the biggest and baddest deep learning model, but instead that you show that you can incorporate the tools that are taught throughout the course in a meaningful way.
Also note that the project is not expected to be very large in scope. It may simply be that you want to train X model on Y data. You will approximately be given 4 full days to work on the project. It is better that you start out with a smaller project and then add complexity along the way if you have time.
"},{"location":"pages/projects/#day-1","title":"Day 1","text":"The first project days is all about getting started on the projects and formulating exactly what you want to work on as a group.
Start by brainstorming projects! Try to figure out exactly what you want to work with and begin to investigate what third party package that can support the project.
When you have come up with an idea, write a project description. The description is the delivery for today and should be at least 300 words. Try to answer the following questions in the description:
(Optional) If you want to think more about the product design of your project, feel free to fill out the ML canvas (or part of it). You can read more about the different fields on canvas here.
After having done the product description, you can start on the actual coding of the project. In the next section, a to-do list is attached that summaries what we are doing in the course. You are NOT expected to fulfill all bullet points from week 1 today.
The project description will serve as an guideline for us at the exam that you have somewhat reached the goals that you set out to do. By the end of the day, you should commit your project description to the README.md
file belonging to your project repository. If you filled out the ML canvas, feel free to include that as part of the README.md
file. Also remember to commit whatever you have done on the project until now. When you have done this, go to DTU Learn and hand-in (as a group) the link to your GitHub repository as an assignment.
We will briefly (before next Monday) look over your GitHub repository and project description to check that everything is fine. If we have any questions/concerns we will contact you.
"},{"location":"pages/projects/#day-2","title":"Day 2","text":"The goal for today is simply to continue working on your project. Start with bullet points in the checklist from week 1 and continue with bullet points for week 2.
"},{"location":"pages/projects/#day-3","title":"Day 3","text":"Continue working on your project, today you should hopefully focus on the bullet points in the checklist from week 2. There is no delivery for this week, but make sure that you have committed all your progress at the end of the day. We will again briefly look over the repositories and will reach out to your group if we are worried about the progression of your project.
"},{"location":"pages/projects/#day-4","title":"Day 4","text":"We have now entered the final week of the course and the second last project day. You are most likely continuing with bullet points from week 2, but should hopefully begin to look at the bullet points from week 3 today. These are in general much more complex, so we recommend looking at them until you have completed most from week 2. We also recommend that you being to fill our report template.
"},{"location":"pages/projects/#day-5","title":"Day 5","text":"Today you are finishing your project. We recommend that you start by creating a architechtual overview of your project similar to this figure. I recommend using draw.io for creating this kind of diagram, but feel free to use any tool you like. Else you should just continue working on your project, checking of as many bullet points as possible. Finally, you should also prepare yourself for the exam tomorrow.
"},{"location":"pages/projects/#project-hints","title":"Project hints","text":"Below are listed some hints to prevent you from getting stuck during the project work with problems that previous groups have encountered.
Data
Start out small! We recommend that you start out with less than 1GB of data. If the dataset you want to work with is larger, then subsample it. You can use dvc to version control your data and only download the full dataset when you are ready to train the model.
Be aware of many smaller files. DVC
does not handle many small files well, and can take a long time to download. If you have many small files, consider zipping them together and then unzip them at runtime.
You do not need to use DVC
for everything regarding data. You workflow is to just use DVC
for version controlling the data, but when you need to get it you can just download it from the source. For example if you are storing your data in a GCP bucket, you can use the gsutil
command to download the data or directly accessing the it using the cloud storage file system
Modelling
Again, start out small! Start with a simple model and then add complexity as you go along. It is better to have a simple model that works than a complex model that does not work.
Try fine-tuning a pre-trained model. This is often much faster than training a model from scratch.
Deployment
Please note that all the lists are exhaustive meaning that I do not expect you to have completed very point on the checklist for the exam.
"},{"location":"pages/projects/#week-1","title":"Week 1","text":"make_dataset.py
file such that it downloads whatever data you need andrequirements.txt
file with whatever dependencies that you are usingpep8
) while doing the projectFrom January 2025 the exam only consist of a project report. The report should be handed in at midnight on the final day of the course. For January 2025, this means the 24th. We provide template folder called reports. As the first task you should copy the folder and all its content to your project repository. Then, you jobs is to fill out the README.md
file which contains the report template. The file itself contains instructions on how to fill it out and instructions on using the included report.py
file for validating your work. You will hand-in the template by simple including it in your project repository. By midnight on the final day of the course, we will automatically scrape the report and use it as the basis for grading you. Therefore, changes after this point are not registered.
Slides
The course is organised into exercise (2/3 of the course) days and project days (1/3 of the course).
Exercise days start at 9:00 in the morning with an lecture (usually 30-45 min) that will give some context about at least one of the topics of that day. Additionally, previous days exercises may shortly be touched upon. The remaining of the day will be spend on solving exercises either individually or in small groups. For some people the exercises may be fast to do and for others it will take the whole day. We will provide help throughout the day. We will try to answer questions on slack but help with be priorities to students physically on campus.
Project days are intended for project work and you are therefore responsible for making an agreement with your group when and where you are going to work. The first project days there will be a lecture at 9:00 with project information. Other project days we may also start the day with an external lecture, which we highly recommend that you participate in. During each project day we will have office hours where you can ask questions for the project.
Below is an overall timeplan for each day, including the presentation topic of the day and the frameworks that you will be using in the exercises.
Recodings (link to drive folder with mp4 files):
In the first week you will be introduced to a number of development practices for organizing and developing code, especially with a focus on making everything reproducible.
Date Day Presentation topic Frameworks Format 6/1/25 Monday Deep learning software\ud83d\udcdd Terminal, Conda, IDE, PyTorch Exercises 7/1/25 Tuesday MLOps: what is it?\ud83d\udcdd Git, CookieCutter, Pep8, DVC Exercises 8/1/25 Wednesday Reproducibility\ud83d\udcdd Docker, Hydra Exercises 9/1/25 Thursday Debugging\ud83d\udcdd Debugger, Profiler, Wandb, Lightning Exercises 10/1/25 Friday Project work\ud83d\udcdd - Projects"},{"location":"pages/timeplan/#week-2","title":"Week 2","text":"The second week is about automatization and the cloud. Automatization will help use making sure that our code does not break when we make changes to it. The cloud will help us scale up our applications and we learn how to use different services to help develop a full machine learning pipeline.
Date Day Presentation topic Frameworks Format 13/1/25 Monday Continuous Integration\ud83d\udcdd Pytest, Github actions, Pre-commit, CML Exercises 14/1/25 Tuesday The Cloud\ud83d\udcdd GCP Engine, Bucket, Artifact registry, Vertex AI Exercises 15/1/25 Wednesday Deployment\ud83d\udcdd FastAPI, Torchserve, GCP Functions, GCP Run Exercises 16/1/25 Thursday No lecture - Projects 17/1/25 Friday Company presentation (TBA) - Projects"},{"location":"pages/timeplan/#week-3","title":"Week 3","text":"For the final week we look into advance topics such as monitoring and scaling of applications. Monitoring is especially important for the longivity for the applications that we develop, that we can deploy them either locally or in the cloud and that we have the tools to monitor how they behave over time. Scaling of applications is an important topic if we ever want our applications to be used by many people at the same time.
Date Day Presentation topic Frameworks Format 20/1/25 Monday Monitoring\ud83d\udcdd Evidently AI, Prometheus, GCP Monitoring Exercises 21/1/25 Tuesday Scalable applications\ud83d\udcdd PyTorch, Lightning Exercises 22/1/25 Wednesday Company presentation (TBA) - Projects 23/1/25 Thursday No lecture - Projects 24/1/25 Friday No lecture - Projects"},{"location":"reports/","title":"Exam template for 02476 Machine Learning Operations","text":"This is the report template for the exam. Please only remove the text formatted as with three dashes in front and behind like:
--- question 1 fill here ---
where you instead should add your answers. Any other changes may have unwanted consequences when your report is auto-generated at the end of the course. For questions where you are asked to include images, start by adding the image to the figures
subfolder (please only use .png
, .jpg
or .jpeg
) and then add the following code in your answer:
![my_image](figures/<image>.<extension>)\n
In addition to this markdown file, we also provide the report.py
script that provides two utility functions:
Running:
python report.py html\n
will generate a .html
page of your report. After the deadline for answering this template, we will auto-scrape everything in this reports
folder and then use this utility to generate an .html
page that will be your serve as your final hand-in.
Running
python report.py check\n
will check your answers in this template against the constraints listed for each question e.g. is your answer too short, too long, or have you included an image when asked to.
For both functions to work you mustn't rename anything. The script has two dependencies that can be installed with
pip install click markdown\n
"},{"location":"reports/#overall-project-checklist","title":"Overall project checklist","text":"The checklist is exhaustive which means that it includes everything that you could do on the project included in the curriculum in this course. Therefore, we do not expect at all that you have checked all boxes at the end of the project.
"},{"location":"reports/#week-1","title":"Week 1","text":"make_dataset.py
file such that it downloads whatever data you need andrequirements.txt
file with whatever dependencies that you are usingpep8
) while doing the projectEnter the group number you signed up on
Answer:
--- question 1 fill here ---
"},{"location":"reports/#question-2","title":"Question 2","text":"Enter the study number for each member in the group
Example:
sXXXXXX, sXXXXXX, sXXXXXX
Answer:
--- question 2 fill here ---
"},{"location":"reports/#question-3","title":"Question 3","text":"What framework did you choose to work with and did it help you complete the project?
Recommended answer length: 100-200 words.
Example: We used the third-party framework ... in our project. We used functionality ... and functionality ... from the package to do ... and ... in our project.
Answer:
--- question 3 fill here ---
"},{"location":"reports/#coding-environment","title":"Coding environment","text":"In the following section we are interested in learning more about you local development environment.
"},{"location":"reports/#question-4","title":"Question 4","text":"Explain how you managed dependencies in your project? Explain the process a new team member would have to go through to get an exact copy of your environment.
Recommended answer length: 100-200 words
Example: We used ... for managing our dependencies. The list of dependencies was auto-generated using ... . To get a complete copy of our development environment, one would have to run the following commands
Answer:
--- question 4 fill here ---
"},{"location":"reports/#question-5","title":"Question 5","text":"We expect that you initialized your project using the cookiecutter template. Explain the overall structure of your code. Did you fill out every folder or only a subset?
Recommended answer length: 100-200 words
Example: From the cookiecutter template we have filled out the ... , ... and ... folder. We have removed the ... folder because we did not use any ... in our project. We have added an ... folder that contains ... for running our experiments. Answer:
--- question 5 fill here ---
"},{"location":"reports/#question-6","title":"Question 6","text":"Did you implement any rules for code quality and format? Additionally, explain with your own words why these concepts matters in larger projects.
Recommended answer length: 50-100 words.
Answer:
--- question 6 fill here ---
"},{"location":"reports/#version-control","title":"Version control","text":"In the following section we are interested in how version control was used in your project during development to corporate and increase the quality of your code.
"},{"location":"reports/#question-7","title":"Question 7","text":"How many tests did you implement and what are they testing in your code?
Recommended answer length: 50-100 words.
Example: In total we have implemented X tests. Primarily we are testing ... and ... as these the most critical parts of our application but also ... .
Answer:
--- question 7 fill here ---
"},{"location":"reports/#question-8","title":"Question 8","text":"What is the total code coverage (in percentage) of your code? If you code had an code coverage of 100% (or close to), would you still trust it to be error free? Explain you reasoning.
Recommended answer length: 100-200 words.
Example: The total code coverage of code is X%, which includes all our source code. We are far from 100% coverage of our ** code and even if we were then...*
Answer:
--- question 8 fill here ---
"},{"location":"reports/#question-9","title":"Question 9","text":"Did you workflow include using branches and pull requests? If yes, explain how. If not, explain how branches and pull request can help improve version control.
Recommended answer length: 100-200 words.
Example: We made use of both branches and PRs in our project. In our group, each member had an branch that they worked on in addition to the main branch. To merge code we ...
Answer:
--- question 9 fill here ---
"},{"location":"reports/#question-10","title":"Question 10","text":"Did you use DVC for managing data in your project? If yes, then how did it improve your project to have version control of your data. If no, explain a case where it would be beneficial to have version control of your data.
Recommended answer length: 100-200 words.
Example: We did make use of DVC in the following way: ... . In the end it helped us in ... for controlling ... part of our pipeline
Answer:
--- question 10 fill here ---
"},{"location":"reports/#question-11","title":"Question 11","text":"Discuss you continuous integration setup. What kind of continuous integration are you running (unittesting, linting, etc.)? Do you test multiple operating systems, Python version etc. Do you make use of caching? Feel free to insert a link to one of your GitHub actions workflow.
Recommended answer length: 200-300 words.
Example: We have organized our continuous integration into 3 separate files: one for doing ..., one for running ... testing and one for running ... . In particular for our ..., we used ... .An example of a triggered workflow can be seen here:
Answer:
--- question 11 fill here ---
"},{"location":"reports/#running-code-and-tracking-experiments","title":"Running code and tracking experiments","text":"In the following section we are interested in learning more about the experimental setup for running your code and especially the reproducibility of your experiments.
"},{"location":"reports/#question-12","title":"Question 12","text":"How did you configure experiments? Did you make use of config files? Explain with coding examples of how you would run a experiment.
Recommended answer length: 50-100 words.
Example: We used a simple argparser, that worked in the following way: Python my_script.py --lr 1e-3 --batch_size 25
Answer:
--- question 12 fill here ---
"},{"location":"reports/#question-13","title":"Question 13","text":"Reproducibility of experiments are important. Related to the last question, how did you secure that no information is lost when running experiments and that your experiments are reproducible?
Recommended answer length: 100-200 words.
Example: We made use of config files. Whenever an experiment is run the following happens: ... . To reproduce an experiment one would have to do ...
Answer:
--- question 13 fill here ---
"},{"location":"reports/#question-14","title":"Question 14","text":"Upload 1 to 3 screenshots that show the experiments that you have done in W&B (or another experiment tracking service of your choice). This may include loss graphs, logged images, hyperparameter sweeps etc. You can take inspiration from this figure. Explain what metrics you are tracking and why they are important.
Recommended answer length: 200-300 words + 1 to 3 screenshots.
Example: As seen in the first image when have tracked ... and ... which both inform us about ... in our experiments. As seen in the second image we are also tracking ... and ...
Answer:
--- question 14 fill here ---
"},{"location":"reports/#question-15","title":"Question 15","text":"Docker is an important tool for creating containerized applications. Explain how you used docker in your experiments? Include how you would run your docker images and include a link to one of your docker files.
Recommended answer length: 100-200 words.
Example: For our project we developed several images: one for training, inference and deployment. For example to run the training docker image: docker run trainer:latest lr=1e-3 batch_size=64
. Link to docker file:
Answer:
--- question 15 fill here ---
"},{"location":"reports/#question-16","title":"Question 16","text":"When running into bugs while trying to run your experiments, how did you perform debugging? Additionally, did you try to profile your code or do you think it is already perfect?
Recommended answer length: 100-200 words.
Example: Debugging method was dependent on group member. Some just used ... and others used ... . We did a single profiling run of our main code at some point that showed ...
Answer:
--- question 16 fill here ---
"},{"location":"reports/#working-in-the-cloud","title":"Working in the cloud","text":"In the following section we would like to know more about your experience when developing in the cloud.
"},{"location":"reports/#question-17","title":"Question 17","text":"List all the GCP services that you made use of in your project and shortly explain what each service does?
Recommended answer length: 50-200 words.
Example: We used the following two services: Engine and Bucket. Engine is used for... and Bucket is used for...
Answer:
--- question 17 fill here ---
"},{"location":"reports/#question-18","title":"Question 18","text":"The backbone of GCP is the Compute engine. Explained how you made use of this service and what type of VMs you used?
Recommended answer length: 100-200 words.
Example: We used the compute engine to run our ... . We used instances with the following hardware: ... and we started the using a custom container: ...
Answer:
--- question 18 fill here ---
"},{"location":"reports/#question-19","title":"Question 19","text":"Insert 1-2 images of your GCP bucket, such that we can see what data you have stored in it. You can take inspiration from this figure.
Answer:
--- question 19 fill here ---
"},{"location":"reports/#question-20","title":"Question 20","text":"Upload one image of your GCP artifact registry, such that we can see the different images that you have stored. You can take inspiration from this figure.
Answer:
--- question 20 fill here ---
"},{"location":"reports/#question-21","title":"Question 21","text":"Upload one image of your GCP cloud build history, so we can see the history of the images that have been build in your project. You can take inspiration from this figure.
Answer:
--- question 21 fill here ---
"},{"location":"reports/#question-22","title":"Question 22","text":"Did you manage to deploy your model, either in locally or cloud? If not, describe why. If yes, describe how and preferably how you invoke your deployed service?
Recommended answer length: 100-200 words.
Example: For deployment we wrapped our model into application using ... . We first tried locally serving the model, which worked. Afterwards we deployed it in the cloud, using ... . To invoke the service an user would call curl -X POST -F \"file=@file.json\"<weburl>
Answer:
--- question 22 fill here ---
"},{"location":"reports/#question-23","title":"Question 23","text":"Did you manage to implement monitoring of your deployed model? If yes, explain how it works. If not, explain how monitoring would help the longevity of your application.
Recommended answer length: 100-200 words.
Example: We did not manage to implement monitoring. We would like to have monitoring implemented such that over time we could measure ... and ... that would inform us about this ... behaviour of our application.
Answer:
--- question 23 fill here ---
"},{"location":"reports/#question-24","title":"Question 24","text":"How many credits did you end up using during the project and what service was most expensive?
Recommended answer length: 25-100 words.
Example: Group member 1 used ..., Group member 2 used ..., in total ... credits was spend during development. The service costing the most was ... due to ...
Answer:
--- question 24 fill here ---
"},{"location":"reports/#overall-discussion-of-project","title":"Overall discussion of project","text":"In the following section we would like you to think about the general structure of your project.
"},{"location":"reports/#question-25","title":"Question 25","text":"Include a figure that describes the overall architecture of your system and what services that you make use of. You can take inspiration from this figure. Additionally in your own words, explain the overall steps in figure.
Recommended answer length: 200-400 words
Example:
The starting point of the diagram is our local setup, where we integrated ... and ... and ... into our code. Whenever we commit code and push to github, it auto triggers ... and ... . From there the diagram shows ...
Answer:
--- question 25 fill here ---
"},{"location":"reports/#question-26","title":"Question 26","text":"Discuss the overall struggles of the project. Where did you spend most time and what did you do to overcome these challenges?
Recommended answer length: 200-400 words.
Example: The biggest challenges in the project was using ... tool to do ... . The reason for this was ...
Answer:
--- question 26 fill here ---
"},{"location":"reports/#question-27","title":"Question 27","text":"State the individual contributions of each team member. This is required information from DTU, because we need to make sure all members contributed actively to the project
Recommended answer length: 50-200 words.
Example: Student sXXXXXX was in charge of developing of setting up the initial cookie cutter project and developing of the docker containers for training our applications. Student sXXXXXX was in charge of training our models in the cloud and deploying them afterwards. All members contributed to code by...
Answer:
--- question 27 fill here ---
"},{"location":"s10_extra/","title":"Extra learning modules","text":"All modules listed here are not part of the core course but expand on some of the other topics. Some of them may still be under construction and may in the future be moved into other sessions.
Learn how to setup a simple documentation system for your application
M32: Documentation
Learn how to do hyperparameter optimization using Optuna
M33: Hyperparameter Optimization
Learn how to use HPC systems that use PBS to do job scheduling
M34: High Performance Clusters
Danger
Module is still under development
"},{"location":"s10_extra/calibration/#methods","title":"Methods","text":""},{"location":"s10_extra/calibration/#exercises","title":"\u2754 Exercises","text":"Implement a script
Implement temperature scaling
Implement label smoothing
alpha = 0.1\nfor i in range(len(y_true)):\n y_true[i] = (1 - alpha) * y_true[i] + alpha / num_classes\n
Implement mixup
Implement cutmix
Implement the Focal Loss
Implement it in a continues integration setup
Danger
Module is still under development
\"Machine learning engineering is 10% machine learning and 90% engineering.\" - Chip Huyen
We highly recommend that you read the book Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications by Chip Huyen which gives an fantastic overview of the thought processes that goes into designing moder machine learning systems.
"},{"location":"s10_extra/design/#the-stack","title":"The stack","text":"Have you ever encountered the concept of full stack developer. A full stack developer is an developer who can both develop client and server software or in more general terms, it is a developer who can take care of the complete developer pipeline.
Below is seen an image of the massive amounts of tools that exist within the MLOps umbrella.
"},{"location":"s10_extra/design/#visualizing-the-design","title":"Visualizing the design","text":""},{"location":"s10_extra/documentation/","title":"M32 - Documentation","text":""},{"location":"s10_extra/documentation/#documentation","title":"Documentation","text":"In today's rapidly evolving software development landscape, effective documentation is a crucial component of any project. The ability to create clear, concise, and user-friendly technical documentation can make a significant difference in the success of your codebase. We all probably encountered code that we wanted to use, only for us to abandon using it because it was missing documentation such that we could get started with it.
Technical documentation or code documentation can be many things:
and many more. We are in this module going to focus on setting up a very basic documentation system that will automatically help you document the API of your code. For this reason we recommend that before continuing with this module that you have completed module M7 on good coding practices or have similar experience with writing docstrings for Python functions and classes.
There are different systems for writing documentation. In fact there is a lot to choose from:
Important to note that all these are static site generators. The word static here refers to that when the content is generated and served on a webside, the underlying HTML code will not change. It may contain HTML elements that dynamic (like video), but the site does not change (1).
We are in this module going to look at Mkdocs, which (in my opinion) is one of the easiest systems to get started with because all documentation is written in markdown and the build system is written in Python. As an alternativ, you can consider doing the exercises in Sphinx which is probably the most used documentation system for Python code. Sphinx offer more customization than Mkdocs, so is generally preferred for larger projects with complex documentation, but for smaller projects Mkdocs should be easier to get started with and is sufficient.
Mkdocs by default does not include many features and for that reason we are directly going to dive into using the material for mkdocs theme that provides a lot of nice customization to create professional static sites. In fact, this whole course is written in mkdocs using the material theme.
"},{"location":"s10_extra/documentation/#mkdocs","title":"Mkdocs","text":"The core file when using mkdocs is the mkdocs.yaml
file, which is the configuration file for the project:
site_name: Documentation of my project\nsite_author: Jane Doe\ndocs_dir: source # (1)!\n\ntheme:\n language: en\n name: material # (2)!\n features: # (3)!\n - content.code.copy\n - content.code.annotate\n\nplugins: # (4)!\n - search\n - mkdocstrings\n\nnav: # (5)!\n - Home: index.md\n
This indicates the source directory of our documentation. If the layout of your documentation is a bit different than what described above, you may need to change this.
The overall theme of your documentation. We recommend the material
theme but there are many more to choose from and you can also create your own.
The featuers
section is where features that are supported by your given theme can be enabled. In this example we have enabled content.code.copy
feature which adds a small copy button to all code block and the content.code.annotate
feature which allows you to add annotations like this box to code blocks.
Plugins add new functionality to your documentation. In this case we have added two plugins that add functionality for searching through our documentation and automatically adding documentation from docstrings. Remember that some plugins requires you to install additional Python packages with those plugins, so remember to add them to your requirements.txt
file.
The nav
section is where you define the navigation structure of your documentation. When you add new .md
files to the source
folder you then need to add them to the nav
section.
And that is more or less what you need to get started. In general, if you need help with configuration of your documentation in mkdocs I recommend looking at this page and this page.
"},{"location":"s10_extra/documentation/#exercises","title":"Exercises","text":"In this set of exercises we assume that you have completed module M6 on code structure and therefore have a repository that at least contains the following:
\u251c\u2500\u2500 pyproject.toml <- Project configuration file with package metadata\n\u2502\n\u251c\u2500\u2500 docs <- Documentation folder\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 index.md <- Homepage for your documentation\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 mkdocs.yaml <- Configuration file for mkdocs\n\u2502 \u2502\n\u2502 \u2514\u2500\u2500 source/ <- Source directory for documentation files\n\u2502\n\u2514\u2500\u2500 src <- Source code for use in this project.\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 __init__.py <- Makes src a Python module\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 models <- model implementations, training script\n\u2502 \u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u2502 \u251c\u2500\u2500 model.py\n\u2502 \u2502 \u251c\u2500\u2500 train_model.py\n...\n
It is not important exactly what is in the src
folder for the exercises, but we are going to refer to the above structure in the exercises, so adjust accordingly if you diviate from this. Additionally, we are going to assume that your project code is installed in your environment such that it can be imported as normal Python code.
We are going to need two Python packages to get started: mkdocs and material for mkdocs. Install with
pip install \"mkdocs-material >= 4.8.0\" # (1)!\n
mkdocs
is a dependency of mkdocs-material
we only need to install the latter.Run in your terminal (from the docs
folder):
mkdocs serve # (1)!\n
mkdocs serve
will automatically rebuild the whole site whenever you save a file inside the docs
folder. This is not a problem if you have a fairly small site with not that many pages (or elements), but can take a long time for large sites. Consider running with the --dirty
option for only re-building the site for files that have been changed.which should render the index.md
file as the homepage. You can leave the documentation server running during the remaining exercises.
We are no ready to document the API of our code:
Make sure you at least have one function and class inside your src
module. If you do not have you can for simplicity copy the following module to the src/models/model.py
file
import torch\n\nclass MyNeuralNet(torch.nn.Module):\n \"\"\"Basic neural network class.\n\n Args:\n in_features: number of input features\n out_features: number of output features\n\n \"\"\"\n def __init__(self, in_features: int, out_features: int) -> None:\n self.l1 = torch.nn.Linear(in_features, 500)\n self.l2 = torch.nn.Linear(500, out_features)\n self.r = torch.nn.ReLU()\n\n def forward(self, x: torch.Tensor) -> torch.Tensor:\n \"\"\"Forward pass of the model.\n\n Args:\n x: input tensor expected to be of shape [N,in_features]\n\n Returns:\n Output tensor with shape [N,out_features]\n\n \"\"\"\n return self.l2(self.r(self.l1(x)))\n
and the following function to add src/predict_model.py
file:
def predict(\n model: torch.nn.Module,\n dataloader: torch.utils.data.DataLoader\n) -> None:\n \"\"\"Run prediction for a given model and dataloader.\n\n Args:\n model: model to use for prediction\n dataloader: dataloader with batches\n\n Returns\n Tensor of shape [N, d] where N is the number of samples and d is the output dimension of the model\n\n \"\"\"\n return [model(batch) for batch in dataloader]\n
Add a markdown file to the docs/source
folder called my_api.md
and add that file to the nav:
section in the mkdocs.yaml
file.
To that file add the following code:
# My API\n\n::: src.models.model.MyNeuralNet\n\n::: src.predict_model.predict\n
The :::
indicator tells mkdocs that it should look for the corresponding function/module and then render it on the given page. Thus, if you have a function/module located in another location change the paths accordingly.
Make sure that the documentation correctly includes your function and module on the given page.
(Optional) Include more functions/modules in your documentation.
(Optional) Look through the documentation for mkdocstrings and try to improve the layout a bit. Especially, the headings, docstrings and signatures could be of interest to adjust.
Finally, try to build a final version of your documentation
mkdocs build\n
this should result in a site
folder that contains the actual HTML code for documentation.
To publish your documentation you need a place to host your build documentation e.g. the content of the site
folder you build in the last exercise. There are many places to host your documentation, but if you only need a static site and are already hosting your code through Github, then a good option is Github Pages. Github pages is free to use for your public projects.
Before getting started with this set of exercises you should have completed module M16 on GitHub actions so you already know about workflow files.
"},{"location":"s10_extra/documentation/#exercises_1","title":"Exercises","text":"Start by adding a new file called deploy_docs.yaml
to the .github/workflows
folder. Add the following cod to that file and save it.
name: Deploy docs\n\non:\npush:\n branches:\n - main\n\npermissions:\n contents: write # (1)!\n\njobs:\n deploy:\n name: Deploy docs\n runs-on: ubuntu-latest\n steps:\n - name: Checkout code\n uses: actions/checkout@v4\n with:\n fetch-depth: 0\n\n - name: Set up Python\n uses: actions/setup-python@v5\n with:\n python-version: 3.11\n cache: 'pip'\n cache-dependency-path: setup.py\n\n - name: Install dependencies\n run: pip install -r requirements.txt\n\n - name: Deploy docs\n run: mkdocs gh-deploy --force\n
write
permissions to this actions because it is not only reading your code but it will also push code.Before continuing, make sure you understand what the different steps of the workflow does and especially we recommend looking at the documentation of the mkdocs gh-deploy
command.
Commit and push the file. Check that the action is executed and if it succeeds, that your build project is pushed to a branch called gh-pages
. If the action does not succeeds, then figure out what is wrong and fix it!
After confirming that our action is working, you need to configure Github to publish the content being build by Github Actions. Do the following:
Source
setting choose the Deploy from a branch
Branch
setting choose the gh-pages
branch and /(root)
folder and save
This should then start deploying your site to https://<your-username>.github.io/<your-reponame>/
. If it does not do this you may need to recommit and trigger the GitHub actions build again.
Make sure your documentation is published and looks as it should.
This ends the module on technical documentation. We cannot stress enough how important it is to write proper documentation for larger projects that need to be maintained over a longer time. It is often a iterative process, but it is often best to do it while writing the code.
"},{"location":"s10_extra/high_performance_clusters/","title":"M34 - High Performance Clusters","text":""},{"location":"s10_extra/high_performance_clusters/#high-performance-clusters","title":"High Performance Clusters","text":"As discussed in the intro session on the cloud, cloud providers offers near infinite compute resources. However, using these resources comes at a hefty price often and it is therefore important to be aware of another resource many have access to: High Performance Clusters or HPC. HPCs exist all over the world, and many time you already have access to one or can easily get access to one. If you are an university student you most likely have a local HPC that you can access through your institution. Else, there exist public HPC resources that everybody (with a project) can apply for. As an example in the EU we have EuroHPC initiative that currently has 8 different supercomputers with a centralized location for applying for resources that are both open for research projects and start-ups.
Depending on your application, you may have different needs and it is therefore important to be aware also of the different tiers of HPC. In Europe, HPC are often categorized such that Tier-0 are European Centers with petaflop or hexascale machines,\u00a0Tier 1 are National centers of supercomputers, and Tier 2 are Regional centers. The lower the Tier, the larger applications it is possible to run.
"},{"location":"s10_extra/high_performance_clusters/#cluster-architectures","title":"Cluster architectures","text":"In very general terms, cluster can come as two different kind of systems: supercomputers and LSF (Load Sharing Facility). A supercomputer (as shown below) is organized into different modules, that are separated by network link. When you login to a supercomputer you will meet the front end which contains all the software needed to run computations. When you submit a job it will get sent to the backend modules which in most cases includes: general compute modules (CPU), acceleration modules (GPU), a memory module (RAM) and finally a storage module (HDD). Depending on your application you may need one module more than another. For example in deep learning the acceleration module is important but in physics simulation the general compute module / storage model is probably more important.
Overview of the Meluxina supercomputer that's part of EuroHPC. Image creditAlternatively, LSF are a network of computers where each computer has its own CPU, GPU, RAM etc. and the individual computes (or nodes) are then connected by network. The important different between a supercomputer and as LSF systems is how the resources are organized. When comparing supercomputers to LSF system it is generally the case that it is better to run on a LSF system if you are only requesting resources that can be handled by a single node, however it is better to run on a supercomputer if you have a resource intensive application that requires many devices to communicate with each others.
Regardless of cluster architectures, on the software side of HPC, the most important part is what's called the HPC scheduler. Without a HPC scheduler an HPC cluster would just be a bunch of servers with different jobs interfering with each other. The problem is when you have a large collection of resources and a large collection of users, you cannot rely on the users just running their applications without interfering with each other. A HPC scheduler is in charge of managing that whenever an user request to run an application, they get put in a queue and whenever the resources their application ask for are available the application gets run.
The biggest bach control systems for doing scheduling on HPC are:
We are going to take a look at PBS works as that is what is installed on our local university cluster.
"},{"location":"s10_extra/high_performance_clusters/#exercises","title":"\u2754 Exercises","text":"Exercise files
The following exercises are focused on local students at DTU that want to use our local HPC resources. That said, the steps in the exercise are fairly general to other types of cluster. For the purpose of this exercise we are going to see how we can run this image classifier script , but feel free to work with whatever application you want to.
Start by accessing the cluster. This can either be through ssh
in a terminal or if you want a graphical interface thinlinc can be installed. In general we recommend following the steps here for DTU students as the setup depends on if you are on campus or not.
When you have access to the cluster we are going to start with the setup phase. In the setup phase we are going to setup the environment necessary for our computations. If you have accessed the cluster through graphical interface start by opening a terminal.
Lets start by setting up conda for controlling our dependencies. If you have not already worked with conda
, please checkout module M2 on package managers and virtual environments. In general you should be able to setup (mini)conda through these two commands:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nsh Miniconda3-latest-Linux-x86_64.sh\n
Close the terminal and open a new for the installation to complete. Type conda
in the terminal to check that everything is fine. Go ahead and create a new environment that we can install dependencies in
conda create -n \"hpc_env\" python=3.10 --no-default-packages\n
and activate it.
Copy over any files you need. For the image classifier script you need the requirements file and the actual application.
Next, install all the requirements you need. If you want to run the image classifier script you can run this command in the terminal
pip install -r image_classifier_requirements.txt\n
using this requirements file.
That's all the setup needed. You would need to go through the creating of environment and installation of requirements whenever you start a new project (no need for reinstalling conda). For the next step we need to look at how to submit jobs on the cluster. We are now ready to submit the our first job to the cluster:
Start by checking the statistics for the different clusters. Try to use both the qstat
command which should give an overview of the different cluster, number of running jobs and number of pending jobs. For many system you can also try the much more user friendly command classstat
command.
Figure out which queue you want to use. For the sake of the exercises it needs to be one with GPU support. For DTU students, any queue that starts with gpu
are GPU accelerated.
Now we are going to develop a bash script for submitting our job. We have provided an example of such scripts. Take a careful look and go each line and make sure you understand it. Afterwards, change it to your needs (queue and student email).
Try to submit the script:
bsub < jobscript.sh\n
You can check the status of your script by running the bstat
command. Hopefully, the job should go through really quickly. Take a look at the output file, it should be called something like gpu_*.out
. Also take a look at the gpu_*.err
file. Does both files look as they should?
Lets now try to run our application on the cluster. To do that we need to take care of two things:
First we need to load the correct version of CUDA. A cluster system often contains multiple versions of specific software to suit the needs of all their users, and it is the users that are in charge of loading the correct software during job submission. The only extra software that needs to be loaded for most PyTorch applications are a CUDA module. You can check which modules are available on the cluster with
module avail\n
Afterwards, add the correct CUDA version you need to the jobscript.sh
file. If you are trying to run the provided image classifier script then the correct version is CUDA/11.7
(can be seen in the requirements file).
# add to the bottom of the file\nmodule load cuda/11.7\n
We are now ready to add in our application. The only thing we need to take care of is telling the system to run it using the python
version that is connected to our hpc_env
we created in the beginning. Try typing:
which python\n
which should give you the full path. Then add to the bottom of the jobscript
file:
~/miniconda3/envs/hpc_env/bin/python \\\n image_classifier.py \\\n --trainer.accelerator 'gpu' --trainer.devices 1 --trainer.max_epochs 5\n
which will run the image classifier script (change it if you are running something else).
Finally submit the job:
bsub < jobscript.sh\n
and check when it is done that it has produced what you expected.
(Optional) If you application supports multi GPUs also try that out. You would first need to change the jobscript to request multiple GPUs and additionally you would need to tell your application to run on multiple GPUs. For the image classifier script it can be done by changing the --trainer.devices
flag to 2
(or higher).
This ends the module on using HPC systems.
"},{"location":"s10_extra/hyperparameters/","title":"M33 - Hyperparameter optimization","text":""},{"location":"s10_extra/hyperparameters/#hyperparameter-optimization","title":"Hyperparameter optimization","text":"Outdated module
This module has not been updated for a long time and therefore some functionality of Optuna, which is used in these exercises, may not be included. If you have completed the module on Weights & Bias then we highly recommend instead using their sweep functionality.
Hyperparameter optimization is not a new idea within machine learning but have somewhat seen a renaissance with the uprise of deep learning. This can mainly be contributed to the following:
However the problem with doing hyperparameter optimization of a deep learning models is that it can take over a week to train a single model. In most cases we therefore cannot do a full grid search of all hyperparameter combinations to get the best model. Instead we have to do some tricks that will help us speed up our searching. In these exercises we are going to be integrating optuna into our different models, that will provide the tools for speeding up our search.
It should be noted that a lot of deep learning models does not optimize every hyperparameter that is included in the model but instead relies on heuristic guidelines (\"rule of thumb\") based on what seems to be working in general e.g. a learning rate of 0.01 seems to work great with the Adam optimizer. That said, these rules probably only apply to 80% of deep learning model, whereas for the last 20% the recommendations may be suboptimal Here is a great site that has collected an extensive list of these recommendations, taken from the excellent deep learning book by Ian Goodfellow, Yoshua Bengio and Aaron Courville.
In practice, I recommend trying to identify (through experimentation) which hyperparameters that are important for the performance of your model and then spend your computational budget trying to optimize them while setting the rest to a \"recommended value\".
"},{"location":"s10_extra/hyperparameters/#exercises","title":"\u2754 Exercises","text":"Exercise files
Start by installing optuna: pip install optuna
Initially we will look at the cross_validate.py
file. It implements simple K-fold cross validation of a random forest sklearn digits dataset (subset of MNIST). Look over the script and try to run it.
We will now try to write the same code in optune. Please note that the script have a variable OPTUNA=False
that you can use to change what part of the code should run. The three main concepts of optuna is
A trial: a single experiment
A study: a collection of trials
The objective: function to determine how \"good\" a trial is
Lets start by writing the objective function, which we have already started in the script. For now you do not need to care about the trial
argument, just assume that it contains the hyperparameters needed to define your random forest model. The output of the objective function should be a single number that we want to optimize. (HINT: did you remember to do K-fold crossvalidation inside your objective function?)
Next lets focus on the trial. Inside the objective
function the trial should be used to suggest what parameters to use next. Take a look at the documentation for trial or take a look at the code examples and figure out how to define the hyperparameter of the model.
Finally lets launch a study. It can be as simple as
study = optuna.create_study()\nstudy.optimize(objective, n_trials=100)\n
but lets play around a bit with it:
By default the .optimize
method will minimize the objective (by definition the optimum of an objective function is at its minimum). Is the score your objective function is returning something that should be minimized? If not, a simple solution is to put a -
in front of the metric. However, look through the documentation on how to change the direction of the optimization.
Optuna will by default do Bayesian optimization when sampling the hyperparameters (using a evolutionary algorithm for suggesting new trials). However, since this example is quite simple, we can actually perform a full grid search. How would you do this in Optuna?
Compare the performance of a single optuna run using Bayesian optimization with n_trials=10
with a exhaustive grid search that have search through all hyperparameters. What is the performance/time trade-off for these two solutions?
In addition to doing baysian optimization, the other great part about Optuna is that it have native support for Pruning unpromising trials. Pruning refers to the user stopping trials for hyperparameter combinations that does not seem to lead anywhere. You may have learning rate that is so high that training is diverging or a neural network with too many parameters so it is just overfitting to the training data. This however begs the question: what constitutes an unpromising trial? This is up to you to define based on prior experimentation.
Start by looking at the fashion_trainer.py
script. Its a simple classification network for classifying images in the FashionMNIST dataset. Run the script with the default hyperparameters to get a feeling of how the training should be progress. Note down the performance on the test set.
Start by defining a validation set and a validation dataloader that we can use for hyperparameter optimization (HINT: use 5-10% of you training data).
Now, adjust the script to use Optuna. The 5 hyperparameters listed in the table above should at least be included in the hyperparameter search. For some we have already defined the search space but for the remaining you need to come up with a good range of values to investigate. We done integrating optuna, run a small study (n_tirals=3
) to check that the code is working.
nn.ReLU
, nn.Tanh
, nn.RReLU
, nn.LeakyReLU
, nn.ELU
} If implemented correctly the number of hyperparameter combinations should be at least 1000, meaning that we not only need baysian optimization but probably also need pruning to succeed. Checkout the page for built-in pruners in Optuna. Implement pruning in the script. I recommend using either the MedianPruner
or the ProcentilePruner
.
Re-run the study using pruning with a large number of trials (n_trials>50
)
Take a look at this visualization page for ideas on how to visualize the study you just did. Make at least two visualization of the study and make sure that you understand them.
Pruning is great for better spending your computational budged, however it comes with a trade-off. What is it and what hyperparameter should one be especially careful about when using pruning?
Finally, what parameter combination achieved the best performance? What is the test set performance for this set of parameters. Did you improve over the initial set of hyperparameters?
The exercises until now have focused on doing the hyperparameter searching sequentially, meaning that we test one set of parameters at the time. It is a fine approach because you can easily let it run for a week without any interaction. However, assuming that you have the computational resources to run in parallel, how do you do that?
To run hyperparameter search in parallel we need a common database that all experiments can read and write to. We are going to use the recommended mysql
. You do not have to understand what SQL is to complete this exercise, but it is basically a language (like python) for managing databases. Install mysql.
Next we are going to initialize a database that we can read and write to. For this exercises we are going to focus on a locally stored database but it could of course also be located in the cloud.
mysql -u root -e \"CREATE DATABASE IF NOT EXISTS example\"\n
you can also do this directly in Python when calling the create_study
command by also setting the storage
and load_if_exists=True
flags.
Now we are going to create a Optuna study in our database
optuna create-study --study-name \"distributed-example\" --storage \"mysql://root@localhost/example\"\n
Change how you initialize the study to read and write to the database. Therefore, instead of doing
study = optuna.create_study()\n
then do
study = optuna.load_study(\n study_name=\"distributed-example\", storage=\"mysql://root@localhost/example\"\n)\n
where the study_name
and storage
should match how the study was created.
For running in parallel, you can either open up a extra terminal and simple launch your script once per open terminal or you can use the provided parallel_lancher.py
that will launch multiple executions of your script. It should be used as:
python parallel_lancher.py myscript.py --num_parallel 2\n
Finally, make sure that you can access the results
That's all on how to do hyperparameter optimization in a scalable way. If you feel like it you can try to apply these techniques on the ongoing corrupted MNIST example, where you are free to choose what hyperparameters that you want to use.
"},{"location":"s10_extra/infrastructure_as_code/","title":"Infrastructure as code","text":"Danger
Module is still under development
"},{"location":"s10_extra/infrastructure_as_code/#infrastructure-as-code-iac","title":"Infrastructure as Code (IaC)","text":"Infrastructure as Code (IaC) is the process of managing and provisioning computer data centers through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools. The IT infrastructure managed by this comprises both physical equipment such as bare-metal servers as well as virtual machines and associated configuration resources. The definitions are written in a high-level programming language and can be versioned, and the code can be tested and validated.
"},{"location":"s10_extra/infrastructure_as_code/#terraform","title":"Terraform","text":"Terraform is an open-source infrastructure as code software tool created by HashiCorp. It enables users to define and provision a datacenter infrastructure using a high-level configuration language known as Hashicorp Configuration Language (HCL), or optionally JSON. It allows infrastructure to be expressed as code in a simple, human-readable language called HCL (HashiCorp Configuration Language). It supports a multitude of cloud providers, including AWS, Azure, Google Cloud, and many others.
"},{"location":"s10_extra/infrastructure_as_code/#installation","title":"Installation","text":"To install Terraform, download the appropriate package for your operating system from the official Terraform website. Once downloaded, unzip the package and move the binary to a directory included in your system's PATH.
"},{"location":"s10_extra/infrastructure_as_code/#getting-started","title":"Getting started","text":"To get started with Terraform, you need to create a configuration file. This file is a human-readable file that describes the infrastructure and set of resources to be created. The file is saved with a .tf
extension. Here is an example of a simple Terraform configuration file that creates an AWS EC2 instance:
provider \"aws\" {\n region = \"us-west-2\"\n}\n\nresource \"aws_instance\" \"example\" {\n ami = \"ami-0c55b159cbfafe1f0\"\n instance_type = \"t2.micro\"\n}\n
To create the infrastructure described in the configuration file, navigate to the directory containing the file and run the following commands:
terraform init\nterraform apply\n
The terraform init
command is used to initialize a working directory containing Terraform configuration files. This is the first command that should be run after writing a new Terraform configuration or cloning an existing one from version control. The terraform apply
command is used to apply the changes required to reach the desired state of the configuration, or the pre-determined set of actions generated by a terraform plan
execution plan.
Danger
Module is still under development
"},{"location":"s10_extra/kubernetes/#kubernetes","title":"Kubernetes","text":"Kubernetes, also known as K8s, is an open-source platform designed to automate the deployment, scaling, and operation of application containers across clusters of hosts. It provides the framework to run distributed systems resiliently, handling scaling and failover for your applications, providing deployment patterns, and more.
"},{"location":"s10_extra/kubernetes/#what-is-kubernetes","title":"What is Kubernetes?","text":""},{"location":"s10_extra/kubernetes/#brief-history","title":"Brief History","text":"Kubernetes was originally developed by Google and is now maintained by the Cloud Native Computing Foundation.
"},{"location":"s10_extra/kubernetes/#core-functions","title":"Core Functions","text":"Kubernetes makes it easier to deploy and manage containerized applications at scale.
"},{"location":"s10_extra/kubernetes/#key-concepts","title":"Key Concepts","text":"Kubernetes follows a client-server architecture. At a high level, it consists of a Control Plane (master) and Nodes (workers).
Overview of Kubernetes Architecture. Image Credit: Kubernetes Official Documentation"},{"location":"s10_extra/kubernetes/#control-plane-components","title":"Control Plane Components","text":"Minikube is a tool that allows you to run Kubernetes locally. It runs a single-node Kubernetes cluster inside a VM on your laptop for users looking to try out Kubernetes or develop with it day-to-day.
"},{"location":"s10_extra/kubernetes/#installing-minikube","title":"Installing Minikube","text":"minikube start
.minikube
in a terminal.kubectl
in a terminal.Yatai is a model serving platform, making it easier to deploy machine learning models in Kubernetes environments.
"},{"location":"s10_extra/kubernetes/#what-is-yatai","title":"What is Yatai?","text":"Yatai simplifies the deployment, management, and scaling of machine learning models in Kubernetes.
"},{"location":"s10_extra/kubernetes/#getting-started-with-yatai","title":"Getting Started with Yatai","text":"Danger
Module is still under development
"},{"location":"s10_extra/orchestration/#workflow-orchestration","title":"Workflow orchestration","text":""},{"location":"s10_extra/orchestration/#prefect","title":"Prefect","text":"If you give an MLOps engineer a job
pip install prefect\n
from prefect import task, Flow\n
"},{"location":"s10_extra/orchestration/#exercises","title":"\u2754 Exercises","text":"Start by installing prefect
:
pip install prefect\n
Start a local Prefect server instance in your virtual environment.
prefect server start\n
The great thing about Prefect is that the orchestration tasks and flows are written in pure Python.
Danger
Module is still under development
"},{"location":"s10_extra/quantization/#exercises","title":"\u2754 Exercises","text":"We are in these exercises going to be looking at two different kinds of quantization strategies: quantization-aware training and post-training quantization. As the names suggest, the quantization is either applied while training or after training. There are good reasons for doing both:
If the model you are going to deploy in the end needs to be quantized, either due to hard requirements for how the big the model can be or in the effort to optimize inference time, quantization-aware training is the better approach. The reason here being that the model is specifically optimized to always be quantized and therefore in general end up with a better model.
If the most important metric for deployment is the overall performance of the model with no regards to model size and inference speed, post-training quantization is the better option. This allows you to most likely train a better model to begin with and then try out converting the model afterwards. In the best case this can be done without any hits to performance.
Start by installing intel neural compressor
pip install neural_compressor\n
and remember to add this to your requirements.txt
file.
Let's start a new script called model_converter.py
. Start by filling it with some simple code for loading a given float32
model checkpoint. You should already have such code from earlier exercises. Preferably, add a small CLI interface to load a model by passing the filename in the command line:
python model_converter.py model_checkpoint.ckpt\n
Solution We are here going to assume that you are either loading from a onnx
model or alternatively loading a PyTorch Lightning checkpoint:
from typer import App\nimport onnx\nfrom onnx.onnx_ml_pb2 import ModelProto\nfrom pytorch_lightning import LightningModule\nfrom my_model import MyModel\napp = App()\n\n@app.command()\n@app.argument(\"model_checkpoint\")\ndef quantize(model_checkpoint: ModelProto | LightningModule) -> None:\n if isinstance(model_checkpoint, LightningModule):\n model = MyModel.load_from_checkpoint(model_checkpoint)\n else:\n model = onnx.load(model_checkpoint)\n
Next you also need to add
Finally, calculate the size (in MB) of the original model and the quantized model. How much smaller is the quantized model?
SolutionAssuming the models are saved as checkpoint.ckpt
and checkpoint_quantized.ckpt
we can calculate the size using os.path.getsize
in Python:
original_size = os.path.getsize(\"models/checkpoint.onnx\") / (1024 * 1024)\nquantized_size = os.path.getsize(\"models/checkpoint_quantized.onnx\") / (1024 * 1024)\n
The quantized model should be very close to 4 times smaller as int4
only uses 1/4 the bits to store weights compared to float32
format.
Slides
Learn the basics of the command line, and how to use it to navigate your file system and run programs.
M1: Command line
Learn how package managers work in Python and how to create reproducible virtual environments using conda
and pip
.
M2: Package Manager
Learn how to use a modern editor for code development.
M3: Editor
Refresh your PyTorch skills and implement a simple deep-learning model.
M4: Deep Learning Software
Today we start our journey into the world of machine learning operations (MLOps). However, before we can get started, we need to make sure that you have a basic understanding of a couple of topics, as we will be using these throughout the course. In particular, today is all about getting set up with a proper development environment that can support your journey. Most of you probably already have experience with these topics, and it will be mostly repetition.
The reason we are starting here is that many students are missing very basic skills that are never taught but are just expected to be picked up by yourself. This session will only cover the most basic skills to get you started on your development journey. If you wish to learn more about basic computer science skills in general, we highly recommend that you check out The Missing Semester of Your CS Education course from MIT.
Learning objectives
The learning objectives of this session are:
Core Module
Image creditContrary to popular belief, the command line (also commonly known as the terminal) is not a mythical being that has existed since the dawn of time. Instead, it was created at a time when it was not given that your computer had a graphical interface that you could interact with. Think of it as a text interface to your computer.
The terminal is a well-known concept to users of Linux; however, MAC and (especially) Windows users often do not need it and therefore encounter it. Having a basic understanding of how to use a command line can help improve your workflow. The reason that the command line is an important tool to get to know is that doing any kind of MLOps will require us to be able to interact with many different tools, many of which do not have a graphical interface. Additionally, when we get to working in the cloud later in the course, you will be forced to interact with the command line.
Note if you already are a terminal wizard then feel free to skip the exercises below. They are very elementary.
"},{"location":"s1_development_environment/command_line/#the-anatomy-of-the-command-line","title":"The anatomy of the command line","text":"Regardless of the operating system, all command lines look more or less the same:
As already stated, it is essentially just a big text interface to interact with your computer. As the image illustrates, when trying to execute a command, there are several parts to it:
$
, >
, :
are the usual ones. It can also contain other information, such as in the case of the above image which also shows the current conda
environment.ls
or cd
.ls -l
or cd ..
.ls -l figures
or cd ..
.The core difference between options and arguments is that options are optional, while arguments are not.
Image credit"},{"location":"s1_development_environment/command_line/#exercises","title":"\u2754 Exercises","text":"We have put a cheat sheet in the exercise files folder belonging to this session, that gives a quick overview of the different commands that can be executed in the command line.
Windows usersWe highly recommend that you install Windows Subsystem for Linux (WSL). This will install a full Linux system on your Windows machine. Please follow this guide. Remember to run commands from an elevated (as administrator) Windows Command Prompt. You can in general complete all exercises in the course from a normal Windows Command Prompt, but some are easier to do if you run from WSL.
If you decide to run in WSL, you need to remember that you now have two different systems, and installing a package on one system does not mean that it is installed on the other. For example, if you install pip
in WSL, you need to install it again in Windows if you want to use it there.
If you decide to not run in WSL, please always work in a Windows Command Prompt and not Powershell.
Start by opening a terminal.
To navigate inside a terminal, we rely on the cd
command and pwd
command. Make sure you know how to go back and forth in your file system. (1)
The ls
command is important when we want to know the content of a folder. Try to use the command, and also try it with the additional option -l
. What does it show?
Make sure to familiarize yourself with the which
, echo
, cat
, wget
, less
, and top
commands. Also, familiarize yourself with the >
operator. You are probably going to use some of them throughout the course or in your future career. For Windows users, these commands may be named something else, e.g., where
command on Windows corresponds to which
.
It is also significant that you know how to edit a file through the terminal. Most systems should have the nano
editor installed; else, try to figure out which one is installed on your system.
Type nano
in the terminal.
Write the following text in the script
if __name__ == \"__main__\":\n print(\"Hello world!\")\n
Save the script and try to execute it.
Afterward, try to edit the file through the terminal (change Hello world
to something else).
All terminals come with a programming language. The most common system is called bash
, which can come in handy when being able to write simple programs in bash. For example, one case is that you want to execute multiple Python programs sequentially, which can be done through a bash script.
Bash is not part of Windows, so you need to run this part through WSL. If you did not install WSL, you can skip this part or as an alternative do the exercises in Powershell which is the native Windows scripting language (not recommended).
Write a bash script (in nano
) and try executing it:
#!/bin/bash\n# A sample Bash script, by Ryan\necho Hello World!\n
Change the bash script to call the Python program you just wrote.
Try to Google how to write a simple for-loop that executes the Python script 10 times in a row.
A trick you may need throughout this course is setting environment variables. An environment variable is just a dynamically named value that may alter the way running processes behave on a computer. The syntax for setting an environment variable depends on your operating system:
WindowsLinux/Macset MY_VAR=hello\necho %MY_VAR%\n
export MY_VAR=hello\necho $MY_VAR\n
Try to set an environment variable and print it out.
To use an environment variable in a Python program, you can use the os.environ
function from the os
module. Write a Python program that prints out the environment variable you just set.
If you have a collection of environment variables, these can be stored in a file called .env
. The file is formatted as follows:
MY_VAR=hello\nMY_OTHER_VAR=world\n
To load the environment variables from the file, you can use the python-dotenv
package. Install it with pip install python-dotenv
and then try to load the environment variables from the file and print them out.
from dotenv import load_dotenv\nload_dotenv()\nimport os\nprint(os.environ[\"MY_VAR\"])\n
Here is one command from later in the course when we are going to work in the cloud
gcloud compute instances create-with-container instance-1 \\\n --container-image=gcr.io/<project-id>/gcp_vm_tester\n --zone=europe-west1-b\n
Identify the command, options, and arguments.
Solutiongcloud compute instances create-with-container
.--container-image=gcr.io/<project-id>/gcp_vm_tester
and --zone=europe-west1-b
.instance-1
.The tricky part of this example is that commands can have subcommands, which are also commands. In this case, compute
is a subcommand to gcloud
, instances
is a subcommand to compute
, and create-with-container
is a subcommand to instances
.
Two common arguments that nearly all commands have are the -h
and -V
options. What does each of them do?
The -h
(or --help
) option prints the help message for the command, including subcommands and arguments. Try it out by executing python -h
. The -V
(or --version
) option prints the version of the installed program. Try it out by executing python --version
.
This ends the module on the command line. If you are still not comfortable working with the command line, fear not as we are going to use it extensively throughout the course. If you want to spend additional time on this topic, we highly recommend that you watch this video on how to use the command line.
If you are interested in personalizing your command line, you can check out the starship project, which allows you to customize your command line with a lot of different options.
"},{"location":"s1_development_environment/deep_learning_software/","title":"M4 - Deep Learning Software","text":""},{"location":"s1_development_environment/deep_learning_software/#deep-learning-software","title":"Deep Learning Software","text":"Core Module
Deep learning has, since its revolution back in 2012, transformed our lives. From Google Translate to driverless cars to personal assistants to protein engineering, deep learning is transforming nearly every sector of our economy and our lives. However, it did not take long before people realized that deep learning is not a simple beast to tame and it comes with its own kinds of problems, especially if you want to use it in a production setting. In particular, the concept of technical debt was invented to indicate the significant maintenance costs at a system level that it takes to run machine learning in production. MLOps should very much be seen as the response to the concept of technical debt, namely that we should develop methods, processes, and tools (with inspiration from classical DevOps) to counter the problems we run into when working with deep learning models.
It is important to note that all the concepts and tools that have been developed for MLOps can be used together with more classical machine learning models (think K-nearest neighbor, Random forest, etc.), however, deep learning comes with its own set of problems which mostly have to do with the sheer size of the data and models we are working with. For these reasons, we are focusing on working with deep learning models in this course.
"},{"location":"s1_development_environment/deep_learning_software/#software-landscape-for-deep-learning","title":"Software Landscape for Deep Learning","text":"Regarding software for Deep Learning, the landscape is currently dominated by three software frameworks (listed in order of when they were published):
TensorFlow
PyTorch
JAX
We won't go into a longer discussion on which framework is best, as it is pointless. PyTorch and TensorFlow have been around for the longest and therefore have bigger communities and feature sets at this point in time. They are both very similar in the sense that they both have features directed against research and production. JAX is kind of the new kid on the block, which in many ways improves on PyTorch and TensorFlow, but is still not as mature as the other frameworks. As the frameworks use different kinds of programming principles (object-oriented vs. functional programming), comparing them is essentially meaningless.
In this course, we have chosen to work with PyTorch because we find it a bit more intuitive and it is the framework that we use for our day-to-day research life. Additionally, as of right now, it is absolutely the dominating framework for published models, research papers, and competition winners.
The intention behind this set of exercises is to bring everyone's PyTorch skills up to date. If you already are a PyTorch-Jedi, feel free to pass the first set of exercises, but I recommend that you still complete it. The exercises are, in large part, taken directly from the deep learning course at Udacity. Note that these exercises are given as notebooks, which is the last time we are going to use them actively in the course. Instead, after this set of exercises, we are going to focus on writing code in Python scripts.
The notebooks contain a lot of explanatory text. The exercises that you are supposed to fill out are inlined in the text in small \"exercise\" blocks:
If you need a refresher on any deep learning topic in general throughout the course, we recommend finding the relevant chapter in the deep learning book by Ian Goodfellow, Yoshua Bengio, and Aaron Courville (which can also be found in the literature folder). It is not necessary to be good at deep learning to pass this course as the focus is on all the software needed to get deep learning models into production. However, it's important to have a basic understanding of the concepts.
"},{"location":"s1_development_environment/deep_learning_software/#exercises","title":"\u2754 Exercises","text":"Exercise files
Start a Jupyter Notebook session in your terminal (assuming you are standing at the root of the course material). Alternatively, you should be able to open the notebooks directly in your code editor. For VS code users you can read more about how to work with Jupyter Notebooks in VS code here
Complete the Tensors in PyTorch notebook. It focuses on the basic manipulation of PyTorch tensors. You can pass this notebook if you are comfortable doing this.
Complete the Neural Networks in PyTorch notebook. It focuses on building a very simple neural network using the PyTorch nn.Module
interface.
Complete the Training Neural Networks notebook. It focuses on how to write a simple training loop for training a neural network.
Complete the Fashion MNIST notebook, which summarizes concepts learned in notebooks 2 and 3 on building a neural network for classifying the Fashion MNIST dataset.
Complete the Inference and Validation notebook. This notebook adds important concepts on how to do inference and validation on our neural network.
Complete the Saving_and_Loading_Models notebook. This notebook addresses how to save and load model weights. This is important if you want to share a model with someone else.
If tensor a
has shape [N, d]
and tensor b
has shape [M, d]
how can we calculate the pairwise distance between rows in a
and b
without using a for loop?
We can take advantage of broadcasting to do this
a = torch.randn(N, d)\nb = torch.randn(M, d)\ndist = torch.sum((a.unsqueeze(1) - b.unsqueeze(0))**2, dim=2) # shape [N, M]\n
What should be the size of S
for an input image of size 1x28x28, and how many parameters does the neural network then have?
from torch import nn\nneural_net = nn.Sequential(\n nn.Conv2d(1, 32, 3), nn.ReLU(), nn.Conv2d(32, 64, 3), nn.ReLU(), nn.Flatten(), nn.Linear(S, 10)\n)\n
Solution Since both convolutions have a kernel size of 3, stride 1 (default value) and no padding that means that we lose 2 pixels in each dimension, because the kernel can not be centered on the edge pixels. Therefore, the output of the first convolution would be 32x26x26. The output of the second convolution would be 64x24x24. The size of S
must therefore be 64 * 24 * 24 = 36864
. The number of parameters in a convolutional layer is kernel_size * kernel_size * in_channels * out_channels + out_channels
(last term is the bias) and the number of parameters in a linear layer is in_features * out_features + out_features
(last term is the bias). Therefore, the total number of parameters in the network is 3*3*1*32 + 32 + 3*3*32*64 + 64 + 36864*10 + 10 = 387,466
, which could be calculated by running:
sum([prod(p.shape) for p in neural_net.parameters()])\n
A working training loop in PyTorch should have these three function calls: optimizer.zero_grad()
, loss.backward()
, optimizer.step()
. Explain what would happen in the training loop (or implement it) if you forgot each of the function calls.
optimizer.zero_grad()
is in charge of zeroring the gradient. If this is not done, then gradients would accumulate over the steps leading to exploding gradients. loss.backward()
is in charge of calculating the gradients. If this is not done, then the gradients will not be calculated and the optimizer will not be able to update the weights. optimizer.step()
is in charge of updating the weights. If this is not done, then the weights will not be updated and the model will not learn anything.
As the final exercise, we will develop a simple baseline model that we will continue to develop during the course. For this exercise, we provide the data in the data/corruptmnist
folder. Do NOT use the data in the corruptmnist_v2
folder as that is intended for another exercise. As the name suggests, this is a (subsampled) corrupted version of the regular MNIST. Your overall task is the following:
Implement an MNIST neural network that achieves at least 85% accuracy on the test set.
Before any training can start, you should identify the corruption that we have applied to the MNIST dataset to create the corrupted version. This can help you identify what kind of neural network to use to get good performance, but any network should be able to achieve this.
One key point of this course is trying to stay organized. Spending time now organizing your code will save time in the future as you start to add more and more features. As subgoals, please fulfill the following exercises:
Implement your model in a script called model.py
.
model.py
model.pyfrom torch import nn\n\n\nclass MyAwesomeModel(nn.Module):\n \"\"\"My awesome model.\"\"\"\n\n def __init__(self) -> None:\n super().__init__()\n self.fc1 = nn.Linear(784, 128)\n
Solution The provided solution implements a convolutional neural network with 3 convolutional layers and a single fully connected layer. Because the MNIST dataset consists of images, we want an architecture that can take advantage of the spatial information in the images.
model.pyimport torch\nfrom torch import nn\n\n\nclass MyAwesomeModel(nn.Module):\n \"\"\"My awesome model.\"\"\"\n\n def __init__(self) -> None:\n super().__init__()\n self.conv1 = nn.Conv2d(1, 32, 3, 1)\n self.conv2 = nn.Conv2d(32, 64, 3, 1)\n self.conv3 = nn.Conv2d(64, 128, 3, 1)\n self.dropout = nn.Dropout(0.5)\n self.fc1 = nn.Linear(128, 10)\n\n def forward(self, x: torch.Tensor) -> torch.Tensor:\n \"\"\"Forward pass.\"\"\"\n x = torch.relu(self.conv1(x))\n x = torch.max_pool2d(x, 2, 2)\n x = torch.relu(self.conv2(x))\n x = torch.max_pool2d(x, 2, 2)\n x = torch.relu(self.conv3(x))\n x = torch.max_pool2d(x, 2, 2)\n x = torch.flatten(x, 1)\n x = self.dropout(x)\n return self.fc1(x)\n\n\nif __name__ == \"__main__\":\n model = MyAwesomeModel()\n print(f\"Model architecture: {model}\")\n print(f\"Number of parameters: {sum(p.numel() for p in model.parameters())}\")\n\n dummy_input = torch.randn(1, 1, 28, 28)\n output = model(dummy_input)\n print(f\"Output shape: {output.shape}\")\n
Implement your data setup in a script called data.py
. The data was saved using torch.save
, so to load it you should use torch.load
.
Saving the model
When saving the model, you should use torch.save(model.state_dict(), \"model.pt\")
, and when loading the model, you should use model.load_state_dict(torch.load(\"model.pt\"))
. If you do torch.save(model, \"model.pt\")
, this can lead to problems when loading the model later on, as it will try to not only save the model weights but also the model definition. This can lead to problems if you change the model definition later on (which you most likely are going to do).
data.py
model.pyimport torch\n\n\ndef corrupt_mnist():\n \"\"\"Return train and test dataloaders for corrupt MNIST.\"\"\"\n # exchange with the corrupted mnist dataset\n train = torch.randn(50000, 784)\n test = torch.randn(10000, 784)\n return train, test\n
Solution Data is stored in .pt
files which can be loaded using torch.load
(1). We iterate over the files, load them and concatenate them into a single tensor. In particular, we have highlighted the use of .unsqueeze
function. Convolutional neural networks (which we propose as a solution) need the data to be in the shape [N, C, H, W]
where N
is the number of samples, C
is the number of channels, H
is the height of the image and W
is the width of the image. The dataset is stored in the shape [N, H, W]
and therefore we need to add a channel.
.pt
files are nothing else than a .pickle
file in disguise. The torch.save/torch.load
function is essentially a wrapper around the pickle
module in Python, which produces serialized files. However, it is convention to use .pt
to indicate that the file contains PyTorch tensors.We have additionally in the solution added functionality for plotting the images together with the labels for inspection. Remember: all good machine learning starts with a good understanding of the data.
model.pyfrom __future__ import annotations\n\nimport matplotlib.pyplot as plt # only needed for plotting\nimport torch\nfrom mpl_toolkits.axes_grid1 import ImageGrid # only needed for plotting\n\nDATA_PATH = \"data/corruptmnist\"\n\n\ndef corrupt_mnist() -> tuple[torch.utils.data.Dataset, torch.utils.data.Dataset]:\n \"\"\"Return train and test dataloaders for corrupt MNIST.\"\"\"\n train_images, train_target = [], []\n for i in range(5):\n train_images.append(torch.load(f\"{DATA_PATH}/train_images_{i}.pt\"))\n train_target.append(torch.load(f\"{DATA_PATH}/train_target_{i}.pt\"))\n train_images = torch.cat(train_images)\n train_target = torch.cat(train_target)\n\n test_images = torch.load(f\"{DATA_PATH}/test_images.pt\")\n test_target = torch.load(f\"{DATA_PATH}/test_target.pt\")\n\n train_images = train_images.unsqueeze(1).float()\n test_images = test_images.unsqueeze(1).float()\n train_target = train_target.long()\n test_target = test_target.long()\n\n train_set = torch.utils.data.TensorDataset(train_images, train_target)\n test_set = torch.utils.data.TensorDataset(test_images, test_target)\n\n return train_set, test_set\n\n\ndef show_image_and_target(images: torch.Tensor, target: torch.Tensor) -> None:\n \"\"\"Plot images and their labels in a grid.\"\"\"\n row_col = int(len(images) ** 0.5)\n fig = plt.figure(figsize=(10.0, 10.0))\n grid = ImageGrid(fig, 111, nrows_ncols=(row_col, row_col), axes_pad=0.3)\n for ax, im, label in zip(grid, images, target):\n ax.imshow(im.squeeze(), cmap=\"gray\")\n ax.set_title(f\"Label: {label.item()}\")\n ax.axis(\"off\")\n plt.show()\n\n\nif __name__ == \"__main__\":\n train_set, test_set = corrupt_mnist()\n print(f\"Size of training set: {len(train_set)}\")\n print(f\"Size of test set: {len(test_set)}\")\n print(f\"Shape of a training point {(train_set[0][0].shape, train_set[0][1].shape)}\")\n print(f\"Shape of a test point {(test_set[0][0].shape, test_set[0][1].shape)}\")\n show_image_and_target(train_set.tensors[0][:25], train_set.tensors[1][:25])\n
Implement training and evaluation of your model in main.py
script. The main.py
script should be able to take additional subcommands indicating if the model should be trained or evaluated. It will look something like this:
python main.py train --lr 1e-4\npython main.py evaluate trained_model.pt\n
which can be implemented in various ways. We provide you with a starting script that uses the click
library to define a command line interface (CLI), which you can learn more about in this module.
If you try to execute the above code in VS code using the debugger (F5) or the build run functionality in the upper right corner:
you will get an error message saying that you need to select a command to run e.g. main.py
either needs the train
or evaluate
command. This can be fixed by adding a launch.json
to a specialized .vscode
folder in the root of the project. The launch.json
file should look something like this:
{\n \"version\": \"0.2.0\",\n \"configurations\": [\n {\n \"name\": \"Python: Current File\",\n \"type\": \"python\",\n \"request\": \"launch\",\n \"program\": \"${file}\",\n \"args\": [\n \"train\",\n \"--lr\",\n \"1e-4\"\n ],\n \"console\": \"integratedTerminal\",\n \"justMyCode\": true\n }\n ]\n}\n
This will inform VS code that then we execute the current file (in this case main.py
) we want to run it with the train
command and additionally pass the --lr
argument with the value 1e-4
. You can read more about creating a launch.json
file here. If you want to have multiple configurations you can add them to the configurations
list as additional dictionaries.
main.py
main.pyimport click\nimport torch\nfrom data_solution import corrupt_mnist\nfrom model import MyAwesomeModel\n\n\n@click.group()\ndef cli() -> None:\n \"\"\"Command line interface.\"\"\"\n\n\n@click.command()\n@click.option(\"--lr\", default=1e-3, help=\"learning rate to use for training\")\ndef train(lr) -> None:\n \"\"\"Train a model on MNIST.\"\"\"\n print(\"Training day and night\")\n print(lr)\n\n # TODO: Implement training loop here\n model = MyAwesomeModel()\n train_set, _ = corrupt_mnist()\n\n\n@click.command()\n@click.argument(\"model_checkpoint\")\ndef evaluate(model_checkpoint) -> None:\n \"\"\"Evaluate a trained model.\"\"\"\n print(\"Evaluating like my life depends on it\")\n print(model_checkpoint)\n\n # TODO: Implement evaluation logic here\n model = torch.load(model_checkpoint)\n _, test_set = corrupt_mnist()\n\n\ncli.add_command(train)\ncli.add_command(evaluate)\n\n\nif __name__ == \"__main__\":\n cli()\n
Solution The solution implements a simple training loop and evaluation loop. Furthermore, we have added additional hyperparameters that can be passed to the training loop. Highlighted in the solution are the different lines where we take care that our model and data are moved to GPU (or Apple MPS accelerator if you have a newer Mac) if available.
main.pyimport click\nimport matplotlib.pyplot as plt\nimport torch\nfrom model import MyAwesomeModel\n\nfrom data import corrupt_mnist\n\nDEVICE = torch.device(\"cuda\" if torch.cuda.is_available() else \"mps\" if torch.backends.mps.is_available() else \"cpu\")\n\n\n@click.group()\ndef cli() -> None:\n \"\"\"Command line interface.\"\"\"\n\n\n@click.command()\n@click.option(\"--lr\", default=1e-3, help=\"learning rate to use for training\")\n@click.option(\"--batch_size\", default=32, help=\"batch size to use for training\")\n@click.option(\"--epochs\", default=10, help=\"number of epochs to train for\")\ndef train(lr, batch_size, epochs) -> None:\n \"\"\"Train a model on MNIST.\"\"\"\n print(\"Training day and night\")\n print(f\"{lr=}, {batch_size=}, {epochs=}\")\n\n model = MyAwesomeModel().to(DEVICE)\n train_set, _ = corrupt_mnist()\n\n train_dataloader = torch.utils.data.DataLoader(train_set, batch_size=batch_size)\n\n loss_fn = torch.nn.CrossEntropyLoss()\n optimizer = torch.optim.Adam(model.parameters(), lr=lr)\n\n statistics = {\"train_loss\": [], \"train_accuracy\": []}\n for epoch in range(epochs):\n model.train()\n for i, (img, target) in enumerate(train_dataloader):\n img, target = img.to(DEVICE), target.to(DEVICE)\n optimizer.zero_grad()\n y_pred = model(img)\n loss = loss_fn(y_pred, target)\n loss.backward()\n optimizer.step()\n statistics[\"train_loss\"].append(loss.item())\n\n accuracy = (y_pred.argmax(dim=1) == target).float().mean().item()\n statistics[\"train_accuracy\"].append(accuracy)\n\n if i % 100 == 0:\n print(f\"Epoch {epoch}, iter {i}, loss: {loss.item()}\")\n\n print(\"Training complete\")\n torch.save(model.state_dict(), \"model.pth\")\n fig, axs = plt.subplots(1, 2, figsize=(15, 5))\n axs[0].plot(statistics[\"train_loss\"])\n axs[0].set_title(\"Train loss\")\n axs[1].plot(statistics[\"train_accuracy\"])\n axs[1].set_title(\"Train accuracy\")\n fig.savefig(\"training_statistics.png\")\n\n\n@click.command()\n@click.argument(\"model_checkpoint\")\ndef evaluate(model_checkpoint) -> None:\n \"\"\"Evaluate a trained model.\"\"\"\n print(\"Evaluating like my life depended on it\")\n print(model_checkpoint)\n\n model = MyAwesomeModel().to(DEVICE)\n model.load_state_dict(torch.load(model_checkpoint))\n\n _, test_set = corrupt_mnist()\n test_dataloader = torch.utils.data.DataLoader(test_set, batch_size=32)\n\n model.eval()\n correct, total = 0, 0\n for img, target in test_dataloader:\n img, target = img.to(DEVICE), target.to(DEVICE)\n y_pred = model(img)\n correct += (y_pred.argmax(dim=1) == target).float().sum().item()\n total += target.size(0)\n print(f\"Test accuracy: {correct / total}\")\n\n\ncli.add_command(train)\ncli.add_command(evaluate)\n\n\nif __name__ == \"__main__\":\n cli()\n
As documentation that your model is working when running the train
command, the script needs to produce a single plot with the training curve (training step vs training loss). When the evaluate
command is run, it should write the test set accuracy to the terminal.
It is part of the exercise to not implement in notebooks, as code development in real life happens in scripts. As the model is simple to run (for now), you should be able to complete the exercise on your laptop, even if you are only training on CPU. That said, you are allowed to upload your scripts to your own \"Google Drive\" and then you can call your scripts from a Google Colab notebook, which is shown in the image below where all code is placed in the fashion_trainer.py
script and the Colab notebook is just used to execute it.
Be sure to have completed the final exercise before the next session, as we will be building on top of the model you have created.
"},{"location":"s1_development_environment/editor/","title":"M3 - Editor","text":""},{"location":"s1_development_environment/editor/#editoride","title":"Editor/IDE","text":"Core Module
Notebooks can be great for testing out ideas, developing simple code, and explaining and visualizing certain aspects of a codebase. Remember that Jupyter Notebook was created to \"...allows you to create and share documents that contain live code, equations, visualizations, and narrative text.\" However, any larger machine learning project will require you to work in multiple .py
files, and here notebooks will provide a suboptimal workflow. Therefore, to truly get \"work done,\" you will need a good editor/IDE.
Many opinions exist on this matter, but for simplicity, we recommend getting started with one of the following 3:
Editor Webpage Comment (Biased opinion) Spyder https://www.spyder-ide.org/ A Matlab-like environment that is easy to get started with Visual Studio Code https://code.visualstudio.com/ Support for multiple languages with fairly easy setup PyCharm https://www.jetbrains.com/pycharm/ An IDE for Python professionals. Will take a bit of time getting used toWe highly recommend Visual Studio (VS) Code if you do not already have an editor installed (or just want to try something new). We, therefore, put additional effort into explaining VS Code.
Below, you see an overview of the VS Code interface
Image creditThe main components of VS Code are:
The action bar: VS Code is not an editor meant for a single language and can do many things. One of the core reasons that VS Code has become so popular is that custom plug-ins called extensions can be installed to add functionality to VS Code. It is in the action bar that you can navigate between these different applications when you have installed them.
The sidebar: The sidebar has different functionality depending on what extension you have open. In most cases, the sidebar will just contain the file explorer.
The editor: This is where your code is. VS Code supports several layouts in the editor (one column, two columns, etc.). You can make a custom layout by dragging a file to where you want the layout to split.
The panel: The panel contains a terminal for you to interact with. This can quickly be used to try out code by opening a python
interpreter, management of environments, etc.
The status bar: The status bar contains information based on the extensions you have installed. In particular, for Python development, the status bar can be used to change the conda environment.
The overall goal of the exercises is that you should start familiarizing yourself with the editor you have chosen. If you are already an expert in one of them, feel free to skip the rest. You should at least be able to:
The instructions below are specific to Visual Studio Code, but we recommend that you try to answer the questions if using another editor. In the exercise_files
folder belonging to this session, we have put cheat sheets for VS Code (one for Windows and one for Mac/Linux) that can give you an easy overview of the different macros in VS Code. The following exercises are just to get you started, but you can find many more tutorials here.
VS Code is a general editor for many languages, and to get proper Python support, we need to install some extensions. In the action bar
, go to the extension
tab and search for python
in the marketplace. From here, we highly recommend installing the following packages:
If you install the Python
package, you should see something like this in your status bar:
which indicates that you are using the stock Python installation instead of the one you have created using conda
. Click it and change the Python environment to the one you want to use.
One of the most useful tools in VS Code is the ability to navigate the whole project using the built-in Explorer
. To take advantage of VS Code, you need to make sure what you are working on is a project. Create a folder called hello
(somewhere on your laptop) and open it in VS Code (Click File
in the menu and then select Open Folder
). You should end up with a completely clean workspace (as shown below). Click the New file
button and create a file called hello.py
.
Image credit
Finally, let's run some code. Add something simple to the hello.py
file like:
Image credit
and click the run
button as shown in the image. It should create a new terminal, activate the environment that you have chosen, and finally run your script. In addition to clicking the run
button, you can also:
Shift+Enter
to run it in the terminalThat's the basics of using VS Code. We highly recommend that you revisit this tutorial during the course when we get to topics such as debugging and version control, which VS Code can help with. We can also recommend this blog post that goes over some good extensions for AI/ML development in VS Code.
"},{"location":"s1_development_environment/editor/#a-note-on-jupyter-notebooks-in-production-environments","title":"A note on Jupyter notebooks in production environments","text":"As already stated, Jupyter Notebooks are great for development as they allow developers to easily test out new ideas. However, they often lead to pain points when models need to be deployed. We highly recommend reading section 5.1.1 of this paper by Shankar et al. which in more detail discusses the strong opinions on Jupyter notebooks that exist within the developer community.
All this said, there exists one simple tool to make notebooks work better in a production setting. It's called nbconvert
and can be installed with
pip install nbconvert\n
You may need some further dependencies such as Pandoc, TeX and Pyppeteer for it to work (see install instructions here). After this, converting a notebook to a .py
script is as simple as:
jupyter nbconvert --to=script my_notebook.ipynb\n
which will produce a similarly named script called my_notebook.py
. We highly recommend that you stick to developing scripts directly during the course to get experience with doing so, but nbconvert
can be a fantastic tool to have in your toolbox.
You are probably all familiar with using AI tools for solving different tasks in your daily life and you have most likely also used AI tools like ChatGPT or similar for programming. However, most of these tools are not directly integrated into your editor, which can lead to a lot of context-switching that in general leads to lower productivity.
We are therefore in this section going to be looking at GitHub Copilot, which is an AI tool that directly integrates into your editor, eliminating the need to switch between browser tabs or external tools. In addition, the strength of having AI directly in your editor is that it can provide suggestions based on the code you are currently writing and in general it just has access to a larger context than a standalone tool.
"},{"location":"s1_development_environment/editor/#exercises_1","title":"\u2754 Exercises","text":"As of writing this GitHub Copilot is free for all students, teachers and maintainers of popular open-source projects. As a student, sign up for the Student Developer Pack
Install the GitHub Copilot extension in your editor
GitHub Copilot has many different features, but the most important one is the ability to provide suggestions based on the code you are currently writing. Try to write some code in a new Python file and see if you can get some suggestions from GitHub Copilot on how to complete the code. If you have no idea what to try out here is a simple example of starting out coding a neural network in PyTorch:
import torch\nfrom torch import nn\nclass Net(nn.Module):\n
Github Copilot will most likely suggest you complete the code using linear layers with an input dimension of 28*28
. Can you explain why it suggests this and where this bias comes from?
The second feature that can be very useful is the ability to directly chat or ask questions regarding your code. Try highlighting (in your code editor) the code from the previous exercise and press Ctrl+i
which should open a chat window. Ask it to complete it with a convolutional neural network instead of a linear one.
Finally, let's try the built-in chat feature. You can get to this by clicking the Chat
icon in the Activity bar and begin to ask questions similar to how you would ask ChatGPT. However, we have also the option to provide context either from the code editor or the terminal. Try saving the following code in a Python script copilot.py
:
import torch\nfrom torch import nn\nclass Net(nn.Module):\n def __init__(self):\n super(Net, self).__init__()\n self.fc1 = nn.Linear(28*28, 128)\n self.fc2 = nn.Linear(128, 64)\n self.fc3 = nn.Linear(64, 10)\n def forward(self, x):\n x = x.view(-1, 28*28)\n x = torch.relu(self.fc1(x))\n x = torch.relu(self.fc2(x))\n x = self.fc3(x)\n return x\n\nmodel = Net()\nprint(model(torch.randn(1, 1, 14, 14)))\n
and run it in the terminal: python copilot.py
. It will naturally give you an error, but you can now ask GitHub Copilot for help. The easiest way to do this is by highlighting the output in the terminal and then pressing running the Github Copilot: Explain This (Terminal)
command (see the image below, use Ctrl+Shift+P
to open the command palette and search for the command). Does the explanation make sense e.g. can you figure out what to change to get the code running?
(Optional) Just to investigate the difference between using Github Copilot and ChatGPT, try to redo the previous exercises using ChatGPT. What are the main differences between the two tools? (1)
That was a small introduction to GitHub Copilot. We highly recommend that you try to use it during the course to see how it can help you solve both the exercises and the final project. However, when using AI tools it is always important to remember that they are not perfect and that you need to critically evaluate the suggestions they provide. In the end, you are the one responsible for the code you write, not the AI tool.
"},{"location":"s1_development_environment/package_manager/","title":"M2 - Package Manager","text":""},{"location":"s1_development_environment/package_manager/#package-managers-and-virtual-environments","title":"Package managers and virtual environments","text":"Core Module
Python is a great programming language and this is mostly due to its vast ecosystem of packages. No matter what you want to do, there is probably a package that can get you started. Just try to remember when the last time you wrote a program only using the Python standard library. Probably never. For this reason, we need a way to install third-party packages and this is where package managers come into play.
You have probably already used pip
for the longest time, which is the default package manager for Python. pip
is great for beginners but it is missing one essential feature that you will need as a developer or data scientist: virtual environments. Virtual environments are an essential way to make sure that the dependencies of different projects do not cross-contaminate each other. As a naive example, consider project A that requires torch==1.3.0
and project B that requires torch==2.0
, then doing
cd project_A # move to project A\npip install torch==1.3.0 # install old torch version\ncd ../project_B # move to project B\npip install torch==2.0 # install new torch version\ncd ../project_A # move back to project A\npython main.py # try executing main script from project A\n
will mean that even though we are executing the main script from project A's folder, it will use torch==2.0
instead of torch==1.3.0
because that is the last version we installed because in both cases pip
will install the package into the same environment, in this case, the global environment. Instead, if we did something like:
cd project_A # move to project A\npython -m venv env # create a virtual environment in project A\nsource env/bin/activate # activate that virtual environment\npip install torch==1.3.0 # Install the old torch version into the virtual environment belonging to project A\ncd ../project_B # move to project B\npython -m venv env # create a virtual environment in project B\nsource env/bin/activate # activate that virtual environment\npip install torch==2.0 # Install new torch version into the virtual environment belonging to project B\ncd ../project_A # Move back to project A\nsource env/bin/activate # Activate the virtual environment belonging to project A\npython main.py # Succeed in executing the main script from project A\n
cd project_A # Move to project A\npython -m venv env # Create a virtual environment in project A\n.\\env\\Scripts\\activate # Activate that virtual environment\npip install torch==1.3.0 # Install the old torch version into the virtual environment belonging to project A\ncd ../project_B # Move to project B\npython -m venv env # Create a virtual environment in project B\n.\\env\\Scripts\\activate # Activate that virtual environment\npip install torch==2.0 # Install new torch version into the virtual environment belonging to project B\ncd ../project_A # Move back to project A\n.\\env\\Scripts\\activate # Activate the virtual environment belonging to project A\npython main.py # Succeed in executing the main script from project A\n
then we would be sure that torch==1.3.0
is used when executing main.py
in project A because we are using two different virtual environments. In the above case, we used the venv module which is the built-in Python module for creating virtual environments. venv+pip
is arguably a good combination but when working on multiple projects it can quickly become a hassle to manage all the different virtual environments yourself, remembering which Python version to use, which packages to install and so on.
For this reason, a number of package managers have been created that can help you manage your virtual environments and dependencies, with some of the most popular being:
with more being created every year (rye is looking like an interesting project). This is considered a problem in the Python community because it means that there is no standard way of managing dependencies like in other languages like npm
for node.js
or cargo
for rust
.
In the course, we do not care about which package manager you use, but we do care that you use one. If you are already familiar with one package manager, then skip this exercise and continue to use that. The best recommendation that I can give regarding package managers, in general, is to find one you like and then stick with it. A lot of time can be wasted on trying to find the perfect package manager, but in the end, they all do the same with some minor differences. Check out this blog post if you want a fairly up-to-date evaluation of the different environment management and packaging tools that exist in the Python ecosystem.
If you are not familiar with any package managers, then we recommend that you use conda
and pip
for this course. You probably already have conda installed on your laptop, which is great. What conda does especially well, is that it allows you to create virtual environments with different Python versions, which can be really useful if you encounter dependencies that have not been updated in a long time. In this course specifically, we are going to recommend the following workflow
conda
to create virtual environments with specific Python versionspip
to install packages in that environmentInstalling packages with pip
inside conda
environments has been considered a bad practice for a long time, but since conda>=4.6
it is considered safe to do so. The reason for this is that conda
now has a built-in compatibility layer that makes sure that pip
installed packages are compatible with the other packages installed in the environment.
Before we get started with the exercises, let's first talk a bit about Python dependencies. One of the most common ways to specify dependencies in the Python community is through a requirements.txt
file, which is a simple text file that contains a list of all the packages that you want to install. The format allows you to specify the package name and version number you want, with 7 different operators:
package1 # any version\npackage2 == x.y.z # exact version\npackage3 >= x.y.z # at least version x.y.z\npackage4 > x.y.z # newer than version x.y.z\npackage4 <= x.y.z # at most version x.y.z\npackage5 < x.y.z # older than version x.y.z\npackage6 ~= x.y.z # install version newer than x.y.z and older than x.y+1\n
In general, all packages (should) follow the semantic versioning standard, which means that the version number is split into three parts: x.y.z
where x
is the major version, y
is the minor version and z
is the patch version.
The reason that we need to specify the version number is that we want to make sure that we can reproduce our code at a later point. If we do not specify the version number, then we are at the mercy of the package maintainer to not change the API of the package. This is especially important when working with machine learning models, as we want to make sure that we can reproduce the exact same model at a later point.
Finally, we also need to discuss dependency resolution, which is the process of figuring out which packages are compatible. This is a non-trivial problem, and there exist a lot of different algorithms for doing this. If you have ever thought that pip
and conda
were taking a long time to install something, then it is probably because they were trying to figure out which packages are compatible with each other. For example, if you try to install
pip install \"matplotlib >= 3.8.0\" \"numpy <= 1.19\" --dry-run\n
then it would simply fail because there are no versions of matplotlib
and numpy
under the given constraints that are compatible with each other. In this case, we would need to relax the constraints to something like
pip install \"matplotlib >= 3.8.0\" \"numpy <= 1.21\" --dry-run\n
to make it work.
"},{"location":"s1_development_environment/package_manager/#exercises","title":"\u2754 Exercises","text":"For hints regarding how to use conda
you can check out the cheat sheet in the exercise folder.
Download and install conda
. You are free to either install full conda
or the much simpler version miniconda
. The core difference between the two packages is that conda
already comes with a lot of packages that you would normally have to install with miniconda
. The downside is that conda
is a much larger package which can be a huge disadvantage on smaller devices. Make sure that your installation is working by writing conda help
in a terminal and it should show you the help message for conda. If this does not work you probably need to set some system variable to point to the conda installation
If you have successfully installed conda, then you should be able to execute the conda
command in a terminal.
Conda will always tell you what environment you are currently in, indicated by the (env_name)
in the prompt. By default, it will always start in the (base)
environment.
Try creating a new virtual environment. Make sure that it is called my_environment
and that it installs version 3.11 of Python. What command should you execute to do this?
We highly recommend that you use Python 3.8 or higher for this course. In general, we recommend that you use the second latest version of Python that is available (currently Python 3.11 as of writing this). This is because the latest version of Python is often not supported by all dependencies. You can always check the status of different Python version support here.
Solutionconda create --name my_environment python=3.11\n
Which conda
command gives you a list of all the environments that you have created?
conda env list\n
Which conda
command gives you a list of the packages installed in the current environment?
conda list\n
How do you easily export this list to a text file? Do this, and make sure you export it to a file called environment.yaml
, as conda uses another format by default than pip
.
conda list --explicit > environment.yaml\n
Inspect the file to see what is in it.
The environment.yaml
file you have created is one way to secure reproducibility between users because anyone should be able to get an exact copy of your environment if they have your environment.yaml
file. Try creating a new environment directly from your environment.yaml
file and check that the packages being installed exactly match what you originally had.
conda env create --file environment.yaml\n
As the introduction states, it is fairly safe to use pip
inside conda
today. What is the corresponding pip
command that gives you a list of all pip
installed packages? And how do you export this to requirements.txt
file?
pip list # List all installed packages\npip freeze > requirements.txt # Export all installed packages to a requirements.txt file\n
If you look through the requirements that both pip
and conda
produce then you will see that it is often filled with a lot more packages than what you are using in your project. What you are interested in are the packages that you import in your code: from package import module
. One way to get around this is to use the package pipreqs
, which will automatically scan your project and create a requirements file specific to that. Let's try it out:
Install pipreqs
:
pip install pipreqs\n
Either try out pipreqs
on one of your own projects or try it out on some other online project. What does the requirements.txt
file pipreqs
produces look like compared to the files produced by either pip
or conda
.
Try executing the command
pip install \"pytest < 4.6\" pytest-cov==2.12.1\n
based on the error message you get, what would be a compatible way to install these?
SolutionAs pytest-cov==2.12.1
requires a version of pytest
newer than 4.6
, we can simply change the command to be:
pip install \"pytest >= 4.6\" pytest-cov==2.12.1\n
but there of course exist other solutions as well.
This ends the module on setting up virtual environments. While the methods mentioned in the exercises are great ways to construct requirements files automatically, sometimes it is just easier to manually sit down and create the files as you in that way ensure that only the most necessary requirements are installed when creating a new environment.
"},{"location":"s2_organisation_and_version_control/","title":"Organization and version control","text":"Slides
Learn the basics of version control and how to use git
to track changes to your code and collaborate with others.
M5: Git
Learn how to organize Python code into a library, package it and use templates to create new projects.
M6: Code Structure
Learn different coding practices and how to use them to improve the quality of your code.
M7: Good Coding Practice
Learn how to version control data using dvc
.
M8: Data Version Control
Learn the different ways to setup command line interfaces for your applications.
M9: Command Line Interfaces
Today we take our first steps into the world of MLOps. The set of modules in this session focuses on getting organized and making sure that you are familiar with good development practices. While many of the practices you will learn about these modules do not seem that important when you are a single person working on a project, it is crucial when working in large groups that the difference in how different people organize and write their code is minimized. The topics in this session will focus on:
Some exercises in this course are very loosely stated (including the exercises today). You are expected to seek out information before you ask for help (Google is your friend!) as you will both learn more for trying to solve the problems yourself, and it is more realistic of how the \"real world\" works.
Learning objectives
The learning objectives of this session are:
git
to track changes to your codedvc
to version control dataAs we already laid out in the very first module, the command line is a powerful tool for interacting with your computer. You should already now be familiar with running basic Python commands in the terminal:
python my_script.py\n
However, as your projects grow in size and complexity, you will often find yourself in need of more advanced ways of interacting with your code. This is where command line interface (CLI) comes into play. A CLI can be seen as a way for you to define the user interface of your application directly in the terminal. Thus, there is no right or wrong way of creating a CLI, it is all about what makes sense for your application.
In this module we are going to look at three different ways of creating a CLI for your machine learning projects. They are all serving a bit different purposes and can therefore be combined in the same project. However, you will most likely also feel that they are overlapping in some areas. That is completely fine, and it is up to you to decide which one to use in which situation.
"},{"location":"s2_organisation_and_version_control/cli/#project-scripts","title":"Project scripts","text":"You might already be familiar with the concept of executable scripts. An executable script is a Python script that can be run directly from the terminal without having to call the Python interpreter. This has been possible for a long time in Python, by the inclusion of a so-called shebang line at the top of the script. However, we are going to look at a specific way of defining executable scripts using the standard pyproject.toml
file, which you should have learned about in this module.
We are going to assume that you have a training script in your project that you would like to be able to run from the terminal directly without having to call the Python interpreter. Lets assume it is located like this
src/\n\u251c\u2500\u2500 my_project/\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u251c\u2500\u2500 train.py\npyproject.toml\n
In your pyproject.toml
file add the following lines. You will need to alter the paths to match your project.
[project.scripts]\ntrain = \"my_project.train:main\"\n
what do you think the train = \"my_project.train:main\"
line do?
The line tells Python that we want to create an executable script called train
that should run the main
function in the train.py
file located in the my_project
package.
Now, all that is left to do is install the project again in editable mode
pip install -e .\n
and you should now be able to run the following command in the terminal
train\n
Try it out and see if it works.
Add additional scripts to your pyproject.toml
file that allows you to run other scripts in your project from the terminal.
We assume that you also have a script called evaluate.py
in the my_project
package.
[project.scripts]\ntrain = \"my_project.train:main\"\nevaluate = \"my_project.evaluate:main\"\n
That is all there really is to it. You can now run your scripts directly from the terminal without having to call the Python interpreter. Some good examples of Python packages that uses this approach are numpy, pylint and kedro.
"},{"location":"s2_organisation_and_version_control/cli/#command-line-arguments","title":"Command line arguments","text":"If you have worked with Python for some time you are probably familiar with the argparse
package, which allows you to directly pass in additional arguments to your script in the terminal
python my_script.py --arg1 val1 --arg2 val2\n
argparse
is a very simple way of constructing what is called a command line interfaces. However, one limitation of argparse
is the possibility of easily defining an CLI with subcommands. If we take git
as an example, git
is the main command but it has multiple subcommands: push
, pull
, commit
etc. that all can take their own arguments. This kind of second CLI with subcommands is somewhat possible to do using only argparse
, however it requires a bit of hacks.
You could of course ask the question why we at all would like to have the possibility of defining such CLI. The main argument here is to give users of our code a single entrypoint to interact with our application instead of having multiple scripts. As long as all subcommands are proper documented, then our interface should be simple to interact with (again think git
where each subcommand can be given the -h
arg to get specific help).
Instead of using argparse
we are here going to look at the yyper package. typer
extends the functionalities of argparse
to allow for easy definition of subcommands and many other things, which we are not going to touch upon in this module. For completeness we should also mention that typer
is not the only package for doing this, and of other excellent frameworks for creating command line interfaces easily we can mention click.
Start by installing the typer
package
pip install typer\n
remember to add the package to your requirements.txt
file.
To get you started with typer
, let's just create a simple hello world type of script. Create a new Python file called greetings.py
and use the typer
package to create a command line interface such that running the following lines
python greetings.py # should print \"Hello World!\"\npython greetings.py --count=3 # should print \"Hello World!\" three times\npython greetings.py --help # should print the help message, informing the user of the possible arguments\n
executes and gives the expected output. Relevant documentation.
SolutionImportantly for typer
is that you need to provide type hints for the arguments. This is because typer
needs these to be able to work properly.
import typer\napp = typer.Typer()\n\n@app.command()\ndef hello(count: int = 1, name: str = \"World\"):\n for x in range(count):\n typer.echo(f\"Hello {name}!\")\n\nif __name__ == \"__main__\":\n app()\n
Next, lets try on a bit harder example. Below is a simple script that trains a support vector machine on the iris dataset.
iris_classifier.py
iris_classifier.pyfrom sklearn.datasets import load_breast_cancer\nfrom sklearn.metrics import accuracy_score, classification_report\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.svm import SVC\n\n\ndef train():\n \"\"\"Train and evaluate the model.\"\"\"\n # Load the dataset\n data = load_breast_cancer()\n x = data.data\n y = data.target\n\n # Split the dataset into training and testing sets\n x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)\n\n # Standardize the features\n scaler = StandardScaler()\n x_train = scaler.fit_transform(x_train)\n x_test = scaler.transform(x_test)\n\n # Train a Support Vector Machine (SVM) model\n model = SVC(kernel=\"linear\", random_state=42)\n model.fit(x_train, y_train)\n\n # Make predictions on the test set\n y_pred = model.predict(x_test)\n\n # Evaluate the model\n accuracy = accuracy_score(y_test, y_pred)\n report = classification_report(y_test, y_pred)\n\n print(f\"Accuracy: {accuracy:.2f}\")\n print(\"Classification Report:\")\n print(report)\n return accuracy, report\n\n\nif __name__ == \"__main__\":\n train()\n
Implement a CLI for the script such that the following commands can be run
python iris_classifier.py train --output 'model.ckpt' # should train the model and save it to 'model.ckpt'\npython iris_classifier.py train -o 'model.ckpt' # should be the same as above\n
Solution We are here making use of the short name option in typer for giving an shorter alias to the --output
option.
import pickle\nfrom typing import Annotated\n\nimport typer\nfrom sklearn.datasets import load_breast_cancer\nfrom sklearn.metrics import accuracy_score, classification_report\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.svm import SVC\n\napp = typer.Typer()\n\n\n@app.command()\ndef train(output: Annotated[str, typer.Option(\"--output\", \"-o\")] = \"model.ckpt\"):\n \"\"\"Train and evaluate the model.\"\"\"\n # Load the dataset\n data = load_breast_cancer()\n x = data.data\n y = data.target\n\n # Split the dataset into training and testing sets\n x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)\n\n # Standardize the features\n scaler = StandardScaler()\n x_train = scaler.fit_transform(x_train)\n x_test = scaler.transform(x_test)\n\n # Train a Support Vector Machine (SVM) model\n model = SVC(kernel=\"linear\", random_state=42)\n model.fit(x_train, y_train)\n\n # Make predictions on the test set\n y_pred = model.predict(x_test)\n\n # Evaluate the model\n accuracy = accuracy_score(y_test, y_pred)\n report = classification_report(y_test, y_pred)\n\n with open(output, \"wb\") as f:\n pickle.dump(model, f)\n\n print(f\"Accuracy: {accuracy:.2f}\")\n print(\"Classification Report:\")\n print(report)\n return accuracy, report\n\n\nif __name__ == \"__main__\":\n app()\n
Next lets create a CLI that has more than a single command. Continue working in the basic machine learning application from the previous exercise, but this time we want to define two separate commands
python iris_classifier.py train --output 'model.ckpt'\npython iris_classifier.py evaluate 'model.ckpt'\n
Solution The only key difference between the two is that in the train
command we define the output
argument to to be an optional parameter e.g. we provide a default and for the evaluate
command it is a required parameter.
import pickle\nfrom typing import Annotated\n\nimport typer\nfrom sklearn.datasets import load_breast_cancer\nfrom sklearn.metrics import accuracy_score, classification_report\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.svm import SVC\n\napp = typer.Typer()\n\n# Load the dataset\ndata = load_breast_cancer()\nx = data.data\ny = data.target\n\n# Split the dataset into training and testing sets\nx_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)\n\n# Standardize the features\nscaler = StandardScaler()\nx_train = scaler.fit_transform(x_train)\nx_test = scaler.transform(x_test)\n\n\n@app.command()\ndef train(output_file: Annotated[str, typer.Option(\"--output\", \"-o\")] = \"model.ckpt\") -> None:\n \"\"\"Train the model.\"\"\"\n # Train a Support Vector Machine (SVM) model\n model = SVC(kernel=\"linear\", random_state=42)\n model.fit(x_train, y_train)\n\n with open(output_file, \"wb\") as f:\n pickle.dump(model, f)\n\n\n@app.command()\ndef evaluate(model_file):\n \"\"\"Evaluate the model.\"\"\"\n with open(model_file, \"rb\") as f:\n model = pickle.load(f)\n\n # Make predictions on the test set\n y_pred = model.predict(x_test)\n\n # Evaluate the model\n accuracy = accuracy_score(y_test, y_pred)\n report = classification_report(y_test, y_pred)\n\n print(f\"Accuracy: {accuracy:.2f}\")\n print(\"Classification Report:\")\n print(report)\n return accuracy, report\n\n\nif __name__ == \"__main__\":\n app()\n
Finally, let's try to define subcommands for our subcommands e.g. something similar to how git
has the subcommand remote
which in itself has multiple subcommands like add
, rename
etc. Continue on the simple machine learning application from the previous exercises, but this time define a cli such that
python iris_classifier.py train svm --kernel 'linear'\npython iris_classifier.py train knn -k 5\n
e.g the train
command now has two subcommands for training different machine learning models (in this case SVM and KNN) which each takes arguments that are unique to that model. Relevant documentation.
import pickle\nfrom typing import Annotated\n\nimport typer\nfrom sklearn.datasets import load_breast_cancer\nfrom sklearn.metrics import accuracy_score, classification_report\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.svm import SVC\n\napp = typer.Typer()\ntrain_app = typer.Typer()\napp.add_typer(train_app, name=\"train\")\n\n# Load the dataset\ndata = load_breast_cancer()\nx = data.data\ny = data.target\n\n# Split the dataset into training and testing sets\nx_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)\n\n# Standardize the features\nscaler = StandardScaler()\nx_train = scaler.fit_transform(x_train)\nx_test = scaler.transform(x_test)\n\n\n@train_app.command()\ndef svm(kernel: str = \"linear\", output_file: Annotated[str, typer.Option(\"--output\", \"-o\")] = \"model.ckpt\") -> None:\n \"\"\"Train a SVM model.\"\"\"\n model = SVC(kernel=kernel, random_state=42)\n model.fit(x_train, y_train)\n\n with open(output_file, \"wb\") as f:\n pickle.dump(model, f)\n\n\n@train_app.command()\ndef knn(k: int = 5, output_file: Annotated[str, typer.Option(\"--output\", \"-o\")] = \"model.ckpt\") -> None:\n \"\"\"Train a KNN model.\"\"\"\n model = KNeighborsClassifier(n_neighbors=k)\n model.fit(x_train, y_train)\n\n with open(output_file, \"wb\") as f:\n pickle.dump(model, f)\n\n\n@app.command()\ndef evaluate(model_file):\n \"\"\"Evaluate the model.\"\"\"\n with open(model_file, \"rb\") as f:\n model = pickle.load(f)\n\n # Make predictions on the test set\n y_pred = model.predict(x_test)\n\n # Evaluate the model\n accuracy = accuracy_score(y_test, y_pred)\n report = classification_report(y_test, y_pred)\n\n print(f\"Accuracy: {accuracy:.2f}\")\n print(\"Classification Report:\")\n print(report)\n return accuracy, report\n\n\nif __name__ == \"__main__\":\n app()\n
(Optional) Let's try to combine what we have learned until now. Try to make your typer
cli into a executable script using the pyproject.toml
file and try it out!
Assuming that our iris_classifier.py
script from before is placed in src/my_project
folder, we should just add
[project.scripts]\ngreetings = \"src.my_project.iris_classifier:app\"\n
and remember to install the project in editable mode
pip install -e .\n
and you should now be able to run the following command in the terminal
iris_classifier train knn\n
This covers the basic of typer
but feel free to deep dive into how the package can help you custimize your CLIs. Checkout this page on adding colors to your CLI or this page on validating the inputs to your CLI.
The two sections above have shown you how to create a simple CLI for your Python scripts. However, when doing machine learning projects, you often have a lot of non-Python code that you would like to run from the terminal. Based on the learning modules you have already completed, you have already encountered a couple of CLI tools that are used in our projects:
As we begin to move into the next couple of learning modules, we are going to encounter even more CLI tools that we need to interact with. Here is a example of long command that you might need to run in your project in the future
docker run -v $(pwd):/app -w /app --gpus all --rm -it my_image:latest python my_script.py --arg1 val1 --arg2 val2\n
This can be a lot to remember, and it can be easy to make mistakes. Instead it would be nice if we could just do
run my_command --arg1=val1 --arg2=val2\n
e.g. easier to remember because we have remove a lot of the hard-to-remember stuff, but we are still able to configure it to our liking. To help with this, we are going to look at the invoke package. invoke
is a Python package that allows you to define tasks that can be run from the terminal. It is a bit like a more advanced version of the Makefile that you might have encountered in other programming languages. Some good alternatives to invoke
are just and task, but we have chosen to focus on invoke
in this module because it can be installed as a Python package making installation across different systems easier.
Start by installing invoke
pip install invoke\n
remember to add the package to your requirements.txt
file.
Add a tasks.py
file to your repository and try to just run
invoke --list\n
which should work but inform you that no tasks are added yet.
Let's now try to add a task to the tasks.py
file. The way to do this with invoke is to import the task
decorator from invoke
and then decorate a function with it:
from invoke import task\nimport os\n\n@task\ndef python(ctx):\n \"\"\" \"\"\"\n ctx.run(\"which python\" if os.name != \"nt\" else \"where python\")\n
the first argument of any task-decorated function is the ctx
context argument that implements the run
method for running any command as we run them in the terminal. In this case we have simply implemented a task that returns the current Python interpreter but it works for all operating systems. Check that it works by running:
invoke hello\n
Lets try to create a task that simplifies the process of git add
, git commit
, git push
. Create a task such that the following command can be run
invoke git --message \"My commit message\"\n
Implement it and use the command to commit the taskfile you just created!
Solution@task\ndef git(ctx, message):\n ctx.run(f\"git add .\")\n ctx.run(f\"git commit -m '{message}'\")\n ctx.run(f\"git push\")\n
As you have hopefully realized by now, the most important method in invoke
is the ctx.run
method which actually run the commands you want to run in the terminal. This command takes multiple additional arguments. Try out the arguments warn
, pty
, echo
and explain in your own words what they do.
warn
: If set to True
the command will not raise an exception if the command fails. This can be useful if you want to run multiple commands and you do not want the whole process to stop if one of the commands fail.pty
: If set to True
the command will be run in a pseudo-terminal. If you want to enable this or not, depends on the command you are running. Here is a good explanation of when/why you should use it.echo
: If set to True
the command will be printed to the terminal before it is run.Create a command that simplifies the process of bootstrapping a conda
environment and install the relevant dependencies of your project.
@task\ndef conda(ctx, name: str = \"dtu_mlops\"):\n ctx.run(f\"conda env create -f environment.yml\", echo=True)\n ctx.run(f\"conda activate {name}\", echo=True)\n ctx.run(f\"pip install -e .\", echo=True)\n
and try to run the following command
invoke conda\n
Assuming you have completed the exercises on using dvc for version control of data, lets also try to add a task that simplifies the process of adding new data. This is the list of commands that need to be run to add new data to a dvc repository: dvc add
, git add
, git commit
, git push
, dvc push
. Try to implement a task that simplifies this process. It needs to take two arguments for defining the folder to add and the commit message.
@task\ndef dvc(ctx, folder=\"data\", message=\"Add new data\"):\n ctx.run(f\"dvc add {folder}\")\n ctx.run(f\"git add {folder}.dvc .gitignore\")\n ctx.run(f\"git commit -m '{message}'\")\n ctx.run(f\"git push\")\n ctx.run(f\"dvc push\")\n
and try to run the following command
invoke dvc --folder 'data' --message 'Add new data'\n
As the final exercise, lets try to combine every way of defining CLIs we have learned about in this module. Define a task that does the following
dvc pull
to download the datamy_cli
with the subcommand train
with the arguments --output 'model.ckpt'
from invoke import task\n\n@task\ndef pull_data(ctx):\n ctx.run(\"dvc pull\")\n\n@task(pull_data)\ndef train(ctx)\n ctx.run(\"my_cli train\")\n
That is all there is to it. You should now be able to define tasks that can be run from the terminal to simplify the process of running your code. We recommend that as you go through the learning modules in this course that you slowly start to add tasks to your tasks.py
file that simplifies the process of running the code you are writing.
What is the purpose of a command line interface?
SolutionA command line interface is a way for you to define the user interface of your application directly in the terminal. It allows you to interact with your code in a more advanced way than just running Python scripts.
Core Module
With a basic understanding of version control, it is now time to really begin filling up our code repository. However, the question then remains how to organize our code? As developers we tend to not think about code organization that much. It is instead something that just dynamically is being created as we may need it. However, maybe we should spend some time initially getting organized with the chance of this making our code easier to develop and maintain in the long run. If we do not spend time organizing our code, we may end up with a mess of code that is hard to understand or maintain
Big ball of Mud
A Big Ball of Mud is a haphazardly structured, sprawling, sloppy, duct-tape-and-baling-wire, spaghetti-code jungle. These systems show unmistakable signs of unregulated growth, and repeated, expedient repair. Information is shared promiscuously among distant elements of the system, often to the point where nearly all the important information becomes global or duplicated. The overall structure of the system may never have been well defined. If it was, it may have eroded beyond recognition. Programmers with a shred of architectural sensibility shun these quagmires. Only those who are unconcerned about architecture, and, perhaps, are comfortable with the inertia of the day-to-day chore of patching the holes in these failing dikes, are content to work on such systems. Brian Foote and Joseph Yoder, Big Ball of Mud. Fourth Conference on Patterns Languages of Programs (PLoP '97/EuroPLoP '97) Monticello, Illinois, September 1997
We are here going to focus on the organization of data science projects and machine learning projects. The core difference this kind of projects introduces compared to more traditional systems, is data. The key to modern machine learning is without a doubt the vast amounts of data that we have access to today. It is therefore not unreasonable that data should influence our choice of code structure. If we had another kind of application, then the layout of our codebase should probably be different.
"},{"location":"s2_organisation_and_version_control/code_structure/#cookiecutter","title":"Cookiecutter","text":"We are in this course going to use the tool cookiecutter, which is tool for creating projects from project templates. A project template is in short just an overall structure of how you want your folders, files etc. to be organised from the beginning. For this course we are going to be using a custom MLOps template. The template is essentially a fork of the cookiecutter data science template template that has been used for a couple of years in the course, but specialized a bit more towards MLOps instead of general data science.
We are not going to argue that this template is better than every other template, we are just focusing on that it is a standardized way of creating project structures for machine learning projects. By standardized we mean, that if two persons are both using cookiecutter
with the same template, the layout of their code does follow some specific rules, enabling one to faster understand the other person's code. Code organization is therefore not only to make the code easier for you to maintain but also for others to read and understand.
Shown below is the default code structure of cookiecutter for data science projects.
What is important to keep in mind when using a template, is that it exactly is a template. By definition a template is a guide to make something. Therefore, not all parts of a template may be important for your project at hand. Your job is to pick the parts from the template that is useful for organizing your machine learning project and add the parts that are missing.
"},{"location":"s2_organisation_and_version_control/code_structure/#python-projects","title":"Python projects","text":"While the same template in principal could be used regardless of what language we were using for our machine learning or data science application, there are certain considerations to take into account based on what language we are using. Python is the dominant language for machine learning and data science currently, which is why we in this section are focusing on some of the special files you will need for your Python projects.
The first file you may or may not know is the __init__.py
file. In Python the __init__.py
file is used to mark a directory as a Python package. Therefore as a bare minimum, any Python package should look something like this:
\u251c\u2500\u2500 src/\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u251c\u2500\u2500 file1.py\n\u2502 \u251c\u2500\u2500 file2.py\n\u251c\u2500\u2500 pyproject.toml\n
The second file to focus on is the pyproject.toml
. This file is important for actually converting your code into a Python project. Essentially, whenever you run pip install
, pip
is in charge of both downloading the package you want but also in charge of installing it. For pip
to be able to install a package it needs instructions on what part of the code it should install and how to install it. This is the job of the pyproject.toml
file.
Below we have both added a description of the structure of the pyproject.toml
file but also setup.py + setup.cfg
which is the \"old\" way of providing project instructions regarding Python project. However, you may still encounter a lot of projects using setup.py + setup.cfg
so it is good to at least know about them.
pyproject.toml
is the new standardized way of describing project metadata in a declaratively way, introduced in PEP 621. It is written in toml format which is easy to read. At the very least your pyproject.toml
file should include the [build-system]
and [project]
sections:
[build-system]\nrequires = [\"setuptools\", \"wheel\"]\nbuild-backend = \"setuptools.build_meta\"\n\n[project]\nname = \"my-package-name\"\nversion = \"0.1.0\"\nauthors = [{name = \"EM\", email = \"me@em.com\"}]\ndescription = \"Something cool here.\"\nrequires-python = \">=3.8\"\ndynamic = [\"dependencies\"]\n\n[tool.setuptools.dynamic]\ndependencies = {file = [\"requirements.txt\"]}\n
the [build-system]
informs pip
/python
that to build this Python project it needs the two packages setuptools
and wheels
and that it should call the setuptools.build_meta function to actually build the project. The [project]
section essentially contains metadata regarding the package, what its called etc. if we ever want to publish it to PyPI.
For specifying dependencies of your project you have two options. Either you specify them in a requirements.txt
file and it as a dynamic field in pyproject.toml
as shown above. Alternatively, you can add a dependencies
field under the [project]
header like this:
[project]\ndependencies = [\n 'torch==2.1.0',\n 'matplotlib>=3.8.1'\n]\n
The improvement over setup.py + setup.cfg
is that pyproject.toml
also allows for metadata from other tools to be specified in it, essentially making sure you only need a single file for your project. For example, in the next [module M7 on good coding practices] you will learn about the tool ruff
and how it can help format your code. If we want to configure ruff
for our project we can do that directly in pyproject.toml
by adding additional headers:
[ruff]\nruff_option = ...\n
To read more about how to specify pyproject.toml
this page is a good place to start.
setup.py
is the original way to describing how a Python package should be build. The most basic setup.py
file will look like this:
from setuptools import setup\nfrom pip.req import parse_requirements\nrequirements = [str(ir.req) for ir in parse_requirements(\"requirements.txt\")]\nsetup(\n name=\"my-package-name\",\n version=\"0.1.0\",\n author=\"EM\",\n description=\"Something cool here.\"\n install_requires=requirements,\n)\n
Essentially, the it is the exact same meta information as in pyproject.toml
, just written directly in Python syntax instead of toml
. Because there was a wish to deperate this meta information into a separate file, the setup.cfg
file was created which can contain the exact same information as setup.py
just in a declarative config.
[metadata]\nname = my-package-name\nversion = 0.1.0\nauthor = EM\ndescription = \"Something cool here.\"\n# ...\n
This non-standardized way of providing meta information regarding a package was essentially what lead to the creation of pyproject.toml
.
Regardless of what way a project is configured, after creating the above files the correct way to install them would be the same
pip install .\n# or in developer mode\npip install -e . # (1)!\n
-e
is short for --editable
mode also called developer mode. Since we will continuously iterating on our package this is the preferred way to install our package, because that means that we do not have to run pip install
every time we make a change. Essentially, in developer mode changes in the Python source code can immediately take place without requiring a new installation.after running this your code should be available to import as from project_name import ...
like any other Python package you use. This is the most essential you need to know about creating Python packages.
After having installed cookiecutter (exercise 1 and 2), the remaining exercises are intended to be used on taking the simple CNN MNIST classifier from yesterdays exercise and force it into this structure. You are not required to fill out every folder and file in the project structure, but try to at least follow the steps in exercises. Whenever you need to run a file I recommend always doing this from the root directory e.g.
python <project_name>/data/make_dataset.py data/raw data/processed\npython <project_name>/models/train_model.py <arguments>\netc...\n
in this way paths (for saving and loading files) are always relative to the root.
Install cookiecutter framework
pip install cookiecutter\n
Start a new project using this template, that is specialized for this course (1).
You do this by running the cookiecutter command using the template url:
cookiecutter <url-to-template>\n
Valid project names
When asked for a project name you should follow the PEP8 guidelines for naming packages. This means that the name should be all lowercase and if you want to separate words, you should use underscores. For example my_project
is a valid name, while MyProject
is not. Additionally, the packaage name cannot start with a number.
There are two common choices on how layout your source directory. The first is called src-layout where the source code is always place in a src/<project_name>
folder and the second is called flat-layout where the source code is place is just placed in a <project_name>
folder. The template we are using in this course is using the flat-layout, but there are pros and cons for both.
After having created your new project, the first step is to also create a corresponding virtual environment and install any needed requirements. If you have a virtual environment from yesterday feel free to use that else create a new. Then install the project in that environment
pip install -e .\n
Start by filling out the <project_name>/data/make_dataset.py
file. When this file runs, it should take the raw data e.g. the corrupted MNIST files from yesterday (../data/corruptmnist
) which now should be located in a data/raw
folder and process them into a single tensor, normalize the tensor and save this intermediate representation to the data/processed
folder. By normalization here we refer to making sure the images have mean 0 and standard deviation 1.
import click\nimport torch\n\n\ndef normalize(images: torch.Tensor) -> torch.Tensor:\n \"\"\"Normalize images.\"\"\"\n return (images - images.mean()) / images.std()\n\n\n@click.command()\n@click.option(\"raw_dir\", default=\"data/raw\", help=\"Path to raw data directory\")\n@click.option(\"processed_dir\", default=\"data/processed\", help=\"Path to processed data directory\")\ndef make_data(raw_dir: str, processed_dir: str) -> None:\n \"\"\"Process raw data and save it to processed directory.\"\"\"\n train_images, train_target = [], []\n for i in range(5):\n train_images.append(torch.load(f\"{raw_dir}/train_images_{i}.pt\"))\n train_target.append(torch.load(f\"{raw_dir}/train_target_{i}.pt\"))\n train_images = torch.cat(train_images)\n train_target = torch.cat(train_target)\n\n test_images: torch.Tensor = torch.load(f\"{raw_dir}/test_images.pt\")\n test_target: torch.Tensor = torch.load(f\"{raw_dir}/test_target.pt\")\n\n train_images = train_images.unsqueeze(1).float()\n test_images = test_images.unsqueeze(1).float()\n train_target = train_target.long()\n test_target = test_target.long()\n\n train_images = normalize(train_images)\n test_images = normalize(test_images)\n\n torch.save(train_images, f\"{processed_dir}/train_images.pt\")\n torch.save(train_target, f\"{processed_dir}/train_target.pt\")\n torch.save(test_images, f\"{processed_dir}/test_images.pt\")\n torch.save(test_target, f\"{processed_dir}/test_target.pt\")\n\n\nif __name__ == \"__main__\":\n make_data()\n
This template comes with a Makefile
that can be used to easily define common operations in a project. You do not have to understand the complete file but try taking a look at it. In particular the following commands may come in handy
make data # runs the make_dataset.py file, try it!\nmake clean # clean __pycache__ files\nmake requirements # install everything in the requirements.txt file\n
Windows users make
is a GNU build tool that is by default not available on Windows. There are two recommended ways to get it running on Windows. The first is leveraging linux subsystem for Windows which you maybe have already installed. The second option is utilizing the chocolatey package manager, which enables Windows users to install packages similar to Linux system.
In general we recommend that you add commands to the Makefile
as you move along in the course. If you want to know more about how to write Makefile
s then this is an excellent video.
Put your model file (model.py
) into <project_name>/models
folder together and insert the relevant code from the main.py
file into the train_model.py
file. Make sure that whenever a model is trained and it is saved, that it gets saved to the models
folder (preferably in sub-folders).
When you run train_model.py
, make sure that some statistics/visualizations from the trained models gets saved to the reports/figures/
folder. This could be a simple .png
of the training curve.
(Optional) Can you figure out a way to add a train
command to the Makefile
such that training can be started using
make train\n
Solution train:\n python <project_name>/models/train_model.py\n
Fill out the newly created <project_name>/models/predict_model.py
file, such that it takes a pre-trained model file and creates prediction for some data. Recommended interface is that users can give this file either a folder with raw images that gets loaded in or a numpy
or pickle
file with already loaded images e.g. something like this
python <project_name>/models/predict_model.py \\\n models/my_trained_model.pt \\ # file containing a pretrained model\n data/example_images.npy # file containing just 10 images for prediction\n
Fill out the file <project_name>/visualization/visualize.py
with this (as minimum, feel free to add more visualizations)
reports/figures/
folder.The solution here depends a bit on the choice of model. However, in most cases your last layer in the model will be a fully connected layer, which we assume is named fc
. The easiest way to get the features before this layer is to replace the layer with torch.nn.Identity
which essentially does nothing (see highlighted line below). Alternatively, if you implemented everything in a torch.nn.Sequential
you can just remove the last layer from the Sequential
object: model = model[:-1]
.
import click\nimport matplotlib.pyplot as plt\nimport torch\nfrom my_project_name.model import MyAwesomeModel\nfrom sklearn.decomposition import PCA\nfrom sklearn.manifold import TSNE\n\n\n@click.command()\n@click.option(\"model_checkpoint\", default=\"model.pth\", help=\"Path to model checkpoint\")\n@click.option(\"processed_dir\", default=\"data/processed\", help=\"Path to processed data directory\")\n@click.option(\"figure_dir\", default=\"reports/figures\", help=\"Path to save figures\")\n@click.option(\"figure_name\", default=\"embeddings.png\", help=\"Name of the figure\")\ndef visualize(model_checkpoint: str, processed_dir: str, figure_dir: str, figure_name: str) -> None:\n \"\"\"Visualize model predictions.\"\"\"\n model = MyAwesomeModel().load_state_dict(torch.load(model_checkpoint))\n model.eval()\n model.fc = torch.nn.Identity()\n\n test_images = torch.load(f\"{processed_dir}/test_images.pt\")\n test_target = torch.load(f\"{processed_dir}/test_target.pt\")\n test_dataset = torch.utils.data.TensorDataset(test_images, test_target)\n\n embeddings, targets = [], []\n with torch.inference_mode():\n for batch in torch.utils.data.DataLoader(test_dataset, batch_size=32):\n images, target = batch\n predictions = model(images)\n embeddings.append(predictions)\n targets.append(target)\n embeddings = torch.cat(embeddings).numpy()\n targets = torch.cat(targets).numpy()\n\n if embeddings.shape[1] > 500: # Reduce dimensionality for large embeddings\n pca = PCA(n_components=100)\n embeddings = pca.fit_transform(embeddings)\n tsne = TSNE(n_components=2)\n embeddings = tsne.fit_transform(embeddings)\n\n plt.figure(figsize=(10, 10))\n for i in range(10):\n mask = targets == i\n plt.scatter(embeddings[mask, 0], embeddings[mask, 1], label=str(i))\n plt.legend()\n plt.savefig(f\"{figure_dir}/{figure_name}\")\n
(Optional) Feel free to create more files/visualizations (what about investigating/explore the data distribution?)
Make sure to update the README.md
file with a short description on how your scripts should be run
Finally make sure to update the requirements.txt
file with any packages that are necessary for running your code (see this set of exercises for help)
(Optional) Lets say that you are not satisfied with the template I have recommended that you use, which is completely fine. What should you then do? You should of course create your own template! This is actually not that hard to do.
Just for a starting point I would recommend that you fork either the mlops template which you have already been using or alternatively fork the data science template template.
After forking the template, clone it down locally and lets start modifying it. The first step is changing the cookiecutter.json
file. For the mlops template it looks like this:
{\n \"project_name\": \"project_name\",\n \"repo_name\": \"{{ cookiecutter.project_name.lower().replace(' ', '_') }}\",\n \"author_name\": \"Your name (or your organization/company/team)\",\n \"description\": \"A short description of the project.\",\n \"python_version_number\": \"3.10\",\n \"open_source_license\": [\"No license file\", \"MIT\", \"BSD-3-Clause\"]\n}\n
simply add a new line to the json file with the name of the variable you want to add and the default value you want it to have.
The actual template is located in the {{ cookiecutter.project_name }}
folder. cookiecutter
works by replacing everywhere that it sees {{ cookiecutter.<variable_name> }}
with the value of the variable. Therefore, if you want to add a new file to the template, just add it to the {{ cookiecutter.project_name }}
folder and make sure to add the {{ cookiecutter.<variable_name> }}
where you want the variable to be replaced.
After you have made the changes you want to the template, you should test it locally. Just run
cookiecutter . -f --no-input\n
and it should create a new folder using the default values of the cookiecutter.json
file.
Finally, make sure to push any changes you made to the template to GitHub, such that you in the future can use it by simply running
cookiecutter https://github.com/<username>/<my_template_repo>\n
Starting from complete scratch, what is the steps needed to create a new GitHub repository and push a specific template to it as the very first commit.
SolutionCreate a completely barebone repository, either using the GitHub UI or if you have the GitHub cli installed (not git
) you can run
gh repo create <repo_name> --public --confirm\n
Run cookiecutter
with the template you want to use
cookiecutter <template>\n
The name of the folder created by cookiecutter
should be the same as you just used.
Run the following sequence of commands
cd <project_name>\ngit init\ngit add .\ngit commit -m \"Initial commit\"\ngit remote add origin https://github.com/<username>/<repo_name>\ngit push origin master\n
That's it. The template should now have been pushed to the repository as the first commit.
That ends the module on code structure and cookiecutter
. We again want to stress the point of using cookiecutter
is not about following one specific template, but instead just to use any template for organizing your code. What often happens in a team is that multiple templates are needed in different stages of the development phase or for different product types because they share common structure, while still having some specifics. Keeping templates up-to-date then becomes critical such that no team member is using an outdated template. If you ever end up in this situation, we highly recommend to checkout cruft that works alongside cookiecutter
to not only make projects but update existing ones as template evolves. Cruft additionally also has template validation capabilities to ensure projects match the latest version of a template.
Core Module
In this module, we are going to return to version control. However, this time we are going to focus on version control of data. The reason we need to separate between standard version control and data version control comes down to one problem: size.
Classic version control was developed to keep track of code files, which all are simple text files. Even a codebase that contains 1000+ files with millions of lines of code can probably be stored in less than a single gigabyte (GB). On the other hand, the size of data can be drastically bigger. As most machine learning algorithms only get better with the more data that you feed them, we are seeing models today that are being trained on petabytes of data (1.000.000 GB).
Because this is an important concept there exist a couple of frameworks that have specialized in versioning data such as DVC, DAGsHub, Hub, Modelstore and ModelDB. Regardless of what framework, they all implement somewhat the same concept: instead of storing the actual data files or in general storing any large artifacts files we instead store a pointer to these large flies. We then version control the point instead of the artifact.
Image creditWe are in this course going to use DVC
provided by iterative.ai as they also provide tools for automatizing machine learning, which we are going to focus on later.
DVC (Data Version Control) is simply an extension of git
to not only take versioning data but also models and experiments in general. But how does it deal with these large data files? Essentially, DVC
will just keep track of a small metafile that will then point to some remote location where your original data is stored. Metafiles essentially work as placeholders for your data files. Your large data files are then stored in some remote location such as Google Drive or an S3
bucket from Amazon.
Image credit
As the figure shows, we now have two remote locations: one for code and one for data. We use git pull/push
for the code and dvc pull/push
for the data. The key concept is the connection between the data file model.pkl
which is fairly large and its respective metafile model.pkl.dvc
which is very small. The large file is stored in the data remote and the metafile is stored in the code remote.
If in doubt about some of the exercises, we recommend checking out the documentation for DVC as it contains excellent tutorials.
For these exercises, we are going to use Google drive as a remote storage solution for our data. If you do not already have a Google account, please create one (we are going to use it again in later exercises). Please make sure that you at least have 1GB of free space.
Next, install DVC and the Google Drive extension
pip install dvc\npip install dvc-gdrive\n
If you installed DVC via pip and plan to use cloud services as remote storage, you might need to install these optional dependencies: [s3], [azure], [gdrive], [gs], [oss], [ssh]. Alternatively, use [all] to include them all. If you encounter that the installation fails, we recommend that you start by updating pip and then trying to update dvc
:
pip install -U pip\npip install -U dvc-gdrive\n
If this does not work for you, it is most likely due to a problem with pygit2
and in that case we recommend that you follow the instructions here.
In your MNIST repository run the following command from the terminal
dvc init\n
this will setup dvc
for this repository (similar to how git init
will initialize a git repository). These files should be committed using standard git
to your repository.
Go to your Google Drive and create a new folder called dtu_mlops_data
. Then copy the unique identifier belonging to that folder as shown in the figure below
Using this identifier, add it as a remote storage
dvc remote add -d storage gdrive://<your_identifier>\n
Check the content of the file .dvc/config
. Does it contain a pointer to your remote storage? Afterwards, make sure to add this file to the next commit we are going to make:
git add .dvc/config\n
Call the dvc add
command on your data files exactly like you would add a file with git
(you do not need to add every file by itself as you can directly add the data/
folder). Doing this should create a human-readable file with the extension .dvc
. This is the metafile as explained earlier that will serve as a placeholder for your data. If you are on Windows and this step fails you may need to install pywin32
. At the same time, the data
folder should have been added to the .gitignore
file that marks which files should not be tracked by git. Confirm that this is correct.
Now we are going to add, commit and tag the metafiles so we can restore to this stage later on. Commit and tag the files, which should look something like this:
git add data.dvc .gitignore\ngit commit -m \"First datasets, containing 25000 images\"\ngit tag -a \"v1.0\" -m \"data v1.0\"\n
Finally, push your data to the remote storage using dvc push
. You will be asked to authenticate, which involves copy-pasting the code in the link prompted. Check out your Google Drive folder. You will see that the data is not in a recognizable format anymore due to the way that dvc
packs and tracks the data. The boring detail is that dvc
converts the data into content-addressable storage which makes data much faster to get. Finally, make sure that your data is not stored in your Github repository.
After authenticating the first time, DVC
should be setup without having to authenticate again. If you for some reason encounter that dvc fails to authenticate, you can try to reset the authentication. Locate the file $CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json
where $CACHE_HOME
depends on your operating system:
~/Library/Caches
~/.cache
This is the typical location, but it may vary depending on what distro you are running
{user}/AppData/Local
Delete the complete {gdrive_client_id}
folder and retry authenticating with dvc push
.
After completing the above steps, it is very easy for others (or yourself) to get setup with both code and data by simply running
git clone <my_repository>\ncd <my_repository>\ndvc pull\n
(assuming that you give them access right to the folder in your drive). Try doing this (in some other location than your standard code) to make sure that the two commands indeed download both your code and data.
Lets look about the process of updating our data. Remember the important aspect of version control is that we do not need to store explicit files called data_v1.pt
, data_v2.pt
etc. but just have a single data.pt
that where we can always checkout earlier versions. Initially start by copying the data data/corruptmnist_v2
folder from this repository to your MNIST code. This contains 3 extra datafiles with 15000 additional observations. Rerun your data pipeline so these gets incorporated into the files in your processed
folder.
Redo the above steps, adding the new data using dvc
, committing and tagging the metafiles e.g. the following commands should be executed (with appropriate input):
dvc add -> git add -> git commit -> git tag -> dvc push -> git push
.
Lets say that you wanted to go back to the state of your data in v1.0. If the above steps have been done correctly, you should be able to do this using:
git checkout v1.0\ndvc checkout\n
confirm that you have reverted to the original data.
(Optional) Finally, it is important to note that dvc
is not only intended to be used to store data files but also any other large files such as trained model weights (with billions of parameters these can be quite large). For example, if we always store our best-performing model in a file called best_model.ckpt
then we can use dvc
to version control it, store it online and make it easy for others to download. Feel free to experiment with this using your model checkpoints.
In general dvc
is a great framework for version-controlling data and models. However, it is important to note that it does have some performance issue when dealing with datasets that consist of many files. Therefore, if you are ever working with a dataset that consists of many small files, it can be a good idea to:
zip files into a single archive and then version control the archive. The zip
archive should be placed in a data/raw
folder and then unzipped in the data/processed
folder.
If possible turn your data into 1D arrays, then it can be stored in a single file such as .parquet
or .csv
. This is especially useful for tabular data. Then you can version control the single file instead of the many files.
How do you know that a repository is using dvc?
SolutionSimilar to a git repository having a .git
directory, a repository using dvc needs to have a .dvc
folder. Alternatively you can you the dvc status
command.
Assume you just added a folder called data/
that you want to track with dvc
. What is the sequence of 5 commands to successful version control the folder? (assuming you already setup a remote)
dvc add data/\ngit add .\ngit commit -m \"added raw data\"\ngit push\ndvc push\n
That's all for today. With the combined power of git
and dvc
we should be able to version control everything in our development pipeline such that no changes are lost (assuming we commit regularly). It should be noted that dvc
offers more than just data version control, so if you want to deep dive into dvc
we recommend their pipeline feature and how this can be used to setup version controlled experiments. Note that we are going to revisit dvc
later for a more permanent (and large-scale) storage solution.
Core Module
Proper collaboration with other people will require that you can work on the same codebase in an organized manner. This is the reason that version control exist. Simply stated, it is a way to keep track of:
For a full explanation please see this page
Secondly, it is important to note that GitHub is not git! GitHub is the dominating player when it comes to hosting repositories but that does not mean that they are the only one providing free repository hosting (see bitbucket or gitlab) for some other examples.
That said we will be using git and GitHub throughout this course. It is a requirement for passing this course that you create a public repository with your code and use git to upload any code changes. How much you choose to integrate this into your own projects depends, but you are at least expected to be familiar with git+GitHub.
Image credit"},{"location":"s2_organisation_and_version_control/git/#initial-config","title":"Initial config","text":"What does Git stand for?
The name \"git\" was given by Linus Torvalds when he wrote the very first version. He described the tool as \"the stupid content tracker\" and the name as (depending on your mood):
Install git on your computer and make sure that your installation is working by writing git help
in a terminal and it should show you the help message for git.
Create a GitHub account if you do not already have one.
To make sure that we do not have to type in our GitHub username every time that we want to do some changes, we can once and for all set them on our local machine
# type in a terminal\ngit config credential.helper store\ngit config --global user.email <email>\n
The most simple way to think of version control, is that it is just nodes with lines connecting them
Each node, which we call a commit is uniquely identified by a hash string. Each node, stores what our code looked like at that point in time (when we made the commit) and using the hash codes we can easily revert to a specific point in time.
The commits are made up of local changes that we make to our code. A basic workflow for adding commits are seen below
Assuming that we have made some changes to our local working directory and that we want to get these updates to be online in the remote repository we have to do the following steps:
First we run the command git add
. This will move our changes to the staging area. While changes are in the staging area we can very easily revert them (using git restore
). There have therefore not been assigned a unique hash to the code yet, and we can therefore still overwrite it.
To take our code from the staging area and make it into a commit, we simply run git commit
which will locally add a note to the graph. It is important again, that we have not pushed the commit to the online repository yet.
Finally, we want others to be able to use the changes that we made. We do a simple git push
and our commit gets online
Of course, the real power of version control is the ability to make branches, as in the image below
Image creditEach branch can contain code that are not present on other branches. This is useful when you are many developers working together on the same project.
"},{"location":"s2_organisation_and_version_control/git/#exercises","title":"\u2754 Exercises","text":"In your GitHub account create an repository, where the intention is that you upload the code from the final exercise from yesterday
After creating the repository, clone it to your computer
git clone https://github.com/my_user_name/my_repository_name.git\n
Move/copy the three files from yesterday into the repository (and any other that you made)
Add the files to a commit by using git add
command
Commit the files using git commit
command where you use the -m
argument to provide a commit message (1).
Finally push the files to your repository using git push
. Make sure to check online that the files have been updated in your repository.
You can always use the command git status
to check where you are in the process of making a commit.
Also checkout the git log
command, which will show you the history of commits that you have made.
Make sure that you understand how to make branches, as this will allow you to try out code changes without messing with your working code. Creating a new branch can be done using:
# create a new branch\ngit checkout -b <my_branch_name>\n
Afterwards, you can use git checkout
(1) to change between branches (remember to commit your work!) Try adding something (a file, a new line of code etc.) to the newly created branch, commit it and try changing back to master afterwards. You should hopefully see whatever you added on the branch is not present on the main branch.
git checkout
command is used for a lot of different things in git. It can be used to change branches, to revert changes and to create new branches. An alternative is using git switch
and git restore
which are more modern commands.If you do not already have a cloned version of this repository belonging to the course, make sure to make one! I am continuously updating/changing some of the material during the course and I therefore recommend that you each day before the lecture do a git pull
on your local copy
Git may seem like a waste of time when solutions like dropbox, google drive etc exist, and it is not completely untrue when you are only one or two working on a project. However, these file management systems falls short when hundreds to thousands of people work together. For this exercise you will go through the steps of sending an open-source contribution:
Go online and find a project you do not own, where you can improve the code. You can either look at this page of good issues to get started with or for simplicity you can just choose the repository belonging to the course. Now fork the project by clicking the Fork button.
This will create a local copy of the repository which you have complete writing access to. Note that code updates to the original repository do not update code in your local repository.
Clone your local fork of the project using git clone
.
As default your local repository will be on the main branch
(HINT: you can check this with the git status
command). It is good practice to make a new branch when working on some changes. Use the git branch
command followed by the git checkout
command to create a new branch.
You are now ready to make changes to the repository. Try to find something to improve (any spelling mistakes?). When you have made the changes, do the standard git cycle: add -> commit -> push
Go online to the original repository and go to the Pull requests
tab. Find compare
button and choose the button to compare the master branch
of the original repo with the branch that you just created in your own repository. Check the diff on the page to make sure that it contains the changes you have made.
Write a bit about the changes you have made and click Create pull request
:)
Forking a repository has the consequence that your fork and the repository that you forked can diverge. To mitigate this we can set what is called an remote upstream. Take a look on this page , and set a remote upstream for the repository you just forked.
Solutiongit remote add upstream <url-to-original-repo>\n
After setting the upstream branch, we need to pull and merge any update. Take a look on this page and figure out how to do this.
Solutiongit fetch upstream\ngit checkout main\ngit merge upstream/main\n
As a final exercise we want to simulate a merge conflict, which happens when two users try to commit changes to exactly same lines of code in the codebase, and git is not able to resolve how the different commits should be integrated.
In your browser, open your favorite repository (it could be the one you just worked on), go to any file of your choosing and click the edit button (see image below) and make some change to the file. For example, if you choose a Python file you can just import some random packages at the top of the file. Commit the change.
Make sure not to pull the change you just made to your local computer. Locally make changes to the same file in the same lines and commit them afterwards.
Now try to git pull
the online changes. What should (hopefully) happen is that git will tell you that it found a merge conflict that needs to be resolved. Open the file and you should see something like this
<<<<<<< HEAD\nthis is some content to mess with\ncontent to append\n=======\ntotally different content to merge later\n>>>>>>> master\n
this should be interpret as: everything that's between <<<<<<<
and =======
are the changes made by your local commit and everything between =======
and >>>>>>>
are the changes you are trying to pull. To fix the merge conflict you simply have to make the code in the two \"cells\" work together. When you are done, remove the identifiers <<<<<<<
, =======
and >>>>>>>
.
Finally, commit the merge and try to push.
(Optional) The above exercises have focused on how to use git from the terminal, which I highly recommend learning. However, if you are using a proper editor they also have build in support for version control. We recommend getting familiar with these features (here is a tutorial for VS Code)
How do you know if a certain directory is a git repository?
SolutionYou can check if there is a \".git\" directory. Alternative you can use the git status
command.
Explain what the file gitignore
is used for?
The file gitignore
is used to tell git which files to ignore when doing a git add .
command. This is useful for files that are not part of the codebase, but are needed for the code to run (e.g. data files) or files that contain sensitive information (e.g. .env
files that contain API keys and passwords).
You have two branches - main and devel. What sequence of commands would you need to execute to make sure that devel is in sync with main?
Solutiongit checkout main\ngit pull\ngit checkout devel\ngit merge main\n
What best practices are you familiar with regarding version control?
SolutionThat covers the basics of git to get you started. In the exercise folder you can find a git cheat sheet with the most useful commands for future reference. Finally, we want to point out another awesome feature of GitHub: in browser editor. Sometimes you have a small edit that you want to make, but still would like to do this in a IDE/editor. Or you may be in the situation where you are working from another device than your usual developer machine. GitHub has an built-in editor that can simply be enabled by changing any URL from
https://github.com/username/repository\n
to
https://github.dev/username/repository\n
Try it out on your newly created repository.
"},{"location":"s2_organisation_and_version_control/good_coding_practice/","title":"M7 - Good coding practice","text":""},{"location":"s2_organisation_and_version_control/good_coding_practice/#good-coding-practice","title":"Good coding practice","text":"Quote
Code is read more often than it is written. Guido Van Rossum (author of Python)
It is hard to define exactly what good coding practises are. But the above quote by Guido does hint at what it could be, namely that it has to do with how others observe and persive your code. In general, good coding practice is about making sure that you code is easy to read and understand, not only by others but also by your future self. The key concept to keep in mind with we are talking about good coding practice is consistency. In many cases it does not matter exactly how you choose to style your code etc., the important part is that you are consistent about it.
Image credit"},{"location":"s2_organisation_and_version_control/good_coding_practice/#documentation","title":"Documentation","text":"Most programmers have a love-hate relationship with documentation: We absolute hate writing it ourself, but love when someone else has actually taken time to add it to their code. There is no doubt about that well documented code is much easier to maintain, as you do not need to remember all details about the code to still maintain it. It is key to remember that good documentation saves more time, than it takes to write.
The problem with documentation is that there is no right or wrong way to do it. You can end up doing:
Under documentation: You document information that is clearly visible from the code and not the complex parts that are actually hard to understand.
Over documentation: Writing too much documentation will have the opposite effect on most people than what you want: there is too much to read, so people will skip it.
Writing good documentation is a skill that takes time to train, so lets try to do it.
Quote
Code tells you how; Comments tell you why. Jeff Atwood
"},{"location":"s2_organisation_and_version_control/good_coding_practice/#exercises","title":"\u2754 Exercises","text":"Go over the most complicated file in your project. Be critical and add comments where the logic behind the code is not easily understandable. (1)
In deep learning we often work with multi-dimensional tensors that constantly changes shape after each operation. It is good practise to annotate with comments when tensors undergoes some reshaping. In the following example we compute the pairwise euclidean distance between two tensors using broadcasting which results in multiple shape operations.
x = torch.randn(5, 10) # N x D\ny = torch.randn(7, 10) # M x D\nxy = x.unsqueeze(1) - y.unsqueeze(0) # (N x 1 x D) - (1 x M x D) = (N x M x D)\npairwise_euc_dist = xy.abs().pow(2.0).sum(dim=-1) # N x M\n
Add docstrings to at least two Python function/methods. You can see here (example 5) a good example how to use identifiable keywords such as Parameters
, Args
, Returns
which standardizes the way of writing docstrings.
While Python already enforces some styling (e.g. code should be indented in a specific way), this is not enough to secure that code from different users actually look like each other. Maybe even more troubling is that you will often see that your own style of coding changes as you become more and more experienced. This kind of difference in coding style is not that important to take care of when you are working on a personal project, but when working multiple people together on the same project it is important to consider.
The question then remains what styling you should use. This is where Pep8 comes into play, which is the official style guide for python. It is essentially contains what is considered \"good practice\" and \"bad practice\" when coding python.
The many years the most commonly used tool to check if you code is PEP8 compliant is to use flake8. However, we are in this course going to be using ruff that are quickly gaining popularity due to how fast it is and how quickly the developers are adding new features. (1)
flake8
and ruff
is what is called a linter or lint tool, which is any kind of static code analyze program that is used to flag programming errors, bugs, and styling errors.Install ruff
pip install ruff\n
Run ruff
on your project or part of your project
ruff check . # Lint all files in the current directory (and any subdirectories)\nruff check path/to/code/ # Lint all files in `/path/to/code` (and any subdirectories).\n
are you PEP8 compliant or are you a normal mortal?
You could go and fix all the small errors that ruff
is giving. However, in practice large projects instead relies on some kind of code formatter, that will automatically format your code for you to be PEP8 compliant. Some of the biggest formatters for the longest time in Python have been black and yapf, but we are going to use ruff
which also have a build in formatter that should be a drop-in replacement for black
.
Try to use ruff format
to format your code
ruff format . # Format all files in the current directory.\nruff format /path/to/file.py # Format a single file.\n
By default ruff
will apply a selection of rules when we are either checking it or formatting it. However, many more rules can be activated through configuration. If you have completed module M6 on code structure you will have encountered the pyproject.toml
file, which can store both build instructions about our package but also configuration of developer tools. Lets try to configure ruff
using the pyproject.toml
file.
One aspect that is not covered by PEP8 is how import
statements in Python should be organized. If you are like most people, you place your import
statements at the top of the file and they are ordered simply by when you needed them. A better practice is to introduce some clear structure in our imports. In older versions of this course we have used isort to do the job, but we are here going to configure ruff
to do the job. In your pyproject.toml
file add the following lines
[tool.ruff]\nselect = [\"I\"]\n
and try re-running ruff check
and ruff format
. Hopefully this should reorganize your imports to follow common practice. (1)
os
) in one block, followed by third-party dependencies (like torch
) in a second block and finally imports from your own package in a third block. Each block is then put in alphabetical order.One PEP8 styling rule that is often diverged from is the recommended line length of 79 characters, which by many (including myself) is considered very restrictive. If you code consist of multiple levels of indentation, you can quickly run into 79 characters being limiting. For this reason many projects increase it, often to 120 characters which seems to be the sweet spot of how many characters fits in a coding window on a laptop. Add the line
line-length=120\n
under the [tool.ruff]
section in the pyproject.toml
file and rerun ruff check
and ruff format
on your code.
Experiment yourself with further configuration of ruff
. In particular we recommend adding more rules and looking [tool.ruff.pydocstyle]
configuration to indicate how you have styled your documentation.
In addition to writing documentation and following a specific styling, in Python we have a third way of improving the quality of our code: through typing. Typing goes back to the earlier programming languages like c
, c++
etc. where data types needed to be explicit stated for variables:
int main() {\n int x = 5 + 6;\n float y = 0.5;\n cout << \"Hello World! \" << x << std::endl();\n}\n
This is not required by Python but it can really improve the readability of code, that you can directly read from the code what the expected types of input arguments and returns are. In Python the :
character have been reserved for type hints. Here is one example of adding typing to a function:
def add2(x: int, y: int) -> int:\n return x+y\n
here we mark that both x
and y
are integers and using the arrow notation ->
we mark that the output type is also an integer. Assuming that we are also going to use the function for floats and torch.Tensor
s we could improve the typing by specifying a union of types. Depending on the version of Python you are using the syntax for this can be different.
from torch import Tensor # note it is Tensor with upper case T. This is the base class of all tensors\nfrom typing import Union\ndef add2(x: Union[int, float, Tensor], y: Union[int, float, Tensor]) -> Union[int, float, Tensor]:\n return x+y\n
from torch import Tensor # note it is Tensor with upper case T. This is the base class of all tensors\ndef add2(x: int | float | Tensor, y: int | float | Tensor) -> int | float | Tensor:\n return x+y\n
Finally, since this is a very generic function it also works on numpy
arrays etc. we can always default to the Any
type if we are not sure about all the specific types that a function can take
from typing import Any\ndef add2(x: Any, y: Any) -> Any:\n return x+y\n
However, in this case we basically is in the same case as if our function were not typed, as the type hints does not help us at all. Therefore, use Any
only when necessary.
Exercise files
We provide a file called typing_exercise.py
. Add typing everywhere in the file. Please note that you will need the following import:
from typing import Callable, Optional, Tuple, Union, List # you will need all of them in your code\n
for it to work. This cheat sheet is a good resource on typing. We also provide typing_exercise_solution.py
, but try to solve the exercise yourself.
typing_exercise.py
typing_exercise.pyimport torch\nfrom torch import nn\n\n\nclass Network(nn.Module):\n \"\"\"Builds a feedforward network with arbitrary hidden layers.\n\n Arguments:\n input_size: integer, size of the input layer\n output_size: integer, size of the output layer\n hidden_layers: list of integers, the sizes of the hidden layers\n\n \"\"\"\n\n def __init__(self, input_size, output_size, hidden_layers, drop_p=0.5) -> None:\n super().__init__()\n # Input to a hidden layer\n self.hidden_layers = nn.ModuleList([nn.Linear(input_size, hidden_layers[0])])\n\n # Add a variable number of more hidden layers\n layer_sizes = zip(hidden_layers[:-1], hidden_layers[1:])\n self.hidden_layers.extend([nn.Linear(h1, h2) for h1, h2 in layer_sizes])\n\n self.output = nn.Linear(hidden_layers[-1], output_size)\n\n self.dropout = nn.Dropout(p=drop_p)\n\n def forward(self, x):\n \"\"\"Forward pass through the network, returns the output logits.\"\"\"\n for each in self.hidden_layers:\n x = nn.functional.relu(each(x))\n x = self.dropout(x)\n x = self.output(x)\n\n return nn.functional.log_softmax(x, dim=1)\n\n\ndef validation(model, testloader, criterion):\n \"\"\"Validation pass through the dataset.\"\"\"\n accuracy = 0\n test_loss = 0\n for images, labels in testloader:\n images = images.resize_(images.size()[0], 784)\n\n output = model.forward(images)\n test_loss += criterion(output, labels).item()\n\n ## Calculating the accuracy\n # Model's output is log-softmax, take exponential to get the probabilities\n ps = torch.exp(output)\n # Class with highest probability is our predicted class, compare with true label\n equality = labels.data == ps.max(1)[1]\n # Accuracy is number of correct predictions divided by all predictions, just take the mean\n accuracy += equality.type_as(torch.FloatTensor()).mean()\n\n return test_loss, accuracy\n\n\ndef train(model, trainloader, testloader, criterion, optimizer=None, epochs=5, print_every=40) -> None:\n \"\"\"Train a PyTorch Model.\"\"\"\n if optimizer is None:\n optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)\n steps = 0\n running_loss = 0\n for e in range(epochs):\n # Model in training mode, dropout is on\n model.train()\n for images, labels in trainloader:\n steps += 1\n\n # Flatten images into a 784 long vector\n images.resize_(images.size()[0], 784)\n\n optimizer.zero_grad()\n\n output = model.forward(images)\n loss = criterion(output, labels)\n loss.backward()\n optimizer.step()\n\n running_loss += loss.item()\n\n if steps % print_every == 0:\n # Model in inference mode, dropout is off\n model.eval()\n\n # Turn off gradients for validation, will speed up inference\n with torch.no_grad():\n test_loss, accuracy = validation(model, testloader, criterion)\n\n print(\n f\"Epoch: {e + 1}/{epochs}.. \",\n f\"Training Loss: {running_loss / print_every:.3f}.. \",\n f\"Test Loss: {test_loss / len(testloader):.3f}.. \",\n f\"Test Accuracy: {accuracy / len(testloader):.3f}\",\n )\n\n running_loss = 0\n\n # Make sure dropout and grads are on for training\n model.train()\n
Solution typing_exercise_solution.pyfrom __future__ import annotations\n\nfrom collections.abc import Callable\n\nimport torch\nfrom torch import nn\n\n\nclass Network(nn.Module):\n \"\"\"Builds a feedforward network with arbitrary hidden layers.\n\n Arguments:\n input_size: integer, size of the input layer\n output_size: integer, size of the output layer\n hidden_layers: list of integers, the sizes of the hidden layers\n\n \"\"\"\n\n def __init__(\n self,\n input_size: int,\n output_size: int,\n hidden_layers: list[int],\n drop_p: float = 0.5,\n ) -> None:\n super().__init__()\n # Input to a hidden layer\n self.hidden_layers = nn.ModuleList([nn.Linear(input_size, hidden_layers[0])])\n\n # Add a variable number of more hidden layers\n layer_sizes = zip(hidden_layers[:-1], hidden_layers[1:])\n self.hidden_layers.extend([nn.Linear(h1, h2) for h1, h2 in layer_sizes])\n\n self.output = nn.Linear(hidden_layers[-1], output_size)\n\n self.dropout = nn.Dropout(p=drop_p)\n\n def forward(self, x: torch.Tensor) -> torch.Tensor:\n \"\"\"Forward pass through the network, returns the output logits.\"\"\"\n for each in self.hidden_layers:\n x = nn.functional.relu(each(x))\n x = self.dropout(x)\n x = self.output(x)\n\n return nn.functional.log_softmax(x, dim=1)\n\n\ndef validation(\n model: nn.Module,\n testloader: torch.utils.data.DataLoader,\n criterion: Callable | nn.Module,\n) -> tuple[float, float]:\n \"\"\"Validation pass through the dataset.\"\"\"\n accuracy = 0\n test_loss = 0\n for images, labels in testloader:\n images = images.resize_(images.size()[0], 784)\n\n output = model.forward(images)\n test_loss += criterion(output, labels).item()\n\n ## Calculating the accuracy\n # Model's output is log-softmax, take exponential to get the probabilities\n ps = torch.exp(output)\n # Class with highest probability is our predicted class, compare with true label\n equality = labels.data == ps.max(1)[1]\n # Accuracy is number of correct predictions divided by all predictions, just take the mean\n accuracy += equality.type_as(torch.FloatTensor()).mean().item()\n\n return test_loss, accuracy\n\n\ndef train(\n model: nn.Module,\n trainloader: torch.utils.data.DataLoader,\n testloader: torch.utils.data.DataLoader,\n criterion: Callable | nn.Module,\n optimizer: None | torch.optim.Optimizer = None,\n epochs: int = 5,\n print_every: int = 40,\n) -> None:\n \"\"\"Train a PyTorch Model.\"\"\"\n if optimizer is None:\n optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)\n steps = 0\n running_loss = 0\n for e in range(epochs):\n # Model in training mode, dropout is on\n model.train()\n for images, labels in trainloader:\n steps += 1\n\n # Flatten images into a 784 long vector\n images.resize_(images.size()[0], 784)\n\n optimizer.zero_grad()\n\n output = model.forward(images)\n loss = criterion(output, labels)\n loss.backward()\n optimizer.step()\n\n running_loss += loss.item()\n\n if steps % print_every == 0:\n # Model in inference mode, dropout is off\n model.eval()\n\n # Turn off gradients for validation, will speed up inference\n with torch.no_grad():\n test_loss, accuracy = validation(model, testloader, criterion)\n\n print(\n f\"Epoch: {e + 1}/{epochs}.. \",\n f\"Training Loss: {running_loss / print_every:.3f}.. \",\n f\"Test Loss: {test_loss / len(testloader):.3f}.. \",\n f\"Test Accuracy: {accuracy / len(testloader):.3f}\",\n )\n\n running_loss = 0\n\n # Make sure dropout and grads are on for training\n model.train()\n
mypy is what is called a static type checker. If you are using typing in your code, then a static type checker can help you find common mistakes. mypy
does not run your code, but it scans it and checks that the types you have given are compatible. Install mypy
pip install mypy\n
Try to run mypy
on the typing.py
file
mypy typing_exercise.py\n
If you have solved exercise 11 correctly then you should get no errors. If not mypy
should tell you where your types are incompatible.
According to PEP8 what is wrong with the following code?
class myclass(nn.Module):\n def TrainNetwork(self, X, y):\n ...\n
Solution According to PEP8 classes should follow the CapWords convention, meaning that the first letter in each word of the class name should be capitalized. Thus myclass
should therefore be MyClass
. On the other hand, functions and methods should be full lowercase with words separated by underscore. Thus TrainNetwork
should be train_network
.
What would be the of argument x
for a function def f(x):
if it should support the following input
x1 = [1, 2, 3, 4]\nx2 = (1, 2, 3, 4)\nx3 = None\nx4 = {1: \"1\", 2: \"2\", 3: \"3\", 4: \"4\"}\n
Solution The easy solution would be to do def f(x : Any)
. But instead we could also go with:
def f(x: None | Tuple[int, ...] | List[int] | Dict[int, str]):\n
alternatively, we could also do
def f(x: None | Iterable[int]):\n
because both list
, tuple
and dict
are iterables and therefore can be covered by one type (in this specific case).
This ends the module on coding style. We again want to emphazize that a good coding style is more about having a consistent style than strictly following PEP8. A good example of this is Google, that have their own style guide for Python. This guide does not match PEP8 exactly, but it makes sure that different teams within google that are working on different projects are still to a large degree following the same style and therefore if a project is handed from one team to another then at least that will not be a problem.
"},{"location":"s3_reproducibility/","title":"Reproducibility","text":"Slides
Learn how to create reproducible computing environments using docker
and how to use them to run your code.
M10: Docker
Learn how to use hydra
to manage configuration files and how to integrate it with your code.
M11: Config Files
Today is all about reproducibility - one of those concepts that everyone agrees is very important and something should be done about, but the reality is that it is very hard to secure full reproducibility. The last sessions have already touched a bit on how tools like conda
and code organization can help make code more reproducible. Today we are going all the way to ensure that our scripts and our computing environment are fully reproducible.
Reproducibility is closely related to the scientific method:
Observe -> Question -> Hypotheses -> Experiment -> Conclude -> Result -> Observe -> ...
Not having reproducible experiments essentially breaks the cycle between doing experiments and making conclusions. If experiments are not reproducible, then we do not expect that others will arrive at the same conclusion as ourselves. As machine learning experiments are fundamentally the same as doing chemical experiments in a laboratory, we should be equally careful in making sure our environments are reproducible (think of your laptop as your laboratory).
Secondly, if we focus on why reproducibility matters especially in machine learning, it is part of the bigger challenge of making sure that machine learning is trustworthy.
Many different aspects are needed if trustworthy machine learning is ever going to be a reality. We need robustness of our pipelines so we can trust that they do not fail under heavy load. We need integrity to make sure that pipelines are deployed if they are of high quality. We need explainability to make sure that we understand what our machine learning models are doing, so it is not just a black box. We need reproducibility to make sure that the results of our models can be reproduced over and over again. Finally, we need fairness to make sure that our models are not biased toward specific populations. Figure inspired by this paper.Trustworthy ML is the idea that machine learning agents can be trusted. Take the example of a machine learning agent being responsible for medical diagnoses. It is very clear that we need to be able to trust that the agent gives us the correct diagnosis for the system to work in practice. Reproducibility plays a big role here, because without we cannot be sure that the same agent deployed at two different hospitals will give the same diagnosis (given the same input).
Learning objectives
The learning objectives of this session are:
docker
to create a reproducible container, including how to build them from scratchhydra
to integrate with config filesWith docker we can make sure that our compute environment is reproducible, but that does not mean that all our experiments magically become reproducible. There are other factors that are important for creating reproducible experiments.
In this paper (highly recommended read) the authors tried to reproduce the results of 255 papers and tried to figure out which factors where significant to succeed. One of those factors were \"Hyperparameters Specified\" e.g. whether or not the authors of the paper had precisely specified the hyperparameter that was used to run the experiments. It should come as no surprise that this can be a determining factor for reproducibility, however it is not given that hyperparameters are always well specified.
"},{"location":"s3_reproducibility/config_files/#configure-experiments","title":"Configure experiments","text":"There is really no way around it: deep learning contains a lot of hyperparameters. In general, a hyperparameter is any parameter that affects the learning process (e.g. the weights of a neural network are not hyperparameters because they are a consequence of the learning process). The problem with having many hyperparameters to control in your code, is that if you are not careful and structure them it may be hard after running a experiment to figure out which hyperparameters were actually used. Lack of proper configuration management can cause serious problems with reliability, uptime, and the ability to scale a system.
One of the most basic ways of structuring hyperparameters, is just to put them directly into you train.py
script in some object:
class my_hp:\n batch_size: 64\n lr: 128\n other_hp: 12345\n\n# easy access to them\ndl = DataLoader(Dataset, batch_size=my_hp.batch_size)\n
the problem here is configuration is not easy. Each time you want to run a new experiment, you basically have to change the script. If you run the code multiple times, without committing the changes in between then the exact hyperparameter configuration for some experiments may be lost. Alright, with this in mind you change strategy to use an argument parser e.g. run experiments like this
python train.py --batch_size 256 --learning_rate 1e-4 --other_hp 12345\n
This at least solves the problem with configurability. However, we again can end up with losing experiments if we are not careful.
What we really want is some way to easily configure our experiments where the hyperparameters are systematically saved with the experiment. For this we turn our attention to Hydra, a configuration tool that is based around writing config files to keep track of hyperparameters. Hydra operates on top of OmegaConf which is a yaml
based hierarchical configuration system.
A simple yaml
configuration file could look like
#config.yaml\nhyperparameters:\n batch_size: 64\n learning_rate: 1e-4\n
with the corresponding Python code for loading the file
from omegaconf import OmegaConf\n# loading\nconfig = OmegaConf.load('config.yaml')\n\n# accessing in two different ways\ndl = DataLoader(dataset, batch_size=config.hyperparameters.batch_size)\noptimizer = torch.optim.Adam(model.parameters(), lr=config['hyperparameters']['learning_rate'])\n
or using hydra
for loading the configuration
import hydra\n\n@hydra.main(config_name=\"basic.yaml\")\ndef main(cfg):\n print(cfg.hyperparameters.batch_size, cfg.hyperparameters.learning_rate)\n\nif __name__ == \"__main__\":\n main()\n
The idea behind refactoring our hyperparameters into .yaml
files is that we disentangle the model configuration from the model. In this way it is easier to do version control of the configuration because we have it in a separate file.
Exercise files
The main idea behind the exercises is to take a single script (that we provide) and use Hydra to make sure that everything gets correctly logged such that you would be able to exactly report to others how each experiment was configured. In the provided script, the hyperparameters are hardcoded into the code and your job will be to separate them out into a configuration file.
Note that we provide a solution (in the vae_solution
folder) that can help you get through the exercise, but try to look online for your answers before looking at the solution. Remember: its not about the result, its about the journey.
Start by installing hydra:
pip install hydra-core\n
Remember to add it to your requirements.txt
file.
Next take a look at the vae_mnist.py
and model.py
file and understand what is going on. It is a model we will revisit during the course.
Identify the key hyperparameters of the script. Some of them should be easy to find, but at least 3 have made it into the core part of the code. One essential hyperparameter is also not included in the script but is needed to be completely reproducible (HINT: the weights of any neural network are initialized at random).
SolutionFrom the top of the file batch_size
, x_dim
, hidden_dim
can be found as hyperparameters. Looking through the code it can be seen that the latent_dim
of the encoder and decoder, lr
or the optimzer, epochs
in the training loop also are hyperparameters. Finally, the seed
is not included in the script but is needed to make the script fully reproducible e.g. torch.manual_seed(seed)
.
Write a configuration file config.yaml
where you write down the hyperparameters that you have found
Get the script running by loading the configuration file inside your script (using hydra) that incorporates the hyperparameters into the script. Note: you should only edit the vae_mnist.py
file and not the model.py
file.
Run the script
By default hydra will write the results to a outputs
folder, with a sub-folder for the day the experiment was run and further the time it was started. Inspect your run by going over each file the hydra has generated and check the information has been logged. Can you find the hyperparameters?
Hydra also allows for dynamically changing and adding parameters on the fly from the command-line:
Try changing one parameter from the command-line
python vae_mnist.py hyperparameters.seed=1234\n
Try adding one parameter from the command-line
python vae_mnist.py +experiment.stuff_that_i_want_to_add=42\n
By default the file vae_mnist.log
should be empty, meaning that whatever you printed to the terminal did not get picked up by Hydra. This is due to Hydra under the hood making use of the native python logging package. This means that to also save all printed output from the script we need to convert all calls to print
with log.info
Create a logger in the script:
import logging\nlog = logging.getLogger(__name__)\n
Exchange all calls to print
with calls to log.info
Try re-running the script and make sure that the output printed to the terminal also gets saved to the vae_mnist.log
file
Make sure that your script is fully reproducible. To check this you will need two runs of the script to compare. Then run the reproducibility_tester.py
script as
python reproducibility_tester.py path/to/run/1 path/to/run/2\n
the script will go over trained weights to see if the match and that the hyperparameters was the same. Note: for the script to work, the weights should be saved to a file called trained_model.pt
(this is the default of the vae_mnist.py
script, so only relevant if you have changed the saving of the weights)
Make a new experiment using a new configuration file where you have changed a hyperparameter of your own choice. You are not allowed to change the configuration file in the script but should instead be able to provide it as an argument when launching the script e.g. something like
python vae_mnist.py experiment=exp2\n
We recommend that you use a file structure like this
|--conf\n| |--config.yaml\n| |--experiments\n| |--exp1.yaml\n| |--exp2.yaml\n|--my_app.py\n
Finally, a awesome feature of hydra is the instantiate feature. This allows you to define a configuration file that can be used to directly instantiating objects in python. Try to create a configuration file that can be used to instantiating the Adam
optimizer in the vae_mnist.py
script.
The configuration file could look like this
optimizer:\n _target_: torch.optim.Adam\n lr: 1e-3\n betas: [0.9, 0.999]\n eps: 1e-8\n weight_decay: 0\n
and the python code to load the configuration file and instantiate the optimizer could look like this
import hydra\nimport torch.optim as optim\n\n@hydra.main(config_name=\"adam.yaml\")\ndef main(cfg):\n optimizer = hydra.utils.instantiate(cfg.optimizer)\n print(optimizer)\n\nif __name__ == \"__main__\":\n main()\n
This will print the optimizer object that is created from the configuration file.
Make your MNIST code reproducible! Apply what you have just done to the simple script to your MNIST code. Only requirement is that you this time use multiple configuration files, meaning that you should have at least one model_conf.yaml
file and a training_conf.yaml
file that separates out the hyperparameters that have to do with the model definition and those that have to do with the training. You can also choose to work with even more complex config setups: in the image below the configuration has two layers such that we individually can specify hyperparameters belonging to a specific model architecture and hyperparameters for each individual optimizer that we may try.
Image credit"},{"location":"s3_reproducibility/docker/","title":"M10 - Docker","text":""},{"location":"s3_reproducibility/docker/#docker","title":"Docker","text":"
Core Module
Image creditWhile the above picture may seem silly at first, it is actually pretty close to how Docker came into existence. A big part of creating an MLOps pipeline is being able to reproduce it. Reproducibility goes beyond versioning our code with git
and using conda
environments to keep track of our Python installations. To truly achieve reproducibility, we need to capture system-level components such as:
Docker provides this kind of system-level reproducibility by creating isolated program dependencies. In addition to providing reproducibility, one of the key features of Docker is scalability, which is important when we later discuss deployment. Because Docker ensures system-level reproducibility, it does not (conceptually) matter whether we try to start our program on a single machine or on 1000 machines at once.
"},{"location":"s3_reproducibility/docker/#docker-overview","title":"Docker Overview","text":"Docker has three main concepts: Dockerfile, Docker image, and Docker container:
A Dockerfile is a basic text document that contains all the commands a user could call on the command line to run an application. This includes installing dependencies, pulling data from online storage, setting up code, and specifying commands to run (e.g., python train.py
).
Running, or more correctly, building a Dockerfile will create a Docker image. An image is a lightweight, standalone/containerized, executable package of software that includes everything (application code, libraries, tools, dependencies, etc.) necessary to make an application run.
Actually running an image will create a Docker container. This means that the same image can be launched multiple times, creating multiple containers.
The exercises today will focus on how to construct the actual Dockerfile, as this is the first step to constructing your own container.
"},{"location":"s3_reproducibility/docker/#docker-sharing","title":"Docker Sharing","text":"The whole point of using Docker is that sharing applications becomes much easier. In general, we have two options:
After creating the Dockerfile
, we can simply commit it to GitHub (it's just a text file) and then ask other users to simply build the image themselves.
After building the image ourselves, we can choose to upload it to an image registry such as Docker Hub, where others can get our image by simply running docker pull
, making them able to instantaneously run it as a container, as shown in the figure below:
In the following exercises, we guide you on how to build a docker file for your MNIST repository that will make the training and prediction a self-contained application. Please make sure that you somewhat understand each step and do not just copy the exercise. Also, note that you probably need to execute the exercise from an elevated terminal e.g. with administrative privilege.
The exercises today are only an introduction to docker and some of the steps are going to be unoptimized from a production setting view. For example, we often want to keep the size of the docker image as small as possible, which we are not focusing on for these exercises.
If you are using VScode
then we recommend installing the VScode docker extension for easy getting an overview of which images have been building and which are running. Additionally, the extension named Dev Containers may also be beneficial for you to download.
Start by installing docker. How much trouble you need to go through depends on your operating system. For Windows and Mac, we recommend they install Docker Desktop, which comes with a graphical user interface (GUI) for quickly viewing docker images and docker containers currently built/in use. Windows users that have not installed WSL yet are going to have to do it now (as docker needs it as a backend for starting virtual machines) but you do not need to install docker in WSL. After installing docker we recommend that you restart your laptop.
Try running the following to confirm that your installation is working:
docker run hello-world\n
which should give the message
Hello from Docker!\nThis message shows that your installation appears to be working correctly.\n
Next, let's try to download an image from Docker Hub. Download the busybox
image:
docker pull busybox\n
which is a very small (1-5Mb) containerized application that contains the most essential GNU file utilities, shell utilities, etc.
After pulling the image, write
docker images\n
which should show you all available images. You should see the busybox
image that we just downloaded.
Let's try to run this image
docker run busybox\n
You will see that nothing happens! The reason for that is we did not provide any commands to docker run
. We essentially just ask it to start the busybox
virtual machine, do nothing, and then close it again. Now, try again, this time with
docker run busybox echo \"hello from busybox\"\n
Note how fast this process is. In just a few seconds, Docker is able to start a virtual machine, execute a command, and kill it afterward.
Try running
docker ps\n
What does this command do? What if you add -a
to the end?
If we want to run multiple commands within the virtual machine, we can start it in interactive mode
docker run -it busybox\n
This can be a great way to investigate what the filesystem of our virtual machine looks like.
As you may have already noticed by now, each time we execute docker run
, we can still see small remnants of the containers using docker ps -a
. These stray containers can end up taking up a lot of disk space. To remove them, use docker rm
where you provide the container ID that you want to delete
docker rm <container_id>\n
Let's now move on to trying to construct a Dockerfile ourselves for our MNIST project. Create a file called trainer.dockerfile
. The intention is that we want to develop one Dockerfile for running our training script and one for doing predictions.
Instead of starting from scratch, we nearly always want to start from some base image. For this exercise, we are going to start from a simple python
image. Add the following to your Dockerfile
# Base image\nFROM python:3.9-slim\n
Next, we are going to install some essentials in our image. The essentials more or less consist of a Python installation. These instructions may seem familiar if you are using Linux:
# Install Python\nRUN apt update && \\\n apt install --no-install-recommends -y build-essential gcc && \\\n apt clean && rm -rf /var/lib/apt/lists/*\n
The previous two steps are common for any Docker application where you want to run Python. All the remaining steps are application-specific (to some degree):
Let's copy over our application (the essential parts) from our computer to the container:
COPY requirements.txt requirements.txt\nCOPY pyproject.toml pyproject.toml\nCOPY <project-name>/ <project-name>/\nCOPY data/ data/\n
Remember that we only want the essential parts to keep our Docker image as small as possible. Why do we need each of these files/folders to run training in our Docker container?
Let's set the working directory in our container and add commands that install the dependencies (1):
We split the installation into two steps so that Docker can cache our project dependencies separately from our application code. This means that if we change our application code, we do not need to reinstall all the dependencies. This is a common strategy for Docker images.
As an alternative, you can use RUN make requirements
if you have a Makefile
that installs the dependencies. Just remember to also copy over the Makefile
into the Docker image.
WORKDIR /\nRUN pip install -r requirements.txt --no-cache-dir\nRUN pip install . --no-deps --no-cache-dir\n
The --no-cache-dir
is quite important. Can you explain what it does and why it is important in relation to Docker?
Finally, we are going to name our training script as the entrypoint for our Docker image. The entrypoint is the application that we want to run when the image is being executed:
ENTRYPOINT [\"python\", \"-u\", \"<project_name>/train_model.py\"]\n
The \"u\"
here makes sure that any output from our script, e.g., any print(...)
statements, gets redirected to our terminal. If not included, you would need to use docker logs
to inspect your run.
We are now ready to build our Dockerfile into a Docker image.
docker build -f trainer.dockerfile . -t trainer:latest\n
MAC M1/M2 users In general, Docker images are built for a specific platform. For example, if you are using a Mac with an M1/M2 chip, then you are running on an ARM architecture. If you are using a Windows or Linux machine, then you are running on an AMD64 architecture. This is important to know when building Docker images. Thus, Docker images you build may not work on other platforms than the ones you build on. You can specify which platform you want to build for by adding the --platform
argument to the docker build
command:
docker build --platform linux/amd64 -f trainer.dockerfile . -t trainer:latest\n
and also when running the image:
docker run --platform linux/amd64 trainer:latest\n
Note that this will significantly increase the build and run time of your Docker image when running locally, because Docker will need to emulate the other platform. In general, for the exercises today, you should not need to specify the platform, but be aware of this if you are building Docker images on your own.
Please note that here we are providing two extra arguments to docker build
. The -f trainer.dockerfile .
(the dot is important to remember) indicates which Dockerfile we want to run (except if you named it just Dockerfile
) and the -t trainer:latest
is the respective name and tag that we see afterward when running docker images
(see image below). Please note that building a Docker image can take a couple of minutes.
Docker images and space
Docker images can take up a lot of space on your computer, especially the Docker images we are trying to build because PyTorch is a huge dependency. If you are running low on space, you can try to
docker system prune\n
Alternatively, you can manually delete images using docker rmi {image_name}:{image_tag}
.
Try running docker images
and confirm that you get output similar to the one above. If you succeed with this, then try running the docker image
docker run --name experiment1 trainer:latest\n
you should hopefully see your training starting. Please note that we can start as many containers as we want at the same time by giving them all different names using the --name
tag.
You are most likely going to rebuild your Docker image multiple times, either due to an implementation error or the addition of new functionality. Therefore, instead of watching pip suffer through downloading torch
for the 20th time, you can reuse the cache from the last time the Docker image was built. To do this, replace the line in your Dockerfile that installs your requirements with:
RUN --mount=type=cache,target=~/pip/.cache pip install -r requirements.txt --no-cache-dir\n
which mounts your local pip cache to the Docker image. For building the image, you need to have enabled the BuildKit feature. If you have Docker version v23.0 or later (you can check this by running docker version
), then this is enabled by default. Otherwise, you need to enable it by setting the environment variable DOCKER_BUILDKIT=1
before building the image.
Try changing your Dockerfile and rebuilding the image. You should see that the build process is much faster.
Remember, if you ever are in doubt about how files are organized inside a Docker image, you always have the option to start the image in interactive mode:
docker run -it --entrypoint sh {image_name}:{image_name}\n
When your training has completed you will notice that any files that are created when running your training script are not present on your laptop (for example if your script is saving the trained model to a file). This is because the files were created inside your container (which is a separate little machine). To get the files you have two options:
If you already have a completed run then you can use it
docker cp\n
to copy the files between your container and laptop. For example to copy a file called trained_model.pt
from a folder you would do:
docker cp {container_name}:{dir_path}/{file_name} {local_dir_path}/{local_file_name}\n
Try this out.
A much more efficient strategy is to mount a volume that is shared between the host (your laptop) and the container. This can be done with the -v
option for the docker run
command. For example, if we want to automatically get the trained_model.pt
file after running our training script we could simply execute the container as
docker run --name {container_name} -v %cd%/models:/models/ trainer:latest\n
this command mounts our local models
folder as a corresponding models
folder in the container. Any file save by the container to this folder will be synchronized back to our host machine. Try this out! Note if you have multiple files/folders that you want to mount (if in doubt about file organization in the container try to do the next exercise first). Also note that the %cd%
needs to change depending on your OS, see this page for help.
With training done we also need to write an application for prediction. Create a new docker image called predict.dockerfile
. This file should call your <project_name>/models/predict_model.py
script instead. This image will need some trained model weights to work. Feel free to either include these during the build process or mount them afterwards. When you create the file try to build
and run
it to confirm that it works. Hint: if you are passing in the model checkpoint and prediction data as arguments to your script, your docker run
probably needs to look something like
docker run --name predict --rm \\\n -v %cd%/trained_model.pt:/models/trained_model.pt \\ # mount trained model file\n -v %cd%/data/example_images.npy:/example_images.npy \\ # mount data we want to predict on\n predict:latest \\\n ../../models/trained_model.pt \\ # argument to script, path relative to script location in container\n ../../example_images.npy\n
(Optional, requires GPU support) By default, a virtual machine created by docker only has access to your cpu
and not your gpu
. While you do not necessarily have a laptop with a GPU that supports the training of neural networks (e.g. one from Nvidia) it is beneficial that you understand how to construct a docker image that can take advantage of a GPU if you were to run this on a machine in the future that has a GPU (e.g. in the cloud). It does take a bit more work, but many of the steps will be similar to building a normal docker image.
There are three prerequisites for working with Nvidia GPU-accelerated docker containers. First, you need to have the Docker Engine installed (already taken care of), have Nvidia GPU with updated GPU drivers and finally have the Nvidia container toolkit installed. The last part you not likely have not installed and needs to do. Some distros of Linux have known problems with the installation process, so you may have to search through known issues in nvidia-docker repository to find a solution
To test that everything is working start by pulling a relevant Nvidia docker image. In my case this is the correct image:
docker pull nvidia/cuda:11.0.3-base-ubuntu20.04\n
but it may differ based on what Cuda vision you have. You can find all the different official Nvidia images here. After pulling the image, try running the nvidia-smi
command inside a container based on the image you just pulled. It should look something like this:
docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi\n
and should show an image like below:
If it does not work, try redoing the steps.
We should hopefully have a working setup now for running Nvidia accelerated docker containers. The next step is to get PyTorch inside our container, such that our PyTorch implementation also correctly identifies the GPU. Luckily for us, Nvidia provides a set of docker images for GPU-optimized software for AI, HPC and visualizations through their NGC Catalog. The containers that have to do with PyTorch can be seen here. Try pulling the latest:
docker pull nvcr.io/nvidia/pytorch:22.07-py3\n
It may take some time because the NGC images include a lot of other software for optimizing PyTorch applications. It may be possible for you to find other images for running GPU-accelerated applications that have a smaller memory footprint, but NGC is the recommended and supported way.
Let's test that this container works:
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:22.07-py3\n
this should run the container in interactive mode attached to your current terminal. Try opening python
in the container and try writing:
import torch\nprint(torch.cuda.is_available())\n
which hopefully should return True
.
Finally, we need to incorporate all this into our already developed docker files for our application. This is also fairly easy as we just need to change our FROM
statement at the beginning of our docker file:
FROM python:3.7-slim\n
change to
FROM nvcr.io/nvidia/pytorch:22.07-py3\n
try doing this to one of your docker files, build the image and run the container. Remember to check that your application is using GPU by printing torch.cuda.is_available()
.
(Optional) Another way you can use Dockerfiles in your day-to-day work is for Dev-containers. Developer containers allow you to develop code directly inside a container, making sure that your code is running in the same environment as it will when deployed. This is especially useful if you are working on a project that has a lot of dependencies that are hard to install on your local machine. Setup instructions for VS Code and PyCharm can be found here (should be simple since we have already installed Docker):
We will focus on the VS Code setup here.
First, install the Remote - Containers extension.
Create a .devcontainer
folder in your project root and create a Dockerfile
inside it. We will keep this file very barebones for now, so let's just define a base installation of Python:
FROM python:3.11-slim-buster\n\nRUN apt update && \\\n apt install --no-install-recommends -y build-essential gcc && \\\n apt clean && rm -rf /var/lib/apt/lists/*\n
Create a devcontainer.json
file in the .devcontainer
folder. This file should look something like this:
{\n \"name\": \"my_working_env\",\n \"dockerFile\": \"Dockerfile\",\n \"postCreateCommand\": \"pip install -r requirements.txt\"\n}\n
This file tells VS Code that we want to use the Dockerfile
that we just created and that we want to install our Python dependencies after the container has been created.
After creating these files, you should be able to open the command palette in VS Code (F1) and search for the option Remote-Containers: Reopen in Container
or Remote-Containers: Rebuild and Reopen in Container
. Choose either of these options.
This will start a new VS Code instance inside a Docker container. You should be able to see this in the bottom left corner of your VS Code window. You should also be able to see that the Python interpreter has changed to the one inside the container.
You are now ready to start developing inside the container. Try opening a terminal and run python
and import torch
to confirm that everything is working.
(Optional) In M8 on Data version control you learned about the framework dvc
for version controlling data. A neutral question at this point would then be how to incorporate dvc
into our docker image. We need to do two things:
dvc
has all the correct files to pull data from our remote storagedvc
has the correct credentials to pull data from our remote storageWe are going to assume that dvc
(and any dvc
extension needed) is part of your requirements.txt
file and that it is already being installed in a RUN pip install -r requirements.txt
command in your Dockerfile. If not, then you need to add it.
Add the following lines to your Dockerfile
RUN dvc init --no-scm\nCOPY .dvc/config .dvc/config\nCOPY *.dvc .dvc/\nRUN dvc config core.no_scm true\nRUN dvc pull\n
The first line initializes dvc
in the Docker image. The --no-scm
option is needed because normally dvc
can only be initialized inside a git repository, but this option allows initializing dvc
without being in one. The second and third lines copy over the dvc
config file and the dvc
metadata files that are needed to pull data from your remote storage. The last line pulls the data.
If your data is not public, we need to provide credentials in some way to pull the data. We are for now going to do it in a not-so-secure way. When dvc
first connected to your drive, a credential file was created. This file is located in $CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json
where $CACHE_HOME
.
~/Library/Caches
~/.cache
This is the typical location, but it may vary depending on what distro you are running.
{user}/AppData/Local
Find the file. The content should look similar to this (only some fields are shown):
{\n \"access_token\": ...,\n \"client_id\": ...,\n \"client_secret\": ...,\n \"refresh_token\": ...,\n ...\n}\n
We are going to copy the file into our Docker image. This, of course, is not a secure way of doing it, but it is the easiest way to get started. As long as you are not sharing your Docker image with anyone else, then it is fine. Add the following lines to your Dockerfile before the RUN dvc pull
command:
COPY <path_to_default.json> default.json\ndvc remote modify myremote --local gdrive_service_account_json_file_path default.json\n
where <path_to_default.json>
is the path to the default.json
file that you just found. The last line tells dvc
to use the default.json
file as the credentials for pulling data from your remote storage. You can confirm that this works by running dvc pull
in your Docker image.
What is the difference between a docker image and a docker container?
SolutionA Docker image is a template for a Docker container. A Docker container is a running instance of a Docker image. A Docker image is a static file, while a Docker container is a running process.
What are the 3 steps involved in containerizing an application?
SolutionWhat advantage is there to running your application inside a Docker container instead of running the application directly on your machine?
SolutionRunning inside a Docker container gives you a consistent and independent environment for your application. This means that you can be sure that your application will run the same way on your machine as it will on another machine. Thus, Docker gives the ability to abstract away the differences between different machines.
A Docker container is built from a series of layers that are stacked on top of each other. This should be clear if you look at the output when building a Docker image. What is the advantage of this?
SolutionThe advantage is efficiency and reusability. When a change is made to a Docker image, only the layer(s) that are changed need to be updated. For example, if you update the application code in your Docker image, which usually is the last layer, then only that layer needs to be rebuilt, making the process much faster. Additionally, if you have multiple Docker images that share the same base image, then the base image only needs to be downloaded once.
This covers the absolute minimum you should know about Docker to get a working image and container. If you want to really deep dive into this topic, you can find a copy of the Docker Cookbook by S\u00e9bastien Goasguen in the literature folder.
If you are actively going to be using Docker in the future, one thing to consider is the image size. Even these simple images that we have built still take up GB in size. Several optimization steps can be taken to reduce the image size for you or your end user. If you have time, you can read this article on different approaches to reducing image size. Additionally, you can take a look at the dive-in extension for Docker Desktop that lets you explore in depth your Docker images.
"},{"location":"s4_debugging_and_logging/","title":"Debugging, Profiling, Logging and Boilerplate","text":"Slides
Learn how to use the debugger in your editor to find bugs in your code.
M12: Debugging
Learn how to use a profiler to identify bottlenecks in your code and from those profiles optimize the runtime of your programs.
M13: Profiling
Learn how to systematically log experiments and hyperparameters to make your code reproducible.
M14: Logging
Learn how to use pytorch-lightning
framework to minimize boilerplate code and structure deep learning models.
M15: Boilerplate
Today we are initially going to go over three different topics that are all fundamentally necessary for any data scientist or DevOps engineer:
All three topics can be characterized by something you probably already are familiar with. Since you started programming, you have done debugging as nobody can write perfect code on the first try. Similarly, while you have not directly profiled your code, I bet that you at some point have had some very slow code and optimized it to run faster. Identifying and improving are the fundamentals of profiling code. Finally, logging is a very broad term and refers to any kind of output from your applications that helps you at a later point identify the \"performance\" of you application.
However, while we expect you to already be familiar with these topics, we do not expect all of you to be experts as it is very rare that these topics are focused on. Today we are going to introduce some best practices and tools to help you overcome every one of these three important topics. As the final topic for today, we are going to learn about how we can minimize boilerplate and focus on coding what matters for our project instead of all the boilerplate to get it working.
Learning objectives
The learning objectives of this session are:
pytorch-lightning
framework to minimize boilerplate code and structure deep learning modelsBoilerplate is a general term that describes any standardized text, copy, documents, methods, or procedures that may be used over again without making major changes to the original. But how does this relate to doing machine learning projects? If you have already tried doing a couple of projects within machine learning you will probably have seen a pattern: every project usually consist of these three aspects of code:
While the latter two certainly seems important, in most cases the actual development or research often revolves around defining the model. In this sense, both the training code and the utilities becomes boilerplate that should just carry over from one project to another. But the problem usually is that we have not generalized our training code to take care of the small adjusted that may be required in future projects and we therefore end up implementing it over and over again every time that we start a new project. This is of course a waste of our time that we should try to find a solution to.
This is where high-level frameworks comes into play. High-level frameworks are build on top of another framework (PyTorch in this case) and tries to abstract/standardize how to do particular tasks such as training. At first it may seem irritating that you need to comply to someone else code structure, however there is a good reason for that. The idea is that you can focus on what really matters (your task, model architecture etc.) and do not have to worry about the actual boilerplate that comes with it.
The most popular high-level (training) frameworks within the PyTorch
ecosystem are:
They all offer many of the same features, so choosing one over the other for most projects should not matter that much. We are here going to use PyTorch Lightning
, as it offers all the functionality that we are going to need later in the course.
In general we refer to the documentation from PyTorch lightning if in doubt about how to format your code for doing specific tasks. We are here going to explain the key concepts of the API that you need to understand to use the framework, starting with the LightningModule
and the Trainer
.
The LightningModule
is a subclass of a standard nn.Module
that basically adds additional structure. In addition to the standard __init__
and forward
methods that need to be implemented in a nn.Module
, a LightningModule
further requires two more methods implemented:
training_step
: should contain your actual training code e.g. given a batch of data this should return the loss that you want to optimize
configure_optimizers
: should return the optimizer that you want to use
Below is shown these two methods added to standard MNIST classifier
Compared to a standard nn.Module
, the additional methods in the LightningModule
basically specifies exactly how you want to optimize your model.
The second component to lightning is the Trainer
object. As the name suggest, the Trainer
object takes care of the actual training, automizing everything that you do not want to worry about.
from pytorch_lightning import Trainer\nmodel = MyAwesomeModel() # this is our LightningModule\ntrainer = Trainer()\ntraier.fit(model)\n
That's is essentially all that you need to specify in lightning to have a working model. The trainer object does not have methods that you need to implement yourself, but it have a bunch of arguments that can be used to control how many epochs that you want to train, if you want to run on gpu etc. To get the training of our model to work we just need to specify how our data should be feed into the lighning framework.
"},{"location":"s4_debugging_and_logging/boilerplate/#data","title":"Data","text":"For organizing our code that has to do with data in Lightning
we essentially have three different options. However, all three assume that we are using torch.utils.data.DataLoader
for the dataloading.
If we already have a train_dataloader
and possible also a val_dataloader
and test_dataloader
defined we can simply add them to our LightningModule
using the similar named methods:
def train_dataloader(self):\n return DataLoader(...)\n\ndef val_dataloader(self):\n return DataLoader(...)\n\ndef test_dataloader(self):\n return DataLoader(...)\n
Maybe even simpler, we can directly feed such dataloaders in the fit
method of the Trainer
object:
trainer.fit(model, train_dataloader, val_dataloader)\ntrainer.test(model, test_dataloader)\n
Finally, Lightning
also have the LightningDataModule
that organizes data loading into a single structure, see this page for more info. Putting data loading into a DataModule
makes sense as it is then can be reused between projects.
Callbacks is one way to add additional functionality to your model, that strictly speaking is not already part of your model. Callbacks should therefore be seen as self-contained feature that can be reused between projects. You have the option to implement callbacks yourself (by inheriting from the pytorch_lightning.callbacks.Callback
base class) or use one of the build in callbacks. Of particular interest are ModelCheckpoint
and EarlyStopping
callbacks:
The ModelCheckpoint
makes sure to save checkpoints of you model. This is in principal not hard to do yourself, but the ModelCheckpoint
callback offers additional functionality by saving checkpoints only when some metric improves, or only save the best K
performing models etc.
model = MyModel()\ncheckpoint_callback = ModelCheckpoint(\n dirpath=\"./models\", monitor=\"val_loss\", mode=\"min\"\n)\ntrainer = Trainer(callbacks=[checkpoint_callbacks])\ntrainer.fit(model)\n
The EarlyStopping
callback can help you prevent overfitting by automatically stopping the training if a certain value is not improving anymore:
model = MyModel()\nearly_stopping_callback = EarlyStopping(\n monitor=\"val_loss\", patience=3, verbose=True, mode=\"min\"\n)\ntrainer = Trainer(callbacks=[early_stopping_callback])\ntrainer.fit(model)\n
Multiple callbacks can be used by passing them all in a list e.g.
trainer = Trainer(callbacks=[checkpoint_callbacks, early_stopping_callback])\n
"},{"location":"s4_debugging_and_logging/boilerplate/#exercises","title":"\u2754 Exercises","text":"Please note that the in following exercise we will basically ask you to reformat all your MNIST code to follow the lightning standard, such that we can take advantage of all the tricks the framework has to offer. The reason we did not implement our model in lightning
to begin with, is that to truly understand why it is beneficially to use a high-level framework to do some of the heavy lifting you need to have gone through some of implementation troubles yourself.
Install pytorch lightning:
pip install pytorch-lightning # (1)!\n
pip install lightning
which includes more than just the PyTorch Lightning
package. This also includes Lightning Fabric
and Lightning Apps
which you can read more about here and here.Convert your corrupted MNIST model into a LightningModule
. You can either choose to completely override your old model or implement it in a new file. The bare minimum that you need to add while converting to get it working with the rest of lightning:
The training_step
method. This function should contain essentially what goes into a single training step and should return the loss at the end
The configure_optimizers
method
Please read the documentation for more info.
Solution lightning.pyimport pytorch_lightning as pl\nimport torch\nfrom torch import nn\n\n\nclass MyAwesomeModel(pl.LightningModule):\n \"\"\"My awesome model.\"\"\"\n\n def __init__(self) -> None:\n super().__init__()\n self.conv1 = nn.Conv2d(1, 32, 3, 1)\n self.conv2 = nn.Conv2d(32, 64, 3, 1)\n self.conv3 = nn.Conv2d(64, 128, 3, 1)\n self.dropout = nn.Dropout(0.5)\n self.fc1 = nn.Linear(128, 10)\n\n self.loss_fn = nn.CrossEntropyLoss()\n\n def forward(self, x: torch.Tensor) -> torch.Tensor:\n \"\"\"Forward pass.\"\"\"\n x = torch.relu(self.conv1(x))\n x = torch.max_pool2d(x, 2, 2)\n x = torch.relu(self.conv2(x))\n x = torch.max_pool2d(x, 2, 2)\n x = torch.relu(self.conv3(x))\n x = torch.max_pool2d(x, 2, 2)\n x = torch.flatten(x, 1)\n x = self.dropout(x)\n return self.fc1(x)\n\n def training_step(self, batch):\n \"\"\"Training step.\"\"\"\n img, target = batch\n y_pred = self(img)\n return self.loss_fn(y_pred, target)\n\n def configure_optimizers(self):\n \"\"\"Configure optimizer.\"\"\"\n return torch.optim.Adam(self.parameters(), lr=1e-3)\n\n\nif __name__ == \"__main__\":\n model = MyAwesomeModel()\n print(f\"Model architecture: {model}\")\n print(f\"Number of parameters: {sum(p.numel() for p in model.parameters())}\")\n\n dummy_input = torch.randn(1, 1, 28, 28)\n output = model(dummy_input)\n print(f\"Output shape: {output.shape}\")\n
Make sure your data is formatted such that it can be loaded using the torch.utils.data.DataLoader
object.
Instantiate a Trainer
object. It is recommended to take a look at the trainer arguments (there are many of them) and maybe adjust some of them:
Investigate what the default_root_dir
flag does
As default lightning will run for 1000 epochs. This may be too much (for now). Change this by changing the appropriate flag. Additionally, there also exist a flag to set the maximum number of steps that we should train for.
SolutionSetting the max_epochs
will accomplish this.
trainer = Trainer(max_epochs=10)\n
Additionally, you may consider instead setting the max_steps
flag to limit based on the number of steps or max_time
to limit based on time. Similarly, the flags min_epochs
, min_steps
and min_time
can be used to set the minimum number of epochs, steps or time.
To start with we also want to limit the amount of training data to 20% of its original size. which trainer flag do you need to set for this to work?
SolutionSetting the limit_train_batches
flag will accomplish this.
trainer = Trainer(limit_train_batches=0.2)\n
Similarly, you can also set the limit_val_batches
and limit_test_batches
flags to limit the validation and test data.
Try fitting your model: trainer.fit(model)
Now try adding some callbacks
to your trainer.
early_stopping_callback = EarlyStopping(\n monitor=\"val_loss\", patience=3, verbose=True, mode=\"min\"\n)\ncheckpoint_callback = ModelCheckpoint(\n dirpath=\"./models\", monitor=\"val_loss\", mode=\"min\"\n)\ntrainer = Trainer(callbacks=[early_stopping_callback, checkpoint_callback])\n
The privous module was all about logging in wandb
, so the question is naturally how does lightning
support this. Lightning does not only support wandb
, but also many others. Common for all of them, is that logging just need to happen through the self.log
method in your LightningModule
:
Add self.log
to your `LightningModule. Should look something like this:
def training_step(self, batch, batch_idx):\n data, target = batch\n preds = self(data)\n loss = self.criterion(preds, target)\n acc = (target == preds.argmax(dim=-1)).float().mean()\n self.log('train_loss', loss)\n self.log('train_acc', acc)\n return loss\n
Add the wandb
logger to your trainer
trainer = Trainer(logger=pl.loggers.WandbLogger(project=\"dtu_mlops\"))\n
and try to train the model. Confirm that you are seeing the scalars appearing in your wandb
portal.
self.log
does sadly only support logging scalar tensors. Luckily, for logging other quantities we can still access the standard wandb.log
through our model
def training_step(self, batch, batch_idx):\n ...\n # self.logger.experiment is the same as wandb.log\n self.logger.experiment.log({'logits': wandb.Histrogram(preds)})\n
try doing this, by logging something else than scalar tensors.
Finally, we maybe also want to do some validation or testing. In lightning we just need to add the validation_step
and test_step
to our lightning module and supply the respective data in form of a separate dataloader. Try to at least implement one of them.
Both validation and test steps can be implemented in the same way as the training step:
def validation_step(self, batch) -> None:\n data, target = batch\n preds = self(data)\n loss = self.criterion(preds, target)\n acc = (target == preds.argmax(dim=-1)).float().mean()\n self.log('val_loss', loss, on_epoch=True)\n self.log('val_acc', acc, on_epoch=True)\n
two things to take note of here is that we are setting the on_epoch
flag to True
in the self.log
method. This is because we want to log the validation loss and accuracy only once per epoch. Additionally, we are not returning anything from the validation_step
method, because we do not optimize over the loss.
(Optional, requires GPU) One of the big advantages of using lightning
is that you no more need to deal with device placement e.g. called .to('cuda')
everywhere. If you have a GPU, try to set the gpus
flag in the trainer. If you do not have one, do not worry, we are going to return to this when we are going to run training in the cloud.
The two arguments accelerator
and devices
can be used to specify which devices to run on and how many to run on. For example, to run on a single GPU you can do
trainer = Trainer(accelerator=\"gpu\", devices=1)\n
as an alternative the accelerator can just be set to accelerator=\"auto\"
to automatically detect the best available device.
(Optional) As default PyTorch uses float32
for representing floating point numbers. However, research have shown that neural network training is very robust towards a decrease in precision. The great benefit going from float32
to float16
is that we get approximately half the memory consumption. Try out half-precision training in PyTorch lightning. You can enable this by setting the precision flag in the Trainer
.
Lightning supports four different types of mixed precision training (16-bit and 16-bit bfloat) and two types of:
# 16-bit mixed precision (model weights remain in torch.float32)\ntrainer = Trainer(precision=\"16-mixed\", devices=1)\n\n# 16-bit bfloat mixed precision (model weights remain in torch.float32)\ntrainer = Trainer(precision=\"bf16-mixed\", devices=1)\n\n# 16-bit precision (model weights get cast to torch.float16)\ntrainer = Trainer(precision=\"16-true\", devices=1)\n\n# 16-bit bfloat precision (model weights get cast to torch.bfloat16)\ntrainer = Trainer(precision=\"bf16-true\", devices=1)\n
(Optional) Lightning also have built-in support for profiling. Checkout how to do this using the profiler argument in the Trainer
object.
(Optional) Another great feature of Lightning is that the allow for easily defining command line interfaces through the Lightning CLI feature. The Lightning CLI is essentially a drop in replacement for defining command line interfaces (covered in this module) and can also replace the need for config files (covered in this module) for securing reproducibility when working inside the Lightning framework. We highly recommend checking out the feature and that you try to refactor your code such that you do not need to call trainer.fit
anymore but it is instead directly controlled from the Lightning CLI.
Free exercise: Experiment with what the lightning framework is capable of. Either try out more of the trainer flags, some of the other callbacks, or maybe look into some of the other methods that can be implemented in your lightning module. Only your imagination is the limit!
That covers everything for today. It has been a mix of topics that all should help you write \"better\" code (by some objective measure). If you want to deep dive more into the PyTorch lightning framework, we highly recommend looking at the different tutorials in the documentation that covers more advanced models and training cases. Additionally, we also want to highlight other frameworks in the lightning ecosystem:
Debugging is very hard to teach and is one of the skills that just comes with experience. That said, there are good and bad ways to debug a program. We are all probably familiar with just inserting print(...)
statements everywhere in our code. It is easy and can many times help narrow down where the problem happens. That said, this is not a great way of debugging when dealing with a very large codebase. You should therefore familiarize yourself with the built-in python debugger as it may come in handy during the course.
To invoke the build in Python debugger you can either:
Set a trace directly with the Python debugger by calling
import pdb\npdb.set_trace()\n
anywhere you want to stop the code. Then you can use different commands (see the python_debugger_cheatsheet.pdf
) to step through the code.
If you are using an editor, then you can insert inline breakpoints (in VS code this can be done by pressing F9
) and then execute the script in debug mode (inline breakpoints can often be seen as small red dots to the left of your code). The editor should then offer some interface to allow you step through your code. Here is a guide to using the build in debugger in VScode.
Additionally, if your program is stopping on an error and you automatically want to start the debugger where it happens, then you can simply launch the program like this from the terminal
python -m pdb -c continue my_script.py\n
Exercise files
We here provide a script vae_mnist_bugs.py
which contains a number of bugs to get it running. Start by going over the script and try to understand what is going on. Hereafter, try to get it running by solving the bugs. The following bugs exist in the script:
Some of the bugs prevents the script from even running, while some of them influences the training dynamics. Try to find them all. We also provide a working version called vae_mnist_working.py
(but please try to find the bugs before looking at the script). Successfully debugging and running the script should produce three files:
orig_data.png
containing images from the standard MNIST training setreconstructions.png
reconstructions from the modelgenerated_samples.png
samples from the modelAgain, we cannot stress enough that the exercise is actually not about finding the bugs but using a proper debugger to find them.
"},{"location":"s4_debugging_and_logging/logging/","title":"M14 - Logging","text":""},{"location":"s4_debugging_and_logging/logging/#logging","title":"Logging","text":"Core Module
Logging in general refers to the practise of recording events activities over time. Having proper logging in your applications can be extremely beneficial for a few reasons:
Debugging becomes easier because we in a more structure way can output information about the state of our program, variables, values etc. to help identify and fix bugs or unexpected behavior.
When we move into a more production environment, proper logging is essential for monitoring the health and performance of our application.
It can help in auditing as logging info about specific activities etc. can help keeping a record of who did what and when.
Having proper logging means that info is saved for later, that can be analysed to gain insight into the behavior of our application, such as trends.
We are in this course going to divide the kind of logging we can do into categories: application logging and experiment logging. In general application logging is important regardless of the kind of application you are developing, whereas experiment logging is important machine learning based projects where we are doing experiments.
"},{"location":"s4_debugging_and_logging/logging/#application-logging","title":"Application logging","text":"The most basic form of logging in Python applications is the good old print
statement:
for batch_idx, batch in enumerate(dataloader):\n print(f\"Processing batch {batch_idx} out of {len(dataloader)}\")\n ...\n
This will keep a \"record\" of the events happening in our script, in this case how far we have progressed. We could even change the print to include something like batch.shape
to also have information about the current data being processed.
Using print
statements is fine for small applications, but to have proper logging we need a bit more functionality than what print
can offer. Python actually comes with a great logging module, that defines functions for flexible logging. It is exactly this we are going to look at in this module.
The four main components to the Python logging module are:
Logger: The main entry point for using the logging system. You create instances of the Logger class to emit log messages.
Handler: Defines where the log messages go. Handlers send the log messages to specific destinations, such as the console or a file.
Formatter: Specifies the layout of the log messages. Formatters determine the structure of the log records, including details like timestamps and log message content.
Level: Specifies the severity of a log message.
Especially, the last point is important to understand. Levels essentially allows of to get rid of statements like this:
if debug:\n print(x.shape)\n
where the logging is conditional on the variable debug
which we can set a runtime. Thus, it is something we can disable for users of our application (debug=False
) but have enabled when we develop the application (debug=True
). And it makes sense that not all things logged, should be available to all stakeholders of a codebase. We as developers probably always wants the highest level of logging, whereas users of the our code need less info and we may want to differentiate this based on users.
It is also important to understand the different between logging and error handling. Error handling Python is done using raise
statements and try/catch
like:
def f(x: int):\n if not isinstance(x, int):\n raise ValueError(\"Expected an integer\")\n return 2 * x\n\ntry:\n f(5):\nexcept ValueError:\n print(\"I failed to do a thing, but continuing.\")\n
Why would we evere need log warning
, error
, critical
levels of information, if we are just going to handle it? The reason is that raising exceptions are meant to change the program flow at runtime e.g. things we do not want the user to do, but we can deal with in some way. Logging is always for after a program have run, to inspect what went wrong. Sometimes you need one, sometimes the other, sometimes both.
Exercises are inspired by this made with ml module on the same topic. If you need help for the exercises you can find a simple solution script here.
As logging is a built-in module in Python, nothing needs to be installed. Instead start a new file called my_logger.py
and start out with the following code:
import logging\nimport sys\n\n# Create super basic logger\nlogging.basicConfig(stream=sys.stdout, level=logging.DEBUG)\nlogger = logging.getLogger(__name__) # (1)\n\n# Logging levels (from lowest to highest priority)\nlogger.debug(\"Used for debugging your code.\")\nlogger.info(\"Informative messages from your code.\")\nlogger.warning(\"Everything works but there is something to be aware of.\")\nlogger.error(\"There's been a mistake with the process.\")\nlogger.critical(\"There is something terribly wrong and process may terminate.\")\n
__name__
always contains the record of the script or module that is currently being run. Therefore if we initialize our logger base using this variable, it will always be unique to our application and not conflict with logger setup by any third-party package.Try running the code. Than try changing the argument level
when creating the logger. What happens when you do that?
Instead of sending logs to the terminal, we may also want to send them to a file. This can be beneficial, such that only warning
level logs and higher are available to the user, but debug
and info
is still saved when the application is running.
Try adding the following dict to your logger.py
file:
logging_config = {\n \"version\": 1,\n \"formatters\": { # (1)\n \"minimal\": {\"format\": \"%(message)s\"},\n \"detailed\": {\n \"format\": \"%(levelname)s %(asctime)s [%(name)s:%(filename)s:%(funcName)s:%(lineno)d]\\n%(message)s\\n\"\n },\n },\n \"handlers\": { # (2)\n \"console\": {\n \"class\": \"logging.StreamHandler\",\n \"stream\": sys.stdout,\n \"formatter\": \"minimal\",\n \"level\": logging.DEBUG,\n },\n \"info\": {\n \"class\": \"logging.handlers.RotatingFileHandler\",\n \"filename\": Path(LOGS_DIR, \"info.log\"),\n \"maxBytes\": 10485760, # 1 MB\n \"backupCount\": 10,\n \"formatter\": \"detailed\",\n \"level\": logging.INFO,\n },\n \"error\": {\n \"class\": \"logging.handlers.RotatingFileHandler\",\n \"filename\": Path(LOGS_DIR, \"error.log\"),\n \"maxBytes\": 10485760, # 1 MB\n \"backupCount\": 10,\n \"formatter\": \"detailed\",\n \"level\": logging.ERROR,\n },\n },\n \"root\": {\n \"handlers\": [\"console\", \"info\", \"error\"],\n \"level\": logging.INFO,\n \"propagate\": True,\n },\n}\n
The formatter section determines how logs should be formatted. Here we define two separate formatters, called minimal
and detailed
which we can use in the next part of the code.
The handlers is in charge of what should happen to different level of logging. console
uses the minimal
format we defined and sens logs to the stdout
stream for messages of level DEBUG
and higher. The info
handler uses the detailed
format and sends messages of level INFO
and higher to a separate info.log
file. The error
handler does the same for messages of level ERROR
and higher to a file called error.log
.
you will need to set the LOGS_DIR
variable and also figure out how to add this logging_config
using the logging config submodule to your logger.
When the code successfully runs, check the LOGS_DIR
folder and make sure that a info.log
and error.log
file was created with the appropriate content.
Finally, lets try to add a little bit of style and color to our logging. For this we can use the great package rich which is a great package for rich text and beautiful formatting in terminals. Install rich
and add the following line to your my_logger.py
script:
logger.root.handlers[0] = RichHandler(markup=True) # set rich handler\n
and try re-running the script. Hopefully you should see something beautiful in your terminal like this:
(Optional) We already briefly touched on logging during the module on config files using hydra. If you want to configure hydra to use custom logging scheme as the one we setup in the last two exercises, you can take a look at this page. In hydra you will need to provide the configuration of the logger as config file. You can find examples of such config file here.
When most people think machine learning, we think about the training phase. Being able to track and log experiments is an important part of understanding what is going on with your model while you are training. It can help you debug your model and help tweak your models to perfection. Without proper logging of experiments, it can be really hard to iterate on the model because you do not know what changes lead to increase or decrease in performance.
The most basic logging we can do when running experiments is writing the metrics that our model is producing e.g. the loss or the accuracy to the terminal or a file for later inspection. We can then also use tools such as matplotlib for plotting the progression of our metrics over time. This kind of workflow may be enough when doing smaller experiments or working alone on a project, but there is no way around using a proper experiment tracker and visualizer when doing large scale experiments in collaboration with others. It especially becomes important when you want to compare performance between different runs.
There exist many tools for logging your experiments, with some of them being:
All of the frameworks offers many of the same functionalities, you can see a (bias) review here. We are going to use Weights and Bias (wandb), as it support everything we need in this course. Additionally, it is an excellent tool for collaboration and sharing of results.
Using the Weights and Bias (wandb) dashboard we can quickly get an overview and compare many runs over different metrics. This allows for better iteration of models and training procedure."},{"location":"s4_debugging_and_logging/logging/#exercises_1","title":"\u2754 Exercises","text":"Start by creating an account at wandb. I recommend using your GitHub account but feel free to choose what you want. When you are logged in you should get an API key of length 40. Copy this for later use (HINT: if you forgot to copy the API key, you can find it under settings), but make sure that you do not share it with anyone or leak it in any way.
.env fileA good place to store not only your wandb API key but also other sensitive information is in a .env
file. This file should be added to your .gitignore
file to make sure that it is not uploaded to your repository. You can then load the variables in the .env
file using the python-dotenv
package. For more information see this page.
.env
WANDB_API_KEY=your-api-key\nWANDB_PROJECT=my_project\nWANDB_ENTITY=my_entity\n...\n
load_from_env_file.pyfrom dotenv import load_dotenv\nload_dotenv()\nimport os\napi_key = os.getenv(\"WANDB_API_KEY\")\n
Next install wandb on your laptop
pip install wandb\n
Now connect to your wandb account
wandb login\n
you will be asked to provide the 40 length API key. The connection should be remain open to the wandb server even when you close the terminal, such that you do not have to login each time. If using wandb
in a notebook you need to manually close the connection using wandb.finish()
.
We are now ready for incorporating wandb
into our code. We are going to continue development on our corrupt MNIST codebase from the previous sessions. For help, we recommend looking at this quickstart and this guide for PyTorch applications. You first job is to alter your training script to include wandb
logging, at least for the training loss.
import click\nimport torch\nimport wandb\nfrom my_project.data import corrupt_mnist\nfrom my_project.model import MyAwesomeModel\n\nDEVICE = torch.device(\"cuda\" if torch.cuda.is_available() else \"mps\" if torch.backends.mps.is_available() else \"cpu\")\n\n\n@click.command()\n@click.option(\"--lr\", type=float, default=0.001, help=\"Learning rate\")\n@click.option(\"--batch_size\", type=int, default=32, help=\"Batch size\")\n@click.option(\"--epochs\", type=int, default=5, help=\"Number of epochs\")\ndef train(lr, batch_size, epochs) -> None:\n \"\"\"Train a model on MNIST.\"\"\"\n print(\"Training day and night\")\n print(f\"{lr=}, {batch_size=}, {epochs=}\")\n wandb.init(\n project=\"corrupt_mnist\",\n config={\"lr\": lr, \"batch_size\": batch_size, \"epochs\": epochs},\n )\n\n model = MyAwesomeModel().to(DEVICE)\n train_set, _ = corrupt_mnist()\n\n train_dataloader = torch.utils.data.DataLoader(train_set, batch_size=batch_size)\n\n loss_fn = torch.nn.CrossEntropyLoss()\n optimizer = torch.optim.Adam(model.parameters(), lr=lr)\n\n for epoch in range(epochs):\n model.train()\n for i, (img, target) in enumerate(train_dataloader):\n img, target = img.to(DEVICE), target.to(DEVICE)\n optimizer.zero_grad()\n y_pred = model(img)\n loss = loss_fn(y_pred, target)\n loss.backward()\n optimizer.step()\n accuracy = (y_pred.argmax(dim=1) == target).float().mean().item()\n wandb.log({\"train_loss\": loss.item(), \"train_accuracy\": accuracy})\n\n if i % 100 == 0:\n print(f\"Epoch {epoch}, iter {i}, loss: {loss.item()}\")\n\n\nif __name__ == \"__main__\":\n train()\n
After running your model, checkout the webpage. Hopefully you should be able to see at least one run with something logged.
Now log something else than scalar values. This could be a image, a histogram or a matplotlib figure. In all cases the logging is still going to use wandb.log
but you need extra calls to wandb.Image
etc. depending on what you choose to log.
In this solution we log the input images to the model every 100 step. Additionally, we also log a histogram of the gradients to inspect if the model is converging. Finally, we create a ROC curve which is a matplotlib figure and log that as well.
train.pyimport click\nimport matplotlib.pyplot as plt\nimport torch\nimport wandb\nfrom my_project.data import corrupt_mnist\nfrom my_project.model import MyAwesomeModel\nfrom sklearn.metrics import RocCurveDisplay\n\nDEVICE = torch.device(\"cuda\" if torch.cuda.is_available() else \"mps\" if torch.backends.mps.is_available() else \"cpu\")\n\n\n@click.command()\n@click.option(\"--lr\", type=float, default=0.001, help=\"Learning rate\")\n@click.option(\"--batch_size\", type=int, default=32, help=\"Batch size\")\n@click.option(\"--epochs\", type=int, default=5, help=\"Number of epochs\")\ndef train(lr, batch_size, epochs) -> None:\n \"\"\"Train a model on MNIST.\"\"\"\n print(\"Training day and night\")\n print(f\"{lr=}, {batch_size=}, {epochs=}\")\n wandb.init(\n project=\"corrupt_mnist\",\n config={\"lr\": lr, \"batch_size\": batch_size, \"epochs\": epochs},\n )\n\n model = MyAwesomeModel().to(DEVICE)\n train_set, _ = corrupt_mnist()\n\n train_dataloader = torch.utils.data.DataLoader(train_set, batch_size=batch_size)\n\n loss_fn = torch.nn.CrossEntropyLoss()\n optimizer = torch.optim.Adam(model.parameters(), lr=lr)\n\n for epoch in range(epochs):\n model.train()\n\n preds, targets = [], []\n for i, (img, target) in enumerate(train_dataloader):\n img, target = img.to(DEVICE), target.to(DEVICE)\n optimizer.zero_grad()\n y_pred = model(img)\n loss = loss_fn(y_pred, target)\n loss.backward()\n optimizer.step()\n accuracy = (y_pred.argmax(dim=1) == target).float().mean().item()\n wandb.log({\"train_loss\": loss.item(), \"train_accuracy\": accuracy})\n\n preds.append(y_pred.detach().cpu())\n targets.append(target.detach().cpu())\n\n if i % 100 == 0:\n print(f\"Epoch {epoch}, iter {i}, loss: {loss.item()}\")\n\n # add a plot of the input images\n images = wandb.Image(img[:5].detach().cpu(), caption=\"Input images\")\n wandb.log({\"images\": images})\n\n # add a plot of histogram of the gradients\n grads = torch.cat([p.grad.flatten() for p in model.parameters() if p.grad is not None], 0)\n wandb.log({\"gradients\": wandb.Histogram(grads)})\n\n # add a custom matplotlib plot of the ROC curves\n preds = torch.cat(preds, 0)\n targets = torch.cat(targets, 0)\n\n for class_id in range(10):\n one_hot = torch.zeros_like(targets)\n one_hot[targets == class_id] = 1\n _ = RocCurveDisplay.from_predictions(\n one_hot,\n preds[:, class_id],\n name=f\"ROC curve for {class_id}\",\n plot_chance_level=(class_id == 2),\n )\n\n wandb.plot({\"roc\": plt})\n # alternatively the wandb.plot.roc_curve function can be used\n\n\nif __name__ == \"__main__\":\n train()\n
Finally, we want to log the model itself. This is done by saving the model as an artifact and then logging the artifact. You can read much more about what artifacts are here but they are essentially one or more files logged together with runs that can be versioned and equipped with metadata. Log the model after training and see if you can find it in the wandb dashboard.
SolutionIn this solution we have added the calculating of final training metrics and when we then log the model we add these as metadata to the artifact.
train.pyimport click\nimport matplotlib.pyplot as plt\nimport torch\nimport wandb\nfrom my_project.data import corrupt_mnist\nfrom my_project.model import MyAwesomeModel\nfrom sklearn.metrics import RocCurveDisplay, accuracy_score, f1_score, precision_score, recall_score\n\nDEVICE = torch.device(\"cuda\" if torch.cuda.is_available() else \"mps\" if torch.backends.mps.is_available() else \"cpu\")\n\n\n@click.command()\n@click.option(\"--lr\", type=float, default=0.001, help=\"Learning rate\")\n@click.option(\"--batch_size\", type=int, default=32, help=\"Batch size\")\n@click.option(\"--epochs\", type=int, default=5, help=\"Number of epochs\")\ndef train(lr, batch_size, epochs) -> None:\n \"\"\"Train a model on MNIST.\"\"\"\n print(\"Training day and night\")\n print(f\"{lr=}, {batch_size=}, {epochs=}\")\n run = wandb.init(\n project=\"corrupt_mnist\",\n config={\"lr\": lr, \"batch_size\": batch_size, \"epochs\": epochs},\n )\n\n model = MyAwesomeModel().to(DEVICE)\n train_set, _ = corrupt_mnist()\n\n train_dataloader = torch.utils.data.DataLoader(train_set, batch_size=batch_size)\n\n loss_fn = torch.nn.CrossEntropyLoss()\n optimizer = torch.optim.Adam(model.parameters(), lr=lr)\n\n for epoch in range(epochs):\n model.train()\n\n preds, targets = [], []\n for i, (img, target) in enumerate(train_dataloader):\n img, target = img.to(DEVICE), target.to(DEVICE)\n optimizer.zero_grad()\n y_pred = model(img)\n loss = loss_fn(y_pred, target)\n loss.backward()\n optimizer.step()\n accuracy = (y_pred.argmax(dim=1) == target).float().mean().item()\n wandb.log({\"train_loss\": loss.item(), \"train_accuracy\": accuracy})\n\n preds.append(y_pred.detach().cpu())\n targets.append(target.detach().cpu())\n\n if i % 100 == 0:\n print(f\"Epoch {epoch}, iter {i}, loss: {loss.item()}\")\n\n # add a plot of the input images\n images = wandb.Image(img[:5].detach().cpu(), caption=\"Input images\")\n wandb.log({\"images\": images})\n\n # add a plot of histogram of the gradients\n grads = torch.cat([p.grad.flatten() for p in model.parameters() if p.grad is not None], 0)\n wandb.log({\"gradients\": wandb.Histogram(grads)})\n\n # add a custom matplotlib plot of the ROC curves\n preds = torch.cat(preds, 0)\n targets = torch.cat(targets, 0)\n\n for class_id in range(10):\n one_hot = torch.zeros_like(targets)\n one_hot[targets == class_id] = 1\n _ = RocCurveDisplay.from_predictions(\n one_hot,\n preds[:, class_id],\n name=f\"ROC curve for {class_id}\",\n plot_chance_level=(class_id == 2),\n )\n\n wandb.plot({\"roc\": plt})\n # alternatively the wandb.plot.roc_curve function can be used\n\n final_accuracy = accuracy_score(targets, preds.argmax(dim=1))\n final_precision = precision_score(targets, preds.argmax(dim=1), average=\"weighted\")\n final_recall = recall_score(targets, preds.argmax(dim=1), average=\"weighted\")\n final_f1 = f1_score(targets, preds.argmax(dim=1), average=\"weighted\")\n\n # first we save the model to a file then log it as an artifact\n torch.save(model.state_dict(), \"model.pth\")\n artifact = wandb.Artifact(\n name=\"corrupt_mnist_model\",\n type=\"model\",\n description=\"A model trained to classify corrupt MNIST images\",\n metadata={\"accuracy\": final_accuracy, \"precision\": final_precision, \"recall\": final_recall, \"f1\": final_f1},\n )\n artifact.add_file(\"model.pth\")\n run.log_artifact(artifact)\n\n\nif __name__ == \"__main__\":\n train()\n
After running the script you should be able to see the logged artifact in the wandb dashboard.
Weights and bias was created with collaboration in mind and lets therefore share our results with others.
Lets create a report that you can share. Click the Create report button (upper right corner when you are in a project workspace) and include some of the graphs/plots/images that you have generated in the report.
Make the report shareable by clicking the Share button and create view-only-link. Send a link to your report to a group member, fellow student or a friend. In the worst case that you have no one else to share with you can send a link to my email nsde@dtu.dk
, so I can checkout your awesome work \ud83d\ude03
When calling wandb.init
you can provide many additional argument. Some of the most important are
project
entity
job_type
Make sure you understand what these arguments do and try them out. It will come in handy for your group work as they essentially allows multiple users to upload their own runs to the same project in wandb
.
Relevant documentation can be found here. The project
indicates what project all experiments and artifacts are logged to. We want to keep this the same for all group members. The entity
is the username of the person or team who owns the project, which should also be the same for all group members. The job type is important if you have different jobs that log to the same project. A common example is one script that trains a model and another that evaluates it. By setting the job type you can easily filter the runs in the wandb dashboard.
Wandb also comes with build in feature for doing hyperparameter sweeping which can be beneficial to get a better working model. Look through the documentation on how to do a hyperparameter sweep in Wandb. You at least need to create a new file called sweep.yaml
and make sure that you call wandb.log
in your code on an appropriate value.
Start by creating a sweep.yaml
file. Relevant documentation can be found here. We recommend placing the file in a configs
folder in your project.
The sweep.yaml
file will depend on kind of hyperparameters your model accepts as arguments and how they are passed to the model. For this solution we assume that the model accepts the hyperparameters lr
, batch_size
and epochs
and that they are passed as --args
(with hyphens) (1) e.g. this would be how we run the script
command
config in your sweep.yaml
file. This is because wandb
uses --args
to pass hyperparameters to the script, whereas hydra
uses args
(without the hyphen). See this page for more information.python train.py --lr=0.01 --batch_size=32 --epochs=10\n
The sweep.yaml
could then look like this:
program: train.py\nname: sweepdemo\nproject: my_project # change this\nentity: my_entity # change this\nmetric:\n goal: minimize\n name: validation_loss\nparameters:\n learning_rate:\n min: 0.0001\n max: 0.1\n distribution: log_uniform\n batch_size:\n values: [16, 32, 64]\n epochs:\n values: [5, 10, 15]\nrun_cap: 10\n
Afterwards, you need to create a sweep using the wandb sweep
command:
wandb sweep configs/sweep.yaml\n
this will output a sweep id that you need to use in the next step.
Finally, you need to run the sweep using the wandb agent
command:
wandb agent <sweep_id>\n
where <sweep_id>
is the id of the sweep you just created. You can find the id in the output of the wandb sweep
command. The reason that we first lunch the sweep and then the agent is that we can have multiple agents running at the same time, parallelizing the search for the best hyperparameters. Try this out by opening a new terminal and running the wandb agent
command again (with the same <sweep_id>
).
Inspect the sweep results in the wandb dashboard. You should see multiple new runs under the project you are logging the sweep to, corresponding to the different hyperparameters you tried. Make sure you understand the results and can answer what hyperparameters gave the best results and what hyperparameters had the largest impact on the results.
SolutionIn the sweep dashboard you should see something like this:
Importantly you can:
Next we need to understand the model registry, which will be very important later on when we get to the deployment of our models. The model registry is a centralized place for storing and versioning models. Importantly, any model in the registry is immutable, meaning that once a model is uploaded it cannot be changed. This is important for reproducibility and traceability of models.
The model registry is in general a repository of a teams trained models where ML practitioners publish candidates for production and share them with others. Figure from wandb.
The model registry builds on the artifact registry in wandb. Any model that is uploaded to the model registry is stored as an artifact. This means that we first need to log our trained models as artifacts before we can register them in the model registry. Make sure you have logged at least one model as an artifact before continuing.
Next lets create a registry. Go to the model registry tab (left pane, visible from your homepage) and then click the New Registered Model
button. Fill out the form and create the registry.
When then need to link our artifact to the model registry we just created. We can do this in two ways: either through the web interface or through the wandb
API. In the web interface, go to the artifact you want to link to the model registry and click the Link to registry
button (upper right corner). If you want to use the API you need to call the link method on a artifact object.
To use the API, create a new script called link_to_registry.py
and add the following code:
import wandb\napi = wandb.Api()\nartifact_path = \"<entity>/<project>/<artifact_name>:<version>\"\nartifact = api.artifact(artifact_path)\nartifact.link(target_path=\"<entity>/model-registry/<my_registry_name>\")\nartifact.save()\n
In the code <entity>
, <project>
, <artifact_name>
, <version>
and <my_registry_name>
should be replaced with the appropriate values.
We are now ready to consume our model, which can be done by downloading the artifact from the model registry. In this case we use the wandb API to download the artifact.
import wandb\nrun = wandb.init()\nartifact = run.use_artifact('<entity>/model-registry/<my_registry_name>:<version>', type='model')\nartifact_dir = artifact.download(\"<artifact_dir>\")\nmodel = MyModel()\nmodel.load_state_dict(torch.load(\"<artifact_dir>/model.ckpt\"))\n
Try running this code with the appropriate values for <entity>
, <my_registry_name>
, <version>
and <artifact_dir>
. Make sure that you can load the model and that it is the same as the one you trained.
Each model in the registry have at least one alias, which is the version of the model. The most recently added model also receives the alias latest
. Aliases are great for indicating where in workflow a model is, e.g. if it is a candidate for production or if it is a model that is still being developed. Try adding an alias to one of your models in the registry.
(Optional) A model always corresponds to an artifact, and artifacts can contain metadata that we can use to automate the process of registering models. We could for example imaging that we at the end of each week run a script that registers the best model from the week. Try creating a small script using the wandb
API that goes over a collection of artifacts and registers the best one.
import logging\nimport operator\nimport os\n\nimport click\nimport wandb\nfrom dotenv import load_dotenv\n\nlogger = logging.getLogger(__name__)\nload_dotenv()\n\n\n@click.command()\n@click.argument(\"model-name\")\n@click.option(\"--metric_name\", default=\"accuracy\", help=\"Name of the metric to choose the best model from.\")\n@click.option(\"--higher-is-better\", default=True, help=\"Whether higher metric values are better.\")\ndef stage_best_model_to_registry(model_name, metric_name, higher_is_better) -> None:\n \"\"\"\n Stage the best model to the model registry.\n\n Args:\n model_name: Name of the model to be registered.\n metric_name: Name of the metric to choose the best model from.\n higher_is_better: Whether higher metric values are better.\n\n \"\"\"\n api = wandb.Api(\n api_key=os.getenv(\"WANDB_API_KEY\"),\n overrides={\"entity\": os.getenv(\"WANDB_ENTITY\"), \"project\": os.getenv(\"WANDB_PROJECT\")},\n )\n artifact_collection = api.artifact_collection(type_name=\"model\", name=model_name)\n\n best_metric = float(\"-inf\") if higher_is_better else float(\"inf\")\n compare_op = operator.gt if higher_is_better else operator.lt\n best_artifact = None\n for artifact in list(artifact_collection.artifacts()):\n if metric_name in artifact.metadata and compare_op(artifact.metadata[metric_name], best_metric):\n best_metric = artifact.metadata[metric_name]\n best_artifact = artifact\n\n if best_artifact is None:\n logging.error(\"No model found in registry.\")\n return\n\n logger.info(f\"Best model found in registry: {best_artifact.name} with {metric_name}={best_metric}\")\n best_artifact.link(\n target_path=f\"{os.getenv('WANDB_ENTITY')}/model-registry/{model_name}\",\n aliases=[\"best\", \"staging\"],\n )\n best_artifact.save()\n logger.info(\"Model staged to registry.\")\n\n\nif __name__ == \"__main__\":\n stage_best_model_to_registry()\n
In the future it will be important for us to be able to run Wandb inside a docker container (together with whatever training or inference we specify). The problem here is that we cannot authenticate Wandb in the same way as the previous exercise, it needs to happen automatically. Lets therefore look into how we can do that.
First we need to generate an authentication key, or more precise an API key. This is in general the way any service (like a docker container) can authenticate. Start by going https://wandb.ai/home, click your profile icon in the upper right corner and then go to settings. Scroll down to the danger zone and generate a new API key and finally copy it.
Next create a new docker file called wandb.docker
and add the following code
FROM python:3.10-slim\nRUN apt update && \\\n apt install --no-install-recommends -y build-essential gcc && \\\n apt clean && rm -rf /var/lib/apt/lists/*\nRUN pip install wandb\nCOPY s4_debugging_and_logging/exercise_files/wandb_tester.py wandb_tester.py\nENTRYPOINT [\"python\", \"-u\", \"wandb_tester.py\"]\n
please take a look at the script being copied into the image and afterwards build the docker image.
When we want to run the image, what we need to do is including a environment variables that contains the API key we generated. This will then authenticate the docker container with the wandb server:
docker run -e WANDB_API_KEY=<your-api-key> wandb:latest\n
Try running it an confirm that the results are uploaded to the wandb server (1).
.env
file you can use the --env-file
flag instead of -e
to load the environment variables from the file e.g. docker run --env-file .env wandb:latest
.Feel free to experiment more with wandb
as it is a great tool for logging, organizing and sharing experiments.
That is the module on logging. Please note that at this point in the course you will begin to see some overlap between the different frameworks. While we mainly used hydra
for configuring our Python scripts it can also be used to save metrics and hyperparameters similar to how wandb
can. Similar arguments holds for dvc
which can also be used to log metrics. In our opinion wandb
just offers a better experience when interacting with the results after logging. We want to stress that the combination of tools presented in this course may not be the best for all your future projects, and we recommend finding a setup that fits you. That said, each framework provide specific features that the others does not.
Finally, we want to note that we during the course really try to showcase a lot of open source frameworks, Wandb is not one. It is free to use for personal usage (with a few restrictions) but for enterprise it does require a license. If you are eager to only work with open-source tools we highly recommend trying out MLFlow which offers the same overall functionalities as Wandb.
"},{"location":"s4_debugging_and_logging/profiling/","title":"M13 - Profiling","text":""},{"location":"s4_debugging_and_logging/profiling/#profilers","title":"Profilers","text":"Core Module
"},{"location":"s4_debugging_and_logging/profiling/#profilers_1","title":"Profilers","text":"In general profiling code is about improving the performance of your code. In this session we are going to take a somewhat narrow approach to what \"performance\" is: runtime, meaning the time it takes to execute your program.
At the bare minimum, the two questions a proper profiling of your program should be able to answer is:
The first question is important to priorities optimization. If two methods A
and B
have approximately the same runtime, but A
is called 1000 more times than B
we should probably spend time optimizing A
over B
if we want to speedup our code. The second question is gives itself, directly telling us which methods are the expensive to call.
Using profilers can help you find bottlenecks in your code. In this exercise we will look at two different profilers, with the first one being the cProfile. cProfile
is pythons build in profiler that can help give you an overview runtime of all the functions and methods involved in your programs.
Run the cProfile
on the vae_mnist_working.py
script. Hint: you can directly call the profiler on a script using the -m
arg
python -m cProfile -o <output_file> -s <sort_order> myscript.py\n
Try looking at the output of the profiling. Can you figure out which function took the longest to run?
Can you explain the difference between tottime
and cumtime
? Under what circumstances does these differ and when are they equal.
To get a better feeling of the profiled result we can try to visualize it. Python does not provide a native solution, but open-source solutions such as snakeviz exist. Try installing snakeviz
and load a profiled run into it (HINT: snakeviz expect the run to have the file format .prof
).
Try optimizing the run! (Hint: The data is not stored as torch tensor). After optimizing the code make sure (using cProfile
and snakeviz
) that the code actually runs faster.
Profiling machine learning code can become much more complex because we are suddenly beginning to mix different devices (CPU+GPU), that can (and should) overlap some of their computations. When profiling this kind of machine learning code we are often looking for bottlenecks. A bottleneck is simple the place in your code that is preventing other processes from performing their best. This is the reason that all major deep learning frameworks also include their own profilers that can help profiling more complex applications.
The image below show a typical report using the build in profiler in pytorch. As the image shows the profiler looks both a the kernel
time (this is the time spend doing actual computations) and also transfer times such as memcpy
(where we are copying data between devices). It can even analyze your code and give recommendations.
Using the profiler can be as simple as wrapping the code that you want to profile with the torch.profiler.profile
decorator
with torch.profiler.profile(...) as prof:\n # code that I want to profile\n output = model(data)\n
"},{"location":"s4_debugging_and_logging/profiling/#exercises_1","title":"\u2754 Exercises","text":"Exercise files
In these investigate the profiler that is build into PyTorch already. Note that these exercises requires that you have PyTorch v1.8.1 installed (or higher). You can always check which version you currently have installed by writing (in a python interpreter):
import torch\nprint(torch.__version__)\n
But we always recommend to update to the latest PyTorch version for the best experience. Additionally, to display the result nicely (like snakeviz
for cProfile
) we are also going to use the tensorboard profiler extension
pip install torch_tb_profiler\n
A good starting point is too look at the API for the profiler. Here the important class to look at is the torch.profiler.profile
class.
Lets try out an simple example (taken from here):
Try to run the following code
import torch\nimport torchvision.models as models\nfrom torch.profiler import profile, ProfilerActivity\n\nmodel = models.resnet18()\ninputs = torch.randn(5, 3, 224, 224)\n\nwith profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:\n model(inputs)\n
this will profile the forward
pass of Resnet 18 model.
Running this code will produce an prof
object that contains all the relevant information about the profiling. Try writing the following code:
print(prof.key_averages().table(sort_by=\"cpu_time_total\", row_limit=10))\n
what operation is taking most of the cpu?
Try running
print(prof.key_averages(group_by_input_shape=True).table(sort_by=\"cpu_time_total\", row_limit=30))\n
can you see any correlation between the shape of the input and the cost of the operation?
(Optional) If you have a GPU you can also profile the operations on that device:
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:\n model(inputs)\n
(Optional) As an alternative to using profile
as an context-manager we can also use its .start
and .stop
methods:
prof = profile(...)\nprof.start()\n... # code I want to profile\nprof.stop()\n
Try doing this on the above example.
The torch.profiler.profile
function takes some additional arguments. What argument would you need to set to also profile the memory usage? (Hint: this page) Try doing it to the simple example above and make sure to sort the sample by self_cpu_memory_usage
.
As mentioned we can also get a graphical output for better inspection. After having done a profiling try to export the results with:
prof.export_chrome_trace(\"trace.json\")\n
you should be able to visualize the file by going to chrome://tracing
in any chromium based web browser. Can you still identify the information printed in the previous exercises from the visualizations?
Running profiling on a single forward step can produce misleading results as it only provides a single sample that may depend on what background processes that are running on your computer. Therefore it is recommended to profile multiple iterations of your model. If this is the case then we need to include prof.step()
to tell the profiler when we are doing a new iteration
with profile(...) as prof:\n for i in range(10):\n model(inputs)\n prof.step()\n
Try doing this. Is the conclusion this the same on what operations that are taken up most of the time? Have the percentage changed significantly?
Additionally, we can also visualize the profiling results using the profiling viewer in tensorboard.
Start by initializing the profile
class with an additional argument:
from torch.profiler import profile, tensorboard_trace_handler\nwith profile(..., on_trace_ready=tensorboard_trace_handler(\"./log/resnet18\")) as prof:\n ...\n
Try run a profiling (using a couple of iterations) and make sure that a file with the .pt.trace.json
is produced in the log/resnet18
folder.
Now try launching tensorboard
tensorboard --logdir=./log\n
and open the page http://localhost:6006/#pytorch_profiler, where you should hopefully see an image similar to the one below:
Image credit
Try poking around in the interface.
Tensorboard have a nice feature for comparing runs under the diff
tab. Try redoing a profiling run but use model = models.resnet34()
instead. Load up both runs and try to look at the diff
between them.
As an final exercise, try to use the profiler on the vae_mnist_working.py
file from the previous module on debugging, where you profile a whole training run (not only the forward pass). What is the bottleneck during the training? Is it still the forward pass or is it something else? Can you improve the code somehow based on the information from the profiler.
This end the module on profiling. If you want to go into more details on this topic we can recommend looking into line_profiler and kernprof. A downside of using python's cProfile
is that it can only profiling at an functional/modular level, that is great for identifying hotspots in your code. However, sometimes the cause of an computationally hotspot is a single line of code in a function, which will not be caught by cProfile
. An example would be an simple index operations such as a[idx] = b
, which for large arrays and non-sequential indexes is really expensive. For these cases line_profiler and kernprof are excellent tools to have in your toolbox. Additionally, if you do not like cProfile we can also recommend py-spy which is another open-source profiling tool for Python programs.
Slides
Learn how to write unit tests that cover both data and models in your ML pipeline.
M16: Unit testing
Learn how to implement continuous integration using Github actions such that tests are automatically executed on code changes.
M17: Github Actions
Learn how to use pre-commit to ensure that code that is not up to standard does not get committed.
M18: Pre-commit
Learn how to implement continuous machine learning pipelines in Github actions.
M19: Continuous Machine Learning
Continues integration is a sub-discipline of the general field of Continues X. Continuous X is one of the core elements of modern DevOps, and by extension MLOps. Continuous X assumes that we have a (long) developer pipeline (see image below) where we want to make some changes to our code e.g:
Basically, any code change we will expect will have a influence on the final result. The problem with doing changes to the start of our pipeline is that we want the change to propagate all the way through to the end of the pipeline.
Image creditThis is where continuous X comes into play. The word continuous here refers to the fact that the pipeline should continuously be updated as we make code changes. You can also choose to think of this as the automatization of processes. The X then covers that the process we need to go through to automate steps in the pipeline depends on where we are in the pipeline e.g. the tools needed to do continuous integration are different from the tools needed to do continuous delivery.
In this session, we are going to focus on continuous integration (CI). As indicated in the image above, continuous integration usually takes care of the first part of the developer pipeline which has to do with the code base, code building and code testing. This is paramount to step in automatization as we would rather catch bugs at the beginning of our pipeline than in the end.
Learning objectives
The learning objectives of this session are:
The continuous integration we have looked at until now is what we can consider \"classical\" continuous integration, which has its roots in DevOps and not MLOps. While the test that we have written and the containers we have developed in the previous session have been about machine learning, everything we have done translates completely to how it would be done if we had developed any other application that did not include machine learning.
In this session, we are now gonna change gears and look at continuous machine learning (CML). As the name may suggest we are now focusing on automatizing actual machine learning processes. The reason for doing this is the same as with continuous integration, namely that we often have a bunch of checks that we want our newly trained model to pass before we trust it to be ready for deployment. Writing unit tests secures that the code that we use for training our model is not broken, but there exist other failure modes of a machine learning pipeline:
All these questions are questions that we can answer by writing tests that are specific to machine learning. In this session, we are going to look at how we can begin to use Github Actions to automate these tests.
"},{"location":"s5_continuous_integration/cml/#mlops-maturity-model","title":"MLOps maturity model","text":"Before getting started with the exercises, let's first take a side step and look at what is called the MLOps maturity model. The reason here is to get a better understanding of when continuous machine learning is relevant. The main idea behind the MLOps maturity model is to help organizations understand where they are in their machine learning operations journey and what the next logical steps are. The model is divided into five stages:
Image creditLevel 0
At this level, organizations are doing machine learning in an ad-hoc manner. There is no standardization, no version control, no testing, and no monitoring.
Level 1
At this level, organizations have started to implement DevOps practices in their machine learning workflows. They have started to use version control and maybe come with basic continuous integration practices.
Level 2
At this level, organizations have started to standardize the training process and tackle the problem of creating reproducible experiments. Centralization of model artifacts and metadata is common at this level. They have started to implement model versioning and model registry practices.
Level 3
At this level, organizations have started to implement continuous integration and continuous deployment practices. They have started to automate the testing of their models and have started to monitor their models in production.
Level 4
At this level, organizations have started to implement continuous machine learning practices. They have started to automate the training, evaluation, and deployment of their models. They have started to implement automated retraining and model updates.
The MLOps maturity model tells us that continuous machine learning is the highest form of maturity in MLOps. It is the stage where we have automated the entire machine learning pipeline and the cases we will be going through in the exercises are therefore some of the last steps in the MLOps maturity model.
"},{"location":"s5_continuous_integration/cml/#exercises","title":"\u2754 Exercises","text":"In the following exercises, we are going to look at two different cases where we can use continuous machine learning. The first one is a simple case where we are automatically going to trigger some workflow (like training of a model) whenever we make changes to our data. This is a very common use case in machine learning where we have a data pipeline that is continuously updating our data. The second case is connected to staging and deploying models. In this case, we are going to look at how we can automatically do further processing of our model whenever we push a new model to our repository.
For the first set of exercises, we are going to rely on the cml
framework by iterative.ai, which is a framework that is built on top of GitHub actions. The figure below describes the overall process using the cml
framework. It should be clear that it is the very same process that we go through in the other continuous integration sessions: push code
-> trigger GitHub actions
-> do stuff
. The new part in this session that we are only going to trigger whenever data changes.
Image credit
If you have not already created a dataset class for the corrupted Mnist data, start by doing that. Essentially, it is a class that should inherit from torch.utils.data.Dataset
and should have a __getitem__
and __len__
from __future__ import annotations\n\nimport os\nfrom typing import TYPE_CHECKING\n\nimport torch\nfrom torch import Tensor\nfrom torch.utils.data import Dataset\n\nif TYPE_CHECKING:\n import torchvision.transforms.v2 as transforms\n\n\nclass MnistDataset(Dataset):\n \"\"\"MNIST dataset for PyTorch.\n\n Args:\n data_folder: Path to the data folder.\n train: Whether to load training or test data.\n img_transform: Image transformation to apply.\n target_transform: Target transformation to apply.\n \"\"\"\n\n name: str = \"MNIST\"\n\n def __init__(\n self,\n data_folder: str = \"data\",\n train: bool = True,\n img_transform: transforms.Transform | None = None,\n target_transform: transforms.Transform | None = None,\n ) -> None:\n super().__init__()\n self.data_folder = data_folder\n self.train = train\n self.img_transform = img_transform\n self.target_transform = target_transform\n self.load_data()\n\n def load_data(self) -> None:\n \"\"\"Load images and targets from disk.\"\"\"\n images, target = [], []\n if self.train:\n nb_files = len([f for f in os.listdir(self.data_folder) if f.startswith(\"train_images\")])\n for i in range(nb_files):\n images.append(torch.load(f\"{self.data_folder}/train_images_{i}.pt\"))\n target.append(torch.load(f\"{self.data_folder}/train_target_{i}.pt\"))\n else:\n images.append(torch.load(f\"{self.data_folder}/test_images.pt\"))\n target.append(torch.load(f\"{self.data_folder}/test_target.pt\"))\n self.images = torch.cat(images, 0)\n self.target = torch.cat(target, 0)\n\n def __getitem__(self, idx: int) -> tuple[Tensor, Tensor]:\n \"\"\"Return image and target tensor.\"\"\"\n img, target = self.images[idx], self.target[idx]\n if self.img_transform:\n img = self.img_transform(img)\n if self.target_transform:\n target = self.target_transform(target)\n return img, target\n\n def __len__(self) -> int:\n \"\"\"Return the number of images in the dataset.\"\"\"\n return self.images.shape[0]\n
Then let's create a function that can report basic statistics such as the number of training samples, number of test samples and generate figures of sample images in the dataset and distribution of the classes in the dataset. This function should be called dataset_statistics
and should take a path to the dataset as input.
import click\nimport matplotlib.pyplot as plt\nimport torch\nfrom mnist_dataset import MnistDataset\nfrom utils import show_image_and_target\n\n\n@click.command()\n@click.option(\"--datadir\", default=\"data\", help=\"Path to the data directory\")\ndef dataset_statistics(datadir: str) -> None:\n \"\"\"Compute dataset statistics.\"\"\"\n train_dataset = MnistDataset(data_folder=datadir, train=True)\n test_dataset = MnistDataset(data_folder=datadir, train=False)\n print(f\"Train dataset: {train_dataset.name}\")\n print(f\"Number of images: {len(train_dataset)}\")\n print(f\"Image shape: {train_dataset[0][0].shape}\")\n print(\"\\n\")\n print(f\"Test dataset: {test_dataset.name}\")\n print(f\"Number of images: {len(test_dataset)}\")\n print(f\"Image shape: {test_dataset[0][0].shape}\")\n\n show_image_and_target(train_dataset.images[:25], train_dataset.target[:25], show=False)\n plt.savefig(\"mnist_images.png\")\n plt.close()\n\n train_label_distribution = torch.bincount(train_dataset.target)\n test_label_distribution = torch.bincount(test_dataset.target)\n\n plt.bar(torch.arange(10), train_label_distribution)\n plt.title(\"Train label distribution\")\n plt.xlabel(\"Label\")\n plt.ylabel(\"Count\")\n plt.savefig(\"train_label_distribution.png\")\n plt.close()\n\n plt.bar(torch.arange(10), test_label_distribution)\n plt.title(\"Test label distribution\")\n plt.xlabel(\"Label\")\n plt.ylabel(\"Count\")\n plt.savefig(\"test_label_distribution.png\")\n plt.close()\n\n\nif __name__ == \"__main__\":\n dataset_statistics()\n
Next, we are going to implement a GitHub actions workflow that only activates when we make changes to our data. Create a new workflow file (call it cml_data.yaml
) and make sure it only activates on push/pull-request events when data/
changes. Relevant documentation
The secret is to use the paths
keyword in the workflow file. We here specify that the workflow should only trigger when the .dvc
folder or any file with the .dvc
extension changes, which is the case when we update our data and call dvc add data/
.
name: DVC Workflow\n\non:\n pull_request:\n branches:\n - main\n paths:\n - '**/*.dvc'\n - '.dvc/**'\n
The next step is to implement steps in our workflow that do something when data changes. This is the reason why we created the dataset_statistics
function. Implement a workflow that:
dataset_statistics
function on the dataThis solution assumes that data is stored in a GCP bucket and that the credentials are stored in a secret called GCP_SA_KEY
. If this is not the case for you, you need to adjust the workflow accordingly with the correct way to pull the data.
jobs:\n run_data_checker:\n runs-on: ubuntu-latest\n steps:\n - name: Checkout code\n uses: actions/checkout@v4\n\n - name: Set up Python\n uses: actions/setup-python@v5\n with:\n python-version: 3.11\n cache: 'pip'\n cache-dependency-path: setup.py\n\n - name: Install dependencies\n run: |\n make dev_requirements\n pip list\n\n - name: Auth with GCP\n uses: google-github-actions/auth@v2\n with:\n credentials_json: ${{ secrets.GCP_SA_KEY }}\n\n - name: Pull data\n run: |\n dvc pull --no-run-cache\n\n - name: Check data statistics\n run: |\n python dataset_statistics.py\n
Let's make sure that the workflow works as expected for now. Create a new branch and either add or remove a file in the data/
folder. Then run
dvc add data/\ngit add data.dvc\ngit commit -m \"Update data\"\ngit push\n
to commit the changes to data. Open a pull request with the branch and make sure that the workflow activates and runs as expected.
Let's now add the cml
framework such that we can comment the results of the dataset_statistics
function in the pull request automatically. Look at the getting started guide for help on how to do this. You will need write all the content of the dataset_statistics
function to a file called report.md
and then use the cml comment create
command to create a comment in the pull request with the content of the file.
jobs:\n dataset_statistics:\n runs-on: ubuntu-latest\n steps:\n # ...all the previous steps\n - name: Check data statistics & generate report\n run: |\n python src/example_mlops/data.py > data_statistics.md\n echo '![](./mnist_images.png \"MNIST images\")' >> data_statistics.md\n echo '![](./train_label_distribution.png \"Train label distribution\")' >> data_statistics.md\n echo '![](./test_label_distribution.png \"Test label distribution\")' >> data_statistics.md\n\n - name: Setup cml\n uses: iterative/setup-cml@v2\n\n - name: Comment on PR\n env:\n REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}\n run: |\n cml comment create data_statistics.md --watermark-title=\"Data Checker\" # (1)!\n
--watermark-title
flag is used to watermark the comment created by cml
. It is to make sure that no new comments are created every time the workflow runs.Make sure that the workflow works as expected. You should see a comment created by github-actions (bot)
like this if you have done everything correctly:
(Optional) Feel free to add more checks to the workflow. For example, you could add a check that runs a small baseline model on the updated data and checks that the model converges. This is a very common sanity check that is done in machine learning pipelines.
For the second set of exercises, we are going to look at how to automatically run further testing of our models whenever we add them to our model registry. For that reason, do not continue with this set of exercises before you have completed the exercises on the model registry in this module.
The model registry is in general a repository of a team's trained models where ML practitioners publish candidates for production and share them with others. Figure from wandb.
The first step is in our weights and bias account to create a team. Some of these more advanced features are only available for teams, however every user is allowed to create one team for free. Go to your weights and bias account and create a team (the option should be on the left side of the UI). Give a team name and select W&B cloud storage.
Now we need to generate a personal access token that can link our weights and bias account to our GitHub account. Go to this page and generate a new token. You can also find the page by clicking your profile icon in the upper right corner of Github and selecting Settings
, then Developer settings
, then Personal access tokens
and finally choose either Tokens (classic)
or Fine-grained tokens
(which is the safer option, which is also what the link points to).
give it a name, set what repositories it should have access to and select the permissions you want it to have. In our case if you choose to create Fine-grained token
then it needs access to the contents:write
permission. If you choose Tokens (classic)
then it needs access to the repo
permission. After you have created the token, copy it and save it somewhere safe.
Go to the settings of your newly created team: https://wandb.ai/teamname/settings and scroll down to the Team secrets
section. Here add the token you just created as a secret with the name GITHUB_ACTIONS_TOKEN
. WANDB will now be able to use this token to trigger actions in your repository.
On the same settings page, scroll down to the Webhooks
settings. Click the New webhook
button in fill in the following information:
github_actions_dispatch
https://api.github.com/repos/<owner>/<repo>/dispatches
GITHUB_ACTIONS_TOKEN
You here need to replace <owner>
and <repo>
with your own information. The /dispatches
endpoint is a special endpoint that all Github actions workflows can listen to. Thus, if you ever want to setup a webhook in some other framework that should trigger a Github action, you can use this endpoint.
Next, navigate to your model registry. It should hopefully contain at least one registry with at least one model registered. If not, go back to the previous module and do that.
When you have a model in your registry, click on the View details
button. Then click the New automation
button. On the first page, select that you want to trigger the automation when an alias is added to a model version, set that alias to staging
and select the action type to be Webhook
. On the next page, select the github_actions_dispatch
webhook that you just created and add this as the payload:
{\n \"event_type\": \"staged_model\",\n \"client_payload\":\n {\n \"event_author\": \"${event_author}\",\n \"artifact_version\": \"${artifact_version}\",\n \"artifact_version_string\": \"${artifact_version_string}\",\n \"artifact_collection_name\": \"${artifact_collection_name}\",\n \"project_name\": \"${project_name}\",\n \"entity_name\": \"${entity_name}\"\n }\n}\n
Finally, on the next page give the automation a name and click Create automation
.
Make sure you understand overall what is happening here.
SolutionThe automation is set up to trigger a webhook whenever the alias staging
is added to a model version. The webhook is set up to trigger a Github action workflow that listens to the /dispatches
endpoint and has the event type staged_model
. The payload that is sent to the webhook contains information about the model that was staged.
We are now ready to create the Github actions workflow
that listens to the /dispatches
endpoint and triggers whenever a model is staged. Create a new workflow file (called stage_model.yaml
) and make sure it only activates on the staged_model
event. Hint: relevant documentation
name: Check staged model\n\non:\n repository_dispatch:\n types: staged_model\n
Next, we need to implement the steps in our workflow that do something when a model is staged. The payload that is sent to the webhook contains information about the model that was staged. Implement a workflow that:
jobs:\n identify_event:\n runs-on: ubuntu-latest\n outputs:\n model_name: ${{ steps.set_output.outputs.model_name }}\n steps:\n - name: Check event type\n run: |\n echo \"Event type: repository_dispatch\"\n echo \"Payload Data: ${{ toJson(github.event.client_payload) }}\"\n\n - name: Setting model environment variable and output\n id: set_output\n run: |\n echo \"model_name=${{ github.event.client_payload.artifact_version_string }}\" >> $GITHUB_OUTPUT\n
We now need to write a script that can be executed on our staged model. In this case, we are going to run some performance tests on it to check that it is fast enough for deployment. Therefore, do the following:
In a tests/performancetests
folder, create a new file called test_model.py
Implement a test that loads the model from an wandb artifact path e.g. //: and runs it on a random input. Importantly, the artifact path should be read from an environment variable called MODEL_NAME
.
The test should assert that the model can do 100 predictions in less than X amount of time
In this solution we assume that 4 environment variables are set: WANDB_API
, WANDB_ENTITY
, WANDB_PROJECT
and MODEL_NAME
.
import wandb\nimport os\nimport time\nfrom my_project.models import MyModel\n\ndef load_model(artifact):\n api = wandb.Api(\n api_key=os.getenv(\"WANDB_API_KEY\"),\n overrides={\"entity\": os.getenv(\"WANDB_ENTITY\"), \"project\": os.getenv(\"WANDB_PROJECT\")},\n )\n artifact = api.artifact(model_checkpoint)\n artifact.download(root=logdir)\n file_name = artifact.files()[0].name\n return MyModel.load_from_checkpoint(f\"{logdir}/{file_name}\")\n\ndef test_model_speed():\n model = load_model(os.getenv(\"MODEL_NAME\"))\n start = time.time()\n for _ in range(100):\n model(torch.rand(1, 1, 28, 28))\n end = time.time()\n assert end - start < 1\n
Let's now add another job that calls the script we just wrote. It needs to:
which is very similar to the kind of jobs we have written before.
Solutionjobs:\n identify_event:\n ...\n test_model:\n runs-on: ubuntu-latest\n needs: identify_event\n env:\n WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}\n WANDB_ENTITY: ${{ secrets.WANDB_ENTITY }}\n WANDB_PROJECT: ${{ secrets.WANDB_PROJECT }}\n MODEL_NAME: ${{ needs.identify_event.outputs.model_name }}\n steps:\n - name: Echo model name\n run: |\n echo \"Model name: $MODEL_NAME\"\n - name: Checkout code\n uses: actions/checkout@v4\n\n - name: Set up Python\n uses: actions/setup-python@v5\n with:\n python-version: 3.11\n cache: 'pip'\n cache-dependency-path: setup.py\n\n - name: Install dependencies\n run: |\n pip install -r requirements.txt\n pip list\n\n - name: Test model\n run: |\n pytest tests/performancetests/test_model.py\n
Finally, we are going to assume in this setup that if the model gets this far then it is ready for deployment. We are therefore going to add a final job that will add a new alias to the model called production
. Here is some relevant Python code that can be used to add the alias:
import click\nimport os\nimport wandb\n\n@click.command()\n@click.argument(\"artifact-path\")\n@click.option(\n \"--aliases\", \"-a\", multiple=True, default=[\"staging\"], help=\"List of aliases to link the artifact with.\"\n)\ndef link_model(artifact_path: str, aliases: list[str]) -> None:\n \"\"\"\n Stage a specific model to the model registry.\n\n Args:\n artifact_path: Path to the artifact to stage.\n Should be of the format \"entity/project/artifact_name:version\".\n aliases: List of aliases to link the artifact with.\n\n Example:\n model_management link-model entity/project/artifact_name:version -a staging -a best\n\n \"\"\"\n if artifact_path == \"\":\n click.echo(\"No artifact path provided. Exiting.\")\n return\n\n api = wandb.Api(\n api_key=os.getenv(\"WANDB_API_KEY\"),\n overrides={\"entity\": os.getenv(\"WANDB_ENTITY\"), \"project\": os.getenv(\"WANDB_PROJECT\")},\n )\n _, _, artifact_name_version = artifact_path.split(\"/\")\n artifact_name, _ = artifact_name_version.split(\":\")\n\n artifact = api.artifact(artifact_path)\n artifact.link(target_path=f\"{os.getenv('WANDB_ENTITY')}/model-registry/{artifact_name}\", aliases=aliases)\n artifact.save()\n click.echo(f\"Artifact {artifact_path} linked to {aliases}\")\n
for example, you can run this script with the following command:
python link_model.py entity/project/artifact_name:version -a staging -a production\n
Implement a final job that calls this script and adds the production
alias to the model.
jobs:\n identify_event:\n ...\n test_model:\n ...\n add_production_alias:\n runs-on: ubuntu-latest\n needs: identify_event\n env:\n WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}\n WANDB_ENTITY: ${{ secrets.WANDB_ENTITY }}\n WANDB_PROJECT: ${{ secrets.WANDB_PROJECT }}\n MODEL_NAME: ${{ needs.identify_event.outputs.model_name }}\n steps:\n - name: Echo model name\n run: |\n echo \"Model name: $MODEL_NAME\"\n\n - name: Checkout code\n uses: actions/checkout@v4\n\n - name: Set up Python\n uses: actions/setup-python@v5\n with:\n python-version: 3.11\n cache: 'pip'\n cache-dependency-path: setup.py\n\n - name: Install dependencies\n run: |\n pip install -r requirements.txt\n pip list\n\n - name: Add production alias\n run: |\n python link_model.py $MODEL_NAME -a production\n
Finally, make sure the workflow works as expected. To try it out again and again for testing purposes, you can just manually add and then delete the staging
alias to any model version in the model registry.
(Optional) Consider adding more checks to the workflow. For example, you could add a step that checks if the model is too large for deployment, runs some further evaluation scripts, or checks if the model is robust to adversarial attacks. Only the imagination sets the limits here.
(Optional) If you have got this far, consider combining principles from the two exercises. Here is an idea: we use the workflow from the second exercise to trigger a workflow that checks a staged model for performance. We then use the cml
framework to automatically create a pull request e.g. use cml pr create
instead of cml comment create
to create a pull request with the results of the performance test. Then if we are happy with the performance, we can then approve that pull request and the production alias is added to the model. This is a better workflow because it allows for human intervention before the model is deployed.
What is the difference between continuous integration and continuous machine learning?
SolutionThere are three key differences between continuous integration and continuous machine learning:
Imaging you get hired in the pharmasuitical industri being asked to develop a machine learning pipeline that can automatically sort out which drugs are safe and which are not. What level of the MLOps maturity model would you strive to reach?
SolutionThere is really no right or wrong answer here, but in most cases we would actually not aim for level 4. The reason is that the consequences of a bad model in this case can be severe. Therefore, we would probably not want automated retraining and model updates, which is what level 4 is about. Instead, we would probably aim for level 3 where we have automated testing and monitoring of our models but there is still human oversight in the process.
This ends the module on continuous machine learning. As we have hopefully convinced you, it is only the imagination that sets the limits for what you can use Github actions for in your machine learning pipeline. However, we do want to stress that it is important that human oversight is always present in the process. Automation is great, but it should never replace human judgement. This is especially true in machine learning where the consequences of a bad model can be severe if it is used in critical decision making.
Finally, if you have completed the exercises on using the cloud consider checking out the cml runner lunch command that allows you to run your workflows on cloud resources instead of the GitHub actions runners.
"},{"location":"s5_continuous_integration/github_actions/","title":"M17 - Github Actions","text":""},{"location":"s5_continuous_integration/github_actions/#github-actions","title":"GitHub actions","text":"Core Module
With the tests established in the previous module, we are now ready to move on to implementing some continuous integration in our pipeline. As you probably have already realized testing your code locally may be cumbersome to do, because
For these reasons, we want to automate the testing, such that it is done every time we push to our repository. If we combine this with only pushing to branches and then only merging these branches whenever all automated testing has passed, our code should be fairly safe against unwanted bugs (assuming your tests are well covering your code).
"},{"location":"s5_continuous_integration/github_actions/#github-actions_1","title":"GitHub actions","text":"GitHub actions are the continuous integration solution that GitHub provides. Each of your repositories gets 2,000 minutes of free testing per month which should be more than enough for the scope of this course (and probably all personal projects you do). Getting GitHub actions set up in a repository may seem complicated at first, but workflow files that you create for one repository can more or less be reused for any other repository that you have.
Let's take a look at how a GitHub workflow file is organized:
name
runs-on
, we can specify which operation system we want the workflow to run on.steps
. This is where we specify the actual commands that should be run when the workflow is executed.Start by creating a .github
folder in the root of your repository. Add a sub-folder to that called workflows
.
Go over this page that explains how to do automated testing of Python code in GitHub actions. You do not have to understand everything, but at least get a feeling of what a workflow file should look like.
We have provided a workflow file called tests.yaml
that should run your tests for you. Place this file in the .github/workflows/
folder. The workflow file consists of three steps
First, a Python environment is initiated (in this case Python 3.8)
Next all dependencies required to run the test are installed
Finally, pytest
is called and our tests will be run
Go over the file and try to understand the overall structure and syntax of the file.
tests.yaml
tests.yamlname: \"Run tests\"\n\non:\n push:\n branches: [ master, main ]\n pull_request:\n branches: [ master, main ]\n\njobs:\n build:\n\n runs-on: ubuntu-latest\n\n steps:\n - name: Checkout\n uses: actions/checkout@v4\n - name: Set up Python\n uses: actions/setup-python@v5\n with:\n python-version: 3.11\n - name: Install dependencies\n run: |\n python -m pip install --upgrade pip\n pip install -r requirements.txt\n pip install -r requirements_tests.txt\n - name: Test with pytest\n run: |\n pytest -v\n
For the script to work you need to define the requirements.txt
and requirements_tests.txt
. The first file should contain all packages required to run your code. The second file contains all additional packages required to run the tests. In your simple case, it may very well be that the second file is empty, however, sometimes additional packages are used for testing that are not strictly required for the scripts to run.
Finally, try pushing the changes to your repository. Hopefully, your tests should just start, and you will after some time see a green check mark next to the hash of the commit. Also, try to inspect the Actions tap where you can see the history of actions run.
Normally we develop code on only one operating system and just hope that it will work on other operating systems. However, continuous integration enables us to automatically test on other systems than the one we are using.
The provided tests.yaml
only runs on one operating system. Which one?
Alter the file such that it executes the test on the two other main operating systems that exist. You can find information on available operating systems also called runners here
SolutionWe can \"parametrize\" of script to run on different operating systems by using the strategy
attribute. This attribute allows us to define a matrix of values that the workflow will run on. The following code will run the tests on ubuntu-latest
, windows-latest
, and macos-latest
:
jobs:\n build:\n runs-on: ${{ matrix.os }}\n strategy:\n matrix:\n os: [\"ubuntu-latest\", \"windows-latest\", \"macos-latest\"]\n
Can you also figure out how to run the tests using different Python versions?
SolutionJust add another line to the strategy
attribute that specifies the Python version and use the value in the setup Python action. The following code will run the tests on Python versions
jobs:\n build:\n runs-on: ${{ matrix.os }}\n strategy:\n matrix:\n os: [\"ubuntu-latest\", \"windows-latest\", \"macos-latest\"]\n python-version: [\"3.10\", \"3.11\", \"3.12\"]\n\n steps:\n - uses: actions/checkout@v4\n - name: Set up Python\n uses: actions/setup-python@v5\n with:\n python-version: ${{ matrix.python-version }}\n
If you push the changes above you will maybe see that whenever one of the tests in the matrix fails, it will automatically cancel the other tests. This is for saving time and resources. However, sometimes you want all the tests to run even if one fails. Can you figure out how to do that?
SolutionYou can set the fail-fast
attribute to false
under the strategy
attribute:
jobs:\n build:\n runs-on: ${{ matrix.os }}\n strategy:\n fail-fast: false\n matrix:\n os: [\"ubuntu-latest\", \"windows-latest\", \"macos-latest\"]\n python-version: [\"3.10\", \"3.11\", \"3.12\"]\n
As the workflow is currently implemented, GitHub actions will destroy every downloaded package when the workflow has been executed. To improve this we can take advantage of caching
:
Figure out how to implement caching
in your workflow file. You can find a guide here and here.
steps:\n- uses: actions/checkout@v4\n- uses: actions/setup-python@v5\n with:\n python-version: 3.11\n cache: 'pip' # caching pip dependencies\n- run: pip install -r requirements.txt\n
When you have implemented a caching system go to Actions->Caches
in your repository and make sure that they are correctly added. It should look something like the image below
Measure how long your workflow takes before and after adding caching
to your workflow. Did it improve the runtime of your workflow?
(Optional) Code coverage can also be added to the workflow file by uploading it as an artifact after running the coverage. Follow the instructions in this post on how to do it.
With different checks in place, it is a good time to learn about branch protection rules. A branch protection rule is essentially some kind of guarding that prevents you from merging code into a branch before certain conditions are met. In this exercise, we will create a branch protection rule that requires all checks to pass before merging code into the main branch.
Start by going into your Settings -> Rules -> Rulesets
and create a new branch ruleset. See the image below.
In the ruleset start by giving it a name and then set the target branches to be Default branch
. This means that the ruleset will be applied to your master/main branch. As shown in the image below, two rules may be particularly beneficial when you later start working with other people:
The first rule to consider is Require a pull request before merging. As the name suggests this rule requires that changes that are to be merged into the main branch must be done through a pull request. This is a good practice as it allows for code review and testing before the code is merged into the main branch. Additionally, this opens the option to specify that the code must be reviewed (or at least approved) by a certain number of people.
The second rule to consider is Require status checks to pass. This rule makes sure that our workflows are passing before we can merge code into the main branch. You can select which workflows are required, as some may be nice to have passing but not strictly needed.
Finally, if you think the rules are a bit too restrictive you can always add that the repository admin e.g. you can bypass the rules by adding Repository admin
to the bypass list. Implement the following rules:
If you have created the rules correctly you should see something like the image below when you try to merge a pull request. In this case, all three checks are required to pass before the code can be merged. Additionally, a single reviewer is required to approve the code. A bypass rule is also setup for the repository admin.
One problem you may have encountered is running your tests that have to do with your data, with the core problem being that your data is not stored in GitHub (assuming you have done module M8 - DVC) and therefore cannot be tested. However, we can download data while running our continuous integration. Let's try to create that:
The first problem is that we need our continuous integration pipeline to be able to authenticate with our storage solution. We can take advantage of an authentication file that is created the first time we push with DVC. It is located in $CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json
where $CACHE_HOME
depends on your operating system:
~/Library/Caches
~/.cache
This is the typical location, but it may vary depending on what distro you are running
{user}/AppData/Local
Find the file. The content should look similar to this (only some fields are shown):
{\n \"access_token\": ...,\n \"client_id\": ...,\n \"client_secret\": ...,\n \"refresh_token\": ...,\n ...\n}\n
The content of that file should be treated as a password and not shared with the world and the relevant question is therefore how to use this info in a public repository. The answer is GitHub secrets, where we can store information, and access it in our workflow files and it is still not public. Navigate to the secrets option (as shown below) and create a secret with the name GDRIVE_CREDENTIALS_DATA
that contains the content of the file you found in the previous exercise.
Afterward, add the following code to your workflow file:
- uses: iterative/setup-dvc@v1\n- name: Get data\n run: dvc pull\n env:\n GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}\n
that runs dvc pull
using the secret authentication file. For help you can visit this small repository that implements the same workflow.
Finally, add the changes, commit, push and confirm that everything works as expected. You should now be able to run unit tests that depend on your input data.
In module M6 on good coding practices (optional module) of the course you were introduced to a couple of good coding practices such as being consistent with your coding style, how your Python packages are sorted and that your code follows certain standards. All this was done using the ruff
framework. In this set of exercises, we will create GitHub workflows that will automatically test for this.
Create a new workflow file called codecheck.yaml
, that implements the following three steps
Setup Python environment
Installs ruff
Runs ruff check
and ruff format
on the repository
(HINT: You should be able to just change the last steps of the tests.yaml
workflow file)
name: Code formatting\n\non:\n push:\n branches:\n - main\n pull_request:\n branches:\n - main\n\njobs:\n format:\n runs-on: ubuntu-latest\n steps:\n - name: Checkout code\n uses: actions/checkout@v4\n - name: Set up Python\n uses: actions/setup-python@v5\n with:\n python-version: 3.11\n cache: 'pip'\n cache-dependency-path: setup.py\n - name: Install dependencies\n run: |\n pip install ruff\n pip list\n - name: Ruff check\n run: ruff check .\n - name: Ruff format\n run: ruff format .\n
In addition to ruff
we also used mypy
in those sets of exercises for checking if the typing we added to our code was good enough. Add another step to the codecheck.yaml
file which runs mypy
on your repository.
Try to make sure that all steps are passed on repository. Especially mypy
can be hard to get a passing, so this exercise formally only requires you to get ruff
passing.
(Optional) As you have probably already experienced in module M9 on docker it can be cumbersome to build docker images, sometimes taking a couple of minutes to build each time we make changes to our code base. For this reason, we just want to build a new image every time we commit our code because that should mark that we believe the code to be working at that point. Thus, let's automate the process of building our docker images using Github actions. Do note that in a future module will look at how to build containers using cloud providers, and this exercise is therefore very much optional.
Start by making sure you have a dockerfile in your repository. If you do not have one, you can use the following simple dockerfile:
FROM busybox\nCMD echo \"Howdy cowboy\"\n
Push the dockerfile to your repository
Next, create a Docker Hub account
Within Docker Hub create an access token by going to Settings -> Security
. Click the New Access Token
button and give it a name that you recognize.
Copy the newly created access token and head over to your GitHub repository online. Go to Settings -> Secrets -> Actions
and click the New repository secret
. Copy over the access token and give it the name DOCKER_HUB_TOKEN
. Additionally, add two other secrets DOCKER_HUB_USERNAME
and DOCKER_HUB_REPOSITORY
that contain your docker username and docker repository name respectively.
Next, we are going to construct the actual Github actions workflow file
name: Docker Image continuous integration\n\non:\n push:\n branches: [ master ]\n\njobs:\n build:\n runs-on: ubuntu-latest\n steps:\n - uses: actions/checkout@v4\n - name: Build the Docker image\n run: |\n echo \"${{ secrets.DOCKER_HUB_TOKEN }}\" | docker login \\\n -u \"${{ secrets.DOCKER_HUB_USERNAME }}\" --password-stdin docker.io\n docker build . --file Dockerfile \\\n --tag docker.io/${{ secrets.DOCKER_HUB_USERNAME }}/${{ secrets.DOCKER_HUB_REPOSITORY }}:$GITHUB_SHA\n docker push \\\n docker.io/${{ secrets.DOCKER_HUB_USERNAME }}/${{ secrets.DOCKER_HUB_REPOSITORY }}:$GITHUB_SHA\n
The first part of the workflow file should look somewhat recognizable. However, the last three lines are where all the magic happens. Carefully go through them and figure out what they do. If you want some help you can look at the help page for docker login
, docker build
and docker push
.
Upload the workflow to your GitHub repository and check that it is being executed. If everything works you should be able to see the build docker image in your container repository in the docker hub.
Make sure that you can execute docker pull
locally to pull down the image that you just continuously build
(Optional) To test that the container works directly in GitHub you can also try to include an additional step that runs the container.
- name: Run container\n run: |\n docker run ...\n
A great feature that GitHub provides is the ability to have bots help you with maintaining your repository. One of the most useful bots is called Dependabot
. As the name suggests, Dependabot
helps you keep your dependencies up to date. This is important because dependencies often either contain fixes for bugs or security vulnerabilities that you want to have in your code.
To get dependabot working in your repository, we need to add a single configuration file to your repository. Create a file called .github/dependabot.yaml
. Look through the documentation for how to set up the file such that it updates your Python dependencies on a weekly basis.
The following code will check for updates in the pip
ecosystem every week e.g. it automatically will look for requirements.txt
files and update the packages in there.
version: 2\nupdates:\n - package-ecosystem: \"pip\"\n directory: \"/\"\n schedule:\n interval: \"weekly\"\n
Insights
tab and then the Dependency graph
tab. From here you under the Dependabot
tab should be able to see if the bot has correctly identified what files to track and if it has found any updates.
Click the Recent update jobs
to see the history of Dependabot checking for updates. If there are no updates you can try to click the Check for updates
button to force Dependabot to check for updates.
At this point the Dependabot should hopefully have found some updates and created one or more pull requests. If it has not done so you most likely need to update your requirement file such that your dependencies are correctly restricted/specified e.g.
# lets assume pytorch v2.5 is the latest version\n\n# these different specifications will not trigger dependabot because\n# the latest version is included in the specification\ntorch\ntorch == 2.5\ntorch >= 2.5\ntorch ~= 2.5\n\n# these specifications will trigger dependabot because the latest\n# version is not included\ntorch < 2.5\ntorch == 2.4\ntorch <= 2.4\n
If you have a pull request from Dependabot, check it out and see if it looks good. If it does, you can merge it.
(Optional) Dependabot can also help keeping our GitHub Actions pipelines up-to-date. As you may have realized during this module, when we write statements like in our workflow files:
...\njobs:\n build:\n runs-on: ubuntu-latest\n steps:\n - uses: actions/checkout@v4\n...\n
The @v4
specifies that we are using version 4 of the actions/checkout
action. This means that if a new version of the action is released, we will not automatically get the new version. Dependabot can help us with this. Try adding to the dependabot.yaml
file that Dependabot should also check for updates in the GitHub Actions ecosystem.
version: 2\nupdates:\n - package-ecosystem: \"pip\"\n directory: \"/\"\n schedule:\n interval: \"weekly\"\n - package-ecosystem: \"github-actions\"\n schedule:\n interval: \"weekly\"\n
When working with GitHub actions you will often encounter the following 4 concepts:
Try to define them in your own words.
Solutionyaml
file that defines the instructions to be executed on specific events. Needs to be placed in the .github/workflows
folder.The on
attribute specifies upon which events the workflow will be triggered. Assume you have set the on
attribute to the following:
on:\n push:\n branches: [main]\n pull_request:\n branches: [main]\n schedule:\n - cron: \"0 0 * * *\"\n workflow_dispatch: {}\n
What 4 events would trigger the execution of that action?
Solutionmain
would trigger itmain
will trigger itThe trigger can be executed by manually triggering it through the GitHub UI, for example, shown below
This ends the module on GitHub workflows. If you are more interested in this topic you can check out module M31 on documentation which first includes locally building some documentation for your project and afterward use GitHub actions for deploying it to GitHub Pages. Additionally, GitHub also has a lot of templates already for running different continuous integration tasks. If you try to create a workflow file directly in GitHub you may encounter the following page
We highly recommend checking this out if you want to write any other kind of continuous integration pipeline in GitHub actions. We can also recommend this repository that has a list of awesome actions and check out the act repository which is a tool for running your GitHub Actions locally!
"},{"location":"s5_continuous_integration/pre_commit/","title":"M18 - Pre-commit","text":""},{"location":"s5_continuous_integration/pre_commit/#pre-commit","title":"Pre-commit","text":"One of the cornerstones of working with git is remembering to commit your work often. Often committing makes sure that it is easier to identify and revert unwanted changes that you have introduced, because the code changes becomes smaller per commit.
However, as you hopefully already seen in the course there are a lot of mental task to do before you actually write git commit
in the terminal. The most basic thing is of course making sure that you have saved all your changes, and you are not committing a not up-to-date file. However, this also includes tasks such as styling, formatting, making sure all tests succeeds etc. All these mental to-do notes does not mix well with the principal of remembering to commit often, because you in principal have to do them every time.
The obvious solution to this problem is to automate all or some of our mental task every time that we do a commit. This is where pre-commit hooks comes into play, as they can help us attach additional tasks that should be run every time that we do a git commit
.
Pre-commit simply works by inserting whatever workflow we want to automate in between whenever we do a git commit
and afterwards would do a git push
.
The system works by looking for a file called .pre-commit-config.yaml
that we can configure. If we execute
pre-commit sample-config | out-file .pre-commit-config.yaml -encoding utf8\n
you should get a sample file that looks like
# See https://pre-commit.com for more information\n# See https://pre-commit.com/hooks.html for more hooks\nrepos:\n- repo: https://github.com/pre-commit/pre-commit-hooks\n rev: v3.2.0\n hooks:\n - id: trailing-whitespace\n - id: end-of-file-fixer\n - id: check-yaml\n - id: check-added-large-files\n
the file structure is very simple:
id
of the different hooks. The id
corresponds to an id
in this file: https://github.com/pre-commit/pre-commit-hooks/blob/master/.pre-commit-hooks.yamlWhen we are done defining our .pre-commit-config.yaml
we just need to install it
pre-commit install\n
this will make sure that the file is automatically executed whenever we run git commit
Install pre-commit
pip install pre-commit\n
Consider adding pre-commit
to a requirements_dev.txt
file, as it is a development tool.
Next create the sample file
pre-commit sample-config > .pre-commit-config.yaml\n
The sample file already contains 4 hooks. Make sure you understand what each do and if you need them at all.
pre-commit
works by hooking into the git commit
command, running whenever that command is run. For this to work, we need to install the hooks into git commit
. Run
pre-commit install\n
to do this.
Try to commit your recently created .pre-commit-config.yaml
file. You will likely not do anything, because pre-commit
only check files that are being committed. Instead try to run
pre-commit run --all-files\n
that will check every file in your repository.
Try adding at least another check from the base repository to your .pre-commit-config.yaml
file.
In this case we have added the check-json
hook to our .pre-commit-config.yaml
file, which will automatically check that all JSON files are valid.
repos:\n- repo:\n rev: v3.2.0\n hooks:\n - id: trailing-whitespace\n - id: end-of-file-fixer\n - id: check-yaml\n - id: check-added-large-files\n - id: check-json\n
If you have completed the optional module M7 on good coding practice you will have learned about the linter ruff
. ruff
comes with its own pre-commit hook. Try adding that to your .pre-commit-config.yaml
file and see what happens when you try to commit files.
This is one way to add the ruff
pre-commit hook. We run both the ruff
and ruff-format
hooks, and we also add the --fix
argument to the ruff
hook to try to fix what is possible.
```yaml repos: - repo: https://github.com/astral-sh/ruff-pre-commit rev: v0.4.7 hooks: # try to fix what is possible - id: ruff args: [\"--fix\"] # perform formatting updates - id: ruff-format # validate if all is fine with preview mode - id: ruff
(Optional) Add more hooks to your .pre-commit-config.yaml
.
Sometimes you are in a hurry, so make sure that you also can do commits without running pre-commit
e.g.
git commit -m <message> --no-verify\n
Finally, figure out how to disable pre-commit
again (if you get tired of it).
Assuming you have completed the module on GitHub Actions, lets try to add a pre-commit
workflow that automatically runs your pre-commit
checks every time you push to your repository and then automatically commits those changes to your repository. We recommend that you make use of
pre-commit
pre-commit
makes.As an alternative you configure the CI tool provided by the creators of pre-commit
.
The workflow first uses the pre-commit
action to install and run the pre-commit
checks. Importantly we run it with continue-on-error: true
to make sure that the workflow does not fail if the checks fail. Next, we use git diff
to list the changes that pre-commit
has made and then we use the git-auto-commit-action
to commit those changes.
name: Pre-commit CI\n\non:\n pull_request:\n push:\n branches: [main]\n\njobs:\n pre-commit:\n name: Check pre-commit\n runs-on: ubuntu-latest\n\n permissions:\n contents: write\n\n steps:\n - name: Checkout code\n uses: actions/checkout@v4\n\n - name: Set up Python\n uses: actions/setup-python@v5\n with:\n python-version: 3.11\n\n - name: Install pre-commit\n uses: pre-commit/action@v3.0.1\n continue-on-error: true\n\n - name: List modified files\n run: |\n git diff --name-only\n\n - name: Commit changes\n uses: stefanzweifel/git-auto-commit-action@v5\n with:\n commit_message: Pre-commit fixes\n commit_options: '--no-verify'\n
That was all about how pre-commit
can be used to automate tasks. If you want to deep dive more into the topic you can checkout this page on how to define your own pre-commit
hooks.
Core Module
What often comes to mind for many developers, when discussing continuous integration (CI) is code testing. Continuous integration should ensure that whenever a codebase is updated it is automatically tested such that if bugs have been introduced in the codebase it will be caught early on. If you look at the MLOps cycle, continuous integration is one of the cornerstones of the operations part. However, it should be noted that applying continuous integration does not magically secure that your code does not break. Continuous integration is only as strong as the tests that are automatically executed. Continuous integration simply structures and automates this.
Quote
Continuous Integration doesn\u2019t get rid of bugs, but it does make them dramatically easier to find and remove. Martin Fowler, Chief Scientist, ThoughtWorks
Image creditThe kind of tests we are going to look at are called unit testing. Unit testing refers to the practice of writing test that tests individual parts of your code base to test for correctness. By unit, you can therefore think of a function, module or in general any object. By writing tests in this way it should be very easy to isolate which part of the code broke after an update to the code base. Another way to test your code base would be through integration testing which is equally important but we are not going to focus on it in this course.
Unit tests (and integration tests) are not a unique concept to MLOps but are a core concept of DevOps. However, it is important to note that testing machine learning-based systems is much more difficult than traditional systems. The reason for this is that machine learning systems depend on data, that influences the state of our system. For this reason, we not only need unit tests and integration tests of our code we also need data testing, infrastructure testing and more monitoring to check that we stay within the data distribution we are training on (more on this in module M25 on data drifting). This added complexity is illustrated in the figure below.
"},{"location":"s5_continuous_integration/unittesting/#pytest","title":"Pytest","text":"Before we can begin to automate testing of our code base we of course need to write the tests first. It is both a hard and tedious task to do but arguably the most important aspect of continuous integration. Python offers a couple of different libraries for writing tests. We are going to use pytest
.
The following exercises should be applied to your MNIST repository
The first part of doing continuous integration is writing the unit tests. We do not expect you to cover every part of the code you have developed but try to at least write tests that cover two files. Start by creating a tests
folder.
Read the getting started guide for pytest which is the testing framework that we are going to use
Install pytest:
pip install pytest\n
Write some tests. Below are some guidelines on some tests that should be implemented, but you are of course free to implement more tests. You can at any point check if your tests are passing by typing in a terminal
pytest tests/\n
When you implement a test you need to follow two standards, for pytest
to be able to find your tests. First, any files created (except __init__.py
) should always start with test_*.py
. Secondly, any test implemented needs to be wrapped into a function that again needs to start with test_*
:
# this will be found and executed by pytest\ndef test_something():\n ...\n\n# this will not be found and executed by pytest\ndef something_to_test():\n ...\n
Start by creating a tests/__init__.py
file and fill in the following:
import os\n_TEST_ROOT = os.path.dirname(__file__) # root of test folder\n_PROJECT_ROOT = os.path.dirname(_TEST_ROOT) # root of project\n_PATH_DATA = os.path.join(_PROJECT_ROOT, \"data\") # root of data\n
these can help you refer to your data files during testing. For example, in another test file, I could write
from tests import _PATH_DATA\n
which then contains the root path to my data.
Data testing: In a file called tests/test_data.py
implement at least a test that checks that data gets correctly loaded. By this, we mean that you should check
def test_data():\n dataset = MNIST(...)\n assert len(dataset) == N_train for training and N_test for test\n assert that each datapoint has shape [1,28,28] or [784] depending on how you choose to format\n assert that all labels are represented\n
where N_train
should be either 30.000 or 50.000 depending on if you are just the first subset of the corrupted MNIST data or also including the second subset. N_test
should be 5000.
from my_project.data import corrupt_mnist\n\ndef test_data():\n train, test = corrupt_mnist()\n assert len(train) == 30000\n assert len(test) == 5000\n for dataset in [train, test]:\n for x, y in dataset:\n assert x.shape == (1, 28, 28)\n assert y in range(10)\n train_targets = torch.unique(train.tensors[1])\n assert (train_targets == torch.arange(0,10)).all()\n test_targets = torch.unique(test.tensors[1])\n assert (test_targets == torch.arange(0,10)).all()\n
Model testing: In a file called tests/test_model.py
implement at least a test that checks for a given input with shape X that the output of the model has shape Y.
from my_project.model import MyAwesomeModel\n\ndef test_model():\n model = MyAwesomeModel()\n x = torch.randn(1, 1, 28, 28)\n y = model(x)\n assert y.shape == (1, 10)\n
Training testing: In a file called tests/test_training.py
implement at least one test that asserts something about your training script. You are here given free hands on what should be tested but try to test something that risks being broken when developing the code.
Good code raises errors and gives out warnings in appropriate places. This is often in the case of some invalid combination of input to your script. For example, your model could check for the size of the input given to it (see code below) to make sure it corresponds to what you are expecting. Not implementing such errors would still result in PyTorch failing at a later point due to shape errors, however, these custom errors will probably make more sense to the end user. Implement at least one raised error or warning somewhere in your code and use either pytest.raises
or pytest.warns
to check that they are correctly raised/warned. As inspiration, the following implements ValueError
in code belonging to the model:
# src/models/model.py\ndef forward(self, x: Tensor):\n if x.ndim != 4:\n raise ValueError('Expected input to a 4D tensor')\n if x.shape[1] != 1 or x.shape[2] != 28 or x.shape[3] != 28:\n raise ValueError('Expected each sample to have shape [1, 28, 28]')\n
Solution The above example would be captured by a test looking something like this:
# tests/test_model.py\nimport pytest\nfrom my_project.model import MyAwesomeModel\n\ndef test_error_on_wrong_shape():\n model = MyAwesomeModel()\n with pytest.raises(ValueError, match='Expected input to a 4D tensor')\n model(torch.randn(1,2,3))\n with pytest.raises(ValueError, match='Expected each sample to have shape [1, 28, 28]')\n model(torch.randn(1,1,28,29))\n
A test is only as good as the error message it gives, and by default, assert
will only report that the check failed. However, we can help ourselves and others by adding strings after assert
like
assert len(train_dataset) == N_train, \"Dataset did not have the correct number of samples\"\n
Add such comments to the assert statements you just did in the previous exercises.
The tests that involve checking anything that has to do with our data, will of course fail if the data is not present. To future-proof our code, we can take advantage of the pytest.mark.skipif
decorator. Use this decorator to skip your data tests if the corresponding data files do not exist. It should look something like this
import os.path\n@pytest.mark.skipif(not os.path.exists(file_path), reason=\"Data files not found\")\ndef test_something_about_data():\n ...\n
You can read more about skipping tests here
After writing the different tests, make sure that they are passing locally.
We often want to check a function/module for various input arguments. In this case, you could write the same test over and over again for different inputs, but pytest
also has built-in support for this with the use of the pytest.mark.parametrize decorator. Implement a parametrized test and make sure that it runs for different inputs.
@pytest.mark.parametrize(\"batch_size\", [32, 64])\ndef test_model(batch_size: int) -> None:\n model = MyModel()\n x = torch.randn(batch_size, 1, 28, 28)\n y = model(x)\n assert y.shape == (batch_size, 10)\n
There is no way of measuring how good the test you have written is. However, what we can measure is the code coverage. Code coverage refers to the percentage of your codebase that gets run when all your tests are executed. Having a high coverage at least means that all your code will run when executed.
Install coverage
pip install coverage\n
Instead of running your tests directly with pytest
, now do
coverage run -m pytest tests/\n
To get a simple coverage report simply type
coverage report\n
which will give you the percentage of cover in each of your files. You can also write
coverage report -m\n
to get the exact lines that were missed by your tests.
Finally, try to increase the coverage by writing a new test that runs some of the lines in your codebase that are not covered yet.
Often coverage
reports the code coverage on files that we do not want to get a code coverage for, for example your test file. Figure out how to configure coverage
to exclude some files.
You need to set the omit
option. This can either be done when running coverage run
or coverage report
such as:
coverage run --omit=\"tests/*\" -m pytest tests/\n# or\ncoverage report --omit=\"tests/*\"\n
As an alternative you can specify this in your pyproject.toml
file:
[tool.coverage.run]\nomit = [\"tests/*\"]\n
Assuming you have a code coverage of 100%, would you expect that no bugs are present in your code?
SolutionNo, code coverage is not a guarantee that your code is bug-free. It is just a measure of how many lines of code are run when your tests are executed. Therefore, there may still be some corner case that is not covered by your tests and will result in a bug. However, having a high code coverage is a good indicator that you have tested your code.
Consider the following code:
@pytest.mark.parametrize(\"network_size\", [10, 100, 1000])\n@pytest.mark.parametrize(\"device\", [\"cpu\", \"cuda\"])\nclass MyTestClass:\n @pytest.mark.parametrize(\"network_type\", [\"alexnet\", \"squeezenet\", \"vgg\", \"resnet\"])\n @pytest.mark.parametrize(\"precision\", [torch.half, torch.float, torch.double])\n def test_network1(self, network_size, device, network_type, precision):\n if device == \"cuda\" and not torch.cuda.is_available():\n pytest.skip(\"Test requires cuda\")\n model = MyModelClass(network_size, network_type).to(device=device, dtype=precision)\n ...\n\n @pytest.mark.parametrize(\"add_dropout\", [True, False])\n def test_network2(self, network_size, device, add_dropout):\n if device == \"cuda\" and not torch.cuda.is_available():\n pytest.skip(\"Test requires cuda\")\n model = MyModelClass2(network_size, add_dropout).to(device)\n ...\n
how many tests are executed when running the above code?
SolutionThe answer depends on whether or not we are running on a GPU-enabled machine. The test_network1
has 4 parameters, network_size, device, network_type, precision
, that respectively can take on 3, 2, 4, 3
values meaning that in total that test will be running 3x2x4x3=72
times with different parameters on a GPU-enabled machine and 36 on a machine without a GPU. A similar calculation can be done for test_network2
, which only has three factors network_size, device, add_dropout
that result in 3x2x2=12
test on a GPU-enabled machine and 6 on a machine without a GPU. In total, that means 84 tests would run on a machine with a GPU and 42 on a machine without a GPU.
That covers the basics of writing unit tests for Python code. We want to note that pytest
of course is not the only framework for doing this. Python has a built-in framework called unittest for doing this also (but pytest
offers a bit more features). Another open-source framework that you could choose to check out is hypothesis which can help catch errors in corner cases of your code. In addition to writing unit tests it is also highly recommended to test code that you include in your docstring belonging to your functions and modulus to make sure that any code there is in your documentation is also correct. For such testing, we can highly recommend using Python built-in framework doctest.
Slides
Learn how to get started with Google Cloud Platform and how to interact with the SDK.
M20: Cloud Setup
Learn how to use different GCP services to support your machine learning pipeline.
M21: Cloud Services
Running computations locally is often sufficient when only playing around with code in the initial phase of development. However, to scale your experiments you will need more computing power than what your standard laptop/desktop can offer. You probably already have experience with running on a local cluster or similar but today's topic is about utilizing cloud computing.
Image creditThere exist numerous amount of cloud computing providers with some of the biggest being:
They all have slight advantages and disadvantages over each other. In this course, we are going to focus on Google Cloud platform, because they have been kind enough to sponsor $50 of cloud credit to each student. If you happen to run out of credit, you can also get some free credit for a limited amount of time when you sign up with a new account. What's important to note is that all these different cloud providers all have the same set of services and that learning how to use the services of one cloud provider in many cases translates to also knowing how to use the same services at another cloud provider. The services are called something different and can have a bit of a different interface/interaction pattern but in the end, it does not matter.
Today's exercises are about getting to know how to work with the cloud. If you are in doubt about anything or want to deep dive into some topics, I can recommend watching this series of videos or going through the general docs.
Learning objectives
The learning objectives of this session are:
Core Module
Google Cloud Platform (GCP) is the cloud service provided by Google. The key concept, or selling point, of any cloud provider, is the idea of near-infinite resources. Without the cloud, it simply is not feasible to do many modern deep learning and machine learning tasks because they cannot be scaled locally.
The image below shows all the different services that the Google Cloud platform offers. We are going to be working with around 10 of these services throughout the course. Therefore, if you get done with exercises early I highly recommend that you deep dive more into the Google cloud platform.
Image credit"},{"location":"s6_the_cloud/cloud_setup/#exercises","title":"\u2754 Exercises","text":"As the first step, we are going to get you some Google Cloud credits.
Go to https://learn.inside.dtu.dk. Go to this course. Find the recent message where there should be a download link and instructions on how to claim the $50 cloud credit. Please do not share the link anywhere as there are a limited amount of coupons. If you are not officially taking this course at DTU, Google gives $300 cloud credits whenever you sign up with a new account. NOTE that you need to provide a credit card for this so make sure to closely monitor your credit use so you do not end up spending more than the free credit.
Log in to the homepage of GCP. It should look like this:
Go to billing and make sure that your account is showing $50 of cloud credit
make sure to also check out the Reports
throughout the course. When you are starting to use some of the cloud services these tabs will update with info about how much time you can use before your cloud credit runs out. Make sure that you monitor this page as you will not be given another coupon.
One way to stay organized within GCP is to create projects.
Create a new project called dtumlops
. When you click create
you should get a notification that the project is being created. The notification bell is a good way to make sure how the processes you are running are doing throughout the course.
Next, it local setup on your laptop. We are going to install gcloud
, which is part of the Google Cloud SDK. gcloud
is the command line interface for working with our Google Cloud account. Nearly everything that we can do through the web interface we can also do through the gcloud
interface. Follow the installation instructions here for your specific OS.
After installation, try in a terminal to type:
gcloud -h\n
the command should show the help page. If not, something went wrong in the installation (you may need to restart after installing).
Now login by typing
gcloud auth login\n
you should be sent to a web page where you link your cloud account to the gcloud
interface. Afterward, also run this command:
gcloud auth application-default login\n
If you at some point want to revoke the authentication you can type:
gcloud auth revoke\n
Next, you will need to set the project that we just created as the default project0. In your web browser under project info, you should be able to see the Project ID
belonging to your dtumlops
project. Copy this and type he following command in a terminal
gcloud config set project <project-id>\n
You can also get the project info by running
gcloud projects list\n
Next, install the Google Cloud Python API:
pip install --upgrade google-api-python-client\n
Make sure that the Python interface is also installed. In a Python terminal type
import googleapiclient\n
this should work without any errors.
(Optional) If you are using VSCode you can also download the relevant extension called Cloud Code
. After installing it you should see a small Cloud Code
button in the action bar.
Finally, we need to activate a couple of developer APIs that are not activated by default. In a terminal write
gcloud services enable apigateway.googleapis.com\ngcloud services enable servicemanagement.googleapis.com\ngcloud services enable servicecontrol.googleapis.com\n
you can always check which services are enabled by typing
gcloud services list\n
After following these steps your laptop should hopefully be setup for using GCP locally. You are now ready to use their services, both locally on your laptop and in the cloud console.
"},{"location":"s6_the_cloud/cloud_setup/#iam-and-quotas","title":"IAM and Quotas","text":"A big part of using the cloud in a bigger organization has to do with Admin and quotas. Admin here in general refers to the different roles that users of GCP and quotas refer to the amount of resources that a given user has access to. For example, one employee, let's say a data scientist, may only be granted access to certain GCP services that have to do with the development and training of machine learning models, with X
amounts of GPUs available to use to make sure that the employee does not spend too much money. Another employee, a DevOps engineer, probably does not need access to the same services and not necessarily the same resources.
In this course, we are not going to focus too much on this aspect but it is important to know that it exists. One feature you are going to need for doing the project is how to share a project with other people. This is done through the IAM (Identities and Access Management) page. Simply click the Grant Access
button, search for the email of the person you want to share the project with and give them either Viewer
, Editor
or Owner
access, depending on what you want them to be able to do. The figure below shows how to do this.
What we are going to go through right now is how to increase the quotas for how many GPUs you have available for your project. By default, for any free accounts in GCP (or accounts using teaching credits) the default quota for GPUs that you can use is either 0 or 1 (their policies sometimes change). We will in the exercises below try to increase it.
"},{"location":"s6_the_cloud/cloud_setup/#exercises_1","title":"\u2754 Exercises","text":"Start by enabling the Compute Engine
service. Simply search for it in the top search bar. It should bring you to a page where you can enable the service (may take some time). We are going to look more into this service in the next module.
Next go to the IAM & Admin
page, again search for it in the top search bar. The remaining steps are illustrated in the figure below.
Go to the quotas page
In the search field search for GPUs (all regions)
(needs to match exactly, the search field is case sensitive), such that you get the same quota as in the image.
In the limit, you can see what your current quota for the number of GPUs you can use is. Additionally, to the right of the limit, you can see the current usage. It is worth checking in on if you are ever in doubt if a job is running on GPU or not.
Click the quota and afterward the Edit
quotas button.
In the pop-up window, increase your limit to either 1 or 2.
After sending your request you can try clicking the Increase requests
tab to see the status of your request
If you are ever running into errors when working in GPU that contains statements about quotas
you can always try to go to this page and see what you are allowed to use currently and try to increase it. For example, when you get to training machine learning models using Vertex AI in the next module, you would most likely need to ask for a quota increase for that service as well.
Finally, we want to note that a quota increase is sometimes not allowed within 24 hours of creating an account. If your request gets rejected, we recommend to wait a day and try again. If this does still not work, you may need to use their services some more to make sure you are not a bot that wants to mine crypto.
"},{"location":"s6_the_cloud/cloud_setup/#service-accounts","title":"Service accounts","text":"At some point, you will most likely need to use a service account. A service account is a virtual account that is used to interact with the Google Cloud API. It it intended for non-human users e.g. other machines, services, etc. For example, if you want to launch a training job from Github Actions, you will need to use a service account for authentication between Github and GCP. You can read more about how to create a service account here.
"},{"location":"s6_the_cloud/cloud_setup/#exercises_2","title":"\u2754 Exercises","text":"Go to the IAM & Admin
page and click on Service accounts
. Alternatively, you can search for it in the top search bar.
Click the Create Service Account
button. On the next page, you can give the service account a name, and id ( automatically generated, but you can change it if you want). You can also give it a description. Leave the rest as default and click Create
.
Next, let's give the service account some permissions. Click on the service account you just created. In the Permissions
tab click Add permissions
. Your job now is to give the service account the lowest possible permissions such that it can download files from a bucket. Look at this page and try to find the role that fits the description.
The role you are looking for is Storage Object Viewer
. This role allows the service account to list objects in a bucket and download objects, but nothing more. Thus even if someone gets access to the service account they cannot delete objects in the bucket.
To use the service account later we need to create a key for it. Click on the service account and then the Keys
tab. Click Add key
and then Create new key
. Choose the JSON
key type and click Create
. This will download a JSON file to your computer. This file is the key to the service account and should be kept secret. If you lose it you can always create a new one.
Finally, everything we just did from creating the service account, giving it permissions, and creating a key can also be done through the gcloud
interface. Try to find the commands to do this in the documentation.
The commands you are looking for are:
gcloud iam service-accounts create my-sa \\\n --description=\"My first service account\" --display-name=\"my-sa\"\ngcloud projects add-iam-policy-binding $(GCP_PROJECT_NAME) \\\n --member=\"serviceAccount:global-service-account@iam.gserviceaccount.com\" \\\n --role=\"roles/storage.objectViewer\"\ngcloud iam service-accounts keys create service_account_key.json \\\n --iam-account=global-service-account@$(GCP_PROJECT_NAME).iam.gserviceaccount.com\n
where $(GCP_PROJECT_NAME)
is the name of your project. If you then want to delete the service account you can run
gcloud iam service-accounts delete global-service-account@$(GCP_PROJECT_NAME).iam.gserviceaccount.com\n
What considerations to take when choosing a GCP region for running a new application?
SolutionA series of factors may influence your choice of region, including:
The 3 major cloud providers all have the same services, but they are called something different depending on the provider. What are the corresponding names of these GCP services in AWS and Azure?
It is important to know these correspondences to navigate blogpost etc. about MLOps on the internet.
Solution GCP AWS Azure Compute Engine Elastic Compute Cloud (EC2) Virtual Machines Cloud storage Simple Storage Service (S3) Blob Storage Cloud functions Lambda Functions Serverless Compute Cloud run App Runner, Fargate, Lambda Container Apps, Container Instances Cloud build CodeBuild DevOps Vertex AI SageMaker AI PlatformWhy does is it always important to assign the lowest possible permissions to a service account?
SolutionThe reason is that if someone gets access to the service account they can only do what the service account is allowed to do. If the service account has the permission to delete objects in a bucket, the attacker can delete all the objects in the bucket. For this reason, in most cases multiple service accounts are used, each with different permissions. This setup is called the principle of least privilege.
Core Module
In this set of exercises, we are going to get more familiar with using some of the resources that GCP offers.
"},{"location":"s6_the_cloud/using_the_cloud/#compute","title":"Compute","text":"The most basic service of any cloud provider is the ability to create and run virtual machines. In GCP this service is called Compute Engine API. A virtual machine allows you to essentially run an operating system that behaves like a completely separate computer. There are many reasons why one to use virtual machines:
Virtual machines allow you to scale your operations, essentially giving you access to infinitely many individual computers
Virtual machines allow you to use large-scale hardware. For example, if you are developing a deep learning model on your laptop and want to know the inference time for a specific hardware configuration, you can just create a virtual machine with those specs and run your model.
Virtual machines allow you to run processes in the \"background\". If you want to train a model for a week or more, you do not want to do this on your laptop as you cannot move it or do anything with it while it is training. Virtual machines allow you to just launch a job and forget about it (at least until you run out of credit).
We are now going to start using the cloud.
Click on the Compute Engine
tab in the sidebar on the homepage of GCP.
Click the Create Instance
button. You will see the following image below.
Give the virtual machine a meaningful name, and set the location to some location that is closer to where you are (to reduce latency, we recommend europe-west-1
). Finally, try to adjust the configuration a bit. Can you find at least two settings that alter the price of the virtual machine?
In general, the price of a virtual machine is determined by the class of hardware attached to it. Higher class CPUs and GPUs mean higher prices. Additionally, the amount of memory and disk space also affects the price. Finally, to location of the virtual machine also affects the price.
After figuring this out, create a e2-medium
instance (leave the rest configured as default). Before clicking the Create
button make sure to check the Equivalent code
button. You should see a very long command that you could have typed in the terminal that would create a VM similar to configuring it through the UI.
After creating the virtual machine, in a local terminal type:
gcloud compute instances list\n
you should hopefully see the instance you have just created.
You can start a terminal directly by typing:
gcloud compute ssh --zone <zone> <name> --project <project-id>\n
You can always see the exact command that you need to run to ssh
to a VM by selecting the View gcloud command
option in the Compute Engine overview (see image below).
While logged into the instance, check if Python and PyTorch are installed. You should see that neither is installed. The VM we have only specified what compute resources it should have, and not what software should be in it. We can fix this by starting VMs based on specific docker images (it's all coming together).
GCP comes with several ready-to-go images for doing deep learning. More info can be found here. Try, running this line:
gcloud container images list --repository=\"gcr.io/deeplearning-platform-release\"\n
what does the output show?
SolutionThe output should show a list of images that are available for you to use. The images are essentially docker images that contain a specific software stack. The software stack is often a specific version of Python, PyTorch, TensorFlow, etc. The images are maintained by Google and are updated regularly.
Next, start (in the terminal) a new instance using a PyTorch image. The command for doing it should look something like this:
gcloud compute instances create <instance_name> \\\n --zone=<zone> \\\n --image-family=<image-family> \\\n --image-project=deeplearning-platform-release \\\n # add these arguments if you want to run on GPU and have the quota to do so\n --accelerator=\"type=nvidia-tesla-K80,count=1\" \\\n --maintenance-policy TERMINATE \\\n --metadata=\"install-nvidia-driver=True\" \\\n
You can find more info here on what <image-family>
should have as value and what extra argument you need to add if you want to run on GPU (if you have access).
The command should look something like this:
CPUGPUgcloud compute instances create my_instance \\\n --zone=europe-west1-b \\\n --image-family=pytorch-latest-cpu \\\n --image-project=deeplearning-platform-release\n
gcloud compute instances create my_instance \\\n --zone=europe-west1-b \\\n --image-family=pytorch-latest-gpu \\\n --image-project=deeplearning-platform-release \\\n --accelerator=\"type=nvidia-tesla-K80,count=1\" \\\n --maintenance-policy TERMINATE\n
ssh
to the VM as one of the previous exercises. Confirm that the container indeed contains both a Python installation and PyTorch is also installed. Hint: you also have the possibility through the web page to start a browser session directly to the VMs you create:
Everything that you have done locally can also be achieved through the web terminal, which of course comes pre-installed with the gcloud
command etc.
Try out launching this and run some of the commands from the previous exercises.
Finally, we want to make sure that we do not forget to stop our VMs. VMs are charged by the minute, so even if you are not using them you are still paying for them. Therefore, you must remember to stop your VMs when you are not using them. You can do this by either clicking the Stop
button on the VM overview page or by running the following command:
gcloud compute instances stop <instance-name>\n
Another big part of cloud computing is the storage of data. There are many reasons that you want to store your data in the cloud including:
Cloud storage is luckily also very cheap. Google Cloud only takes around $0.026 per GB per month. This means that around 1 TB of data would cost you $26 which is more than what the same amount of data would cost on Google Drive, but the storage in Google Cloud is much more focused on enterprise usage such that you can access the data through code.
"},{"location":"s6_the_cloud/using_the_cloud/#exercises_1","title":"\u2754 Exercises","text":"When we did the exercise on data version control, we made dvc
work together with our own Google Drive to store data. However, a big limitation of this is that we need to authenticate each time we try to either push or pull the data. The reason is that we need to use an API instead which is offered through GCP.
We are going to follow the instructions from this page
Let's start by creating a data storage. On the GCP start page, in the sidebar, click on the Cloud Storage
. On the next page click the Create bucket
:
Give the bucket a unique name, set it to a region close by and importantly remember to enable Object versioning under the last tab. Finally, click `Create``.
After creating the storage, you should be able to see it online and you should be able to see it if you type in your local terminal:
gsutil ls\n
gsutil is a command line tool that allows you to create, upload, download, list, move, rename and delete objects in the cloud storage. For example, you can upload a file to the cloud storage by running:
gsutil cp <file> gs://<bucket-name>\n
Next, we need the Google storage extension for dvc
pip install dvc-gs\n
Now in your corrupt MNIST repository where you have already configured dvc
, we are going to change the storage from our Google Drive to our newly created Google Cloud storage.
dvc remote add -d remote_storage <output-from-gsutils>\n
In addition, we are also going to modify the remote to support object versioning (called version_aware
in dvc
):
dvc remote modify remote_storage version_aware true\n
This will change the default way that dvc
handles data. Instead of just storing the latest version of the data as content-addressable storage it will now store the data as it looks in our local repository, which allows us to not only use dvc
to download our data.
The above command will change the .dvc/config
file. git add
and git commit
the changes to that file. Finally, push data to the cloud
dvc push --no-run-cache # (1)!\n
--no-run-cache
flag is used to avoid pushing the cache file to the cloud, which is not supported by the Google Cloud storage.Finally, make sure that you can pull without having to give your credentials. The easiest way to see this is to delete the .dvc/cache
folder that should be locally on your laptop and afterward do a
dvc pull --no-run-cache\n
This setup should work when trying to access the data from your laptop, which we authenticated in the previous module. However, how can you access the data from a virtual machine, inside a docker container or from a different laptop? We in general recommend two ways:
You can make the bucket publicly accessible e.g. no authentication is needed. That means that anyone with the URL to the data can access it. This is the easiest way to do it, but also the least secure. You can read more about how to make your buckets public here.
You can use the service account that you created in the previous module to authenticate the VM. This is the most secure way to do it, but also the most complicated. You first need to give the service account the correct permissions. Then you need to authenticate using the service account. In dvc
this is done by setting the environment variable GOOGLE_APPLICATION_CREDENTIALS
to the path of
export GOOGLE_APPLICATION_CREDENTIALS=\"/path/to/your/credentials.json\"\n
set GOOGLE_APPLICATION_CREDENTIALS=\"C:\\path\\to\\your\\credentials.json\"\n
You should hopefully at this point have seen the strength of using containers to create reproducible environments. They allow us to specify exactly the software that we want to run inside our VMs. However, you should already have run into two problems with containers
For this reason, we want to move both the building process and the storage of images to the cloud. In GCP the two services that we are going to use for this are called Cloud Build for building the containers in the cloud and Artifact registry for storing the images afterward.
"},{"location":"s6_the_cloud/using_the_cloud/#exercises_2","title":"\u2754 Exercises","text":"In these exercises, I recommend that you start with a dummy version of some code to make sure that the building process does not take too long. Below is a simple Python script that does image classification using Sklearn, together with the corresponding requirements.txt
file and Dockerfile
.
from sklearn import datasets, metrics, svm\nfrom sklearn.model_selection import train_test_split\n\nif __name__ == \"__main__\":\n digits = datasets.load_digits()\n\n # flatten the images\n n_samples = len(digits.images)\n data = digits.images.reshape((n_samples, -1))\n\n # Create a classifier: a support vector classifier\n clf = svm.SVC(gamma=0.001)\n\n # Split data into 50% train and 50% test subsets\n X_train, X_test, y_train, y_test = train_test_split(data, digits.target, test_size=0.5, shuffle=False)\n\n # Learn the digits on the train subset\n clf.fit(X_train, y_train)\n\n # Predict the value of the digit on the test subset\n predicted = clf.predict(X_test)\n\n print(f\"Classification report for classifier {clf}:\\n{metrics.classification_report(y_test, predicted)}\\n\")\n
requirements.txt requirements.txtscikit-learn>=1.0\n
Dockerfile DockerfileFROM python:3.11-slim\n\n# install python\nRUN apt update && \\\n apt install --no-install-recommends -y build-essential gcc && \\\n apt clean && rm -rf /var/lib/apt/lists/*\n\nCOPY requirements.txt requirements.txt\nCOPY main.py main.py\nWORKDIR /\nRUN pip install -r requirements.txt --no-cache-dir\n\nENTRYPOINT [\"python\", \"-u\", \"main.py\"]\n
The docker images for this application are therefore going to be substantially faster to build and smaller in size than the images we are used to that use PyTorch.
Start by enabling the service: Google Artifact Registry API
and Google Cloud Build API
. This can be done through the website (by searching for the services) or can also be enabled from the terminal:
gcloud services enable artifactregistry.googleapis.com\ngcloud services enable cloudbuild.googleapis.com\n
The first step is creating an artifact repository in the cloud. You can either do this through the UI or using gcloud
in the command line.
Find the Artifact Registry
service (search for it in the search bar) and click on it. From there click on the Create repository
button. You should see the following page:
Give the repository a name, make sure to set the format to Docker
and specify the region. At the bottom of the page you can optionally add a cleanup policy. We recommend that you add one to keep costs down. Give the policy a name choose the Keep most recent versions
option and set the keep count to 5
. Click Create
and you should now see the repository in the list of repositories.
gcloud artifacts repositories create <registry-name> \\\n --repository-format=docker \\\n --location=europe-west1 \\\n --description=\"My docker registry\"\n
where you need to replace <registry-name>
with a name of your choice. You can read more about the command here. We recommend that after creating the repository you update it with a cleanup policy to keep costs down. You can do this by running:
gcloud artifacts repositories set-cleanup-policies REPOSITORY\n --project=<project-id>\n --location=<region>\n --policy=policy.yaml\n
where the policy.yaml
file should look something like this:
[\n {\n \"name\": \"keep-minimum-versions\",\n \"action\": {\"type\": \"Keep\"},\n \"mostRecentVersions\": {\n \"keepCount\": 5\n }\n }\n]\n
and you can read more about the command here. Whenever we in the future want to push or pull to this artifact repository we can refer to it using this URL:
<region>-docker.pkg.dev/<project-id>/<registry-name>\n
for example, europe-west1-docker.pkg.dev/dtumlops-335110/container-registry
would be a valid URL (this is the one I created).
We are now ready to build our containers in the cloud. In principle, GCP cloud build works out of the box with docker files. However, the recommended way is to add specialized cloudbuild.yaml
files. You can think of the cloudbuild.yaml
file as the corresponding file in GCP as workflow files are in GitHub actions, which you learned about in module M16. It is essentially a file that specifies a list of steps that should be executed to do something, but the syntax is different.
Look at the documentation on how to write a cloudbuild.yaml
file for building and pushing a docker image to the artifact registry. Try to implement such a file in your repository.
For building docker images the syntax is as follows:
steps:\n- name: 'gcr.io/cloud-builders/docker'\n id: 'Build container image'\n args: [\n 'build',\n '.',\n '-t',\n 'europe-west1-docker.pkg.dev/$PROJECT_ID/<registry-name>/<image-name>',\n '-f',\n '<path-to-dockerfile>'\n ]\n- name: 'gcr.io/cloud-builders/docker'\n id: 'Push container image'\n args: [\n 'push',\n 'europe-west1-docker.pkg.dev/$PROJECT_ID/<registry-name>/<image-name>'\n ]\n
where you need to replace <registry-name>
, <image-name>
and <path-to-dockerfile>
with your own values. You can hopefully recognize the syntax from the docker exercises. In this example, we are calling the cloud-builders/docker
service with both the build
and push
arguments.
You can now try to trigger the cloudbuild.yaml
file from your local machine. What gcloud
command would you use to do this?
You can trigger a build by running the following command:
gcloud builds submit --config=cloudbuild.yaml .\n
This command will submit a build to the cloud build service using the configuration file cloudbuild.yaml
in the current directory.
Instead of relying on manually submitting builds, we can setup the building process as continuous integration such that it is triggered every time we push code to the repository. This is done by setting up a trigger in the GCP console. From the GCP homepage, navigate to the triggers panel:
Click on the manage repositories.
From there, click the Connect Repository
and go through the steps of authenticating your GitHub profile with GCP and choose the repository that you want to setup build triggers. For now, skip the Create a trigger (optional)
part by pressing Done
in the end.
Navigate back to the Triggers
homepage and click Create trigger
. Set the following:
Push to branch
^main$
Autodetected
or Cloud build configuration file
Finally, click the Create
button and the trigger should show up on the triggers page.
To activate the trigger, push some code to the chosen repository.
Go to the Cloud Build
page and you should see the image being built and pushed.
Try clicking on the build to check out the build process and build summary. As you can see from the image, if a build is failing you will often find valuable info by looking at the build summary. If your build is failing try to configure it to run in one of these regions: us-central1, us-west2, europe-west1, asia-east1, australia-southeast1, southamerica-east1
as specified in the documentation.
If/when your build is successful, navigate to the Artifact Registry
page. You should hopefully find that the image you just built was pushed here. Congrats!
Make sure that you can pull your image down to your laptop
docker pull <region>-docker.pkg.dev/<project-id>/<registry-name>/<image-name>:<image-tag>\n
you will need to authenticate docker
with GCP first. Instructions can be found here, but the following command should hopefully be enough to make docker
and GCP talk to each other:
gcloud auth configure-docker <region-docker.pkg.dev>\n
where you need to replace <region>
with the region you are using. Do note you need to have docker
actively running in the background, as any other time you want to use docker
.
Automatization through the cloud is in general the way to go, but sometimes you may want to manually create images and push them to the registry. Figure out how to push an image to your Artifact Registry
. For simplicity, you can just push the busybox
image you downloaded during the initial docker exercises. This page should help you with exercise.
Pushing to a repository is similar to pulling. Assuming that you have already built an image called busybox
you can push it to the repository by running:
docker tag busybox <region>-docker.pkg.dev/<project-id>/<registry-name>/busybox:latest\ndocker push <region>-docker.pkg.dev/<project-id>/<registry-name>/busybox:latest\n
where you need to replace <region>
, <project-id>
and <registry-name>
with your own values.
(Optional) Instead of using the built-in trigger in GCP, another way to activate the build-on code changes is to integrate with Github Actions. This has the benefit that we can make the build process depend on other steps in the pipeline. For example, in the image below we have conditioned the build to only run if tests are passing on all operating systems. Lets try to implement this.
Start by adding a new secret to Github with the name GCLOUD_SERVICE_KEY
and the value of the service account key that you created in the previous module. This is needed to authenticate the Github action with GCP.
We assume that you already have a workflow file that runs some unit tests:
name: Unit tests & build\n\non:\n push:\n branches: [main]\n pull_request:\n branches: [main]\n\njobs:\n test:\n
we now want to add a job that triggers the build process in GCP. How can you make the build
job depend on the test
job? Hint: Relevant documentation.
You can make the build
job depend on the test
job by adding the needs
keyword to the build
job:
name: Unit tests & build\n\non:\n push:\n branches: [main]\n pull_request:\n branches: [main]\n\njobs:\n test:\n ...\n build:\n needs: test\n ...\n
Additionally, we probably only want to build the image if the job is running on our main branch e.g. not part of a pull request. How can you make the build
job only run on the main branch?
name: Unit tests & build\n\non:\n push:\n branches: [main]\n pull_request:\n branches: [main]\n\njobs:\n test:\n ...\n build:\n needs: test\n if: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}\n ...\n
Finally, we need to add the steps to submit the build job to GCP. You need four steps:
How can you do this? Hint: For the first two steps these two Github actions can be useful: auth and setup-gcloud.
Solutionname: Unit tests & build\n\non:\n push:\n branches: [main]\n pull_request:\n branches: [main]\n\njobs:\n test:\n ...\n build:\n needs: test\n if: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}\n runs-on: ubuntu-latest\n steps:\n - name: Checkout code\n uses: actions/checkout@v4\n\n - name: Auth with GCP\n uses: google-github-actions/auth@v2\n with:\n credentials_json: ${{ secrets.GCLOUD_SERVICE_KEY }}\n\n - name: Set up Cloud SDK\n uses: google-github-actions/setup-gcloud@v2\n\n - name: Submit build\n run: gcloud builds submit --config cloudbuild_containers.yaml\n
(Optional) The cloudbuild
specification format allows you to specify so-called substitutions. A substitution is simply a way to replace a variable in the cloudbuild.yaml
file with a value that is known only at runtime. This can be useful for using the same cloudbuild.yaml
file for multiple builds. Try to implement a substitution in your docker cloud build file such that the image name is a variable.
Build in substitutions
You have probably already encountered substitutions like $PROJECT_ID
in the cloudbuild.yaml
file. These are substitutions that are automatically replaced by GCP. Other commonly used are $BUILD_ID
, $PROJECT_NUMBER
and $LOCATION
. You can find a full list of built.in substitutions here
We just need to add the substitutions
field to the cloudbuild.yaml
file. For example, if we want to replace the image name with a variable called _IMAGE_NAME
we can do the following:
steps:\n- name: 'gcr.io/cloud-builders/docker'\n id: 'Build container image'\n args: [\n 'build',\n '.',\n '-t',\n 'europe-west1-docker.pkg.dev/$PROJECT_ID/<registry-name>/$_IMAGE_NAME',\n '-f',\n '<path-to-dockerfile>'\n ]\n- name: 'gcr.io/cloud-builders/docker'\n id: 'Push container image'\n args: [\n 'push',\n 'europe-west1-docker.pkg.dev/$PROJECT_ID/<registry-name>/$_IMAGE_NAME'\n ]\nsubstitutions:\n _IMAGE_NAME: 'my_image'\n
Do note that user substitutions are prefixed with an underscore _
to distinguish them from built-in. You can read more here
How would you provide the value for the _IMAGE_NAME
variable to the gcloud builds submit
command?
You can provide the value for the _IMAGE_NAME
variable by adding the --substitutions
flag to the gcloud builds submit
command:
gcloud builds submit --config=cloudbuild.yaml --substitutions=_IMAGE_NAME=my_image\n
If you want to provide more than one substitution you can do so by separating them with a comma.
As the final step in our journey through different GCP services in this module, we are going to look at the training of our models. This is one of the important tasks that GCP can help us with because we can always rent more hardware as long as we have credits, meaning that we can both scale horizontally (run more experiments) and vertically (run longer experiments).
We are going to check out two ways of running our experiments. First, we are going to return to the Compute Engine service because it gives the most simple form of scaling of experiments. That is: we create a VM with an appropriate docker image, start it, log into the VM and run our experiments. Most people can run a couple of experiments in parallel this way. However, what if there was an abstract layer that automatically created VM for us, launched our experiments and then closed the VM afterwards?
This is where Vertex AI service comes into play. This is a dedicated service for handling ML models in GCP in the cloud. Vertex AI is in principal and end-to-end service that can take care of everything machine learning related in the cloud. In this course, we are primarily focused on just the training of our models, and then use other services for different parts of our pipeline.
"},{"location":"s6_the_cloud/using_the_cloud/#exercises_3","title":"\u2754 Exercises","text":"Let's start by going through how we could train a model using PyTorch using the Compute Engine service:
Start by creating an appropriate VM. If you want to start a VM that has PyTorch pre-installed with only CPU support you can run the following command
gcloud compute instances create <instance-name> \\\n --zone europe-west1-b \\\n --image-family=pytorch-latest-cpu \\\n --image-project=deeplearning-platform-release\n
alternatively, if you have access to GPU in your GCP account you could start a VM in the following way
gcloud compute instances create <instance-name> \\\n --zone europe-west4-a \\\n --image-family=pytorch-latest-gpu \\\n --image-project=deeplearning-platform-release \\\n --accelerator=\"type=nvidia-tesla-v100,count=1\" \\\n --metadata=\"install-nvidia-driver=True\" \\\n --maintenance-policy TERMINATE\n
Next login into your newly created VM. You can either open an ssh
terminal in the cloud console or run the following command
gcloud beta compute ssh <instance-name>\n
It is recommended to always check that the VM we get is actually what we asked for. In this case, the VM should have PyTorch pre-installed so let's check for that by running
python -c \"import torch; print(torch.__version__)\"\n
Additionally, if you have a VM with GPU support also try running the nvidia-smi
command.
When you have logged in to the VM, it works as your machine. Therefore to run some training code you would need to do the same setup step you have done on your machine: clone your Github, install dependencies, download data, and run code. Try doing this to make sure you can train a model.
The above exercises should hopefully have convinced you that it can be hard to scale experiments using the Compute Engine service. The reason is that you need to manually start, setup and stop a separate VM for each experiment. Instead, let's try to use the Vertex AI service to train our models.
Start by enabling it by searching for Vertex AI
in the cloud console by going to the service or by running the following command:
gcloud services enable aiplatform.googleapis.com\n
The way we are going to use Vertex AI is to create custom jobs because we have already developed docker containers that contain everything to run our code. Thus the only command that we need to use is gcloud ai custom-jobs create
command. An example here would be:
gcloud ai custom-jobs create \\\n --region=europe-west1 \\\n --display-name=test-run \\\n --config=config.yaml \\\n # these are the arguments that are passed to the container, only needed if you want to change defaults\n --command 'python src/my_project/train.py' \\\n --args '[\"--epochs\", \"10\"]'\n
Essentially, this command combines everything into one command: it first creates a VM with the specs specified by a configuration file, then loads a container specified again in the configuration file and finally it runs everything. An example of a config file could be:
CPUGPU# config_cpu.yaml\nworkerPoolSpecs:\n machineSpec:\n machineType: n1-highmem-2\n replicaCount: 1\n containerSpec:\n imageUri: gcr.io/<project-id>/<docker-img>\n
# config_gpu.yaml\nworkerPoolSpecs:\n machineSpec:\n machineType: n1-standard-8\n acceleratorType: NVIDIA_TESLA_T4 #(1)!\n acceleratorCount: 1\n replicaCount: 1\n containerSpec:\n imageUri: gcr.io/<project-id>/<docker-img>\n
you can read more about the configuration formatting here and the different types of machines here. Try to execute a job using the gcloud ai custom-jobs create
command. For additional documentation you can look at the documentation on the command and this page and this page
Assuming you manage to launch a job, you should see an output like this:
Try executing the commands that are outputted to look at both the status and the progress of your job.
In addition, you can also visit the Custom Jobs
tab in training
part of Vertex AI
You will need to select the specific region that you submitted your job to see the job.
During custom training, we do not necessarily need to use dvc
for downloading our data. A more efficient way is to use cloud storage as a mounted file system. This allows us to access data directly from the cloud storage without having to download it first. All our training jobs are automatically mounted a gcs
folder in the root directory. Try to access the data from your training script:
# loading from a bucket using mounted file system\ndata = torch.load('/gcs/<my-bucket-name>/data.pt')\n# writing to a bucket using mounted file system\ntorch.save(data, '/gcs/<my-bucket-name>/data.pt')\n
is should speed up the training process a bit.
Your code may depend on environment variables for authenticating, for example with weights and bias during training. These can also be specified in the configuration file. How would you do this?
SolutionYou can specify environment variables in the configuration file by adding the env
field to the containerSpec
field. For example, if you want to specify the WANDB_API_KEY
you can do it like this:
workerPoolSpecs:\n machineSpec:\n machineType: n1-highmem-2\n replicaCount: 1\n containerSpec:\n imageUri: gcr.io/<project-id>/<docker-img>\n env:\n - name: WANDB_API_KEY\n value: <your-wandb-api-key>\n
You need to replace <your-wandb-api-key>
with your actual key. Also, remember that this file now contains a secret and should be treated as such.
Try to execute multiple jobs with different configurations e.g. change the --args
field in the gcloud ai custom-jobs create
command at the same time. This should hopefully show you how easy it is to scale experiments using the Vertex AI service.
Similar to GitHub Actions, GCP also has a secrets store that can be used to keep secrets safe. This is called the Secret Manager in GCP. By using the Secret Manager, we get the option to inject secrets into our code without having to store them in the code itself.
"},{"location":"s6_the_cloud/using_the_cloud/#exercises_4","title":"\u2754 Exercises","text":"Let's look at the example from before where we have a config file like this for custom Vertex AI jobs:
workerPoolSpecs:\n machineSpec:\n machineType: n1-highmem-2\n replicaCount: 1\n containerSpec:\n imageUri: gcr.io/<project-id>/<docker-img>\n env:\n - name: WANDB_API_KEY\n value: $WANDB_API_KEY\n
we do not want to store the WANDB_API_KEY
in the config file, rather we would like to store it in the Secret Manager and inject it right before the job starts. Let's figure out how to do that.
Start by enabling the secrets manager API by running the following command:
gcloud services enable secretmanager.googleapis.com\n
Next, go to the secrets manager in the cloud console and create a new secret. You just need to give it a name, a value and leave the rest as default. Add one or more secrets like the image below.
We are going to inject the secrets into our training job by using cloudbuild. Create new cloudbuild file called vertex_ai_train.yaml
and add the following content:
steps:\n- name: \"alpine\"\n id: \"Replace values in the training config\"\n entrypoint: \"sh\"\n args:\n - '-c'\n - |\n apk add --no-cache gettext\n envsubst < config.yaml > config.yaml.tmp\n mv config.yaml.tmp config.yaml\n secretEnv: ['WANDB_API_KEY']\n\n- name: 'alpine'\n id: \"Show config\"\n waitFor: ['Replace values in the training config']\n entrypoint: \"sh\"\n args:\n - '-c'\n - |\n cat config.yaml\n\n- name: 'gcr.io/cloud-builders/gcloud'\n id: 'Train on vertex AI'\n waitFor: ['Replace values in the training config']\n args: [\n 'ai',\n 'custom-jobs',\n 'create',\n '--region',\n 'europe-west1',\n '--display-name',\n 'example-mlops-job',\n '--config',\n '${_VERTEX_TRAIN_CONFIG}',\n ]\navailableSecrets:\n secretManager:\n - versionName: projects/$PROJECT_ID/secrets/WANDB_API_KEY/versions/latest\n env: 'WANDB_API_KEY'\n
Slowly go through the file and try to understand what each step does.
SolutionThere are two parts to using secrets in cloud build. First, there is the availableSecrets
field that specifies what secrets from the Secret Manager should be injected into the build. In this case, we are injecting the WANDB_API_KEY
and setting it as an environment variable. The second part is the secretEnv
field in the first step. This field specifies which secrets should be available in the first step. The steps are then doing:
The first step call the envsubst command which is a general Linux command that replaces environment variables in a file. In this case, it replaces the $WANDB_API_KEY
with the actual value of the secret. We then save the file as config.yaml.tmp
and rename it back to config.yaml
.
The second step is just to show that the replacement was successful. This is mostly for debugging purposes and can be removed.
The third step is the actual training job. It waits for the first step to finish before running.
Finally, try to trigger the build:
gcloud builds submit --config=vertex_ai_train.yaml\n
and check that the WANDB_API_KEY
is correctly injected into the config.yaml
file.
In Compute Engine, we have the option to either stop or suspend the VMs, can you describe what the difference is?
SolutionSuspended instances preserve the guest OS memory, device state, and application state. You will not be charged for a suspended VM but will be charged for the storage of the aforementioned states. Stopped instances do not preserve any of the states and you will be charged for the storage of the disk. However, in both cases if the VM instances have resources attached to them, such as static IPs and persistent disks, which are charged until they are deleted.
As seen in the exercises, a cloudbuild.yaml
file often contains multiple steps. How would you make steps dependent on each other e.g. one step can only run if another step has finished? And how would you make steps execute concurrently?
In both cases, the solution is the waitFor
field. If you want a step to wait for another step to finish you you need to give the first step an id
and then specify that id
in the waitFor
field of the second step.
steps:\n- name: 'alpine'\n id: 'step1'\n entrypoint: 'sh'\n args: ['-c', 'echo \"Hello World\"']\n- name: 'alpine'\n id: 'step2'\n entrypoint: 'sh'\n args: ['-c', 'echo \"Hello World 2\"']\n waitFor: ['step1']\n
If you want steps to run concurrently you can set the waitFor
field to ['-']
:
steps:\n- name: 'alpine'\n id: 'step1'\n entrypoint: 'sh'\n args: ['-c', 'echo \"Hello World\"']\n- name: 'alpine'\n id: 'step2'\n entrypoint: 'sh'\n args: ['-c', 'echo \"Hello World 2\"']\n waitFor: ['-']\n
This ends the session on how to use Google Cloud services for now. In a future session, we are going to investigate a bit more of the services offered in GCP, in particular for deploying the models that we have just trained.
"},{"location":"s7_deployment/","title":"Model deployment","text":"Slides
Learn how to use requests works and how to create custom APIs
M22: Requests and APIs
Learn how to deploy custom APIs using serverless functions and serverless containers in the cloud
M23: Cloud Deployment
Learn how to test APIs for functionality and load
M24: API testing
Learn about different ways to improve the deployment of machine learning models
M25: ML Deployment
Learn how to create a frontend for your application using Streamlit
M26: Frontend
Let's say that you have spent 1000 GPU hours and trained the most awesome model that you want to share with the world. One way to do this is, of course, to just place all your code in a Github repository, upload a file with the trained model weights to your favorite online storage (assuming it is too big for GitHub to handle) and ask people to just download your code and the weights to run the code by themselves. This is a fine approach in a small research setting, but in production, you need to be able to deploy the model to an environment that is fully contained such that people can just execute without looking (too hard) at the code.
Image credit
In this session we try to look at methods specialized towards deployment of models on your local machine and also how to deploy services in the cloud.
Learning objectives
The learning objectives of this session are:
fastapi
and run it locallyCore Module
Before we can get deployment of our models we need to understand concepts such as APIs and requests. The core reason for this is that we need a new abstraction layer on top of our applications that are not Python-specific. While Python is the defacto language for machine learning, we cannot expect everybody else to use it and in particular, we cannot expect network protocols (both locally and external) to be able to communicate with our Python programs out of the box. For this reason, we need to understand requests, in particular HTTP requests and how to create APIs that can interact with those requests.
"},{"location":"s7_deployment/apis/#requests","title":"Requests","text":"When we are talking about requests, we are essentially talking about the communication method used in client-server types of architectures. As shown in the image below, in this architecture, the client (user) is going to send requests to a server (our machine learning application) and the server will give a response. For example, the user may send a request to get the class of a specific image, which our application will do and then send back the response in terms of a label.
Image creditThe common way of sending requests is called HTTP (Hypertext Transfer Protocol). It is essentially a specification of the intermediary transportation method between the client and server. An HTTP request essentially consists of two parts:
The common request methods are (case sensitive):
You can read more about the different methods here. For most machine learning applications, GET and POST are the core methods to remember. Additionally, if you want to read more about HTTP in general we highly recommend that you go over this comic strip protocol, but the TLDR is that it provides privacy, integrity and identification over the web.
"},{"location":"s7_deployment/apis/#exercises","title":"\u2754 Exercises","text":"We are going to do a couple of exercises on sending requests using requests package to get familiar with the syntax.
Start by installing the `requests`` package
pip install requests\n
Afterwards, create a small script and try to execute the code
import requests\nresponse = requests.get('https://api.github.com/this-api-should-not-exist')\nprint(response.status_code)\n
As you can see from the syntax, we are sending a request using the GET method. This code should return status code 404. Take a look at this page that contains a list of status codes. Next, let's call a page that exists
import requests\nresponse = requests.get('https://api.github.com')\nprint(response.status_code)\n
What is the status code now and what does it mean? Status codes are important when you have an application that is interacting with a server and wants to make sure that it does not fail, which can be done with simple if
statements on the status codes
if response.status_code == 200:\n print('Success!')\nelif response.status_code == 404:\n print('Not Found.')\n
Next, try to call the following
response=requests.get(\"https://api.github.com/repos/SkafteNicki/dtu_mlops\")\n
which gives back a payload. Essentially, payload refers to any additional data that is sent from the client to the server or vice-versa. Try looking at the response.content
attribute. What is the type of this attribute?
You should hopefully observe that the .content
attribute is of type bytes
. It is important to note that this is the standard way of sending payloads to encode them into byte
objects. To get a more human-readable version of the response, we can convert it to JSON format
response.json()\n
Important to remember that a JSON object in Python is just a nested dictionary if you ever want to iterate over the object in some way.
When we use the GET method we can additionally provide a params
argument, that specifies what we want the server to send back for a specific request URL:
response = requests.get(\n 'https://api.github.com/search/repositories',\n params={'q': 'requests+language:python'},\n)\n
Before looking at response.json()
can you explain what the code does? You can try looking at this page for help.
Sometimes the content of a page cannot be converted into JSON, because as already stated data is sent as bytes. Say that we want to download an image, which we can do in the following way
import requests\nresponse = requests.get('https://imgs.xkcd.com/comics/making_progress.png')\n
Try calling response.json()
, what happens? Next, try calling response.content
. To get the result in this case we would need to convert from bytes to an image:
with open(r'img.png','wb') as f:\n f.write(response.content)\n
The get
method is the most useful method because it allows us to get data from the server. However, as stated in the beginning multiple request methods exist, for example, the POST method for sending data to the server. Try executing:
pload = {'username':'Olivia','password':'123'}\nresponse = requests.post('https://httpbin.org/post', data = pload)\n
Investigate the response (this is an artificial example because we do not control the server).
Finally, we should also know that requests can be sent directly from the command line using the curl
command. Sometimes it is easier to send a request directly from the terminal and sometimes it is easier to do it from a script.
Make sure you have curl
installed, or else find instruction on installing it. To check call curl -
-help` with the documentation on curl.
To execute requests.get('https://api.github.com')
using curl we would simply do
curl -X GET \"https://api.github.com\"\ncurl -X GET -I \"https://api.github.com\" # if you want the status code\n
Try it yourself.
Try to redo some of the exercises yourself using curl
.
That ends the intro session on requests
. Do not worry if you are still not completely comfortable with sending requests, we are going to return to how we do it in practice when we have created our API. If you want to learn more about the requests
package you can check out this tutorial and if you want to see more examples of how to use curl
you can check out this page
Requests are all about being on the client side of our client-server architecture. We are now going to move on to the server side where we will be learning about writing the APIs that requests can interact with. An application programming interface (API) is essentially the way for the developer (you) tells a user how to use the application that you have created. The API is an abstraction layer that allows the user to interact with our application in the way we want them to interact with it, without the user even having to look at the code.
We can take the API from GitHub as an example https://api.github.com. This API allows any user to retrieve, integrate and send data to Github without ever having to visit their webpage. The API exposes multiple endpoints that have various functions:
and we could go on. However, there may be functionality that Github is not interested in users having access to and they may therefore choose not to have endpoints for specific features (1).
The particular kind of API we are going to work with is called REST API (or RESTful API). The REST API specify specific constraints that a particular API needs to fulfill to be considered RESTful. You can read more about what the six guiding principles behind REST API on this page but one of the most important to have in mind is that the client-server architecture needs to be stateless. This means that whenever a request is send to the server it needs to be self-contained (all information included) and the server cannot rely on any previously stored information from previous requests.
To implement APIs in practice we are going to use FastAPI. FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints. FastAPI is only one of many frameworks for defining APIs, however, compared to other frameworks such as Flask and django it offers a sweet spot of being flexible enough to do what you want without having many additional (unnecessary) features.
"},{"location":"s7_deployment/apis/#exercises_1","title":"\u2754 Exercises","text":"The exercises below are a condensed version of this and this tutorial. If you ever need context for the exercises, we can recommend trying to go through these. Additionally, we also provide this solution file that you can look through for help.
Install FastAPI
pip install fastapi\n
This contains the functions, modules, and variables we are going to need to define our interface.
Additionally, also install uvicorn
which is a package for defining low level server applications.
pip install uvicorn[standard]\n
Start by defining a small application like this in a file called main.py
:
from fastapi import FastAPI\napp = FastAPI()\n\n@app.get(\"/\")\ndef read_root():\n return {\"Hello\": \"World\"}\n\n@app.get(\"/items/{item_id}\")\ndef read_item(item_id: int):\n return {\"item_id\": item_id}\n
Importantly here is the use of the @app.get
decorator. What could this decorator refer to? Explain what the two functions are probably doing.
Next lets launch our app. Since we called our script main.py
and we inside the script initialized our API with app = FastAPI
, our application that we want to deploy can be referenced by main:app
:
uvicorn --reload --port 8000 main:app\n
this will launch a server at this page: http://localhost:8000/
. As you will hopefully see, this page will return the content of the root
function, like the image below. Remember to also check the output in your terminal as that will give info on when and how your application is being invoked.
What webpage should you open to get the server to return 1
?
Also checkout the pages: http://localhost:8000/docs
and http://localhost:8000/redoc
. What does these pages show?
The power of the docs
and redoc
pages is that they allow you to easily test your application with their simple UI. As shown in the image below, simply open the endpoint you want to test, click the Try it out
button, input any values and execute it. It will return both the corresponding curl
command for invoking your endpoint, the corresponding URL and response of you application. Try it out.
You can also checkout http://localhost:8000/openapi.json
to check out the schema that is generated which essentially is a json
file containing the overall specifications of your program.
Try to access http://localhost:8000/items/foo
, what happens in this case? When you specify types in your API, FastAPI will automatically do type validation using pydantic, making sure users can only access your API with the correct types. Therefore, remember to include types in your applications!
With the fundamentals in place let's configure it a bit more:
Lets start by changing the root function to include a bit more info. In particular we are also interested in returning the status code so the end user can easily read that. Default status codes are included in the http built-in Python package:
from http import HTTPStatus\n\n@app.get(\"/\")\ndef root():\n \"\"\" Health check.\"\"\"\n response = {\n \"message\": HTTPStatus.OK.phrase,\n \"status-code\": HTTPStatus.OK,\n }\n return response\n
try to reload the app and see what is returned now. You should not have to re-launch the app because we initialized the app with the --reload
argument.
When we decorate our functions with @app.get(\"/items/{item_id}\")
, item_id
is in the case what we call a path parameters because it is a parameter that is directly included in the path of our endpoint. We have already seen how we can restrict a path to a single type, but what if we want to restrict it to specific values? This is often the case if we are working with parameters of type str
. In this case we would need to define a enum
:
from enum import Enum\nclass ItemEnum(Enum):\n alexnet = \"alexnet\"\n resnet = \"resnet\"\n lenet = \"lenet\"\n\n@app.get(\"/restric_items/{item_id}\")\ndef read_item(item_id: ItemEnum):\n return {\"item_id\": item_id}\n
Add this API, reload and execute both a valid parameter and a non-valid parameter.
In contrast to path parameters we have query parameters. In the requests exercises we saw an example of this where we were calling https://api.github.com/search/code with the query 'q': 'requests+language:python'
. Any parameter in FastAPI that is not a path parameter, will be considered a query parameter:
@app.get(\"/query_items\")\ndef read_item(item_id: int):\n return {\"item_id\": item_id}\n
Add this API, reload and figure out how to pass in a query parameter.
We have until now worked with the .get
method, but lets also see an example of the .post
method. As already described the POST request method is used for uploading data to the server. Here is a simple app that saves username and password in a database (please never implement this in real life like this):
database = {'username': [ ], 'password': [ ]}\n\n@app.post(\"/login/\")\ndef login(username: str, password: str):\n username_db = database['username']\n password_db = database['password']\n if username not in username_db and password not in password_db:\n with open('database.csv', \"a\") as file:\n file.write(f\"{username}, {password} \\n\")\n username_db.append(username)\n password_db.append(password)\n return \"login saved\"\n
Make sure you understand what the function does and then try to execute it a couple of times to see your database updating. It is important to note that we sometimes in the following exercises use the .get
method and sometimes the .post
method. For our usage it does not really matter.
We are now moving on to figuring out how to provide different standard inputs like text, images, json to our APIs. It is important that you try out each example yourself and in particular you look at the curl
commands that are necessary to invoke each application.
Here is a small application, that takes a single text input
@app.get(\"/text_model/\")\ndef contains_email(data: str):\n regex = r'\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b'\n response = {\n \"input\": data,\n \"message\": HTTPStatus.OK.phrase,\n \"status-code\": HTTPStatus.OK,\n \"is_email\": re.fullmatch(regex, data) is not None\n }\n return response\n
What does the application do? Try it out yourself
Let's say we wanted to extend the application to check for a specific email domain, either gmail
or hotmail
. Assume that we want to feed this into our application as a json
object e.g.
{\n \"email\": \"mlops@gmail.com\",\n \"domain_match\": \"gmail\"\n}\n
Figure out how to alter the data
parameter such that it takes in the json
object and make sure to extend the application to check if the email and domain also match. Hint: take a look at this page
Let's move on to an application that requires a file input:
from fastapi import UploadFile, File\nfrom typing import Optional\n\n@app.post(\"/cv_model/\")\nasync def cv_model(data: UploadFile = File(...)):\n with open('image.jpg', 'wb') as image:\n content = await data.read()\n image.write(content)\n image.close()\n\n response = {\n \"input\": data,\n \"message\": HTTPStatus.OK.phrase,\n \"status-code\": HTTPStatus.OK,\n }\n return response\n
A couple of new things are going on here: we use the specialized UploadFile
and File
bodies in our input definition. Additionally, we added the async
/await
keywords. Figure out what everything does and try to run the application (you can use any image file you like).
The above application actually does not do anything. Let's add opencv as a package and let's resize the image. It can be done with the following three lines:
import cv2\nimg = cv2.imread(\"image.jpg\")\nres = cv2.resize(img, (h, w))\n
Figure out where to add them in the application and additionally add h
and w
as optional parameters, with a default value of 28. Try running the application where you specify everything and one more time where you leave out h
and w
.
Finally, let's also figure out how to return a file from our application. You will need to add the following lines:
from fastapi.responses import FileResponse\ncv2.imwrite('image_resize.jpg', res)\nFileResponse('image_resize.jpg')\n
Figure out where to add them to the code and try running the application one more time to see that you get a file back with the resized image.
A common pattern in most applications is that we want some code to run on startup and some code to run on shutdown. FastAPI allows us to do this by controlling the lifespan of our application. This is done by implementing the lifespan
function. Look at the documentation for lifespan events and implement a small application that prints Hello
on startup and Goodbye
on shutdown.
Here is a simple example that will print Hello
on startup and Goodbye
on shutdown.
from contextlib import asynccontextmanager\nfrom fastapi import FastAPI\n\n@asynccontextmanager\nasync def lifespan(app: FastAPI):\n print(\"Hello\")\n yield\n print(\"Goodbye\")\n\napp = FastAPI(lifespan=lifespan)\n\n@app.get(\"/\")\ndef read_root():\n return {\"Hello\": \"World\"}\n
Let's try to figure out how to use FastAPI in a Machine learning context. Below is a script that downloads a VisionEncoderDecoder
from huggingface . The model can be used to create captions for a given image. Thus calling
predict_step(['s7_deployment/exercise_files/my_cat.jpg'])\n
returns a list of strings like ['a cat laying on a couch with a stuffed animal']
(try this yourself). Create a FastAPI application that can do inference using this model e.g. it should take in an image, preferably some optional hyperparameters (like max_length
) and should return a string (or list of strings) containing the generated caption.
simple ML application
from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor, AutoTokenizer\nimport torch\nfrom PIL import Image\n\nmodel = VisionEncoderDecoderModel.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\nfeature_extractor = ViTFeatureExtractor.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\ntokenizer = AutoTokenizer.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nmodel.to(device)\n\ngen_kwargs = {\"max_length\": 16, \"num_beams\": 8, \"num_return_sequences\": 1}\ndef predict_step(image_paths):\n images = []\n for image_path in image_paths:\n i_image = Image.open(image_path)\n if i_image.mode != \"RGB\":\n i_image = i_image.convert(mode=\"RGB\")\n\n images.append(i_image)\n pixel_values = feature_extractor(images=images, return_tensors=\"pt\").pixel_values\n pixel_values = pixel_values.to(device)\n output_ids = model.generate(pixel_values, **gen_kwargs)\n preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)\n preds = [pred.strip() for pred in preds]\n return preds\n\nif __name__ == \"__main__\":\n print(predict_step(['s7_deployment/exercise_files/my_cat.jpg']))\n
Solution ml_app.pyfrom contextlib import asynccontextmanager\n\nimport torch\nfrom fastapi import FastAPI, File, UploadFile\nfrom PIL import Image\nfrom transformers import AutoTokenizer, VisionEncoderDecoderModel, ViTFeatureExtractor\n\n\n@asynccontextmanager\nasync def lifespan(app: FastAPI):\n \"\"\"Load and clean up model on startup and shutdown.\"\"\"\n global model, feature_extractor, tokenizer, device, gen_kwargs\n print(\"Loading model\")\n model = VisionEncoderDecoderModel.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n feature_extractor = ViTFeatureExtractor.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n tokenizer = AutoTokenizer.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n model.to(device)\n gen_kwargs = {\"max_length\": 16, \"num_beams\": 8, \"num_return_sequences\": 1}\n\n yield\n\n print(\"Cleaning up\")\n del model, feature_extractor, tokenizer, device, gen_kwargs\n\n\napp = FastAPI(lifespan=lifespan)\n\n\n@app.post(\"/caption/\")\nasync def caption(data: UploadFile = File(...)):\n \"\"\"Generate a caption for an image.\"\"\"\n i_image = Image.open(data.file)\n if i_image.mode != \"RGB\":\n i_image = i_image.convert(mode=\"RGB\")\n\n pixel_values = feature_extractor(images=[i_image], return_tensors=\"pt\").pixel_values\n pixel_values = pixel_values.to(device)\n output_ids = model.generate(pixel_values, **gen_kwargs)\n preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)\n return [pred.strip() for pred in preds]\n
As the final step, we want to figure out how to include our FastAPI application in a docker container as it will help us when we want to deploy in the cloud because docker as always can take care of the dependencies for our application. For the following set of exercises you can take whatever previous FastAPI application as the base application for the container
Start by creating a requirement.txt
file for your application. You will at least need fastapi
and uvicorn
in the file and we always recommend that you are specific about the version you want to use
fastapi>=0.68.0,<0.69.0\nuvicorn>=0.15.0,<0.16.0\n# add anything else you application needs to be able to run\n
Next, create a Dockerfile
with the following content
FROM python:3.9\nWORKDIR /code\nCOPY ./requirements.txt /code/requirements.txt\n\nRUN pip install --no-cache-dir --upgrade -r /code/requirements.txt\nCOPY ./app /code/app\n\nCMD [\"uvicorn\", \"app.main:app\", \"--host\", \"0.0.0.0\", \"--port\", \"80\"]\n
The above assumes that your file structure looks like this
.\n\u251c\u2500\u2500 app\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u2514\u2500\u2500 main.py\n\u251c\u2500\u2500 Dockerfile\n\u2514\u2500\u2500 requirements.txt\n
Hopefully, all these steps should look familiar if you already went through module M9, except for maybe the last line. However, this is just the standard way that we have run our FastAPI applications in the last couple of exercises, this time with some extra arguments regarding the ports we allow.
Next, build the corresponding docker image
docker build -t my_fastapi_app .\n
Finally, run the image such that a container is spun up that runs our application. The important part here is to remember to specify the -p
argument (p for port) which should be the same number as the port we have specified in the last line of our Dockerfile.
docker run --name mycontainer -p 80:80 myimage\n
Check that everything is working by going to the corresponding localhost page http://localhost/items/5?q=somequery
This ends the module on APIs. If you want to go further in this direction we highly recommend that you check out bentoml which is an API standard that focuses solely on creating easy-to-understand APIs and services for ml-applications. Additionally, we can also highly recommend checking out Postman which can help design, document and in particular test the API you are writing to make sure that it works as expected.
"},{"location":"s7_deployment/cloud_deployment/","title":"M23 - Cloud Deployment","text":""},{"location":"s7_deployment/cloud_deployment/#cloud-deployment","title":"Cloud deployment","text":"Core Module
We are now returning to using the cloud. In this module, you should have gone through the steps of having your code in your GitHub repository to automatically build into a docker container, store that, store data and pull it all together to make a training run. After the training is completed you should hopefully have a file stored in the cloud with your trained model weights.
Today's exercises will be about serving those model weights to an end user. We focus on two different ways of deploying our model: Google cloud functions and Google cloud run. Both services are serverless, meaning that you do not have to manage the server that runs your code.
GCP in general has 5 core deployment options. We are going to focus on Cloud Functions and Cloud Run, which are two of the serverless options. In contrast to these two, you have the option to deploy to Kubernetes Engine and Compute Engine which are more traditional ways of deploying your code. Here you have to manage the underlying infrastructure."},{"location":"s7_deployment/cloud_deployment/#cloud-functions","title":"Cloud Functions","text":"Google Cloud Functions, is the most simple way that we can deploy our code to the cloud. As stated above, it is a serverless service, meaning that you do not have to worry about the underlying infrastructure. You just write your code and deploy it. The service is great for small applications that can be encapsulated in a single script.
"},{"location":"s7_deployment/cloud_deployment/#exercises","title":"\u2754 Exercises","text":"Go to the start page of Cloud Functions
. Can be found in the sidebar on the homepage or you can just search for it. Activate the service in the cloud console or use the following command:
gcloud services enable cloudfunctions.googleapis.com\n
Click the Create Function
button which should take you to a screen like the image below. Make sure it is a 2nd Gen function, give it a name, set the server region to somewhere close by and change the authentication policy to Allow unauthenticated invocations
so we can access it directly from a browser. Remember to note down the
On the next page, for Runtime
pick the Python 3.11
option (or newer). This will make the inline editor show both a main.py
and requirements.py
file. Look over them and try to understand what they do. Especially, take a look at the functions-framework which is a needed requirement of any Cloud function.
After you have looked over the files, click the Deploy
button.
The functions-framework
is a lightweight, open-source framework for turning Python functions into HTTP functions. Any function that you deploy to Cloud Functions must be wrapped in the @functions_framework.http
decorator.
Afterwards, the function should begin to deploy. When it is done, you should see \u2705. Now let's test it by going to the Testing
tab.
If you know what the application does, it should come as no surprise that it can run without any input. We therefore just send an empty request by clicking the Test The Function
button. Does the function return the output you expected? Wait for the logs to show up. What do they show?
What should the Triggering event
look like in the testing prompt for the program to respond with
Hallo General Kenobi!\n
Try it out.
SolutionThe default triggering event is a JSON object with a key name
and a value. Therefore the triggering event should look like this:
{\n \"name\": \"General Kenobi\"\n}\n
Go to the trigger tab and go to the URL for the application. Execute the API a couple of times. How can you change the URL to make the application respond with the same output as above?
SolutionYou can change the URL to include a query parameter name
with the value General Kenobi
. For example
https://us-central1-my-personal-mlops-project.cloudfunctions.net/function-3?name=General%20Kanobi\n
where you would need to replace everything before the ?
with your URL.
Click on the metrics tab. You should hopefully see it being populated with a few data points. Identify what each panel is showing.
SolutionCheck out the logs tab. You should see that your application has already been invoked multiple times. Also, try to execute this command in a terminal:
gcloud functions logs read\n
Next, we are going to create our own application that takes some input so we can try to send it requests. We provide a very simple script to get started.
Simple script
sklearn_cloud_functions.py# Load data\nimport pickle\n\nimport numpy as np\nfrom sklearn import datasets\nfrom sklearn.neighbors import KNeighborsClassifier\n\niris_x, iris_y = datasets.load_iris(return_X_y=True)\n\n# Split iris data in train and test data\n# A random permutation, to split the data randomly\nnp.random.seed(0)\nindices = np.random.permutation(len(iris_x))\niris_x_train = iris_x[indices[:-10]]\niris_y_train = iris_y[indices[:-10]]\niris_x_test = iris_x[indices[-10:]]\niris_y_test = iris_y[indices[-10:]]\n\n# Create and fit a nearest-neighbor classifier\n\nknn = KNeighborsClassifier()\nknn.fit(iris_x_train, iris_y_train)\nknn.predict(iris_x_test)\n\n# save model\n\nwith open(\"model.pkl\", \"wb\") as file:\n pickle.dump(knn, file)\n
Figure out what the script does and run the script. This should create a file with a trained model.
SolutionThe file trains a simple KNN model on the iris dataset and saves it to a file called model.pkl
.
Next, create a storage bucket and upload the model file to the bucket. Try to do this using the gsutil
command and check afterward that the file is in the bucket.
gsutil mb gs://<bucket-name> # mb stands for make bucket\ngsutil cp <file-name> gs://<bucket-name> # cp stands for copy\n
Create a new cloud function with the same initial settings as the first one, e.g. Python 3.11
and HTTP
. Then implement in the main.py
file code that:
In addition to writing the main.py
file, you also need to fill out the requirements.txt
file. You need at least three packages to run the application. Remember to also change the Entry point
to the name of your function. If your deployment fails, try to go to the Logs Explorer
page in gcp
which can help you identify why.
The main script should look something like this:
main.pyimport pickle\n\nimport functions_framework\nfrom google.cloud import storage\n\nBUCKET_NAME = \"my_sklearn_model_bucket\"\nMODEL_FILE = \"model.pkl\"\n\nclient = storage.Client()\nbucket = client.get_bucket(BUCKET_NAME)\nblob = bucket.get_blob(MODEL_FILE)\nmy_model = pickle.loads(blob.download_as_string())\n\n\n@functions_framework.http\ndef knn_classifier(request):\n \"\"\"Simple knn classifier function for iris prediction.\"\"\"\n request_json = request.get_json()\n if request_json and \"input_data\" in request_json:\n input_data = request_json[\"input_data\"]\n input_data = [float(in_data) for in_data in input_data]\n input_data = [input_data]\n prediction = my_model.predict(input_data)\n return {\"prediction\": prediction.tolist()}\n return {\"error\": \"No input data provided.\"}\n
And, the requirement file should look like this:
functions-framework>=3.7.0\ngoogle-cloud-storage>=2.14.0\nscikit-learn>=1.4.0\n
importantly make sure that you are using the same version of scikit-learn
as you used when you trained the model. Else when trying to load the model you will most likely get an error.
When you have successfully deployed the model, try to make predictions with it. What should the request look like?
SolutionIt depends on how exactly you have chosen to implement the main.py
. But for the provided solution, the payload should look like this:
{\n \"data\": [1, 2, 3, 4]\n}\n
with the corresponding curl
command:
curl -X POST \\\n https://your-cloud-function-url/knn_classifier \\\n -H \"Content-Type: application/json\" \\\n -d '{\"input_data\": [5.1, 3.5, 1.4, 0.2]}'\n
Let's try to figure out how to do the above deployment using gcloud
instead of the console UI. The relevant command is gcloud functions deploy. For this function to work you will need to put the main.py
and requirements.txt
in a separate folder. Try to execute the command to successfully deploy the function.
gcloud functions deploy <func-name> \\\n --gen2 --runtime python311 --trigger-http --source <folder> --entry-point knn_classifier\n
where you need to replace <func-name>
with the name of your function and <folder>
with the path to the folder containing the main.py
and requirements.txt
files.
(Optional) You can finally try to redo the exercises by deploying a PyTorch application. You will essentially need to go through the same steps as the sklearn example, including uploading a trained model to storage and writing a cloud function that loads it and returns some output. You are free to choose whatever PyTorch model you want.
Cloud functions are great for simple deployments, that can be encapsulated in a single script with only simple requirements. However, they do not scale with more advanced applications that may depend on multiple programming languages. We are already familiar with how we can deal with this through containers and Cloud Run is the corresponding service in GCP for deploying containers.
"},{"location":"s7_deployment/cloud_deployment/#exercises_1","title":"\u2754 Exercises","text":"We are going to start locally by developing a small app that we can deploy. We provide two small examples to choose from: first is a small FastAPI app consisting of a single Python script and a docker file. The second is a small Streamlit app (which you can learn more about in this module) consisting of a single docker file. You can choose which one you want to work with.
Simple Fastapi app simple_fastapi_app.pyfrom fastapi import FastAPI\n\napp = FastAPI()\n\n\n@app.get(\"/\")\ndef read_root():\n \"\"\"Root endpoint.\"\"\"\n return {\"Hello\": \"World\"}\n\n\n@app.get(\"/items/{item_id}\")\ndef read_item(item_id: int):\n \"\"\"Get an item by id.\"\"\"\n return {\"item_id\": item_id}\n
simple_fastapi_app.dockerfileFROM python:3.9-slim\n\nEXPOSE $PORT\n\nWORKDIR /app\n\nRUN apt-get update && apt-get install -y \\\n build-essential \\\n software-properties-common \\\n git \\\n && rm -rf /var/lib/apt/lists/*\n\nRUN pip install fastapi\nRUN pip install pydantic\nRUN pip install uvicorn\n\nCOPY simple_fastapi_app.py simple_fastapi_app.py\n\nCMD exec uvicorn simple_fastapi_app:app --port $PORT --host 0.0.0.0 --workers 1\n
Simple Streamlit app streamlit_app.dockerfileFROM python:3.9-slim\n\nEXPOSE $PORT\n\nWORKDIR /app\n\nRUN apt-get update && apt-get install -y \\\n build-essential \\\n software-properties-common \\\n git \\\n && rm -rf /var/lib/apt/lists/*\n\nRUN git clone https://github.com/streamlit/streamlit-example.git .\n\nRUN pip3 install -r requirements.txt\n\nENTRYPOINT [\"streamlit\", \"run\", \"streamlit_app.py\", \"--server.port=$PORT\", \"--server.address=0.0.0.0\"]\n
Start by going over the files belonging to your choice app and understand what it does.
Next, build the docker image belonging to the app
docker build -f <dockerfile> . -t gcp_test_app:latest\n
Next tag and push the image to your artifact registry
docker tag gcp_test_app <region>-docker.pkg.dev/<project-id>/<registry-name>/gcp_test_app:latest\ndocker push <region>-docker.pkg.dev/<project-id>/<registry-name>/gcp_test_app:latest\n
Afterward, check your artifact registry contains the pushed image.
Next, go to Cloud Run
in the cloud console and enable the service or use the following command:
gcloud services enable run.googleapis.com\n
Click the Create Service
button which should bring you to a page similar to the one below
Do the following:
Click the select button, which will bring up all build containers and pick the one you want to deploy. In the future, you probably want to choose the Continuously deploy new revisions from a source repository such that a new version is always deployed when a new container is built.
Hereafter, give the service a name and select the region. We recommend choosing a region close to you.
Set the authentication method to Allow unauthenticated invocations such that we can call it without providing credentials. In the future, you may only set that authenticated invocations are allowed.
Expand the Container, Connections, Security tab and edit the port such that it matches the port exposed in your chosen application. If your docker file exposes the env variable $PORT
you can set the port to anything.
Finally, click the create button and wait for the service to be deployed (may take some time).
Common problems
If you get an error saying The user-provided container failed to start and listen on the port defined by the PORT environment variable. there are two common reasons for this:
You need to add an EXPOSE
statement in your docker container:
EXPOSE 8080\nCMD exec uvicorn my_application:app --port 8080 --workers 1 main:app\n
and make sure that your application is also listening on that port. If you hard code the port in your application (as in the above code) it is best to set it 8080 which is the default port for cloud run. Alternatively, a better approach is to set it to the $PORT
environment variable which is set by cloud run and can be accessed in your application:
EXPOSE $PORT\nCMD exec uvicorn my_application:app --port $PORT --workers 1 main:app\n
If you do this and then want to run locally you can run it as:
docker run -p 8080:8080 -e PORT=8080 <image-name>:<image-tag>\n
If you are serving a large machine-learning model, it may also be that your deployed container is running out of memory. You can try to increase the memory of the container by going to the Edit container and the Resources tab and increasing the memory.
If you manage to deploy the service you should see an image like this:
You can now access your application by clicking the URL. This will access the root of your application, so you may need to add /
or /<path>
to the URL depending on how the app works.
Everything we just did in the console UI we can also do with the gcloud run deploy. How would you do that?
SolutionThe command should look something like this
gcloud run deploy <service-name> \\\n --image <image-name>:<image-tag> --platform managed --region <region> --allow-unauthenticated\n
where you need to replace <service-name>
with the name of your service, <image-name>
with the name of your image and <region>
with the region you want to deploy to. The --allow-unauthenticated
flag is optional but is needed if you want to access the service without providing credentials.
After deploying using the command line, make sure that the service is up and running by using these two commands
gcloud run services list\ngcloud run services describe <service-name> --region <region>\n
Instead of deploying our docker container using the UI or command line, which is a manual operation, we can do it continuously by using cloudbuild.yaml
file we learned about in the previous section. This is called continuous deployment, and it is a way to automate the deployment process.
Image credit
Let's revise the cloudbuild.yaml
file from the artifact registry exercises in this module which will build and push a specified docker image.
cloudbuild.yaml
cloudbuild.yamlsteps:\n- name: 'gcr.io/cloud-builders/docker'\n id: 'Build container image'\n args: [\n 'build',\n '.',\n '-t',\n 'europe-west1-docker.pkg.dev/$PROJECT_ID/<registry-name>/<image-name>',\n '-f',\n '<path-to-dockerfile>'\n ]\n- name: 'gcr.io/cloud-builders/docker'\n id: 'Push container image'\n args: [\n 'push',\n 'europe-west1-docker.pkg.dev/$PROJECT_ID/<registry-name>/<image-name>'\n ]\n
Add a third step to the cloudbuild.yaml
file that deploys the container image to Cloud Run. The relevant service you need to use is called 'gcr.io/cloud-builders/gcloud'
and the command is 'gcloud run deploy'
. Afterwards, reuse the trigger you created in the previous module or create a new one to build and deploy the container image continuously. Confirm that this works by making a change to your application and pushing it to GitHub and see if the application is updated continuously.
The full cloudbuild.yaml
file should look like this:
steps:\n- name: 'gcr.io/cloud-builders/docker'\n id: 'Build container image'\n args: [\n 'build',\n '.',\n '-t',\n 'europe-west1-docker.pkg.dev/$PROJECT_ID/<registry-name>/<image-name>',\n '-f',\n '<path-to-dockerfile>'\n ]\n- name: 'gcr.io/cloud-builders/docker'\n id: 'Push container image'\n args: [\n 'push',\n 'europe-west1-docker.pkg.dev/$PROJECT_ID/<registry-name>/<image-name>'\n ]\n- name: 'gcr.io/cloud-builders/gcloud'\n id: 'Deploy to Cloud Run'\n args: [\n 'run',\n 'deploy',\n '<service-name>',\n '--image',\n 'europe-west1-docker.pkg.dev/$PROJECT_ID/<registry-name>/<image-name>',\n '--region',\n 'europe-west1',\n '--platform',\n 'managed',\n ]\n
In the previous module on using the cloud you learned about the Secrets Manager in GCP. How can you use this service in combination with Cloud Run?
SolutionIn the cloud console, secrets can be set in the Container(s), Volumes, Networking, Security tab under the Variables & Secrets section, see image below.
In the gcloud
command, you can set the secret by using the --update-secrets
flag.
gcloud run deploy <service-name> \\\n --image <image-name>:<image-tag> --platform managed \\\n --region <region> --allow-unauthenticated \\\n --update-secrets <secret-name>=<secret-version>\n
That ends the exercises on deployment. The exercises above are just a small taste of what deployment has to offer. In both sections, we have explicitly chosen to work with serverless deployments. But what if you wanted to do the opposite e.g. being the one in charge of the management of the cluster that handles the deployed services? If you are interested in taking deployment to the next level should get started on Kubernetes which is the de-facto open-source container orchestration platform that is being used in production environments. If you want to deep dive we recommend starting here which describes how to make pipelines that are a necessary component before you start to create your own Kubernetes cluster.
"},{"location":"s7_deployment/frontend/","title":"M26 - Frontend","text":""},{"location":"s7_deployment/frontend/#frontend","title":"Frontend","text":"If you have gone over the deployment module you should be at the point where you have a machine learning model running in the cloud. The model can be interacted with by sending HTTP requests to the API endpoint. In general we refer to this as the backend of the application. It is the part of our application that are behind-the-scene that the user does not see and it is not really that user-friendly. Instead we want to create a frontend that the user can interact with in a more user-friendly way. This is what we will be doing in this module.
Another point of splitting our application into a frontend and a backend has to do with scalability. If we have a lot of users interacting with our application, we might want to scale only the backend and not the frontend, because that is the part that will be running our heavy machine learning model. In general dividing a application into smaller pieces are the pattern that is used in microservice architectures.
In monollithic applications everything the user may be requesting of our application is handled by a single process/ container. In microservice architectures the application is split into smaller pieces that can be scaled independently. This also leads to easier maintainability and faster development.Frontends have for the longest time been created using HTML, CSS and JavaScript. This is still the case, but there are now a lot of frameworks that can help us create a frontend in Python:
In this module we will be looking at streamlit
. streamlit
is a easy to use framework that allows us to create interactive web applications in Python. It is not at all as powerful as a framework like Django
, but it is very easy to get started with and it is very easy to integrate with our machine learning models.
In these exercises we go through the process of setting up a backend using fastapi
and a frontend using streamlit
, containerizing both applications and then deploying them to the cloud. We have already created an example of this which can be found in the samples/frontend_backend
folder.
Lets start by creating the backend application in a backend.py
file. You can use essentially any backend you want, but we will be using a simple imagenet classifier that we have created in the samples/frontend_backend/backend
folder.
Create a new file called backend.py
and implement a FastAPI interface with a single /predict
endpoint that takes a image as input and returns the predicted class (and probabilities) of the image.
import json\nfrom contextlib import asynccontextmanager\n\nimport anyio\nimport torch\nfrom fastapi import FastAPI, File, HTTPException, UploadFile\nfrom PIL import Image\nfrom torchvision import models, transforms\n\n\n@asynccontextmanager\nasync def lifespan(app: FastAPI):\n \"\"\"Context manager to start and stop the lifespan events of the FastAPI application.\"\"\"\n global model, transform, imagenet_classes\n # Load model\n model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)\n model.eval()\n\n transform = transforms.Compose(\n [\n transforms.Resize((224, 224)),\n transforms.ToTensor(),\n transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),\n ],\n )\n\n async with await anyio.open_file(\"imagenet-simple-labels.json\") as f:\n imagenet_classes = json.load(f)\n\n yield\n\n # Clean up\n del model\n del transform\n del imagenet_classes\n\n\napp = FastAPI(lifespan=lifespan)\n\n\ndef predict_image(image_path: str) -> str:\n \"\"\"Predict image class (or classes) given image path and return the result.\"\"\"\n img = Image.open(image_path).convert(\"RGB\")\n img = transform(img).unsqueeze(0)\n with torch.no_grad():\n output = model(img)\n _, predicted_idx = torch.max(output, 1)\n return output.softmax(dim=-1), imagenet_classes[predicted_idx.item()]\n\n\n@app.get(\"/\")\nasync def root():\n \"\"\"Root endpoint.\"\"\"\n return {\"message\": \"Hello from the backend!\"}\n\n\n# FastAPI endpoint for image classification\n@app.post(\"/classify/\")\nasync def classify_image(file: UploadFile = File(...)):\n \"\"\"Classify image endpoint.\"\"\"\n try:\n contents = await file.read()\n async with await anyio.open_file(file.filename, \"wb\") as f:\n f.write(contents)\n probabilities, prediction = predict_image(file.filename)\n return {\"filename\": file.filename, \"prediction\": prediction, \"probabilities\": probabilities.tolist()}\n except Exception as e:\n raise HTTPException(status_code=500) from e\n
Run the backend using uvicorn
uvicorn backend:app --reload\n
Test the backend by sending a request to the /predict
endpoint, preferably using curl
command
In this example we are sending a request to the /predict
endpoint with a file called my_cat.jpg
. The response should be \"tabby cat\" for the solution we have provided.
curl -X 'POST' \\\n 'http://127.0.0.1:8000/classify/' \\\n -H 'accept: application/json' \\\n -H 'Content-Type: multipart/form-data' \\\n -F 'file=@my_cat.jpg;type=image/jpeg'\n
Create a requirements_backend.txt
file with the dependencies needed for the backend.
fastapi>=0.108.0\nuvicorn>=0.25.0\ntorch>=2.1.2\ntorchvision>=0.16.2\n
Containerize the backend into a file called backend.dockerfile
.
FROM python:3.11-slim\n\nRUN apt update && \\\n apt install --no-install-recommends -y build-essential gcc git && \\\n apt clean && rm -rf /var/lib/apt/lists/*\n\nRUN mkdir /app\n\nWORKDIR /app\n\nCOPY requirements_backend.txt /app/requirements_backend.txt\nCOPY backend.py /app/backend.py\n\nRUN --mount=type=cache,target=/root/.cache/pip pip install -r requirements_backend.txt\n\nEXPOSE $PORT\nCMD exec unicorn --port $PORT --host 0.0.0.0 backend:app\n
Build the backend image
docker build -t backend:latest -f backend.dockerfile .\n
Recheck that the backend works by running the image in a container
docker run --rm -p 8000:8000 -e \"PORT=8000\" backend\n
and test that it works by sending a request to the /predict
endpoint.
Deploy the backend to Cloud run using the gcloud
command
Assuming that we have created an artifact registry called frontend_backend
we can deploy the backend to Cloud Run using the following commands:
docker tag \\\n backend:latest \\\n <region>-docker.pkg.dev/<project>/frontend-backend/backend:latest\ndocker push \\\n <region>.pkg.dev/<project>/frontend-backend/backend:latest\ngcloud run deploy backend \\\n --image=europe-west1-docker.pkg.dev/<project>/frontend-backend/backend:latest \\\n --region=europe-west1 \\\n --platform=managed \\\n
where <region>
and <project>
should be replaced with the appropriate values.
Finally, test that the deployed backend works as expected by sending a request to the /predict
endpoint
In this solution we are first extracting the url of the deployed backend and then sending a request to the /predict
endpoint.
export MYENDPOINT=$(gcloud run services describe backend --region=<region> --format=\"value(status.url)\")\ncurl -X 'POST' \\\n $MYENDPOINT/predict \\\n -H 'accept: application/json' \\\n -H 'Content-Type: multipart/form-data' \\\n -F 'file=@my_cat.jpg;type=image/jpeg'\n
With the backend taken care of lets now write our frontend. Our frontend just needs to be a \"nice\" interface to our backend. Its main functionality will be to send a request to the backend and display the result. streamlit documentation
Start by installing streamlit
pip install streamlit\n
Now create a file called frontend.py
and implement a streamlit application. You can design it as you want, but we recommend that the following can be done in the frontend:
Have a file uploader that allows the user to upload an image
Display the image that the user uploaded
Have a button that sends the image to the backend and displays the result
For now just assume that a environment variable called BACKEND
is available that contains the URL of the backend. We will in the next step show how to get this URL automatically.
import os\n\nimport pandas as pd\nimport requests\nimport streamlit as st\nfrom google.cloud import run_v2\n\n\ndef get_backend_url():\n \"\"\"Get the URL of the backend service.\"\"\"\n parent = \"projects/my-personal-mlops-project/locations/europe-west1\"\n client = run_v2.ServicesClient()\n services = client.list_services(parent=parent)\n for service in services:\n if service.name.split(\"/\")[-1] == \"production-model\":\n return service.uri\n return os.environ.get(\"BACKEND\", None)\n\n\ndef classify_image(image, backend):\n \"\"\"Send the image to the backend for classification.\"\"\"\n predict_url = f\"{backend}/predict\"\n response = requests.post(predict_url, files={\"image\": image}, timeout=10)\n if response.status_code == 200:\n return response.json()\n return None\n\n\ndef main() -> None:\n \"\"\"Main function of the Streamlit frontend.\"\"\"\n backend = get_backend_url()\n if backend is None:\n msg = \"Backend service not found\"\n raise ValueError(msg)\n\n st.title(\"Image Classification\")\n\n uploaded_file = st.file_uploader(\"Upload an image\", type=[\"jpg\", \"jpeg\", \"png\"])\n\n if uploaded_file is not None:\n image = uploaded_file.read()\n result = classify_image(image, backend=backend)\n\n if result is not None:\n prediction = result[\"prediction\"]\n probabilities = result[\"probabilities\"]\n\n # show the image and prediction\n st.image(image, caption=\"Uploaded Image\")\n st.write(\"Prediction:\", prediction)\n\n # make a nice bar chart\n data = {\"Class\": [f\"Class {i}\" for i in range(10)], \"Probability\": probabilities}\n df = pd.DataFrame(data)\n df.set_index(\"Class\", inplace=True)\n st.bar_chart(df, y=\"Probability\")\n else:\n st.write(\"Failed to get prediction\")\n\n\nif __name__ == \"__main__\":\n main()\n
We need to make sure that the frontend knows where the backend is located, and we want that to happen automatically so we do not have to hardcode the URL into our frontend. We can do this by using the Python SDK for Google Cloud Run. The following code snippet shows how to get the URL of the backend service or fall back to an environment variable if the service is not found.
from google.cloud import run_v2\nimport streamlit as st\n\n@st.cache_resource # (1)!\ndef get_backend_url():\n \"\"\"Get the URL of the backend service.\"\"\"\n parent = \"projects/<project>/locations/<region>\"\n client = run_v2.ServicesClient()\n services = client.list_services(parent=parent)\n for service in services:\n if service.name.split(\"/\")[-1] == \"production-model\":\n return service.uri\n name = os.environ.get(\"BACKEND\", None)\n return name\n
st.cache_resource
is a decorator that tells streamlit
to cache the result of the function. This is useful if the function is expensive to run and we want to avoid running it multiple times.Add the above code snippet to the top of your frontend.py
file and replace <project>
and <region>
with the appropriate values. You will need to install pip install google-cloud-run
to be able to use the code snippet.
Run the frontend using streamlit
streamlit run frontend.py\n
Create a requirements_frontend.txt
file with the dependencies needed for the frontend.
streamlit>=1.28.2\npandas>=2.1.3\ngoogle-cloud-run>=0.10.5\n
Containerize the frontend into a file called frontend.dockerfile
.
FROM python:3.11-slim\n\nRUN apt update && \\\n apt install --no-install-recommends -y build-essential gcc git && \\\n apt clean && rm -rf /var/lib/apt/lists/*\n\nRUN mkdir /app\n\nWORKDIR /app\n\nCOPY requirements_frontend.txt /app/requirements_frontend.txt\nCOPY frontend.py /app/frontend.py\n\nRUN --mount=type=cache,target=/root/.cache/pip pip install -r requirements_frontend.txt\n\nEXPOSE $PORT\n\nCMD [\"streamlit\", \"run\", \"frontend.py\", \"--server.port\", \"$PORT\"]\n
Build the frontend image
docker build -t frontend:latest -f frontend.dockerfile .\n
Run the frontend image
docker run --rm -p 8001:8001 -e \"PORT=8001\" backend\n
and check in your web browser that the frontend works as expected.
Deploy the frontend to Cloud run using the gcloud
command
Assuming that we have created an artifact registry called frontend_backend
we can deploy the backend to Cloud Run using the following commands:
docker tag frontend:latest \\\n <region>-docker.pkg.dev/<project>/frontend-backend/frontend:latest\ndocker push <region>.pkg.dev/<project>/frontend-backend/frontend:latest\ngcloud run deploy frontend \\\n --image=europe-west1-docker.pkg.dev/<project>/frontend-backend/frontend:latest \\\n --region=europe-west1 \\\n --platform=managed \\\n
Test that frontend works as expected by opening the URL of the deployed frontend in your web browser.
(Optional) If you have gotten this far you have successfully created a frontend and a backend and deployed them to the cloud. Finally, it may be worth it to load test your application to see how it performs under load. Write a locust file which is covered in this module and run it against your frontend. Make sure that it can handle the load you expect it to handle.
(Optional) Feel free to experiment further with streamlit and see what you can create. For example, you can try to create a option for the user to upload a video and then display the video with the predicted class overlaid on top of the video.
We have created separate requirements files for the frontend and the backend. Why is this a good idea?
SolutionThis is a good idea because the frontend and the backend may have different dependencies. By having separate requirements files we can make sure that we only install the dependencies that are needed for the specific application. This also has the positive side effect that we can keep the docker images smaller. For example, the frontend does not need the torch
library which is huge and only needed for the backend.
This ends the exercises for this module.
"},{"location":"s7_deployment/ml_deployment/","title":"M25 - ML deployment","text":""},{"location":"s7_deployment/ml_deployment/#deployment-of-machine-learning-models","title":"Deployment of Machine Learning Models","text":"In one of the previous modules you learned about how to use FastAPI to create an API to interact with your machine learning models. FastAPI is a great framework, but it is a general framework meaning that it was not developed with machine learning applications in mind. This means that there are features which you may consider to be missing when considering running large scale machine learning models:
Dynamic-batching: if you have a large number of requests coming in, you may want to process them in batches to reduce the overhead of loading the model and running the inference. This is especially true if you are running your model on a GPU, where the overhead of loading the model is significant.
Async inference: FastAPi does support async requests but not no way to call the model asynchronously. This means that if you have a large number of requests coming in, you will have to wait for the model to finish processing (because the model is not async) before you can start processing the next request.
Native GPU support: you can definitely run part of your application in FastAPI if you want to. But again it was not build with machine learning in mind, so you will have to do some extra work to get it to work.
It should come as no surprise that multiple frameworks have therefore sprung up that better supports deployment of machine learning algorithms (just listing a few here):
\ud83c\udf1f Framework \ud83e\udde9 Backend Agnostic \ud83e\udde0 Model Agnostic \ud83d\udcc2 Repository \u2b50 Github Stars Cortex \u2705 \u2705 \ud83d\udd17 Link 8.0k BentoML \u2705 \u2705 \ud83d\udd17 Link 7.2k Ray Serve \u2705 \u2705 \ud83d\udd17 Link 34.1k Triton Inference Server \u2705 \u2705 \ud83d\udd17 Link 8.3k OpenVINO \u2705 \u2705 \ud83d\udd17 Link 7.3k Seldon-core \u2705 \u2705 \ud83d\udd17 Link 4.4k Litserve \u2705 \u2705 \ud83d\udd17 Link 2.5k Torchserve \u274c \u2705 \ud83d\udd17 Link 4.2k TensorFlow serve \u274c \u2705 \ud83d\udd17 Link 6.2k vLLM \u274c \u274c \ud83d\udd17 Link 30.5kThe first 7 frameworks are backend agnostic, meaning that they are intended to work with whatever computational backend you model is implemented in (TensorFlow, PyTorch, Jax, Sklearn etc.), whereas the last 3 are backend specific (PyTorch, TensorFlow and a custom framework). The first 9 frameworks are model agnostic, meaning that they are intended to work with whatever model you have implemented, whereas the last one is model specific in this case to LLM's. When choosing a framework to deploy your model, you should consider the following:
Ease of use. Some frameworks are easier to use and get started with than others, but may have fewer features. As an example from the list above, Litserve
is very easy to get started with but is a relatively new framework and may not have all the features you need.
Performance. Some frameworks are optimized for performance, but may be harder to use. As an example from the list above, vLLM
is a very high performance framework for serving large language models but it cannot be used for other types of models.
Community. Some frameworks have a large community, which can be helpful if you run into problems. As an example from the list above, Triton Inference Server
is developed by Nvidia and has a large community of users. As a good rule of thumb, the more stars a repository has on Github, the larger the community.
In this module we are going to be looking at the BentoML
framework because it strikes a good balance between ease of use and having a lot of features that can improve the performance of serving your models. However, before we dive into this serving framework, we are going to look at a general way to package our machine learning models that should work with most of the above frameworks.
Whenever we want to serve a machine learning model, we in general need 3 things:
In the previous module on Docker we learned how to package all of these things into a container. This is a great way to package a model, but it is not the only way. The core assumption we currently have made is that the computational backend is the same as the one we trained the model on. However, this does not need to be the case. As long as we can export our model and weights to a common format, we can run the model on any backend that supports this format.
This is exactly what the Open Neural Network Exchange (ONNX) is designed to do. ONNX is a standardized format for creating and sharing machine learning models. It defines an extensible computation graph model, as well as definitions of built-in operators and standard data types. The idea behind ONNX is that a model trained with a specific framework on a specific device, let's say PyTorch on your local computer, can be exported and run with an entirely different framework and hardware easily. Learning how to export your models to ONNX is therefore a great way to increase the longevity of your models and not being locked into a specific framework for serving your models.
The ONNX format is designed to bridge the gap between development and deployment of machine learning models, by making it easy to export models between different frameworks and hardware. For example PyTorch is in general considered an developer friendly framework, however it has historically been slow to run inference with compared to a framework. Image credit"},{"location":"s7_deployment/ml_deployment/#exercises","title":"\u2754 Exercises","text":"Start by installing ONNX, ONNX runtime and ONNX script. This can be done by running the following command
pip install onnx onnxruntime onnxscript\n
the first package contains the core ONNX framework, the second package contains the runtime for running ONNX models and the third package contains a new experimental package that is designed to make it easier to export models to ONNX.
Let's start out with converting a model to ONNX. The following code snippets shows how to export a PyTorch model to ONNX.
PyTorch => 2.0PyTorch < 2.0 or WindowsPyTorch-lightningimport torch\nimport torchvision\n\nmodel = torchvision.models.resnet18(weights=None)\nmodel.eval()\n\ndummy_input = torch.randn(1, 3, 224, 224)\nonnx_model = torch.onnx.dynamo_export(\n model=model,\n model_args=(dummy_input,),\n export_options=torch.onnx.ExportOptions(dynamic_shapes=True),\n)\nonnx_model.save(\"resnet18.onnx\")\n
import torch\nimport torchvision\n\nmodel = torchvision.models.resnet18(weights=None)\nmodel.eval()\n\ndummy_input = torch.randn(1, 3, 224, 224)\ntorch.onnx.export(\n model=model,\n args=(dummy_input,),\n f=\"resnet18.onnx\",\n input_names=[\"input\"],\n output_names=[\"output\"],\n dynamic_axes={\"input\": {0: \"batch_size\"}, \"output\": {0: \"batch_size\"}}\n)\n
import torch\nimport torchvision\nimport pytorch_lightning as pl\nimport onnx\nimport onnxruntime\n\nclass LitModel(pl.LightningModule):\n def __init__(self):\n super().__init__()\n self.model = torchvision.models.resnet18(pretrained=True)\n self.model.eval()\n\n def forward(self, x):\n return self.model(x)\n\nmodel = LitModel()\nmodel.eval()\n\ndummy_input = torch.randn(1, 3, 224, 224)\nmodel.to_onnx(\n file_path=\"resnet18.onnx\",\n input_sample=dummy_input,\n input_names=[\"input\"],\n output_names=[\"output\"],\n dynamic_axes={\"input\": {0: \"batch_size\"}, \"output\": {0: \"batch_size\"}}\n)\n
Export a model of your own choice to ONNX or just try to export the resnet18
model as shown in the examples above, and confirm that the model was exported by checking that the file exists. Can you figure out what is meant by dynamic_axes
?
The dynamic_axes
argument is used to specify which axes of the input tensor that should be considered dynamic. This is useful when the model can accept inputs of different sizes, e.g. when the model is used in a dynamic batching scenario. In the example above we have specified that the first axis of the input tensor should be considered dynamic, meaning that the model can accept inputs of different batch sizes. While it may be tempting to specify all axes as dynamic, however this can lead to slower inference times, because the ONNX runtime will not be able to optimize the computational graph as well.
Check that the model was correctly exported by loading it using the onnx
package and afterwards check the graph of model using the following code:
import onnx\nmodel = onnx.load(\"resnet18.onnx\")\nonnx.checker.check_model(model)\nprint(onnx.helper.printable_graph(model.graph))\n
To get a better understanding of what is actually exported, lets try to visualize the computational graph of the model. This can be done using the open-source tool netron. You can either try it out directly in webbrowser or you can install it locally using pip install netron
and then run it using netron resnet18.onnx
. Can you figure out what method of the model is exported to ONNX?
When a PyTorch model is exported to ONNX, it is only the forward
method of the model that is exported. This means that it is the only method we have access to when we load the model later. Therefore, make sure that the forward
method of your model is implemented in a way that it can be used for inference.
After converting a model to ONNX format we can use the ONNX Runtime to run it. The benefit of this is that ONNX Runtime is able to optimize the computational graph of the model, which can lead to faster inference times. Lets try to look into that.
Figure out how to run a model using the ONNX Runtime. Relevant documentation.
SolutionTo use the ONNX runtime to run a model, we first need to start a inference session, then extract input output names of our model and finally run the model. The following code snippet shows how to do this.
import onnxruntime as rt\nort_session = rt.InferenceSession(\"<path-to-model>\")\ninput_names = [i.name for i in ort_session.get_inputs()]\noutput_names = [i.name for i in ort_session.get_outputs()]\nbatch = {input_names[0]: np.random.randn(1, 3, 224, 224).astype(np.float32)}\nout = ort_session.run(output_names, batch)\n
Let's experiment with performance of ONNX vs. PyTorch. Implement a benchmark that measures the time it takes to run a model using PyTorch and ONNX. Bonus points if you test for multiple input sizes. To get you started we have implemented a timing decorator that you can use to measure the time it takes to run a function.
from statistics import mean, stdev\nimport time\ndef timing_decorator(func, function_repeat: int = 10, timing_repeat: int = 5):\n \"\"\" Decorator that times the execution of a function. \"\"\"\n def wrapper(*args, **kwargs):\n timing_results = []\n for _ in range(timing_repeat):\n start_time = time.time()\n for _ in range(function_repeat):\n result = func(*args, **kwargs)\n end_time = time.time()\n elapsed_time = end_time - start_time\n timing_results.append(elapsed_time)\n print(f\"Avg +- Stddev: {mean(timing_results):0.3f} +- {stdev(timing_results):0.3f} seconds\")\n return result\n return wrapper\n
Solution onnx_benchmark.pyimport sys\nimport time\nfrom statistics import mean, stdev\n\nimport onnxruntime as ort\nimport torch\nimport torchvision\n\n\ndef timing_decorator(func, function_repeat: int = 10, timing_repeat: int = 5):\n \"\"\"Decorator that times the execution of a function.\"\"\"\n\n def wrapper(*args, **kwargs):\n timing_results = []\n for _ in range(timing_repeat):\n start_time = time.time()\n for _ in range(function_repeat):\n result = func(*args, **kwargs)\n end_time = time.time()\n elapsed_time = end_time - start_time\n timing_results.append(elapsed_time)\n print(f\"Avg +- Stddev: {mean(timing_results):0.3f} +- {stdev(timing_results):0.3f} seconds\")\n return result\n\n return wrapper\n\n\nmodel = torchvision.models.resnet18()\nmodel.eval()\n\ndummy_input = torch.randn(1, 3, 224, 224)\nif sys.platform == \"win32\":\n # Windows doesn't support the new TorchDynamo-based ONNX Exporter\n torch.onnx.export(\n model,\n dummy_input,\n \"resnet18.onnx\",\n input_names=[\"input.1\"],\n dynamic_axes={\"input.1\": {0: \"batch_size\", 2: \"height\", 3: \"width\"}},\n )\nelse:\n torch.onnx.dynamo_export(model, dummy_input).save(\"resnet18.onnx\")\n\nort_session = ort.InferenceSession(\"resnet18.onnx\")\n\n\n@timing_decorator\ndef torch_predict(image) -> None:\n \"\"\"Predict using PyTorch model.\"\"\"\n model(image)\n\n\n@timing_decorator\ndef onnx_predict(image) -> None:\n \"\"\"Predict using ONNX model.\"\"\"\n ort_session.run(None, {\"input.1\": image.numpy()})\n\n\nif __name__ == \"__main__\":\n for size in [224, 448, 896]:\n dummy_input = torch.randn(1, 3, size, size)\n print(f\"Image size: {size}\")\n torch_predict(dummy_input)\n onnx_predict(dummy_input)\n
To get a better understanding of why running the model using the ONNX runtime is usually faster lets try to see what happens to the computational graph. By default the ONNX Runtime will apply these optimization in online mode, meaning that the optimizations are applied when the model is loaded. However, it is also possible to apply the optimizations in offline mode, such that the optimized model is saved to disk. Below is an example of how to do this.
import onnxruntime as rt\nsess_options = rt.SessionOptions()\n\n# Set graph optimization level\nsess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_EXTENDED\n\n# To enable model serialization after graph optimization set this\nsess_options.optimized_model_filepath = \"optimized_model.onnx>\"\n\nsession = rt.InferenceSession(\"<model_path>\", sess_options)\n
Try to apply the optimizations in offline mode and use netron
to visualize both the original and optimized model side by side. Can you see any differences?
You should hopefully see that the optimized model consist of fewer nodes and edges than the original model. These nodes are often called fused nodes, because they are the result of multiple nodes being fused together. In the image below we have visualized the first part of the computational graph of a resnet18 model, before and after optimization.
As mentioned in the introduction, ONNX is able to run on many different types of hardware and execution engine. You can check all providers and all the available providers by running the following code
import onnxruntime\nprint(onnxruntime.get_all_providers())\nprint(onnxruntime.get_available_providers())\n
Can you figure out how to set which provide the ONNX runtime should use?
SolutionThe provider that the ONNX runtime should use can be set by passing the providers
argument to the InferenceSession
class. A list should be provided, which prioritizes the providers in the order they are listed.
import onnxruntime as rt\nprovider_list = ['CUDAExecutionProvider', 'CPUExecutionProvider']\nort_session = rt.InferenceSession(\"<path-to-model>\", providers=provider_list)\n
In this case we will prefer CUDA Execution Provider over CPU Execution Provider if both are available.
As you have probably realised in the exercises on docker, it can take a long time to build the kind of containers we are working with and they can be quite large. There is a reason for this and that is that PyTorch is a very large framework with a lot of dependencies. ONNX on the other hand is a much smaller framework. This kind of makes sense, because PyTorch is a framework that primarily was designed for developing e.g. training models, while ONNX is a framework that is designed for serving models. Let's try to quantify this.
Construct a dockerfile that builds a docker image with PyTorch as a dependency. The dockerfile does actually not need to run anything. Repeat the same process for the ONNX runtime. Bonus point for developing a docker image that takes a build arg at build time that specifies if the image should be built with CUDA support or not.
SolutionThe dockerfile for the PyTorch image could look something like this
inference_pytorch.dockerfileFROM python:3.11-slim\n\nRUN apt update && \\\n apt install --no-install-recommends -y build-essential gcc && \\\n apt clean && rm -rf /var/lib/apt/lists/*\n\nARG CUDA\nENV CUDA=${CUDA}\nRUN echo \"CUDA is set to: ${CUDA}\"\n\nRUN echo \"CUDA is set to: ${CUDA}\" && \\\n if [ -n \"$CUDA\" ]; then \\\n pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu121; \\\n else \\\n pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu; \\\n fi\n
and the dockerfile for the ONNX image could look something like this
inference_onnx.dockerfileFROM python:3.11-slim\n\nRUN apt update && \\\n apt install --no-install-recommends -y build-essential gcc && \\\n apt clean && rm -rf /var/lib/apt/lists/*\n\nRUN echo \"CUDA is set to: ${CUDA}\" && \\\n if [ -n \"$CUDA\" ]; then \\\n pip install onnxruntime-gpu; \\\n else \\\n pip install onnxruntime; \\\n fi\n
Build both containers and measure the time it takes to build them. How much faster is it to build the ONNX container compared to the PyTorch container?
SolutionOn unix/linux you can use the time command to measure the time it takes to build the containers. Building both images, with and without CUDA support, can be done with the following commands
time docker build . -t pytorch_inference_cuda:latest -f inference_pytorch.dockerfile \\\n --no-cache --build-arg CUDA=true\ntime docker build . -t pytorch_inference:latest -f inference_pytorch.dockerfile \\\n --no-cache --build-arg CUDA=\ntime docker build . -t onnx_inference_cuda:latest -f inference_onnx.dockerfile \\\n --no-cache --build-arg CUDA=true\ntime docker build . -t onnx_inference:latest -f inference_onnx.dockerfile \\\n --no-cache --build-arg CUDA=\n
the --no-cache
flag is used to ensure that the build process is not cached and ensure a fair comparison. On my laptop this respectively took 5m1s
, 1m4s
, 0m4s
, 0m50s
meaning that the ONNX container was respectively 7x (with CUDA) and 1.28x (no CUDA) faster to build than the PyTorch container.
Find out the size of the two docker images. It can be done in the terminal by running the docker images
command. How much smaller is the ONNX model compared to the PyTorch model?
As of writing the docker image containing the PyTorch framework was 5.54GB (with CUDA) and 1.25GB (no CUDA). In comparison the ONNX image was 647MB (with CUDA) and 647MB (no CUDA). This means that the ONNX image is respectively 8.5x (with CUDA) and 1.94x (no CUDA) smaller than the PyTorch image.
(Optional) Assuming you have completed the module on FastAPI try creating a small FastAPI application that serves a model using the ONNX runtime.
SolutionHere is a simple example of how to create a FastAPI application that serves a model using the ONNX runtime.
onnx_fastapi.pyimport numpy as np\nimport onnxruntime\nfrom fastapi import FastAPI\n\napp = FastAPI()\n\n\n@app.get(\"/predict\")\ndef predict():\n \"\"\"Predict using ONNX model.\"\"\"\n # Load the ONNX model\n model = onnxruntime.InferenceSession(\"model.onnx\")\n\n # Prepare the input data\n input_data = {\"input\": np.random.rand(1, 3).astype(np.float32)}\n\n # Run the model\n output = model.run(None, input_data)\n\n return {\"output\": output[0].tolist()}\n
This completes the exercises on the ONNX format. Do note that one limitation of the ONNX format is that is is based on ProtoBuf, which is a binary format. A protobuf file can have a maximum size of 2GB, which means that the .onnx
format is not enough for very large models. However, through the use of external data it is possible to circumvent this limitation.
BentoML cloud vs BentoML OSS
We are only going to be looking at the open-source version of BentoML in this module. However, BentoML also has a cloud version that makes it very easy to deploy models that are coded in BentoML to the cloud. If you are interested in this, you can check out the BentoML cloud documentation. This business strategy of having an open-source product and a cloud product is very common in the machine learning space (HuggingFace, LightningAI, Weights and Biases etc.), because it allows companies to make money from the cloud product while still providing a free product to the community.
BentoML is a framework that is designed to make it easy to serve machine learning models. It is designed to be backend agnostic, meaning that it can be used with any computational backend. It is also model agnostic, meaning that it can be used with any machine learning model.
Let's consider a simple example of how to serve a model using BentoML. The following code snippet shows how to serve a model that uses the transformers
library to summarize text.
import bentoml\nfrom transformers import pipeline\n\nEXAMPLE_INPUT = (\n \"Breaking News: In an astonishing turn of events, the small town of Willow Creek has been taken by storm as \"\n \"local resident Jerry Thompson's cat, Whiskers, performed what witnesses are calling a 'miraculous and gravity-\"\n \"defying leap.' Eyewitnesses report that Whiskers, an otherwise unremarkable tabby cat, jumped a record-breaking \"\n \"20 feet into the air to catch a fly. The event, which took place in Thompson's backyard, is now being investigated \"\n \"by scientists for potential breaches in the laws of physics. Local authorities are considering a town festival to \"\n \"celebrate what is being hailed as 'The Leap of the Century.'\"\n)\n\n@bentoml.service(resources={\"cpu\": \"2\"}, traffic={\"timeout\": 10})\nclass Summarization:\n def __init__(self) -> None:\n self.pipeline = pipeline('summarization')\n\n @bentoml.api\n def summarize(self, text: str = EXAMPLE_INPUT) -> str:\n result = self.pipeline(text)\n return result[0]['summary_text']\n
In BentoML
we organize our services in classes, where each class is a service that we want to serve. The two important parts of the code snippet are the @bentoml.service
and @bentoml.api
decorators.
The @bentoml.service
decorator is used to specify the resources that the service should use and in general how the service should be run. In this case we are specifying that the service should use 2 CPU cores and that the timeout for the service should be 10 seconds.
The @bentoml.api
decorator is used to specify the API that the service should expose. In this case we are specifying that the service should have an API called summarize
that takes a string as input and returns a string as output.
To serve the model using BentoML
we can execute the following command, which is very similar to the command we used to serve the model using FastAPI.
bentoml serve service:Summarization\n
"},{"location":"s7_deployment/ml_deployment/#exercises_1","title":"\u2754 Exercises","text":"In general, we advise looking through the docs for Bento ML if you need help with any of the exercises. We are going to assume that you have done the exercises on ONNX and we are therefore going to be using BentoML
to serve ONNX models. If you have not done this part, you can still follow along but you will need to use a PyTorch model instead of an ONNX model.
Install BentoML
pip install bentoml\n
Remember to add the dependency to your requirements.txt
file.
You are in principal free to serve any model you like, but we recommend to just use a torchvision model as in the ONNX exercises. Write your first service in BentoML
that serves a model of your choice. We recommend experimenting with providing input/output as tensors because bentoml supports this nativly. Secondly, write a client that can send a request to the service and print the result. Here we recommend using the build in bentoml.SyncHTTPClient.
The following implements a simple BentoML service that serves a ONNX resnet18 model. The service expects the both input and output to be numpy arrays.
bentoml_service.pyfrom __future__ import annotations\n\nimport bentoml\nimport numpy as np\nfrom onnxruntime import InferenceSession\n\n\n@bentoml.service\nclass ImageClassifierService:\n \"\"\"Image classifier service using ONNX model.\"\"\"\n\n def __init__(self) -> None:\n self.model = InferenceSession(\"model.onnx\")\n\n @bentoml.api\n def predict(self, image: np.ndarray) -> np.ndarray:\n \"\"\"Predict the class of the input image.\"\"\"\n output = self.model.run(None, {\"input\": image.astype(np.float32)})\n return output[0]\n
The service can be served using the following command
bentoml serve bentoml_service:ImageClassifierService\n
To test that the service works the following client can be used
bentoml_client.pyimport bentoml\nimport numpy as np\nfrom PIL import Image\n\nif __name__ == \"__main__\":\n image = Image.open(\"my_cat.jpg\")\n image = image.resize((224, 224)) # Resize to match the minimum input size of the model\n image = np.array(image)\n image = np.transpose(image, (2, 0, 1)) # Change to CHW format\n image = np.expand_dims(image, axis=0) # Add batch dimension\n\n with bentoml.SyncHTTPClient(\"http://localhost:4040\") as client:\n resp = client.predict(image=image)\n print(resp)\n
We are now going to look at features very BentoML
really sets itself apart from FastAPI
. The first is adaptive batching. As you are hopefully aware, modern machine learning models can process multiple samples at the same time and in doing so increases the throughput of the model. When we train a model we often set a fixed batch size, however we cannot do that when serving the model because that would mean that we would have to wait for the batch to be full before we can process it. Adaptive batching simply refers to the process where we specify a maximum batch size and also a timeout. When either the batch is full or the timeout is reached, however many samples we have collected are sent to the model for processing. This can be a very powerful feature because it allows us to process samples as soon as they arrive, while still taking advantage of the increased throughput of batching.
The overall architecture of the adaptive batching feature in BentoML. The feature is implemented on the server side and mainly consist of dispatcher that is in charge of collecting requests and sending them to the model server when either the batch is full or a timeout is reached. Image credit
Look through the documentation on adaptive batching and add adaptive batching to your service from the previous exercise. Make sure your service works as expected by testing it with the client from the previous exercise.
Solution bentoml_service_adaptive_batching.pyfrom __future__ import annotations\n\nimport bentoml\nimport numpy as np\nfrom onnxruntime import InferenceSession\n\n\n@bentoml.service\nclass ImageClassifierService:\n \"\"\"Image classifier service using ONNX model.\"\"\"\n\n def __init__(self) -> None:\n self.model = InferenceSession(\"model.onnx\")\n\n @bentoml.api(\n batchable=True,\n batch_dim=(0, 0),\n max_batch_size=128,\n max_latency_ms=1000,\n )\n def predict(self, image: np.ndarray) -> np.ndarray:\n \"\"\"Predict the class of the input image.\"\"\"\n output = self.model.run(None, {\"input\": image.astype(np.float32)})\n return output[0]\n
Try to measure the throughput of your model with and without adaptive batching. Assuming that you have completed the module on testing APIs and therefore are familiar with the locust
framework, we recommend that you write a simple locustfile and use the locust
command to measure the throughput of your model.
The following locust file can be used to measure the throughput of the model with and without adaptive
locustfile.pyimport numpy as np\nfrom locust import HttpUser, between, task\nfrom PIL import Image\n\n\ndef prepare_image():\n \"\"\"Load and preprocess the image as required.\"\"\"\n image = Image.open(\"my_cat.jpg\")\n image = image.resize((224, 224))\n image = np.array(image)\n image = np.transpose(image, (2, 0, 1)) # Convert to CHW format\n image = np.expand_dims(image, axis=0) # Add batch dimension\n # Convert to list format for JSON serialization\n return image.tolist()\n\n\nimage = prepare_image()\n\n\nclass BentoMLUser(HttpUser):\n \"\"\"Locust user class for sending prediction requests to the server.\"\"\"\n\n wait_time = between(1, 2)\n\n @task\n def send_prediction_request(self):\n \"\"\"Send a prediction request to the server.\"\"\"\n payload = {\"image\": image} # Package the image as JSON\n self.client.post(\"/predict\", json=payload, headers={\"Content-Type\": \"application/json\"})\n
and then the following command can be used to measure the throughput of the model
locust -f locustfile_bentoml.py --host http://localhost:4040 --headless -u 50 -t 60s\n
You should hopefully see that the throughput of the model is higher when adaptive batching is enabled, but the speedup is largely dependent on the model you are running, the configuration of the adaptive batching and the hardware you are running on.
On my laptop I saw about a 1.5 - 2x speedup when adaptive batching was enabled.
(Optional, requires GPU) Look through the documentation for inference on GPU and add this to your service. Check that your service works as expected by testing it with the client from the previous exercise and make sure you are seeing a speedup when running on the GPU.
SolutionA simple change to the bento.service
decorator is all that is needed to run the model on the GPU.
```python @bentoml.service(resources={\"gpu\": 1}) class MyService: def init(self): self.model = torch.load('model.pth').to('cuda:0')
Another way to speed up the inference is to just use multiple workers. This duplicates the server over multiple processes taking advantage of modern multi-core CPUs. This is similar to running uvicorn
command with the --workers
flag for fastapi applications. Implement multiple workers in your service and test that it works as expected by testing it with the client from the previous exercise. Also test that you are seeing a speedup when running with multiple workers.
Multiple workers can be added to the bento.service
decorator as shown below.
@bentoml.service(workers=4)\nclass MyService:\n # Service implementation\n
Alternatively, you can set workers=\"cpu_count\"
to use all available CPU cores. The speedup depends on the model you are serving, the hardware you are running on and the number of workers you are using, but it should be higher than using a single worker.
In addition to increasing the throughput of your deployments BentoML
can also help with ML applications that requires some kind of composition of multiple models. It is very normal in production setups to have multiple models that either
BentoML
makes it easy to compose multiple models together.
Implement two services that runs in a sequence e.g. the output of one service is used as the input of another service. As an example you can implement either some pre- or post-processing service that is used in conjunction with the model you have implemented in the previous exercises.
SolutionThe following code snippet shows how to implement two services that runs in a sequence.
bentoml_service_composition.pyfrom __future__ import annotations\n\nfrom pathlib import Path\n\nimport bentoml\nimport numpy as np\nfrom onnxruntime import InferenceSession\nfrom PIL import Image\n\n\n@bentoml.service\nclass ImagePreprocessorService:\n \"\"\"Image preprocessor service.\"\"\"\n\n @bentoml.api\n def preprocess(self, image_file: Path) -> np.ndarray:\n \"\"\"Preprocess the input image.\"\"\"\n image = Image.open(image_file)\n image = image.resize((224, 224))\n image = np.array(image)\n image = np.transpose(image, (2, 0, 1))\n return np.expand_dims(image, axis=0)\n\n\n@bentoml.service\nclass ImageClassifierService:\n \"\"\"Image classifier service using ONNX model.\"\"\"\n\n preprocessing_service = bentoml.depends(ImagePreprocessorService)\n\n def __init__(self) -> None:\n self.model = InferenceSession(\"model.onnx\")\n\n @bentoml.api\n async def predict(self, image_file: Path) -> np.ndarray:\n \"\"\"Predict the class of the input image.\"\"\"\n image = await self.preprocessing_service.to_async.preprocess(image_file)\n output = self.model.run(None, {\"input\": image.astype(np.float32)})\n return output[0]\n
Implement three services, where two of them runs concurrently and the output of both services are combined in the third service to make a prediction. As an example you can expand your previous service to serve two different models and then implement a third service that combines the output of both models to make a prediction.
SolutionThe following code snippet shows how to implement a service that consist of two concurrent services. The example assumes that two models called model_a.onnx
and model_b.onnx
are available.
from __future__ import annotations\n\nimport asyncio\n\nimport bentoml\nimport numpy as np\nfrom onnxruntime import InferenceSession\n\n\n@bentoml.service\nclass ImageClassifierServiceModelA:\n \"\"\"Image classifier service using ONNX model.\"\"\"\n\n def __init__(self) -> None:\n self.model = InferenceSession(\"model_a.onnx\")\n\n @bentoml.api\n def predict(self, image: np.ndarray) -> np.ndarray:\n \"\"\"Predict the class of the input image.\"\"\"\n output = self.model.run(None, {\"input\": image.astype(np.float32)})\n return output[0]\n\n\n@bentoml.service\nclass ImageClassifierServiceModelB:\n \"\"\"Image classifier service using ONNX model.\"\"\"\n\n def __init__(self) -> None:\n self.model = InferenceSession(\"model_b.onnx\")\n\n @bentoml.api\n def predict(self, image: np.ndarray) -> np.ndarray:\n \"\"\"Predict the class of the input image.\"\"\"\n output = self.model.run(None, {\"input\": image.astype(np.float32)})\n return output[0]\n\n\n@bentoml.service\nclass ImageClassifierService:\n \"\"\"Image classifier service using ONNX model.\"\"\"\n\n model_a = bentoml.depends(ImageClassifierServiceModelA)\n model_b = bentoml.depends(ImageClassifierServiceModelB)\n\n @bentoml.api\n async def predict(self, image: np.ndarray) -> np.ndarray:\n \"\"\"Predict the class of the input image.\"\"\"\n result_a, result_b = await asyncio.gather(\n self.model_a.to_async.predict(image), self.model_b.to_async.predict(image)\n )\n return (result_a + result_b) / 2\n
(Optional) Implement a server that consist of both sequential and concurrent services.
Similar to deploying a FastAPI application to the cloud, deploying a BentoML
framework to the cloud often requires you to first containerize the application. Because BentoML
is designed to be easy to use for even users not that familiar with Docker, it introduces the concept of a bentofile
. A bentofile
is a file that specifies how the container should be build. Below is an example of how a bentofile
could look like.
service: 'service:Summarization'\nlabels:\n owner: bentoml-team\n project: gallery\ninclude:\n - '*.py'\npython:\n packages:\n - torch\n - transformers\n
which can then be used to build a bento
using the following command
bentoml build\n
A bento
is not a docker image, but it can be used to build a docker image with the following command
bentoml containerize summarization:latest\n
Can you figure out how the different parts of the bentofile
are used to build the docker image? Additionally, can you figure out from the source repository how the bentofile
is used to build the docker image?
The service
part specifies both what the container should be called and also what service it should serve e.g. the last statement in the corresponding dockerfile is CMD [\"bentoml\", \"serve\", \"service:Summarization\"]
. The labels
part is used to specify labels about the container, see this link for more info. The include
part corresponds to COPY
statements in the dockerfile and finally the python
part is used to specify what python packages should be installed in the container which corresponds to RUN pip install ...
in the dockerfile.
Regarding how the bentofile
is used to build the docker image, the bentoml
package contains a number of templates (written using the jinja2 templating language) that are used to generate the dockerfiles. The templates can be found here.
Take whatever service from the previous exercises and try to containerize it. You are free to either write a bentofile
or a dockerfile
to do this.
The following bentofile
can be used to containerize the very first service we implemented in this set of exercises.
service: 'bentoml_service:ImageClassifierService'\nlabels:\n owner: bentoml-team\n project: gallery\ninclude:\n- 'bentoml_service.py'\n- 'model.onnx'\npython:\n packages:\n - onnxruntime\n - numpy\n
The corresponding dockerfile would look something like this
FROM python:3.11-slim\nWORKDIR /bento\nCOPY bentoml_service.py .\nCOPY model.onnx .\nRUN pip install onnxruntime numpy bentoml\nCMD [\"bentoml\", \"serve\", \"bentoml_service:ImageClassifierService\"]\n
Deploy the container to GCP Run and test that it works.
SolutionThe following command can be used to deploy the container to GCP Run. We assume that you have already build the container and called it bentoml_service:latest
.
docker tag bentoml_service:latest \\\n <region>-docker.pkg.dev/<project-id>/<repository-name>/bentoml_service:latest\ndocker push <region>-docker.pkg.dev/<project-id>/<repository-name>/bentoml_service:latest\ngcloud run deploy bentoml-service \\\n --image=<region>-docker.pkg.dev/<project-id>/<repository-name>/bentoml_service:latest \\\n --platform managed \\\n --port 3000 # default used by BentoML\n
where <project-id>
should be replaced with the id of the project you are deploying to. The service should now be available at the URL that is printed in the terminal.
This completes the exercises on the BentoML
framework. If you want to deep dive more into this we can recommend looking into their tasks feature for use cases that have a very long running time and build in model management feature to unify the way models are loaded, managed and served.
How would you export a scikit-learn
model to ONNX? What method is exported when you export scikit-learn
model to ONNX?
It is possible to export a scikit-learn
model to ONNX using the sklearn-onnx
package. The following code snippet shows how to export a scikit-learn
model to ONNX.
from sklearn.ensemble import RandomForestClassifier\nfrom skl2onnx import to_onnx\nmodel = RandomForestClassifier(n_estimators=2)\ndummy_input = np.random.randn(1, 4)\nonx = to_onnx(model, dummy_input)\nwith open(\"model.onnx\", \"wb\") as f:\n f.write(onx.SerializeToString())\n
The method that is exported when you export a scikit-learn
model to ONNX is the predict
method.
In your own words, describe what the concept of computational graph means?
SolutionA computational graph is a way to represent the mathematical operations that are performed in a model. It is essentially a graph where the nodes are the operations and the edges are the data that is passed between them. The computational graph normally represents the forward pass of the model and is the reason that we can easily backpropagate through the model to train it, because the graph contains all the necessary information to calculate the gradients of the model.
In your own words, explain why fusing operations together in the computational graph often leads to better performance?
SolutionEach time we want to do a computation, the data needs to be loaded from memory into the CPU/GPU. This is a slow process and the more operations we have, the more times we need to load the data. By fusing operations together, we can reduce the number of times we need to load the data, because we can do multiple operations on the same data before we need to load new data.
This ends the module on tools specifically designed for serving machine learning models. As stated in the beginning of the module, there are a lot of different tools that can be used to serve machine learning models and the choice of tool often depends on the specific use case. In general, we recommend that whenever you want to serve a machine learning model, you should try out a few different frameworks and see which one fits your use case the best.
"},{"location":"s7_deployment/testing_apis/","title":"M24 - API Testing","text":""},{"location":"s7_deployment/testing_apis/#api-testing","title":"API testing","text":"Core Module
API testing, similar to unit testing, is a type of software testing that involves testing the application programming interface (API) directly to ensure it meets requirements for functionality, reliability, performance, and security. The core difference from the unit testing we have been implementing until now is that instead of testing the individual functions, we are testing the entire API as a whole. API testing is therefore a form of integration testing. Additionally, another difference is that we need to simulate API calls that should be as similar as possible to the ones that will be made by the users of the API.
The is in general two things that we want to test when we are working with APIs:
In this module, we go over how to do each of them.
"},{"location":"s7_deployment/testing_apis/#testing-for-functionality","title":"Testing for functionality","text":"Similar to when we wrote unit tests for our code back in this module we can also write tests for our API that checks that our code does what it is supposed to do e.g. by using assert
statements. As always we recommend implementing the tests in a separate folder called tests
, but we recommend that you add further subfolders to separate the different types of tests. For example, for the type of machine learning projects and APIs we have been working with in this course:
my_project\n|-- src/\n| |-- train.py\n| |-- data.py\n| |-- app.py\n|-- tests/\n| |-- unittests/\n| | |-- test_train.py\n| | |-- test_data.py\n| |-- integrationtests/\n| | |-- test_apis.py\n
"},{"location":"s7_deployment/testing_apis/#exercises","title":"\u2754 Exercises","text":"In these exercises, we are going to assume that we want to test an API written in FastAPI (see this module). If the API is written in a different framework then how to write the tests may have to change.
Start by installing httpx which is the client we are going to use during testing:
pip install httpx\n
Remember to add it to your requirements.txt
file.
If you have already done the module on unittesting then you should already have a tests/
folder. If not then create one. Inside the tests/
folder create a new folder called integrationtests/
. Inside the integrationtests/
folder create a file called test_apis.py
and write the following code:
from fastapi.testclient import TestClient\nfrom app.main import app\nclient = TestClient(app)\n
this code will create a client that can be used to send requests to the API. The app
variable is the FastAPI application that we want to test.
Now, you can write tests that check that the API works as intended, much like you would write unit tests. For example, if you have an root endpoint that just returns a simple welcome message you could write a test like this:
def test_read_root(model):\n response = client.get(\"/\")\n assert response.status_code == 200\n assert response.json() == {\"message\": \"Welcome to the MNIST model inference API!\"}\n
make sure to always assert
that the status code is what you expect and that the response is what you expect. Add such tests for all the endpoints in your API.
If you have an application with lifespan events e.g. you have implemented the lifespan
function in your FastAPI application, you need to instead use the TestClient
in a with
statement. This is because the TestClient
will close the connection to the application after the test is done. Here is an example:
def test_read_root(model):\n with TestClient(app) as client:\n response = client.get(\"/\")\n assert response.status_code == 200\n assert response.json() == {\"message\": \"Welcome to the MNIST model inference API!\"}\n
To run the tests, you can use the following command:
pytest tests/integrationtests/test_apis.py\n
Make sure that all your tests pass.
The next type of testing we are going to implement for our application is load testing, which is a kind of performance testing. The goal of load testing is to determine how an application behaves under both normal and peak conditions. The purpose is to identify the maximum operating capacity of an application as well as any bottlenecks and to determine which element is causing degradation.
Before we get started on the exercises we recommend that you start by defining an environment variable that contains the endpoint of your API e.g we need the API running to be able to test it. To begin with, you can just run the API locally, thus in a terminal window run the following command:
uvicorn app.main:app --reload\n
by default the API will be running on http://localhost:8000
which we can then define as an environment variable:
set MYENDPOINT=http://localhost:8000\n
export MYENDPOINT=http://localhost:8000\n
However, the end goal is to test an API you have deployed in the cloud. If you have used Google Cloud Run to deploy your API then you can get the endpoint by going to the UI and looking at the service details:
The endpoint can be seen in the top center. It always starts with `https://` followed by a random string and then `.a.run.app`However, we can also use the gcloud
command to get the endpoint:
for /f \"delims=\" %i in ^\n('gcloud run services describe <name> --region=<region> --format=\"value(status.url)\"') do set MYENDPOINT=%i\n
export MYENDPOINT=$(gcloud run services describe <name> --region=<region> --format=\"value(status.url)\")\n
where you need to define <name>
and <region>
with the name of your service and the region it is deployed in.
For the exercises, we are going to use the locust framework for load testing (the name is a reference to a locust being a swarm of bugs invading your application). It is a Python framework that allows you to write tests that simulate many users interacting with your application. It is very easy to get started with and it is very easy to integrate with your CI/CD pipeline.
Install locust
pip install locust\n
Remember to add it to your requirements.txt
file.
Make sure you have written an API that you can test. Else you can for simplicity just use this simple example
Simple hallo world Fastapi example
model.pyfrom fastapi import FastAPI\n\napp = FastAPI()\n\n\n@app.get(\"/\")\ndef read_root():\n \"\"\"Root endpoint.\"\"\"\n return {\"Hello\": \"World\"}\n\n\n@app.get(\"/items/{item_id}\")\ndef read_item(item_id: int):\n \"\"\"Get an item by id.\"\"\"\n return {\"item_id\": item_id}\n
Add a new folder to your tests/
folder called performancetests
and inside it create a file called locustfile.py
. To that file, you need to add the appropriate code to simulate the users that you want to test. You can read more about how to write a locustfile.py
here.
Here we provide a solution to the above simple example:
locustfile.pyimport random\n\nfrom locust import HttpUser, between, task\n\n\nclass MyUser(HttpUser):\n \"\"\"A simple Locust user class that defines the tasks to be performed by the users.\"\"\"\n\n wait_time = between(1, 2)\n\n @task\n def get_root(self) -> None:\n \"\"\"A task that simulates a user visiting the root URL of the FastAPI app.\"\"\"\n self.client.get(\"/\")\n\n @task(3)\n def get_item(self) -> None:\n \"\"\"A task that simulates a user visiting a random item URL of the FastAPI app.\"\"\"\n item_id = random.randint(1, 10)\n self.client.get(f\"/items/{item_id}\")\n
Then try to run the locust
command:
locust -f tests/performancetests/locustfile.py\n
and then navigate to http://localhost:8089 in your web browser. You should see a page that looks similar to the top of this figure.
you can here define the number of users you want to simulate and how many users you want to spawn per second. Finally, you can define which endpoint you want to test. When you are ready you can press the Start
.
Afterward, you should see the results of the test in the web browser. Answer the following questions:
Maybe of more use to us is running locust in the terminal. To do this you can run the following command:
WindowsMac/Linuxlocust -f tests/performancetests/locustfile.py \\\n --headless --users 10 --spawn-rate 1 --run-time 1m --host %MYENDPOINT%\n
locust -f tests/performancetests/locustfile.py \\\n --headless --users 10 --spawn-rate 1 --run-time 1m --host $MYENDPOINT\n
this will run the test with 10 users that are spawned at a rate of 1 per second for 1 minute.
(Optional) A good use case for load testing in our case is to test that our API can handle a load right after it has been deployed. To do this we need to add appropriate steps to our CI/CD pipeline. Try adding locust to an existing or new workflow file in your .github/workflows/
folder, such that it runs after the deployment step.
The solution here expects that a service called production-model
has been deployed to Google Cloud Run. Then the following steps can be added to a workflow file, to first authenticate with Google Cloud, extract the relevant URL, and then run the load test:
- name: Auth with GCP\n uses: google-github-actions/auth@v2\n with:\n credentials_json: ${{ secrets.GCP_SA_KEY }}\n\n- name: Set up Cloud SDK\n uses: google-github-actions/setup-gcloud@v2\n\n- name: Extract deployed model URL\n run: |\n DEPLOYED_MODEL_URL=$(gcloud run services describe production-model \\\n --region=europe-west1 \\\n --format='value(status.url)')\n echo \"DEPLOYED_MODEL_URL=$DEPLOYED_MODEL_URL\" >> $GITHUB_ENV\n\n- name: Run load test on deployed model\n env:\n DEPLOYED_MODEL_URL: ${{ env.DEPLOYED_MODEL_URL }}\n run: |\n locust -f tests/performance/locustfile.py \\\n --headless -u 100 -r 10 --run-time 10m --host=$DEPLOYED_MODEL_URL --csv=/locust/results\n\n- name: Upload locust results\n uses: actions/upload-artifact@v4\n with:\n name: locust-results\n path: /locust\n
the results can afterward be downloaded from the artifacts tab in the GitHub UI.
In the locust
framework, what does the @task
decorator do and what does @task(3)
mean?
The @task
decorator is used to define a task that a user can perform. The @task(3)
decorator is used to define a task that a user can perform that is three times more likely to be performed than the other tasks.
In the locust
framework, what does the wait_time
attribute do?
The wait_time
attribute is used to define how long a user should wait between tasks. It can be either be a fixed number or a random number between two values.
from locust import HttpUser, task, between, constant\n\nclass MyUser(HttpUser):\n wait_time = between(5, 9)\n # or\n wait_time = constant(5)\n
Load testing can give numbers on average response time, 99th percentile response time, and requests per second. What do these numbers tell us about the user experience of the API?
SolutionThe average response time and 99th percentile response time are both measures how \"snappy\" the API feels to the user. While the average response time is normally considered the most important, the 99th percentile response time is also important as it tells us if there are a small amount of users that are experiencing a very slow response time. The requests per second tells us how many users the API can handle at the same time. If this number is too low it can lead to users experiencing slow response times and may indicate that something is wrong with the API.
Slides
Learn how to detect data drifting using the evidently
framework
M27: Data Drifting
Learn how to setup a prometheus monitoring system for your application
M28: System Monitoring
We have now reached the end of our machine-learning pipeline. We have successfully developed, trained and deployed a machine learning model. However, the question then becomes if you can trust that your newly deployed model still works as expected after 1 day without you intervening? What about 1 month? What about 1 year?
There may be corner cases where an ML model is working as expected, but the vast majority of ML models will perform worse over time because they are not generalizing well enough. For example, assume you have just deployed an application that classifies images from phones when suddenly a new phone comes out with a new kind of sensor that takes images that either have a very weird aspect ratio or something else your model is not robust towards. There is nothing wrong with this, you can essentially just retrain your model on new data that accounts for this corner case, however, you need a mechanism that informs you.
This is very monitoring comes into play. Monitoring practices are in charge of collecting any information about your application in some format that can then be analyzed and reacted on. Monitoring is essential to securing the longevity of your applications.
As with many other sub-fields within MLOps, we can divide monitoring into classic monitoring and ML-specific monitoring. Classic monitoring (known from classic DevOps) is often about
All these are basic information you are interested in regardless of what application type you are trying to deploy. However, then there is machine learning related monitoring that especially relates data. Take the example above, with the new phone, this we would in general consider to be data drifting problem e.g. the data you are trying to do inference on have drifted away from the distribution of data your model was trained on. Such monitoring problems are unique to machine learning applications and needs to be handled separately.
We are in this session going to see examples of both kinds of monitoring.
Learning objectives
The learning objectives of this session are:
evidently
frameworkData drifting is one of the core reasons for model accuracy degrades over time in production. For machine learning models, data drift is the change in model input data that leads to model performance degradation. In practical terms, this means that the model is receiving input that is outside of the scope that it was trained on, as seen in the figure below. This shows that the underlying distribution of a particular feature has slowly been increasing in value over two years
Image creditIn some cases, it may be that if you normalize some feature in a better way that you are able to generalize your model better, but this is not always the case. The reason for such a drift is commonly some external factor that you essentially have no control over. That really only leaves you with one option: retrain your model on the newly received input features and deploy that model to production. This process is probably going to repeat over the lifetime of your application if you want to keep it up-to-date with the real world.
Image creditWe have now come up with a solution to the data drift problem, but there is one important detail that we have not taken care of: When we should actually trigger the retraining? We do not want to wait around for our model performance to degrade, thus we need tools that can detect when we are seeing a drift in our data.
"},{"location":"s8_monitoring/data_drifting/#exercises","title":"\u2754 Exercises","text":"For these exercises we are going to use the framework Evidently developed by EvidentlyAI. Evidently currently supports both detection for both regression and classification models. The exercises are in large taken from here and in general we recommend if you are in doubt about an exercise to look at the docs for API and examples (their documentation can be a bit lacking sometimes, so you may also have to dive into the source code).
Additionally, we want to stress that data drift detection, concept drift detection etc. is still an active field of research and therefore exist multiple frameworks for doing this kind of detection. In addition to Evidently, we can also mention NannyML, WhyLogs and deepcheck.
Start by install Evidently
pip install evidently\n
you will also need scikit-learn
and pandas
installed if you do not already have it.
Hopefully you already gone through session S7 on deployment. As part of the deployment to GCP functions you should have developed a application that can classify the iris dataset, based on a model trained by this script . We are going to convert this into a FastAPI application for the purpose here:
Convert your GCP function into a FastAPI application. The appropriate curl
command should look something like this:
curl -X 'POST' \\\n 'http://127.0.0.1:8000/iris_v1/?sepal_length=1.0&sepal_width=1.0&petal_length=1.0&petal_width=1.0' \\\n -H 'accept: application/json' \\\n -d ''\n
and the response body should look like this:
{\n \"prediction\": \"Iris-Setosa\",\n \"prediction_int\": 0\n}\n
We have implemented a solution in this file (called v1) if you need help.
Next we are going to add some functionality to our application. We need to add that the input for the user is saved to a database whenever our application is called. However, to not slow down the response to our user we want to implement this as an background task. A background task is a function that should be executed after the user have got their response. Implement a background task that save the user input to a database implemented as a simple .csv
file. You can read more about background tasks here. The header of the database should look something like this:
time, sepal_length, sepal_width, petal_length, petal_width, prediction\n2022-12-28 17:24:34.045649, 1.0, 1.0, 1.0, 1.0, 1\n2022-12-28 17:24:44.026432, 2.0, 2.0, 2.0, 2.0, 1\n...\n
thus both input, timestamp and predicted value should be saved. We have implemented a solution in this file (called v2) if you need help.
Call you API a number of times to generate some dummy data in the database.
Create a new data_drift.py
file where we are going to implement the data drifting detection and reporting. Start by adding both the real iris data and your generated dummy data as pandas dataframes.
import pandas as pd\nfrom sklearn import datasets\nreference_data = datasets.load_iris(as_frame=True).frame\ncurrent_data = pd.read_csv('prediction_database.csv')\n
if done correctly you will most likely end up with two dataframes that look like
# reference_data\nsepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target\n0 5.1 3.5 1.4 0.2 0\n1 4.9 3.0 1.4 0.2 0\n...\n148 6.2 3.4 5.4 2.3 2\n149 5.9 3.0 5.1 1.8 2\n[150 rows x 5 columns]\n\n# current_data\ntime sepal_length sepal_width petal_length petal_width prediction\n2022-12-28 17:24:34.045649 1.0 1.0 1.0 1.0 1\n...\n2022-12-28 17:24:34.045649 1.0 1.0 1.0 1.0 1\n[10 rows x 5 columns]\n
Standardize the dataframes such that they have the same column names and drop the time column from the current_data
dataframe.
We are now ready to generate some reports about data drifting:
Try executing the following code:
from evidently.report import Report\nfrom evidently.metric_preset import DataDriftPreset\nreport = Report(metrics=[DataDriftPreset()])\nreport.run(reference_data=reference, current_data=current)\nreport.save_html('report.html')\n
and open the generated .html
page. What does it say about your data? Have it drifted? Make sure to poke around to understand what the different plots are actually showing.
Data drifting is not the only kind of reporting evidently can make. We can also get reports on the data quality. Try first adding a few Nan
values to your reference data. Secondly, try changing the report to
from evidently.metric_preset import DataDriftPreset, DataQualityPreset\nreport = Report(metrics=[DataDriftPreset(), DataQualityPreset()])\n
and re-run the report. Checkout the newly generated report. Again go over the generated plots and make sure that it picked up on the missing values you just added.
The final report present we will look at is the TargetDriftPreset
. Target drift means that our model is over/under predicting certain classes e.g. or general terms the distribution of predicted values differs from the ground true distribution of targets. Try adding the TargetDriftPreset
to the Report
class and re-run the analysis and inspect the result. Have your targets drifted?
Evidently reports are meant for debugging, exploration and reporting of results. However, as we stated in the beginning, what we are actually interested in methods automatically detecting when we are beginning to drift. For this we will need to look at Test and TestSuites:
Lets start with a simple test that checks if there are any missing values in our dataset:
from evidently.test_suite import TestSuite\nfrom evidently.tests import TestNumberOfMissingValues\ndata_test = TestSuite(tests=[TestNumberOfMissingValues()])\ndata_test.run(reference_data=reference, current_data=current)\n
again we could run data_test.save_html
to get a nice view of the results (feel free to try it out) but additionally we can also call data_test.as_dict()
method that will give a dict with the test results. What dictionary key contains the if all tests have passed or not?
Take a look at this colab notebook that contains all tests implemented in Evidently. Pick 5 tests of your choice, where at least 1 fails by default and implement them as a TestSuite
. Then try changing the arguments of the test so they better fit your usecase and get them all passing.
(Optional) When doing monitoring in practice, we are not always interested in running on all data collected from our API maybe only the last N
entries or maybe just from the last hour of observations. Since we are already logging the timestamps of when our API is called we can use that for filtering. Implement a simple filter that either takes an integer n
and returns the last n
entries in our database or some datetime t
that filters away observations earlier than this.
Evidently by default only supports structured data e.g. tabular data (so does nearly every other framework). Thus, the question then becomes how we can extend unstructured data such as images or text? The solution is to extract structured features from the data which we then can run the analysis on.
(Optional) For images the simple solution would be to flatten the images and consider each pixel a feature, however this does not work in practice because changes in the individual pixels does not really tell anything about the image. Instead we should derive some feature such as:
These are all numbers that can make up a feature vector for a image. Try out doing this yourself, for example by extracting such features from MNIST and FashionMNIST datasets, and check if you can detect a drift between the two sets.
(Optional) For text a common approach is to extra some higher level embedding such as the very classical GLOVE embedding. Try following this tutorial to understand how drift detection is done on text.
Lets instead take a deep learning based approach to doing this. Lets consider the CLIP model, which is normally used to do image captioning. For our purpose this is perfect because we can use the model to get abstract feature embeddings for both images and text:
from PIL import Image\nimport requests\n# requires transformers package: pip install transformers\nfrom transformers import CLIPProcessor, CLIPModel\n\nmodel = CLIPModel.from_pretrained(\"openai/clip-vit-base-patch32\")\nprocessor = CLIPProcessor.from_pretrained(\"openai/clip-vit-base-patch32\")\n\nurl = \"http://images.cocodataset.org/val2017/000000039769.jpg\"\nimage = Image.open(requests.get(url, stream=True).raw)\n\n# set either text=None or images=None when only the other is needed\ninputs = processor(text=[\"a photo of a cat\", \"a photo of a dog\"], images=image, return_tensors=\"pt\", padding=True)\n\nimg_features = model.get_image_features(inputs['pixel_values'])\ntext_features = model.get_text_features(inputs['input_ids'], inputs['attention_mask'])\n
Both img_features
and text_features
are in this case a (512,)
abstract feature embedding, that should be able to tell us something about our data distribution. Try using this method to extract features on two different datasets like CIFAR10 and SVHN if you want to work with vision or IMDB movie review and Amazon review for text. After extracting the features try running some of the data distribution testing you just learned about.
(Optional) If we have multiple applications and want to run monitoring for each application we often want also the monitoring to be a deployed application (that only we can access). Implement a /monitoring/
endpoint that does all the reporting we just went through such that you have two endpoints:
http://127.0.0.1:8000/iris_infer/?sepal_length=1.0&sepal_width=1.0&petal_length=1.0&petal_width=1.0 # user endpoint\nhttp://127.0.0.1:8000/iris_monitoring/ # monitoring endpoint\n
Our monitoring endpoint should return a HTML page either showing an Evidently report or test suit. Try implementing this endpoint. We have implemented a solution in this file if you need help with how to return an HTML page from a FastAPI application.
As an final exercise, we recommend that you try implementing this to run directly in the cloud. You will need to implement this in a container e.g. GCP Run service because the data gathering from the endpoint should still be implemented as an background task. For this to work you will need to change the following:
Instead of saving the input to a local file you should either store it in GCP bucket or an BigQuery SQL table (this is a better solution, but also out-of-scope for this course)
You can either run the data analysis locally by just pulling from cloud storage predictions and training data or alternatively you can deploy this as its own endpoint that can be invoked. For the latter option we recommend that this should require authentication.
That ends the module on detection of data drifting, data quality etc. If this has not already been made clear, monitoring of machine learning applications is an extremely hard discipline because it is not a clear cut when we should actually respond to feature beginning to drift and when it is probably fine. That comes down to the individual application what kind of rules that should be implemented. Additionally, the tools presented here are also in no way complete and are especially limited in one way: they are only considering the marginal distribution of data. Every analysis that we done have been on the distribution per feature (the marginal distribution), however as the image below show it is possible for data to have drifted to another distribution with the marginal being approximately the same.
There are methods such as Maximum Mean Discrepancy (MMD) tests that are able to do testing on multivariate distributions, which you are free to dive into. In this course we will just always recommend to consider multiple features when doing decision regarding your deployed applications.
"},{"location":"s8_monitoring/monitoring/","title":"M28 - System Monitoring","text":""},{"location":"s8_monitoring/monitoring/#monitoring","title":"Monitoring","text":"In this module we are going to look into more classical monitoring of applications. The key concept we are often working with here is called telemetry. Telemetry in general refer to any automatic measurement and wireless transmission of data from our application. It could be numbers such as:
In general there are three different kinds of telemetry we are interested in:
Name Description Example Purpose Metrics Metrics are quantitative measurements of the system. They are usually numbers that are aggregated over a period of time. E.g. the number of requests per minute. The number of requests per minute. Metrics are used to get an overview of the system. They are often used to create dashboards that can be used to get an overview of the system. Logs Logs are textual or structured records generated by applications. They provide a detailed account of events, errors, warnings, and informational messages that occur during the operation of the system. System logs, error logs Logs are essential for diagnosing issues, debugging, and auditing. They provide a detailed history of what happened in a system, making it easier to trace the root cause of problems and track the behavior of components over time. Traces Traces are detailed records of specific transactions or events as they move through a system. A trace typically includes information about the sequence of operations, timing, and dependencies between different components. Distributed tracing in microservices architecture Traces help in understanding the flow of a request or a transaction across different components. They are valuable for identifying bottlenecks, understanding latency, and troubleshooting issues related to the flow of data or control.We are mainly going to focus in this module on metrics.
"},{"location":"s8_monitoring/monitoring/#local-instrumentator","title":"Local instrumentator","text":"Before we look into the cloud lets at least conceptually understand how a given instance of a app can expose values that we may be interested in monitoring.
The standard framework for exposing metrics is called prometheus. Prometheus is a time series database that is designed to store metrics. It is also designed to be very easy to instrument applications with and it is designed to scale to large amounts of data. The way prometheus works is that it exposes a /metrics
endpoint that can be queried to get the current state of the metrics. The metrics are exposed in a format called prometheus text format.
Start by installing prometheus-fastapi-instrumentator
in python
pip install prometheus-fastapi-instrumentator\n
this will allow us to easily instrument our FastAPI application with prometheus.
Create a simple FastAPI application in a file called app.py
. You can reuse any application from the previous module on APIs. To that file now add the following code:
from prometheus_fastapi_instrumentator import Instrumentator\n\n# your app code here\n\nInstrumentator().instrument(app).expose(app)\n
This will instrument your application with prometheus and expose the metrics on the /metrics
endpoint.
Run the app using uvicorn
server. Make sure that the app exposes the endpoints you expect it too exposes, but make sure you also checkout the /metrics
endpoint.
The metric endpoint exposes multiple /metrics
. Metrics always looks like this:
# TYPE key <type>\nkey value\n
e.g. it is essentially a ditionary of key-value pairs with the added functionality of a <type>
. Look at this page over the different types prometheus metrics can have and try to understand the different metrics being exposed.
Look at the documentation for the prometheus-fastapi-instrumentator
and try to add at least one more metric to your application. Rerun the application and confirm that the new metric is being exposed.
Any cloud system with respect for itself will have some kind of monitoring system. GCP has a service called Monitoring that is designed to monitor all the different services. By default it will monitor a lot of metrics out-of-box. However, the question is if we want to monitor more than the default metrics. The complexity that comes with doing monitoring in the cloud is that we need more than one container. We at least need one container actually running the application that is also exposing the /metrics
endpoint and then we need a another container that is collecting the metrics from the first container and storing them in a database. To implement such system of containers that need to talk to each others we in general need to use a container orchestration system such as Kubernetes. This is out of scope for this course, but we can use a feature of Cloud Run
called sidecar containers
to achieve the same effect. A sidecar container is a container that is running alongside the main container and can be used to do things such as collecting metrics.
Overall we recommend that you just become familiar with the monitoring tab for your cloud run service (see image) above. Try to invoke your service a couple of times and see what happens to the metrics over time.
Try creating a service level objective (SLO). In short a SLO is a target for how well your application should be performing. Click the Create SLO
button and fill it out with what you consider to be a good SLO for your application.
(Optional) To expose our own metrics we need to setup a sidecar container. To do this follow the instructions here. We have setup a simple example that uses fastapi and prometheus that you can find here. After you have correctly setup the sidecar container you should be able to see the metrics in the monitoring tab.
A core problem within monitoring is alert systems. The alert system is in charge of sending out alerts to relevant people when some telemetry or metric we are tracking is not behaving as it should. Alert systems are a subjective choice of when and how many should be send out and in general should be proportional with how important to the of the metric/telemetry. We commonly run into what is referred to the goldielock problem where we want just the right amount of alerts however it is more often the case that we either have
Therefore, setting up proper alert systems can be as challenging as setting up the systems for actually the metrics we want to trigger alerts.
"},{"location":"s8_monitoring/monitoring/#exercises_2","title":"\u2754 Exercises","text":"We are in this exercise going to look at how we can setup automatic alerting such that we get an message every time one of our applications are not behaving as expected.
Go to the Monitoring
service. Then go to Alerting
tab.
Start by setting up an notification channel. A recommend setting up with an email.
Next lets create a policy. Clicking the Add Condition
should bring up a window as below. You are free to setup the condition as you want but the image is one way bo setup an alert that will react to the number of times an cloud function is invoked (actually it measures the amount of log entries from cloud functions).
After adding the condition, add the notification channel you created in one of the earlier steps. Remember to also add some documentation that should be send with the alert to better describe what the alert is actually doing.
When the alert is setup you need to trigger it. If you setup the condition as the image above you just need to invoke the cloud function many times. Here is a small code snippet that you can execute on your laptop to call a cloud function many time (you need to change the url and payload depending on your function):
import time\nimport requests\nurl = 'https://us-central1-dtumlops-335110.cloudfunctions.net/function-2'\npayload = {'message': 'Hello, General Kenobi'}\n\nfor _ in range(1000):\n r = requests.get(url, params=payload)\n
Make sure that you get the alert through the notification channel you setup.
Slides
Learn how to setup distributed data loading in your PyTorch application
M29: Distributed Data Loading
Learn how to do distributed training in PyTorch using pytorch-lightning
M30: Distributed Training
Learn how to do scalable inference in PyTorch
M31: Scalable Inference
This module is all about scaling the applications that we are building. We are here going to use a very narrow definition of scaling namely that we want our applications to run faster, however, one should note that in general scaling is a much broader term. There are many different ways to scale your applications and we are going to look at three of these related to different tasks in machine learning algorithms:
We are going to approach the term scaling from two different angles and both should result in your application running faster. The first approach is levering multiple devices, such as using multiple CPU cores or parallelizing training across multiple GPUs. The second approach is more analytical, where we are going to look at how we can design smaller/faster model architectures that run faster.
It should be noted that this module is specific to working with PyTorch applications. In particular, we are going to see how we can both improve base PyTorch code and how to utilize the PyTorch Lightning which we introduced in module M14 on boilerplate to improve the scaling of our applications. If your application is written using another framework we can guarantee that the same techniques in these modules transfer to that framework, but may require you to seek out how to specifically to it.
If you manage to complete all modules in this session, feel free to check out the extra module on scalable hyperparameter optimization.
Learning objectives
The learning objectives of this session are:
pytorch-lightning
Core Module
One way that deep learning fundamentally changed the way we think about data in machine learning is that more data is always better. This was very much not the case with more traditional machine learning algorithms (random forest, support vector machines etc.) where a plateau in performance was often reached for a certain amount of data and did not improve if more was added. However, as deep learning models have become deeper and deeper and thereby more and more data-hungry performance seems to be ever increasing or at least not reaching a plateau in the same way as for traditional machine learning.
Image creditAs we are trying to feed more and more data into our models, the obvious first question to ask is how to do this efficiently. As a general rule of thumb, we want the performance bottleneck to be the forward/backward e.g. the actual computation in our neural network and not the data loading. By bottleneck, we here refer to the part of our pipeline that is restricting how fast we can process data. If data loading is our bottleneck, then our compute device can sit idle while waiting for data to arrive, which is both inefficient and costly. For example, if you are using a cloud provider for training deep learning models, you are paying by the hour per device, and thus not using them fully can be costly in the long run.
In the first set of exercises, we are therefore going to focus on distributed data loading i.e. how to load data in parallel to make sure that we always have data ready for our compute devices. We are in the following going to look at what is going on behind the scenes when we use PyTorch to parallelize data loading.
"},{"location":"s9_scalable_applications/data_loading/#a-closer-look-at-data-loading","title":"A closer look at Data loading","text":"Before we talk distributed applications it is important to understand the physical layout of a standard CPU (the brain of your computer).
Most modern CPUs is a single chip that consists of multiple cores. Each core can further be divided into threads. In most laptops, the core count is 4 and commonly 2 threads per code. This means that the common laptop has 8 threads. The number of threads a compute unit has is important because that directly corresponds to the number of parallel operations that can be executed i.e. one per thread. In a Python terminal you should be able to get the number of cores in your machine by writing (try it):
import multiprocessing\ncores = multiprocessing.cpu_count()\nprint(f\"Number of cores: {cores}, Number of threads: {2*cores}\")\n
A distributed application is in general any kind of application that parallelizes some or all of its workload. We are in these exercises only focusing on distributed data loading, which happens primarily only on the CPU. In PyTorch
it is easy to parallelize data loading if you are using their dataset/data loader interface:
from torch.utils.data import Dataset, DataLoader\nclass MyDataset(Dataset):\n def __init__(self, ...):\n # whatever logic is needed to init the data set\n self.data = ...\n\n def __getitem__(self, idx):\n # return one item\n return self.data[idx]\n\ndataset = MyDataset()\ndataloader = Dataloader(\n dataset,\n batch_size=8,\n num_workers=4 # this is the number of threads we want to parallelize workload over\n)\n
Let's take a deep dive into what happens when we request a batch from our dataloader e.g. next(dataloader)
. First, we must understand that we have a thread that plays the role of the main and the remaining threads (in the above example we request 4) are called workers. When the dataloader is created, we create this structure and make sure that all threads have a copy of our dataset definition so each can call the __getitem__
method.
Then comes the actual part where we request a batch of data. Assume that we have a batch size of 8 and we do not do any shuffling. In this step, the master thread then distributes the list of requested data points ([0,1,2,3,4,5,6,7]
) to the four worker threads. With 8 indices and 4 workers, each worker will receive 2 indices.
Each worker thread then calls the __getitem__
method for all the indices it has received. When all processes are done, the loaded images data points gets sent back to the master thread and collected into a single structure/tensor.
Each arrow is corresponds to a communication between two threads, which is not a free operation. In total to get a single batch (not counting the initial startup cost) in this example we need to do 8 communication operations. This may seem like a small price to pay, but that may not be the case. If the processing time of __getitem__
is very low ( data is stored in memory, we just need to index to get it) then it does not make sense to use multiprocessing. The computational savings by doing the look-up operations in parallel are smaller than the communication cost there is between the main thread and the workers. Multiprocessing makes sense when the processing time of __getitem__
is high (data is probably stored on the hard drive).
It is this trade-off that we are going to investigate in the exercises.
"},{"location":"s9_scalable_applications/data_loading/#exercises","title":"\u2754 Exercises","text":"This exercise is intended to be done on the labeled faces in the wild (LFW) dataset. The dataset consists of images of famous people extracted from the internet. The dataset had been used to drive the field of facial verification, which you can read more about here. We are going to imagine that this dataset cannot fit in memory, and your job is therefore to construct a data pipeline that can be parallelized based on loading the raw data files (.jpg) at runtime.
Download the dataset and extract it to a folder. It does not matter if you choose the non-aligned or aligned version of the dataset.
We provide the lfw_dataset.py
file where we have started the process of defining a data class. Fill out the __init__
, __len__
and __getitem__
. Note that __getitem__
expects that you return a single img
which should be a torch.Tensor
. Loading should be done using PIL Image, as PIL
images are the default input format for torchvision for transforms (for data augmentation).
Make sure that the script runs without any additional arguments
python lfw_dataset.py\n
Visualize a single batch by filling out the codeblock after the first TODO right after defining the dataloader. The visualization should show when launching the script as
python lfw_dataset.py -visualize_batch\n
Hint: this tutorial.
Experiment how the number of workers influences the performance. We have already provide code that will pass over 100 batches from the dataset 5 times and calculate how long time it took, which you can play around with by calling
python lfw_dataset.py -get_timing -num_workers 1\n
Make a errorbar plot with number of workers along the x-axis and the timing along the y-axis. The errorbars should correspond to the standard deviation over the 5 runs. HINT: if it is taking too long to evaluate, measure the time over less batches (set the -batches_to_check
flag). Also if you are not seeing an improvement, try increasing the batch size (since data loading is parallelized per batch).
For certain machines like the Mac with M1 chipset it is necessary to set the multiprocessing_context
flag in the dataloder to \"fork\"
. This essentially tells the dataloader how the worker nodes should be created.
Retry the experiment where you change the data augmentation to be more complex:
lfw_trans = transforms.Compose([\n transforms.RandomAffine(5, (0.1, 0.1), (0.5, 2.0)),\n # add more transforms here\n transforms.ToTensor()\n])\n
by making the augmentation more computationally demanding, it should be easier to get a boost in performance when using multiple workers because the data augmentation is also executed in parallel.
(Optional, requires access to GPU) If your dataset fits in GPU memory it is beneficial to set the pin_memory
flag to True
. By setting this flag we are essentially telling PyTorch that they can lock the data in place in memory which will make the transfer between the host (CPU) and the device (GPU) faster.
This ends the module on distributed data loading in PyTorch. If you want to go into more details we highly recommend that you read this paper that goes into great detail on analyzing how data loading in PyTorch works and performance benchmarks.
"},{"location":"s9_scalable_applications/distributed_training/","title":"M30 - Distributed Training","text":""},{"location":"s9_scalable_applications/distributed_training/#distributed-training","title":"Distributed Training","text":"In this module we are going to look at distributed training. Distributed training is one of the key ingredients to all the awesome results that deep learning models are producing. For example: Alphafold the highly praised model from Deepmind that seems to have solved protein structure prediction, was trained in a distributed fashion for a few weeks. The training was done on 16 TPUv3s (specialized hardware), which is approximately equal to 100-200 modern GPUs. This means that training Alphafold without distributed training on a single GPU (probably not even possible) would take a couple of years to train! Therefore, it is simply impossible currently to train some of the state-of-the-art (SOTA) models within deep learning currently, without taking advantage of distributed training.
When we talk about distributed training, there are a number of different paradigms that we may use to parallelize our computations
In this module we are going to look at data parallel training, which is the original way of doing parallel training and distributed data parallel training which is an improved version of data parallel. If you want to know more about sharded training which is the newest of the paradigms you can read more about it in this blog post, which describes how sharded can save over 60% of memory used during your training.
Finally, we want to note that for all the exercises in the module you are going to need a multi GPU setup. If you have not already gained access to multi GPU machines on GCP (see the quotas exercises in this module) you will need to find another way of running the exercises. For DTU Students I can recommend checking out this optional module on using the high performance cluster (HPC) where you can get access to multi GPU resources.
"},{"location":"s9_scalable_applications/distributed_training/#data-parallel","title":"Data parallel","text":"While data parallel today in general is seen as obsolete compared to distributed data parallel, we are still going to investigate it a bit since it offers the most simple form of distributed computations in deep learning pipeline.
In the figure below is shown both the forward and backward step in the data parallel paradigm
The steps are the following:
Whenever we try to do forward call e.g. out=model(batch)
we take the batch and divide it equally between all devices. If we have a batch size of N
and M
devices each device will be sent N/M
datapoints.
Afterwards each device receives a copy of the model
e.g. a copy of the weights that currently parametrizes our neural network.
In this step we perform the actual forward pass in parallel. This is the actual steps that can help us scale our training.
Finally we need to send back the output of each replicated model to the primary device.
Similar to the analysis we did of parallel data loading, we cannot always expect that this will actual take less time than doing the forward call on a single GPU. If we are parallelizing over M
devices, we essentially need to do 3xM
communication calls to send batch, model and output between the devices. If the parallel forward call does not outweigh this, then it will take longer.
In addition, we also have the backward path to focus on
As the end of the forward collected the output on the primary device, this is also where the loss is accumulated. Thus, loss gradients are first calculated on the primary device
Next we scatter the gradient to all the workers
The workers then perform a parallel backward pass through their individual model
Finally, we reduce (sum) the gradients from all the workers on the main process such that we can do gradient descend.
One of the big downsides of using data parallel is that all the replicas are destroyed after each backward call. This means that we over and over again need to replicate our model and send it to the devices that are part of the computations.
Even though it seems like a lot of logic is implementing data parallel into your code, in PyTorch we can very simply enable data parallel training by wrapping our model in the nn.DataParallel class.
from torch import nn\nmodel = MyModelClass()\nmodel = nn.DataParallel(model, device_ids=[0, 1]) # data parallel on gpu 0 and 1\npreds = model(input) # same as usual\n
"},{"location":"s9_scalable_applications/distributed_training/#exercises","title":"\u2754 Exercises","text":"Please note that the exercise only makes sense if you have access to multiple GPUs.
Create a new script (call it data_parallel.py
) where you take a copy of model FashionCNN
from the fashion_mnist.py
script. Instantiate the model and wrap torch.nn.DataParallel
around it such that it can be executed in data parallel.
Try to run inference in parallel on multiple devices (pass a batch multiple times and time it) e.g.
import time\nstart = time.time()\nfor _ in range(n_reps):\n out = model(batch)\nend = time.time()\n
Does data parallel decrease the inference time? If no, can you explain why that may be? Try playing around with the batch size, and see if data parallel is more beneficial for larger batch sizes.
It should be clear that there is huge disadvantage of using the data parallel paradigm to scale your applications: the model needs to replicated on each pass (because it is destroyed in the end), which requires a large transfer of data. This is the main problem that distributed data parallel tries to solve.
The two key difference between distributed data parallel and data parallel that we move the model update (the gradient step) to happen on each device in parallel instead of only on the main device. This has the consequence that we do not need to move replicate the model on each step, instead we just keep a local version on each device that we keep updating. The full set of steps (as shown in the figure):
Initialize an exact copy of the model on each device
From disk (or memory) we start by loading data into a section of page-locked host memory per device. Page-locked memory is essentially a way to reverse a piece of a computers memory for a specific transfer that is going to happen over and over again to speed it up. The page-locked regions are loaded with non-overlapping data.
Transfer data from page-locked memory to each device in parallel
Perform forward pass in parallel
Do a all-reduce operation on the gradients. An all-reduce operation is a so call all-to-all operation meaning that all processes send their own gradient to all other processes and also received from all other processes.
Reduce the combined gradient signal from all processes and update the individual model in parallel. Since all processes received the same gradient information, all models will still be in sync.
Thus, in distributed data parallel we here end up only doing a single communication call between all processes, compared to all the communication going on in data parallel. While all-reduce is a more expensive operation that many of the other communication operations that we can do, because we only have to do a single we gain a huge performance boost. Empirically distributed data parallel tends to be 2-3 times faster than data parallel.
However, this performance increase does not come for free. Where we could implement data parallel in a single line in PyTorch, distributed data parallel is much more involving.
"},{"location":"s9_scalable_applications/distributed_training/#exercises_1","title":"\u2754 Exercises","text":"We have provided an example of how to do distributed data parallel training in PyTorch in the two files distributed_example.py
and distributed_example.sh
. You objective is to get a understanding of the necessary components in the script to get this kind of distributed training to work. Try to answer the following questions (HINT: try to Google around):
What is the function of the DDP
wrapper?
What is the function of the DistributedSampler
?
Why is it necessary to call dist.barrier()
before passing a batch into the model?
What does the different environment variables do in the .sh
file
Try to benchmark the runs using 1 and 2 GPUs
The first exercise have hopefully convinced you that it can be quite the trouble writing distributed training applications yourself. Luckily for us, PyTorch-lightning
can take care of this for us such that we do not have to care about the specific details. To get your model training on multiple GPUs you need to change two arguments in the trainer: the accelerator
flag and the gpus
flag. In addition to this, you can read through this guide about any additional steps you may need to do (for many of you, it should just work). Try running your model on multiple GPUs.
Try benchmarking your training using 1 and 2 gpus e.g. try running a couple of epochs and measure how long time it takes. How much of a speedup can you actually get? Why can you not get a speedup of 2?
Inference is the task of applying our trained model to some new and unseen data, often called prediction. Thus, scaling inference is different from scaling data loading and training, mainly due to inference normally only using a single data point (or a few). As we can neither parallelize the data loading nor parallelize using multiple GPUs (at least not in any efficient way), this is of no use to us when we are doing inference. Additionally, performing inference is often not something we do on machines that can perform large computations, as most inference today is actually either done on edge devices e.g. mobile phones or in low-cost-low-compute cloud environments. Thus, we need to be smarter about how we scale inference than just throwing more computing power at it.
In this module, we are going to look at various ways that you can either reduce the size of your model or make your model faster. Both are important for running inference fast regardless of the setup you are running your model on. We want to note that this is still very much an active area of research and therefore best practices for what to do in a specific situation can change.
"},{"location":"s9_scalable_applications/inference/#choosing-the-right-architecture","title":"Choosing the right architecture","text":"Assume you are starting a completely new project and have to come up with a model architecture for doing this. What is your strategy? The common way to do this is to look at prior work on similar problems that you are facing and either directly choose the same architecture or create some slight variation hereof. This is a great way to get started, but the architecture that you end up choosing may be optimal in terms of performance but not inference speed.
The fact is that not all base architectures are created equal, and a 10K parameter model with one architecture can have a significantly different inference speed than another 10K parameter model with another architecture. For example, consider the figure below which compares an number of models from the timm package, colored based on their base architecture. The general trend is that the number of images that can be processed by a model per sec (y-axis) is inversely proportional to the number of parameters (x-axis). However, we in general see that convolutional base architectures (conv) are more efficient than transformer (vit) for the same parameter budget.
Image credit"},{"location":"s9_scalable_applications/inference/#exercises","title":"\u2754 Exercises","text":"As discussed in this blogpost the largest increase in inference speed you will see (given some specific hardware) is choosing an efficient model architecture. In the exercises below we are going to investigate the inference speed of different architectures.
Start by checking out this table which contains a list of pretrained weights in torchvision
. Try finding an
model that has in the range of 20-30 mio parameters.
Write a small script that first initializes all models, creates a dummy input tensor of shape [100, 3, 256, 256] and then measures the time it takes to do a forward pass on the input tensor. Make sure to do this multiple times to get a good average time.
SolutionIn this solution, we have chosen to use the efficientnet b5 (30.4M parameters), resnet50 (25.6M parameters) and the swin v2 transformer tiny (28.4M parameters) models.
import time\nimport torch\nfrom torchvision import models\n\nmodel_list = [\"efficientnet_b5\", \"resnet50\", \"swin_v2_t\"]\nimage = torch.randn(100, 3, 256, 256)\n\nn_reps = 10\nfor i, m in enumerate(model_list):\n model = models.get_model(m)\n tic = time.time()\n for _ in range(n_reps):\n _ = model(image)\n toc = time.time()\n print(f\"Model {i} took: {(toc - tic) / n_reps}\")\n
Does the results make sense? Based on the above figure we would expect that efficientnet is faster than resnet, which is faster than the transformer based model. Is this also what you are seeing?
To figure out why one net is more efficient than another we can try to count the operations each network need to do for inference. A operation here we can define as a FLOP (floating point operation) which is any mathematical operation (such as +, -, *, /) or assignment that involves floating-point numbers. Luckily for us someone has already created a python package for calculating this in pytorch: ptflops
Install the package
pip install ptflops\n
Try calling the get_model_complexity_info
function from the ptflops
package on the networks from the previous exercise. What are the results?
from ptflops import get_model_complexity_info\nimport time\nimport torch\nfrom torchvision import models\n\nmodel_list = [\"efficientnet_b5\", \"resnet50\", \"swin_v2_t\"]\nfor model in model_list:\n macs, params = get_model_complexity_info(\n models.get_model(model_list[0]), (3, 256, 256), backend='pytorch', print_per_layer_stat=False\n )\n print(f\"Model {model} have {params} parameters and uses {macs}\")\n
In the table from the initial exercise, you could also see the overall performance of each network on the Imagenet-1K dataset. Given this performance, the inference speed, the flops count what network would you choose to use in a production setting? Discuss when choosing one over another should be considered.
Quantization is a technique where all computations are performed with integers instead of floats. We are essentially taking all continuous signals and converting them into discretized signals.
Image creditAs discussed in this blogpost series, while float
(32-bit) is the primarily used precision in machine learning because is strikes a good balance between memory consumption, precision and computational requirement it does not mean that during inference we can take advantage of quantization to improve the speed of our model. For instance:
Floating-point computations are slower than integer operations
Recent hardware have specialized hardware for doing integer operations
Many neural networks are actually not bottlenecked by how many computations they need to do but by how fast we can transfer data e.g. the memory bandwidth and cache of your system is the limiting factor. Therefore working with 8-bit integers vs 32-bit floats means that we can approximately move data around 4 times as fast.
Storing models in integers instead of floats save us approximately 75% of the ram/harddisk space whenever we save a checkpoint. This is especially useful in relation to deploying models using docker (as you hopefully remember) as it will lower the size of our docker images.
But how do we convert between floats and integers in quantization? In most cases we often use a linear affine quantization:
$$ x_{int} = \\text{round}\\left( \\frac{x_{float}}{s} + z \\right) $$
where $s$ is a scale and $z$ is the so called zero point. But how does to doing inference in a neural network. The figure below shows all the conversations that we need to make to our standard inference pipeline to actually do computations in quantized format.
Image credit"},{"location":"s9_scalable_applications/inference/#exercises_1","title":"\u2754 Exercises","text":"Lets look at how quantized tensors look in PyTorch
Start by creating a tensor that contains both random numbers
Next call the torch.quantize_per_tensor
function on the tensor. What does the quantized tensor look like? How does the values relate to the scale
and zero_point
arguments.
Finally, try to call the .dequantize()
method on the tensor. Do you get a tensor back that is close to what you initially started out with.
As you hopefully saw in the first exercise we are going to perform a number of rounding errors when doing quantization and naively we would expect that this would accumulate and lead to a much worse model. However, in practice we observe that quantization still works, and we actually have a mathematically sound reason for this. Can you figure out why quantization still works with all the small rounding errors? HINT: it has to do with the central limit theorem
Lets move on to quantization of our model. Follow this tutorial from PyTorch on how to do quantization. The goal is to construct a model model_fc32
that works on normal floats and a quantized version model_int8
. For simplicity you can just use one of the models from the tutorial.
Lets try to benchmark our quantized model and see if all the trouble that we went through actually paid of. Also try to perform the benchmark on the non-quantized model and see if you get a difference. If you do not get an improvement, explain why that may be.
Pruning is another way for reducing the model size and maybe improve performance of our network. As the figure below illustrates, in pruning we are simply removing weights in our network that we do not consider important for the task at hand. By removing, we here mean that the weight gets set to 0. There are many ways to determine if a weight is important but the general rule that the importance of a weight is proportional to the magnitude of a given weight. This makes intuitively sense, since weights in all linear operations (fully connected or convolutional) are always multiplied onto the incoming value, thus a small weight means a small outgoing activation.
Image credit"},{"location":"s9_scalable_applications/inference/#exercises_2","title":"\u2754 Exercises","text":"We provide a start script that implements the famous LeNet in this file. Open and run it just to make sure that you know the network.
PyTorch have already some pruning methods implemented in its package. Import the prune
module from torch.nn.utils
in the script.
Try to prune the weights of the first convolutional layer by calling
prune.random_unstructured(module_1, name=\"weight\", amount=0.3) # (1)!\n
Try printing the named_parameters
, named_buffers
before and after the module is pruned. Can you explain the difference and what is the connection to the module_1.weight
attribute.
Try pruning the bias of the same module this time using the l1_unstructured
function from the pruning module. Again check the named_parameters
, named_buffers
argument to make sure you understand the difference between L1 pruning and unstructured pruning.
Instead of pruning only a single module in the model lets try pruning the whole model. To do this we just need to iterate over all named_modules
in the model like this:
for name, module in new_model.named_modules():\n prune.l1_unstructured(module, name='weight', amount=0.2)\n
But what if we wanted to apply different pruning to different layers. Implement a pruning scheme where
amount=0.2
amount=0.4
Print print(dict(new_model.named_buffers()).keys())
after the pruning to confirm that all weights have been correctly pruned.
The pruning we have looked at until know have only been local in nature e.g. we have applied the pruning independently for each layer, not accounting globally for how much we should actually prune. As you may realize this can quickly lead to an network that is pruned too much. Instead, the more common approach is too prune globally where we remove the smallest X
amount of connections:
Start by creating a tuple over all the weights with the following format
parameters_to_prune = (\n (model.conv1, 'weight'),\n # fill in the rest of the modules yourself\n (model.fc3, 'weight'),\n)\n
The tuple needs to have length 5. Challenge: Can you construct the tuple using for
loops, such that the code works for arbitrary size networks?
Next prune using the global_unstructured
function to globally prune the tuple of parameters
prune.global_unstructured(\n parameters_to_prune,\n pruning_method=prune.L1Unstructured,\n amount=0.2,\n)\n
Check that the amount that have been pruned is actually equal to the 20% specified in the pruning. We provide the following function that for a given submodule (for example model.conv1
) computes the amount of pruned weights
def check_prune_level(module: nn.Module):\n sparsity_level = 100 * float(torch.sum(module.weight == 0) / module.weight.numel())\n print(f\"Sparsity level of module {sparsity_level}\")\n
With a pruned network we really want to see if all our effort actually ended up with a network that is faster and/or smaller in memory. Do the following to the globally pruned network from the previous exercises:
First we need to make the pruning of our network permanent. Right now it is only semi-permanent as we are still keeping a copy of the original weights in memory. Make the change permanent by calling prune.remove
on every pruned module in the model. Hint: iterate over the parameters_to_prune
tuple.
Next try to measure the time of a single inference (repeated 100 times) for both the pruned and non-pruned network
import time\ntic = time.time()\nfor _ in range(100):\n _ = network(torch.randn(100, 1, 28, 28))\ntoc = time.time()\nprint(toc - tic)\n
Is the pruned network actually faster? If not can you explain why?
Next lets measure the size of our network (called pruned_network
) and a freshly initialized network (called network
):
torch.save(pruned_network.state_dict(), 'pruned_network.pt')\ntorch.save(network.state_dict(), 'network.pt')\n
Lookup the size of each file. Are the pruned network actually smaller? If not can you explain why?
Repeat the last exercise, but this time start by converting all pruned weights to sparse format first by calling the .to_sparse()
method on each pruned weight. Is the saved model smaller now?
This ends the exercises on pruning. As you probably realized in the last couple of exercises, then pruning does not guarantee speedups out of the box. This is because linear operations in PyTorch does not handle sparse structures out of the box. To actually get speedups we would need to deep dive into the sparse tensor operations, which again does not even guarantee that a speedup because the performance of these operations depends on the sparsity structure of the pruned weights. Investigating this is out of scope for these exercises, but we highly recommend checking it out if you are interested in sparse networks.
"},{"location":"s9_scalable_applications/inference/#knowledge-distillation","title":"Knowledge distillation","text":"Knowledge distillation is somewhat similar to pruning, in the sense that it tries to find a smaller model that can perform equally well as a large model, however it does so in a completely different way. Knowledge distillation is a model compression technique that builds on the work of Bucila et al. in which we try do distill/compress the knowledge of a large complex model (also called the teacher model) into a simpler model (also called the student model).
The best known example of this is the DistilBERT model. The DistilBERT model is a smaller version of the large natural-language procession model Bert, which achieves 97% of the performance of Bert while only containing 40% of the weights and being 60% faster. You can see in the figure below how it is much smaller in size compared to other models developed at the same time.
Image creditKnowledge distillation works by assuming we have a big teacher that is already performing well that we want to compress. By running our training set through our large model we get a softmax distribution for each and every training sample. The goal of the students, is to both match the original labels of the training data but also match the softmax distribution of the teacher model. The intuition behind doing this, is that teacher model needs to be more complex to learn the complex inter-class relasionship from just (one-hot) labels. The student on the other hand gets directly feed with softmax distributions from the teacher that explicit encodes this inter-class relasionship and thus does not need the same capasity to learn the same as the teacher.
Image credit"},{"location":"s9_scalable_applications/inference/#exercises_3","title":"\u2754 Exercises","text":"Lets try implementing model distillation ourself. We are going to see if we can achieve this on the cifar10 dataset. Do note that exercise below can take quite long time to finish because it involves training multiple networks and therefore involve some waiting.
Start by install the transformers
and datasets
packages from Huggingface
pip install transformers\npip install datasets\n
which we are going to download the cifar10 dataset and a teacher model.
Next download the cifar10 dataset
from datasets import load_dataset\ndataset = load_dataset(\"cifar10\")\n
Next lets initialize our teacher model. For this we consider a large transformer based model:
from transformers import AutoFeatureExtractor, AutoModelForImageClassification\nextractor = AutoFeatureExtractor.from_pretrained(\"aaraki/vit-base-patch16-224-in21k-finetuned-cifar10\")\nmodel = AutoModelForImageClassification.from_pretrained(\"aaraki/vit-base-patch16-224-in21k-finetuned-cifar10\")\n
To get the logits (un-normalized softmax scores) from our teacher model for a single datapoint from the training dataset you would extract it like this:
sample_img = dataset['train'][0]['img']\npreprocessed_img = extractor(dataset['train'][0]['img'], return_tensors='pt')\noutput = model(**preprocessed_img)\nprint(output.logits)\n# tensor([[ 3.3682, -0.3160, -0.2798, -0.5006, -0.5529, -0.5625, -0.6144, -0.4671, 0.2807, -0.3066]])\n
Repeat this process for the whole training dataset and store the result somewhere.
Implement a simple convolutional model. You can create a custom one yourself or use a small one from torchvision
.
Train the model on cifar10 to convergence, so you have a base result on how the model is performing.
Redo the training, but this time add knowledge distillation to your training objective. It should look like this:
for batch in dataset:\n # ...\n img, target, teacher_logits = batch\n preds = model(img)\n loss = torch.nn.functional.cross_entropy(preds, target)\n loss_teacher = torch.nn.functional.cross_entropy(preds, teacher_logits)\n loss = loss + loss_teacher\n loss.backward()\n # ...\n
Compare the final performance obtained with and without knowledge distillation. Did the performance improve or not?
This ends the module on scaling inference in machine learning models.
"},{"location":"samples/","title":"Collection of sample applications","text":""},{"location":"tools/","title":"Tools","text":"Just a collection of tools and scripts for running the course.
"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Home","text":"Machine Learning Operations
Repository for course 02476 at DTU.
Checkout the homepage!
"},{"location":"#i-course-information","title":"\u2139\ufe0f Course information","text":"
Recommended prerequisites: DTU course 02456 (Deep Learning) or experience with the following topics:
Start by cloning or downloading this repository
git clone https://github.com/SkafteNicki/dtu_mlops\n
If you do not have git installed (yet) we will touch upon it in the course. The folder will contain all the exercise material for this course and lectures. Additionally, you should join our Slack channel which we use for communication. The link may be expired, write to me.
"},{"location":"#course-organization","title":"\ud83d\udcc2 Course organization","text":"We highly recommend that when going through the material you use the homepage which is the corresponding GitHub Pages version of this repository that is more nicely rendered, and also includes some special HTML magic provided by Material for MkDocs.
The course is divided into sessions, denoted by capital S, and modules, denoted by capital M. A session corresponds to a full day of work if you are following the course, meaning approximately 9 hours of work. Each session (S) corresponds to a topic within MLOps and consists of multiple modules (M) that each cover a specific topic.
Importantly we differ between core modules and optional modules. Core modules will be marked by
Core Module
at the top of their corresponding page. Core modules are important to go through to be able to pass the course. You are highly recommended to still do the optional modules.
Additionally, be aware of the following icons throughout the course material:
This icon can be expanded to show code belonging to a given exercise
ExampleI will contain some code for an exercise.
This icon can be expanded to show a solution for a given exercise
SolutionI will present a solution to the exercise.
This icon (1) can be expanded to show a hint or a note for a given exercise
Machine Learning Operations (MLOps) is a rather new field that has seen its uprise as machine learning and particularly deep learning has become a widely available technology. The term itself is a compound of \"machine learning\" and \"operations\" and covers everything that has to do with the management of the production ML lifecycle.
The lifecycle of production ML can largely be divided into three phases:
Design: The initial phase starts with an investigation of the problem. Based on this analysis, several requirements can be prioritized for what we want our future model to do. Since machine learning requires data to be trained, we also investigate in this step what data we have and if we need to source it in some other way.
Model development: Based on the design phase we can begin to conjure some machine learning algorithms to solve our problems. As always, the initial step often involves doing some data analysis to make sure that our model is learning the signal that we want it to learn. Secondly, is the machine learning engineering phase, where the particular model architecture is chosen. Finally, we also need to do validation and testing to make sure that our model is generalizing well.
Operations: Based on the model development phase, we now have a model that we want to use. The operations are where create an automatic pipeline that makes sure that whenever we make changes to our codebase they get automatically incorporated into our model, such that we do not slow down production. Equally important is the ongoing monitoring of already deployed models to make sure that they behave exactly as we specified them.
It is important to note that the three steps are a cycle, meaning that when you have successfully deployed a machine learning model that is not the end of it. Your initial requirements may change, forcing you to revisit the design phase. Some new algorithms may show promising results, so you revisit the model development phase to implement this. Finally, you may try to cut the cost of running your model in production, making you revisit the operations phase, and trying to optimize some steps.
The focus in this course is particularly on the Operations part of MLOps as this is what many data scientists are missing in their toolbox to take all the knowledge they have about data processing and model development into a production setting.
"},{"location":"#learning-objectives","title":"\u2754 Learning objectives","text":"General course objective
Introduce the student to a number of coding practices that will help them organization, scale, monitor and deploy machine learning models either in a research or production setting. To provide hands-on experience with a number of frameworks, both local and in the cloud, for doing large scale machine learning models.
This includes:
Additional reading resources (in no particular order):
Ref 1 Introduction blog post for those who have never heard about MLOps and want to get an overview.
Ref 2 Great document from Google about the different levels of MLOps.
Ref 3 Another introduction to the principles of MLOps and the different stages of MLOps.
Ref 4 Great paper about the technical dept in machine learning.
Ref 5 Interview study that uncovers many of the pain points that ML engineers go through when doing MLOps.
Other courses with content similar to this:
Made with ML. Great online MLOps course that also covers additional topics on the foundations of working with ML.
Full stack deep learning. Another MLOps online course going through the whole developer pipeline.
MLOps Zoomcamp. MLOps online course that includes many of the same topics.
If you want to contribute to the course, we are happy to have you! Anything from fixing typos to adding new content is welcome. For building the course material locally, it is a simple two-step process:
pip install -r requirements.txt\nmkdocs serve\n
Which will start a local server that you can access at http://127.0.0.1:8000
and will automatically update when you make changes to the course material. When you have something that you want to contribute, please make a pull request.
I highly value open source, and the content of this course is therefore free to use under the Apache 2.0 license. If you use parts of this course in your work, please cite using:
@misc{skafte_mlops,\n author = {Nicki Skafte Detlefsen},\n title = {Machine Learning Operations},\n howpublished = {\\url{https://github.com/SkafteNicki/dtu_mlops}},\n year = {2024}\n}\n
"},{"location":"pages/faq/","title":"Frequently asked questions","text":"For further questions, please contact Nicki.
"},{"location":"pages/faq/#when-is-the-next-time-the-course-is-running","title":"When is the next time the course is running \u2754","text":"The course always runs in January, during the 3-week period at DTU. The exact dates can be found in the academic calendar.
"},{"location":"pages/faq/#is-it-possible-to-attend-the-course-fully-online","title":"Is it possible to attend the course fully online \u2754","text":"Mostly yes. All exercises are provided online and lectures will be recorded and streamed. However, do note that
Overall we try to support flexible learning as much as possible with some limitations.
"},{"location":"pages/faq/#what-are-the-prerequisites-for-taking-this-course","title":"What are the prerequisites for taking this course \u2754","text":"We recommend that you have a basic understanding of machine learning concepts such as what a dataset is, what probabilities are, what a classifier is, what overfitting means etc. This corresponds to the curriculum covered in course 02450. The actual focus of the course is not on machine learning models, but we will be using these basic concepts throughout the exercises.
Additionally, we recommend basic knowledge about deep learning and how to code in PyTorch, corresponding to the curriculum covered in 02456. From prior experience, we know that not all students have gained knowledge about deep learning models before this course, and we will be covering the basics of how to code in PyTorch in one of the first modules of the course to get everyone up to speed.
"},{"location":"pages/faq/#i-will-be-missing-x-days-of-the-course-will-that-be-a-problem","title":"I will be missing X days of the course, will that be a problem \u2754","text":"Depends. The course is fairly intensive, with most students working from 9-17 every day. If you already know that you will be missing X days of the course, then I highly recommend that you go through some of the first sessions before the course starts to give yourself a bit of breathing room. If you are not able to do so, please be aware that an additional effort may be needed from you to keep up with your fellow students.
"},{"location":"pages/faq/#how-many-should-we-be-in-a-group-for-the-projects","title":"How many should we be in a group for the projects \u2754","text":"Between 3 and 5. The projects are designed to be done in groups meaning that we intentionally make them too big for one person to do alone. Luckily, a lot of the work that you need to do can be done in parallel, so it is not as bad as it sounds.
"},{"location":"pages/faq/#when-will-the-exam-take-place","title":"When will the exam take place \u2754","text":"From 2025 and onwards, the exam only consist of a project report. The report should be handed in at midnight on the final day of the course. For January 2025, this means the 24th.
"},{"location":"pages/faq/#where-can-i-find-information-regarding-the-exam","title":"Where can I find information regarding the exam \u2754","text":"Look at the bottom of this page. Details will be updated as we get closer to the exam date.
"},{"location":"pages/faq/#can-i-use-chatgpt-or-similar-tools-for-the-exercises-project-exam-report-coding-writing","title":"Can I use ChatGPT or similar tools for the exercises, project, exam report (coding + writing) \u2754","text":"Yes, yes, and yes, but remember that its a tool and you need to validate the output before using it. We would prefer for the exam report that you formulate the answers in your own words because it is intended for you do describe what you have been doing in your project. The I in LLM stands for intelligence.
"},{"location":"pages/faq/#i-am-a-phd-student-not-enrolled-at-dtu-can-i-take-the-course","title":"I am a PhD student not enrolled at DTU, can I take the course \u2754","text":"Yes, PhD students from other universities can attend the course. You can checkout this page for more information or in general you can contact phdcourses@dtu.dk for more information. Do note that the registration deadline is usually in beginning of December.
"},{"location":"pages/faq/#i-am-a-foreign-student-and-my-home-university-doesnt-accept-passnot-pass-what-can-i-do","title":"I am a foreign student and my home university doesn't accept pass/not pass, what can I do \u2754","text":"We can give a grade on the Danish 7-point grading scale for foreign students who need it, where their home university does not accept pass/no-pass. You need to contact the course responsible Nicki within the first week of the course to request this. Secondly, we may need to further validate your work, so please be prepared for doing a short oral exam on one of the last days of the course.
"},{"location":"pages/faq/#i-am-a-euroteq-student-any-special-rules-for-me","title":"I am a EuroTEQ student, any special rules for me \u2754","text":"Not really, you will attend the course as any other student. However, we will provide a special Slack channel for you, trying to make sure that you can get the same help as students from DTU who can attend the course on campus.
"},{"location":"pages/overview/","title":"Summary of course content","text":"There are a lot of moving parts in this course, so it may be hard to understand how it all fits together. This page provides a summary of the frameworks in this course e.g. the stack of tools used. In the figure below we have provided an overview on how the different tools of the course interacts with each other. The table after the figure provides a short description of each of the parts.
The MLOps stack in the course. This is just an example of one stack, and depending on your use case you may want to use a different stack of tools that better fits your needs. Regardless of the stack, the principles of MLOps are the same. Framework Description PyTorch is the backbone of our code, it provides the computational engine and the data structures that we need to define our data structures. PyTorch lightning is a framework that provides a high-level interface to PyTorch. It provides a lot of functionality that we need to train our models, such as logging, checkpointing, early stopping, etc. such that we do not have to implement it ourselves. It also allows us to scale our models to multiple GPUs and multiple nodes. We control the dependencies and Python interpreter using Conda that enables us to construct reproducible virtual environments For configuring our experiments we use Hydra that allows us to define a hierarchical configuration structure config files Using Weights and Bias allows us to track and log any values and hyperparameters for our experiments Whenever we run into performance bottlenecks with our code we can use the Profiler to find the cause of the bottleneck When we run into bugs in our code we can use the Debugger to find the cause of the bug For organizing our code and creating templates we can use Cookiecutter Docker is a tool that allows us to create a container that contains all the dependencies and code that we need to run our code For controlling the versions of our data and synchronization between local and remote data storage, we can use DVC that makes this process easy For version control of our code we use Git (in complement with Github) that allows multiple developers to work together on a shared codebase We can use Pytest to write unit tests for our code, to make sure that new changes to the code does break the code base For linting our code and keeping a consistent coding style we can use tools such as Pylint and Flake8 that checks our code for common mistakes and style issues For running our unit tests and other checks on our code in a continuous manner e.g. after we commit and push our code we can use Github actions that automate this process Using Cloud build we can automate the process of building our docker images and pushing them to our artifact registry Artifact registry is a service that allows us to store our docker images for later use by other services For storing our data and trained models we can use Cloud storage that provides a scalable and secure storage solution For general compute tasks we can use Compute engine that provides a scalable and secure compute solution For training our experiments in a easy and scalable manner we can use Vertex AI For creating a REST API for our model we can use FastAPI that provides a high-level interface for creating APIs For simple deployments of our code we can use Cloud functions that allows us to run our code in response to events through simple Python functions For more complex deployments of our code we can use Cloud run that allows us to run our code in response to events through docker containers Cloud monitoring gives us the tools to keep track of important logs and errors from the other cloud services For monitoring our deployed model is experiencing any drift we can use Evidently AI that provides a framework and dashboard for monitoring drift For monitoring the telemetry of our deployed model we can use OpenTelemetry that provides a standard for collecting and exporting telemetry data"},{"location":"pages/projects/","title":"Project work","text":"Slides
Approximately 1/3 of the course time is dedicated to doing project work. The projects will serve as the basis of your exam. In the project, you will essentially re-apply everything that you learn throughout the course to a self chosen project. The overall goals with the project is:
In the projects you are free to work on whatever problem that you want. That said, we have a specific requirement, that you need to incorporate some third-party framework into your project. If you want inspiration for projects, here are some examples
Classification of tweets
Translating from English to German
Classification of scientific papers
Classification of rice types from images
We hope most students will be able to form groups by themselves. Expected group size is between 3 and 5. If you are not able to form a group, please make sure to post in the #looking-for-group
channel on Slack or make sure to be present on the 4th day of the course (the day before the project work starts) where we will help students that have not found a group yet.
We strive to keep the tools thought in this course as open-source as possible. The great thing about the open-source community is that whatever problem you are working on, there is probably some package out there that can get you at least 10% of the way. For the project, we want to enforce this point and you are required to include some third-party package, that is neither PyTorch or one of the tools already covered in the course, into your project.
If you have no idea what framework to include, the PyTorch ecosystem is a great place for finding open-source frameworks that can help you accelerate your own projects where PyTorch is the backengine. All tools in the ecosystem should work greatly together with PyTorch. However, it is important to note that the ecosystem is not a complete list of all the awesome packages that exist to extend the functionality of PyTorch. If you are still missing inspiration for frameworks to use, we highly recommend these three that have been used in previous years of the course:
PyTorch Image Models. PyTorch Image Models (also known as TIMM) is the absolutely most used computer vision package (maybe except for torchvision
). It contains models, scripts and pre trained for a lot of state-of-the-art image models within computer vision.
Transformers. The Transformers repository from the Huggingface group focuses on state-of-the-art Natural Language Processing (NLP). It provides many pre-trained model to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.
PyTorch-Geometric. PyTorch Geometric (PyG) is a geometric deep learning. It consists of various methods for deep learning on graphs and other irregular structures, also known as geometric deep learning, from a variety of published papers.
Each project day is fully dedicated to project work, except for maybe external inspirational lectures in the morning. The group decides exactly where they want to work on the project, how they want to work on the project, how do distribute the workload etc. We encourage strongly to parallelize work during the project, because there are a lot of tasks to do, but it it is important that all group members at least have some understanding of the whole project.
Remember that the focus of the project work is not to demonstrate that you can work with the biggest and baddest deep learning model, but instead that you show that you can incorporate the tools that are taught throughout the course in a meaningful way.
Also note that the project is not expected to be very large in scope. It may simply be that you want to train X model on Y data. You will approximately be given 4 full days to work on the project. It is better that you start out with a smaller project and then add complexity along the way if you have time.
"},{"location":"pages/projects/#day-1","title":"Day 1","text":"The first project days is all about getting started on the projects and formulating exactly what you want to work on as a group.
Start by brainstorming projects! Try to figure out exactly what you want to work with and begin to investigate what third party package that can support the project.
When you have come up with an idea, write a project description. The description is the delivery for today and should be at least 300 words. Try to answer the following questions in the description:
(Optional) If you want to think more about the product design of your project, feel free to fill out the ML canvas (or part of it). You can read more about the different fields on canvas here.
After having done the product description, you can start on the actual coding of the project. In the next section, a to-do list is attached that summaries what we are doing in the course. You are NOT expected to fulfill all bullet points from week 1 today.
The project description will serve as an guideline for us at the exam that you have somewhat reached the goals that you set out to do. By the end of the day, you should commit your project description to the README.md
file belonging to your project repository. If you filled out the ML canvas, feel free to include that as part of the README.md
file. Also remember to commit whatever you have done on the project until now. When you have done this, go to DTU Learn and hand-in (as a group) the link to your GitHub repository as an assignment.
We will briefly (before next Monday) look over your GitHub repository and project description to check that everything is fine. If we have any questions/concerns we will contact you.
"},{"location":"pages/projects/#day-2","title":"Day 2","text":"The goal for today is simply to continue working on your project. Start with bullet points in the checklist from week 1 and continue with bullet points for week 2.
"},{"location":"pages/projects/#day-3","title":"Day 3","text":"Continue working on your project, today you should hopefully focus on the bullet points in the checklist from week 2. There is no delivery for this week, but make sure that you have committed all your progress at the end of the day. We will again briefly look over the repositories and will reach out to your group if we are worried about the progression of your project.
"},{"location":"pages/projects/#day-4","title":"Day 4","text":"We have now entered the final week of the course and the second last project day. You are most likely continuing with bullet points from week 2, but should hopefully begin to look at the bullet points from week 3 today. These are in general much more complex, so we recommend looking at them until you have completed most from week 2. We also recommend that you being to fill our report template.
"},{"location":"pages/projects/#day-5","title":"Day 5","text":"Today you are finishing your project. We recommend that you start by creating a architechtual overview of your project similar to this figure. I recommend using draw.io for creating this kind of diagram, but feel free to use any tool you like. Else you should just continue working on your project, checking of as many bullet points as possible. Finally, you should also prepare yourself for the exam tomorrow.
"},{"location":"pages/projects/#project-hints","title":"Project hints","text":"Below are listed some hints to prevent you from getting stuck during the project work with problems that previous groups have encountered.
Data
Start out small! We recommend that you start out with less than 1GB of data. If the dataset you want to work with is larger, then subsample it. You can use dvc to version control your data and only download the full dataset when you are ready to train the model.
Be aware of many smaller files. DVC
does not handle many small files well, and can take a long time to download. If you have many small files, consider zipping them together and then unzip them at runtime.
You do not need to use DVC
for everything regarding data. You workflow is to just use DVC
for version controlling the data, but when you need to get it you can just download it from the source. For example if you are storing your data in a GCP bucket, you can use the gsutil
command to download the data or directly accessing the it using the cloud storage file system
Modelling
Again, start out small! Start with a simple model and then add complexity as you go along. It is better to have a simple model that works than a complex model that does not work.
Try fine-tuning a pre-trained model. This is often much faster than training a model from scratch.
Deployment
Please note that all the lists are exhaustive meaning that I do not expect you to have completed very point on the checklist for the exam.
"},{"location":"pages/projects/#week-1","title":"Week 1","text":"make_dataset.py
file such that it downloads whatever data you need andrequirements.txt
file with whatever dependencies that you are usingpep8
) while doing the projectFrom January 2025 the exam only consist of a project report. The report should be handed in at midnight on the final day of the course. For January 2025, this means the 24th. We provide template folder called reports. As the first task you should copy the folder and all its content to your project repository. Then, you jobs is to fill out the README.md
file which contains the report template. The file itself contains instructions on how to fill it out and instructions on using the included report.py
file for validating your work. You will hand-in the template by simple including it in your project repository. By midnight on the final day of the course, we will automatically scrape the report and use it as the basis for grading you. Therefore, changes after this point are not registered.
Slides
The course is organised into exercise (2/3 of the course) days and project days (1/3 of the course).
Exercise days start at 9:00 in the morning with an lecture (usually 30-45 min) that will give some context about at least one of the topics of that day. Additionally, previous days exercises may shortly be touched upon. The remaining of the day will be spend on solving exercises either individually or in small groups. For some people the exercises may be fast to do and for others it will take the whole day. We will provide help throughout the day. We will try to answer questions on slack but help with be priorities to students physically on campus.
Project days are intended for project work and you are therefore responsible for making an agreement with your group when and where you are going to work. The first project days there will be a lecture at 9:00 with project information. Other project days we may also start the day with an external lecture, which we highly recommend that you participate in. During each project day we will have office hours where you can ask questions for the project.
Below is an overall timeplan for each day, including the presentation topic of the day and the frameworks that you will be using in the exercises.
Recodings (link to drive folder with mp4 files):
In the first week you will be introduced to a number of development practices for organizing and developing code, especially with a focus on making everything reproducible.
Date Day Presentation topic Frameworks Format 6/1/25 Monday Deep learning software\ud83d\udcdd Terminal, Conda, IDE, PyTorch Exercises 7/1/25 Tuesday MLOps: what is it?\ud83d\udcdd Git, CookieCutter, Pep8, DVC Exercises 8/1/25 Wednesday Reproducibility\ud83d\udcdd Docker, Hydra Exercises 9/1/25 Thursday Debugging\ud83d\udcdd Debugger, Profiler, Wandb, Lightning Exercises 10/1/25 Friday Project work\ud83d\udcdd - Projects"},{"location":"pages/timeplan/#week-2","title":"Week 2","text":"The second week is about automatization and the cloud. Automatization will help use making sure that our code does not break when we make changes to it. The cloud will help us scale up our applications and we learn how to use different services to help develop a full machine learning pipeline.
Date Day Presentation topic Frameworks Format 13/1/25 Monday Continuous Integration\ud83d\udcdd Pytest, Github actions, Pre-commit, CML Exercises 14/1/25 Tuesday The Cloud\ud83d\udcdd GCP Engine, Bucket, Artifact registry, Vertex AI Exercises 15/1/25 Wednesday Deployment\ud83d\udcdd FastAPI, Torchserve, GCP Functions, GCP Run Exercises 16/1/25 Thursday No lecture - Projects 17/1/25 Friday Company presentation (TBA) - Projects"},{"location":"pages/timeplan/#week-3","title":"Week 3","text":"For the final week we look into advance topics such as monitoring and scaling of applications. Monitoring is especially important for the longivity for the applications that we develop, that we can deploy them either locally or in the cloud and that we have the tools to monitor how they behave over time. Scaling of applications is an important topic if we ever want our applications to be used by many people at the same time.
Date Day Presentation topic Frameworks Format 20/1/25 Monday Monitoring\ud83d\udcdd Evidently AI, Prometheus, GCP Monitoring Exercises 21/1/25 Tuesday Scalable applications\ud83d\udcdd PyTorch, Lightning Exercises 22/1/25 Wednesday Company presentation (TBA) - Projects 23/1/25 Thursday No lecture - Projects 24/1/25 Friday No lecture - Projects"},{"location":"reports/","title":"Exam template for 02476 Machine Learning Operations","text":"This is the report template for the exam. Please only remove the text formatted as with three dashes in front and behind like:
--- question 1 fill here ---
where you instead should add your answers. Any other changes may have unwanted consequences when your report is auto-generated at the end of the course. For questions where you are asked to include images, start by adding the image to the figures
subfolder (please only use .png
, .jpg
or .jpeg
) and then add the following code in your answer:
![my_image](figures/<image>.<extension>)\n
In addition to this markdown file, we also provide the report.py
script that provides two utility functions:
Running:
python report.py html\n
will generate a .html
page of your report. After the deadline for answering this template, we will auto-scrape everything in this reports
folder and then use this utility to generate an .html
page that will be your serve as your final hand-in.
Running
python report.py check\n
will check your answers in this template against the constraints listed for each question e.g. is your answer too short, too long, or have you included an image when asked to.
For both functions to work you mustn't rename anything. The script has two dependencies that can be installed with
pip install click markdown\n
"},{"location":"reports/#overall-project-checklist","title":"Overall project checklist","text":"The checklist is exhaustive which means that it includes everything that you could do on the project included in the curriculum in this course. Therefore, we do not expect at all that you have checked all boxes at the end of the project.
"},{"location":"reports/#week-1","title":"Week 1","text":"make_dataset.py
file such that it downloads whatever data you need andrequirements.txt
file with whatever dependencies that you are usingpep8
) while doing the projectEnter the group number you signed up on
Answer:
--- question 1 fill here ---
"},{"location":"reports/#question-2","title":"Question 2","text":"Enter the study number for each member in the group
Example:
sXXXXXX, sXXXXXX, sXXXXXX
Answer:
--- question 2 fill here ---
"},{"location":"reports/#question-3","title":"Question 3","text":"What framework did you choose to work with and did it help you complete the project?
Recommended answer length: 100-200 words.
Example: We used the third-party framework ... in our project. We used functionality ... and functionality ... from the package to do ... and ... in our project.
Answer:
--- question 3 fill here ---
"},{"location":"reports/#coding-environment","title":"Coding environment","text":"In the following section we are interested in learning more about you local development environment.
"},{"location":"reports/#question-4","title":"Question 4","text":"Explain how you managed dependencies in your project? Explain the process a new team member would have to go through to get an exact copy of your environment.
Recommended answer length: 100-200 words
Example: We used ... for managing our dependencies. The list of dependencies was auto-generated using ... . To get a complete copy of our development environment, one would have to run the following commands
Answer:
--- question 4 fill here ---
"},{"location":"reports/#question-5","title":"Question 5","text":"We expect that you initialized your project using the cookiecutter template. Explain the overall structure of your code. Did you fill out every folder or only a subset?
Recommended answer length: 100-200 words
Example: From the cookiecutter template we have filled out the ... , ... and ... folder. We have removed the ... folder because we did not use any ... in our project. We have added an ... folder that contains ... for running our experiments. Answer:
--- question 5 fill here ---
"},{"location":"reports/#question-6","title":"Question 6","text":"Did you implement any rules for code quality and format? Additionally, explain with your own words why these concepts matters in larger projects.
Recommended answer length: 50-100 words.
Answer:
--- question 6 fill here ---
"},{"location":"reports/#version-control","title":"Version control","text":"In the following section we are interested in how version control was used in your project during development to corporate and increase the quality of your code.
"},{"location":"reports/#question-7","title":"Question 7","text":"How many tests did you implement and what are they testing in your code?
Recommended answer length: 50-100 words.
Example: In total we have implemented X tests. Primarily we are testing ... and ... as these the most critical parts of our application but also ... .
Answer:
--- question 7 fill here ---
"},{"location":"reports/#question-8","title":"Question 8","text":"What is the total code coverage (in percentage) of your code? If you code had an code coverage of 100% (or close to), would you still trust it to be error free? Explain you reasoning.
Recommended answer length: 100-200 words.
Example: The total code coverage of code is X%, which includes all our source code. We are far from 100% coverage of our ** code and even if we were then...*
Answer:
--- question 8 fill here ---
"},{"location":"reports/#question-9","title":"Question 9","text":"Did you workflow include using branches and pull requests? If yes, explain how. If not, explain how branches and pull request can help improve version control.
Recommended answer length: 100-200 words.
Example: We made use of both branches and PRs in our project. In our group, each member had an branch that they worked on in addition to the main branch. To merge code we ...
Answer:
--- question 9 fill here ---
"},{"location":"reports/#question-10","title":"Question 10","text":"Did you use DVC for managing data in your project? If yes, then how did it improve your project to have version control of your data. If no, explain a case where it would be beneficial to have version control of your data.
Recommended answer length: 100-200 words.
Example: We did make use of DVC in the following way: ... . In the end it helped us in ... for controlling ... part of our pipeline
Answer:
--- question 10 fill here ---
"},{"location":"reports/#question-11","title":"Question 11","text":"Discuss you continuous integration setup. What kind of continuous integration are you running (unittesting, linting, etc.)? Do you test multiple operating systems, Python version etc. Do you make use of caching? Feel free to insert a link to one of your GitHub actions workflow.
Recommended answer length: 200-300 words.
Example: We have organized our continuous integration into 3 separate files: one for doing ..., one for running ... testing and one for running ... . In particular for our ..., we used ... .An example of a triggered workflow can be seen here:
Answer:
--- question 11 fill here ---
"},{"location":"reports/#running-code-and-tracking-experiments","title":"Running code and tracking experiments","text":"In the following section we are interested in learning more about the experimental setup for running your code and especially the reproducibility of your experiments.
"},{"location":"reports/#question-12","title":"Question 12","text":"How did you configure experiments? Did you make use of config files? Explain with coding examples of how you would run a experiment.
Recommended answer length: 50-100 words.
Example: We used a simple argparser, that worked in the following way: Python my_script.py --lr 1e-3 --batch_size 25
Answer:
--- question 12 fill here ---
"},{"location":"reports/#question-13","title":"Question 13","text":"Reproducibility of experiments are important. Related to the last question, how did you secure that no information is lost when running experiments and that your experiments are reproducible?
Recommended answer length: 100-200 words.
Example: We made use of config files. Whenever an experiment is run the following happens: ... . To reproduce an experiment one would have to do ...
Answer:
--- question 13 fill here ---
"},{"location":"reports/#question-14","title":"Question 14","text":"Upload 1 to 3 screenshots that show the experiments that you have done in W&B (or another experiment tracking service of your choice). This may include loss graphs, logged images, hyperparameter sweeps etc. You can take inspiration from this figure. Explain what metrics you are tracking and why they are important.
Recommended answer length: 200-300 words + 1 to 3 screenshots.
Example: As seen in the first image when have tracked ... and ... which both inform us about ... in our experiments. As seen in the second image we are also tracking ... and ...
Answer:
--- question 14 fill here ---
"},{"location":"reports/#question-15","title":"Question 15","text":"Docker is an important tool for creating containerized applications. Explain how you used docker in your experiments? Include how you would run your docker images and include a link to one of your docker files.
Recommended answer length: 100-200 words.
Example: For our project we developed several images: one for training, inference and deployment. For example to run the training docker image: docker run trainer:latest lr=1e-3 batch_size=64
. Link to docker file:
Answer:
--- question 15 fill here ---
"},{"location":"reports/#question-16","title":"Question 16","text":"When running into bugs while trying to run your experiments, how did you perform debugging? Additionally, did you try to profile your code or do you think it is already perfect?
Recommended answer length: 100-200 words.
Example: Debugging method was dependent on group member. Some just used ... and others used ... . We did a single profiling run of our main code at some point that showed ...
Answer:
--- question 16 fill here ---
"},{"location":"reports/#working-in-the-cloud","title":"Working in the cloud","text":"In the following section we would like to know more about your experience when developing in the cloud.
"},{"location":"reports/#question-17","title":"Question 17","text":"List all the GCP services that you made use of in your project and shortly explain what each service does?
Recommended answer length: 50-200 words.
Example: We used the following two services: Engine and Bucket. Engine is used for... and Bucket is used for...
Answer:
--- question 17 fill here ---
"},{"location":"reports/#question-18","title":"Question 18","text":"The backbone of GCP is the Compute engine. Explained how you made use of this service and what type of VMs you used?
Recommended answer length: 100-200 words.
Example: We used the compute engine to run our ... . We used instances with the following hardware: ... and we started the using a custom container: ...
Answer:
--- question 18 fill here ---
"},{"location":"reports/#question-19","title":"Question 19","text":"Insert 1-2 images of your GCP bucket, such that we can see what data you have stored in it. You can take inspiration from this figure.
Answer:
--- question 19 fill here ---
"},{"location":"reports/#question-20","title":"Question 20","text":"Upload one image of your GCP artifact registry, such that we can see the different images that you have stored. You can take inspiration from this figure.
Answer:
--- question 20 fill here ---
"},{"location":"reports/#question-21","title":"Question 21","text":"Upload one image of your GCP cloud build history, so we can see the history of the images that have been build in your project. You can take inspiration from this figure.
Answer:
--- question 21 fill here ---
"},{"location":"reports/#question-22","title":"Question 22","text":"Did you manage to deploy your model, either in locally or cloud? If not, describe why. If yes, describe how and preferably how you invoke your deployed service?
Recommended answer length: 100-200 words.
Example: For deployment we wrapped our model into application using ... . We first tried locally serving the model, which worked. Afterwards we deployed it in the cloud, using ... . To invoke the service an user would call curl -X POST -F \"file=@file.json\"<weburl>
Answer:
--- question 22 fill here ---
"},{"location":"reports/#question-23","title":"Question 23","text":"Did you manage to implement monitoring of your deployed model? If yes, explain how it works. If not, explain how monitoring would help the longevity of your application.
Recommended answer length: 100-200 words.
Example: We did not manage to implement monitoring. We would like to have monitoring implemented such that over time we could measure ... and ... that would inform us about this ... behaviour of our application.
Answer:
--- question 23 fill here ---
"},{"location":"reports/#question-24","title":"Question 24","text":"How many credits did you end up using during the project and what service was most expensive?
Recommended answer length: 25-100 words.
Example: Group member 1 used ..., Group member 2 used ..., in total ... credits was spend during development. The service costing the most was ... due to ...
Answer:
--- question 24 fill here ---
"},{"location":"reports/#overall-discussion-of-project","title":"Overall discussion of project","text":"In the following section we would like you to think about the general structure of your project.
"},{"location":"reports/#question-25","title":"Question 25","text":"Include a figure that describes the overall architecture of your system and what services that you make use of. You can take inspiration from this figure. Additionally in your own words, explain the overall steps in figure.
Recommended answer length: 200-400 words
Example:
The starting point of the diagram is our local setup, where we integrated ... and ... and ... into our code. Whenever we commit code and push to github, it auto triggers ... and ... . From there the diagram shows ...
Answer:
--- question 25 fill here ---
"},{"location":"reports/#question-26","title":"Question 26","text":"Discuss the overall struggles of the project. Where did you spend most time and what did you do to overcome these challenges?
Recommended answer length: 200-400 words.
Example: The biggest challenges in the project was using ... tool to do ... . The reason for this was ...
Answer:
--- question 26 fill here ---
"},{"location":"reports/#question-27","title":"Question 27","text":"State the individual contributions of each team member. This is required information from DTU, because we need to make sure all members contributed actively to the project
Recommended answer length: 50-200 words.
Example: Student sXXXXXX was in charge of developing of setting up the initial cookie cutter project and developing of the docker containers for training our applications. Student sXXXXXX was in charge of training our models in the cloud and deploying them afterwards. All members contributed to code by...
Answer:
--- question 27 fill here ---
"},{"location":"s10_extra/","title":"Extra learning modules","text":"All modules listed here are not part of the core course but expand on some of the other topics. Some of them may still be under construction and may in the future be moved into other sessions.
Learn how to setup a simple documentation system for your application
M32: Documentation
Learn how to do hyperparameter optimization using Optuna
M33: Hyperparameter Optimization
Learn how to use HPC systems that use PBS to do job scheduling
M34: High Performance Clusters
Danger
Module is still under development
"},{"location":"s10_extra/calibration/#methods","title":"Methods","text":""},{"location":"s10_extra/calibration/#exercises","title":"\u2754 Exercises","text":"Implement a script
Implement temperature scaling
Implement label smoothing
alpha = 0.1\nfor i in range(len(y_true)):\n y_true[i] = (1 - alpha) * y_true[i] + alpha / num_classes\n
Implement mixup
Implement cutmix
Implement the Focal Loss
Implement it in a continues integration setup
Danger
Module is still under development
\"Machine learning engineering is 10% machine learning and 90% engineering.\" - Chip Huyen
We highly recommend that you read the book Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications by Chip Huyen which gives an fantastic overview of the thought processes that goes into designing moder machine learning systems.
"},{"location":"s10_extra/design/#the-stack","title":"The stack","text":"Have you ever encountered the concept of full stack developer. A full stack developer is an developer who can both develop client and server software or in more general terms, it is a developer who can take care of the complete developer pipeline.
Below is seen an image of the massive amounts of tools that exist within the MLOps umbrella.
"},{"location":"s10_extra/design/#visualizing-the-design","title":"Visualizing the design","text":""},{"location":"s10_extra/documentation/","title":"M32 - Documentation","text":""},{"location":"s10_extra/documentation/#documentation","title":"Documentation","text":"In today's rapidly evolving software development landscape, effective documentation is a crucial component of any project. The ability to create clear, concise, and user-friendly technical documentation can make a significant difference in the success of your codebase. We all probably encountered code that we wanted to use, only for us to abandon using it because it was missing documentation such that we could get started with it.
Technical documentation or code documentation can be many things:
and many more. We are in this module going to focus on setting up a very basic documentation system that will automatically help you document the API of your code. For this reason we recommend that before continuing with this module that you have completed module M7 on good coding practices or have similar experience with writing docstrings for Python functions and classes.
There are different systems for writing documentation. In fact there is a lot to choose from:
Important to note that all these are static site generators. The word static here refers to that when the content is generated and served on a webside, the underlying HTML code will not change. It may contain HTML elements that dynamic (like video), but the site does not change (1).
We are in this module going to look at Mkdocs, which (in my opinion) is one of the easiest systems to get started with because all documentation is written in markdown and the build system is written in Python. As an alternativ, you can consider doing the exercises in Sphinx which is probably the most used documentation system for Python code. Sphinx offer more customization than Mkdocs, so is generally preferred for larger projects with complex documentation, but for smaller projects Mkdocs should be easier to get started with and is sufficient.
Mkdocs by default does not include many features and for that reason we are directly going to dive into using the material for mkdocs theme that provides a lot of nice customization to create professional static sites. In fact, this whole course is written in mkdocs using the material theme.
"},{"location":"s10_extra/documentation/#mkdocs","title":"Mkdocs","text":"The core file when using mkdocs is the mkdocs.yaml
file, which is the configuration file for the project:
site_name: Documentation of my project\nsite_author: Jane Doe\ndocs_dir: source # (1)!\n\ntheme:\n language: en\n name: material # (2)!\n features: # (3)!\n - content.code.copy\n - content.code.annotate\n\nplugins: # (4)!\n - search\n - mkdocstrings\n\nnav: # (5)!\n - Home: index.md\n
This indicates the source directory of our documentation. If the layout of your documentation is a bit different than what described above, you may need to change this.
The overall theme of your documentation. We recommend the material
theme but there are many more to choose from and you can also create your own.
The featuers
section is where features that are supported by your given theme can be enabled. In this example we have enabled content.code.copy
feature which adds a small copy button to all code block and the content.code.annotate
feature which allows you to add annotations like this box to code blocks.
Plugins add new functionality to your documentation. In this case we have added two plugins that add functionality for searching through our documentation and automatically adding documentation from docstrings. Remember that some plugins requires you to install additional Python packages with those plugins, so remember to add them to your requirements.txt
file.
The nav
section is where you define the navigation structure of your documentation. When you add new .md
files to the source
folder you then need to add them to the nav
section.
And that is more or less what you need to get started. In general, if you need help with configuration of your documentation in mkdocs I recommend looking at this page and this page.
"},{"location":"s10_extra/documentation/#exercises","title":"Exercises","text":"In this set of exercises we assume that you have completed module M6 on code structure and therefore have a repository that at least contains the following:
\u251c\u2500\u2500 pyproject.toml <- Project configuration file with package metadata\n\u2502\n\u251c\u2500\u2500 docs <- Documentation folder\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 index.md <- Homepage for your documentation\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 mkdocs.yaml <- Configuration file for mkdocs\n\u2502 \u2502\n\u2502 \u2514\u2500\u2500 source/ <- Source directory for documentation files\n\u2502\n\u2514\u2500\u2500 src <- Source code for use in this project.\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 __init__.py <- Makes src a Python module\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 models <- model implementations, training script\n\u2502 \u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u2502 \u251c\u2500\u2500 model.py\n\u2502 \u2502 \u251c\u2500\u2500 train_model.py\n...\n
It is not important exactly what is in the src
folder for the exercises, but we are going to refer to the above structure in the exercises, so adjust accordingly if you diviate from this. Additionally, we are going to assume that your project code is installed in your environment such that it can be imported as normal Python code.
We are going to need two Python packages to get started: mkdocs and material for mkdocs. Install with
pip install \"mkdocs-material >= 4.8.0\" # (1)!\n
mkdocs
is a dependency of mkdocs-material
we only need to install the latter.Run in your terminal (from the docs
folder):
mkdocs serve # (1)!\n
mkdocs serve
will automatically rebuild the whole site whenever you save a file inside the docs
folder. This is not a problem if you have a fairly small site with not that many pages (or elements), but can take a long time for large sites. Consider running with the --dirty
option for only re-building the site for files that have been changed.which should render the index.md
file as the homepage. You can leave the documentation server running during the remaining exercises.
We are no ready to document the API of our code:
Make sure you at least have one function and class inside your src
module. If you do not have you can for simplicity copy the following module to the src/models/model.py
file
import torch\n\nclass MyNeuralNet(torch.nn.Module):\n \"\"\"Basic neural network class.\n\n Args:\n in_features: number of input features\n out_features: number of output features\n\n \"\"\"\n def __init__(self, in_features: int, out_features: int) -> None:\n self.l1 = torch.nn.Linear(in_features, 500)\n self.l2 = torch.nn.Linear(500, out_features)\n self.r = torch.nn.ReLU()\n\n def forward(self, x: torch.Tensor) -> torch.Tensor:\n \"\"\"Forward pass of the model.\n\n Args:\n x: input tensor expected to be of shape [N,in_features]\n\n Returns:\n Output tensor with shape [N,out_features]\n\n \"\"\"\n return self.l2(self.r(self.l1(x)))\n
and the following function to add src/predict_model.py
file:
def predict(\n model: torch.nn.Module,\n dataloader: torch.utils.data.DataLoader\n) -> None:\n \"\"\"Run prediction for a given model and dataloader.\n\n Args:\n model: model to use for prediction\n dataloader: dataloader with batches\n\n Returns\n Tensor of shape [N, d] where N is the number of samples and d is the output dimension of the model\n\n \"\"\"\n return [model(batch) for batch in dataloader]\n
Add a markdown file to the docs/source
folder called my_api.md
and add that file to the nav:
section in the mkdocs.yaml
file.
To that file add the following code:
# My API\n\n::: src.models.model.MyNeuralNet\n\n::: src.predict_model.predict\n
The :::
indicator tells mkdocs that it should look for the corresponding function/module and then render it on the given page. Thus, if you have a function/module located in another location change the paths accordingly.
Make sure that the documentation correctly includes your function and module on the given page.
(Optional) Include more functions/modules in your documentation.
(Optional) Look through the documentation for mkdocstrings and try to improve the layout a bit. Especially, the headings, docstrings and signatures could be of interest to adjust.
Finally, try to build a final version of your documentation
mkdocs build\n
this should result in a site
folder that contains the actual HTML code for documentation.
To publish your documentation you need a place to host your build documentation e.g. the content of the site
folder you build in the last exercise. There are many places to host your documentation, but if you only need a static site and are already hosting your code through Github, then a good option is Github Pages. Github pages is free to use for your public projects.
Before getting started with this set of exercises you should have completed module M16 on GitHub actions so you already know about workflow files.
"},{"location":"s10_extra/documentation/#exercises_1","title":"Exercises","text":"Start by adding a new file called deploy_docs.yaml
to the .github/workflows
folder. Add the following cod to that file and save it.
name: Deploy docs\n\non:\npush:\n branches:\n - main\n\npermissions:\n contents: write # (1)!\n\njobs:\n deploy:\n name: Deploy docs\n runs-on: ubuntu-latest\n steps:\n - name: Checkout code\n uses: actions/checkout@v4\n with:\n fetch-depth: 0\n\n - name: Set up Python\n uses: actions/setup-python@v5\n with:\n python-version: 3.11\n cache: 'pip'\n cache-dependency-path: setup.py\n\n - name: Install dependencies\n run: pip install -r requirements.txt\n\n - name: Deploy docs\n run: mkdocs gh-deploy --force\n
write
permissions to this actions because it is not only reading your code but it will also push code.Before continuing, make sure you understand what the different steps of the workflow does and especially we recommend looking at the documentation of the mkdocs gh-deploy
command.
Commit and push the file. Check that the action is executed and if it succeeds, that your build project is pushed to a branch called gh-pages
. If the action does not succeeds, then figure out what is wrong and fix it!
After confirming that our action is working, you need to configure Github to publish the content being build by Github Actions. Do the following:
Source
setting choose the Deploy from a branch
Branch
setting choose the gh-pages
branch and /(root)
folder and save
This should then start deploying your site to https://<your-username>.github.io/<your-reponame>/
. If it does not do this you may need to recommit and trigger the GitHub actions build again.
Make sure your documentation is published and looks as it should.
This ends the module on technical documentation. We cannot stress enough how important it is to write proper documentation for larger projects that need to be maintained over a longer time. It is often a iterative process, but it is often best to do it while writing the code.
"},{"location":"s10_extra/high_performance_clusters/","title":"M34 - High Performance Clusters","text":""},{"location":"s10_extra/high_performance_clusters/#high-performance-clusters","title":"High Performance Clusters","text":"As discussed in the intro session on the cloud, cloud providers offers near infinite compute resources. However, using these resources comes at a hefty price often and it is therefore important to be aware of another resource many have access to: High Performance Clusters or HPC. HPCs exist all over the world, and many time you already have access to one or can easily get access to one. If you are an university student you most likely have a local HPC that you can access through your institution. Else, there exist public HPC resources that everybody (with a project) can apply for. As an example in the EU we have EuroHPC initiative that currently has 8 different supercomputers with a centralized location for applying for resources that are both open for research projects and start-ups.
Depending on your application, you may have different needs and it is therefore important to be aware also of the different tiers of HPC. In Europe, HPC are often categorized such that Tier-0 are European Centers with petaflop or hexascale machines,\u00a0Tier 1 are National centers of supercomputers, and Tier 2 are Regional centers. The lower the Tier, the larger applications it is possible to run.
"},{"location":"s10_extra/high_performance_clusters/#cluster-architectures","title":"Cluster architectures","text":"In very general terms, cluster can come as two different kind of systems: supercomputers and LSF (Load Sharing Facility). A supercomputer (as shown below) is organized into different modules, that are separated by network link. When you login to a supercomputer you will meet the front end which contains all the software needed to run computations. When you submit a job it will get sent to the backend modules which in most cases includes: general compute modules (CPU), acceleration modules (GPU), a memory module (RAM) and finally a storage module (HDD). Depending on your application you may need one module more than another. For example in deep learning the acceleration module is important but in physics simulation the general compute module / storage model is probably more important.
Overview of the Meluxina supercomputer that's part of EuroHPC. Image creditAlternatively, LSF are a network of computers where each computer has its own CPU, GPU, RAM etc. and the individual computes (or nodes) are then connected by network. The important different between a supercomputer and as LSF systems is how the resources are organized. When comparing supercomputers to LSF system it is generally the case that it is better to run on a LSF system if you are only requesting resources that can be handled by a single node, however it is better to run on a supercomputer if you have a resource intensive application that requires many devices to communicate with each others.
Regardless of cluster architectures, on the software side of HPC, the most important part is what's called the HPC scheduler. Without a HPC scheduler an HPC cluster would just be a bunch of servers with different jobs interfering with each other. The problem is when you have a large collection of resources and a large collection of users, you cannot rely on the users just running their applications without interfering with each other. A HPC scheduler is in charge of managing that whenever an user request to run an application, they get put in a queue and whenever the resources their application ask for are available the application gets run.
The biggest bach control systems for doing scheduling on HPC are:
We are going to take a look at PBS works as that is what is installed on our local university cluster.
"},{"location":"s10_extra/high_performance_clusters/#exercises","title":"\u2754 Exercises","text":"Exercise files
The following exercises are focused on local students at DTU that want to use our local HPC resources. That said, the steps in the exercise are fairly general to other types of cluster. For the purpose of this exercise we are going to see how we can run this image classifier script , but feel free to work with whatever application you want to.
Start by accessing the cluster. This can either be through ssh
in a terminal or if you want a graphical interface thinlinc can be installed. In general we recommend following the steps here for DTU students as the setup depends on if you are on campus or not.
When you have access to the cluster we are going to start with the setup phase. In the setup phase we are going to setup the environment necessary for our computations. If you have accessed the cluster through graphical interface start by opening a terminal.
Lets start by setting up conda for controlling our dependencies. If you have not already worked with conda
, please checkout module M2 on package managers and virtual environments. In general you should be able to setup (mini)conda through these two commands:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nsh Miniconda3-latest-Linux-x86_64.sh\n
Close the terminal and open a new for the installation to complete. Type conda
in the terminal to check that everything is fine. Go ahead and create a new environment that we can install dependencies in
conda create -n \"hpc_env\" python=3.10 --no-default-packages\n
and activate it.
Copy over any files you need. For the image classifier script you need the requirements file and the actual application.
Next, install all the requirements you need. If you want to run the image classifier script you can run this command in the terminal
pip install -r image_classifier_requirements.txt\n
using this requirements file.
That's all the setup needed. You would need to go through the creating of environment and installation of requirements whenever you start a new project (no need for reinstalling conda). For the next step we need to look at how to submit jobs on the cluster. We are now ready to submit the our first job to the cluster:
Start by checking the statistics for the different clusters. Try to use both the qstat
command which should give an overview of the different cluster, number of running jobs and number of pending jobs. For many system you can also try the much more user friendly command classstat
command.
Figure out which queue you want to use. For the sake of the exercises it needs to be one with GPU support. For DTU students, any queue that starts with gpu
are GPU accelerated.
Now we are going to develop a bash script for submitting our job. We have provided an example of such scripts. Take a careful look and go each line and make sure you understand it. Afterwards, change it to your needs (queue and student email).
Try to submit the script:
bsub < jobscript.sh\n
You can check the status of your script by running the bstat
command. Hopefully, the job should go through really quickly. Take a look at the output file, it should be called something like gpu_*.out
. Also take a look at the gpu_*.err
file. Does both files look as they should?
Lets now try to run our application on the cluster. To do that we need to take care of two things:
First we need to load the correct version of CUDA. A cluster system often contains multiple versions of specific software to suit the needs of all their users, and it is the users that are in charge of loading the correct software during job submission. The only extra software that needs to be loaded for most PyTorch applications are a CUDA module. You can check which modules are available on the cluster with
module avail\n
Afterwards, add the correct CUDA version you need to the jobscript.sh
file. If you are trying to run the provided image classifier script then the correct version is CUDA/11.7
(can be seen in the requirements file).
# add to the bottom of the file\nmodule load cuda/11.7\n
We are now ready to add in our application. The only thing we need to take care of is telling the system to run it using the python
version that is connected to our hpc_env
we created in the beginning. Try typing:
which python\n
which should give you the full path. Then add to the bottom of the jobscript
file:
~/miniconda3/envs/hpc_env/bin/python \\\n image_classifier.py \\\n --trainer.accelerator 'gpu' --trainer.devices 1 --trainer.max_epochs 5\n
which will run the image classifier script (change it if you are running something else).
Finally submit the job:
bsub < jobscript.sh\n
and check when it is done that it has produced what you expected.
(Optional) If you application supports multi GPUs also try that out. You would first need to change the jobscript to request multiple GPUs and additionally you would need to tell your application to run on multiple GPUs. For the image classifier script it can be done by changing the --trainer.devices
flag to 2
(or higher).
This ends the module on using HPC systems.
"},{"location":"s10_extra/hyperparameters/","title":"M33 - Hyperparameter optimization","text":""},{"location":"s10_extra/hyperparameters/#hyperparameter-optimization","title":"Hyperparameter optimization","text":"Outdated module
This module has not been updated for a long time and therefore some functionality of Optuna, which is used in these exercises, may not be included. If you have completed the module on Weights & Bias then we highly recommend instead using their sweep functionality.
Hyperparameter optimization is not a new idea within machine learning but have somewhat seen a renaissance with the uprise of deep learning. This can mainly be contributed to the following:
However the problem with doing hyperparameter optimization of a deep learning models is that it can take over a week to train a single model. In most cases we therefore cannot do a full grid search of all hyperparameter combinations to get the best model. Instead we have to do some tricks that will help us speed up our searching. In these exercises we are going to be integrating optuna into our different models, that will provide the tools for speeding up our search.
It should be noted that a lot of deep learning models does not optimize every hyperparameter that is included in the model but instead relies on heuristic guidelines (\"rule of thumb\") based on what seems to be working in general e.g. a learning rate of 0.01 seems to work great with the Adam optimizer. That said, these rules probably only apply to 80% of deep learning model, whereas for the last 20% the recommendations may be suboptimal Here is a great site that has collected an extensive list of these recommendations, taken from the excellent deep learning book by Ian Goodfellow, Yoshua Bengio and Aaron Courville.
In practice, I recommend trying to identify (through experimentation) which hyperparameters that are important for the performance of your model and then spend your computational budget trying to optimize them while setting the rest to a \"recommended value\".
"},{"location":"s10_extra/hyperparameters/#exercises","title":"\u2754 Exercises","text":"Exercise files
Start by installing optuna: pip install optuna
Initially we will look at the cross_validate.py
file. It implements simple K-fold cross validation of a random forest sklearn digits dataset (subset of MNIST). Look over the script and try to run it.
We will now try to write the same code in optune. Please note that the script have a variable OPTUNA=False
that you can use to change what part of the code should run. The three main concepts of optuna is
A trial: a single experiment
A study: a collection of trials
The objective: function to determine how \"good\" a trial is
Lets start by writing the objective function, which we have already started in the script. For now you do not need to care about the trial
argument, just assume that it contains the hyperparameters needed to define your random forest model. The output of the objective function should be a single number that we want to optimize. (HINT: did you remember to do K-fold crossvalidation inside your objective function?)
Next lets focus on the trial. Inside the objective
function the trial should be used to suggest what parameters to use next. Take a look at the documentation for trial or take a look at the code examples and figure out how to define the hyperparameter of the model.
Finally lets launch a study. It can be as simple as
study = optuna.create_study()\nstudy.optimize(objective, n_trials=100)\n
but lets play around a bit with it:
By default the .optimize
method will minimize the objective (by definition the optimum of an objective function is at its minimum). Is the score your objective function is returning something that should be minimized? If not, a simple solution is to put a -
in front of the metric. However, look through the documentation on how to change the direction of the optimization.
Optuna will by default do Bayesian optimization when sampling the hyperparameters (using a evolutionary algorithm for suggesting new trials). However, since this example is quite simple, we can actually perform a full grid search. How would you do this in Optuna?
Compare the performance of a single optuna run using Bayesian optimization with n_trials=10
with a exhaustive grid search that have search through all hyperparameters. What is the performance/time trade-off for these two solutions?
In addition to doing baysian optimization, the other great part about Optuna is that it have native support for Pruning unpromising trials. Pruning refers to the user stopping trials for hyperparameter combinations that does not seem to lead anywhere. You may have learning rate that is so high that training is diverging or a neural network with too many parameters so it is just overfitting to the training data. This however begs the question: what constitutes an unpromising trial? This is up to you to define based on prior experimentation.
Start by looking at the fashion_trainer.py
script. Its a simple classification network for classifying images in the FashionMNIST dataset. Run the script with the default hyperparameters to get a feeling of how the training should be progress. Note down the performance on the test set.
Start by defining a validation set and a validation dataloader that we can use for hyperparameter optimization (HINT: use 5-10% of you training data).
Now, adjust the script to use Optuna. The 5 hyperparameters listed in the table above should at least be included in the hyperparameter search. For some we have already defined the search space but for the remaining you need to come up with a good range of values to investigate. We done integrating optuna, run a small study (n_tirals=3
) to check that the code is working.
nn.ReLU
, nn.Tanh
, nn.RReLU
, nn.LeakyReLU
, nn.ELU
} If implemented correctly the number of hyperparameter combinations should be at least 1000, meaning that we not only need baysian optimization but probably also need pruning to succeed. Checkout the page for built-in pruners in Optuna. Implement pruning in the script. I recommend using either the MedianPruner
or the ProcentilePruner
.
Re-run the study using pruning with a large number of trials (n_trials>50
)
Take a look at this visualization page for ideas on how to visualize the study you just did. Make at least two visualization of the study and make sure that you understand them.
Pruning is great for better spending your computational budged, however it comes with a trade-off. What is it and what hyperparameter should one be especially careful about when using pruning?
Finally, what parameter combination achieved the best performance? What is the test set performance for this set of parameters. Did you improve over the initial set of hyperparameters?
The exercises until now have focused on doing the hyperparameter searching sequentially, meaning that we test one set of parameters at the time. It is a fine approach because you can easily let it run for a week without any interaction. However, assuming that you have the computational resources to run in parallel, how do you do that?
To run hyperparameter search in parallel we need a common database that all experiments can read and write to. We are going to use the recommended mysql
. You do not have to understand what SQL is to complete this exercise, but it is basically a language (like python) for managing databases. Install mysql.
Next we are going to initialize a database that we can read and write to. For this exercises we are going to focus on a locally stored database but it could of course also be located in the cloud.
mysql -u root -e \"CREATE DATABASE IF NOT EXISTS example\"\n
you can also do this directly in Python when calling the create_study
command by also setting the storage
and load_if_exists=True
flags.
Now we are going to create a Optuna study in our database
optuna create-study --study-name \"distributed-example\" --storage \"mysql://root@localhost/example\"\n
Change how you initialize the study to read and write to the database. Therefore, instead of doing
study = optuna.create_study()\n
then do
study = optuna.load_study(\n study_name=\"distributed-example\", storage=\"mysql://root@localhost/example\"\n)\n
where the study_name
and storage
should match how the study was created.
For running in parallel, you can either open up a extra terminal and simple launch your script once per open terminal or you can use the provided parallel_lancher.py
that will launch multiple executions of your script. It should be used as:
python parallel_lancher.py myscript.py --num_parallel 2\n
Finally, make sure that you can access the results
That's all on how to do hyperparameter optimization in a scalable way. If you feel like it you can try to apply these techniques on the ongoing corrupted MNIST example, where you are free to choose what hyperparameters that you want to use.
"},{"location":"s10_extra/infrastructure_as_code/","title":"Infrastructure as code","text":"Danger
Module is still under development
"},{"location":"s10_extra/infrastructure_as_code/#infrastructure-as-code-iac","title":"Infrastructure as Code (IaC)","text":"Infrastructure as Code (IaC) is the process of managing and provisioning computer data centers through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools. The IT infrastructure managed by this comprises both physical equipment such as bare-metal servers as well as virtual machines and associated configuration resources. The definitions are written in a high-level programming language and can be versioned, and the code can be tested and validated.
"},{"location":"s10_extra/infrastructure_as_code/#terraform","title":"Terraform","text":"Terraform is an open-source infrastructure as code software tool created by HashiCorp. It enables users to define and provision a datacenter infrastructure using a high-level configuration language known as Hashicorp Configuration Language (HCL), or optionally JSON. It allows infrastructure to be expressed as code in a simple, human-readable language called HCL (HashiCorp Configuration Language). It supports a multitude of cloud providers, including AWS, Azure, Google Cloud, and many others.
"},{"location":"s10_extra/infrastructure_as_code/#installation","title":"Installation","text":"To install Terraform, download the appropriate package for your operating system from the official Terraform website. Once downloaded, unzip the package and move the binary to a directory included in your system's PATH.
"},{"location":"s10_extra/infrastructure_as_code/#getting-started","title":"Getting started","text":"To get started with Terraform, you need to create a configuration file. This file is a human-readable file that describes the infrastructure and set of resources to be created. The file is saved with a .tf
extension. Here is an example of a simple Terraform configuration file that creates an AWS EC2 instance:
provider \"aws\" {\n region = \"us-west-2\"\n}\n\nresource \"aws_instance\" \"example\" {\n ami = \"ami-0c55b159cbfafe1f0\"\n instance_type = \"t2.micro\"\n}\n
To create the infrastructure described in the configuration file, navigate to the directory containing the file and run the following commands:
terraform init\nterraform apply\n
The terraform init
command is used to initialize a working directory containing Terraform configuration files. This is the first command that should be run after writing a new Terraform configuration or cloning an existing one from version control. The terraform apply
command is used to apply the changes required to reach the desired state of the configuration, or the pre-determined set of actions generated by a terraform plan
execution plan.
Danger
Module is still under development
"},{"location":"s10_extra/kubernetes/#kubernetes","title":"Kubernetes","text":"Kubernetes, also known as K8s, is an open-source platform designed to automate the deployment, scaling, and operation of application containers across clusters of hosts. It provides the framework to run distributed systems resiliently, handling scaling and failover for your applications, providing deployment patterns, and more.
"},{"location":"s10_extra/kubernetes/#what-is-kubernetes","title":"What is Kubernetes?","text":""},{"location":"s10_extra/kubernetes/#brief-history","title":"Brief History","text":"Kubernetes was originally developed by Google and is now maintained by the Cloud Native Computing Foundation.
"},{"location":"s10_extra/kubernetes/#core-functions","title":"Core Functions","text":"Kubernetes makes it easier to deploy and manage containerized applications at scale.
"},{"location":"s10_extra/kubernetes/#key-concepts","title":"Key Concepts","text":"Kubernetes follows a client-server architecture. At a high level, it consists of a Control Plane (master) and Nodes (workers).
Overview of Kubernetes Architecture. Image Credit: Kubernetes Official Documentation"},{"location":"s10_extra/kubernetes/#control-plane-components","title":"Control Plane Components","text":"Minikube is a tool that allows you to run Kubernetes locally. It runs a single-node Kubernetes cluster inside a VM on your laptop for users looking to try out Kubernetes or develop with it day-to-day.
"},{"location":"s10_extra/kubernetes/#installing-minikube","title":"Installing Minikube","text":"minikube start
.minikube
in a terminal.kubectl
in a terminal.Yatai is a model serving platform, making it easier to deploy machine learning models in Kubernetes environments.
"},{"location":"s10_extra/kubernetes/#what-is-yatai","title":"What is Yatai?","text":"Yatai simplifies the deployment, management, and scaling of machine learning models in Kubernetes.
"},{"location":"s10_extra/kubernetes/#getting-started-with-yatai","title":"Getting Started with Yatai","text":"Danger
Module is still under development
"},{"location":"s10_extra/orchestration/#workflow-orchestration","title":"Workflow orchestration","text":""},{"location":"s10_extra/orchestration/#prefect","title":"Prefect","text":"If you give an MLOps engineer a job
pip install prefect\n
from prefect import task, Flow\n
"},{"location":"s10_extra/orchestration/#exercises","title":"\u2754 Exercises","text":"Start by installing prefect
:
pip install prefect\n
Start a local Prefect server instance in your virtual environment.
prefect server start\n
The great thing about Prefect is that the orchestration tasks and flows are written in pure Python.
Danger
Module is still under development
"},{"location":"s10_extra/quantization/#exercises","title":"\u2754 Exercises","text":"We are in these exercises going to be looking at two different kinds of quantization strategies: quantization-aware training and post-training quantization. As the names suggest, the quantization is either applied while training or after training. There are good reasons for doing both:
If the model you are going to deploy in the end needs to be quantized, either due to hard requirements for how the big the model can be or in the effort to optimize inference time, quantization-aware training is the better approach. The reason here being that the model is specifically optimized to always be quantized and therefore in general end up with a better model.
If the most important metric for deployment is the overall performance of the model with no regards to model size and inference speed, post-training quantization is the better option. This allows you to most likely train a better model to begin with and then try out converting the model afterwards. In the best case this can be done without any hits to performance.
Start by installing intel neural compressor
pip install neural_compressor\n
and remember to add this to your requirements.txt
file.
Let's start a new script called model_converter.py
. Start by filling it with some simple code for loading a given float32
model checkpoint. You should already have such code from earlier exercises. Preferably, add a small CLI interface to load a model by passing the filename in the command line:
python model_converter.py model_checkpoint.ckpt\n
Solution We are here going to assume that you are either loading from a onnx
model or alternatively loading a PyTorch Lightning checkpoint:
from typer import App\nimport onnx\nfrom onnx.onnx_ml_pb2 import ModelProto\nfrom pytorch_lightning import LightningModule\nfrom my_model import MyModel\napp = App()\n\n@app.command()\n@app.argument(\"model_checkpoint\")\ndef quantize(model_checkpoint: ModelProto | LightningModule) -> None:\n if isinstance(model_checkpoint, LightningModule):\n model = MyModel.load_from_checkpoint(model_checkpoint)\n else:\n model = onnx.load(model_checkpoint)\n
Next you also need to add
Finally, calculate the size (in MB) of the original model and the quantized model. How much smaller is the quantized model?
SolutionAssuming the models are saved as checkpoint.ckpt
and checkpoint_quantized.ckpt
we can calculate the size using os.path.getsize
in Python:
original_size = os.path.getsize(\"models/checkpoint.onnx\") / (1024 * 1024)\nquantized_size = os.path.getsize(\"models/checkpoint_quantized.onnx\") / (1024 * 1024)\n
The quantized model should be very close to 4 times smaller as int4
only uses 1/4 the bits to store weights compared to float32
format.
Slides
Learn the basics of the command line, and how to use it to navigate your file system and run programs.
M1: Command line
Learn how package managers work in Python and how to create reproducible virtual environments using conda
and pip
.
M2: Package Manager
Learn how to use a modern editor for code development.
M3: Editor
Refresh your PyTorch skills and implement a simple deep-learning model.
M4: Deep Learning Software
Today we start our journey into the world of machine learning operations (MLOps). However, before we can get started, we need to make sure that you have a basic understanding of a couple of topics, as we will be using these throughout the course. In particular, today is all about getting set up with a proper development environment that can support your journey. Most of you probably already have experience with these topics, and it will be mostly repetition.
The reason we are starting here is that many students are missing very basic skills that are never taught but are just expected to be picked up by yourself. This session will only cover the most basic skills to get you started on your development journey. If you wish to learn more about basic computer science skills in general, we highly recommend that you check out The Missing Semester of Your CS Education course from MIT.
Learning objectives
The learning objectives of this session are:
Core Module
Image creditContrary to popular belief, the command line (also commonly known as the terminal) is not a mythical being that has existed since the dawn of time. Instead, it was created at a time when it was not given that your computer had a graphical interface that you could interact with. Think of it as a text interface to your computer.
The terminal is a well-known concept to users of Linux; however, MAC and (especially) Windows users often do not need it and therefore encounter it. Having a basic understanding of how to use a command line can help improve your workflow. The reason that the command line is an important tool to get to know is that doing any kind of MLOps will require us to be able to interact with many different tools, many of which do not have a graphical interface. Additionally, when we get to working in the cloud later in the course, you will be forced to interact with the command line.
Note if you already are a terminal wizard then feel free to skip the exercises below. They are very elementary.
"},{"location":"s1_development_environment/command_line/#the-anatomy-of-the-command-line","title":"The anatomy of the command line","text":"Regardless of the operating system, all command lines look more or less the same:
As already stated, it is essentially just a big text interface to interact with your computer. As the image illustrates, when trying to execute a command, there are several parts to it:
$
, >
, :
are the usual ones. It can also contain other information, such as in the case of the above image which also shows the current conda
environment.ls
or cd
.ls -l
or cd ..
.ls -l figures
or cd ..
.The core difference between options and arguments is that options are optional, while arguments are not.
Image credit"},{"location":"s1_development_environment/command_line/#exercises","title":"\u2754 Exercises","text":"We have put a cheat sheet in the exercise files folder belonging to this session, that gives a quick overview of the different commands that can be executed in the command line.
Windows usersWe highly recommend that you install Windows Subsystem for Linux (WSL). This will install a full Linux system on your Windows machine. Please follow this guide. Remember to run commands from an elevated (as administrator) Windows Command Prompt. You can in general complete all exercises in the course from a normal Windows Command Prompt, but some are easier to do if you run from WSL.
If you decide to run in WSL, you need to remember that you now have two different systems, and installing a package on one system does not mean that it is installed on the other. For example, if you install pip
in WSL, you need to install it again in Windows if you want to use it there.
If you decide to not run in WSL, please always work in a Windows Command Prompt and not Powershell.
Start by opening a terminal.
To navigate inside a terminal, we rely on the cd
command and pwd
command. Make sure you know how to go back and forth in your file system. (1)
The ls
command is important when we want to know the content of a folder. Try to use the command, and also try it with the additional option -l
. What does it show?
Make sure to familiarize yourself with the which
, echo
, cat
, wget
, less
, and top
commands. Also, familiarize yourself with the >
operator. You are probably going to use some of them throughout the course or in your future career. For Windows users, these commands may be named something else, e.g., where
command on Windows corresponds to which
.
It is also significant that you know how to edit a file through the terminal. Most systems should have the nano
editor installed; else, try to figure out which one is installed on your system.
Type nano
in the terminal.
Write the following text in the script
if __name__ == \"__main__\":\n print(\"Hello world!\")\n
Save the script and try to execute it.
Afterward, try to edit the file through the terminal (change Hello world
to something else).
All terminals come with a programming language. The most common system is called bash
, which can come in handy when being able to write simple programs in bash. For example, one case is that you want to execute multiple Python programs sequentially, which can be done through a bash script.
Bash is not part of Windows, so you need to run this part through WSL. If you did not install WSL, you can skip this part or as an alternative do the exercises in Powershell which is the native Windows scripting language (not recommended).
Write a bash script (in nano
) and try executing it:
#!/bin/bash\n# A sample Bash script, by Ryan\necho Hello World!\n
Change the bash script to call the Python program you just wrote.
Try to Google how to write a simple for-loop that executes the Python script 10 times in a row.
A trick you may need throughout this course is setting environment variables. An environment variable is just a dynamically named value that may alter the way running processes behave on a computer. The syntax for setting an environment variable depends on your operating system:
WindowsLinux/Macset MY_VAR=hello\necho %MY_VAR%\n
export MY_VAR=hello\necho $MY_VAR\n
Try to set an environment variable and print it out.
To use an environment variable in a Python program, you can use the os.environ
function from the os
module. Write a Python program that prints out the environment variable you just set.
If you have a collection of environment variables, these can be stored in a file called .env
. The file is formatted as follows:
MY_VAR=hello\nMY_OTHER_VAR=world\n
To load the environment variables from the file, you can use the python-dotenv
package. Install it with pip install python-dotenv
and then try to load the environment variables from the file and print them out.
from dotenv import load_dotenv\nload_dotenv()\nimport os\nprint(os.environ[\"MY_VAR\"])\n
Here is one command from later in the course when we are going to work in the cloud
gcloud compute instances create-with-container instance-1 \\\n --container-image=gcr.io/<project-id>/gcp_vm_tester\n --zone=europe-west1-b\n
Identify the command, options, and arguments.
Solutiongcloud compute instances create-with-container
.--container-image=gcr.io/<project-id>/gcp_vm_tester
and --zone=europe-west1-b
.instance-1
.The tricky part of this example is that commands can have subcommands, which are also commands. In this case, compute
is a subcommand to gcloud
, instances
is a subcommand to compute
, and create-with-container
is a subcommand to instances
.
Two common arguments that nearly all commands have are the -h
and -V
options. What does each of them do?
The -h
(or --help
) option prints the help message for the command, including subcommands and arguments. Try it out by executing python -h
. The -V
(or --version
) option prints the version of the installed program. Try it out by executing python --version
.
This ends the module on the command line. If you are still not comfortable working with the command line, fear not as we are going to use it extensively throughout the course. If you want to spend additional time on this topic, we highly recommend that you watch this video on how to use the command line.
If you are interested in personalizing your command line, you can check out the starship project, which allows you to customize your command line with a lot of different options.
"},{"location":"s1_development_environment/deep_learning_software/","title":"M4 - Deep Learning Software","text":""},{"location":"s1_development_environment/deep_learning_software/#deep-learning-software","title":"Deep Learning Software","text":"Core Module
Deep learning has, since its revolution back in 2012, transformed our lives. From Google Translate to driverless cars to personal assistants to protein engineering, deep learning is transforming nearly every sector of our economy and our lives. However, it did not take long before people realized that deep learning is not a simple beast to tame and it comes with its own kinds of problems, especially if you want to use it in a production setting. In particular, the concept of technical debt was invented to indicate the significant maintenance costs at a system level that it takes to run machine learning in production. MLOps should very much be seen as the response to the concept of technical debt, namely that we should develop methods, processes, and tools (with inspiration from classical DevOps) to counter the problems we run into when working with deep learning models.
It is important to note that all the concepts and tools that have been developed for MLOps can be used together with more classical machine learning models (think K-nearest neighbor, Random forest, etc.), however, deep learning comes with its own set of problems which mostly have to do with the sheer size of the data and models we are working with. For these reasons, we are focusing on working with deep learning models in this course.
"},{"location":"s1_development_environment/deep_learning_software/#software-landscape-for-deep-learning","title":"Software Landscape for Deep Learning","text":"Regarding software for Deep Learning, the landscape is currently dominated by three software frameworks (listed in order of when they were published):
TensorFlow
PyTorch
JAX
We won't go into a longer discussion on which framework is best, as it is pointless. PyTorch and TensorFlow have been around for the longest and therefore have bigger communities and feature sets at this point in time. They are both very similar in the sense that they both have features directed against research and production. JAX is kind of the new kid on the block, which in many ways improves on PyTorch and TensorFlow, but is still not as mature as the other frameworks. As the frameworks use different kinds of programming principles (object-oriented vs. functional programming), comparing them is essentially meaningless.
In this course, we have chosen to work with PyTorch because we find it a bit more intuitive and it is the framework that we use for our day-to-day research life. Additionally, as of right now, it is absolutely the dominating framework for published models, research papers, and competition winners.
The intention behind this set of exercises is to bring everyone's PyTorch skills up to date. If you already are a PyTorch-Jedi, feel free to pass the first set of exercises, but I recommend that you still complete it. The exercises are, in large part, taken directly from the deep learning course at Udacity. Note that these exercises are given as notebooks, which is the last time we are going to use them actively in the course. Instead, after this set of exercises, we are going to focus on writing code in Python scripts.
The notebooks contain a lot of explanatory text. The exercises that you are supposed to fill out are inlined in the text in small \"exercise\" blocks:
If you need a refresher on any deep learning topic in general throughout the course, we recommend finding the relevant chapter in the deep learning book by Ian Goodfellow, Yoshua Bengio, and Aaron Courville (which can also be found in the literature folder). It is not necessary to be good at deep learning to pass this course as the focus is on all the software needed to get deep learning models into production. However, it's important to have a basic understanding of the concepts.
"},{"location":"s1_development_environment/deep_learning_software/#exercises","title":"\u2754 Exercises","text":"Exercise files
Start a Jupyter Notebook session in your terminal (assuming you are standing at the root of the course material). Alternatively, you should be able to open the notebooks directly in your code editor. For VS code users you can read more about how to work with Jupyter Notebooks in VS code here
Complete the Tensors in PyTorch notebook. It focuses on the basic manipulation of PyTorch tensors. You can pass this notebook if you are comfortable doing this.
Complete the Neural Networks in PyTorch notebook. It focuses on building a very simple neural network using the PyTorch nn.Module
interface.
Complete the Training Neural Networks notebook. It focuses on how to write a simple training loop for training a neural network.
Complete the Fashion MNIST notebook, which summarizes concepts learned in notebooks 2 and 3 on building a neural network for classifying the Fashion MNIST dataset.
Complete the Inference and Validation notebook. This notebook adds important concepts on how to do inference and validation on our neural network.
Complete the Saving_and_Loading_Models notebook. This notebook addresses how to save and load model weights. This is important if you want to share a model with someone else.
If tensor a
has shape [N, d]
and tensor b
has shape [M, d]
how can we calculate the pairwise distance between rows in a
and b
without using a for loop?
We can take advantage of broadcasting to do this
a = torch.randn(N, d)\nb = torch.randn(M, d)\ndist = torch.sum((a.unsqueeze(1) - b.unsqueeze(0))**2, dim=2) # shape [N, M]\n
What should be the size of S
for an input image of size 1x28x28, and how many parameters does the neural network then have?
from torch import nn\nneural_net = nn.Sequential(\n nn.Conv2d(1, 32, 3), nn.ReLU(), nn.Conv2d(32, 64, 3), nn.ReLU(), nn.Flatten(), nn.Linear(S, 10)\n)\n
Solution Since both convolutions have a kernel size of 3, stride 1 (default value) and no padding that means that we lose 2 pixels in each dimension, because the kernel can not be centered on the edge pixels. Therefore, the output of the first convolution would be 32x26x26. The output of the second convolution would be 64x24x24. The size of S
must therefore be 64 * 24 * 24 = 36864
. The number of parameters in a convolutional layer is kernel_size * kernel_size * in_channels * out_channels + out_channels
(last term is the bias) and the number of parameters in a linear layer is in_features * out_features + out_features
(last term is the bias). Therefore, the total number of parameters in the network is 3*3*1*32 + 32 + 3*3*32*64 + 64 + 36864*10 + 10 = 387,466
, which could be calculated by running:
sum([prod(p.shape) for p in neural_net.parameters()])\n
A working training loop in PyTorch should have these three function calls: optimizer.zero_grad()
, loss.backward()
, optimizer.step()
. Explain what would happen in the training loop (or implement it) if you forgot each of the function calls.
optimizer.zero_grad()
is in charge of zeroring the gradient. If this is not done, then gradients would accumulate over the steps leading to exploding gradients. loss.backward()
is in charge of calculating the gradients. If this is not done, then the gradients will not be calculated and the optimizer will not be able to update the weights. optimizer.step()
is in charge of updating the weights. If this is not done, then the weights will not be updated and the model will not learn anything.
As the final exercise, we will develop a simple baseline model that we will continue to develop during the course. For this exercise, we provide the data in the data/corruptmnist
folder. Do NOT use the data in the corruptmnist_v2
folder as that is intended for another exercise. As the name suggests, this is a (subsampled) corrupted version of the regular MNIST. Your overall task is the following:
Implement an MNIST neural network that achieves at least 85% accuracy on the test set.
Before any training can start, you should identify the corruption that we have applied to the MNIST dataset to create the corrupted version. This can help you identify what kind of neural network to use to get good performance, but any network should be able to achieve this.
One key point of this course is trying to stay organized. Spending time now organizing your code will save time in the future as you start to add more and more features. As subgoals, please fulfill the following exercises:
Implement your model in a script called model.py
.
model.py
model.pyfrom torch import nn\n\n\nclass MyAwesomeModel(nn.Module):\n \"\"\"My awesome model.\"\"\"\n\n def __init__(self) -> None:\n super().__init__()\n self.fc1 = nn.Linear(784, 128)\n
Solution The provided solution implements a convolutional neural network with 3 convolutional layers and a single fully connected layer. Because the MNIST dataset consists of images, we want an architecture that can take advantage of the spatial information in the images.
model.pyimport torch\nfrom torch import nn\n\n\nclass MyAwesomeModel(nn.Module):\n \"\"\"My awesome model.\"\"\"\n\n def __init__(self) -> None:\n super().__init__()\n self.conv1 = nn.Conv2d(1, 32, 3, 1)\n self.conv2 = nn.Conv2d(32, 64, 3, 1)\n self.conv3 = nn.Conv2d(64, 128, 3, 1)\n self.dropout = nn.Dropout(0.5)\n self.fc1 = nn.Linear(128, 10)\n\n def forward(self, x: torch.Tensor) -> torch.Tensor:\n \"\"\"Forward pass.\"\"\"\n x = torch.relu(self.conv1(x))\n x = torch.max_pool2d(x, 2, 2)\n x = torch.relu(self.conv2(x))\n x = torch.max_pool2d(x, 2, 2)\n x = torch.relu(self.conv3(x))\n x = torch.max_pool2d(x, 2, 2)\n x = torch.flatten(x, 1)\n x = self.dropout(x)\n return self.fc1(x)\n\n\nif __name__ == \"__main__\":\n model = MyAwesomeModel()\n print(f\"Model architecture: {model}\")\n print(f\"Number of parameters: {sum(p.numel() for p in model.parameters())}\")\n\n dummy_input = torch.randn(1, 1, 28, 28)\n output = model(dummy_input)\n print(f\"Output shape: {output.shape}\")\n
Implement your data setup in a script called data.py
. The data was saved using torch.save
, so to load it you should use torch.load
.
Saving the model
When saving the model, you should use torch.save(model.state_dict(), \"model.pt\")
, and when loading the model, you should use model.load_state_dict(torch.load(\"model.pt\"))
. If you do torch.save(model, \"model.pt\")
, this can lead to problems when loading the model later on, as it will try to not only save the model weights but also the model definition. This can lead to problems if you change the model definition later on (which you most likely are going to do).
data.py
model.pyimport torch\n\n\ndef corrupt_mnist():\n \"\"\"Return train and test dataloaders for corrupt MNIST.\"\"\"\n # exchange with the corrupted mnist dataset\n train = torch.randn(50000, 784)\n test = torch.randn(10000, 784)\n return train, test\n
Solution Data is stored in .pt
files which can be loaded using torch.load
(1). We iterate over the files, load them and concatenate them into a single tensor. In particular, we have highlighted the use of .unsqueeze
function. Convolutional neural networks (which we propose as a solution) need the data to be in the shape [N, C, H, W]
where N
is the number of samples, C
is the number of channels, H
is the height of the image and W
is the width of the image. The dataset is stored in the shape [N, H, W]
and therefore we need to add a channel.
.pt
files are nothing else than a .pickle
file in disguise. The torch.save/torch.load
function is essentially a wrapper around the pickle
module in Python, which produces serialized files. However, it is convention to use .pt
to indicate that the file contains PyTorch tensors.We have additionally in the solution added functionality for plotting the images together with the labels for inspection. Remember: all good machine learning starts with a good understanding of the data.
model.pyfrom __future__ import annotations\n\nimport matplotlib.pyplot as plt # only needed for plotting\nimport torch\nfrom mpl_toolkits.axes_grid1 import ImageGrid # only needed for plotting\n\nDATA_PATH = \"data/corruptmnist\"\n\n\ndef corrupt_mnist() -> tuple[torch.utils.data.Dataset, torch.utils.data.Dataset]:\n \"\"\"Return train and test dataloaders for corrupt MNIST.\"\"\"\n train_images, train_target = [], []\n for i in range(5):\n train_images.append(torch.load(f\"{DATA_PATH}/train_images_{i}.pt\"))\n train_target.append(torch.load(f\"{DATA_PATH}/train_target_{i}.pt\"))\n train_images = torch.cat(train_images)\n train_target = torch.cat(train_target)\n\n test_images = torch.load(f\"{DATA_PATH}/test_images.pt\")\n test_target = torch.load(f\"{DATA_PATH}/test_target.pt\")\n\n train_images = train_images.unsqueeze(1).float()\n test_images = test_images.unsqueeze(1).float()\n train_target = train_target.long()\n test_target = test_target.long()\n\n train_set = torch.utils.data.TensorDataset(train_images, train_target)\n test_set = torch.utils.data.TensorDataset(test_images, test_target)\n\n return train_set, test_set\n\n\ndef show_image_and_target(images: torch.Tensor, target: torch.Tensor) -> None:\n \"\"\"Plot images and their labels in a grid.\"\"\"\n row_col = int(len(images) ** 0.5)\n fig = plt.figure(figsize=(10.0, 10.0))\n grid = ImageGrid(fig, 111, nrows_ncols=(row_col, row_col), axes_pad=0.3)\n for ax, im, label in zip(grid, images, target):\n ax.imshow(im.squeeze(), cmap=\"gray\")\n ax.set_title(f\"Label: {label.item()}\")\n ax.axis(\"off\")\n plt.show()\n\n\nif __name__ == \"__main__\":\n train_set, test_set = corrupt_mnist()\n print(f\"Size of training set: {len(train_set)}\")\n print(f\"Size of test set: {len(test_set)}\")\n print(f\"Shape of a training point {(train_set[0][0].shape, train_set[0][1].shape)}\")\n print(f\"Shape of a test point {(test_set[0][0].shape, test_set[0][1].shape)}\")\n show_image_and_target(train_set.tensors[0][:25], train_set.tensors[1][:25])\n
Implement training and evaluation of your model in main.py
script. The main.py
script should be able to take additional subcommands indicating if the model should be trained or evaluated. It will look something like this:
python main.py train --lr 1e-4\npython main.py evaluate trained_model.pt\n
which can be implemented in various ways. We provide you with a starting script that uses the click
library to define a command line interface (CLI), which you can learn more about in this module.
If you try to execute the above code in VS code using the debugger (F5) or the build run functionality in the upper right corner:
you will get an error message saying that you need to select a command to run e.g. main.py
either needs the train
or evaluate
command. This can be fixed by adding a launch.json
to a specialized .vscode
folder in the root of the project. The launch.json
file should look something like this:
{\n \"version\": \"0.2.0\",\n \"configurations\": [\n {\n \"name\": \"Python: Current File\",\n \"type\": \"python\",\n \"request\": \"launch\",\n \"program\": \"${file}\",\n \"args\": [\n \"train\",\n \"--lr\",\n \"1e-4\"\n ],\n \"console\": \"integratedTerminal\",\n \"justMyCode\": true\n }\n ]\n}\n
This will inform VS code that then we execute the current file (in this case main.py
) we want to run it with the train
command and additionally pass the --lr
argument with the value 1e-4
. You can read more about creating a launch.json
file here. If you want to have multiple configurations you can add them to the configurations
list as additional dictionaries.
main.py
main.pyimport click\nimport torch\nfrom data_solution import corrupt_mnist\nfrom model import MyAwesomeModel\n\n\n@click.group()\ndef cli() -> None:\n \"\"\"Command line interface.\"\"\"\n\n\n@click.command()\n@click.option(\"--lr\", default=1e-3, help=\"learning rate to use for training\")\ndef train(lr) -> None:\n \"\"\"Train a model on MNIST.\"\"\"\n print(\"Training day and night\")\n print(lr)\n\n # TODO: Implement training loop here\n model = MyAwesomeModel()\n train_set, _ = corrupt_mnist()\n\n\n@click.command()\n@click.argument(\"model_checkpoint\")\ndef evaluate(model_checkpoint) -> None:\n \"\"\"Evaluate a trained model.\"\"\"\n print(\"Evaluating like my life depends on it\")\n print(model_checkpoint)\n\n # TODO: Implement evaluation logic here\n model = torch.load(model_checkpoint)\n _, test_set = corrupt_mnist()\n\n\ncli.add_command(train)\ncli.add_command(evaluate)\n\n\nif __name__ == \"__main__\":\n cli()\n
Solution The solution implements a simple training loop and evaluation loop. Furthermore, we have added additional hyperparameters that can be passed to the training loop. Highlighted in the solution are the different lines where we take care that our model and data are moved to GPU (or Apple MPS accelerator if you have a newer Mac) if available.
main.pyimport click\nimport matplotlib.pyplot as plt\nimport torch\nfrom model import MyAwesomeModel\n\nfrom data import corrupt_mnist\n\nDEVICE = torch.device(\"cuda\" if torch.cuda.is_available() else \"mps\" if torch.backends.mps.is_available() else \"cpu\")\n\n\n@click.group()\ndef cli() -> None:\n \"\"\"Command line interface.\"\"\"\n\n\n@click.command()\n@click.option(\"--lr\", default=1e-3, help=\"learning rate to use for training\")\n@click.option(\"--batch_size\", default=32, help=\"batch size to use for training\")\n@click.option(\"--epochs\", default=10, help=\"number of epochs to train for\")\ndef train(lr, batch_size, epochs) -> None:\n \"\"\"Train a model on MNIST.\"\"\"\n print(\"Training day and night\")\n print(f\"{lr=}, {batch_size=}, {epochs=}\")\n\n model = MyAwesomeModel().to(DEVICE)\n train_set, _ = corrupt_mnist()\n\n train_dataloader = torch.utils.data.DataLoader(train_set, batch_size=batch_size)\n\n loss_fn = torch.nn.CrossEntropyLoss()\n optimizer = torch.optim.Adam(model.parameters(), lr=lr)\n\n statistics = {\"train_loss\": [], \"train_accuracy\": []}\n for epoch in range(epochs):\n model.train()\n for i, (img, target) in enumerate(train_dataloader):\n img, target = img.to(DEVICE), target.to(DEVICE)\n optimizer.zero_grad()\n y_pred = model(img)\n loss = loss_fn(y_pred, target)\n loss.backward()\n optimizer.step()\n statistics[\"train_loss\"].append(loss.item())\n\n accuracy = (y_pred.argmax(dim=1) == target).float().mean().item()\n statistics[\"train_accuracy\"].append(accuracy)\n\n if i % 100 == 0:\n print(f\"Epoch {epoch}, iter {i}, loss: {loss.item()}\")\n\n print(\"Training complete\")\n torch.save(model.state_dict(), \"model.pth\")\n fig, axs = plt.subplots(1, 2, figsize=(15, 5))\n axs[0].plot(statistics[\"train_loss\"])\n axs[0].set_title(\"Train loss\")\n axs[1].plot(statistics[\"train_accuracy\"])\n axs[1].set_title(\"Train accuracy\")\n fig.savefig(\"training_statistics.png\")\n\n\n@click.command()\n@click.argument(\"model_checkpoint\")\ndef evaluate(model_checkpoint) -> None:\n \"\"\"Evaluate a trained model.\"\"\"\n print(\"Evaluating like my life depended on it\")\n print(model_checkpoint)\n\n model = MyAwesomeModel().to(DEVICE)\n model.load_state_dict(torch.load(model_checkpoint))\n\n _, test_set = corrupt_mnist()\n test_dataloader = torch.utils.data.DataLoader(test_set, batch_size=32)\n\n model.eval()\n correct, total = 0, 0\n for img, target in test_dataloader:\n img, target = img.to(DEVICE), target.to(DEVICE)\n y_pred = model(img)\n correct += (y_pred.argmax(dim=1) == target).float().sum().item()\n total += target.size(0)\n print(f\"Test accuracy: {correct / total}\")\n\n\ncli.add_command(train)\ncli.add_command(evaluate)\n\n\nif __name__ == \"__main__\":\n cli()\n
As documentation that your model is working when running the train
command, the script needs to produce a single plot with the training curve (training step vs training loss). When the evaluate
command is run, it should write the test set accuracy to the terminal.
It is part of the exercise to not implement in notebooks, as code development in real life happens in scripts. As the model is simple to run (for now), you should be able to complete the exercise on your laptop, even if you are only training on CPU. That said, you are allowed to upload your scripts to your own \"Google Drive\" and then you can call your scripts from a Google Colab notebook, which is shown in the image below where all code is placed in the fashion_trainer.py
script and the Colab notebook is just used to execute it.
Be sure to have completed the final exercise before the next session, as we will be building on top of the model you have created.
"},{"location":"s1_development_environment/editor/","title":"M3 - Editor","text":""},{"location":"s1_development_environment/editor/#editoride","title":"Editor/IDE","text":"Core Module
Notebooks can be great for testing out ideas, developing simple code, and explaining and visualizing certain aspects of a codebase. Remember that Jupyter Notebook was created to \"...allows you to create and share documents that contain live code, equations, visualizations, and narrative text.\" However, any larger machine learning project will require you to work in multiple .py
files, and here notebooks will provide a suboptimal workflow. Therefore, to truly get \"work done,\" you will need a good editor/IDE.
Many opinions exist on this matter, but for simplicity, we recommend getting started with one of the following 3:
Editor Webpage Comment (Biased opinion) Spyder https://www.spyder-ide.org/ A Matlab-like environment that is easy to get started with Visual Studio Code https://code.visualstudio.com/ Support for multiple languages with fairly easy setup PyCharm https://www.jetbrains.com/pycharm/ An IDE for Python professionals. Will take a bit of time getting used toWe highly recommend Visual Studio (VS) Code if you do not already have an editor installed (or just want to try something new). We, therefore, put additional effort into explaining VS Code.
Below, you see an overview of the VS Code interface
Image creditThe main components of VS Code are:
The action bar: VS Code is not an editor meant for a single language and can do many things. One of the core reasons that VS Code has become so popular is that custom plug-ins called extensions can be installed to add functionality to VS Code. It is in the action bar that you can navigate between these different applications when you have installed them.
The sidebar: The sidebar has different functionality depending on what extension you have open. In most cases, the sidebar will just contain the file explorer.
The editor: This is where your code is. VS Code supports several layouts in the editor (one column, two columns, etc.). You can make a custom layout by dragging a file to where you want the layout to split.
The panel: The panel contains a terminal for you to interact with. This can quickly be used to try out code by opening a python
interpreter, management of environments, etc.
The status bar: The status bar contains information based on the extensions you have installed. In particular, for Python development, the status bar can be used to change the conda environment.
The overall goal of the exercises is that you should start familiarizing yourself with the editor you have chosen. If you are already an expert in one of them, feel free to skip the rest. You should at least be able to:
The instructions below are specific to Visual Studio Code, but we recommend that you try to answer the questions if using another editor. In the exercise_files
folder belonging to this session, we have put cheat sheets for VS Code (one for Windows and one for Mac/Linux) that can give you an easy overview of the different macros in VS Code. The following exercises are just to get you started, but you can find many more tutorials here.
VS Code is a general editor for many languages, and to get proper Python support, we need to install some extensions. In the action bar
, go to the extension
tab and search for python
in the marketplace. From here, we highly recommend installing the following packages:
If you install the Python
package, you should see something like this in your status bar:
which indicates that you are using the stock Python installation instead of the one you have created using conda
. Click it and change the Python environment to the one you want to use.
One of the most useful tools in VS Code is the ability to navigate the whole project using the built-in Explorer
. To take advantage of VS Code, you need to make sure what you are working on is a project. Create a folder called hello
(somewhere on your laptop) and open it in VS Code (Click File
in the menu and then select Open Folder
). You should end up with a completely clean workspace (as shown below). Click the New file
button and create a file called hello.py
.
Image credit
Finally, let's run some code. Add something simple to the hello.py
file like:
Image credit
and click the run
button as shown in the image. It should create a new terminal, activate the environment that you have chosen, and finally run your script. In addition to clicking the run
button, you can also:
Shift+Enter
to run it in the terminalThat's the basics of using VS Code. We highly recommend that you revisit this tutorial during the course when we get to topics such as debugging and version control, which VS Code can help with. We can also recommend this blog post that goes over some good extensions for AI/ML development in VS Code.
"},{"location":"s1_development_environment/editor/#a-note-on-jupyter-notebooks-in-production-environments","title":"A note on Jupyter notebooks in production environments","text":"As already stated, Jupyter Notebooks are great for development as they allow developers to easily test out new ideas. However, they often lead to pain points when models need to be deployed. We highly recommend reading section 5.1.1 of this paper by Shankar et al. which in more detail discusses the strong opinions on Jupyter notebooks that exist within the developer community.
All this said, there exists one simple tool to make notebooks work better in a production setting. It's called nbconvert
and can be installed with
pip install nbconvert\n
You may need some further dependencies such as Pandoc, TeX and Pyppeteer for it to work (see install instructions here). After this, converting a notebook to a .py
script is as simple as:
jupyter nbconvert --to=script my_notebook.ipynb\n
which will produce a similarly named script called my_notebook.py
. We highly recommend that you stick to developing scripts directly during the course to get experience with doing so, but nbconvert
can be a fantastic tool to have in your toolbox.
You are probably all familiar with using AI tools for solving different tasks in your daily life and you have most likely also used AI tools like ChatGPT or similar for programming. However, most of these tools are not directly integrated into your editor, which can lead to a lot of context-switching that in general leads to lower productivity.
We are therefore in this section going to be looking at GitHub Copilot, which is an AI tool that directly integrates into your editor, eliminating the need to switch between browser tabs or external tools. In addition, the strength of having AI directly in your editor is that it can provide suggestions based on the code you are currently writing and in general it just has access to a larger context than a standalone tool.
"},{"location":"s1_development_environment/editor/#exercises_1","title":"\u2754 Exercises","text":"As of writing this GitHub Copilot is free for all students, teachers and maintainers of popular open-source projects. As a student, sign up for the Student Developer Pack
Install the GitHub Copilot extension in your editor
GitHub Copilot has many different features, but the most important one is the ability to provide suggestions based on the code you are currently writing. Try to write some code in a new Python file and see if you can get some suggestions from GitHub Copilot on how to complete the code. If you have no idea what to try out here is a simple example of starting out coding a neural network in PyTorch:
import torch\nfrom torch import nn\nclass Net(nn.Module):\n
Github Copilot will most likely suggest you complete the code using linear layers with an input dimension of 28*28
. Can you explain why it suggests this and where this bias comes from?
The second feature that can be very useful is the ability to directly chat or ask questions regarding your code. Try highlighting (in your code editor) the code from the previous exercise and press Ctrl+i
which should open a chat window. Ask it to complete it with a convolutional neural network instead of a linear one.
Finally, let's try the built-in chat feature. You can get to this by clicking the Chat
icon in the Activity bar and begin to ask questions similar to how you would ask ChatGPT. However, we have also the option to provide context either from the code editor or the terminal. Try saving the following code in a Python script copilot.py
:
import torch\nfrom torch import nn\nclass Net(nn.Module):\n def __init__(self):\n super(Net, self).__init__()\n self.fc1 = nn.Linear(28*28, 128)\n self.fc2 = nn.Linear(128, 64)\n self.fc3 = nn.Linear(64, 10)\n def forward(self, x):\n x = x.view(-1, 28*28)\n x = torch.relu(self.fc1(x))\n x = torch.relu(self.fc2(x))\n x = self.fc3(x)\n return x\n\nmodel = Net()\nprint(model(torch.randn(1, 1, 14, 14)))\n
and run it in the terminal: python copilot.py
. It will naturally give you an error, but you can now ask GitHub Copilot for help. The easiest way to do this is by highlighting the output in the terminal and then pressing running the Github Copilot: Explain This (Terminal)
command (see the image below, use Ctrl+Shift+P
to open the command palette and search for the command). Does the explanation make sense e.g. can you figure out what to change to get the code running?
(Optional) Just to investigate the difference between using Github Copilot and ChatGPT, try to redo the previous exercises using ChatGPT. What are the main differences between the two tools? (1)
That was a small introduction to GitHub Copilot. We highly recommend that you try to use it during the course to see how it can help you solve both the exercises and the final project. However, when using AI tools it is always important to remember that they are not perfect and that you need to critically evaluate the suggestions they provide. In the end, you are the one responsible for the code you write, not the AI tool.
"},{"location":"s1_development_environment/package_manager/","title":"M2 - Package Manager","text":""},{"location":"s1_development_environment/package_manager/#package-managers-and-virtual-environments","title":"Package managers and virtual environments","text":"Core Module
Python is a great programming language and this is mostly due to its vast ecosystem of packages. No matter what you want to do, there is probably a package that can get you started. Just try to remember when the last time you wrote a program only using the Python standard library. Probably never. For this reason, we need a way to install third-party packages and this is where package managers come into play.
You have probably already used pip
for the longest time, which is the default package manager for Python. pip
is great for beginners but it is missing one essential feature that you will need as a developer or data scientist: virtual environments. Virtual environments are an essential way to make sure that the dependencies of different projects do not cross-contaminate each other. As a naive example, consider project A that requires torch==1.3.0
and project B that requires torch==2.0
, then doing
cd project_A # move to project A\npip install torch==1.3.0 # install old torch version\ncd ../project_B # move to project B\npip install torch==2.0 # install new torch version\ncd ../project_A # move back to project A\npython main.py # try executing main script from project A\n
will mean that even though we are executing the main script from project A's folder, it will use torch==2.0
instead of torch==1.3.0
because that is the last version we installed because in both cases pip
will install the package into the same environment, in this case, the global environment. Instead, if we did something like:
cd project_A # move to project A\npython -m venv env # create a virtual environment in project A\nsource env/bin/activate # activate that virtual environment\npip install torch==1.3.0 # Install the old torch version into the virtual environment belonging to project A\ncd ../project_B # move to project B\npython -m venv env # create a virtual environment in project B\nsource env/bin/activate # activate that virtual environment\npip install torch==2.0 # Install new torch version into the virtual environment belonging to project B\ncd ../project_A # Move back to project A\nsource env/bin/activate # Activate the virtual environment belonging to project A\npython main.py # Succeed in executing the main script from project A\n
cd project_A # Move to project A\npython -m venv env # Create a virtual environment in project A\n.\\env\\Scripts\\activate # Activate that virtual environment\npip install torch==1.3.0 # Install the old torch version into the virtual environment belonging to project A\ncd ../project_B # Move to project B\npython -m venv env # Create a virtual environment in project B\n.\\env\\Scripts\\activate # Activate that virtual environment\npip install torch==2.0 # Install new torch version into the virtual environment belonging to project B\ncd ../project_A # Move back to project A\n.\\env\\Scripts\\activate # Activate the virtual environment belonging to project A\npython main.py # Succeed in executing the main script from project A\n
then we would be sure that torch==1.3.0
is used when executing main.py
in project A because we are using two different virtual environments. In the above case, we used the venv module which is the built-in Python module for creating virtual environments. venv+pip
is arguably a good combination but when working on multiple projects it can quickly become a hassle to manage all the different virtual environments yourself, remembering which Python version to use, which packages to install and so on.
For this reason, a number of package managers have been created that can help you manage your virtual environments and dependencies, with some of the most popular being:
with more being created every year (rye is looking like an interesting project). This is considered a problem in the Python community because it means that there is no standard way of managing dependencies like in other languages like npm
for node.js
or cargo
for rust
.
In the course, we do not care about which package manager you use, but we do care that you use one. If you are already familiar with one package manager, then skip this exercise and continue to use that. The best recommendation that I can give regarding package managers, in general, is to find one you like and then stick with it. A lot of time can be wasted on trying to find the perfect package manager, but in the end, they all do the same with some minor differences. Check out this blog post if you want a fairly up-to-date evaluation of the different environment management and packaging tools that exist in the Python ecosystem.
If you are not familiar with any package managers, then we recommend that you use conda
and pip
for this course. You probably already have conda installed on your laptop, which is great. What conda does especially well, is that it allows you to create virtual environments with different Python versions, which can be really useful if you encounter dependencies that have not been updated in a long time. In this course specifically, we are going to recommend the following workflow
conda
to create virtual environments with specific Python versionspip
to install packages in that environmentInstalling packages with pip
inside conda
environments has been considered a bad practice for a long time, but since conda>=4.6
it is considered safe to do so. The reason for this is that conda
now has a built-in compatibility layer that makes sure that pip
installed packages are compatible with the other packages installed in the environment.
Before we get started with the exercises, let's first talk a bit about Python dependencies. One of the most common ways to specify dependencies in the Python community is through a requirements.txt
file, which is a simple text file that contains a list of all the packages that you want to install. The format allows you to specify the package name and version number you want, with 7 different operators:
package1 # any version\npackage2 == x.y.z # exact version\npackage3 >= x.y.z # at least version x.y.z\npackage4 > x.y.z # newer than version x.y.z\npackage4 <= x.y.z # at most version x.y.z\npackage5 < x.y.z # older than version x.y.z\npackage6 ~= x.y.z # install version newer than x.y.z and older than x.y+1\n
In general, all packages (should) follow the semantic versioning standard, which means that the version number is split into three parts: x.y.z
where x
is the major version, y
is the minor version and z
is the patch version.
The reason that we need to specify the version number is that we want to make sure that we can reproduce our code at a later point. If we do not specify the version number, then we are at the mercy of the package maintainer to not change the API of the package. This is especially important when working with machine learning models, as we want to make sure that we can reproduce the exact same model at a later point.
Finally, we also need to discuss dependency resolution, which is the process of figuring out which packages are compatible. This is a non-trivial problem, and there exist a lot of different algorithms for doing this. If you have ever thought that pip
and conda
were taking a long time to install something, then it is probably because they were trying to figure out which packages are compatible with each other. For example, if you try to install
pip install \"matplotlib >= 3.8.0\" \"numpy <= 1.19\" --dry-run\n
then it would simply fail because there are no versions of matplotlib
and numpy
under the given constraints that are compatible with each other. In this case, we would need to relax the constraints to something like
pip install \"matplotlib >= 3.8.0\" \"numpy <= 1.21\" --dry-run\n
to make it work.
"},{"location":"s1_development_environment/package_manager/#exercises","title":"\u2754 Exercises","text":"For hints regarding how to use conda
you can check out the cheat sheet in the exercise folder.
Download and install conda
. You are free to either install full conda
or the much simpler version miniconda
. The core difference between the two packages is that conda
already comes with a lot of packages that you would normally have to install with miniconda
. The downside is that conda
is a much larger package which can be a huge disadvantage on smaller devices. Make sure that your installation is working by writing conda help
in a terminal and it should show you the help message for conda. If this does not work you probably need to set some system variable to point to the conda installation
If you have successfully installed conda, then you should be able to execute the conda
command in a terminal.
Conda will always tell you what environment you are currently in, indicated by the (env_name)
in the prompt. By default, it will always start in the (base)
environment.
Try creating a new virtual environment. Make sure that it is called my_environment
and that it installs version 3.11 of Python. What command should you execute to do this?
We highly recommend that you use Python 3.8 or higher for this course. In general, we recommend that you use the second latest version of Python that is available (currently Python 3.11 as of writing this). This is because the latest version of Python is often not supported by all dependencies. You can always check the status of different Python version support here.
Solutionconda create --name my_environment python=3.11\n
Which conda
command gives you a list of all the environments that you have created?
conda env list\n
Which conda
command gives you a list of the packages installed in the current environment?
conda list\n
How do you easily export this list to a text file? Do this, and make sure you export it to a file called environment.yaml
, as conda uses another format by default than pip
.
conda list --explicit > environment.yaml\n
Inspect the file to see what is in it.
The environment.yaml
file you have created is one way to secure reproducibility between users because anyone should be able to get an exact copy of your environment if they have your environment.yaml
file. Try creating a new environment directly from your environment.yaml
file and check that the packages being installed exactly match what you originally had.
conda env create --file environment.yaml\n
As the introduction states, it is fairly safe to use pip
inside conda
today. What is the corresponding pip
command that gives you a list of all pip
installed packages? And how do you export this to requirements.txt
file?
pip list # List all installed packages\npip freeze > requirements.txt # Export all installed packages to a requirements.txt file\n
If you look through the requirements that both pip
and conda
produce then you will see that it is often filled with a lot more packages than what you are using in your project. What you are interested in are the packages that you import in your code: from package import module
. One way to get around this is to use the package pipreqs
, which will automatically scan your project and create a requirements file specific to that. Let's try it out:
Install pipreqs
:
pip install pipreqs\n
Either try out pipreqs
on one of your own projects or try it out on some other online project. What does the requirements.txt
file pipreqs
produces look like compared to the files produced by either pip
or conda
.
Try executing the command
pip install \"pytest < 4.6\" pytest-cov==2.12.1\n
based on the error message you get, what would be a compatible way to install these?
SolutionAs pytest-cov==2.12.1
requires a version of pytest
newer than 4.6
, we can simply change the command to be:
pip install \"pytest >= 4.6\" pytest-cov==2.12.1\n
but there of course exist other solutions as well.
This ends the module on setting up virtual environments. While the methods mentioned in the exercises are great ways to construct requirements files automatically, sometimes it is just easier to manually sit down and create the files as you in that way ensure that only the most necessary requirements are installed when creating a new environment.
"},{"location":"s2_organisation_and_version_control/","title":"Organization and version control","text":"Slides
Learn the basics of version control and how to use git
to track changes to your code and collaborate with others.
M5: Git
Learn how to organize Python code into a library, package it and use templates to create new projects.
M6: Code Structure
Learn different coding practices and how to use them to improve the quality of your code.
M7: Good Coding Practice
Learn how to version control data using dvc
.
M8: Data Version Control
Learn the different ways to setup command line interfaces for your applications.
M9: Command Line Interfaces
Today we take our first steps into the world of MLOps. The set of modules in this session focuses on getting organized and making sure that you are familiar with good development practices. While many of the practices you will learn about these modules do not seem that important when you are a single person working on a project, it is crucial when working in large groups that the difference in how different people organize and write their code is minimized. The topics in this session will focus on:
Some exercises in this course are very loosely stated (including the exercises today). You are expected to seek out information before you ask for help (Google is your friend!) as you will both learn more for trying to solve the problems yourself, and it is more realistic of how the \"real world\" works.
Learning objectives
The learning objectives of this session are:
git
to track changes to your codedvc
to version control dataAs we already laid out in the very first module, the command line is a powerful tool for interacting with your computer. You should already now be familiar with running basic Python commands in the terminal:
python my_script.py\n
However, as your projects grow in size and complexity, you will often find yourself in need of more advanced ways of interacting with your code. This is where command line interface (CLI) comes into play. A CLI can be seen as a way for you to define the user interface of your application directly in the terminal. Thus, there is no right or wrong way of creating a CLI, it is all about what makes sense for your application.
In this module we are going to look at three different ways of creating a CLI for your machine learning projects. They are all serving a bit different purposes and can therefore be combined in the same project. However, you will most likely also feel that they are overlapping in some areas. That is completely fine, and it is up to you to decide which one to use in which situation.
"},{"location":"s2_organisation_and_version_control/cli/#project-scripts","title":"Project scripts","text":"You might already be familiar with the concept of executable scripts. An executable script is a Python script that can be run directly from the terminal without having to call the Python interpreter. This has been possible for a long time in Python, by the inclusion of a so-called shebang line at the top of the script. However, we are going to look at a specific way of defining executable scripts using the standard pyproject.toml
file, which you should have learned about in this module.
We are going to assume that you have a training script in your project that you would like to be able to run from the terminal directly without having to call the Python interpreter. Lets assume it is located like this
src/\n\u251c\u2500\u2500 my_project/\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u251c\u2500\u2500 train.py\npyproject.toml\n
In your pyproject.toml
file add the following lines. You will need to alter the paths to match your project.
[project.scripts]\ntrain = \"my_project.train:main\"\n
what do you think the train = \"my_project.train:main\"
line do?
The line tells Python that we want to create an executable script called train
that should run the main
function in the train.py
file located in the my_project
package.
Now, all that is left to do is install the project again in editable mode
pip install -e .\n
and you should now be able to run the following command in the terminal
train\n
Try it out and see if it works.
Add additional scripts to your pyproject.toml
file that allows you to run other scripts in your project from the terminal.
We assume that you also have a script called evaluate.py
in the my_project
package.
[project.scripts]\ntrain = \"my_project.train:main\"\nevaluate = \"my_project.evaluate:main\"\n
That is all there really is to it. You can now run your scripts directly from the terminal without having to call the Python interpreter. Some good examples of Python packages that uses this approach are numpy, pylint and kedro.
"},{"location":"s2_organisation_and_version_control/cli/#command-line-arguments","title":"Command line arguments","text":"If you have worked with Python for some time you are probably familiar with the argparse
package, which allows you to directly pass in additional arguments to your script in the terminal
python my_script.py --arg1 val1 --arg2 val2\n
argparse
is a very simple way of constructing what is called a command line interfaces. However, one limitation of argparse
is the possibility of easily defining an CLI with subcommands. If we take git
as an example, git
is the main command but it has multiple subcommands: push
, pull
, commit
etc. that all can take their own arguments. This kind of second CLI with subcommands is somewhat possible to do using only argparse
, however it requires a bit of hacks.
You could of course ask the question why we at all would like to have the possibility of defining such CLI. The main argument here is to give users of our code a single entrypoint to interact with our application instead of having multiple scripts. As long as all subcommands are proper documented, then our interface should be simple to interact with (again think git
where each subcommand can be given the -h
arg to get specific help).
Instead of using argparse
we are here going to look at the yyper package. typer
extends the functionalities of argparse
to allow for easy definition of subcommands and many other things, which we are not going to touch upon in this module. For completeness we should also mention that typer
is not the only package for doing this, and of other excellent frameworks for creating command line interfaces easily we can mention click.
Start by installing the typer
package
pip install typer\n
remember to add the package to your requirements.txt
file.
To get you started with typer
, let's just create a simple hello world type of script. Create a new Python file called greetings.py
and use the typer
package to create a command line interface such that running the following lines
python greetings.py # should print \"Hello World!\"\npython greetings.py --count=3 # should print \"Hello World!\" three times\npython greetings.py --help # should print the help message, informing the user of the possible arguments\n
executes and gives the expected output. Relevant documentation.
SolutionImportantly for typer
is that you need to provide type hints for the arguments. This is because typer
needs these to be able to work properly.
import typer\napp = typer.Typer()\n\n@app.command()\ndef hello(count: int = 1, name: str = \"World\"):\n for x in range(count):\n typer.echo(f\"Hello {name}!\")\n\nif __name__ == \"__main__\":\n app()\n
Next, lets try on a bit harder example. Below is a simple script that trains a support vector machine on the iris dataset.
iris_classifier.py
iris_classifier.pyfrom sklearn.datasets import load_breast_cancer\nfrom sklearn.metrics import accuracy_score, classification_report\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.svm import SVC\n\n\ndef train():\n \"\"\"Train and evaluate the model.\"\"\"\n # Load the dataset\n data = load_breast_cancer()\n x = data.data\n y = data.target\n\n # Split the dataset into training and testing sets\n x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)\n\n # Standardize the features\n scaler = StandardScaler()\n x_train = scaler.fit_transform(x_train)\n x_test = scaler.transform(x_test)\n\n # Train a Support Vector Machine (SVM) model\n model = SVC(kernel=\"linear\", random_state=42)\n model.fit(x_train, y_train)\n\n # Make predictions on the test set\n y_pred = model.predict(x_test)\n\n # Evaluate the model\n accuracy = accuracy_score(y_test, y_pred)\n report = classification_report(y_test, y_pred)\n\n print(f\"Accuracy: {accuracy:.2f}\")\n print(\"Classification Report:\")\n print(report)\n return accuracy, report\n\n\nif __name__ == \"__main__\":\n train()\n
Implement a CLI for the script such that the following commands can be run
python iris_classifier.py train --output 'model.ckpt' # should train the model and save it to 'model.ckpt'\npython iris_classifier.py train -o 'model.ckpt' # should be the same as above\n
Solution We are here making use of the short name option in typer for giving an shorter alias to the --output
option.
import pickle\nfrom typing import Annotated\n\nimport typer\nfrom sklearn.datasets import load_breast_cancer\nfrom sklearn.metrics import accuracy_score, classification_report\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.svm import SVC\n\napp = typer.Typer()\n\n\n@app.command()\ndef train(output: Annotated[str, typer.Option(\"--output\", \"-o\")] = \"model.ckpt\"):\n \"\"\"Train and evaluate the model.\"\"\"\n # Load the dataset\n data = load_breast_cancer()\n x = data.data\n y = data.target\n\n # Split the dataset into training and testing sets\n x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)\n\n # Standardize the features\n scaler = StandardScaler()\n x_train = scaler.fit_transform(x_train)\n x_test = scaler.transform(x_test)\n\n # Train a Support Vector Machine (SVM) model\n model = SVC(kernel=\"linear\", random_state=42)\n model.fit(x_train, y_train)\n\n # Make predictions on the test set\n y_pred = model.predict(x_test)\n\n # Evaluate the model\n accuracy = accuracy_score(y_test, y_pred)\n report = classification_report(y_test, y_pred)\n\n with open(output, \"wb\") as f:\n pickle.dump(model, f)\n\n print(f\"Accuracy: {accuracy:.2f}\")\n print(\"Classification Report:\")\n print(report)\n return accuracy, report\n\n\nif __name__ == \"__main__\":\n app()\n
Next lets create a CLI that has more than a single command. Continue working in the basic machine learning application from the previous exercise, but this time we want to define two separate commands
python iris_classifier.py train --output 'model.ckpt'\npython iris_classifier.py evaluate 'model.ckpt'\n
Solution The only key difference between the two is that in the train
command we define the output
argument to to be an optional parameter e.g. we provide a default and for the evaluate
command it is a required parameter.
import pickle\nfrom typing import Annotated\n\nimport typer\nfrom sklearn.datasets import load_breast_cancer\nfrom sklearn.metrics import accuracy_score, classification_report\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.svm import SVC\n\napp = typer.Typer()\n\n# Load the dataset\ndata = load_breast_cancer()\nx = data.data\ny = data.target\n\n# Split the dataset into training and testing sets\nx_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)\n\n# Standardize the features\nscaler = StandardScaler()\nx_train = scaler.fit_transform(x_train)\nx_test = scaler.transform(x_test)\n\n\n@app.command()\ndef train(output_file: Annotated[str, typer.Option(\"--output\", \"-o\")] = \"model.ckpt\") -> None:\n \"\"\"Train the model.\"\"\"\n # Train a Support Vector Machine (SVM) model\n model = SVC(kernel=\"linear\", random_state=42)\n model.fit(x_train, y_train)\n\n with open(output_file, \"wb\") as f:\n pickle.dump(model, f)\n\n\n@app.command()\ndef evaluate(model_file):\n \"\"\"Evaluate the model.\"\"\"\n with open(model_file, \"rb\") as f:\n model = pickle.load(f)\n\n # Make predictions on the test set\n y_pred = model.predict(x_test)\n\n # Evaluate the model\n accuracy = accuracy_score(y_test, y_pred)\n report = classification_report(y_test, y_pred)\n\n print(f\"Accuracy: {accuracy:.2f}\")\n print(\"Classification Report:\")\n print(report)\n return accuracy, report\n\n\nif __name__ == \"__main__\":\n app()\n
Finally, let's try to define subcommands for our subcommands e.g. something similar to how git
has the subcommand remote
which in itself has multiple subcommands like add
, rename
etc. Continue on the simple machine learning application from the previous exercises, but this time define a cli such that
python iris_classifier.py train svm --kernel 'linear'\npython iris_classifier.py train knn -k 5\n
e.g the train
command now has two subcommands for training different machine learning models (in this case SVM and KNN) which each takes arguments that are unique to that model. Relevant documentation.
import pickle\nfrom typing import Annotated\n\nimport typer\nfrom sklearn.datasets import load_breast_cancer\nfrom sklearn.metrics import accuracy_score, classification_report\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.svm import SVC\n\napp = typer.Typer()\ntrain_app = typer.Typer()\napp.add_typer(train_app, name=\"train\")\n\n# Load the dataset\ndata = load_breast_cancer()\nx = data.data\ny = data.target\n\n# Split the dataset into training and testing sets\nx_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)\n\n# Standardize the features\nscaler = StandardScaler()\nx_train = scaler.fit_transform(x_train)\nx_test = scaler.transform(x_test)\n\n\n@train_app.command()\ndef svm(kernel: str = \"linear\", output_file: Annotated[str, typer.Option(\"--output\", \"-o\")] = \"model.ckpt\") -> None:\n \"\"\"Train a SVM model.\"\"\"\n model = SVC(kernel=kernel, random_state=42)\n model.fit(x_train, y_train)\n\n with open(output_file, \"wb\") as f:\n pickle.dump(model, f)\n\n\n@train_app.command()\ndef knn(k: int = 5, output_file: Annotated[str, typer.Option(\"--output\", \"-o\")] = \"model.ckpt\") -> None:\n \"\"\"Train a KNN model.\"\"\"\n model = KNeighborsClassifier(n_neighbors=k)\n model.fit(x_train, y_train)\n\n with open(output_file, \"wb\") as f:\n pickle.dump(model, f)\n\n\n@app.command()\ndef evaluate(model_file):\n \"\"\"Evaluate the model.\"\"\"\n with open(model_file, \"rb\") as f:\n model = pickle.load(f)\n\n # Make predictions on the test set\n y_pred = model.predict(x_test)\n\n # Evaluate the model\n accuracy = accuracy_score(y_test, y_pred)\n report = classification_report(y_test, y_pred)\n\n print(f\"Accuracy: {accuracy:.2f}\")\n print(\"Classification Report:\")\n print(report)\n return accuracy, report\n\n\nif __name__ == \"__main__\":\n app()\n
(Optional) Let's try to combine what we have learned until now. Try to make your typer
cli into a executable script using the pyproject.toml
file and try it out!
Assuming that our iris_classifier.py
script from before is placed in src/my_project
folder, we should just add
[project.scripts]\ngreetings = \"src.my_project.iris_classifier:app\"\n
and remember to install the project in editable mode
pip install -e .\n
and you should now be able to run the following command in the terminal
iris_classifier train knn\n
This covers the basic of typer
but feel free to deep dive into how the package can help you custimize your CLIs. Checkout this page on adding colors to your CLI or this page on validating the inputs to your CLI.
The two sections above have shown you how to create a simple CLI for your Python scripts. However, when doing machine learning projects, you often have a lot of non-Python code that you would like to run from the terminal. Based on the learning modules you have already completed, you have already encountered a couple of CLI tools that are used in our projects:
As we begin to move into the next couple of learning modules, we are going to encounter even more CLI tools that we need to interact with. Here is a example of long command that you might need to run in your project in the future
docker run -v $(pwd):/app -w /app --gpus all --rm -it my_image:latest python my_script.py --arg1 val1 --arg2 val2\n
This can be a lot to remember, and it can be easy to make mistakes. Instead it would be nice if we could just do
run my_command --arg1=val1 --arg2=val2\n
e.g. easier to remember because we have remove a lot of the hard-to-remember stuff, but we are still able to configure it to our liking. To help with this, we are going to look at the invoke package. invoke
is a Python package that allows you to define tasks that can be run from the terminal. It is a bit like a more advanced version of the Makefile that you might have encountered in other programming languages. Some good alternatives to invoke
are just and task, but we have chosen to focus on invoke
in this module because it can be installed as a Python package making installation across different systems easier.
Start by installing invoke
pip install invoke\n
remember to add the package to your requirements.txt
file.
Add a tasks.py
file to your repository and try to just run
invoke --list\n
which should work but inform you that no tasks are added yet.
Let's now try to add a task to the tasks.py
file. The way to do this with invoke is to import the task
decorator from invoke
and then decorate a function with it:
from invoke import task\nimport os\n\n@task\ndef python(ctx):\n \"\"\" \"\"\"\n ctx.run(\"which python\" if os.name != \"nt\" else \"where python\")\n
the first argument of any task-decorated function is the ctx
context argument that implements the run
method for running any command as we run them in the terminal. In this case we have simply implemented a task that returns the current Python interpreter but it works for all operating systems. Check that it works by running:
invoke hello\n
Lets try to create a task that simplifies the process of git add
, git commit
, git push
. Create a task such that the following command can be run
invoke git --message \"My commit message\"\n
Implement it and use the command to commit the taskfile you just created!
Solution@task\ndef git(ctx, message):\n ctx.run(f\"git add .\")\n ctx.run(f\"git commit -m '{message}'\")\n ctx.run(f\"git push\")\n
As you have hopefully realized by now, the most important method in invoke
is the ctx.run
method which actually run the commands you want to run in the terminal. This command takes multiple additional arguments. Try out the arguments warn
, pty
, echo
and explain in your own words what they do.
warn
: If set to True
the command will not raise an exception if the command fails. This can be useful if you want to run multiple commands and you do not want the whole process to stop if one of the commands fail.pty
: If set to True
the command will be run in a pseudo-terminal. If you want to enable this or not, depends on the command you are running. Here is a good explanation of when/why you should use it.echo
: If set to True
the command will be printed to the terminal before it is run.Create a command that simplifies the process of bootstrapping a conda
environment and install the relevant dependencies of your project.
@task\ndef conda(ctx, name: str = \"dtu_mlops\"):\n ctx.run(f\"conda env create -f environment.yml\", echo=True)\n ctx.run(f\"conda activate {name}\", echo=True)\n ctx.run(f\"pip install -e .\", echo=True)\n
and try to run the following command
invoke conda\n
Assuming you have completed the exercises on using dvc for version control of data, lets also try to add a task that simplifies the process of adding new data. This is the list of commands that need to be run to add new data to a dvc repository: dvc add
, git add
, git commit
, git push
, dvc push
. Try to implement a task that simplifies this process. It needs to take two arguments for defining the folder to add and the commit message.
@task\ndef dvc(ctx, folder=\"data\", message=\"Add new data\"):\n ctx.run(f\"dvc add {folder}\")\n ctx.run(f\"git add {folder}.dvc .gitignore\")\n ctx.run(f\"git commit -m '{message}'\")\n ctx.run(f\"git push\")\n ctx.run(f\"dvc push\")\n
and try to run the following command
invoke dvc --folder 'data' --message 'Add new data'\n
As the final exercise, lets try to combine every way of defining CLIs we have learned about in this module. Define a task that does the following
dvc pull
to download the datamy_cli
with the subcommand train
with the arguments --output 'model.ckpt'
from invoke import task\n\n@task\ndef pull_data(ctx):\n ctx.run(\"dvc pull\")\n\n@task(pull_data)\ndef train(ctx)\n ctx.run(\"my_cli train\")\n
That is all there is to it. You should now be able to define tasks that can be run from the terminal to simplify the process of running your code. We recommend that as you go through the learning modules in this course that you slowly start to add tasks to your tasks.py
file that simplifies the process of running the code you are writing.
What is the purpose of a command line interface?
SolutionA command line interface is a way for you to define the user interface of your application directly in the terminal. It allows you to interact with your code in a more advanced way than just running Python scripts.
Core Module
With a basic understanding of version control, it is now time to really begin filling up our code repository. However, the question then remains how to organize our code? As developers we tend to not think about code organization that much. It is instead something that just dynamically is being created as we may need it. However, maybe we should spend some time initially getting organized with the chance of this making our code easier to develop and maintain in the long run. If we do not spend time organizing our code, we may end up with a mess of code that is hard to understand or maintain
Big ball of Mud
A Big Ball of Mud is a haphazardly structured, sprawling, sloppy, duct-tape-and-baling-wire, spaghetti-code jungle. These systems show unmistakable signs of unregulated growth, and repeated, expedient repair. Information is shared promiscuously among distant elements of the system, often to the point where nearly all the important information becomes global or duplicated. The overall structure of the system may never have been well defined. If it was, it may have eroded beyond recognition. Programmers with a shred of architectural sensibility shun these quagmires. Only those who are unconcerned about architecture, and, perhaps, are comfortable with the inertia of the day-to-day chore of patching the holes in these failing dikes, are content to work on such systems. Brian Foote and Joseph Yoder, Big Ball of Mud. Fourth Conference on Patterns Languages of Programs (PLoP '97/EuroPLoP '97) Monticello, Illinois, September 1997
We are here going to focus on the organization of data science projects and machine learning projects. The core difference this kind of projects introduces compared to more traditional systems, is data. The key to modern machine learning is without a doubt the vast amounts of data that we have access to today. It is therefore not unreasonable that data should influence our choice of code structure. If we had another kind of application, then the layout of our codebase should probably be different.
"},{"location":"s2_organisation_and_version_control/code_structure/#cookiecutter","title":"Cookiecutter","text":"We are in this course going to use the tool cookiecutter, which is tool for creating projects from project templates. A project template is in short just an overall structure of how you want your folders, files etc. to be organised from the beginning. For this course we are going to be using a custom MLOps template. The template is essentially a fork of the cookiecutter data science template template that has been used for a couple of years in the course, but specialized a bit more towards MLOps instead of general data science.
We are not going to argue that this template is better than every other template, we are just focusing on that it is a standardized way of creating project structures for machine learning projects. By standardized we mean, that if two persons are both using cookiecutter
with the same template, the layout of their code does follow some specific rules, enabling one to faster understand the other person's code. Code organization is therefore not only to make the code easier for you to maintain but also for others to read and understand.
Shown below is the default code structure of cookiecutter for data science projects.
What is important to keep in mind when using a template, is that it exactly is a template. By definition a template is a guide to make something. Therefore, not all parts of a template may be important for your project at hand. Your job is to pick the parts from the template that is useful for organizing your machine learning project and add the parts that are missing.
"},{"location":"s2_organisation_and_version_control/code_structure/#python-projects","title":"Python projects","text":"While the same template in principal could be used regardless of what language we were using for our machine learning or data science application, there are certain considerations to take into account based on what language we are using. Python is the dominant language for machine learning and data science currently, which is why we in this section are focusing on some of the special files you will need for your Python projects.
The first file you may or may not know is the __init__.py
file. In Python the __init__.py
file is used to mark a directory as a Python package. Therefore as a bare minimum, any Python package should look something like this:
\u251c\u2500\u2500 src/\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u251c\u2500\u2500 file1.py\n\u2502 \u251c\u2500\u2500 file2.py\n\u251c\u2500\u2500 pyproject.toml\n
The second file to focus on is the pyproject.toml
. This file is important for actually converting your code into a Python project. Essentially, whenever you run pip install
, pip
is in charge of both downloading the package you want but also in charge of installing it. For pip
to be able to install a package it needs instructions on what part of the code it should install and how to install it. This is the job of the pyproject.toml
file.
Below we have both added a description of the structure of the pyproject.toml
file but also setup.py + setup.cfg
which is the \"old\" way of providing project instructions regarding Python project. However, you may still encounter a lot of projects using setup.py + setup.cfg
so it is good to at least know about them.
pyproject.toml
is the new standardized way of describing project metadata in a declaratively way, introduced in PEP 621. It is written in toml format which is easy to read. At the very least your pyproject.toml
file should include the [build-system]
and [project]
sections:
[build-system]\nrequires = [\"setuptools\", \"wheel\"]\nbuild-backend = \"setuptools.build_meta\"\n\n[project]\nname = \"my-package-name\"\nversion = \"0.1.0\"\nauthors = [{name = \"EM\", email = \"me@em.com\"}]\ndescription = \"Something cool here.\"\nrequires-python = \">=3.8\"\ndynamic = [\"dependencies\"]\n\n[tool.setuptools.dynamic]\ndependencies = {file = [\"requirements.txt\"]}\n
the [build-system]
informs pip
/python
that to build this Python project it needs the two packages setuptools
and wheels
and that it should call the setuptools.build_meta function to actually build the project. The [project]
section essentially contains metadata regarding the package, what its called etc. if we ever want to publish it to PyPI.
For specifying dependencies of your project you have two options. Either you specify them in a requirements.txt
file and it as a dynamic field in pyproject.toml
as shown above. Alternatively, you can add a dependencies
field under the [project]
header like this:
[project]\ndependencies = [\n 'torch==2.1.0',\n 'matplotlib>=3.8.1'\n]\n
The improvement over setup.py + setup.cfg
is that pyproject.toml
also allows for metadata from other tools to be specified in it, essentially making sure you only need a single file for your project. For example, in the next [module M7 on good coding practices] you will learn about the tool ruff
and how it can help format your code. If we want to configure ruff
for our project we can do that directly in pyproject.toml
by adding additional headers:
[ruff]\nruff_option = ...\n
To read more about how to specify pyproject.toml
this page is a good place to start.
setup.py
is the original way to describing how a Python package should be build. The most basic setup.py
file will look like this:
from setuptools import setup\nfrom pip.req import parse_requirements\nrequirements = [str(ir.req) for ir in parse_requirements(\"requirements.txt\")]\nsetup(\n name=\"my-package-name\",\n version=\"0.1.0\",\n author=\"EM\",\n description=\"Something cool here.\"\n install_requires=requirements,\n)\n
Essentially, the it is the exact same meta information as in pyproject.toml
, just written directly in Python syntax instead of toml
. Because there was a wish to deperate this meta information into a separate file, the setup.cfg
file was created which can contain the exact same information as setup.py
just in a declarative config.
[metadata]\nname = my-package-name\nversion = 0.1.0\nauthor = EM\ndescription = \"Something cool here.\"\n# ...\n
This non-standardized way of providing meta information regarding a package was essentially what lead to the creation of pyproject.toml
.
Regardless of what way a project is configured, after creating the above files the correct way to install them would be the same
pip install .\n# or in developer mode\npip install -e . # (1)!\n
-e
is short for --editable
mode also called developer mode. Since we will continuously iterating on our package this is the preferred way to install our package, because that means that we do not have to run pip install
every time we make a change. Essentially, in developer mode changes in the Python source code can immediately take place without requiring a new installation.after running this your code should be available to import as from project_name import ...
like any other Python package you use. This is the most essential you need to know about creating Python packages.
After having installed cookiecutter (exercise 1 and 2), the remaining exercises are intended to be used on taking the simple CNN MNIST classifier from yesterdays exercise and force it into this structure. You are not required to fill out every folder and file in the project structure, but try to at least follow the steps in exercises. Whenever you need to run a file I recommend always doing this from the root directory e.g.
python <project_name>/data/make_dataset.py data/raw data/processed\npython <project_name>/models/train_model.py <arguments>\netc...\n
in this way paths (for saving and loading files) are always relative to the root.
Install cookiecutter framework
pip install cookiecutter\n
Start a new project using this template, that is specialized for this course (1).
You do this by running the cookiecutter command using the template url:
cookiecutter <url-to-template>\n
Valid project names
When asked for a project name you should follow the PEP8 guidelines for naming packages. This means that the name should be all lowercase and if you want to separate words, you should use underscores. For example my_project
is a valid name, while MyProject
is not. Additionally, the packaage name cannot start with a number.
There are two common choices on how layout your source directory. The first is called src-layout where the source code is always place in a src/<project_name>
folder and the second is called flat-layout where the source code is place is just placed in a <project_name>
folder. The template we are using in this course is using the flat-layout, but there are pros and cons for both.
After having created your new project, the first step is to also create a corresponding virtual environment and install any needed requirements. If you have a virtual environment from yesterday feel free to use that else create a new. Then install the project in that environment
pip install -e .\n
Start by filling out the <project_name>/data/make_dataset.py
file. When this file runs, it should take the raw data e.g. the corrupted MNIST files from yesterday (../data/corruptmnist
) which now should be located in a data/raw
folder and process them into a single tensor, normalize the tensor and save this intermediate representation to the data/processed
folder. By normalization here we refer to making sure the images have mean 0 and standard deviation 1.
import click\nimport torch\n\n\ndef normalize(images: torch.Tensor) -> torch.Tensor:\n \"\"\"Normalize images.\"\"\"\n return (images - images.mean()) / images.std()\n\n\n@click.command()\n@click.option(\"raw_dir\", default=\"data/raw\", help=\"Path to raw data directory\")\n@click.option(\"processed_dir\", default=\"data/processed\", help=\"Path to processed data directory\")\ndef make_data(raw_dir: str, processed_dir: str) -> None:\n \"\"\"Process raw data and save it to processed directory.\"\"\"\n train_images, train_target = [], []\n for i in range(5):\n train_images.append(torch.load(f\"{raw_dir}/train_images_{i}.pt\"))\n train_target.append(torch.load(f\"{raw_dir}/train_target_{i}.pt\"))\n train_images = torch.cat(train_images)\n train_target = torch.cat(train_target)\n\n test_images: torch.Tensor = torch.load(f\"{raw_dir}/test_images.pt\")\n test_target: torch.Tensor = torch.load(f\"{raw_dir}/test_target.pt\")\n\n train_images = train_images.unsqueeze(1).float()\n test_images = test_images.unsqueeze(1).float()\n train_target = train_target.long()\n test_target = test_target.long()\n\n train_images = normalize(train_images)\n test_images = normalize(test_images)\n\n torch.save(train_images, f\"{processed_dir}/train_images.pt\")\n torch.save(train_target, f\"{processed_dir}/train_target.pt\")\n torch.save(test_images, f\"{processed_dir}/test_images.pt\")\n torch.save(test_target, f\"{processed_dir}/test_target.pt\")\n\n\nif __name__ == \"__main__\":\n make_data()\n
This template comes with a Makefile
that can be used to easily define common operations in a project. You do not have to understand the complete file but try taking a look at it. In particular the following commands may come in handy
make data # runs the make_dataset.py file, try it!\nmake clean # clean __pycache__ files\nmake requirements # install everything in the requirements.txt file\n
Windows users make
is a GNU build tool that is by default not available on Windows. There are two recommended ways to get it running on Windows. The first is leveraging linux subsystem for Windows which you maybe have already installed. The second option is utilizing the chocolatey package manager, which enables Windows users to install packages similar to Linux system.
In general we recommend that you add commands to the Makefile
as you move along in the course. If you want to know more about how to write Makefile
s then this is an excellent video.
Put your model file (model.py
) into <project_name>/models
folder together and insert the relevant code from the main.py
file into the train_model.py
file. Make sure that whenever a model is trained and it is saved, that it gets saved to the models
folder (preferably in sub-folders).
When you run train_model.py
, make sure that some statistics/visualizations from the trained models gets saved to the reports/figures/
folder. This could be a simple .png
of the training curve.
(Optional) Can you figure out a way to add a train
command to the Makefile
such that training can be started using
make train\n
Solution train:\n python <project_name>/models/train_model.py\n
Fill out the newly created <project_name>/models/predict_model.py
file, such that it takes a pre-trained model file and creates prediction for some data. Recommended interface is that users can give this file either a folder with raw images that gets loaded in or a numpy
or pickle
file with already loaded images e.g. something like this
python <project_name>/models/predict_model.py \\\n models/my_trained_model.pt \\ # file containing a pretrained model\n data/example_images.npy # file containing just 10 images for prediction\n
Fill out the file <project_name>/visualization/visualize.py
with this (as minimum, feel free to add more visualizations)
reports/figures/
folder.The solution here depends a bit on the choice of model. However, in most cases your last layer in the model will be a fully connected layer, which we assume is named fc
. The easiest way to get the features before this layer is to replace the layer with torch.nn.Identity
which essentially does nothing (see highlighted line below). Alternatively, if you implemented everything in a torch.nn.Sequential
you can just remove the last layer from the Sequential
object: model = model[:-1]
.
import click\nimport matplotlib.pyplot as plt\nimport torch\nfrom my_project_name.model import MyAwesomeModel\nfrom sklearn.decomposition import PCA\nfrom sklearn.manifold import TSNE\n\n\n@click.command()\n@click.option(\"model_checkpoint\", default=\"model.pth\", help=\"Path to model checkpoint\")\n@click.option(\"processed_dir\", default=\"data/processed\", help=\"Path to processed data directory\")\n@click.option(\"figure_dir\", default=\"reports/figures\", help=\"Path to save figures\")\n@click.option(\"figure_name\", default=\"embeddings.png\", help=\"Name of the figure\")\ndef visualize(model_checkpoint: str, processed_dir: str, figure_dir: str, figure_name: str) -> None:\n \"\"\"Visualize model predictions.\"\"\"\n model = MyAwesomeModel().load_state_dict(torch.load(model_checkpoint))\n model.eval()\n model.fc = torch.nn.Identity()\n\n test_images = torch.load(f\"{processed_dir}/test_images.pt\")\n test_target = torch.load(f\"{processed_dir}/test_target.pt\")\n test_dataset = torch.utils.data.TensorDataset(test_images, test_target)\n\n embeddings, targets = [], []\n with torch.inference_mode():\n for batch in torch.utils.data.DataLoader(test_dataset, batch_size=32):\n images, target = batch\n predictions = model(images)\n embeddings.append(predictions)\n targets.append(target)\n embeddings = torch.cat(embeddings).numpy()\n targets = torch.cat(targets).numpy()\n\n if embeddings.shape[1] > 500: # Reduce dimensionality for large embeddings\n pca = PCA(n_components=100)\n embeddings = pca.fit_transform(embeddings)\n tsne = TSNE(n_components=2)\n embeddings = tsne.fit_transform(embeddings)\n\n plt.figure(figsize=(10, 10))\n for i in range(10):\n mask = targets == i\n plt.scatter(embeddings[mask, 0], embeddings[mask, 1], label=str(i))\n plt.legend()\n plt.savefig(f\"{figure_dir}/{figure_name}\")\n
(Optional) Feel free to create more files/visualizations (what about investigating/explore the data distribution?)
Make sure to update the README.md
file with a short description on how your scripts should be run
Finally make sure to update the requirements.txt
file with any packages that are necessary for running your code (see this set of exercises for help)
(Optional) Lets say that you are not satisfied with the template I have recommended that you use, which is completely fine. What should you then do? You should of course create your own template! This is actually not that hard to do.
Just for a starting point I would recommend that you fork either the mlops template which you have already been using or alternatively fork the data science template template.
After forking the template, clone it down locally and lets start modifying it. The first step is changing the cookiecutter.json
file. For the mlops template it looks like this:
{\n \"project_name\": \"project_name\",\n \"repo_name\": \"{{ cookiecutter.project_name.lower().replace(' ', '_') }}\",\n \"author_name\": \"Your name (or your organization/company/team)\",\n \"description\": \"A short description of the project.\",\n \"python_version_number\": \"3.10\",\n \"open_source_license\": [\"No license file\", \"MIT\", \"BSD-3-Clause\"]\n}\n
simply add a new line to the json file with the name of the variable you want to add and the default value you want it to have.
The actual template is located in the {{ cookiecutter.project_name }}
folder. cookiecutter
works by replacing everywhere that it sees {{ cookiecutter.<variable_name> }}
with the value of the variable. Therefore, if you want to add a new file to the template, just add it to the {{ cookiecutter.project_name }}
folder and make sure to add the {{ cookiecutter.<variable_name> }}
where you want the variable to be replaced.
After you have made the changes you want to the template, you should test it locally. Just run
cookiecutter . -f --no-input\n
and it should create a new folder using the default values of the cookiecutter.json
file.
Finally, make sure to push any changes you made to the template to GitHub, such that you in the future can use it by simply running
cookiecutter https://github.com/<username>/<my_template_repo>\n
Starting from complete scratch, what is the steps needed to create a new GitHub repository and push a specific template to it as the very first commit.
SolutionCreate a completely barebone repository, either using the GitHub UI or if you have the GitHub cli installed (not git
) you can run
gh repo create <repo_name> --public --confirm\n
Run cookiecutter
with the template you want to use
cookiecutter <template>\n
The name of the folder created by cookiecutter
should be the same as you just used.
Run the following sequence of commands
cd <project_name>\ngit init\ngit add .\ngit commit -m \"Initial commit\"\ngit remote add origin https://github.com/<username>/<repo_name>\ngit push origin master\n
That's it. The template should now have been pushed to the repository as the first commit.
That ends the module on code structure and cookiecutter
. We again want to stress the point of using cookiecutter
is not about following one specific template, but instead just to use any template for organizing your code. What often happens in a team is that multiple templates are needed in different stages of the development phase or for different product types because they share common structure, while still having some specifics. Keeping templates up-to-date then becomes critical such that no team member is using an outdated template. If you ever end up in this situation, we highly recommend to checkout cruft that works alongside cookiecutter
to not only make projects but update existing ones as template evolves. Cruft additionally also has template validation capabilities to ensure projects match the latest version of a template.
Core Module
In this module, we are going to return to version control. However, this time we are going to focus on version control of data. The reason we need to separate between standard version control and data version control comes down to one problem: size.
Classic version control was developed to keep track of code files, which all are simple text files. Even a codebase that contains 1000+ files with millions of lines of code can probably be stored in less than a single gigabyte (GB). On the other hand, the size of data can be drastically bigger. As most machine learning algorithms only get better with the more data that you feed them, we are seeing models today that are being trained on petabytes of data (1.000.000 GB).
Because this is an important concept there exist a couple of frameworks that have specialized in versioning data such as DVC, DAGsHub, Hub, Modelstore and ModelDB. Regardless of what framework, they all implement somewhat the same concept: instead of storing the actual data files or in general storing any large artifacts files we instead store a pointer to these large flies. We then version control the point instead of the artifact.
Image creditWe are in this course going to use DVC
provided by iterative.ai as they also provide tools for automatizing machine learning, which we are going to focus on later.
DVC (Data Version Control) is simply an extension of git
to not only take versioning data but also models and experiments in general. But how does it deal with these large data files? Essentially, DVC
will just keep track of a small metafile that will then point to some remote location where your original data is stored. Metafiles essentially work as placeholders for your data files. Your large data files are then stored in some remote location such as Google Drive or an S3
bucket from Amazon.
Image credit
As the figure shows, we now have two remote locations: one for code and one for data. We use git pull/push
for the code and dvc pull/push
for the data. The key concept is the connection between the data file model.pkl
which is fairly large and its respective metafile model.pkl.dvc
which is very small. The large file is stored in the data remote and the metafile is stored in the code remote.
If in doubt about some of the exercises, we recommend checking out the documentation for DVC as it contains excellent tutorials.
For these exercises, we are going to use Google drive as a remote storage solution for our data. If you do not already have a Google account, please create one (we are going to use it again in later exercises). Please make sure that you at least have 1GB of free space.
Next, install DVC and the Google Drive extension
pip install dvc\npip install dvc-gdrive\n
If you installed DVC via pip and plan to use cloud services as remote storage, you might need to install these optional dependencies: [s3], [azure], [gdrive], [gs], [oss], [ssh]. Alternatively, use [all] to include them all. If you encounter that the installation fails, we recommend that you start by updating pip and then trying to update dvc
:
pip install -U pip\npip install -U dvc-gdrive\n
If this does not work for you, it is most likely due to a problem with pygit2
and in that case we recommend that you follow the instructions here.
In your MNIST repository run the following command from the terminal
dvc init\n
this will setup dvc
for this repository (similar to how git init
will initialize a git repository). These files should be committed using standard git
to your repository.
Go to your Google Drive and create a new folder called dtu_mlops_data
. Then copy the unique identifier belonging to that folder as shown in the figure below
Using this identifier, add it as a remote storage
dvc remote add -d storage gdrive://<your_identifier>\n
Check the content of the file .dvc/config
. Does it contain a pointer to your remote storage? Afterwards, make sure to add this file to the next commit we are going to make:
git add .dvc/config\n
Call the dvc add
command on your data files exactly like you would add a file with git
(you do not need to add every file by itself as you can directly add the data/
folder). Doing this should create a human-readable file with the extension .dvc
. This is the metafile as explained earlier that will serve as a placeholder for your data. If you are on Windows and this step fails you may need to install pywin32
. At the same time, the data
folder should have been added to the .gitignore
file that marks which files should not be tracked by git. Confirm that this is correct.
Now we are going to add, commit and tag the metafiles so we can restore to this stage later on. Commit and tag the files, which should look something like this:
git add data.dvc .gitignore\ngit commit -m \"First datasets, containing 25000 images\"\ngit tag -a \"v1.0\" -m \"data v1.0\"\n
Finally, push your data to the remote storage using dvc push
. You will be asked to authenticate, which involves copy-pasting the code in the link prompted. Check out your Google Drive folder. You will see that the data is not in a recognizable format anymore due to the way that dvc
packs and tracks the data. The boring detail is that dvc
converts the data into content-addressable storage which makes data much faster to get. Finally, make sure that your data is not stored in your Github repository.
After authenticating the first time, DVC
should be setup without having to authenticate again. If you for some reason encounter that dvc fails to authenticate, you can try to reset the authentication. Locate the file $CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json
where $CACHE_HOME
depends on your operating system:
~/Library/Caches
~/.cache
This is the typical location, but it may vary depending on what distro you are running
{user}/AppData/Local
Delete the complete {gdrive_client_id}
folder and retry authenticating with dvc push
.
After completing the above steps, it is very easy for others (or yourself) to get setup with both code and data by simply running
git clone <my_repository>\ncd <my_repository>\ndvc pull\n
(assuming that you give them access right to the folder in your drive). Try doing this (in some other location than your standard code) to make sure that the two commands indeed download both your code and data.
Lets look about the process of updating our data. Remember the important aspect of version control is that we do not need to store explicit files called data_v1.pt
, data_v2.pt
etc. but just have a single data.pt
that where we can always checkout earlier versions. Initially start by copying the data data/corruptmnist_v2
folder from this repository to your MNIST code. This contains 3 extra datafiles with 15000 additional observations. Rerun your data pipeline so these gets incorporated into the files in your processed
folder.
Redo the above steps, adding the new data using dvc
, committing and tagging the metafiles e.g. the following commands should be executed (with appropriate input):
dvc add -> git add -> git commit -> git tag -> dvc push -> git push
.
Lets say that you wanted to go back to the state of your data in v1.0. If the above steps have been done correctly, you should be able to do this using:
git checkout v1.0\ndvc checkout\n
confirm that you have reverted to the original data.
(Optional) Finally, it is important to note that dvc
is not only intended to be used to store data files but also any other large files such as trained model weights (with billions of parameters these can be quite large). For example, if we always store our best-performing model in a file called best_model.ckpt
then we can use dvc
to version control it, store it online and make it easy for others to download. Feel free to experiment with this using your model checkpoints.
In general dvc
is a great framework for version-controlling data and models. However, it is important to note that it does have some performance issue when dealing with datasets that consist of many files. Therefore, if you are ever working with a dataset that consists of many small files, it can be a good idea to:
zip files into a single archive and then version control the archive. The zip
archive should be placed in a data/raw
folder and then unzipped in the data/processed
folder.
If possible turn your data into 1D arrays, then it can be stored in a single file such as .parquet
or .csv
. This is especially useful for tabular data. Then you can version control the single file instead of the many files.
How do you know that a repository is using dvc?
SolutionSimilar to a git repository having a .git
directory, a repository using dvc needs to have a .dvc
folder. Alternatively you can you the dvc status
command.
Assume you just added a folder called data/
that you want to track with dvc
. What is the sequence of 5 commands to successful version control the folder? (assuming you already setup a remote)
dvc add data/\ngit add .\ngit commit -m \"added raw data\"\ngit push\ndvc push\n
That's all for today. With the combined power of git
and dvc
we should be able to version control everything in our development pipeline such that no changes are lost (assuming we commit regularly). It should be noted that dvc
offers more than just data version control, so if you want to deep dive into dvc
we recommend their pipeline feature and how this can be used to setup version controlled experiments. Note that we are going to revisit dvc
later for a more permanent (and large-scale) storage solution.
Core Module
Proper collaboration with other people will require that you can work on the same codebase in an organized manner. This is the reason that version control exist. Simply stated, it is a way to keep track of:
For a full explanation please see this page
Secondly, it is important to note that GitHub is not git! GitHub is the dominating player when it comes to hosting repositories but that does not mean that they are the only one providing free repository hosting (see bitbucket or gitlab) for some other examples.
That said we will be using git and GitHub throughout this course. It is a requirement for passing this course that you create a public repository with your code and use git to upload any code changes. How much you choose to integrate this into your own projects depends, but you are at least expected to be familiar with git+GitHub.
Image credit"},{"location":"s2_organisation_and_version_control/git/#initial-config","title":"Initial config","text":"What does Git stand for?
The name \"git\" was given by Linus Torvalds when he wrote the very first version. He described the tool as \"the stupid content tracker\" and the name as (depending on your mood):
Install git on your computer and make sure that your installation is working by writing git help
in a terminal and it should show you the help message for git.
Create a GitHub account if you do not already have one.
To make sure that we do not have to type in our GitHub username every time that we want to do some changes, we can once and for all set them on our local machine
# type in a terminal\ngit config credential.helper store\ngit config --global user.email <email>\n
The most simple way to think of version control, is that it is just nodes with lines connecting them
Each node, which we call a commit is uniquely identified by a hash string. Each node, stores what our code looked like at that point in time (when we made the commit) and using the hash codes we can easily revert to a specific point in time.
The commits are made up of local changes that we make to our code. A basic workflow for adding commits are seen below
Assuming that we have made some changes to our local working directory and that we want to get these updates to be online in the remote repository we have to do the following steps:
First we run the command git add
. This will move our changes to the staging area. While changes are in the staging area we can very easily revert them (using git restore
). There have therefore not been assigned a unique hash to the code yet, and we can therefore still overwrite it.
To take our code from the staging area and make it into a commit, we simply run git commit
which will locally add a note to the graph. It is important again, that we have not pushed the commit to the online repository yet.
Finally, we want others to be able to use the changes that we made. We do a simple git push
and our commit gets online
Of course, the real power of version control is the ability to make branches, as in the image below
Image creditEach branch can contain code that are not present on other branches. This is useful when you are many developers working together on the same project.
"},{"location":"s2_organisation_and_version_control/git/#exercises","title":"\u2754 Exercises","text":"In your GitHub account create an repository, where the intention is that you upload the code from the final exercise from yesterday
After creating the repository, clone it to your computer
git clone https://github.com/my_user_name/my_repository_name.git\n
Move/copy the three files from yesterday into the repository (and any other that you made)
Add the files to a commit by using git add
command
Commit the files using git commit
command where you use the -m
argument to provide a commit message (1).
Finally push the files to your repository using git push
. Make sure to check online that the files have been updated in your repository.
You can always use the command git status
to check where you are in the process of making a commit.
Also checkout the git log
command, which will show you the history of commits that you have made.
Make sure that you understand how to make branches, as this will allow you to try out code changes without messing with your working code. Creating a new branch can be done using:
# create a new branch\ngit checkout -b <my_branch_name>\n
Afterwards, you can use git checkout
(1) to change between branches (remember to commit your work!) Try adding something (a file, a new line of code etc.) to the newly created branch, commit it and try changing back to master afterwards. You should hopefully see whatever you added on the branch is not present on the main branch.
git checkout
command is used for a lot of different things in git. It can be used to change branches, to revert changes and to create new branches. An alternative is using git switch
and git restore
which are more modern commands.If you do not already have a cloned version of this repository belonging to the course, make sure to make one! I am continuously updating/changing some of the material during the course and I therefore recommend that you each day before the lecture do a git pull
on your local copy
Git may seem like a waste of time when solutions like dropbox, google drive etc exist, and it is not completely untrue when you are only one or two working on a project. However, these file management systems falls short when hundreds to thousands of people work together. For this exercise you will go through the steps of sending an open-source contribution:
Go online and find a project you do not own, where you can improve the code. You can either look at this page of good issues to get started with or for simplicity you can just choose the repository belonging to the course. Now fork the project by clicking the Fork button.
This will create a local copy of the repository which you have complete writing access to. Note that code updates to the original repository do not update code in your local repository.
Clone your local fork of the project using git clone
.
As default your local repository will be on the main branch
(HINT: you can check this with the git status
command). It is good practice to make a new branch when working on some changes. Use the git branch
command followed by the git checkout
command to create a new branch.
You are now ready to make changes to the repository. Try to find something to improve (any spelling mistakes?). When you have made the changes, do the standard git cycle: add -> commit -> push
Go online to the original repository and go to the Pull requests
tab. Find compare
button and choose the button to compare the master branch
of the original repo with the branch that you just created in your own repository. Check the diff on the page to make sure that it contains the changes you have made.
Write a bit about the changes you have made and click Create pull request
:)
Forking a repository has the consequence that your fork and the repository that you forked can diverge. To mitigate this we can set what is called an remote upstream. Take a look on this page , and set a remote upstream for the repository you just forked.
Solutiongit remote add upstream <url-to-original-repo>\n
After setting the upstream branch, we need to pull and merge any update. Take a look on this page and figure out how to do this.
Solutiongit fetch upstream\ngit checkout main\ngit merge upstream/main\n
As a final exercise we want to simulate a merge conflict, which happens when two users try to commit changes to exactly same lines of code in the codebase, and git is not able to resolve how the different commits should be integrated.
In your browser, open your favorite repository (it could be the one you just worked on), go to any file of your choosing and click the edit button (see image below) and make some change to the file. For example, if you choose a Python file you can just import some random packages at the top of the file. Commit the change.
Make sure not to pull the change you just made to your local computer. Locally make changes to the same file in the same lines and commit them afterwards.
Now try to git pull
the online changes. What should (hopefully) happen is that git will tell you that it found a merge conflict that needs to be resolved. Open the file and you should see something like this
<<<<<<< HEAD\nthis is some content to mess with\ncontent to append\n=======\ntotally different content to merge later\n>>>>>>> master\n
this should be interpret as: everything that's between <<<<<<<
and =======
are the changes made by your local commit and everything between =======
and >>>>>>>
are the changes you are trying to pull. To fix the merge conflict you simply have to make the code in the two \"cells\" work together. When you are done, remove the identifiers <<<<<<<
, =======
and >>>>>>>
.
Finally, commit the merge and try to push.
(Optional) The above exercises have focused on how to use git from the terminal, which I highly recommend learning. However, if you are using a proper editor they also have build in support for version control. We recommend getting familiar with these features (here is a tutorial for VS Code)
How do you know if a certain directory is a git repository?
SolutionYou can check if there is a \".git\" directory. Alternative you can use the git status
command.
Explain what the file gitignore
is used for?
The file gitignore
is used to tell git which files to ignore when doing a git add .
command. This is useful for files that are not part of the codebase, but are needed for the code to run (e.g. data files) or files that contain sensitive information (e.g. .env
files that contain API keys and passwords).
You have two branches - main and devel. What sequence of commands would you need to execute to make sure that devel is in sync with main?
Solutiongit checkout main\ngit pull\ngit checkout devel\ngit merge main\n
What best practices are you familiar with regarding version control?
SolutionThat covers the basics of git to get you started. In the exercise folder you can find a git cheat sheet with the most useful commands for future reference. Finally, we want to point out another awesome feature of GitHub: in browser editor. Sometimes you have a small edit that you want to make, but still would like to do this in a IDE/editor. Or you may be in the situation where you are working from another device than your usual developer machine. GitHub has an built-in editor that can simply be enabled by changing any URL from
https://github.com/username/repository\n
to
https://github.dev/username/repository\n
Try it out on your newly created repository.
"},{"location":"s2_organisation_and_version_control/good_coding_practice/","title":"M7 - Good coding practice","text":""},{"location":"s2_organisation_and_version_control/good_coding_practice/#good-coding-practice","title":"Good coding practice","text":"Quote
Code is read more often than it is written. Guido Van Rossum (author of Python)
It is hard to define exactly what good coding practises are. But the above quote by Guido does hint at what it could be, namely that it has to do with how others observe and persive your code. In general, good coding practice is about making sure that you code is easy to read and understand, not only by others but also by your future self. The key concept to keep in mind with we are talking about good coding practice is consistency. In many cases it does not matter exactly how you choose to style your code etc., the important part is that you are consistent about it.
Image credit"},{"location":"s2_organisation_and_version_control/good_coding_practice/#documentation","title":"Documentation","text":"Most programmers have a love-hate relationship with documentation: We absolute hate writing it ourself, but love when someone else has actually taken time to add it to their code. There is no doubt about that well documented code is much easier to maintain, as you do not need to remember all details about the code to still maintain it. It is key to remember that good documentation saves more time, than it takes to write.
The problem with documentation is that there is no right or wrong way to do it. You can end up doing:
Under documentation: You document information that is clearly visible from the code and not the complex parts that are actually hard to understand.
Over documentation: Writing too much documentation will have the opposite effect on most people than what you want: there is too much to read, so people will skip it.
Writing good documentation is a skill that takes time to train, so lets try to do it.
Quote
Code tells you how; Comments tell you why. Jeff Atwood
"},{"location":"s2_organisation_and_version_control/good_coding_practice/#exercises","title":"\u2754 Exercises","text":"Go over the most complicated file in your project. Be critical and add comments where the logic behind the code is not easily understandable. (1)
In deep learning we often work with multi-dimensional tensors that constantly changes shape after each operation. It is good practise to annotate with comments when tensors undergoes some reshaping. In the following example we compute the pairwise euclidean distance between two tensors using broadcasting which results in multiple shape operations.
x = torch.randn(5, 10) # N x D\ny = torch.randn(7, 10) # M x D\nxy = x.unsqueeze(1) - y.unsqueeze(0) # (N x 1 x D) - (1 x M x D) = (N x M x D)\npairwise_euc_dist = xy.abs().pow(2.0).sum(dim=-1) # N x M\n
Add docstrings to at least two Python function/methods. You can see here (example 5) a good example how to use identifiable keywords such as Parameters
, Args
, Returns
which standardizes the way of writing docstrings.
While Python already enforces some styling (e.g. code should be indented in a specific way), this is not enough to secure that code from different users actually look like each other. Maybe even more troubling is that you will often see that your own style of coding changes as you become more and more experienced. This kind of difference in coding style is not that important to take care of when you are working on a personal project, but when working multiple people together on the same project it is important to consider.
The question then remains what styling you should use. This is where Pep8 comes into play, which is the official style guide for python. It is essentially contains what is considered \"good practice\" and \"bad practice\" when coding python.
The many years the most commonly used tool to check if you code is PEP8 compliant is to use flake8. However, we are in this course going to be using ruff that are quickly gaining popularity due to how fast it is and how quickly the developers are adding new features. (1)
flake8
and ruff
is what is called a linter or lint tool, which is any kind of static code analyze program that is used to flag programming errors, bugs, and styling errors.Install ruff
pip install ruff\n
Run ruff
on your project or part of your project
ruff check . # Lint all files in the current directory (and any subdirectories)\nruff check path/to/code/ # Lint all files in `/path/to/code` (and any subdirectories).\n
are you PEP8 compliant or are you a normal mortal?
You could go and fix all the small errors that ruff
is giving. However, in practice large projects instead relies on some kind of code formatter, that will automatically format your code for you to be PEP8 compliant. Some of the biggest formatters for the longest time in Python have been black and yapf, but we are going to use ruff
which also have a build in formatter that should be a drop-in replacement for black
.
Try to use ruff format
to format your code
ruff format . # Format all files in the current directory.\nruff format /path/to/file.py # Format a single file.\n
By default ruff
will apply a selection of rules when we are either checking it or formatting it. However, many more rules can be activated through configuration. If you have completed module M6 on code structure you will have encountered the pyproject.toml
file, which can store both build instructions about our package but also configuration of developer tools. Lets try to configure ruff
using the pyproject.toml
file.
One aspect that is not covered by PEP8 is how import
statements in Python should be organized. If you are like most people, you place your import
statements at the top of the file and they are ordered simply by when you needed them. A better practice is to introduce some clear structure in our imports. In older versions of this course we have used isort to do the job, but we are here going to configure ruff
to do the job. In your pyproject.toml
file add the following lines
[tool.ruff]\nselect = [\"I\"]\n
and try re-running ruff check
and ruff format
. Hopefully this should reorganize your imports to follow common practice. (1)
os
) in one block, followed by third-party dependencies (like torch
) in a second block and finally imports from your own package in a third block. Each block is then put in alphabetical order.One PEP8 styling rule that is often diverged from is the recommended line length of 79 characters, which by many (including myself) is considered very restrictive. If you code consist of multiple levels of indentation, you can quickly run into 79 characters being limiting. For this reason many projects increase it, often to 120 characters which seems to be the sweet spot of how many characters fits in a coding window on a laptop. Add the line
line-length=120\n
under the [tool.ruff]
section in the pyproject.toml
file and rerun ruff check
and ruff format
on your code.
Experiment yourself with further configuration of ruff
. In particular we recommend adding more rules and looking [tool.ruff.pydocstyle]
configuration to indicate how you have styled your documentation.
In addition to writing documentation and following a specific styling, in Python we have a third way of improving the quality of our code: through typing. Typing goes back to the earlier programming languages like c
, c++
etc. where data types needed to be explicit stated for variables:
int main() {\n int x = 5 + 6;\n float y = 0.5;\n cout << \"Hello World! \" << x << std::endl();\n}\n
This is not required by Python but it can really improve the readability of code, that you can directly read from the code what the expected types of input arguments and returns are. In Python the :
character have been reserved for type hints. Here is one example of adding typing to a function:
def add2(x: int, y: int) -> int:\n return x+y\n
here we mark that both x
and y
are integers and using the arrow notation ->
we mark that the output type is also an integer. Assuming that we are also going to use the function for floats and torch.Tensor
s we could improve the typing by specifying a union of types. Depending on the version of Python you are using the syntax for this can be different.
from torch import Tensor # note it is Tensor with upper case T. This is the base class of all tensors\nfrom typing import Union\ndef add2(x: Union[int, float, Tensor], y: Union[int, float, Tensor]) -> Union[int, float, Tensor]:\n return x+y\n
from torch import Tensor # note it is Tensor with upper case T. This is the base class of all tensors\ndef add2(x: int | float | Tensor, y: int | float | Tensor) -> int | float | Tensor:\n return x+y\n
Finally, since this is a very generic function it also works on numpy
arrays etc. we can always default to the Any
type if we are not sure about all the specific types that a function can take
from typing import Any\ndef add2(x: Any, y: Any) -> Any:\n return x+y\n
However, in this case we basically is in the same case as if our function were not typed, as the type hints does not help us at all. Therefore, use Any
only when necessary.
Exercise files
We provide a file called typing_exercise.py
. Add typing everywhere in the file. Please note that you will need the following import:
from typing import Callable, Optional, Tuple, Union, List # you will need all of them in your code\n
for it to work. This cheat sheet is a good resource on typing. We also provide typing_exercise_solution.py
, but try to solve the exercise yourself.
typing_exercise.py
typing_exercise.pyimport torch\nfrom torch import nn\n\n\nclass Network(nn.Module):\n \"\"\"Builds a feedforward network with arbitrary hidden layers.\n\n Arguments:\n input_size: integer, size of the input layer\n output_size: integer, size of the output layer\n hidden_layers: list of integers, the sizes of the hidden layers\n\n \"\"\"\n\n def __init__(self, input_size, output_size, hidden_layers, drop_p=0.5) -> None:\n super().__init__()\n # Input to a hidden layer\n self.hidden_layers = nn.ModuleList([nn.Linear(input_size, hidden_layers[0])])\n\n # Add a variable number of more hidden layers\n layer_sizes = zip(hidden_layers[:-1], hidden_layers[1:])\n self.hidden_layers.extend([nn.Linear(h1, h2) for h1, h2 in layer_sizes])\n\n self.output = nn.Linear(hidden_layers[-1], output_size)\n\n self.dropout = nn.Dropout(p=drop_p)\n\n def forward(self, x):\n \"\"\"Forward pass through the network, returns the output logits.\"\"\"\n for each in self.hidden_layers:\n x = nn.functional.relu(each(x))\n x = self.dropout(x)\n x = self.output(x)\n\n return nn.functional.log_softmax(x, dim=1)\n\n\ndef validation(model, testloader, criterion):\n \"\"\"Validation pass through the dataset.\"\"\"\n accuracy = 0\n test_loss = 0\n for images, labels in testloader:\n images = images.resize_(images.size()[0], 784)\n\n output = model.forward(images)\n test_loss += criterion(output, labels).item()\n\n ## Calculating the accuracy\n # Model's output is log-softmax, take exponential to get the probabilities\n ps = torch.exp(output)\n # Class with highest probability is our predicted class, compare with true label\n equality = labels.data == ps.max(1)[1]\n # Accuracy is number of correct predictions divided by all predictions, just take the mean\n accuracy += equality.type_as(torch.FloatTensor()).mean()\n\n return test_loss, accuracy\n\n\ndef train(model, trainloader, testloader, criterion, optimizer=None, epochs=5, print_every=40) -> None:\n \"\"\"Train a PyTorch Model.\"\"\"\n if optimizer is None:\n optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)\n steps = 0\n running_loss = 0\n for e in range(epochs):\n # Model in training mode, dropout is on\n model.train()\n for images, labels in trainloader:\n steps += 1\n\n # Flatten images into a 784 long vector\n images.resize_(images.size()[0], 784)\n\n optimizer.zero_grad()\n\n output = model.forward(images)\n loss = criterion(output, labels)\n loss.backward()\n optimizer.step()\n\n running_loss += loss.item()\n\n if steps % print_every == 0:\n # Model in inference mode, dropout is off\n model.eval()\n\n # Turn off gradients for validation, will speed up inference\n with torch.no_grad():\n test_loss, accuracy = validation(model, testloader, criterion)\n\n print(\n f\"Epoch: {e + 1}/{epochs}.. \",\n f\"Training Loss: {running_loss / print_every:.3f}.. \",\n f\"Test Loss: {test_loss / len(testloader):.3f}.. \",\n f\"Test Accuracy: {accuracy / len(testloader):.3f}\",\n )\n\n running_loss = 0\n\n # Make sure dropout and grads are on for training\n model.train()\n
Solution typing_exercise_solution.pyfrom __future__ import annotations\n\nfrom collections.abc import Callable\n\nimport torch\nfrom torch import nn\n\n\nclass Network(nn.Module):\n \"\"\"Builds a feedforward network with arbitrary hidden layers.\n\n Arguments:\n input_size: integer, size of the input layer\n output_size: integer, size of the output layer\n hidden_layers: list of integers, the sizes of the hidden layers\n\n \"\"\"\n\n def __init__(\n self,\n input_size: int,\n output_size: int,\n hidden_layers: list[int],\n drop_p: float = 0.5,\n ) -> None:\n super().__init__()\n # Input to a hidden layer\n self.hidden_layers = nn.ModuleList([nn.Linear(input_size, hidden_layers[0])])\n\n # Add a variable number of more hidden layers\n layer_sizes = zip(hidden_layers[:-1], hidden_layers[1:])\n self.hidden_layers.extend([nn.Linear(h1, h2) for h1, h2 in layer_sizes])\n\n self.output = nn.Linear(hidden_layers[-1], output_size)\n\n self.dropout = nn.Dropout(p=drop_p)\n\n def forward(self, x: torch.Tensor) -> torch.Tensor:\n \"\"\"Forward pass through the network, returns the output logits.\"\"\"\n for each in self.hidden_layers:\n x = nn.functional.relu(each(x))\n x = self.dropout(x)\n x = self.output(x)\n\n return nn.functional.log_softmax(x, dim=1)\n\n\ndef validation(\n model: nn.Module,\n testloader: torch.utils.data.DataLoader,\n criterion: Callable | nn.Module,\n) -> tuple[float, float]:\n \"\"\"Validation pass through the dataset.\"\"\"\n accuracy = 0\n test_loss = 0\n for images, labels in testloader:\n images = images.resize_(images.size()[0], 784)\n\n output = model.forward(images)\n test_loss += criterion(output, labels).item()\n\n ## Calculating the accuracy\n # Model's output is log-softmax, take exponential to get the probabilities\n ps = torch.exp(output)\n # Class with highest probability is our predicted class, compare with true label\n equality = labels.data == ps.max(1)[1]\n # Accuracy is number of correct predictions divided by all predictions, just take the mean\n accuracy += equality.type_as(torch.FloatTensor()).mean().item()\n\n return test_loss, accuracy\n\n\ndef train(\n model: nn.Module,\n trainloader: torch.utils.data.DataLoader,\n testloader: torch.utils.data.DataLoader,\n criterion: Callable | nn.Module,\n optimizer: None | torch.optim.Optimizer = None,\n epochs: int = 5,\n print_every: int = 40,\n) -> None:\n \"\"\"Train a PyTorch Model.\"\"\"\n if optimizer is None:\n optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)\n steps = 0\n running_loss = 0\n for e in range(epochs):\n # Model in training mode, dropout is on\n model.train()\n for images, labels in trainloader:\n steps += 1\n\n # Flatten images into a 784 long vector\n images.resize_(images.size()[0], 784)\n\n optimizer.zero_grad()\n\n output = model.forward(images)\n loss = criterion(output, labels)\n loss.backward()\n optimizer.step()\n\n running_loss += loss.item()\n\n if steps % print_every == 0:\n # Model in inference mode, dropout is off\n model.eval()\n\n # Turn off gradients for validation, will speed up inference\n with torch.no_grad():\n test_loss, accuracy = validation(model, testloader, criterion)\n\n print(\n f\"Epoch: {e + 1}/{epochs}.. \",\n f\"Training Loss: {running_loss / print_every:.3f}.. \",\n f\"Test Loss: {test_loss / len(testloader):.3f}.. \",\n f\"Test Accuracy: {accuracy / len(testloader):.3f}\",\n )\n\n running_loss = 0\n\n # Make sure dropout and grads are on for training\n model.train()\n
mypy is what is called a static type checker. If you are using typing in your code, then a static type checker can help you find common mistakes. mypy
does not run your code, but it scans it and checks that the types you have given are compatible. Install mypy
pip install mypy\n
Try to run mypy
on the typing.py
file
mypy typing_exercise.py\n
If you have solved exercise 11 correctly then you should get no errors. If not mypy
should tell you where your types are incompatible.
According to PEP8 what is wrong with the following code?
class myclass(nn.Module):\n def TrainNetwork(self, X, y):\n ...\n
Solution According to PEP8 classes should follow the CapWords convention, meaning that the first letter in each word of the class name should be capitalized. Thus myclass
should therefore be MyClass
. On the other hand, functions and methods should be full lowercase with words separated by underscore. Thus TrainNetwork
should be train_network
.
What would be the of argument x
for a function def f(x):
if it should support the following input
x1 = [1, 2, 3, 4]\nx2 = (1, 2, 3, 4)\nx3 = None\nx4 = {1: \"1\", 2: \"2\", 3: \"3\", 4: \"4\"}\n
Solution The easy solution would be to do def f(x : Any)
. But instead we could also go with:
def f(x: None | Tuple[int, ...] | List[int] | Dict[int, str]):\n
alternatively, we could also do
def f(x: None | Iterable[int]):\n
because both list
, tuple
and dict
are iterables and therefore can be covered by one type (in this specific case).
This ends the module on coding style. We again want to emphazize that a good coding style is more about having a consistent style than strictly following PEP8. A good example of this is Google, that have their own style guide for Python. This guide does not match PEP8 exactly, but it makes sure that different teams within google that are working on different projects are still to a large degree following the same style and therefore if a project is handed from one team to another then at least that will not be a problem.
"},{"location":"s3_reproducibility/","title":"Reproducibility","text":"Slides
Learn how to create reproducible computing environments using docker
and how to use them to run your code.
M10: Docker
Learn how to use hydra
to manage configuration files and how to integrate it with your code.
M11: Config Files
Today is all about reproducibility - one of those concepts that everyone agrees is very important and something should be done about, but the reality is that it is very hard to secure full reproducibility. The last sessions have already touched a bit on how tools like conda
and code organization can help make code more reproducible. Today we are going all the way to ensure that our scripts and our computing environment are fully reproducible.
Reproducibility is closely related to the scientific method:
Observe -> Question -> Hypotheses -> Experiment -> Conclude -> Result -> Observe -> ...
Not having reproducible experiments essentially breaks the cycle between doing experiments and making conclusions. If experiments are not reproducible, then we do not expect that others will arrive at the same conclusion as ourselves. As machine learning experiments are fundamentally the same as doing chemical experiments in a laboratory, we should be equally careful in making sure our environments are reproducible (think of your laptop as your laboratory).
Secondly, if we focus on why reproducibility matters especially in machine learning, it is part of the bigger challenge of making sure that machine learning is trustworthy.
Many different aspects are needed if trustworthy machine learning is ever going to be a reality. We need robustness of our pipelines so we can trust that they do not fail under heavy load. We need integrity to make sure that pipelines are deployed if they are of high quality. We need explainability to make sure that we understand what our machine learning models are doing, so it is not just a black box. We need reproducibility to make sure that the results of our models can be reproduced over and over again. Finally, we need fairness to make sure that our models are not biased toward specific populations. Figure inspired by this paper.Trustworthy ML is the idea that machine learning agents can be trusted. Take the example of a machine learning agent being responsible for medical diagnoses. It is very clear that we need to be able to trust that the agent gives us the correct diagnosis for the system to work in practice. Reproducibility plays a big role here, because without we cannot be sure that the same agent deployed at two different hospitals will give the same diagnosis (given the same input).
Learning objectives
The learning objectives of this session are:
docker
to create a reproducible container, including how to build them from scratchhydra
to integrate with config filesWith docker we can make sure that our compute environment is reproducible, but that does not mean that all our experiments magically become reproducible. There are other factors that are important for creating reproducible experiments.
In this paper (highly recommended read) the authors tried to reproduce the results of 255 papers and tried to figure out which factors where significant to succeed. One of those factors were \"Hyperparameters Specified\" e.g. whether or not the authors of the paper had precisely specified the hyperparameter that was used to run the experiments. It should come as no surprise that this can be a determining factor for reproducibility, however it is not given that hyperparameters are always well specified.
"},{"location":"s3_reproducibility/config_files/#configure-experiments","title":"Configure experiments","text":"There is really no way around it: deep learning contains a lot of hyperparameters. In general, a hyperparameter is any parameter that affects the learning process (e.g. the weights of a neural network are not hyperparameters because they are a consequence of the learning process). The problem with having many hyperparameters to control in your code, is that if you are not careful and structure them it may be hard after running a experiment to figure out which hyperparameters were actually used. Lack of proper configuration management can cause serious problems with reliability, uptime, and the ability to scale a system.
One of the most basic ways of structuring hyperparameters, is just to put them directly into you train.py
script in some object:
class my_hp:\n batch_size: 64\n lr: 128\n other_hp: 12345\n\n# easy access to them\ndl = DataLoader(Dataset, batch_size=my_hp.batch_size)\n
the problem here is configuration is not easy. Each time you want to run a new experiment, you basically have to change the script. If you run the code multiple times, without committing the changes in between then the exact hyperparameter configuration for some experiments may be lost. Alright, with this in mind you change strategy to use an argument parser e.g. run experiments like this
python train.py --batch_size 256 --learning_rate 1e-4 --other_hp 12345\n
This at least solves the problem with configurability. However, we again can end up with losing experiments if we are not careful.
What we really want is some way to easily configure our experiments where the hyperparameters are systematically saved with the experiment. For this we turn our attention to Hydra, a configuration tool that is based around writing config files to keep track of hyperparameters. Hydra operates on top of OmegaConf which is a yaml
based hierarchical configuration system.
A simple yaml
configuration file could look like
#config.yaml\nhyperparameters:\n batch_size: 64\n learning_rate: 1e-4\n
with the corresponding Python code for loading the file
from omegaconf import OmegaConf\n# loading\nconfig = OmegaConf.load('config.yaml')\n\n# accessing in two different ways\ndl = DataLoader(dataset, batch_size=config.hyperparameters.batch_size)\noptimizer = torch.optim.Adam(model.parameters(), lr=config['hyperparameters']['learning_rate'])\n
or using hydra
for loading the configuration
import hydra\n\n@hydra.main(config_name=\"basic.yaml\")\ndef main(cfg):\n print(cfg.hyperparameters.batch_size, cfg.hyperparameters.learning_rate)\n\nif __name__ == \"__main__\":\n main()\n
The idea behind refactoring our hyperparameters into .yaml
files is that we disentangle the model configuration from the model. In this way it is easier to do version control of the configuration because we have it in a separate file.
Exercise files
The main idea behind the exercises is to take a single script (that we provide) and use Hydra to make sure that everything gets correctly logged such that you would be able to exactly report to others how each experiment was configured. In the provided script, the hyperparameters are hardcoded into the code and your job will be to separate them out into a configuration file.
Note that we provide a solution (in the vae_solution
folder) that can help you get through the exercise, but try to look online for your answers before looking at the solution. Remember: its not about the result, its about the journey.
Start by installing hydra:
pip install hydra-core\n
Remember to add it to your requirements.txt
file.
Next take a look at the vae_mnist.py
and model.py
file and understand what is going on. It is a model we will revisit during the course.
Identify the key hyperparameters of the script. Some of them should be easy to find, but at least 3 have made it into the core part of the code. One essential hyperparameter is also not included in the script but is needed to be completely reproducible (HINT: the weights of any neural network are initialized at random).
SolutionFrom the top of the file batch_size
, x_dim
, hidden_dim
can be found as hyperparameters. Looking through the code it can be seen that the latent_dim
of the encoder and decoder, lr
or the optimzer, epochs
in the training loop also are hyperparameters. Finally, the seed
is not included in the script but is needed to make the script fully reproducible e.g. torch.manual_seed(seed)
.
Write a configuration file config.yaml
where you write down the hyperparameters that you have found
Get the script running by loading the configuration file inside your script (using hydra) that incorporates the hyperparameters into the script. Note: you should only edit the vae_mnist.py
file and not the model.py
file.
Run the script
By default hydra will write the results to a outputs
folder, with a sub-folder for the day the experiment was run and further the time it was started. Inspect your run by going over each file the hydra has generated and check the information has been logged. Can you find the hyperparameters?
Hydra also allows for dynamically changing and adding parameters on the fly from the command-line:
Try changing one parameter from the command-line
python vae_mnist.py hyperparameters.seed=1234\n
Try adding one parameter from the command-line
python vae_mnist.py +experiment.stuff_that_i_want_to_add=42\n
By default the file vae_mnist.log
should be empty, meaning that whatever you printed to the terminal did not get picked up by Hydra. This is due to Hydra under the hood making use of the native python logging package. This means that to also save all printed output from the script we need to convert all calls to print
with log.info
Create a logger in the script:
import logging\nlog = logging.getLogger(__name__)\n
Exchange all calls to print
with calls to log.info
Try re-running the script and make sure that the output printed to the terminal also gets saved to the vae_mnist.log
file
Make sure that your script is fully reproducible. To check this you will need two runs of the script to compare. Then run the reproducibility_tester.py
script as
python reproducibility_tester.py path/to/run/1 path/to/run/2\n
the script will go over trained weights to see if the match and that the hyperparameters was the same. Note: for the script to work, the weights should be saved to a file called trained_model.pt
(this is the default of the vae_mnist.py
script, so only relevant if you have changed the saving of the weights)
Make a new experiment using a new configuration file where you have changed a hyperparameter of your own choice. You are not allowed to change the configuration file in the script but should instead be able to provide it as an argument when launching the script e.g. something like
python vae_mnist.py experiment=exp2\n
We recommend that you use a file structure like this
|--conf\n| |--config.yaml\n| |--experiments\n| |--exp1.yaml\n| |--exp2.yaml\n|--my_app.py\n
Finally, a awesome feature of hydra is the instantiate feature. This allows you to define a configuration file that can be used to directly instantiating objects in python. Try to create a configuration file that can be used to instantiating the Adam
optimizer in the vae_mnist.py
script.
The configuration file could look like this
optimizer:\n _target_: torch.optim.Adam\n lr: 1e-3\n betas: [0.9, 0.999]\n eps: 1e-8\n weight_decay: 0\n
and the python code to load the configuration file and instantiate the optimizer could look like this
import hydra\nimport torch.optim as optim\n\n@hydra.main(config_name=\"adam.yaml\")\ndef main(cfg):\n optimizer = hydra.utils.instantiate(cfg.optimizer)\n print(optimizer)\n\nif __name__ == \"__main__\":\n main()\n
This will print the optimizer object that is created from the configuration file.
Make your MNIST code reproducible! Apply what you have just done to the simple script to your MNIST code. Only requirement is that you this time use multiple configuration files, meaning that you should have at least one model_conf.yaml
file and a training_conf.yaml
file that separates out the hyperparameters that have to do with the model definition and those that have to do with the training. You can also choose to work with even more complex config setups: in the image below the configuration has two layers such that we individually can specify hyperparameters belonging to a specific model architecture and hyperparameters for each individual optimizer that we may try.
Image credit"},{"location":"s3_reproducibility/docker/","title":"M10 - Docker","text":""},{"location":"s3_reproducibility/docker/#docker","title":"Docker","text":"
Core Module
Image creditWhile the above picture may seem silly at first, it is actually pretty close to how Docker came into existence. A big part of creating an MLOps pipeline is being able to reproduce it. Reproducibility goes beyond versioning our code with git
and using conda
environments to keep track of our Python installations. To truly achieve reproducibility, we need to capture system-level components such as:
Docker provides this kind of system-level reproducibility by creating isolated program dependencies. In addition to providing reproducibility, one of the key features of Docker is scalability, which is important when we later discuss deployment. Because Docker ensures system-level reproducibility, it does not (conceptually) matter whether we try to start our program on a single machine or on 1000 machines at once.
"},{"location":"s3_reproducibility/docker/#docker-overview","title":"Docker Overview","text":"Docker has three main concepts: Dockerfile, Docker image, and Docker container:
A Dockerfile is a basic text document that contains all the commands a user could call on the command line to run an application. This includes installing dependencies, pulling data from online storage, setting up code, and specifying commands to run (e.g., python train.py
).
Running, or more correctly, building a Dockerfile will create a Docker image. An image is a lightweight, standalone/containerized, executable package of software that includes everything (application code, libraries, tools, dependencies, etc.) necessary to make an application run.
Actually running an image will create a Docker container. This means that the same image can be launched multiple times, creating multiple containers.
The exercises today will focus on how to construct the actual Dockerfile, as this is the first step to constructing your own container.
"},{"location":"s3_reproducibility/docker/#docker-sharing","title":"Docker Sharing","text":"The whole point of using Docker is that sharing applications becomes much easier. In general, we have two options:
After creating the Dockerfile
, we can simply commit it to GitHub (it's just a text file) and then ask other users to simply build the image themselves.
After building the image ourselves, we can choose to upload it to an image registry such as Docker Hub, where others can get our image by simply running docker pull
, making them able to instantaneously run it as a container, as shown in the figure below:
In the following exercises, we guide you on how to build a docker file for your MNIST repository that will make the training and prediction a self-contained application. Please make sure that you somewhat understand each step and do not just copy the exercise. Also, note that you probably need to execute the exercise from an elevated terminal e.g. with administrative privilege.
The exercises today are only an introduction to docker and some of the steps are going to be unoptimized from a production setting view. For example, we often want to keep the size of the docker image as small as possible, which we are not focusing on for these exercises.
If you are using VScode
then we recommend installing the VScode docker extension for easy getting an overview of which images have been building and which are running. Additionally, the extension named Dev Containers may also be beneficial for you to download.
Start by installing docker. How much trouble you need to go through depends on your operating system. For Windows and Mac, we recommend they install Docker Desktop, which comes with a graphical user interface (GUI) for quickly viewing docker images and docker containers currently built/in use. Windows users that have not installed WSL yet are going to have to do it now (as docker needs it as a backend for starting virtual machines) but you do not need to install docker in WSL. After installing docker we recommend that you restart your laptop.
Try running the following to confirm that your installation is working:
docker run hello-world\n
which should give the message
Hello from Docker!\nThis message shows that your installation appears to be working correctly.\n
Next, let's try to download an image from Docker Hub. Download the busybox
image:
docker pull busybox\n
which is a very small (1-5Mb) containerized application that contains the most essential GNU file utilities, shell utilities, etc.
After pulling the image, write
docker images\n
which should show you all available images. You should see the busybox
image that we just downloaded.
Let's try to run this image
docker run busybox\n
You will see that nothing happens! The reason for that is we did not provide any commands to docker run
. We essentially just ask it to start the busybox
virtual machine, do nothing, and then close it again. Now, try again, this time with
docker run busybox echo \"hello from busybox\"\n
Note how fast this process is. In just a few seconds, Docker is able to start a virtual machine, execute a command, and kill it afterward.
Try running
docker ps\n
What does this command do? What if you add -a
to the end?
If we want to run multiple commands within the virtual machine, we can start it in interactive mode
docker run -it busybox\n
This can be a great way to investigate what the filesystem of our virtual machine looks like.
As you may have already noticed by now, each time we execute docker run
, we can still see small remnants of the containers using docker ps -a
. These stray containers can end up taking up a lot of disk space. To remove them, use docker rm
where you provide the container ID that you want to delete
docker rm <container_id>\n
Let's now move on to trying to construct a Dockerfile ourselves for our MNIST project. Create a file called trainer.dockerfile
. The intention is that we want to develop one Dockerfile for running our training script and one for doing predictions.
Instead of starting from scratch, we nearly always want to start from some base image. For this exercise, we are going to start from a simple python
image. Add the following to your Dockerfile
# Base image\nFROM python:3.9-slim\n
Next, we are going to install some essentials in our image. The essentials more or less consist of a Python installation. These instructions may seem familiar if you are using Linux:
# Install Python\nRUN apt update && \\\n apt install --no-install-recommends -y build-essential gcc && \\\n apt clean && rm -rf /var/lib/apt/lists/*\n
The previous two steps are common for any Docker application where you want to run Python. All the remaining steps are application-specific (to some degree):
Let's copy over our application (the essential parts) from our computer to the container:
COPY requirements.txt requirements.txt\nCOPY pyproject.toml pyproject.toml\nCOPY <project-name>/ <project-name>/\nCOPY data/ data/\n
Remember that we only want the essential parts to keep our Docker image as small as possible. Why do we need each of these files/folders to run training in our Docker container?
Let's set the working directory in our container and add commands that install the dependencies (1):
We split the installation into two steps so that Docker can cache our project dependencies separately from our application code. This means that if we change our application code, we do not need to reinstall all the dependencies. This is a common strategy for Docker images.
As an alternative, you can use RUN make requirements
if you have a Makefile
that installs the dependencies. Just remember to also copy over the Makefile
into the Docker image.
WORKDIR /\nRUN pip install -r requirements.txt --no-cache-dir\nRUN pip install . --no-deps --no-cache-dir\n
The --no-cache-dir
is quite important. Can you explain what it does and why it is important in relation to Docker?
Finally, we are going to name our training script as the entrypoint for our Docker image. The entrypoint is the application that we want to run when the image is being executed:
ENTRYPOINT [\"python\", \"-u\", \"<project_name>/train_model.py\"]\n
The \"u\"
here makes sure that any output from our script, e.g., any print(...)
statements, gets redirected to our terminal. If not included, you would need to use docker logs
to inspect your run.
We are now ready to build our Dockerfile into a Docker image.
docker build -f trainer.dockerfile . -t trainer:latest\n
MAC M1/M2 users In general, Docker images are built for a specific platform. For example, if you are using a Mac with an M1/M2 chip, then you are running on an ARM architecture. If you are using a Windows or Linux machine, then you are running on an AMD64 architecture. This is important to know when building Docker images. Thus, Docker images you build may not work on other platforms than the ones you build on. You can specify which platform you want to build for by adding the --platform
argument to the docker build
command:
docker build --platform linux/amd64 -f trainer.dockerfile . -t trainer:latest\n
and also when running the image:
docker run --platform linux/amd64 trainer:latest\n
Note that this will significantly increase the build and run time of your Docker image when running locally, because Docker will need to emulate the other platform. In general, for the exercises today, you should not need to specify the platform, but be aware of this if you are building Docker images on your own.
Please note that here we are providing two extra arguments to docker build
. The -f trainer.dockerfile .
(the dot is important to remember) indicates which Dockerfile we want to run (except if you named it just Dockerfile
) and the -t trainer:latest
is the respective name and tag that we see afterward when running docker images
(see image below). Please note that building a Docker image can take a couple of minutes.
Docker images and space
Docker images can take up a lot of space on your computer, especially the Docker images we are trying to build because PyTorch is a huge dependency. If you are running low on space, you can try to
docker system prune\n
Alternatively, you can manually delete images using docker rmi {image_name}:{image_tag}
.
Try running docker images
and confirm that you get output similar to the one above. If you succeed with this, then try running the docker image
docker run --name experiment1 trainer:latest\n
you should hopefully see your training starting. Please note that we can start as many containers as we want at the same time by giving them all different names using the --name
tag.
You are most likely going to rebuild your Docker image multiple times, either due to an implementation error or the addition of new functionality. Therefore, instead of watching pip suffer through downloading torch
for the 20th time, you can reuse the cache from the last time the Docker image was built. To do this, replace the line in your Dockerfile that installs your requirements with:
RUN --mount=type=cache,target=~/pip/.cache pip install -r requirements.txt --no-cache-dir\n
which mounts your local pip cache to the Docker image. For building the image, you need to have enabled the BuildKit feature. If you have Docker version v23.0 or later (you can check this by running docker version
), then this is enabled by default. Otherwise, you need to enable it by setting the environment variable DOCKER_BUILDKIT=1
before building the image.
Try changing your Dockerfile and rebuilding the image. You should see that the build process is much faster.
Remember, if you ever are in doubt about how files are organized inside a Docker image, you always have the option to start the image in interactive mode:
docker run -it --entrypoint sh {image_name}:{image_name}\n
When your training has completed you will notice that any files that are created when running your training script are not present on your laptop (for example if your script is saving the trained model to a file). This is because the files were created inside your container (which is a separate little machine). To get the files you have two options:
If you already have a completed run then you can use it
docker cp\n
to copy the files between your container and laptop. For example to copy a file called trained_model.pt
from a folder you would do:
docker cp {container_name}:{dir_path}/{file_name} {local_dir_path}/{local_file_name}\n
Try this out.
A much more efficient strategy is to mount a volume that is shared between the host (your laptop) and the container. This can be done with the -v
option for the docker run
command. For example, if we want to automatically get the trained_model.pt
file after running our training script we could simply execute the container as
docker run --name {container_name} -v %cd%/models:/models/ trainer:latest\n
this command mounts our local models
folder as a corresponding models
folder in the container. Any file save by the container to this folder will be synchronized back to our host machine. Try this out! Note if you have multiple files/folders that you want to mount (if in doubt about file organization in the container try to do the next exercise first). Also note that the %cd%
needs to change depending on your OS, see this page for help.
With training done we also need to write an application for prediction. Create a new docker image called predict.dockerfile
. This file should call your <project_name>/models/predict_model.py
script instead. This image will need some trained model weights to work. Feel free to either include these during the build process or mount them afterwards. When you create the file try to build
and run
it to confirm that it works. Hint: if you are passing in the model checkpoint and prediction data as arguments to your script, your docker run
probably needs to look something like
docker run --name predict --rm \\\n -v %cd%/trained_model.pt:/models/trained_model.pt \\ # mount trained model file\n -v %cd%/data/example_images.npy:/example_images.npy \\ # mount data we want to predict on\n predict:latest \\\n ../../models/trained_model.pt \\ # argument to script, path relative to script location in container\n ../../example_images.npy\n
(Optional, requires GPU support) By default, a virtual machine created by docker only has access to your cpu
and not your gpu
. While you do not necessarily have a laptop with a GPU that supports the training of neural networks (e.g. one from Nvidia) it is beneficial that you understand how to construct a docker image that can take advantage of a GPU if you were to run this on a machine in the future that has a GPU (e.g. in the cloud). It does take a bit more work, but many of the steps will be similar to building a normal docker image.
There are three prerequisites for working with Nvidia GPU-accelerated docker containers. First, you need to have the Docker Engine installed (already taken care of), have Nvidia GPU with updated GPU drivers and finally have the Nvidia container toolkit installed. The last part you not likely have not installed and needs to do. Some distros of Linux have known problems with the installation process, so you may have to search through known issues in nvidia-docker repository to find a solution
To test that everything is working start by pulling a relevant Nvidia docker image. In my case this is the correct image:
docker pull nvidia/cuda:11.0.3-base-ubuntu20.04\n
but it may differ based on what Cuda vision you have. You can find all the different official Nvidia images here. After pulling the image, try running the nvidia-smi
command inside a container based on the image you just pulled. It should look something like this:
docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi\n
and should show an image like below:
If it does not work, try redoing the steps.
We should hopefully have a working setup now for running Nvidia accelerated docker containers. The next step is to get PyTorch inside our container, such that our PyTorch implementation also correctly identifies the GPU. Luckily for us, Nvidia provides a set of docker images for GPU-optimized software for AI, HPC and visualizations through their NGC Catalog. The containers that have to do with PyTorch can be seen here. Try pulling the latest:
docker pull nvcr.io/nvidia/pytorch:22.07-py3\n
It may take some time because the NGC images include a lot of other software for optimizing PyTorch applications. It may be possible for you to find other images for running GPU-accelerated applications that have a smaller memory footprint, but NGC is the recommended and supported way.
Let's test that this container works:
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:22.07-py3\n
this should run the container in interactive mode attached to your current terminal. Try opening python
in the container and try writing:
import torch\nprint(torch.cuda.is_available())\n
which hopefully should return True
.
Finally, we need to incorporate all this into our already developed docker files for our application. This is also fairly easy as we just need to change our FROM
statement at the beginning of our docker file:
FROM python:3.7-slim\n
change to
FROM nvcr.io/nvidia/pytorch:22.07-py3\n
try doing this to one of your docker files, build the image and run the container. Remember to check that your application is using GPU by printing torch.cuda.is_available()
.
(Optional) Another way you can use Dockerfiles in your day-to-day work is for Dev-containers. Developer containers allow you to develop code directly inside a container, making sure that your code is running in the same environment as it will when deployed. This is especially useful if you are working on a project that has a lot of dependencies that are hard to install on your local machine. Setup instructions for VS Code and PyCharm can be found here (should be simple since we have already installed Docker):
We will focus on the VS Code setup here.
First, install the Remote - Containers extension.
Create a .devcontainer
folder in your project root and create a Dockerfile
inside it. We will keep this file very barebones for now, so let's just define a base installation of Python:
FROM python:3.11-slim-buster\n\nRUN apt update && \\\n apt install --no-install-recommends -y build-essential gcc && \\\n apt clean && rm -rf /var/lib/apt/lists/*\n
Create a devcontainer.json
file in the .devcontainer
folder. This file should look something like this:
{\n \"name\": \"my_working_env\",\n \"dockerFile\": \"Dockerfile\",\n \"postCreateCommand\": \"pip install -r requirements.txt\"\n}\n
This file tells VS Code that we want to use the Dockerfile
that we just created and that we want to install our Python dependencies after the container has been created.
After creating these files, you should be able to open the command palette in VS Code (F1) and search for the option Remote-Containers: Reopen in Container
or Remote-Containers: Rebuild and Reopen in Container
. Choose either of these options.
This will start a new VS Code instance inside a Docker container. You should be able to see this in the bottom left corner of your VS Code window. You should also be able to see that the Python interpreter has changed to the one inside the container.
You are now ready to start developing inside the container. Try opening a terminal and run python
and import torch
to confirm that everything is working.
(Optional) In M8 on Data version control you learned about the framework dvc
for version controlling data. A neutral question at this point would then be how to incorporate dvc
into our docker image. We need to do two things:
dvc
has all the correct files to pull data from our remote storagedvc
has the correct credentials to pull data from our remote storageWe are going to assume that dvc
(and any dvc
extension needed) is part of your requirements.txt
file and that it is already being installed in a RUN pip install -r requirements.txt
command in your Dockerfile. If not, then you need to add it.
Add the following lines to your Dockerfile
RUN dvc init --no-scm\nCOPY .dvc/config .dvc/config\nCOPY *.dvc .dvc/\nRUN dvc config core.no_scm true\nRUN dvc pull\n
The first line initializes dvc
in the Docker image. The --no-scm
option is needed because normally dvc
can only be initialized inside a git repository, but this option allows initializing dvc
without being in one. The second and third lines copy over the dvc
config file and the dvc
metadata files that are needed to pull data from your remote storage. The last line pulls the data.
If your data is not public, we need to provide credentials in some way to pull the data. We are for now going to do it in a not-so-secure way. When dvc
first connected to your drive, a credential file was created. This file is located in $CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json
where $CACHE_HOME
.
~/Library/Caches
~/.cache
This is the typical location, but it may vary depending on what distro you are running.
{user}/AppData/Local
Find the file. The content should look similar to this (only some fields are shown):
{\n \"access_token\": ...,\n \"client_id\": ...,\n \"client_secret\": ...,\n \"refresh_token\": ...,\n ...\n}\n
We are going to copy the file into our Docker image. This, of course, is not a secure way of doing it, but it is the easiest way to get started. As long as you are not sharing your Docker image with anyone else, then it is fine. Add the following lines to your Dockerfile before the RUN dvc pull
command:
COPY <path_to_default.json> default.json\ndvc remote modify myremote --local gdrive_service_account_json_file_path default.json\n
where <path_to_default.json>
is the path to the default.json
file that you just found. The last line tells dvc
to use the default.json
file as the credentials for pulling data from your remote storage. You can confirm that this works by running dvc pull
in your Docker image.
What is the difference between a docker image and a docker container?
SolutionA Docker image is a template for a Docker container. A Docker container is a running instance of a Docker image. A Docker image is a static file, while a Docker container is a running process.
What are the 3 steps involved in containerizing an application?
SolutionWhat advantage is there to running your application inside a Docker container instead of running the application directly on your machine?
SolutionRunning inside a Docker container gives you a consistent and independent environment for your application. This means that you can be sure that your application will run the same way on your machine as it will on another machine. Thus, Docker gives the ability to abstract away the differences between different machines.
A Docker container is built from a series of layers that are stacked on top of each other. This should be clear if you look at the output when building a Docker image. What is the advantage of this?
SolutionThe advantage is efficiency and reusability. When a change is made to a Docker image, only the layer(s) that are changed need to be updated. For example, if you update the application code in your Docker image, which usually is the last layer, then only that layer needs to be rebuilt, making the process much faster. Additionally, if you have multiple Docker images that share the same base image, then the base image only needs to be downloaded once.
This covers the absolute minimum you should know about Docker to get a working image and container. If you want to really deep dive into this topic, you can find a copy of the Docker Cookbook by S\u00e9bastien Goasguen in the literature folder.
If you are actively going to be using Docker in the future, one thing to consider is the image size. Even these simple images that we have built still take up GB in size. Several optimization steps can be taken to reduce the image size for you or your end user. If you have time, you can read this article on different approaches to reducing image size. Additionally, you can take a look at the dive-in extension for Docker Desktop that lets you explore in depth your Docker images.
"},{"location":"s4_debugging_and_logging/","title":"Debugging, Profiling, Logging and Boilerplate","text":"Slides
Learn how to use the debugger in your editor to find bugs in your code.
M12: Debugging
Learn how to use a profiler to identify bottlenecks in your code and from those profiles optimize the runtime of your programs.
M13: Profiling
Learn how to systematically log experiments and hyperparameters to make your code reproducible.
M14: Logging
Learn how to use pytorch-lightning
framework to minimize boilerplate code and structure deep learning models.
M15: Boilerplate
Today we are initially going to go over three different topics that are all fundamentally necessary for any data scientist or DevOps engineer:
All three topics can be characterized by something you probably already are familiar with. Since you started programming, you have done debugging as nobody can write perfect code on the first try. Similarly, while you have not directly profiled your code, I bet that you at some point have had some very slow code and optimized it to run faster. Identifying and improving are the fundamentals of profiling code. Finally, logging is a very broad term and refers to any kind of output from your applications that helps you at a later point identify the \"performance\" of you application.
However, while we expect you to already be familiar with these topics, we do not expect all of you to be experts as it is very rare that these topics are focused on. Today we are going to introduce some best practices and tools to help you overcome every one of these three important topics. As the final topic for today, we are going to learn about how we can minimize boilerplate and focus on coding what matters for our project instead of all the boilerplate to get it working.
Learning objectives
The learning objectives of this session are:
pytorch-lightning
framework to minimize boilerplate code and structure deep learning modelsBoilerplate is a general term that describes any standardized text, copy, documents, methods, or procedures that may be used over again without making major changes to the original. But how does this relate to doing machine learning projects? If you have already tried doing a couple of projects within machine learning you will probably have seen a pattern: every project usually consist of these three aspects of code:
While the latter two certainly seems important, in most cases the actual development or research often revolves around defining the model. In this sense, both the training code and the utilities becomes boilerplate that should just carry over from one project to another. But the problem usually is that we have not generalized our training code to take care of the small adjusted that may be required in future projects and we therefore end up implementing it over and over again every time that we start a new project. This is of course a waste of our time that we should try to find a solution to.
This is where high-level frameworks comes into play. High-level frameworks are build on top of another framework (PyTorch in this case) and tries to abstract/standardize how to do particular tasks such as training. At first it may seem irritating that you need to comply to someone else code structure, however there is a good reason for that. The idea is that you can focus on what really matters (your task, model architecture etc.) and do not have to worry about the actual boilerplate that comes with it.
The most popular high-level (training) frameworks within the PyTorch
ecosystem are:
They all offer many of the same features, so choosing one over the other for most projects should not matter that much. We are here going to use PyTorch Lightning
, as it offers all the functionality that we are going to need later in the course.
In general we refer to the documentation from PyTorch lightning if in doubt about how to format your code for doing specific tasks. We are here going to explain the key concepts of the API that you need to understand to use the framework, starting with the LightningModule
and the Trainer
.
The LightningModule
is a subclass of a standard nn.Module
that basically adds additional structure. In addition to the standard __init__
and forward
methods that need to be implemented in a nn.Module
, a LightningModule
further requires two more methods implemented:
training_step
: should contain your actual training code e.g. given a batch of data this should return the loss that you want to optimize
configure_optimizers
: should return the optimizer that you want to use
Below is shown these two methods added to standard MNIST classifier
Compared to a standard nn.Module
, the additional methods in the LightningModule
basically specifies exactly how you want to optimize your model.
The second component to lightning is the Trainer
object. As the name suggest, the Trainer
object takes care of the actual training, automizing everything that you do not want to worry about.
from pytorch_lightning import Trainer\nmodel = MyAwesomeModel() # this is our LightningModule\ntrainer = Trainer()\ntraier.fit(model)\n
That's is essentially all that you need to specify in lightning to have a working model. The trainer object does not have methods that you need to implement yourself, but it have a bunch of arguments that can be used to control how many epochs that you want to train, if you want to run on gpu etc. To get the training of our model to work we just need to specify how our data should be feed into the lighning framework.
"},{"location":"s4_debugging_and_logging/boilerplate/#data","title":"Data","text":"For organizing our code that has to do with data in Lightning
we essentially have three different options. However, all three assume that we are using torch.utils.data.DataLoader
for the dataloading.
If we already have a train_dataloader
and possible also a val_dataloader
and test_dataloader
defined we can simply add them to our LightningModule
using the similar named methods:
def train_dataloader(self):\n return DataLoader(...)\n\ndef val_dataloader(self):\n return DataLoader(...)\n\ndef test_dataloader(self):\n return DataLoader(...)\n
Maybe even simpler, we can directly feed such dataloaders in the fit
method of the Trainer
object:
trainer.fit(model, train_dataloader, val_dataloader)\ntrainer.test(model, test_dataloader)\n
Finally, Lightning
also have the LightningDataModule
that organizes data loading into a single structure, see this page for more info. Putting data loading into a DataModule
makes sense as it is then can be reused between projects.
Callbacks is one way to add additional functionality to your model, that strictly speaking is not already part of your model. Callbacks should therefore be seen as self-contained feature that can be reused between projects. You have the option to implement callbacks yourself (by inheriting from the pytorch_lightning.callbacks.Callback
base class) or use one of the build in callbacks. Of particular interest are ModelCheckpoint
and EarlyStopping
callbacks:
The ModelCheckpoint
makes sure to save checkpoints of you model. This is in principal not hard to do yourself, but the ModelCheckpoint
callback offers additional functionality by saving checkpoints only when some metric improves, or only save the best K
performing models etc.
model = MyModel()\ncheckpoint_callback = ModelCheckpoint(\n dirpath=\"./models\", monitor=\"val_loss\", mode=\"min\"\n)\ntrainer = Trainer(callbacks=[checkpoint_callbacks])\ntrainer.fit(model)\n
The EarlyStopping
callback can help you prevent overfitting by automatically stopping the training if a certain value is not improving anymore:
model = MyModel()\nearly_stopping_callback = EarlyStopping(\n monitor=\"val_loss\", patience=3, verbose=True, mode=\"min\"\n)\ntrainer = Trainer(callbacks=[early_stopping_callback])\ntrainer.fit(model)\n
Multiple callbacks can be used by passing them all in a list e.g.
trainer = Trainer(callbacks=[checkpoint_callbacks, early_stopping_callback])\n
"},{"location":"s4_debugging_and_logging/boilerplate/#exercises","title":"\u2754 Exercises","text":"Please note that the in following exercise we will basically ask you to reformat all your MNIST code to follow the lightning standard, such that we can take advantage of all the tricks the framework has to offer. The reason we did not implement our model in lightning
to begin with, is that to truly understand why it is beneficially to use a high-level framework to do some of the heavy lifting you need to have gone through some of implementation troubles yourself.
Install pytorch lightning:
pip install pytorch-lightning # (1)!\n
pip install lightning
which includes more than just the PyTorch Lightning
package. This also includes Lightning Fabric
and Lightning Apps
which you can read more about here and here.Convert your corrupted MNIST model into a LightningModule
. You can either choose to completely override your old model or implement it in a new file. The bare minimum that you need to add while converting to get it working with the rest of lightning:
The training_step
method. This function should contain essentially what goes into a single training step and should return the loss at the end
The configure_optimizers
method
Please read the documentation for more info.
Solution lightning.pyimport pytorch_lightning as pl\nimport torch\nfrom torch import nn\n\n\nclass MyAwesomeModel(pl.LightningModule):\n \"\"\"My awesome model.\"\"\"\n\n def __init__(self) -> None:\n super().__init__()\n self.conv1 = nn.Conv2d(1, 32, 3, 1)\n self.conv2 = nn.Conv2d(32, 64, 3, 1)\n self.conv3 = nn.Conv2d(64, 128, 3, 1)\n self.dropout = nn.Dropout(0.5)\n self.fc1 = nn.Linear(128, 10)\n\n self.loss_fn = nn.CrossEntropyLoss()\n\n def forward(self, x: torch.Tensor) -> torch.Tensor:\n \"\"\"Forward pass.\"\"\"\n x = torch.relu(self.conv1(x))\n x = torch.max_pool2d(x, 2, 2)\n x = torch.relu(self.conv2(x))\n x = torch.max_pool2d(x, 2, 2)\n x = torch.relu(self.conv3(x))\n x = torch.max_pool2d(x, 2, 2)\n x = torch.flatten(x, 1)\n x = self.dropout(x)\n return self.fc1(x)\n\n def training_step(self, batch):\n \"\"\"Training step.\"\"\"\n img, target = batch\n y_pred = self(img)\n return self.loss_fn(y_pred, target)\n\n def configure_optimizers(self):\n \"\"\"Configure optimizer.\"\"\"\n return torch.optim.Adam(self.parameters(), lr=1e-3)\n\n\nif __name__ == \"__main__\":\n model = MyAwesomeModel()\n print(f\"Model architecture: {model}\")\n print(f\"Number of parameters: {sum(p.numel() for p in model.parameters())}\")\n\n dummy_input = torch.randn(1, 1, 28, 28)\n output = model(dummy_input)\n print(f\"Output shape: {output.shape}\")\n
Make sure your data is formatted such that it can be loaded using the torch.utils.data.DataLoader
object.
Instantiate a Trainer
object. It is recommended to take a look at the trainer arguments (there are many of them) and maybe adjust some of them:
Investigate what the default_root_dir
flag does
As default lightning will run for 1000 epochs. This may be too much (for now). Change this by changing the appropriate flag. Additionally, there also exist a flag to set the maximum number of steps that we should train for.
SolutionSetting the max_epochs
will accomplish this.
trainer = Trainer(max_epochs=10)\n
Additionally, you may consider instead setting the max_steps
flag to limit based on the number of steps or max_time
to limit based on time. Similarly, the flags min_epochs
, min_steps
and min_time
can be used to set the minimum number of epochs, steps or time.
To start with we also want to limit the amount of training data to 20% of its original size. which trainer flag do you need to set for this to work?
SolutionSetting the limit_train_batches
flag will accomplish this.
trainer = Trainer(limit_train_batches=0.2)\n
Similarly, you can also set the limit_val_batches
and limit_test_batches
flags to limit the validation and test data.
Try fitting your model: trainer.fit(model)
Now try adding some callbacks
to your trainer.
early_stopping_callback = EarlyStopping(\n monitor=\"val_loss\", patience=3, verbose=True, mode=\"min\"\n)\ncheckpoint_callback = ModelCheckpoint(\n dirpath=\"./models\", monitor=\"val_loss\", mode=\"min\"\n)\ntrainer = Trainer(callbacks=[early_stopping_callback, checkpoint_callback])\n
The privous module was all about logging in wandb
, so the question is naturally how does lightning
support this. Lightning does not only support wandb
, but also many others. Common for all of them, is that logging just need to happen through the self.log
method in your LightningModule
:
Add self.log
to your `LightningModule. Should look something like this:
def training_step(self, batch, batch_idx):\n data, target = batch\n preds = self(data)\n loss = self.criterion(preds, target)\n acc = (target == preds.argmax(dim=-1)).float().mean()\n self.log('train_loss', loss)\n self.log('train_acc', acc)\n return loss\n
Add the wandb
logger to your trainer
trainer = Trainer(logger=pl.loggers.WandbLogger(project=\"dtu_mlops\"))\n
and try to train the model. Confirm that you are seeing the scalars appearing in your wandb
portal.
self.log
does sadly only support logging scalar tensors. Luckily, for logging other quantities we can still access the standard wandb.log
through our model
def training_step(self, batch, batch_idx):\n ...\n # self.logger.experiment is the same as wandb.log\n self.logger.experiment.log({'logits': wandb.Histrogram(preds)})\n
try doing this, by logging something else than scalar tensors.
Finally, we maybe also want to do some validation or testing. In lightning we just need to add the validation_step
and test_step
to our lightning module and supply the respective data in form of a separate dataloader. Try to at least implement one of them.
Both validation and test steps can be implemented in the same way as the training step:
def validation_step(self, batch) -> None:\n data, target = batch\n preds = self(data)\n loss = self.criterion(preds, target)\n acc = (target == preds.argmax(dim=-1)).float().mean()\n self.log('val_loss', loss, on_epoch=True)\n self.log('val_acc', acc, on_epoch=True)\n
two things to take note of here is that we are setting the on_epoch
flag to True
in the self.log
method. This is because we want to log the validation loss and accuracy only once per epoch. Additionally, we are not returning anything from the validation_step
method, because we do not optimize over the loss.
(Optional, requires GPU) One of the big advantages of using lightning
is that you no more need to deal with device placement e.g. called .to('cuda')
everywhere. If you have a GPU, try to set the gpus
flag in the trainer. If you do not have one, do not worry, we are going to return to this when we are going to run training in the cloud.
The two arguments accelerator
and devices
can be used to specify which devices to run on and how many to run on. For example, to run on a single GPU you can do
trainer = Trainer(accelerator=\"gpu\", devices=1)\n
as an alternative the accelerator can just be set to accelerator=\"auto\"
to automatically detect the best available device.
(Optional) As default PyTorch uses float32
for representing floating point numbers. However, research have shown that neural network training is very robust towards a decrease in precision. The great benefit going from float32
to float16
is that we get approximately half the memory consumption. Try out half-precision training in PyTorch lightning. You can enable this by setting the precision flag in the Trainer
.
Lightning supports four different types of mixed precision training (16-bit and 16-bit bfloat) and two types of:
# 16-bit mixed precision (model weights remain in torch.float32)\ntrainer = Trainer(precision=\"16-mixed\", devices=1)\n\n# 16-bit bfloat mixed precision (model weights remain in torch.float32)\ntrainer = Trainer(precision=\"bf16-mixed\", devices=1)\n\n# 16-bit precision (model weights get cast to torch.float16)\ntrainer = Trainer(precision=\"16-true\", devices=1)\n\n# 16-bit bfloat precision (model weights get cast to torch.bfloat16)\ntrainer = Trainer(precision=\"bf16-true\", devices=1)\n
(Optional) Lightning also have built-in support for profiling. Checkout how to do this using the profiler argument in the Trainer
object.
(Optional) Another great feature of Lightning is that the allow for easily defining command line interfaces through the Lightning CLI feature. The Lightning CLI is essentially a drop in replacement for defining command line interfaces (covered in this module) and can also replace the need for config files (covered in this module) for securing reproducibility when working inside the Lightning framework. We highly recommend checking out the feature and that you try to refactor your code such that you do not need to call trainer.fit
anymore but it is instead directly controlled from the Lightning CLI.
Free exercise: Experiment with what the lightning framework is capable of. Either try out more of the trainer flags, some of the other callbacks, or maybe look into some of the other methods that can be implemented in your lightning module. Only your imagination is the limit!
That covers everything for today. It has been a mix of topics that all should help you write \"better\" code (by some objective measure). If you want to deep dive more into the PyTorch lightning framework, we highly recommend looking at the different tutorials in the documentation that covers more advanced models and training cases. Additionally, we also want to highlight other frameworks in the lightning ecosystem:
Debugging is very hard to teach and is one of the skills that just comes with experience. That said, there are good and bad ways to debug a program. We are all probably familiar with just inserting print(...)
statements everywhere in our code. It is easy and can many times help narrow down where the problem happens. That said, this is not a great way of debugging when dealing with a very large codebase. You should therefore familiarize yourself with the built-in python debugger as it may come in handy during the course.
To invoke the build in Python debugger you can either:
Set a trace directly with the Python debugger by calling
import pdb\npdb.set_trace()\n
anywhere you want to stop the code. Then you can use different commands (see the python_debugger_cheatsheet.pdf
) to step through the code.
If you are using an editor, then you can insert inline breakpoints (in VS code this can be done by pressing F9
) and then execute the script in debug mode (inline breakpoints can often be seen as small red dots to the left of your code). The editor should then offer some interface to allow you step through your code. Here is a guide to using the build in debugger in VScode.
Additionally, if your program is stopping on an error and you automatically want to start the debugger where it happens, then you can simply launch the program like this from the terminal
python -m pdb -c continue my_script.py\n
Exercise files
We here provide a script vae_mnist_bugs.py
which contains a number of bugs to get it running. Start by going over the script and try to understand what is going on. Hereafter, try to get it running by solving the bugs. The following bugs exist in the script:
Some of the bugs prevents the script from even running, while some of them influences the training dynamics. Try to find them all. We also provide a working version called vae_mnist_working.py
(but please try to find the bugs before looking at the script). Successfully debugging and running the script should produce three files:
orig_data.png
containing images from the standard MNIST training setreconstructions.png
reconstructions from the modelgenerated_samples.png
samples from the modelAgain, we cannot stress enough that the exercise is actually not about finding the bugs but using a proper debugger to find them.
"},{"location":"s4_debugging_and_logging/logging/","title":"M14 - Logging","text":""},{"location":"s4_debugging_and_logging/logging/#logging","title":"Logging","text":"Core Module
Logging in general refers to the practise of recording events activities over time. Having proper logging in your applications can be extremely beneficial for a few reasons:
Debugging becomes easier because we in a more structure way can output information about the state of our program, variables, values etc. to help identify and fix bugs or unexpected behavior.
When we move into a more production environment, proper logging is essential for monitoring the health and performance of our application.
It can help in auditing as logging info about specific activities etc. can help keeping a record of who did what and when.
Having proper logging means that info is saved for later, that can be analysed to gain insight into the behavior of our application, such as trends.
We are in this course going to divide the kind of logging we can do into categories: application logging and experiment logging. In general application logging is important regardless of the kind of application you are developing, whereas experiment logging is important machine learning based projects where we are doing experiments.
"},{"location":"s4_debugging_and_logging/logging/#application-logging","title":"Application logging","text":"The most basic form of logging in Python applications is the good old print
statement:
for batch_idx, batch in enumerate(dataloader):\n print(f\"Processing batch {batch_idx} out of {len(dataloader)}\")\n ...\n
This will keep a \"record\" of the events happening in our script, in this case how far we have progressed. We could even change the print to include something like batch.shape
to also have information about the current data being processed.
Using print
statements is fine for small applications, but to have proper logging we need a bit more functionality than what print
can offer. Python actually comes with a great logging module, that defines functions for flexible logging. It is exactly this we are going to look at in this module.
The four main components to the Python logging module are:
Logger: The main entry point for using the logging system. You create instances of the Logger class to emit log messages.
Handler: Defines where the log messages go. Handlers send the log messages to specific destinations, such as the console or a file.
Formatter: Specifies the layout of the log messages. Formatters determine the structure of the log records, including details like timestamps and log message content.
Level: Specifies the severity of a log message.
Especially, the last point is important to understand. Levels essentially allows of to get rid of statements like this:
if debug:\n print(x.shape)\n
where the logging is conditional on the variable debug
which we can set a runtime. Thus, it is something we can disable for users of our application (debug=False
) but have enabled when we develop the application (debug=True
). And it makes sense that not all things logged, should be available to all stakeholders of a codebase. We as developers probably always wants the highest level of logging, whereas users of the our code need less info and we may want to differentiate this based on users.
It is also important to understand the different between logging and error handling. Error handling Python is done using raise
statements and try/catch
like:
def f(x: int):\n if not isinstance(x, int):\n raise ValueError(\"Expected an integer\")\n return 2 * x\n\ntry:\n f(5):\nexcept ValueError:\n print(\"I failed to do a thing, but continuing.\")\n
Why would we evere need log warning
, error
, critical
levels of information, if we are just going to handle it? The reason is that raising exceptions are meant to change the program flow at runtime e.g. things we do not want the user to do, but we can deal with in some way. Logging is always for after a program have run, to inspect what went wrong. Sometimes you need one, sometimes the other, sometimes both.
Exercises are inspired by this made with ml module on the same topic. If you need help for the exercises you can find a simple solution script here.
As logging is a built-in module in Python, nothing needs to be installed. Instead start a new file called my_logger.py
and start out with the following code:
import logging\nimport sys\n\n# Create super basic logger\nlogging.basicConfig(stream=sys.stdout, level=logging.DEBUG)\nlogger = logging.getLogger(__name__) # (1)\n\n# Logging levels (from lowest to highest priority)\nlogger.debug(\"Used for debugging your code.\")\nlogger.info(\"Informative messages from your code.\")\nlogger.warning(\"Everything works but there is something to be aware of.\")\nlogger.error(\"There's been a mistake with the process.\")\nlogger.critical(\"There is something terribly wrong and process may terminate.\")\n
__name__
always contains the record of the script or module that is currently being run. Therefore if we initialize our logger base using this variable, it will always be unique to our application and not conflict with logger setup by any third-party package.Try running the code. Than try changing the argument level
when creating the logger. What happens when you do that?
Instead of sending logs to the terminal, we may also want to send them to a file. This can be beneficial, such that only warning
level logs and higher are available to the user, but debug
and info
is still saved when the application is running.
Try adding the following dict to your logger.py
file:
logging_config = {\n \"version\": 1,\n \"formatters\": { # (1)\n \"minimal\": {\"format\": \"%(message)s\"},\n \"detailed\": {\n \"format\": \"%(levelname)s %(asctime)s [%(name)s:%(filename)s:%(funcName)s:%(lineno)d]\\n%(message)s\\n\"\n },\n },\n \"handlers\": { # (2)\n \"console\": {\n \"class\": \"logging.StreamHandler\",\n \"stream\": sys.stdout,\n \"formatter\": \"minimal\",\n \"level\": logging.DEBUG,\n },\n \"info\": {\n \"class\": \"logging.handlers.RotatingFileHandler\",\n \"filename\": Path(LOGS_DIR, \"info.log\"),\n \"maxBytes\": 10485760, # 1 MB\n \"backupCount\": 10,\n \"formatter\": \"detailed\",\n \"level\": logging.INFO,\n },\n \"error\": {\n \"class\": \"logging.handlers.RotatingFileHandler\",\n \"filename\": Path(LOGS_DIR, \"error.log\"),\n \"maxBytes\": 10485760, # 1 MB\n \"backupCount\": 10,\n \"formatter\": \"detailed\",\n \"level\": logging.ERROR,\n },\n },\n \"root\": {\n \"handlers\": [\"console\", \"info\", \"error\"],\n \"level\": logging.INFO,\n \"propagate\": True,\n },\n}\n
The formatter section determines how logs should be formatted. Here we define two separate formatters, called minimal
and detailed
which we can use in the next part of the code.
The handlers is in charge of what should happen to different level of logging. console
uses the minimal
format we defined and sens logs to the stdout
stream for messages of level DEBUG
and higher. The info
handler uses the detailed
format and sends messages of level INFO
and higher to a separate info.log
file. The error
handler does the same for messages of level ERROR
and higher to a file called error.log
.
you will need to set the LOGS_DIR
variable and also figure out how to add this logging_config
using the logging config submodule to your logger.
When the code successfully runs, check the LOGS_DIR
folder and make sure that a info.log
and error.log
file was created with the appropriate content.
Finally, lets try to add a little bit of style and color to our logging. For this we can use the great package rich which is a great package for rich text and beautiful formatting in terminals. Install rich
and add the following line to your my_logger.py
script:
logger.root.handlers[0] = RichHandler(markup=True) # set rich handler\n
and try re-running the script. Hopefully you should see something beautiful in your terminal like this:
(Optional) We already briefly touched on logging during the module on config files using hydra. If you want to configure hydra to use custom logging scheme as the one we setup in the last two exercises, you can take a look at this page. In hydra you will need to provide the configuration of the logger as config file. You can find examples of such config file here.
When most people think machine learning, we think about the training phase. Being able to track and log experiments is an important part of understanding what is going on with your model while you are training. It can help you debug your model and help tweak your models to perfection. Without proper logging of experiments, it can be really hard to iterate on the model because you do not know what changes lead to increase or decrease in performance.
The most basic logging we can do when running experiments is writing the metrics that our model is producing e.g. the loss or the accuracy to the terminal or a file for later inspection. We can then also use tools such as matplotlib for plotting the progression of our metrics over time. This kind of workflow may be enough when doing smaller experiments or working alone on a project, but there is no way around using a proper experiment tracker and visualizer when doing large scale experiments in collaboration with others. It especially becomes important when you want to compare performance between different runs.
There exist many tools for logging your experiments, with some of them being:
All of the frameworks offers many of the same functionalities, you can see a (bias) review here. We are going to use Weights and Bias (wandb), as it support everything we need in this course. Additionally, it is an excellent tool for collaboration and sharing of results.
Using the Weights and Bias (wandb) dashboard we can quickly get an overview and compare many runs over different metrics. This allows for better iteration of models and training procedure."},{"location":"s4_debugging_and_logging/logging/#exercises_1","title":"\u2754 Exercises","text":"Start by creating an account at wandb. I recommend using your GitHub account but feel free to choose what you want. When you are logged in you should get an API key of length 40. Copy this for later use (HINT: if you forgot to copy the API key, you can find it under settings), but make sure that you do not share it with anyone or leak it in any way.
.env fileA good place to store not only your wandb API key but also other sensitive information is in a .env
file. This file should be added to your .gitignore
file to make sure that it is not uploaded to your repository. You can then load the variables in the .env
file using the python-dotenv
package. For more information see this page.
.env
WANDB_API_KEY=your-api-key\nWANDB_PROJECT=my_project\nWANDB_ENTITY=my_entity\n...\n
load_from_env_file.pyfrom dotenv import load_dotenv\nload_dotenv()\nimport os\napi_key = os.getenv(\"WANDB_API_KEY\")\n
Next install wandb on your laptop
pip install wandb\n
Now connect to your wandb account
wandb login\n
you will be asked to provide the 40 length API key. The connection should be remain open to the wandb server even when you close the terminal, such that you do not have to login each time. If using wandb
in a notebook you need to manually close the connection using wandb.finish()
.
We are now ready for incorporating wandb
into our code. We are going to continue development on our corrupt MNIST codebase from the previous sessions. For help, we recommend looking at this quickstart and this guide for PyTorch applications. You first job is to alter your training script to include wandb
logging, at least for the training loss.
import click\nimport torch\nimport wandb\nfrom my_project.data import corrupt_mnist\nfrom my_project.model import MyAwesomeModel\n\nDEVICE = torch.device(\"cuda\" if torch.cuda.is_available() else \"mps\" if torch.backends.mps.is_available() else \"cpu\")\n\n\n@click.command()\n@click.option(\"--lr\", type=float, default=0.001, help=\"Learning rate\")\n@click.option(\"--batch_size\", type=int, default=32, help=\"Batch size\")\n@click.option(\"--epochs\", type=int, default=5, help=\"Number of epochs\")\ndef train(lr, batch_size, epochs) -> None:\n \"\"\"Train a model on MNIST.\"\"\"\n print(\"Training day and night\")\n print(f\"{lr=}, {batch_size=}, {epochs=}\")\n wandb.init(\n project=\"corrupt_mnist\",\n config={\"lr\": lr, \"batch_size\": batch_size, \"epochs\": epochs},\n )\n\n model = MyAwesomeModel().to(DEVICE)\n train_set, _ = corrupt_mnist()\n\n train_dataloader = torch.utils.data.DataLoader(train_set, batch_size=batch_size)\n\n loss_fn = torch.nn.CrossEntropyLoss()\n optimizer = torch.optim.Adam(model.parameters(), lr=lr)\n\n for epoch in range(epochs):\n model.train()\n for i, (img, target) in enumerate(train_dataloader):\n img, target = img.to(DEVICE), target.to(DEVICE)\n optimizer.zero_grad()\n y_pred = model(img)\n loss = loss_fn(y_pred, target)\n loss.backward()\n optimizer.step()\n accuracy = (y_pred.argmax(dim=1) == target).float().mean().item()\n wandb.log({\"train_loss\": loss.item(), \"train_accuracy\": accuracy})\n\n if i % 100 == 0:\n print(f\"Epoch {epoch}, iter {i}, loss: {loss.item()}\")\n\n\nif __name__ == \"__main__\":\n train()\n
After running your model, checkout the webpage. Hopefully you should be able to see at least one run with something logged.
Now log something else than scalar values. This could be a image, a histogram or a matplotlib figure. In all cases the logging is still going to use wandb.log
but you need extra calls to wandb.Image
etc. depending on what you choose to log.
In this solution we log the input images to the model every 100 step. Additionally, we also log a histogram of the gradients to inspect if the model is converging. Finally, we create a ROC curve which is a matplotlib figure and log that as well.
train.pyimport click\nimport matplotlib.pyplot as plt\nimport torch\nimport wandb\nfrom my_project.data import corrupt_mnist\nfrom my_project.model import MyAwesomeModel\nfrom sklearn.metrics import RocCurveDisplay\n\nDEVICE = torch.device(\"cuda\" if torch.cuda.is_available() else \"mps\" if torch.backends.mps.is_available() else \"cpu\")\n\n\n@click.command()\n@click.option(\"--lr\", type=float, default=0.001, help=\"Learning rate\")\n@click.option(\"--batch_size\", type=int, default=32, help=\"Batch size\")\n@click.option(\"--epochs\", type=int, default=5, help=\"Number of epochs\")\ndef train(lr, batch_size, epochs) -> None:\n \"\"\"Train a model on MNIST.\"\"\"\n print(\"Training day and night\")\n print(f\"{lr=}, {batch_size=}, {epochs=}\")\n wandb.init(\n project=\"corrupt_mnist\",\n config={\"lr\": lr, \"batch_size\": batch_size, \"epochs\": epochs},\n )\n\n model = MyAwesomeModel().to(DEVICE)\n train_set, _ = corrupt_mnist()\n\n train_dataloader = torch.utils.data.DataLoader(train_set, batch_size=batch_size)\n\n loss_fn = torch.nn.CrossEntropyLoss()\n optimizer = torch.optim.Adam(model.parameters(), lr=lr)\n\n for epoch in range(epochs):\n model.train()\n\n preds, targets = [], []\n for i, (img, target) in enumerate(train_dataloader):\n img, target = img.to(DEVICE), target.to(DEVICE)\n optimizer.zero_grad()\n y_pred = model(img)\n loss = loss_fn(y_pred, target)\n loss.backward()\n optimizer.step()\n accuracy = (y_pred.argmax(dim=1) == target).float().mean().item()\n wandb.log({\"train_loss\": loss.item(), \"train_accuracy\": accuracy})\n\n preds.append(y_pred.detach().cpu())\n targets.append(target.detach().cpu())\n\n if i % 100 == 0:\n print(f\"Epoch {epoch}, iter {i}, loss: {loss.item()}\")\n\n # add a plot of the input images\n images = wandb.Image(img[:5].detach().cpu(), caption=\"Input images\")\n wandb.log({\"images\": images})\n\n # add a plot of histogram of the gradients\n grads = torch.cat([p.grad.flatten() for p in model.parameters() if p.grad is not None], 0)\n wandb.log({\"gradients\": wandb.Histogram(grads)})\n\n # add a custom matplotlib plot of the ROC curves\n preds = torch.cat(preds, 0)\n targets = torch.cat(targets, 0)\n\n for class_id in range(10):\n one_hot = torch.zeros_like(targets)\n one_hot[targets == class_id] = 1\n _ = RocCurveDisplay.from_predictions(\n one_hot,\n preds[:, class_id],\n name=f\"ROC curve for {class_id}\",\n plot_chance_level=(class_id == 2),\n )\n\n wandb.plot({\"roc\": plt})\n # alternatively the wandb.plot.roc_curve function can be used\n\n\nif __name__ == \"__main__\":\n train()\n
Finally, we want to log the model itself. This is done by saving the model as an artifact and then logging the artifact. You can read much more about what artifacts are here but they are essentially one or more files logged together with runs that can be versioned and equipped with metadata. Log the model after training and see if you can find it in the wandb dashboard.
SolutionIn this solution we have added the calculating of final training metrics and when we then log the model we add these as metadata to the artifact.
train.pyimport click\nimport matplotlib.pyplot as plt\nimport torch\nimport wandb\nfrom my_project.data import corrupt_mnist\nfrom my_project.model import MyAwesomeModel\nfrom sklearn.metrics import RocCurveDisplay, accuracy_score, f1_score, precision_score, recall_score\n\nDEVICE = torch.device(\"cuda\" if torch.cuda.is_available() else \"mps\" if torch.backends.mps.is_available() else \"cpu\")\n\n\n@click.command()\n@click.option(\"--lr\", type=float, default=0.001, help=\"Learning rate\")\n@click.option(\"--batch_size\", type=int, default=32, help=\"Batch size\")\n@click.option(\"--epochs\", type=int, default=5, help=\"Number of epochs\")\ndef train(lr, batch_size, epochs) -> None:\n \"\"\"Train a model on MNIST.\"\"\"\n print(\"Training day and night\")\n print(f\"{lr=}, {batch_size=}, {epochs=}\")\n run = wandb.init(\n project=\"corrupt_mnist\",\n config={\"lr\": lr, \"batch_size\": batch_size, \"epochs\": epochs},\n )\n\n model = MyAwesomeModel().to(DEVICE)\n train_set, _ = corrupt_mnist()\n\n train_dataloader = torch.utils.data.DataLoader(train_set, batch_size=batch_size)\n\n loss_fn = torch.nn.CrossEntropyLoss()\n optimizer = torch.optim.Adam(model.parameters(), lr=lr)\n\n for epoch in range(epochs):\n model.train()\n\n preds, targets = [], []\n for i, (img, target) in enumerate(train_dataloader):\n img, target = img.to(DEVICE), target.to(DEVICE)\n optimizer.zero_grad()\n y_pred = model(img)\n loss = loss_fn(y_pred, target)\n loss.backward()\n optimizer.step()\n accuracy = (y_pred.argmax(dim=1) == target).float().mean().item()\n wandb.log({\"train_loss\": loss.item(), \"train_accuracy\": accuracy})\n\n preds.append(y_pred.detach().cpu())\n targets.append(target.detach().cpu())\n\n if i % 100 == 0:\n print(f\"Epoch {epoch}, iter {i}, loss: {loss.item()}\")\n\n # add a plot of the input images\n images = wandb.Image(img[:5].detach().cpu(), caption=\"Input images\")\n wandb.log({\"images\": images})\n\n # add a plot of histogram of the gradients\n grads = torch.cat([p.grad.flatten() for p in model.parameters() if p.grad is not None], 0)\n wandb.log({\"gradients\": wandb.Histogram(grads)})\n\n # add a custom matplotlib plot of the ROC curves\n preds = torch.cat(preds, 0)\n targets = torch.cat(targets, 0)\n\n for class_id in range(10):\n one_hot = torch.zeros_like(targets)\n one_hot[targets == class_id] = 1\n _ = RocCurveDisplay.from_predictions(\n one_hot,\n preds[:, class_id],\n name=f\"ROC curve for {class_id}\",\n plot_chance_level=(class_id == 2),\n )\n\n wandb.plot({\"roc\": plt})\n # alternatively the wandb.plot.roc_curve function can be used\n\n final_accuracy = accuracy_score(targets, preds.argmax(dim=1))\n final_precision = precision_score(targets, preds.argmax(dim=1), average=\"weighted\")\n final_recall = recall_score(targets, preds.argmax(dim=1), average=\"weighted\")\n final_f1 = f1_score(targets, preds.argmax(dim=1), average=\"weighted\")\n\n # first we save the model to a file then log it as an artifact\n torch.save(model.state_dict(), \"model.pth\")\n artifact = wandb.Artifact(\n name=\"corrupt_mnist_model\",\n type=\"model\",\n description=\"A model trained to classify corrupt MNIST images\",\n metadata={\"accuracy\": final_accuracy, \"precision\": final_precision, \"recall\": final_recall, \"f1\": final_f1},\n )\n artifact.add_file(\"model.pth\")\n run.log_artifact(artifact)\n\n\nif __name__ == \"__main__\":\n train()\n
After running the script you should be able to see the logged artifact in the wandb dashboard.
Weights and bias was created with collaboration in mind and lets therefore share our results with others.
Lets create a report that you can share. Click the Create report button (upper right corner when you are in a project workspace) and include some of the graphs/plots/images that you have generated in the report.
Make the report shareable by clicking the Share button and create view-only-link. Send a link to your report to a group member, fellow student or a friend. In the worst case that you have no one else to share with you can send a link to my email nsde@dtu.dk
, so I can checkout your awesome work \ud83d\ude03
When calling wandb.init
you can provide many additional argument. Some of the most important are
project
entity
job_type
Make sure you understand what these arguments do and try them out. It will come in handy for your group work as they essentially allows multiple users to upload their own runs to the same project in wandb
.
Relevant documentation can be found here. The project
indicates what project all experiments and artifacts are logged to. We want to keep this the same for all group members. The entity
is the username of the person or team who owns the project, which should also be the same for all group members. The job type is important if you have different jobs that log to the same project. A common example is one script that trains a model and another that evaluates it. By setting the job type you can easily filter the runs in the wandb dashboard.
Wandb also comes with build in feature for doing hyperparameter sweeping which can be beneficial to get a better working model. Look through the documentation on how to do a hyperparameter sweep in Wandb. You at least need to create a new file called sweep.yaml
and make sure that you call wandb.log
in your code on an appropriate value.
Start by creating a sweep.yaml
file. Relevant documentation can be found here. We recommend placing the file in a configs
folder in your project.
The sweep.yaml
file will depend on kind of hyperparameters your model accepts as arguments and how they are passed to the model. For this solution we assume that the model accepts the hyperparameters lr
, batch_size
and epochs
and that they are passed as --args
(with hyphens) (1) e.g. this would be how we run the script
command
config in your sweep.yaml
file. This is because wandb
uses --args
to pass hyperparameters to the script, whereas hydra
uses args
(without the hyphen). See this page for more information.python train.py --lr=0.01 --batch_size=32 --epochs=10\n
The sweep.yaml
could then look like this:
program: train.py\nname: sweepdemo\nproject: my_project # change this\nentity: my_entity # change this\nmetric:\n goal: minimize\n name: validation_loss\nparameters:\n learning_rate:\n min: 0.0001\n max: 0.1\n distribution: log_uniform\n batch_size:\n values: [16, 32, 64]\n epochs:\n values: [5, 10, 15]\nrun_cap: 10\n
Afterwards, you need to create a sweep using the wandb sweep
command:
wandb sweep configs/sweep.yaml\n
this will output a sweep id that you need to use in the next step.
Finally, you need to run the sweep using the wandb agent
command:
wandb agent <sweep_id>\n
where <sweep_id>
is the id of the sweep you just created. You can find the id in the output of the wandb sweep
command. The reason that we first lunch the sweep and then the agent is that we can have multiple agents running at the same time, parallelizing the search for the best hyperparameters. Try this out by opening a new terminal and running the wandb agent
command again (with the same <sweep_id>
).
Inspect the sweep results in the wandb dashboard. You should see multiple new runs under the project you are logging the sweep to, corresponding to the different hyperparameters you tried. Make sure you understand the results and can answer what hyperparameters gave the best results and what hyperparameters had the largest impact on the results.
SolutionIn the sweep dashboard you should see something like this:
Importantly you can:
Next we need to understand the model registry, which will be very important later on when we get to the deployment of our models. The model registry is a centralized place for storing and versioning models. Importantly, any model in the registry is immutable, meaning that once a model is uploaded it cannot be changed. This is important for reproducibility and traceability of models.
The model registry is in general a repository of a teams trained models where ML practitioners publish candidates for production and share them with others. Figure from wandb.
The model registry builds on the artifact registry in wandb. Any model that is uploaded to the model registry is stored as an artifact. This means that we first need to log our trained models as artifacts before we can register them in the model registry. Make sure you have logged at least one model as an artifact before continuing.
Next lets create a registry. Go to the model registry tab (left pane, visible from your homepage) and then click the New Registered Model
button. Fill out the form and create the registry.
When then need to link our artifact to the model registry we just created. We can do this in two ways: either through the web interface or through the wandb
API. In the web interface, go to the artifact you want to link to the model registry and click the Link to registry
button (upper right corner). If you want to use the API you need to call the link method on a artifact object.
To use the API, create a new script called link_to_registry.py
and add the following code:
import wandb\napi = wandb.Api()\nartifact_path = \"<entity>/<project>/<artifact_name>:<version>\"\nartifact = api.artifact(artifact_path)\nartifact.link(target_path=\"<entity>/model-registry/<my_registry_name>\")\nartifact.save()\n
In the code <entity>
, <project>
, <artifact_name>
, <version>
and <my_registry_name>
should be replaced with the appropriate values.
We are now ready to consume our model, which can be done by downloading the artifact from the model registry. In this case we use the wandb API to download the artifact.
import wandb\nrun = wandb.init()\nartifact = run.use_artifact('<entity>/model-registry/<my_registry_name>:<version>', type='model')\nartifact_dir = artifact.download(\"<artifact_dir>\")\nmodel = MyModel()\nmodel.load_state_dict(torch.load(\"<artifact_dir>/model.ckpt\"))\n
Try running this code with the appropriate values for <entity>
, <my_registry_name>
, <version>
and <artifact_dir>
. Make sure that you can load the model and that it is the same as the one you trained.
Each model in the registry have at least one alias, which is the version of the model. The most recently added model also receives the alias latest
. Aliases are great for indicating where in workflow a model is, e.g. if it is a candidate for production or if it is a model that is still being developed. Try adding an alias to one of your models in the registry.
(Optional) A model always corresponds to an artifact, and artifacts can contain metadata that we can use to automate the process of registering models. We could for example imaging that we at the end of each week run a script that registers the best model from the week. Try creating a small script using the wandb
API that goes over a collection of artifacts and registers the best one.
import logging\nimport operator\nimport os\n\nimport click\nimport wandb\nfrom dotenv import load_dotenv\n\nlogger = logging.getLogger(__name__)\nload_dotenv()\n\n\n@click.command()\n@click.argument(\"model-name\")\n@click.option(\"--metric_name\", default=\"accuracy\", help=\"Name of the metric to choose the best model from.\")\n@click.option(\"--higher-is-better\", default=True, help=\"Whether higher metric values are better.\")\ndef stage_best_model_to_registry(model_name, metric_name, higher_is_better) -> None:\n \"\"\"\n Stage the best model to the model registry.\n\n Args:\n model_name: Name of the model to be registered.\n metric_name: Name of the metric to choose the best model from.\n higher_is_better: Whether higher metric values are better.\n\n \"\"\"\n api = wandb.Api(\n api_key=os.getenv(\"WANDB_API_KEY\"),\n overrides={\"entity\": os.getenv(\"WANDB_ENTITY\"), \"project\": os.getenv(\"WANDB_PROJECT\")},\n )\n artifact_collection = api.artifact_collection(type_name=\"model\", name=model_name)\n\n best_metric = float(\"-inf\") if higher_is_better else float(\"inf\")\n compare_op = operator.gt if higher_is_better else operator.lt\n best_artifact = None\n for artifact in list(artifact_collection.artifacts()):\n if metric_name in artifact.metadata and compare_op(artifact.metadata[metric_name], best_metric):\n best_metric = artifact.metadata[metric_name]\n best_artifact = artifact\n\n if best_artifact is None:\n logging.error(\"No model found in registry.\")\n return\n\n logger.info(f\"Best model found in registry: {best_artifact.name} with {metric_name}={best_metric}\")\n best_artifact.link(\n target_path=f\"{os.getenv('WANDB_ENTITY')}/model-registry/{model_name}\",\n aliases=[\"best\", \"staging\"],\n )\n best_artifact.save()\n logger.info(\"Model staged to registry.\")\n\n\nif __name__ == \"__main__\":\n stage_best_model_to_registry()\n
In the future it will be important for us to be able to run Wandb inside a docker container (together with whatever training or inference we specify). The problem here is that we cannot authenticate Wandb in the same way as the previous exercise, it needs to happen automatically. Lets therefore look into how we can do that.
First we need to generate an authentication key, or more precise an API key. This is in general the way any service (like a docker container) can authenticate. Start by going https://wandb.ai/home, click your profile icon in the upper right corner and then go to settings. Scroll down to the danger zone and generate a new API key and finally copy it.
Next create a new docker file called wandb.docker
and add the following code
FROM python:3.10-slim\nRUN apt update && \\\n apt install --no-install-recommends -y build-essential gcc && \\\n apt clean && rm -rf /var/lib/apt/lists/*\nRUN pip install wandb\nCOPY s4_debugging_and_logging/exercise_files/wandb_tester.py wandb_tester.py\nENTRYPOINT [\"python\", \"-u\", \"wandb_tester.py\"]\n
please take a look at the script being copied into the image and afterwards build the docker image.
When we want to run the image, what we need to do is including a environment variables that contains the API key we generated. This will then authenticate the docker container with the wandb server:
docker run -e WANDB_API_KEY=<your-api-key> wandb:latest\n
Try running it an confirm that the results are uploaded to the wandb server (1).
.env
file you can use the --env-file
flag instead of -e
to load the environment variables from the file e.g. docker run --env-file .env wandb:latest
.Feel free to experiment more with wandb
as it is a great tool for logging, organizing and sharing experiments.
That is the module on logging. Please note that at this point in the course you will begin to see some overlap between the different frameworks. While we mainly used hydra
for configuring our Python scripts it can also be used to save metrics and hyperparameters similar to how wandb
can. Similar arguments holds for dvc
which can also be used to log metrics. In our opinion wandb
just offers a better experience when interacting with the results after logging. We want to stress that the combination of tools presented in this course may not be the best for all your future projects, and we recommend finding a setup that fits you. That said, each framework provide specific features that the others does not.
Finally, we want to note that we during the course really try to showcase a lot of open source frameworks, Wandb is not one. It is free to use for personal usage (with a few restrictions) but for enterprise it does require a license. If you are eager to only work with open-source tools we highly recommend trying out MLFlow which offers the same overall functionalities as Wandb.
"},{"location":"s4_debugging_and_logging/profiling/","title":"M13 - Profiling","text":""},{"location":"s4_debugging_and_logging/profiling/#profilers","title":"Profilers","text":"Core Module
"},{"location":"s4_debugging_and_logging/profiling/#profilers_1","title":"Profilers","text":"In general profiling code is about improving the performance of your code. In this session we are going to take a somewhat narrow approach to what \"performance\" is: runtime, meaning the time it takes to execute your program.
At the bare minimum, the two questions a proper profiling of your program should be able to answer is:
The first question is important to priorities optimization. If two methods A
and B
have approximately the same runtime, but A
is called 1000 more times than B
we should probably spend time optimizing A
over B
if we want to speedup our code. The second question is gives itself, directly telling us which methods are the expensive to call.
Using profilers can help you find bottlenecks in your code. In this exercise we will look at two different profilers, with the first one being the cProfile. cProfile
is pythons build in profiler that can help give you an overview runtime of all the functions and methods involved in your programs.
Run the cProfile
on the vae_mnist_working.py
script. Hint: you can directly call the profiler on a script using the -m
arg
python -m cProfile -o <output_file> -s <sort_order> myscript.py\n
Try looking at the output of the profiling. Can you figure out which function took the longest to run?
Can you explain the difference between tottime
and cumtime
? Under what circumstances does these differ and when are they equal.
To get a better feeling of the profiled result we can try to visualize it. Python does not provide a native solution, but open-source solutions such as snakeviz exist. Try installing snakeviz
and load a profiled run into it (HINT: snakeviz expect the run to have the file format .prof
).
Try optimizing the run! (Hint: The data is not stored as torch tensor). After optimizing the code make sure (using cProfile
and snakeviz
) that the code actually runs faster.
Profiling machine learning code can become much more complex because we are suddenly beginning to mix different devices (CPU+GPU), that can (and should) overlap some of their computations. When profiling this kind of machine learning code we are often looking for bottlenecks. A bottleneck is simple the place in your code that is preventing other processes from performing their best. This is the reason that all major deep learning frameworks also include their own profilers that can help profiling more complex applications.
The image below show a typical report using the build in profiler in pytorch. As the image shows the profiler looks both a the kernel
time (this is the time spend doing actual computations) and also transfer times such as memcpy
(where we are copying data between devices). It can even analyze your code and give recommendations.
Using the profiler can be as simple as wrapping the code that you want to profile with the torch.profiler.profile
decorator
with torch.profiler.profile(...) as prof:\n # code that I want to profile\n output = model(data)\n
"},{"location":"s4_debugging_and_logging/profiling/#exercises_1","title":"\u2754 Exercises","text":"Exercise files
In these investigate the profiler that is build into PyTorch already. Note that these exercises requires that you have PyTorch v1.8.1 installed (or higher). You can always check which version you currently have installed by writing (in a python interpreter):
import torch\nprint(torch.__version__)\n
But we always recommend to update to the latest PyTorch version for the best experience. Additionally, to display the result nicely (like snakeviz
for cProfile
) we are also going to use the tensorboard profiler extension
pip install torch_tb_profiler\n
A good starting point is too look at the API for the profiler. Here the important class to look at is the torch.profiler.profile
class.
Lets try out an simple example (taken from here):
Try to run the following code
import torch\nimport torchvision.models as models\nfrom torch.profiler import profile, ProfilerActivity\n\nmodel = models.resnet18()\ninputs = torch.randn(5, 3, 224, 224)\n\nwith profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:\n model(inputs)\n
this will profile the forward
pass of Resnet 18 model.
Running this code will produce an prof
object that contains all the relevant information about the profiling. Try writing the following code:
print(prof.key_averages().table(sort_by=\"cpu_time_total\", row_limit=10))\n
what operation is taking most of the cpu?
Try running
print(prof.key_averages(group_by_input_shape=True).table(sort_by=\"cpu_time_total\", row_limit=30))\n
can you see any correlation between the shape of the input and the cost of the operation?
(Optional) If you have a GPU you can also profile the operations on that device:
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:\n model(inputs)\n
(Optional) As an alternative to using profile
as an context-manager we can also use its .start
and .stop
methods:
prof = profile(...)\nprof.start()\n... # code I want to profile\nprof.stop()\n
Try doing this on the above example.
The torch.profiler.profile
function takes some additional arguments. What argument would you need to set to also profile the memory usage? (Hint: this page) Try doing it to the simple example above and make sure to sort the sample by self_cpu_memory_usage
.
As mentioned we can also get a graphical output for better inspection. After having done a profiling try to export the results with:
prof.export_chrome_trace(\"trace.json\")\n
you should be able to visualize the file by going to chrome://tracing
in any chromium based web browser. Can you still identify the information printed in the previous exercises from the visualizations?
Running profiling on a single forward step can produce misleading results as it only provides a single sample that may depend on what background processes that are running on your computer. Therefore it is recommended to profile multiple iterations of your model. If this is the case then we need to include prof.step()
to tell the profiler when we are doing a new iteration
with profile(...) as prof:\n for i in range(10):\n model(inputs)\n prof.step()\n
Try doing this. Is the conclusion this the same on what operations that are taken up most of the time? Have the percentage changed significantly?
Additionally, we can also visualize the profiling results using the profiling viewer in tensorboard.
Start by initializing the profile
class with an additional argument:
from torch.profiler import profile, tensorboard_trace_handler\nwith profile(..., on_trace_ready=tensorboard_trace_handler(\"./log/resnet18\")) as prof:\n ...\n
Try run a profiling (using a couple of iterations) and make sure that a file with the .pt.trace.json
is produced in the log/resnet18
folder.
Now try launching tensorboard
tensorboard --logdir=./log\n
and open the page http://localhost:6006/#pytorch_profiler, where you should hopefully see an image similar to the one below:
Image credit
Try poking around in the interface.
Tensorboard have a nice feature for comparing runs under the diff
tab. Try redoing a profiling run but use model = models.resnet34()
instead. Load up both runs and try to look at the diff
between them.
As an final exercise, try to use the profiler on the vae_mnist_working.py
file from the previous module on debugging, where you profile a whole training run (not only the forward pass). What is the bottleneck during the training? Is it still the forward pass or is it something else? Can you improve the code somehow based on the information from the profiler.
This end the module on profiling. If you want to go into more details on this topic we can recommend looking into line_profiler and kernprof. A downside of using python's cProfile
is that it can only profiling at an functional/modular level, that is great for identifying hotspots in your code. However, sometimes the cause of an computationally hotspot is a single line of code in a function, which will not be caught by cProfile
. An example would be an simple index operations such as a[idx] = b
, which for large arrays and non-sequential indexes is really expensive. For these cases line_profiler and kernprof are excellent tools to have in your toolbox. Additionally, if you do not like cProfile we can also recommend py-spy which is another open-source profiling tool for Python programs.
Slides
Learn how to write unit tests that cover both data and models in your ML pipeline.
M16: Unit testing
Learn how to implement continuous integration using Github actions such that tests are automatically executed on code changes.
M17: Github Actions
Learn how to use pre-commit to ensure that code that is not up to standard does not get committed.
M18: Pre-commit
Learn how to implement continuous machine learning pipelines in Github actions.
M19: Continuous Machine Learning
Continues integration is a sub-discipline of the general field of Continues X. Continuous X is one of the core elements of modern DevOps, and by extension MLOps. Continuous X assumes that we have a (long) developer pipeline (see image below) where we want to make some changes to our code e.g:
Basically, any code change we will expect will have a influence on the final result. The problem with doing changes to the start of our pipeline is that we want the change to propagate all the way through to the end of the pipeline.
Image creditThis is where continuous X comes into play. The word continuous here refers to the fact that the pipeline should continuously be updated as we make code changes. You can also choose to think of this as the automatization of processes. The X then covers that the process we need to go through to automate steps in the pipeline depends on where we are in the pipeline e.g. the tools needed to do continuous integration are different from the tools needed to do continuous delivery.
In this session, we are going to focus on continuous integration (CI). As indicated in the image above, continuous integration usually takes care of the first part of the developer pipeline which has to do with the code base, code building and code testing. This is paramount to step in automatization as we would rather catch bugs at the beginning of our pipeline than in the end.
Learning objectives
The learning objectives of this session are:
The continuous integration we have looked at until now is what we can consider \"classical\" continuous integration, which has its roots in DevOps and not MLOps. While the test that we have written and the containers we have developed in the previous session have been about machine learning, everything we have done translates completely to how it would be done if we had developed any other application that did not include machine learning.
In this session, we are now gonna change gears and look at continuous machine learning (CML). As the name may suggest we are now focusing on automatizing actual machine learning processes. The reason for doing this is the same as with continuous integration, namely that we often have a bunch of checks that we want our newly trained model to pass before we trust it to be ready for deployment. Writing unit tests secures that the code that we use for training our model is not broken, but there exist other failure modes of a machine learning pipeline:
All these questions are questions that we can answer by writing tests that are specific to machine learning. In this session, we are going to look at how we can begin to use Github Actions to automate these tests.
"},{"location":"s5_continuous_integration/cml/#mlops-maturity-model","title":"MLOps maturity model","text":"Before getting started with the exercises, let's first take a side step and look at what is called the MLOps maturity model. The reason here is to get a better understanding of when continuous machine learning is relevant. The main idea behind the MLOps maturity model is to help organizations understand where they are in their machine learning operations journey and what the next logical steps are. The model is divided into five stages:
Image creditLevel 0
At this level, organizations are doing machine learning in an ad-hoc manner. There is no standardization, no version control, no testing, and no monitoring.
Level 1
At this level, organizations have started to implement DevOps practices in their machine learning workflows. They have started to use version control and maybe come with basic continuous integration practices.
Level 2
At this level, organizations have started to standardize the training process and tackle the problem of creating reproducible experiments. Centralization of model artifacts and metadata is common at this level. They have started to implement model versioning and model registry practices.
Level 3
At this level, organizations have started to implement continuous integration and continuous deployment practices. They have started to automate the testing of their models and have started to monitor their models in production.
Level 4
At this level, organizations have started to implement continuous machine learning practices. They have started to automate the training, evaluation, and deployment of their models. They have started to implement automated retraining and model updates.
The MLOps maturity model tells us that continuous machine learning is the highest form of maturity in MLOps. It is the stage where we have automated the entire machine learning pipeline and the cases we will be going through in the exercises are therefore some of the last steps in the MLOps maturity model.
"},{"location":"s5_continuous_integration/cml/#exercises","title":"\u2754 Exercises","text":"In the following exercises, we are going to look at two different cases where we can use continuous machine learning. The first one is a simple case where we are automatically going to trigger some workflow (like training of a model) whenever we make changes to our data. This is a very common use case in machine learning where we have a data pipeline that is continuously updating our data. The second case is connected to staging and deploying models. In this case, we are going to look at how we can automatically do further processing of our model whenever we push a new model to our repository.
For the first set of exercises, we are going to rely on the cml
framework by iterative.ai, which is a framework that is built on top of GitHub actions. The figure below describes the overall process using the cml
framework. It should be clear that it is the very same process that we go through in the other continuous integration sessions: push code
-> trigger GitHub actions
-> do stuff
. The new part in this session that we are only going to trigger whenever data changes.
Image credit
If you have not already created a dataset class for the corrupted Mnist data, start by doing that. Essentially, it is a class that should inherit from torch.utils.data.Dataset
and should have a __getitem__
and __len__
from __future__ import annotations\n\nimport os\nfrom typing import TYPE_CHECKING\n\nimport torch\nfrom torch import Tensor\nfrom torch.utils.data import Dataset\n\nif TYPE_CHECKING:\n import torchvision.transforms.v2 as transforms\n\n\nclass MnistDataset(Dataset):\n \"\"\"MNIST dataset for PyTorch.\n\n Args:\n data_folder: Path to the data folder.\n train: Whether to load training or test data.\n img_transform: Image transformation to apply.\n target_transform: Target transformation to apply.\n \"\"\"\n\n name: str = \"MNIST\"\n\n def __init__(\n self,\n data_folder: str = \"data\",\n train: bool = True,\n img_transform: transforms.Transform | None = None,\n target_transform: transforms.Transform | None = None,\n ) -> None:\n super().__init__()\n self.data_folder = data_folder\n self.train = train\n self.img_transform = img_transform\n self.target_transform = target_transform\n self.load_data()\n\n def load_data(self) -> None:\n \"\"\"Load images and targets from disk.\"\"\"\n images, target = [], []\n if self.train:\n nb_files = len([f for f in os.listdir(self.data_folder) if f.startswith(\"train_images\")])\n for i in range(nb_files):\n images.append(torch.load(f\"{self.data_folder}/train_images_{i}.pt\"))\n target.append(torch.load(f\"{self.data_folder}/train_target_{i}.pt\"))\n else:\n images.append(torch.load(f\"{self.data_folder}/test_images.pt\"))\n target.append(torch.load(f\"{self.data_folder}/test_target.pt\"))\n self.images = torch.cat(images, 0)\n self.target = torch.cat(target, 0)\n\n def __getitem__(self, idx: int) -> tuple[Tensor, Tensor]:\n \"\"\"Return image and target tensor.\"\"\"\n img, target = self.images[idx], self.target[idx]\n if self.img_transform:\n img = self.img_transform(img)\n if self.target_transform:\n target = self.target_transform(target)\n return img, target\n\n def __len__(self) -> int:\n \"\"\"Return the number of images in the dataset.\"\"\"\n return self.images.shape[0]\n
Then let's create a function that can report basic statistics such as the number of training samples, number of test samples and generate figures of sample images in the dataset and distribution of the classes in the dataset. This function should be called dataset_statistics
and should take a path to the dataset as input.
import click\nimport matplotlib.pyplot as plt\nimport torch\nfrom mnist_dataset import MnistDataset\nfrom utils import show_image_and_target\n\n\n@click.command()\n@click.option(\"--datadir\", default=\"data\", help=\"Path to the data directory\")\ndef dataset_statistics(datadir: str) -> None:\n \"\"\"Compute dataset statistics.\"\"\"\n train_dataset = MnistDataset(data_folder=datadir, train=True)\n test_dataset = MnistDataset(data_folder=datadir, train=False)\n print(f\"Train dataset: {train_dataset.name}\")\n print(f\"Number of images: {len(train_dataset)}\")\n print(f\"Image shape: {train_dataset[0][0].shape}\")\n print(\"\\n\")\n print(f\"Test dataset: {test_dataset.name}\")\n print(f\"Number of images: {len(test_dataset)}\")\n print(f\"Image shape: {test_dataset[0][0].shape}\")\n\n show_image_and_target(train_dataset.images[:25], train_dataset.target[:25], show=False)\n plt.savefig(\"mnist_images.png\")\n plt.close()\n\n train_label_distribution = torch.bincount(train_dataset.target)\n test_label_distribution = torch.bincount(test_dataset.target)\n\n plt.bar(torch.arange(10), train_label_distribution)\n plt.title(\"Train label distribution\")\n plt.xlabel(\"Label\")\n plt.ylabel(\"Count\")\n plt.savefig(\"train_label_distribution.png\")\n plt.close()\n\n plt.bar(torch.arange(10), test_label_distribution)\n plt.title(\"Test label distribution\")\n plt.xlabel(\"Label\")\n plt.ylabel(\"Count\")\n plt.savefig(\"test_label_distribution.png\")\n plt.close()\n\n\nif __name__ == \"__main__\":\n dataset_statistics()\n
Next, we are going to implement a GitHub actions workflow that only activates when we make changes to our data. Create a new workflow file (call it cml_data.yaml
) and make sure it only activates on push/pull-request events when data/
changes. Relevant documentation
The secret is to use the paths
keyword in the workflow file. We here specify that the workflow should only trigger when the .dvc
folder or any file with the .dvc
extension changes, which is the case when we update our data and call dvc add data/
.
name: DVC Workflow\n\non:\n pull_request:\n branches:\n - main\n paths:\n - '**/*.dvc'\n - '.dvc/**'\n
The next step is to implement steps in our workflow that do something when data changes. This is the reason why we created the dataset_statistics
function. Implement a workflow that:
dataset_statistics
function on the dataThis solution assumes that data is stored in a GCP bucket and that the credentials are stored in a secret called GCP_SA_KEY
. If this is not the case for you, you need to adjust the workflow accordingly with the correct way to pull the data.
jobs:\n run_data_checker:\n runs-on: ubuntu-latest\n steps:\n - name: Checkout code\n uses: actions/checkout@v4\n\n - name: Set up Python\n uses: actions/setup-python@v5\n with:\n python-version: 3.11\n cache: 'pip'\n cache-dependency-path: setup.py\n\n - name: Install dependencies\n run: |\n make dev_requirements\n pip list\n\n - name: Auth with GCP\n uses: google-github-actions/auth@v2\n with:\n credentials_json: ${{ secrets.GCP_SA_KEY }}\n\n - name: Pull data\n run: |\n dvc pull --no-run-cache\n\n - name: Check data statistics\n run: |\n python dataset_statistics.py\n
Let's make sure that the workflow works as expected for now. Create a new branch and either add or remove a file in the data/
folder. Then run
dvc add data/\ngit add data.dvc\ngit commit -m \"Update data\"\ngit push\n
to commit the changes to data. Open a pull request with the branch and make sure that the workflow activates and runs as expected.
Let's now add the cml
framework such that we can comment the results of the dataset_statistics
function in the pull request automatically. Look at the getting started guide for help on how to do this. You will need write all the content of the dataset_statistics
function to a file called report.md
and then use the cml comment create
command to create a comment in the pull request with the content of the file.
jobs:\n dataset_statistics:\n runs-on: ubuntu-latest\n steps:\n # ...all the previous steps\n - name: Check data statistics & generate report\n run: |\n python src/example_mlops/data.py > data_statistics.md\n echo '![](./mnist_images.png \"MNIST images\")' >> data_statistics.md\n echo '![](./train_label_distribution.png \"Train label distribution\")' >> data_statistics.md\n echo '![](./test_label_distribution.png \"Test label distribution\")' >> data_statistics.md\n\n - name: Setup cml\n uses: iterative/setup-cml@v2\n\n - name: Comment on PR\n env:\n REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}\n run: |\n cml comment create data_statistics.md --watermark-title=\"Data Checker\" # (1)!\n
--watermark-title
flag is used to watermark the comment created by cml
. It is to make sure that no new comments are created every time the workflow runs.Make sure that the workflow works as expected. You should see a comment created by github-actions (bot)
like this if you have done everything correctly:
(Optional) Feel free to add more checks to the workflow. For example, you could add a check that runs a small baseline model on the updated data and checks that the model converges. This is a very common sanity check that is done in machine learning pipelines.
For the second set of exercises, we are going to look at how to automatically run further testing of our models whenever we add them to our model registry. For that reason, do not continue with this set of exercises before you have completed the exercises on the model registry in this module.
The model registry is in general a repository of a team's trained models where ML practitioners publish candidates for production and share them with others. Figure from wandb.
The first step is in our weights and bias account to create a team. Some of these more advanced features are only available for teams, however every user is allowed to create one team for free. Go to your weights and bias account and create a team (the option should be on the left side of the UI). Give a team name and select W&B cloud storage.
Now we need to generate a personal access token that can link our weights and bias account to our GitHub account. Go to this page and generate a new token. You can also find the page by clicking your profile icon in the upper right corner of Github and selecting Settings
, then Developer settings
, then Personal access tokens
and finally choose either Tokens (classic)
or Fine-grained tokens
(which is the safer option, which is also what the link points to).
give it a name, set what repositories it should have access to and select the permissions you want it to have. In our case if you choose to create Fine-grained token
then it needs access to the contents:write
permission. If you choose Tokens (classic)
then it needs access to the repo
permission. After you have created the token, copy it and save it somewhere safe.
Go to the settings of your newly created team: https://wandb.ai/teamname/settings and scroll down to the Team secrets
section. Here add the token you just created as a secret with the name GITHUB_ACTIONS_TOKEN
. WANDB will now be able to use this token to trigger actions in your repository.
On the same settings page, scroll down to the Webhooks
settings. Click the New webhook
button in fill in the following information:
github_actions_dispatch
https://api.github.com/repos/<owner>/<repo>/dispatches
GITHUB_ACTIONS_TOKEN
You here need to replace <owner>
and <repo>
with your own information. The /dispatches
endpoint is a special endpoint that all Github actions workflows can listen to. Thus, if you ever want to setup a webhook in some other framework that should trigger a Github action, you can use this endpoint.
Next, navigate to your model registry. It should hopefully contain at least one registry with at least one model registered. If not, go back to the previous module and do that.
When you have a model in your registry, click on the View details
button. Then click the New automation
button. On the first page, select that you want to trigger the automation when an alias is added to a model version, set that alias to staging
and select the action type to be Webhook
. On the next page, select the github_actions_dispatch
webhook that you just created and add this as the payload:
{\n \"event_type\": \"staged_model\",\n \"client_payload\":\n {\n \"event_author\": \"${event_author}\",\n \"artifact_version\": \"${artifact_version}\",\n \"artifact_version_string\": \"${artifact_version_string}\",\n \"artifact_collection_name\": \"${artifact_collection_name}\",\n \"project_name\": \"${project_name}\",\n \"entity_name\": \"${entity_name}\"\n }\n}\n
Finally, on the next page give the automation a name and click Create automation
.
Make sure you understand overall what is happening here.
SolutionThe automation is set up to trigger a webhook whenever the alias staging
is added to a model version. The webhook is set up to trigger a Github action workflow that listens to the /dispatches
endpoint and has the event type staged_model
. The payload that is sent to the webhook contains information about the model that was staged.
We are now ready to create the Github actions workflow
that listens to the /dispatches
endpoint and triggers whenever a model is staged. Create a new workflow file (called stage_model.yaml
) and make sure it only activates on the staged_model
event. Hint: relevant documentation
name: Check staged model\n\non:\n repository_dispatch:\n types: staged_model\n
Next, we need to implement the steps in our workflow that do something when a model is staged. The payload that is sent to the webhook contains information about the model that was staged. Implement a workflow that:
jobs:\n identify_event:\n runs-on: ubuntu-latest\n outputs:\n model_name: ${{ steps.set_output.outputs.model_name }}\n steps:\n - name: Check event type\n run: |\n echo \"Event type: repository_dispatch\"\n echo \"Payload Data: ${{ toJson(github.event.client_payload) }}\"\n\n - name: Setting model environment variable and output\n id: set_output\n run: |\n echo \"model_name=${{ github.event.client_payload.artifact_version_string }}\" >> $GITHUB_OUTPUT\n
We now need to write a script that can be executed on our staged model. In this case, we are going to run some performance tests on it to check that it is fast enough for deployment. Therefore, do the following:
In a tests/performancetests
folder, create a new file called test_model.py
Implement a test that loads the model from an wandb artifact path e.g. //: and runs it on a random input. Importantly, the artifact path should be read from an environment variable called MODEL_NAME
.
The test should assert that the model can do 100 predictions in less than X amount of time
In this solution we assume that 4 environment variables are set: WANDB_API
, WANDB_ENTITY
, WANDB_PROJECT
and MODEL_NAME
.
import wandb\nimport os\nimport time\nfrom my_project.models import MyModel\n\ndef load_model(artifact):\n api = wandb.Api(\n api_key=os.getenv(\"WANDB_API_KEY\"),\n overrides={\"entity\": os.getenv(\"WANDB_ENTITY\"), \"project\": os.getenv(\"WANDB_PROJECT\")},\n )\n artifact = api.artifact(model_checkpoint)\n artifact.download(root=logdir)\n file_name = artifact.files()[0].name\n return MyModel.load_from_checkpoint(f\"{logdir}/{file_name}\")\n\ndef test_model_speed():\n model = load_model(os.getenv(\"MODEL_NAME\"))\n start = time.time()\n for _ in range(100):\n model(torch.rand(1, 1, 28, 28))\n end = time.time()\n assert end - start < 1\n
Let's now add another job that calls the script we just wrote. It needs to:
which is very similar to the kind of jobs we have written before.
Solutionjobs:\n identify_event:\n ...\n test_model:\n runs-on: ubuntu-latest\n needs: identify_event\n env:\n WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}\n WANDB_ENTITY: ${{ secrets.WANDB_ENTITY }}\n WANDB_PROJECT: ${{ secrets.WANDB_PROJECT }}\n MODEL_NAME: ${{ needs.identify_event.outputs.model_name }}\n steps:\n - name: Echo model name\n run: |\n echo \"Model name: $MODEL_NAME\"\n - name: Checkout code\n uses: actions/checkout@v4\n\n - name: Set up Python\n uses: actions/setup-python@v5\n with:\n python-version: 3.11\n cache: 'pip'\n cache-dependency-path: setup.py\n\n - name: Install dependencies\n run: |\n pip install -r requirements.txt\n pip list\n\n - name: Test model\n run: |\n pytest tests/performancetests/test_model.py\n
Finally, we are going to assume in this setup that if the model gets this far then it is ready for deployment. We are therefore going to add a final job that will add a new alias to the model called production
. Here is some relevant Python code that can be used to add the alias:
import click\nimport os\nimport wandb\n\n@click.command()\n@click.argument(\"artifact-path\")\n@click.option(\n \"--aliases\", \"-a\", multiple=True, default=[\"staging\"], help=\"List of aliases to link the artifact with.\"\n)\ndef link_model(artifact_path: str, aliases: list[str]) -> None:\n \"\"\"\n Stage a specific model to the model registry.\n\n Args:\n artifact_path: Path to the artifact to stage.\n Should be of the format \"entity/project/artifact_name:version\".\n aliases: List of aliases to link the artifact with.\n\n Example:\n model_management link-model entity/project/artifact_name:version -a staging -a best\n\n \"\"\"\n if artifact_path == \"\":\n click.echo(\"No artifact path provided. Exiting.\")\n return\n\n api = wandb.Api(\n api_key=os.getenv(\"WANDB_API_KEY\"),\n overrides={\"entity\": os.getenv(\"WANDB_ENTITY\"), \"project\": os.getenv(\"WANDB_PROJECT\")},\n )\n _, _, artifact_name_version = artifact_path.split(\"/\")\n artifact_name, _ = artifact_name_version.split(\":\")\n\n artifact = api.artifact(artifact_path)\n artifact.link(target_path=f\"{os.getenv('WANDB_ENTITY')}/model-registry/{artifact_name}\", aliases=aliases)\n artifact.save()\n click.echo(f\"Artifact {artifact_path} linked to {aliases}\")\n
for example, you can run this script with the following command:
python link_model.py entity/project/artifact_name:version -a staging -a production\n
Implement a final job that calls this script and adds the production
alias to the model.
jobs:\n identify_event:\n ...\n test_model:\n ...\n add_production_alias:\n runs-on: ubuntu-latest\n needs: identify_event\n env:\n WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}\n WANDB_ENTITY: ${{ secrets.WANDB_ENTITY }}\n WANDB_PROJECT: ${{ secrets.WANDB_PROJECT }}\n MODEL_NAME: ${{ needs.identify_event.outputs.model_name }}\n steps:\n - name: Echo model name\n run: |\n echo \"Model name: $MODEL_NAME\"\n\n - name: Checkout code\n uses: actions/checkout@v4\n\n - name: Set up Python\n uses: actions/setup-python@v5\n with:\n python-version: 3.11\n cache: 'pip'\n cache-dependency-path: setup.py\n\n - name: Install dependencies\n run: |\n pip install -r requirements.txt\n pip list\n\n - name: Add production alias\n run: |\n python link_model.py $MODEL_NAME -a production\n
Finally, make sure the workflow works as expected. To try it out again and again for testing purposes, you can just manually add and then delete the staging
alias to any model version in the model registry.
(Optional) Consider adding more checks to the workflow. For example, you could add a step that checks if the model is too large for deployment, runs some further evaluation scripts, or checks if the model is robust to adversarial attacks. Only the imagination sets the limits here.
(Optional) If you have got this far, consider combining principles from the two exercises. Here is an idea: we use the workflow from the second exercise to trigger a workflow that checks a staged model for performance. We then use the cml
framework to automatically create a pull request e.g. use cml pr create
instead of cml comment create
to create a pull request with the results of the performance test. Then if we are happy with the performance, we can then approve that pull request and the production alias is added to the model. This is a better workflow because it allows for human intervention before the model is deployed.
What is the difference between continuous integration and continuous machine learning?
SolutionThere are three key differences between continuous integration and continuous machine learning:
Imaging you get hired in the pharmasuitical industri being asked to develop a machine learning pipeline that can automatically sort out which drugs are safe and which are not. What level of the MLOps maturity model would you strive to reach?
SolutionThere is really no right or wrong answer here, but in most cases we would actually not aim for level 4. The reason is that the consequences of a bad model in this case can be severe. Therefore, we would probably not want automated retraining and model updates, which is what level 4 is about. Instead, we would probably aim for level 3 where we have automated testing and monitoring of our models but there is still human oversight in the process.
This ends the module on continuous machine learning. As we have hopefully convinced you, it is only the imagination that sets the limits for what you can use Github actions for in your machine learning pipeline. However, we do want to stress that it is important that human oversight is always present in the process. Automation is great, but it should never replace human judgement. This is especially true in machine learning where the consequences of a bad model can be severe if it is used in critical decision making.
Finally, if you have completed the exercises on using the cloud consider checking out the cml runner lunch command that allows you to run your workflows on cloud resources instead of the GitHub actions runners.
"},{"location":"s5_continuous_integration/github_actions/","title":"M17 - Github Actions","text":""},{"location":"s5_continuous_integration/github_actions/#github-actions","title":"GitHub actions","text":"Core Module
With the tests established in the previous module, we are now ready to move on to implementing some continuous integration in our pipeline. As you probably have already realized testing your code locally may be cumbersome to do, because
For these reasons, we want to automate the testing, such that it is done every time we push to our repository. If we combine this with only pushing to branches and then only merging these branches whenever all automated testing has passed, our code should be fairly safe against unwanted bugs (assuming your tests are well covering your code).
"},{"location":"s5_continuous_integration/github_actions/#github-actions_1","title":"GitHub actions","text":"GitHub actions are the continuous integration solution that GitHub provides. Each of your repositories gets 2,000 minutes of free testing per month which should be more than enough for the scope of this course (and probably all personal projects you do). Getting GitHub actions set up in a repository may seem complicated at first, but workflow files that you create for one repository can more or less be reused for any other repository that you have.
Let's take a look at how a GitHub workflow file is organized:
name
runs-on
, we can specify which operation system we want the workflow to run on.steps
. This is where we specify the actual commands that should be run when the workflow is executed.Start by creating a .github
folder in the root of your repository. Add a sub-folder to that called workflows
.
Go over this page that explains how to do automated testing of Python code in GitHub actions. You do not have to understand everything, but at least get a feeling of what a workflow file should look like.
We have provided a workflow file called tests.yaml
that should run your tests for you. Place this file in the .github/workflows/
folder. The workflow file consists of three steps
First, a Python environment is initiated (in this case Python 3.8)
Next all dependencies required to run the test are installed
Finally, pytest
is called and our tests will be run
Go over the file and try to understand the overall structure and syntax of the file.
tests.yaml
tests.yamlname: \"Run tests\"\n\non:\n push:\n branches: [ master, main ]\n pull_request:\n branches: [ master, main ]\n\njobs:\n build:\n\n runs-on: ubuntu-latest\n\n steps:\n - name: Checkout\n uses: actions/checkout@v4\n - name: Set up Python\n uses: actions/setup-python@v5\n with:\n python-version: 3.11\n - name: Install dependencies\n run: |\n python -m pip install --upgrade pip\n pip install -r requirements.txt\n pip install -r requirements_tests.txt\n - name: Test with pytest\n run: |\n pytest -v\n
For the script to work you need to define the requirements.txt
and requirements_tests.txt
. The first file should contain all packages required to run your code. The second file contains all additional packages required to run the tests. In your simple case, it may very well be that the second file is empty, however, sometimes additional packages are used for testing that are not strictly required for the scripts to run.
Finally, try pushing the changes to your repository. Hopefully, your tests should just start, and you will after some time see a green check mark next to the hash of the commit. Also, try to inspect the Actions tap where you can see the history of actions run.
Normally we develop code on only one operating system and just hope that it will work on other operating systems. However, continuous integration enables us to automatically test on other systems than the one we are using.
The provided tests.yaml
only runs on one operating system. Which one?
Alter the file such that it executes the test on the two other main operating systems that exist. You can find information on available operating systems also called runners here
SolutionWe can \"parametrize\" of script to run on different operating systems by using the strategy
attribute. This attribute allows us to define a matrix of values that the workflow will run on. The following code will run the tests on ubuntu-latest
, windows-latest
, and macos-latest
:
jobs:\n build:\n runs-on: ${{ matrix.os }}\n strategy:\n matrix:\n os: [\"ubuntu-latest\", \"windows-latest\", \"macos-latest\"]\n
Can you also figure out how to run the tests using different Python versions?
SolutionJust add another line to the strategy
attribute that specifies the Python version and use the value in the setup Python action. The following code will run the tests on Python versions
jobs:\n build:\n runs-on: ${{ matrix.os }}\n strategy:\n matrix:\n os: [\"ubuntu-latest\", \"windows-latest\", \"macos-latest\"]\n python-version: [\"3.10\", \"3.11\", \"3.12\"]\n\n steps:\n - uses: actions/checkout@v4\n - name: Set up Python\n uses: actions/setup-python@v5\n with:\n python-version: ${{ matrix.python-version }}\n
If you push the changes above you will maybe see that whenever one of the tests in the matrix fails, it will automatically cancel the other tests. This is for saving time and resources. However, sometimes you want all the tests to run even if one fails. Can you figure out how to do that?
SolutionYou can set the fail-fast
attribute to false
under the strategy
attribute:
jobs:\n build:\n runs-on: ${{ matrix.os }}\n strategy:\n fail-fast: false\n matrix:\n os: [\"ubuntu-latest\", \"windows-latest\", \"macos-latest\"]\n python-version: [\"3.10\", \"3.11\", \"3.12\"]\n
As the workflow is currently implemented, GitHub actions will destroy every downloaded package when the workflow has been executed. To improve this we can take advantage of caching
:
Figure out how to implement caching
in your workflow file. You can find a guide here and here.
steps:\n- uses: actions/checkout@v4\n- uses: actions/setup-python@v5\n with:\n python-version: 3.11\n cache: 'pip' # caching pip dependencies\n- run: pip install -r requirements.txt\n
When you have implemented a caching system go to Actions->Caches
in your repository and make sure that they are correctly added. It should look something like the image below
Measure how long your workflow takes before and after adding caching
to your workflow. Did it improve the runtime of your workflow?
(Optional) Code coverage can also be added to the workflow file by uploading it as an artifact after running the coverage. Follow the instructions in this post on how to do it.
With different checks in place, it is a good time to learn about branch protection rules. A branch protection rule is essentially some kind of guarding that prevents you from merging code into a branch before certain conditions are met. In this exercise, we will create a branch protection rule that requires all checks to pass before merging code into the main branch.
Start by going into your Settings -> Rules -> Rulesets
and create a new branch ruleset. See the image below.
In the ruleset start by giving it a name and then set the target branches to be Default branch
. This means that the ruleset will be applied to your master/main branch. As shown in the image below, two rules may be particularly beneficial when you later start working with other people:
The first rule to consider is Require a pull request before merging. As the name suggests this rule requires that changes that are to be merged into the main branch must be done through a pull request. This is a good practice as it allows for code review and testing before the code is merged into the main branch. Additionally, this opens the option to specify that the code must be reviewed (or at least approved) by a certain number of people.
The second rule to consider is Require status checks to pass. This rule makes sure that our workflows are passing before we can merge code into the main branch. You can select which workflows are required, as some may be nice to have passing but not strictly needed.
Finally, if you think the rules are a bit too restrictive you can always add that the repository admin e.g. you can bypass the rules by adding Repository admin
to the bypass list. Implement the following rules:
If you have created the rules correctly you should see something like the image below when you try to merge a pull request. In this case, all three checks are required to pass before the code can be merged. Additionally, a single reviewer is required to approve the code. A bypass rule is also setup for the repository admin.
One problem you may have encountered is running your tests that have to do with your data, with the core problem being that your data is not stored in GitHub (assuming you have done module M8 - DVC) and therefore cannot be tested. However, we can download data while running our continuous integration. Let's try to create that:
The first problem is that we need our continuous integration pipeline to be able to authenticate with our storage solution. We can take advantage of an authentication file that is created the first time we push with DVC. It is located in $CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json
where $CACHE_HOME
depends on your operating system:
~/Library/Caches
~/.cache
This is the typical location, but it may vary depending on what distro you are running
{user}/AppData/Local
Find the file. The content should look similar to this (only some fields are shown):
{\n \"access_token\": ...,\n \"client_id\": ...,\n \"client_secret\": ...,\n \"refresh_token\": ...,\n ...\n}\n
The content of that file should be treated as a password and not shared with the world and the relevant question is therefore how to use this info in a public repository. The answer is GitHub secrets, where we can store information, and access it in our workflow files and it is still not public. Navigate to the secrets option (as shown below) and create a secret with the name GDRIVE_CREDENTIALS_DATA
that contains the content of the file you found in the previous exercise.
Afterward, add the following code to your workflow file:
- uses: iterative/setup-dvc@v1\n- name: Get data\n run: dvc pull\n env:\n GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}\n
that runs dvc pull
using the secret authentication file. For help you can visit this small repository that implements the same workflow.
Finally, add the changes, commit, push and confirm that everything works as expected. You should now be able to run unit tests that depend on your input data.
In module M6 on good coding practices (optional module) of the course you were introduced to a couple of good coding practices such as being consistent with your coding style, how your Python packages are sorted and that your code follows certain standards. All this was done using the ruff
framework. In this set of exercises, we will create GitHub workflows that will automatically test for this.
Create a new workflow file called codecheck.yaml
, that implements the following three steps
Setup Python environment
Installs ruff
Runs ruff check
and ruff format
on the repository
(HINT: You should be able to just change the last steps of the tests.yaml
workflow file)
name: Code formatting\n\non:\n push:\n branches:\n - main\n pull_request:\n branches:\n - main\n\njobs:\n format:\n runs-on: ubuntu-latest\n steps:\n - name: Checkout code\n uses: actions/checkout@v4\n - name: Set up Python\n uses: actions/setup-python@v5\n with:\n python-version: 3.11\n cache: 'pip'\n cache-dependency-path: setup.py\n - name: Install dependencies\n run: |\n pip install ruff\n pip list\n - name: Ruff check\n run: ruff check .\n - name: Ruff format\n run: ruff format .\n
In addition to ruff
we also used mypy
in those sets of exercises for checking if the typing we added to our code was good enough. Add another step to the codecheck.yaml
file which runs mypy
on your repository.
Try to make sure that all steps are passed on repository. Especially mypy
can be hard to get a passing, so this exercise formally only requires you to get ruff
passing.
(Optional) As you have probably already experienced in module M9 on docker it can be cumbersome to build docker images, sometimes taking a couple of minutes to build each time we make changes to our code base. For this reason, we just want to build a new image every time we commit our code because that should mark that we believe the code to be working at that point. Thus, let's automate the process of building our docker images using Github actions. Do note that in a future module will look at how to build containers using cloud providers, and this exercise is therefore very much optional.
Start by making sure you have a dockerfile in your repository. If you do not have one, you can use the following simple dockerfile:
FROM busybox\nCMD echo \"Howdy cowboy\"\n
Push the dockerfile to your repository
Next, create a Docker Hub account
Within Docker Hub create an access token by going to Settings -> Security
. Click the New Access Token
button and give it a name that you recognize.
Copy the newly created access token and head over to your GitHub repository online. Go to Settings -> Secrets -> Actions
and click the New repository secret
. Copy over the access token and give it the name DOCKER_HUB_TOKEN
. Additionally, add two other secrets DOCKER_HUB_USERNAME
and DOCKER_HUB_REPOSITORY
that contain your docker username and docker repository name respectively.
Next, we are going to construct the actual Github actions workflow file
name: Docker Image continuous integration\n\non:\n push:\n branches: [ master ]\n\njobs:\n build:\n runs-on: ubuntu-latest\n steps:\n - uses: actions/checkout@v4\n - name: Build the Docker image\n run: |\n echo \"${{ secrets.DOCKER_HUB_TOKEN }}\" | docker login \\\n -u \"${{ secrets.DOCKER_HUB_USERNAME }}\" --password-stdin docker.io\n docker build . --file Dockerfile \\\n --tag docker.io/${{ secrets.DOCKER_HUB_USERNAME }}/${{ secrets.DOCKER_HUB_REPOSITORY }}:$GITHUB_SHA\n docker push \\\n docker.io/${{ secrets.DOCKER_HUB_USERNAME }}/${{ secrets.DOCKER_HUB_REPOSITORY }}:$GITHUB_SHA\n
The first part of the workflow file should look somewhat recognizable. However, the last three lines are where all the magic happens. Carefully go through them and figure out what they do. If you want some help you can look at the help page for docker login
, docker build
and docker push
.
Upload the workflow to your GitHub repository and check that it is being executed. If everything works you should be able to see the build docker image in your container repository in the docker hub.
Make sure that you can execute docker pull
locally to pull down the image that you just continuously build
(Optional) To test that the container works directly in GitHub you can also try to include an additional step that runs the container.
- name: Run container\n run: |\n docker run ...\n
A great feature that GitHub provides is the ability to have bots help you with maintaining your repository. One of the most useful bots is called Dependabot
. As the name suggests, Dependabot
helps you keep your dependencies up to date. This is important because dependencies often either contain fixes for bugs or security vulnerabilities that you want to have in your code.
To get dependabot working in your repository, we need to add a single configuration file to your repository. Create a file called .github/dependabot.yaml
. Look through the documentation for how to set up the file such that it updates your Python dependencies on a weekly basis.
The following code will check for updates in the pip
ecosystem every week e.g. it automatically will look for requirements.txt
files and update the packages in there.
version: 2\nupdates:\n - package-ecosystem: \"pip\"\n directory: \"/\"\n schedule:\n interval: \"weekly\"\n
Insights
tab and then the Dependency graph
tab. From here you under the Dependabot
tab should be able to see if the bot has correctly identified what files to track and if it has found any updates.
Click the Recent update jobs
to see the history of Dependabot checking for updates. If there are no updates you can try to click the Check for updates
button to force Dependabot to check for updates.
At this point the Dependabot should hopefully have found some updates and created one or more pull requests. If it has not done so you most likely need to update your requirement file such that your dependencies are correctly restricted/specified e.g.
# lets assume pytorch v2.5 is the latest version\n\n# these different specifications will not trigger dependabot because\n# the latest version is included in the specification\ntorch\ntorch == 2.5\ntorch >= 2.5\ntorch ~= 2.5\n\n# these specifications will trigger dependabot because the latest\n# version is not included\ntorch < 2.5\ntorch == 2.4\ntorch <= 2.4\n
If you have a pull request from Dependabot, check it out and see if it looks good. If it does, you can merge it.
(Optional) Dependabot can also help keeping our GitHub Actions pipelines up-to-date. As you may have realized during this module, when we write statements like in our workflow files:
...\njobs:\n build:\n runs-on: ubuntu-latest\n steps:\n - uses: actions/checkout@v4\n...\n
The @v4
specifies that we are using version 4 of the actions/checkout
action. This means that if a new version of the action is released, we will not automatically get the new version. Dependabot can help us with this. Try adding to the dependabot.yaml
file that Dependabot should also check for updates in the GitHub Actions ecosystem.
version: 2\nupdates:\n - package-ecosystem: \"pip\"\n directory: \"/\"\n schedule:\n interval: \"weekly\"\n - package-ecosystem: \"github-actions\"\n schedule:\n interval: \"weekly\"\n
When working with GitHub actions you will often encounter the following 4 concepts:
Try to define them in your own words.
Solutionyaml
file that defines the instructions to be executed on specific events. Needs to be placed in the .github/workflows
folder.The on
attribute specifies upon which events the workflow will be triggered. Assume you have set the on
attribute to the following:
on:\n push:\n branches: [main]\n pull_request:\n branches: [main]\n schedule:\n - cron: \"0 0 * * *\"\n workflow_dispatch: {}\n
What 4 events would trigger the execution of that action?
Solutionmain
would trigger itmain
will trigger itThe trigger can be executed by manually triggering it through the GitHub UI, for example, shown below
This ends the module on GitHub workflows. If you are more interested in this topic you can check out module M31 on documentation which first includes locally building some documentation for your project and afterward use GitHub actions for deploying it to GitHub Pages. Additionally, GitHub also has a lot of templates already for running different continuous integration tasks. If you try to create a workflow file directly in GitHub you may encounter the following page
We highly recommend checking this out if you want to write any other kind of continuous integration pipeline in GitHub actions. We can also recommend this repository that has a list of awesome actions and check out the act repository which is a tool for running your GitHub Actions locally!
"},{"location":"s5_continuous_integration/pre_commit/","title":"M18 - Pre-commit","text":""},{"location":"s5_continuous_integration/pre_commit/#pre-commit","title":"Pre-commit","text":"One of the cornerstones of working with git is remembering to commit your work often. Often committing makes sure that it is easier to identify and revert unwanted changes that you have introduced, because the code changes becomes smaller per commit.
However, as you hopefully already seen in the course there are a lot of mental task to do before you actually write git commit
in the terminal. The most basic thing is of course making sure that you have saved all your changes, and you are not committing a not up-to-date file. However, this also includes tasks such as styling, formatting, making sure all tests succeeds etc. All these mental to-do notes does not mix well with the principal of remembering to commit often, because you in principal have to do them every time.
The obvious solution to this problem is to automate all or some of our mental task every time that we do a commit. This is where pre-commit hooks comes into play, as they can help us attach additional tasks that should be run every time that we do a git commit
.
Pre-commit simply works by inserting whatever workflow we want to automate in between whenever we do a git commit
and afterwards would do a git push
.
The system works by looking for a file called .pre-commit-config.yaml
that we can configure. If we execute
pre-commit sample-config | out-file .pre-commit-config.yaml -encoding utf8\n
you should get a sample file that looks like
# See https://pre-commit.com for more information\n# See https://pre-commit.com/hooks.html for more hooks\nrepos:\n- repo: https://github.com/pre-commit/pre-commit-hooks\n rev: v3.2.0\n hooks:\n - id: trailing-whitespace\n - id: end-of-file-fixer\n - id: check-yaml\n - id: check-added-large-files\n
the file structure is very simple:
id
of the different hooks. The id
corresponds to an id
in this file: https://github.com/pre-commit/pre-commit-hooks/blob/master/.pre-commit-hooks.yamlWhen we are done defining our .pre-commit-config.yaml
we just need to install it
pre-commit install\n
this will make sure that the file is automatically executed whenever we run git commit
Install pre-commit
pip install pre-commit\n
Consider adding pre-commit
to a requirements_dev.txt
file, as it is a development tool.
Next create the sample file
pre-commit sample-config > .pre-commit-config.yaml\n
The sample file already contains 4 hooks. Make sure you understand what each do and if you need them at all.
pre-commit
works by hooking into the git commit
command, running whenever that command is run. For this to work, we need to install the hooks into git commit
. Run
pre-commit install\n
to do this.
Try to commit your recently created .pre-commit-config.yaml
file. You will likely not do anything, because pre-commit
only check files that are being committed. Instead try to run
pre-commit run --all-files\n
that will check every file in your repository.
Try adding at least another check from the base repository to your .pre-commit-config.yaml
file.
In this case we have added the check-json
hook to our .pre-commit-config.yaml
file, which will automatically check that all JSON files are valid.
repos:\n- repo:\n rev: v3.2.0\n hooks:\n - id: trailing-whitespace\n - id: end-of-file-fixer\n - id: check-yaml\n - id: check-added-large-files\n - id: check-json\n
If you have completed the optional module M7 on good coding practice you will have learned about the linter ruff
. ruff
comes with its own pre-commit hook. Try adding that to your .pre-commit-config.yaml
file and see what happens when you try to commit files.
This is one way to add the ruff
pre-commit hook. We run both the ruff
and ruff-format
hooks, and we also add the --fix
argument to the ruff
hook to try to fix what is possible.
```yaml repos: - repo: https://github.com/astral-sh/ruff-pre-commit rev: v0.4.7 hooks: # try to fix what is possible - id: ruff args: [\"--fix\"] # perform formatting updates - id: ruff-format # validate if all is fine with preview mode - id: ruff
(Optional) Add more hooks to your .pre-commit-config.yaml
.
Sometimes you are in a hurry, so make sure that you also can do commits without running pre-commit
e.g.
git commit -m <message> --no-verify\n
Finally, figure out how to disable pre-commit
again (if you get tired of it).
Assuming you have completed the module on GitHub Actions, lets try to add a pre-commit
workflow that automatically runs your pre-commit
checks every time you push to your repository and then automatically commits those changes to your repository. We recommend that you make use of
pre-commit
pre-commit
makes.As an alternative you configure the CI tool provided by the creators of pre-commit
.
The workflow first uses the pre-commit
action to install and run the pre-commit
checks. Importantly we run it with continue-on-error: true
to make sure that the workflow does not fail if the checks fail. Next, we use git diff
to list the changes that pre-commit
has made and then we use the git-auto-commit-action
to commit those changes.
name: Pre-commit CI\n\non:\n pull_request:\n push:\n branches: [main]\n\njobs:\n pre-commit:\n name: Check pre-commit\n runs-on: ubuntu-latest\n\n permissions:\n contents: write\n\n steps:\n - name: Checkout code\n uses: actions/checkout@v4\n\n - name: Set up Python\n uses: actions/setup-python@v5\n with:\n python-version: 3.11\n\n - name: Install pre-commit\n uses: pre-commit/action@v3.0.1\n continue-on-error: true\n\n - name: List modified files\n run: |\n git diff --name-only\n\n - name: Commit changes\n uses: stefanzweifel/git-auto-commit-action@v5\n with:\n commit_message: Pre-commit fixes\n commit_options: '--no-verify'\n
That was all about how pre-commit
can be used to automate tasks. If you want to deep dive more into the topic you can checkout this page on how to define your own pre-commit
hooks.
Core Module
What often comes to mind for many developers, when discussing continuous integration (CI) is code testing. Continuous integration should ensure that whenever a codebase is updated it is automatically tested such that if bugs have been introduced in the codebase it will be caught early on. If you look at the MLOps cycle, continuous integration is one of the cornerstones of the operations part. However, it should be noted that applying continuous integration does not magically secure that your code does not break. Continuous integration is only as strong as the tests that are automatically executed. Continuous integration simply structures and automates this.
Quote
Continuous Integration doesn\u2019t get rid of bugs, but it does make them dramatically easier to find and remove. Martin Fowler, Chief Scientist, ThoughtWorks
Image creditThe kind of tests we are going to look at are called unit testing. Unit testing refers to the practice of writing test that tests individual parts of your code base to test for correctness. By unit, you can therefore think of a function, module or in general any object. By writing tests in this way it should be very easy to isolate which part of the code broke after an update to the code base. Another way to test your code base would be through integration testing which is equally important but we are not going to focus on it in this course.
Unit tests (and integration tests) are not a unique concept to MLOps but are a core concept of DevOps. However, it is important to note that testing machine learning-based systems is much more difficult than traditional systems. The reason for this is that machine learning systems depend on data, that influences the state of our system. For this reason, we not only need unit tests and integration tests of our code we also need data testing, infrastructure testing and more monitoring to check that we stay within the data distribution we are training on (more on this in module M25 on data drifting). This added complexity is illustrated in the figure below.
"},{"location":"s5_continuous_integration/unittesting/#pytest","title":"Pytest","text":"Before we can begin to automate testing of our code base we of course need to write the tests first. It is both a hard and tedious task to do but arguably the most important aspect of continuous integration. Python offers a couple of different libraries for writing tests. We are going to use pytest
.
The following exercises should be applied to your MNIST repository
The first part of doing continuous integration is writing the unit tests. We do not expect you to cover every part of the code you have developed but try to at least write tests that cover two files. Start by creating a tests
folder.
Read the getting started guide for pytest which is the testing framework that we are going to use
Install pytest:
pip install pytest\n
Write some tests. Below are some guidelines on some tests that should be implemented, but you are of course free to implement more tests. You can at any point check if your tests are passing by typing in a terminal
pytest tests/\n
When you implement a test you need to follow two standards, for pytest
to be able to find your tests. First, any files created (except __init__.py
) should always start with test_*.py
. Secondly, any test implemented needs to be wrapped into a function that again needs to start with test_*
:
# this will be found and executed by pytest\ndef test_something():\n ...\n\n# this will not be found and executed by pytest\ndef something_to_test():\n ...\n
Start by creating a tests/__init__.py
file and fill in the following:
import os\n_TEST_ROOT = os.path.dirname(__file__) # root of test folder\n_PROJECT_ROOT = os.path.dirname(_TEST_ROOT) # root of project\n_PATH_DATA = os.path.join(_PROJECT_ROOT, \"data\") # root of data\n
these can help you refer to your data files during testing. For example, in another test file, I could write
from tests import _PATH_DATA\n
which then contains the root path to my data.
Data testing: In a file called tests/test_data.py
implement at least a test that checks that data gets correctly loaded. By this, we mean that you should check
def test_data():\n dataset = MNIST(...)\n assert len(dataset) == N_train for training and N_test for test\n assert that each datapoint has shape [1,28,28] or [784] depending on how you choose to format\n assert that all labels are represented\n
where N_train
should be either 30.000 or 50.000 depending on if you are just the first subset of the corrupted MNIST data or also including the second subset. N_test
should be 5000.
from my_project.data import corrupt_mnist\n\ndef test_data():\n train, test = corrupt_mnist()\n assert len(train) == 30000\n assert len(test) == 5000\n for dataset in [train, test]:\n for x, y in dataset:\n assert x.shape == (1, 28, 28)\n assert y in range(10)\n train_targets = torch.unique(train.tensors[1])\n assert (train_targets == torch.arange(0,10)).all()\n test_targets = torch.unique(test.tensors[1])\n assert (test_targets == torch.arange(0,10)).all()\n
Model testing: In a file called tests/test_model.py
implement at least a test that checks for a given input with shape X that the output of the model has shape Y.
from my_project.model import MyAwesomeModel\n\ndef test_model():\n model = MyAwesomeModel()\n x = torch.randn(1, 1, 28, 28)\n y = model(x)\n assert y.shape == (1, 10)\n
Training testing: In a file called tests/test_training.py
implement at least one test that asserts something about your training script. You are here given free hands on what should be tested but try to test something that risks being broken when developing the code.
Good code raises errors and gives out warnings in appropriate places. This is often in the case of some invalid combination of input to your script. For example, your model could check for the size of the input given to it (see code below) to make sure it corresponds to what you are expecting. Not implementing such errors would still result in PyTorch failing at a later point due to shape errors, however, these custom errors will probably make more sense to the end user. Implement at least one raised error or warning somewhere in your code and use either pytest.raises
or pytest.warns
to check that they are correctly raised/warned. As inspiration, the following implements ValueError
in code belonging to the model:
# src/models/model.py\ndef forward(self, x: Tensor):\n if x.ndim != 4:\n raise ValueError('Expected input to a 4D tensor')\n if x.shape[1] != 1 or x.shape[2] != 28 or x.shape[3] != 28:\n raise ValueError('Expected each sample to have shape [1, 28, 28]')\n
Solution The above example would be captured by a test looking something like this:
# tests/test_model.py\nimport pytest\nfrom my_project.model import MyAwesomeModel\n\ndef test_error_on_wrong_shape():\n model = MyAwesomeModel()\n with pytest.raises(ValueError, match='Expected input to a 4D tensor')\n model(torch.randn(1,2,3))\n with pytest.raises(ValueError, match='Expected each sample to have shape [1, 28, 28]')\n model(torch.randn(1,1,28,29))\n
A test is only as good as the error message it gives, and by default, assert
will only report that the check failed. However, we can help ourselves and others by adding strings after assert
like
assert len(train_dataset) == N_train, \"Dataset did not have the correct number of samples\"\n
Add such comments to the assert statements you just did in the previous exercises.
The tests that involve checking anything that has to do with our data, will of course fail if the data is not present. To future-proof our code, we can take advantage of the pytest.mark.skipif
decorator. Use this decorator to skip your data tests if the corresponding data files do not exist. It should look something like this
import os.path\n@pytest.mark.skipif(not os.path.exists(file_path), reason=\"Data files not found\")\ndef test_something_about_data():\n ...\n
You can read more about skipping tests here
After writing the different tests, make sure that they are passing locally.
We often want to check a function/module for various input arguments. In this case, you could write the same test over and over again for different inputs, but pytest
also has built-in support for this with the use of the pytest.mark.parametrize decorator. Implement a parametrized test and make sure that it runs for different inputs.
@pytest.mark.parametrize(\"batch_size\", [32, 64])\ndef test_model(batch_size: int) -> None:\n model = MyModel()\n x = torch.randn(batch_size, 1, 28, 28)\n y = model(x)\n assert y.shape == (batch_size, 10)\n
There is no way of measuring how good the test you have written is. However, what we can measure is the code coverage. Code coverage refers to the percentage of your codebase that gets run when all your tests are executed. Having a high coverage at least means that all your code will run when executed.
Install coverage
pip install coverage\n
Instead of running your tests directly with pytest
, now do
coverage run -m pytest tests/\n
To get a simple coverage report simply type
coverage report\n
which will give you the percentage of cover in each of your files. You can also write
coverage report -m\n
to get the exact lines that were missed by your tests.
Finally, try to increase the coverage by writing a new test that runs some of the lines in your codebase that are not covered yet.
Often coverage
reports the code coverage on files that we do not want to get a code coverage for, for example your test file. Figure out how to configure coverage
to exclude some files.
You need to set the omit
option. This can either be done when running coverage run
or coverage report
such as:
coverage run --omit=\"tests/*\" -m pytest tests/\n# or\ncoverage report --omit=\"tests/*\"\n
As an alternative you can specify this in your pyproject.toml
file:
[tool.coverage.run]\nomit = [\"tests/*\"]\n
Assuming you have a code coverage of 100%, would you expect that no bugs are present in your code?
SolutionNo, code coverage is not a guarantee that your code is bug-free. It is just a measure of how many lines of code are run when your tests are executed. Therefore, there may still be some corner case that is not covered by your tests and will result in a bug. However, having a high code coverage is a good indicator that you have tested your code.
Consider the following code:
@pytest.mark.parametrize(\"network_size\", [10, 100, 1000])\n@pytest.mark.parametrize(\"device\", [\"cpu\", \"cuda\"])\nclass MyTestClass:\n @pytest.mark.parametrize(\"network_type\", [\"alexnet\", \"squeezenet\", \"vgg\", \"resnet\"])\n @pytest.mark.parametrize(\"precision\", [torch.half, torch.float, torch.double])\n def test_network1(self, network_size, device, network_type, precision):\n if device == \"cuda\" and not torch.cuda.is_available():\n pytest.skip(\"Test requires cuda\")\n model = MyModelClass(network_size, network_type).to(device=device, dtype=precision)\n ...\n\n @pytest.mark.parametrize(\"add_dropout\", [True, False])\n def test_network2(self, network_size, device, add_dropout):\n if device == \"cuda\" and not torch.cuda.is_available():\n pytest.skip(\"Test requires cuda\")\n model = MyModelClass2(network_size, add_dropout).to(device)\n ...\n
how many tests are executed when running the above code?
SolutionThe answer depends on whether or not we are running on a GPU-enabled machine. The test_network1
has 4 parameters, network_size, device, network_type, precision
, that respectively can take on 3, 2, 4, 3
values meaning that in total that test will be running 3x2x4x3=72
times with different parameters on a GPU-enabled machine and 36 on a machine without a GPU. A similar calculation can be done for test_network2
, which only has three factors network_size, device, add_dropout
that result in 3x2x2=12
test on a GPU-enabled machine and 6 on a machine without a GPU. In total, that means 84 tests would run on a machine with a GPU and 42 on a machine without a GPU.
That covers the basics of writing unit tests for Python code. We want to note that pytest
of course is not the only framework for doing this. Python has a built-in framework called unittest for doing this also (but pytest
offers a bit more features). Another open-source framework that you could choose to check out is hypothesis which can help catch errors in corner cases of your code. In addition to writing unit tests it is also highly recommended to test code that you include in your docstring belonging to your functions and modulus to make sure that any code there is in your documentation is also correct. For such testing, we can highly recommend using Python built-in framework doctest.
Slides
Learn how to get started with Google Cloud Platform and how to interact with the SDK.
M20: Cloud Setup
Learn how to use different GCP services to support your machine learning pipeline.
M21: Cloud Services
Running computations locally is often sufficient when only playing around with code in the initial phase of development. However, to scale your experiments you will need more computing power than what your standard laptop/desktop can offer. You probably already have experience with running on a local cluster or similar but today's topic is about utilizing cloud computing.
Image creditThere exist numerous amount of cloud computing providers with some of the biggest being:
They all have slight advantages and disadvantages over each other. In this course, we are going to focus on Google Cloud platform, because they have been kind enough to sponsor $50 of cloud credit to each student. If you happen to run out of credit, you can also get some free credit for a limited amount of time when you sign up with a new account. What's important to note is that all these different cloud providers all have the same set of services and that learning how to use the services of one cloud provider in many cases translates to also knowing how to use the same services at another cloud provider. The services are called something different and can have a bit of a different interface/interaction pattern but in the end, it does not matter.
Today's exercises are about getting to know how to work with the cloud. If you are in doubt about anything or want to deep dive into some topics, I can recommend watching this series of videos or going through the general docs.
Learning objectives
The learning objectives of this session are:
Core Module
Google Cloud Platform (GCP) is the cloud service provided by Google. The key concept, or selling point, of any cloud provider, is the idea of near-infinite resources. Without the cloud, it simply is not feasible to do many modern deep learning and machine learning tasks because they cannot be scaled locally.
The image below shows all the different services that the Google Cloud platform offers. We are going to be working with around 10 of these services throughout the course. Therefore, if you get done with exercises early I highly recommend that you deep dive more into the Google cloud platform.
Image credit"},{"location":"s6_the_cloud/cloud_setup/#exercises","title":"\u2754 Exercises","text":"As the first step, we are going to get you some Google Cloud credits.
Go to https://learn.inside.dtu.dk. Go to this course. Find the recent message where there should be a download link and instructions on how to claim the $50 cloud credit. Please do not share the link anywhere as there are a limited amount of coupons. If you are not officially taking this course at DTU, Google gives $300 cloud credits whenever you sign up with a new account. NOTE that you need to provide a credit card for this so make sure to closely monitor your credit use so you do not end up spending more than the free credit.
Log in to the homepage of GCP. It should look like this:
Go to billing and make sure that your account is showing $50 of cloud credit
make sure to also check out the Reports
throughout the course. When you are starting to use some of the cloud services these tabs will update with info about how much time you can use before your cloud credit runs out. Make sure that you monitor this page as you will not be given another coupon.
One way to stay organized within GCP is to create projects.
Create a new project called dtumlops
. When you click create
you should get a notification that the project is being created. The notification bell is a good way to make sure how the processes you are running are doing throughout the course.
Next, it local setup on your laptop. We are going to install gcloud
, which is part of the Google Cloud SDK. gcloud
is the command line interface for working with our Google Cloud account. Nearly everything that we can do through the web interface we can also do through the gcloud
interface. Follow the installation instructions here for your specific OS.
After installation, try in a terminal to type:
gcloud -h\n
the command should show the help page. If not, something went wrong in the installation (you may need to restart after installing).
Now login by typing
gcloud auth login\n
you should be sent to a web page where you link your cloud account to the gcloud
interface. Afterward, also run this command:
gcloud auth application-default login\n
If you at some point want to revoke the authentication you can type:
gcloud auth revoke\n
Next, you will need to set the project that we just created as the default project0. In your web browser under project info, you should be able to see the Project ID
belonging to your dtumlops
project. Copy this and type he following command in a terminal
gcloud config set project <project-id>\n
You can also get the project info by running
gcloud projects list\n
Next, install the Google Cloud Python API:
pip install --upgrade google-api-python-client\n
Make sure that the Python interface is also installed. In a Python terminal type
import googleapiclient\n
this should work without any errors.
(Optional) If you are using VSCode you can also download the relevant extension called Cloud Code
. After installing it you should see a small Cloud Code
button in the action bar.
Finally, we need to activate a couple of developer APIs that are not activated by default. In a terminal write
gcloud services enable apigateway.googleapis.com\ngcloud services enable servicemanagement.googleapis.com\ngcloud services enable servicecontrol.googleapis.com\n
you can always check which services are enabled by typing
gcloud services list\n
After following these steps your laptop should hopefully be setup for using GCP locally. You are now ready to use their services, both locally on your laptop and in the cloud console.
"},{"location":"s6_the_cloud/cloud_setup/#iam-and-quotas","title":"IAM and Quotas","text":"A big part of using the cloud in a bigger organization has to do with Admin and quotas. Admin here in general refers to the different roles that users of GCP and quotas refer to the amount of resources that a given user has access to. For example, one employee, let's say a data scientist, may only be granted access to certain GCP services that have to do with the development and training of machine learning models, with X
amounts of GPUs available to use to make sure that the employee does not spend too much money. Another employee, a DevOps engineer, probably does not need access to the same services and not necessarily the same resources.
In this course, we are not going to focus too much on this aspect but it is important to know that it exists. One feature you are going to need for doing the project is how to share a project with other people. This is done through the IAM (Identities and Access Management) page. Simply click the Grant Access
button, search for the email of the person you want to share the project with and give them either Viewer
, Editor
or Owner
access, depending on what you want them to be able to do. The figure below shows how to do this.
What we are going to go through right now is how to increase the quotas for how many GPUs you have available for your project. By default, for any free accounts in GCP (or accounts using teaching credits) the default quota for GPUs that you can use is either 0 or 1 (their policies sometimes change). We will in the exercises below try to increase it.
"},{"location":"s6_the_cloud/cloud_setup/#exercises_1","title":"\u2754 Exercises","text":"Start by enabling the Compute Engine
service. Simply search for it in the top search bar. It should bring you to a page where you can enable the service (may take some time). We are going to look more into this service in the next module.
Next go to the IAM & Admin
page, again search for it in the top search bar. The remaining steps are illustrated in the figure below.
Go to the quotas page
In the search field search for GPUs (all regions)
(needs to match exactly, the search field is case sensitive), such that you get the same quota as in the image.
In the limit, you can see what your current quota for the number of GPUs you can use is. Additionally, to the right of the limit, you can see the current usage. It is worth checking in on if you are ever in doubt if a job is running on GPU or not.
Click the quota and afterward the Edit
quotas button.
In the pop-up window, increase your limit to either 1 or 2.
After sending your request you can try clicking the Increase requests
tab to see the status of your request
If you are ever running into errors when working in GPU that contains statements about quotas
you can always try to go to this page and see what you are allowed to use currently and try to increase it. For example, when you get to training machine learning models using Vertex AI in the next module, you would most likely need to ask for a quota increase for that service as well.
Finally, we want to note that a quota increase is sometimes not allowed within 24 hours of creating an account. If your request gets rejected, we recommend to wait a day and try again. If this does still not work, you may need to use their services some more to make sure you are not a bot that wants to mine crypto.
"},{"location":"s6_the_cloud/cloud_setup/#service-accounts","title":"Service accounts","text":"At some point, you will most likely need to use a service account. A service account is a virtual account that is used to interact with the Google Cloud API. It it intended for non-human users e.g. other machines, services, etc. For example, if you want to launch a training job from Github Actions, you will need to use a service account for authentication between Github and GCP. You can read more about how to create a service account here.
"},{"location":"s6_the_cloud/cloud_setup/#exercises_2","title":"\u2754 Exercises","text":"Go to the IAM & Admin
page and click on Service accounts
. Alternatively, you can search for it in the top search bar.
Click the Create Service Account
button. On the next page, you can give the service account a name, and id ( automatically generated, but you can change it if you want). You can also give it a description. Leave the rest as default and click Create
.
Next, let's give the service account some permissions. Click on the service account you just created. In the Permissions
tab click Add permissions
. Your job now is to give the service account the lowest possible permissions such that it can download files from a bucket. Look at this page and try to find the role that fits the description.
The role you are looking for is Storage Object Viewer
. This role allows the service account to list objects in a bucket and download objects, but nothing more. Thus even if someone gets access to the service account they cannot delete objects in the bucket.
To use the service account later we need to create a key for it. Click on the service account and then the Keys
tab. Click Add key
and then Create new key
. Choose the JSON
key type and click Create
. This will download a JSON file to your computer. This file is the key to the service account and should be kept secret. If you lose it you can always create a new one.
Finally, everything we just did from creating the service account, giving it permissions, and creating a key can also be done through the gcloud
interface. Try to find the commands to do this in the documentation.
The commands you are looking for are:
gcloud iam service-accounts create my-sa \\\n --description=\"My first service account\" --display-name=\"my-sa\"\ngcloud projects add-iam-policy-binding $(GCP_PROJECT_NAME) \\\n --member=\"serviceAccount:global-service-account@iam.gserviceaccount.com\" \\\n --role=\"roles/storage.objectViewer\"\ngcloud iam service-accounts keys create service_account_key.json \\\n --iam-account=global-service-account@$(GCP_PROJECT_NAME).iam.gserviceaccount.com\n
where $(GCP_PROJECT_NAME)
is the name of your project. If you then want to delete the service account you can run
gcloud iam service-accounts delete global-service-account@$(GCP_PROJECT_NAME).iam.gserviceaccount.com\n
What considerations to take when choosing a GCP region for running a new application?
SolutionA series of factors may influence your choice of region, including:
The 3 major cloud providers all have the same services, but they are called something different depending on the provider. What are the corresponding names of these GCP services in AWS and Azure?
It is important to know these correspondences to navigate blogpost etc. about MLOps on the internet.
Solution GCP AWS Azure Compute Engine Elastic Compute Cloud (EC2) Virtual Machines Cloud storage Simple Storage Service (S3) Blob Storage Cloud functions Lambda Functions Serverless Compute Cloud run App Runner, Fargate, Lambda Container Apps, Container Instances Cloud build CodeBuild DevOps Vertex AI SageMaker AI PlatformWhy does is it always important to assign the lowest possible permissions to a service account?
SolutionThe reason is that if someone gets access to the service account they can only do what the service account is allowed to do. If the service account has the permission to delete objects in a bucket, the attacker can delete all the objects in the bucket. For this reason, in most cases multiple service accounts are used, each with different permissions. This setup is called the principle of least privilege.
Core Module
In this set of exercises, we are going to get more familiar with using some of the resources that GCP offers.
"},{"location":"s6_the_cloud/using_the_cloud/#compute","title":"Compute","text":"The most basic service of any cloud provider is the ability to create and run virtual machines. In GCP this service is called Compute Engine API. A virtual machine allows you to essentially run an operating system that behaves like a completely separate computer. There are many reasons why one to use virtual machines:
Virtual machines allow you to scale your operations, essentially giving you access to infinitely many individual computers
Virtual machines allow you to use large-scale hardware. For example, if you are developing a deep learning model on your laptop and want to know the inference time for a specific hardware configuration, you can just create a virtual machine with those specs and run your model.
Virtual machines allow you to run processes in the \"background\". If you want to train a model for a week or more, you do not want to do this on your laptop as you cannot move it or do anything with it while it is training. Virtual machines allow you to just launch a job and forget about it (at least until you run out of credit).
We are now going to start using the cloud.
Click on the Compute Engine
tab in the sidebar on the homepage of GCP.
Click the Create Instance
button. You will see the following image below.
Give the virtual machine a meaningful name, and set the location to some location that is closer to where you are (to reduce latency, we recommend europe-west-1
). Finally, try to adjust the configuration a bit. Can you find at least two settings that alter the price of the virtual machine?
In general, the price of a virtual machine is determined by the class of hardware attached to it. Higher class CPUs and GPUs mean higher prices. Additionally, the amount of memory and disk space also affects the price. Finally, to location of the virtual machine also affects the price.
After figuring this out, create a e2-medium
instance (leave the rest configured as default). Before clicking the Create
button make sure to check the Equivalent code
button. You should see a very long command that you could have typed in the terminal that would create a VM similar to configuring it through the UI.
After creating the virtual machine, in a local terminal type:
gcloud compute instances list\n
you should hopefully see the instance you have just created.
You can start a terminal directly by typing:
gcloud compute ssh --zone <zone> <name> --project <project-id>\n
You can always see the exact command that you need to run to ssh
to a VM by selecting the View gcloud command
option in the Compute Engine overview (see image below).
While logged into the instance, check if Python and PyTorch are installed. You should see that neither is installed. The VM we have only specified what compute resources it should have, and not what software should be in it. We can fix this by starting VMs based on specific docker images (it's all coming together).
GCP comes with several ready-to-go images for doing deep learning. More info can be found here. Try, running this line:
gcloud container images list --repository=\"gcr.io/deeplearning-platform-release\"\n
what does the output show?
SolutionThe output should show a list of images that are available for you to use. The images are essentially docker images that contain a specific software stack. The software stack is often a specific version of Python, PyTorch, TensorFlow, etc. The images are maintained by Google and are updated regularly.
Next, start (in the terminal) a new instance using a PyTorch image. The command for doing it should look something like this:
gcloud compute instances create <instance_name> \\\n --zone=<zone> \\\n --image-family=<image-family> \\\n --image-project=deeplearning-platform-release \\\n # add these arguments if you want to run on GPU and have the quota to do so\n --accelerator=\"type=nvidia-tesla-K80,count=1\" \\\n --maintenance-policy TERMINATE \\\n --metadata=\"install-nvidia-driver=True\" \\\n
You can find more info here on what <image-family>
should have as value and what extra argument you need to add if you want to run on GPU (if you have access).
The command should look something like this:
CPUGPUgcloud compute instances create my_instance \\\n --zone=europe-west1-b \\\n --image-family=pytorch-latest-cpu \\\n --image-project=deeplearning-platform-release\n
gcloud compute instances create my_instance \\\n --zone=europe-west1-b \\\n --image-family=pytorch-latest-gpu \\\n --image-project=deeplearning-platform-release \\\n --accelerator=\"type=nvidia-tesla-K80,count=1\" \\\n --maintenance-policy TERMINATE\n
ssh
to the VM as one of the previous exercises. Confirm that the container indeed contains both a Python installation and PyTorch is also installed. Hint: you also have the possibility through the web page to start a browser session directly to the VMs you create:
Everything that you have done locally can also be achieved through the web terminal, which of course comes pre-installed with the gcloud
command etc.
Try out launching this and run some of the commands from the previous exercises.
Finally, we want to make sure that we do not forget to stop our VMs. VMs are charged by the minute, so even if you are not using them you are still paying for them. Therefore, you must remember to stop your VMs when you are not using them. You can do this by either clicking the Stop
button on the VM overview page or by running the following command:
gcloud compute instances stop <instance-name>\n
Another big part of cloud computing is the storage of data. There are many reasons that you want to store your data in the cloud including:
Cloud storage is luckily also very cheap. Google Cloud only takes around $0.026 per GB per month. This means that around 1 TB of data would cost you $26 which is more than what the same amount of data would cost on Google Drive, but the storage in Google Cloud is much more focused on enterprise usage such that you can access the data through code.
"},{"location":"s6_the_cloud/using_the_cloud/#exercises_1","title":"\u2754 Exercises","text":"When we did the exercise on data version control, we made dvc
work together with our own Google Drive to store data. However, a big limitation of this is that we need to authenticate each time we try to either push or pull the data. The reason is that we need to use an API instead which is offered through GCP.
We are going to follow the instructions from this page
Let's start by creating a data storage. On the GCP start page, in the sidebar, click on the Cloud Storage
. On the next page click the Create bucket
:
Give the bucket a unique name, set it to a region close by and importantly remember to enable Object versioning under the last tab. Finally, click `Create``.
After creating the storage, you should be able to see it online and you should be able to see it if you type in your local terminal:
gsutil ls\n
gsutil is a command line tool that allows you to create, upload, download, list, move, rename and delete objects in the cloud storage. For example, you can upload a file to the cloud storage by running:
gsutil cp <file> gs://<bucket-name>\n
Next, we need the Google storage extension for dvc
pip install dvc-gs\n
Now in your corrupt MNIST repository where you have already configured dvc
, we are going to change the storage from our Google Drive to our newly created Google Cloud storage.
dvc remote add -d remote_storage <output-from-gsutils>\n
In addition, we are also going to modify the remote to support object versioning (called version_aware
in dvc
):
dvc remote modify remote_storage version_aware true\n
This will change the default way that dvc
handles data. Instead of just storing the latest version of the data as content-addressable storage it will now store the data as it looks in our local repository, which allows us to not only use dvc
to download our data.
The above command will change the .dvc/config
file. git add
and git commit
the changes to that file. Finally, push data to the cloud
dvc push --no-run-cache # (1)!\n
--no-run-cache
flag is used to avoid pushing the cache file to the cloud, which is not supported by the Google Cloud storage.Finally, make sure that you can pull without having to give your credentials. The easiest way to see this is to delete the .dvc/cache
folder that should be locally on your laptop and afterward do a
dvc pull --no-run-cache\n
This setup should work when trying to access the data from your laptop, which we authenticated in the previous module. However, how can you access the data from a virtual machine, inside a docker container or from a different laptop? We in general recommend two ways:
You can make the bucket publicly accessible e.g. no authentication is needed. That means that anyone with the URL to the data can access it. This is the easiest way to do it, but also the least secure. You can read more about how to make your buckets public here.
You can use the service account that you created in the previous module to authenticate the VM. This is the most secure way to do it, but also the most complicated. You first need to give the service account the correct permissions. Then you need to authenticate using the service account. In dvc
this is done by setting the environment variable GOOGLE_APPLICATION_CREDENTIALS
to the path of
export GOOGLE_APPLICATION_CREDENTIALS=\"/path/to/your/credentials.json\"\n
set GOOGLE_APPLICATION_CREDENTIALS=\"C:\\path\\to\\your\\credentials.json\"\n
You should hopefully at this point have seen the strength of using containers to create reproducible environments. They allow us to specify exactly the software that we want to run inside our VMs. However, you should already have run into two problems with containers
For this reason, we want to move both the building process and the storage of images to the cloud. In GCP the two services that we are going to use for this are called Cloud Build for building the containers in the cloud and Artifact registry for storing the images afterward.
"},{"location":"s6_the_cloud/using_the_cloud/#exercises_2","title":"\u2754 Exercises","text":"In these exercises, I recommend that you start with a dummy version of some code to make sure that the building process does not take too long. Below is a simple Python script that does image classification using Sklearn, together with the corresponding requirements.txt
file and Dockerfile
.
from sklearn import datasets, metrics, svm\nfrom sklearn.model_selection import train_test_split\n\nif __name__ == \"__main__\":\n digits = datasets.load_digits()\n\n # flatten the images\n n_samples = len(digits.images)\n data = digits.images.reshape((n_samples, -1))\n\n # Create a classifier: a support vector classifier\n clf = svm.SVC(gamma=0.001)\n\n # Split data into 50% train and 50% test subsets\n X_train, X_test, y_train, y_test = train_test_split(data, digits.target, test_size=0.5, shuffle=False)\n\n # Learn the digits on the train subset\n clf.fit(X_train, y_train)\n\n # Predict the value of the digit on the test subset\n predicted = clf.predict(X_test)\n\n print(f\"Classification report for classifier {clf}:\\n{metrics.classification_report(y_test, predicted)}\\n\")\n
requirements.txt requirements.txtscikit-learn>=1.0\n
Dockerfile DockerfileFROM python:3.11-slim\n\n# install python\nRUN apt update && \\\n apt install --no-install-recommends -y build-essential gcc && \\\n apt clean && rm -rf /var/lib/apt/lists/*\n\nCOPY requirements.txt requirements.txt\nCOPY main.py main.py\nWORKDIR /\nRUN pip install -r requirements.txt --no-cache-dir\n\nENTRYPOINT [\"python\", \"-u\", \"main.py\"]\n
The docker images for this application are therefore going to be substantially faster to build and smaller in size than the images we are used to that use PyTorch.
Start by enabling the service: Google Artifact Registry API
and Google Cloud Build API
. This can be done through the website (by searching for the services) or can also be enabled from the terminal:
gcloud services enable artifactregistry.googleapis.com\ngcloud services enable cloudbuild.googleapis.com\n
The first step is creating an artifact repository in the cloud. You can either do this through the UI or using gcloud
in the command line.
Find the Artifact Registry
service (search for it in the search bar) and click on it. From there click on the Create repository
button. You should see the following page:
Give the repository a name, make sure to set the format to Docker
and specify the region. At the bottom of the page you can optionally add a cleanup policy. We recommend that you add one to keep costs down. Give the policy a name choose the Keep most recent versions
option and set the keep count to 5
. Click Create
and you should now see the repository in the list of repositories.
gcloud artifacts repositories create <registry-name> \\\n --repository-format=docker \\\n --location=europe-west1 \\\n --description=\"My docker registry\"\n
where you need to replace <registry-name>
with a name of your choice. You can read more about the command here. We recommend that after creating the repository you update it with a cleanup policy to keep costs down. You can do this by running:
gcloud artifacts repositories set-cleanup-policies REPOSITORY\n --project=<project-id>\n --location=<region>\n --policy=policy.yaml\n
where the policy.yaml
file should look something like this:
[\n {\n \"name\": \"keep-minimum-versions\",\n \"action\": {\"type\": \"Keep\"},\n \"mostRecentVersions\": {\n \"keepCount\": 5\n }\n }\n]\n
and you can read more about the command here. Whenever we in the future want to push or pull to this artifact repository we can refer to it using this URL:
<region>-docker.pkg.dev/<project-id>/<registry-name>\n
for example, europe-west1-docker.pkg.dev/dtumlops-335110/container-registry
would be a valid URL (this is the one I created).
We are now ready to build our containers in the cloud. In principle, GCP cloud build works out of the box with docker files. However, the recommended way is to add specialized cloudbuild.yaml
files. You can think of the cloudbuild.yaml
file as the corresponding file in GCP as workflow files are in GitHub actions, which you learned about in module M16. It is essentially a file that specifies a list of steps that should be executed to do something, but the syntax is different.
Look at the documentation on how to write a cloudbuild.yaml
file for building and pushing a docker image to the artifact registry. Try to implement such a file in your repository.
For building docker images the syntax is as follows:
steps:\n- name: 'gcr.io/cloud-builders/docker'\n id: 'Build container image'\n args: [\n 'build',\n '.',\n '-t',\n 'europe-west1-docker.pkg.dev/$PROJECT_ID/<registry-name>/<image-name>',\n '-f',\n '<path-to-dockerfile>'\n ]\n- name: 'gcr.io/cloud-builders/docker'\n id: 'Push container image'\n args: [\n 'push',\n 'europe-west1-docker.pkg.dev/$PROJECT_ID/<registry-name>/<image-name>'\n ]\n
where you need to replace <registry-name>
, <image-name>
and <path-to-dockerfile>
with your own values. You can hopefully recognize the syntax from the docker exercises. In this example, we are calling the cloud-builders/docker
service with both the build
and push
arguments.
You can now try to trigger the cloudbuild.yaml
file from your local machine. What gcloud
command would you use to do this?
You can trigger a build by running the following command:
gcloud builds submit --config=cloudbuild.yaml .\n
This command will submit a build to the cloud build service using the configuration file cloudbuild.yaml
in the current directory.
Instead of relying on manually submitting builds, we can setup the building process as continuous integration such that it is triggered every time we push code to the repository. This is done by setting up a trigger in the GCP console. From the GCP homepage, navigate to the triggers panel:
Click on the manage repositories.
From there, click the Connect Repository
and go through the steps of authenticating your GitHub profile with GCP and choose the repository that you want to setup build triggers. For now, skip the Create a trigger (optional)
part by pressing Done
in the end.
Navigate back to the Triggers
homepage and click Create trigger
. Set the following:
Push to branch
^main$
Autodetected
or Cloud build configuration file
Finally, click the Create
button and the trigger should show up on the triggers page.
To activate the trigger, push some code to the chosen repository.
Go to the Cloud Build
page and you should see the image being built and pushed.
Try clicking on the build to check out the build process and build summary. As you can see from the image, if a build is failing you will often find valuable info by looking at the build summary. If your build is failing try to configure it to run in one of these regions: us-central1, us-west2, europe-west1, asia-east1, australia-southeast1, southamerica-east1
as specified in the documentation.
If/when your build is successful, navigate to the Artifact Registry
page. You should hopefully find that the image you just built was pushed here. Congrats!
Make sure that you can pull your image down to your laptop
docker pull <region>-docker.pkg.dev/<project-id>/<registry-name>/<image-name>:<image-tag>\n
you will need to authenticate docker
with GCP first. Instructions can be found here, but the following command should hopefully be enough to make docker
and GCP talk to each other:
gcloud auth configure-docker <region-docker.pkg.dev>\n
where you need to replace <region>
with the region you are using. Do note you need to have docker
actively running in the background, as any other time you want to use docker
.
Automatization through the cloud is in general the way to go, but sometimes you may want to manually create images and push them to the registry. Figure out how to push an image to your Artifact Registry
. For simplicity, you can just push the busybox
image you downloaded during the initial docker exercises. This page should help you with exercise.
Pushing to a repository is similar to pulling. Assuming that you have already built an image called busybox
you can push it to the repository by running:
docker tag busybox <region>-docker.pkg.dev/<project-id>/<registry-name>/busybox:latest\ndocker push <region>-docker.pkg.dev/<project-id>/<registry-name>/busybox:latest\n
where you need to replace <region>
, <project-id>
and <registry-name>
with your own values.
(Optional) Instead of using the built-in trigger in GCP, another way to activate the build-on code changes is to integrate with Github Actions. This has the benefit that we can make the build process depend on other steps in the pipeline. For example, in the image below we have conditioned the build to only run if tests are passing on all operating systems. Lets try to implement this.
Start by adding a new secret to Github with the name GCLOUD_SERVICE_KEY
and the value of the service account key that you created in the previous module. This is needed to authenticate the Github action with GCP.
We assume that you already have a workflow file that runs some unit tests:
name: Unit tests & build\n\non:\n push:\n branches: [main]\n pull_request:\n branches: [main]\n\njobs:\n test:\n
we now want to add a job that triggers the build process in GCP. How can you make the build
job depend on the test
job? Hint: Relevant documentation.
You can make the build
job depend on the test
job by adding the needs
keyword to the build
job:
name: Unit tests & build\n\non:\n push:\n branches: [main]\n pull_request:\n branches: [main]\n\njobs:\n test:\n ...\n build:\n needs: test\n ...\n
Additionally, we probably only want to build the image if the job is running on our main branch e.g. not part of a pull request. How can you make the build
job only run on the main branch?
name: Unit tests & build\n\non:\n push:\n branches: [main]\n pull_request:\n branches: [main]\n\njobs:\n test:\n ...\n build:\n needs: test\n if: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}\n ...\n
Finally, we need to add the steps to submit the build job to GCP. You need four steps:
How can you do this? Hint: For the first two steps these two Github actions can be useful: auth and setup-gcloud.
Solutionname: Unit tests & build\n\non:\n push:\n branches: [main]\n pull_request:\n branches: [main]\n\njobs:\n test:\n ...\n build:\n needs: test\n if: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}\n runs-on: ubuntu-latest\n steps:\n - name: Checkout code\n uses: actions/checkout@v4\n\n - name: Auth with GCP\n uses: google-github-actions/auth@v2\n with:\n credentials_json: ${{ secrets.GCLOUD_SERVICE_KEY }}\n\n - name: Set up Cloud SDK\n uses: google-github-actions/setup-gcloud@v2\n\n - name: Submit build\n run: gcloud builds submit --config cloudbuild_containers.yaml\n
(Optional) The cloudbuild
specification format allows you to specify so-called substitutions. A substitution is simply a way to replace a variable in the cloudbuild.yaml
file with a value that is known only at runtime. This can be useful for using the same cloudbuild.yaml
file for multiple builds. Try to implement a substitution in your docker cloud build file such that the image name is a variable.
Build in substitutions
You have probably already encountered substitutions like $PROJECT_ID
in the cloudbuild.yaml
file. These are substitutions that are automatically replaced by GCP. Other commonly used are $BUILD_ID
, $PROJECT_NUMBER
and $LOCATION
. You can find a full list of built.in substitutions here
We just need to add the substitutions
field to the cloudbuild.yaml
file. For example, if we want to replace the image name with a variable called _IMAGE_NAME
we can do the following:
steps:\n- name: 'gcr.io/cloud-builders/docker'\n id: 'Build container image'\n args: [\n 'build',\n '.',\n '-t',\n 'europe-west1-docker.pkg.dev/$PROJECT_ID/<registry-name>/$_IMAGE_NAME',\n '-f',\n '<path-to-dockerfile>'\n ]\n- name: 'gcr.io/cloud-builders/docker'\n id: 'Push container image'\n args: [\n 'push',\n 'europe-west1-docker.pkg.dev/$PROJECT_ID/<registry-name>/$_IMAGE_NAME'\n ]\nsubstitutions:\n _IMAGE_NAME: 'my_image'\n
Do note that user substitutions are prefixed with an underscore _
to distinguish them from built-in. You can read more here
How would you provide the value for the _IMAGE_NAME
variable to the gcloud builds submit
command?
You can provide the value for the _IMAGE_NAME
variable by adding the --substitutions
flag to the gcloud builds submit
command:
gcloud builds submit --config=cloudbuild.yaml --substitutions=_IMAGE_NAME=my_image\n
If you want to provide more than one substitution you can do so by separating them with a comma.
As the final step in our journey through different GCP services in this module, we are going to look at the training of our models. This is one of the important tasks that GCP can help us with because we can always rent more hardware as long as we have credits, meaning that we can both scale horizontally (run more experiments) and vertically (run longer experiments).
We are going to check out two ways of running our experiments. First, we are going to return to the Compute Engine service because it gives the most simple form of scaling of experiments. That is: we create a VM with an appropriate docker image, start it, log into the VM and run our experiments. Most people can run a couple of experiments in parallel this way. However, what if there was an abstract layer that automatically created VM for us, launched our experiments and then closed the VM afterwards?
This is where Vertex AI service comes into play. This is a dedicated service for handling ML models in GCP in the cloud. Vertex AI is in principal and end-to-end service that can take care of everything machine learning related in the cloud. In this course, we are primarily focused on just the training of our models, and then use other services for different parts of our pipeline.
"},{"location":"s6_the_cloud/using_the_cloud/#exercises_3","title":"\u2754 Exercises","text":"Let's start by going through how we could train a model using PyTorch using the Compute Engine service:
Start by creating an appropriate VM. If you want to start a VM that has PyTorch pre-installed with only CPU support you can run the following command
gcloud compute instances create <instance-name> \\\n --zone europe-west1-b \\\n --image-family=pytorch-latest-cpu \\\n --image-project=deeplearning-platform-release\n
alternatively, if you have access to GPU in your GCP account you could start a VM in the following way
gcloud compute instances create <instance-name> \\\n --zone europe-west4-a \\\n --image-family=pytorch-latest-gpu \\\n --image-project=deeplearning-platform-release \\\n --accelerator=\"type=nvidia-tesla-v100,count=1\" \\\n --metadata=\"install-nvidia-driver=True\" \\\n --maintenance-policy TERMINATE\n
Next login into your newly created VM. You can either open an ssh
terminal in the cloud console or run the following command
gcloud beta compute ssh <instance-name>\n
It is recommended to always check that the VM we get is actually what we asked for. In this case, the VM should have PyTorch pre-installed so let's check for that by running
python -c \"import torch; print(torch.__version__)\"\n
Additionally, if you have a VM with GPU support also try running the nvidia-smi
command.
When you have logged in to the VM, it works as your machine. Therefore to run some training code you would need to do the same setup step you have done on your machine: clone your Github, install dependencies, download data, and run code. Try doing this to make sure you can train a model.
The above exercises should hopefully have convinced you that it can be hard to scale experiments using the Compute Engine service. The reason is that you need to manually start, setup and stop a separate VM for each experiment. Instead, let's try to use the Vertex AI service to train our models.
Start by enabling it by searching for Vertex AI
in the cloud console by going to the service or by running the following command:
gcloud services enable aiplatform.googleapis.com\n
The way we are going to use Vertex AI is to create custom jobs because we have already developed docker containers that contain everything to run our code. Thus the only command that we need to use is gcloud ai custom-jobs create
command. An example here would be:
gcloud ai custom-jobs create \\\n --region=europe-west1 \\\n --display-name=test-run \\\n --config=config.yaml \\\n # these are the arguments that are passed to the container, only needed if you want to change defaults\n --command 'python src/my_project/train.py' \\\n --args '[\"--epochs\", \"10\"]'\n
Essentially, this command combines everything into one command: it first creates a VM with the specs specified by a configuration file, then loads a container specified again in the configuration file and finally it runs everything. An example of a config file could be:
CPUGPU# config_cpu.yaml\nworkerPoolSpecs:\n machineSpec:\n machineType: n1-highmem-2\n replicaCount: 1\n containerSpec:\n imageUri: gcr.io/<project-id>/<docker-img>\n
# config_gpu.yaml\nworkerPoolSpecs:\n machineSpec:\n machineType: n1-standard-8\n acceleratorType: NVIDIA_TESLA_T4 #(1)!\n acceleratorCount: 1\n replicaCount: 1\n containerSpec:\n imageUri: gcr.io/<project-id>/<docker-img>\n
you can read more about the configuration formatting here and the different types of machines here. Try to execute a job using the gcloud ai custom-jobs create
command. For additional documentation you can look at the documentation on the command and this page and this page
Assuming you manage to launch a job, you should see an output like this:
Try executing the commands that are outputted to look at both the status and the progress of your job.
In addition, you can also visit the Custom Jobs
tab in training
part of Vertex AI
You will need to select the specific region that you submitted your job to see the job.
During custom training, we do not necessarily need to use dvc
for downloading our data. A more efficient way is to use cloud storage as a mounted file system. This allows us to access data directly from the cloud storage without having to download it first. All our training jobs are automatically mounted a gcs
folder in the root directory. Try to access the data from your training script:
# loading from a bucket using mounted file system\ndata = torch.load('/gcs/<my-bucket-name>/data.pt')\n# writing to a bucket using mounted file system\ntorch.save(data, '/gcs/<my-bucket-name>/data.pt')\n
is should speed up the training process a bit.
Your code may depend on environment variables for authenticating, for example with weights and bias during training. These can also be specified in the configuration file. How would you do this?
SolutionYou can specify environment variables in the configuration file by adding the env
field to the containerSpec
field. For example, if you want to specify the WANDB_API_KEY
you can do it like this:
workerPoolSpecs:\n machineSpec:\n machineType: n1-highmem-2\n replicaCount: 1\n containerSpec:\n imageUri: gcr.io/<project-id>/<docker-img>\n env:\n - name: WANDB_API_KEY\n value: <your-wandb-api-key>\n
You need to replace <your-wandb-api-key>
with your actual key. Also, remember that this file now contains a secret and should be treated as such.
Try to execute multiple jobs with different configurations e.g. change the --args
field in the gcloud ai custom-jobs create
command at the same time. This should hopefully show you how easy it is to scale experiments using the Vertex AI service.
Similar to GitHub Actions, GCP also has a secrets store that can be used to keep secrets safe. This is called the Secret Manager in GCP. By using the Secret Manager, we get the option to inject secrets into our code without having to store them in the code itself.
"},{"location":"s6_the_cloud/using_the_cloud/#exercises_4","title":"\u2754 Exercises","text":"Let's look at the example from before where we have a config file like this for custom Vertex AI jobs:
workerPoolSpecs:\n machineSpec:\n machineType: n1-highmem-2\n replicaCount: 1\n containerSpec:\n imageUri: gcr.io/<project-id>/<docker-img>\n env:\n - name: WANDB_API_KEY\n value: $WANDB_API_KEY\n
we do not want to store the WANDB_API_KEY
in the config file, rather we would like to store it in the Secret Manager and inject it right before the job starts. Let's figure out how to do that.
Start by enabling the secrets manager API by running the following command:
gcloud services enable secretmanager.googleapis.com\n
Next, go to the secrets manager in the cloud console and create a new secret. You just need to give it a name, a value and leave the rest as default. Add one or more secrets like the image below.
We are going to inject the secrets into our training job by using cloudbuild. Create new cloudbuild file called vertex_ai_train.yaml
and add the following content:
steps:\n- name: \"alpine\"\n id: \"Replace values in the training config\"\n entrypoint: \"sh\"\n args:\n - '-c'\n - |\n apk add --no-cache gettext\n envsubst < config.yaml > config.yaml.tmp\n mv config.yaml.tmp config.yaml\n secretEnv: ['WANDB_API_KEY']\n\n- name: 'alpine'\n id: \"Show config\"\n waitFor: ['Replace values in the training config']\n entrypoint: \"sh\"\n args:\n - '-c'\n - |\n cat config.yaml\n\n- name: 'gcr.io/cloud-builders/gcloud'\n id: 'Train on vertex AI'\n waitFor: ['Replace values in the training config']\n args: [\n 'ai',\n 'custom-jobs',\n 'create',\n '--region',\n 'europe-west1',\n '--display-name',\n 'example-mlops-job',\n '--config',\n '${_VERTEX_TRAIN_CONFIG}',\n ]\navailableSecrets:\n secretManager:\n - versionName: projects/$PROJECT_ID/secrets/WANDB_API_KEY/versions/latest\n env: 'WANDB_API_KEY'\n
Slowly go through the file and try to understand what each step does.
SolutionThere are two parts to using secrets in cloud build. First, there is the availableSecrets
field that specifies what secrets from the Secret Manager should be injected into the build. In this case, we are injecting the WANDB_API_KEY
and setting it as an environment variable. The second part is the secretEnv
field in the first step. This field specifies which secrets should be available in the first step. The steps are then doing:
The first step call the envsubst command which is a general Linux command that replaces environment variables in a file. In this case, it replaces the $WANDB_API_KEY
with the actual value of the secret. We then save the file as config.yaml.tmp
and rename it back to config.yaml
.
The second step is just to show that the replacement was successful. This is mostly for debugging purposes and can be removed.
The third step is the actual training job. It waits for the first step to finish before running.
Finally, try to trigger the build:
gcloud builds submit --config=vertex_ai_train.yaml\n
and check that the WANDB_API_KEY
is correctly injected into the config.yaml
file.
In Compute Engine, we have the option to either stop or suspend the VMs, can you describe what the difference is?
SolutionSuspended instances preserve the guest OS memory, device state, and application state. You will not be charged for a suspended VM but will be charged for the storage of the aforementioned states. Stopped instances do not preserve any of the states and you will be charged for the storage of the disk. However, in both cases if the VM instances have resources attached to them, such as static IPs and persistent disks, which are charged until they are deleted.
As seen in the exercises, a cloudbuild.yaml
file often contains multiple steps. How would you make steps dependent on each other e.g. one step can only run if another step has finished? And how would you make steps execute concurrently?
In both cases, the solution is the waitFor
field. If you want a step to wait for another step to finish you you need to give the first step an id
and then specify that id
in the waitFor
field of the second step.
steps:\n- name: 'alpine'\n id: 'step1'\n entrypoint: 'sh'\n args: ['-c', 'echo \"Hello World\"']\n- name: 'alpine'\n id: 'step2'\n entrypoint: 'sh'\n args: ['-c', 'echo \"Hello World 2\"']\n waitFor: ['step1']\n
If you want steps to run concurrently you can set the waitFor
field to ['-']
:
steps:\n- name: 'alpine'\n id: 'step1'\n entrypoint: 'sh'\n args: ['-c', 'echo \"Hello World\"']\n- name: 'alpine'\n id: 'step2'\n entrypoint: 'sh'\n args: ['-c', 'echo \"Hello World 2\"']\n waitFor: ['-']\n
This ends the session on how to use Google Cloud services for now. In a future session, we are going to investigate a bit more of the services offered in GCP, in particular for deploying the models that we have just trained.
"},{"location":"s7_deployment/","title":"Model deployment","text":"Slides
Learn how to use requests works and how to create custom APIs
M22: Requests and APIs
Learn how to deploy custom APIs using serverless functions and serverless containers in the cloud
M23: Cloud Deployment
Learn how to test APIs for functionality and load
M24: API testing
Learn about different ways to improve the deployment of machine learning models
M25: ML Deployment
Learn how to create a frontend for your application using Streamlit
M26: Frontend
Let's say that you have spent 1000 GPU hours and trained the most awesome model that you want to share with the world. One way to do this is, of course, to just place all your code in a Github repository, upload a file with the trained model weights to your favorite online storage (assuming it is too big for GitHub to handle) and ask people to just download your code and the weights to run the code by themselves. This is a fine approach in a small research setting, but in production, you need to be able to deploy the model to an environment that is fully contained such that people can just execute without looking (too hard) at the code.
Image credit
In this session we try to look at methods specialized towards deployment of models on your local machine and also how to deploy services in the cloud.
Learning objectives
The learning objectives of this session are:
fastapi
and run it locallyCore Module
Before we can get deployment of our models we need to understand concepts such as APIs and requests. The core reason for this is that we need a new abstraction layer on top of our applications that are not Python-specific. While Python is the defacto language for machine learning, we cannot expect everybody else to use it and in particular, we cannot expect network protocols (both locally and external) to be able to communicate with our Python programs out of the box. For this reason, we need to understand requests, in particular HTTP requests and how to create APIs that can interact with those requests.
"},{"location":"s7_deployment/apis/#requests","title":"Requests","text":"When we are talking about requests, we are essentially talking about the communication method used in client-server types of architectures. As shown in the image below, in this architecture, the client (user) is going to send requests to a server (our machine learning application) and the server will give a response. For example, the user may send a request to get the class of a specific image, which our application will do and then send back the response in terms of a label.
Image creditThe common way of sending requests is called HTTP (Hypertext Transfer Protocol). It is essentially a specification of the intermediary transportation method between the client and server. An HTTP request essentially consists of two parts:
The common request methods are (case sensitive):
You can read more about the different methods here. For most machine learning applications, GET and POST are the core methods to remember. Additionally, if you want to read more about HTTP in general we highly recommend that you go over this comic strip protocol, but the TLDR is that it provides privacy, integrity and identification over the web.
"},{"location":"s7_deployment/apis/#exercises","title":"\u2754 Exercises","text":"We are going to do a couple of exercises on sending requests using requests package to get familiar with the syntax.
Start by installing the `requests`` package
pip install requests\n
Afterwards, create a small script and try to execute the code
import requests\nresponse = requests.get('https://api.github.com/this-api-should-not-exist')\nprint(response.status_code)\n
As you can see from the syntax, we are sending a request using the GET method. This code should return status code 404. Take a look at this page that contains a list of status codes. Next, let's call a page that exists
import requests\nresponse = requests.get('https://api.github.com')\nprint(response.status_code)\n
What is the status code now and what does it mean? Status codes are important when you have an application that is interacting with a server and wants to make sure that it does not fail, which can be done with simple if
statements on the status codes
if response.status_code == 200:\n print('Success!')\nelif response.status_code == 404:\n print('Not Found.')\n
Next, try to call the following
response=requests.get(\"https://api.github.com/repos/SkafteNicki/dtu_mlops\")\n
which gives back a payload. Essentially, payload refers to any additional data that is sent from the client to the server or vice-versa. Try looking at the response.content
attribute. What is the type of this attribute?
You should hopefully observe that the .content
attribute is of type bytes
. It is important to note that this is the standard way of sending payloads to encode them into byte
objects. To get a more human-readable version of the response, we can convert it to JSON format
response.json()\n
Important to remember that a JSON object in Python is just a nested dictionary if you ever want to iterate over the object in some way.
When we use the GET method we can additionally provide a params
argument, that specifies what we want the server to send back for a specific request URL:
response = requests.get(\n 'https://api.github.com/search/repositories',\n params={'q': 'requests+language:python'},\n)\n
Before looking at response.json()
can you explain what the code does? You can try looking at this page for help.
Sometimes the content of a page cannot be converted into JSON, because as already stated data is sent as bytes. Say that we want to download an image, which we can do in the following way
import requests\nresponse = requests.get('https://imgs.xkcd.com/comics/making_progress.png')\n
Try calling response.json()
, what happens? Next, try calling response.content
. To get the result in this case we would need to convert from bytes to an image:
with open(r'img.png','wb') as f:\n f.write(response.content)\n
The get
method is the most useful method because it allows us to get data from the server. However, as stated in the beginning multiple request methods exist, for example, the POST method for sending data to the server. Try executing:
pload = {'username':'Olivia','password':'123'}\nresponse = requests.post('https://httpbin.org/post', data = pload)\n
Investigate the response (this is an artificial example because we do not control the server).
Finally, we should also know that requests can be sent directly from the command line using the curl
command. Sometimes it is easier to send a request directly from the terminal and sometimes it is easier to do it from a script.
Make sure you have curl
installed, or else find instruction on installing it. To check call curl -
-help` with the documentation on curl.
To execute requests.get('https://api.github.com')
using curl we would simply do
curl -X GET \"https://api.github.com\"\ncurl -X GET -I \"https://api.github.com\" # if you want the status code\n
Try it yourself.
Try to redo some of the exercises yourself using curl
.
That ends the intro session on requests
. Do not worry if you are still not completely comfortable with sending requests, we are going to return to how we do it in practice when we have created our API. If you want to learn more about the requests
package you can check out this tutorial and if you want to see more examples of how to use curl
you can check out this page
Requests are all about being on the client side of our client-server architecture. We are now going to move on to the server side where we will be learning about writing the APIs that requests can interact with. An application programming interface (API) is essentially the way for the developer (you) tells a user how to use the application that you have created. The API is an abstraction layer that allows the user to interact with our application in the way we want them to interact with it, without the user even having to look at the code.
We can take the API from GitHub as an example https://api.github.com. This API allows any user to retrieve, integrate and send data to Github without ever having to visit their webpage. The API exposes multiple endpoints that have various functions:
and we could go on. However, there may be functionality that Github is not interested in users having access to and they may therefore choose not to have endpoints for specific features (1).
The particular kind of API we are going to work with is called REST API (or RESTful API). The REST API specify specific constraints that a particular API needs to fulfill to be considered RESTful. You can read more about what the six guiding principles behind REST API on this page but one of the most important to have in mind is that the client-server architecture needs to be stateless. This means that whenever a request is send to the server it needs to be self-contained (all information included) and the server cannot rely on any previously stored information from previous requests.
To implement APIs in practice we are going to use FastAPI. FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints. FastAPI is only one of many frameworks for defining APIs, however, compared to other frameworks such as Flask and django it offers a sweet spot of being flexible enough to do what you want without having many additional (unnecessary) features.
"},{"location":"s7_deployment/apis/#exercises_1","title":"\u2754 Exercises","text":"The exercises below are a condensed version of this and this tutorial. If you ever need context for the exercises, we can recommend trying to go through these. Additionally, we also provide this solution file that you can look through for help.
Install FastAPI
pip install fastapi\n
This contains the functions, modules, and variables we are going to need to define our interface.
Additionally, also install uvicorn
which is a package for defining low level server applications.
pip install uvicorn[standard]\n
Start by defining a small application like this in a file called main.py
:
from fastapi import FastAPI\napp = FastAPI()\n\n@app.get(\"/\")\ndef read_root():\n return {\"Hello\": \"World\"}\n\n@app.get(\"/items/{item_id}\")\ndef read_item(item_id: int):\n return {\"item_id\": item_id}\n
Importantly here is the use of the @app.get
decorator. What could this decorator refer to? Explain what the two functions are probably doing.
Next lets launch our app. Since we called our script main.py
and we inside the script initialized our API with app = FastAPI
, our application that we want to deploy can be referenced by main:app
:
uvicorn --reload --port 8000 main:app\n
this will launch a server at this page: http://localhost:8000/
. As you will hopefully see, this page will return the content of the root
function, like the image below. Remember to also check the output in your terminal as that will give info on when and how your application is being invoked.
What webpage should you open to get the server to return 1
?
Also checkout the pages: http://localhost:8000/docs
and http://localhost:8000/redoc
. What does these pages show?
The power of the docs
and redoc
pages is that they allow you to easily test your application with their simple UI. As shown in the image below, simply open the endpoint you want to test, click the Try it out
button, input any values and execute it. It will return both the corresponding curl
command for invoking your endpoint, the corresponding URL and response of you application. Try it out.
You can also checkout http://localhost:8000/openapi.json
to check out the schema that is generated which essentially is a json
file containing the overall specifications of your program.
Try to access http://localhost:8000/items/foo
, what happens in this case? When you specify types in your API, FastAPI will automatically do type validation using pydantic, making sure users can only access your API with the correct types. Therefore, remember to include types in your applications!
With the fundamentals in place let's configure it a bit more:
Lets start by changing the root function to include a bit more info. In particular we are also interested in returning the status code so the end user can easily read that. Default status codes are included in the http built-in Python package:
from http import HTTPStatus\n\n@app.get(\"/\")\ndef root():\n \"\"\" Health check.\"\"\"\n response = {\n \"message\": HTTPStatus.OK.phrase,\n \"status-code\": HTTPStatus.OK,\n }\n return response\n
try to reload the app and see what is returned now. You should not have to re-launch the app because we initialized the app with the --reload
argument.
When we decorate our functions with @app.get(\"/items/{item_id}\")
, item_id
is in the case what we call a path parameters because it is a parameter that is directly included in the path of our endpoint. We have already seen how we can restrict a path to a single type, but what if we want to restrict it to specific values? This is often the case if we are working with parameters of type str
. In this case we would need to define a enum
:
from enum import Enum\nclass ItemEnum(Enum):\n alexnet = \"alexnet\"\n resnet = \"resnet\"\n lenet = \"lenet\"\n\n@app.get(\"/restric_items/{item_id}\")\ndef read_item(item_id: ItemEnum):\n return {\"item_id\": item_id}\n
Add this API, reload and execute both a valid parameter and a non-valid parameter.
In contrast to path parameters we have query parameters. In the requests exercises we saw an example of this where we were calling https://api.github.com/search/code with the query 'q': 'requests+language:python'
. Any parameter in FastAPI that is not a path parameter, will be considered a query parameter:
@app.get(\"/query_items\")\ndef read_item(item_id: int):\n return {\"item_id\": item_id}\n
Add this API, reload and figure out how to pass in a query parameter.
We have until now worked with the .get
method, but lets also see an example of the .post
method. As already described the POST request method is used for uploading data to the server. Here is a simple app that saves username and password in a database (please never implement this in real life like this):
database = {'username': [ ], 'password': [ ]}\n\n@app.post(\"/login/\")\ndef login(username: str, password: str):\n username_db = database['username']\n password_db = database['password']\n if username not in username_db and password not in password_db:\n with open('database.csv', \"a\") as file:\n file.write(f\"{username}, {password} \\n\")\n username_db.append(username)\n password_db.append(password)\n return \"login saved\"\n
Make sure you understand what the function does and then try to execute it a couple of times to see your database updating. It is important to note that we sometimes in the following exercises use the .get
method and sometimes the .post
method. For our usage it does not really matter.
We are now moving on to figuring out how to provide different standard inputs like text, images, json to our APIs. It is important that you try out each example yourself and in particular you look at the curl
commands that are necessary to invoke each application.
Here is a small application, that takes a single text input
@app.get(\"/text_model/\")\ndef contains_email(data: str):\n regex = r'\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b'\n response = {\n \"input\": data,\n \"message\": HTTPStatus.OK.phrase,\n \"status-code\": HTTPStatus.OK,\n \"is_email\": re.fullmatch(regex, data) is not None\n }\n return response\n
What does the application do? Try it out yourself
Let's say we wanted to extend the application to check for a specific email domain, either gmail
or hotmail
. Assume that we want to feed this into our application as a json
object e.g.
{\n \"email\": \"mlops@gmail.com\",\n \"domain_match\": \"gmail\"\n}\n
Figure out how to alter the data
parameter such that it takes in the json
object and make sure to extend the application to check if the email and domain also match. Hint: take a look at this page
Let's move on to an application that requires a file input:
from fastapi import UploadFile, File\nfrom typing import Optional\n\n@app.post(\"/cv_model/\")\nasync def cv_model(data: UploadFile = File(...)):\n with open('image.jpg', 'wb') as image:\n content = await data.read()\n image.write(content)\n image.close()\n\n response = {\n \"input\": data,\n \"message\": HTTPStatus.OK.phrase,\n \"status-code\": HTTPStatus.OK,\n }\n return response\n
A couple of new things are going on here: we use the specialized UploadFile
and File
bodies in our input definition. Additionally, we added the async
/await
keywords. Figure out what everything does and try to run the application (you can use any image file you like).
The above application actually does not do anything. Let's add opencv as a package and let's resize the image. It can be done with the following three lines:
import cv2\nimg = cv2.imread(\"image.jpg\")\nres = cv2.resize(img, (h, w))\n
Figure out where to add them in the application and additionally add h
and w
as optional parameters, with a default value of 28. Try running the application where you specify everything and one more time where you leave out h
and w
.
Finally, let's also figure out how to return a file from our application. You will need to add the following lines:
from fastapi.responses import FileResponse\ncv2.imwrite('image_resize.jpg', res)\nFileResponse('image_resize.jpg')\n
Figure out where to add them to the code and try running the application one more time to see that you get a file back with the resized image.
A common pattern in most applications is that we want some code to run on startup and some code to run on shutdown. FastAPI allows us to do this by controlling the lifespan of our application. This is done by implementing the lifespan
function. Look at the documentation for lifespan events and implement a small application that prints Hello
on startup and Goodbye
on shutdown.
Here is a simple example that will print Hello
on startup and Goodbye
on shutdown.
from contextlib import asynccontextmanager\nfrom fastapi import FastAPI\n\n@asynccontextmanager\nasync def lifespan(app: FastAPI):\n print(\"Hello\")\n yield\n print(\"Goodbye\")\n\napp = FastAPI(lifespan=lifespan)\n\n@app.get(\"/\")\ndef read_root():\n return {\"Hello\": \"World\"}\n
Let's try to figure out how to use FastAPI in a Machine learning context. Below is a script that downloads a VisionEncoderDecoder
from huggingface . The model can be used to create captions for a given image. Thus calling
predict_step(['s7_deployment/exercise_files/my_cat.jpg'])\n
returns a list of strings like ['a cat laying on a couch with a stuffed animal']
(try this yourself). Create a FastAPI application that can do inference using this model e.g. it should take in an image, preferably some optional hyperparameters (like max_length
) and should return a string (or list of strings) containing the generated caption.
simple ML application
from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor, AutoTokenizer\nimport torch\nfrom PIL import Image\n\nmodel = VisionEncoderDecoderModel.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\nfeature_extractor = ViTFeatureExtractor.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\ntokenizer = AutoTokenizer.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nmodel.to(device)\n\ngen_kwargs = {\"max_length\": 16, \"num_beams\": 8, \"num_return_sequences\": 1}\ndef predict_step(image_paths):\n images = []\n for image_path in image_paths:\n i_image = Image.open(image_path)\n if i_image.mode != \"RGB\":\n i_image = i_image.convert(mode=\"RGB\")\n\n images.append(i_image)\n pixel_values = feature_extractor(images=images, return_tensors=\"pt\").pixel_values\n pixel_values = pixel_values.to(device)\n output_ids = model.generate(pixel_values, **gen_kwargs)\n preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)\n preds = [pred.strip() for pred in preds]\n return preds\n\nif __name__ == \"__main__\":\n print(predict_step(['s7_deployment/exercise_files/my_cat.jpg']))\n
Solution ml_app.pyfrom contextlib import asynccontextmanager\n\nimport torch\nfrom fastapi import FastAPI, File, UploadFile\nfrom PIL import Image\nfrom transformers import AutoTokenizer, VisionEncoderDecoderModel, ViTFeatureExtractor\n\n\n@asynccontextmanager\nasync def lifespan(app: FastAPI):\n \"\"\"Load and clean up model on startup and shutdown.\"\"\"\n global model, feature_extractor, tokenizer, device, gen_kwargs\n print(\"Loading model\")\n model = VisionEncoderDecoderModel.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n feature_extractor = ViTFeatureExtractor.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n tokenizer = AutoTokenizer.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n model.to(device)\n gen_kwargs = {\"max_length\": 16, \"num_beams\": 8, \"num_return_sequences\": 1}\n\n yield\n\n print(\"Cleaning up\")\n del model, feature_extractor, tokenizer, device, gen_kwargs\n\n\napp = FastAPI(lifespan=lifespan)\n\n\n@app.post(\"/caption/\")\nasync def caption(data: UploadFile = File(...)):\n \"\"\"Generate a caption for an image.\"\"\"\n i_image = Image.open(data.file)\n if i_image.mode != \"RGB\":\n i_image = i_image.convert(mode=\"RGB\")\n\n pixel_values = feature_extractor(images=[i_image], return_tensors=\"pt\").pixel_values\n pixel_values = pixel_values.to(device)\n output_ids = model.generate(pixel_values, **gen_kwargs)\n preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)\n return [pred.strip() for pred in preds]\n
As the final step, we want to figure out how to include our FastAPI application in a docker container as it will help us when we want to deploy in the cloud because docker as always can take care of the dependencies for our application. For the following set of exercises you can take whatever previous FastAPI application as the base application for the container
Start by creating a requirement.txt
file for your application. You will at least need fastapi
and uvicorn
in the file and we always recommend that you are specific about the version you want to use
fastapi>=0.68.0,<0.69.0\nuvicorn>=0.15.0,<0.16.0\n# add anything else you application needs to be able to run\n
Next, create a Dockerfile
with the following content
FROM python:3.9\nWORKDIR /code\nCOPY ./requirements.txt /code/requirements.txt\n\nRUN pip install --no-cache-dir --upgrade -r /code/requirements.txt\nCOPY ./app /code/app\n\nCMD [\"uvicorn\", \"app.main:app\", \"--host\", \"0.0.0.0\", \"--port\", \"80\"]\n
The above assumes that your file structure looks like this
.\n\u251c\u2500\u2500 app\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u2514\u2500\u2500 main.py\n\u251c\u2500\u2500 Dockerfile\n\u2514\u2500\u2500 requirements.txt\n
Hopefully, all these steps should look familiar if you already went through module M9, except for maybe the last line. However, this is just the standard way that we have run our FastAPI applications in the last couple of exercises, this time with some extra arguments regarding the ports we allow.
Next, build the corresponding docker image
docker build -t my_fastapi_app .\n
Finally, run the image such that a container is spun up that runs our application. The important part here is to remember to specify the -p
argument (p for port) which should be the same number as the port we have specified in the last line of our Dockerfile.
docker run --name mycontainer -p 80:80 myimage\n
Check that everything is working by going to the corresponding localhost page http://localhost/items/5?q=somequery
This ends the module on APIs. If you want to go further in this direction we highly recommend that you check out bentoml which is an API standard that focuses solely on creating easy-to-understand APIs and services for ml-applications. Additionally, we can also highly recommend checking out Postman which can help design, document and in particular test the API you are writing to make sure that it works as expected.
"},{"location":"s7_deployment/cloud_deployment/","title":"M23 - Cloud Deployment","text":""},{"location":"s7_deployment/cloud_deployment/#cloud-deployment","title":"Cloud deployment","text":"Core Module
We are now returning to using the cloud. In this module, you should have gone through the steps of having your code in your GitHub repository to automatically build into a docker container, store that, store data and pull it all together to make a training run. After the training is completed you should hopefully have a file stored in the cloud with your trained model weights.
Today's exercises will be about serving those model weights to an end user. We focus on two different ways of deploying our model: Google cloud functions and Google cloud run. Both services are serverless, meaning that you do not have to manage the server that runs your code.
GCP in general has 5 core deployment options. We are going to focus on Cloud Functions and Cloud Run, which are two of the serverless options. In contrast to these two, you have the option to deploy to Kubernetes Engine and Compute Engine which are more traditional ways of deploying your code. Here you have to manage the underlying infrastructure."},{"location":"s7_deployment/cloud_deployment/#cloud-functions","title":"Cloud Functions","text":"Google Cloud Functions, is the most simple way that we can deploy our code to the cloud. As stated above, it is a serverless service, meaning that you do not have to worry about the underlying infrastructure. You just write your code and deploy it. The service is great for small applications that can be encapsulated in a single script.
"},{"location":"s7_deployment/cloud_deployment/#exercises","title":"\u2754 Exercises","text":"Go to the start page of Cloud Functions
. Can be found in the sidebar on the homepage or you can just search for it. Activate the service in the cloud console or use the following command:
gcloud services enable cloudfunctions.googleapis.com\n
Click the Create Function
button which should take you to a screen like the image below. Make sure it is a 2nd Gen function, give it a name, set the server region to somewhere close by and change the authentication policy to Allow unauthenticated invocations
so we can access it directly from a browser. Remember to note down the
On the next page, for Runtime
pick the Python 3.11
option (or newer). This will make the inline editor show both a main.py
and requirements.py
file. Look over them and try to understand what they do. Especially, take a look at the functions-framework which is a needed requirement of any Cloud function.
After you have looked over the files, click the Deploy
button.
The functions-framework
is a lightweight, open-source framework for turning Python functions into HTTP functions. Any function that you deploy to Cloud Functions must be wrapped in the @functions_framework.http
decorator.
Afterwards, the function should begin to deploy. When it is done, you should see \u2705. Now let's test it by going to the Testing
tab.
If you know what the application does, it should come as no surprise that it can run without any input. We therefore just send an empty request by clicking the Test The Function
button. Does the function return the output you expected? Wait for the logs to show up. What do they show?
What should the Triggering event
look like in the testing prompt for the program to respond with
Hallo General Kenobi!\n
Try it out.
SolutionThe default triggering event is a JSON object with a key name
and a value. Therefore the triggering event should look like this:
{\n \"name\": \"General Kenobi\"\n}\n
Go to the trigger tab and go to the URL for the application. Execute the API a couple of times. How can you change the URL to make the application respond with the same output as above?
SolutionYou can change the URL to include a query parameter name
with the value General Kenobi
. For example
https://us-central1-my-personal-mlops-project.cloudfunctions.net/function-3?name=General%20Kanobi\n
where you would need to replace everything before the ?
with your URL.
Click on the metrics tab. You should hopefully see it being populated with a few data points. Identify what each panel is showing.
SolutionCheck out the logs tab. You should see that your application has already been invoked multiple times. Also, try to execute this command in a terminal:
gcloud functions logs read\n
Next, we are going to create our own application that takes some input so we can try to send it requests. We provide a very simple script to get started.
Simple script
sklearn_cloud_functions.py# Load data\nimport pickle\n\nimport numpy as np\nfrom sklearn import datasets\nfrom sklearn.neighbors import KNeighborsClassifier\n\niris_x, iris_y = datasets.load_iris(return_X_y=True)\n\n# Split iris data in train and test data\n# A random permutation, to split the data randomly\nnp.random.seed(0)\nindices = np.random.permutation(len(iris_x))\niris_x_train = iris_x[indices[:-10]]\niris_y_train = iris_y[indices[:-10]]\niris_x_test = iris_x[indices[-10:]]\niris_y_test = iris_y[indices[-10:]]\n\n# Create and fit a nearest-neighbor classifier\n\nknn = KNeighborsClassifier()\nknn.fit(iris_x_train, iris_y_train)\nknn.predict(iris_x_test)\n\n# save model\n\nwith open(\"model.pkl\", \"wb\") as file:\n pickle.dump(knn, file)\n
Figure out what the script does and run the script. This should create a file with a trained model.
SolutionThe file trains a simple KNN model on the iris dataset and saves it to a file called model.pkl
.
Next, create a storage bucket and upload the model file to the bucket. Try to do this using the gsutil
command and check afterward that the file is in the bucket.
gsutil mb gs://<bucket-name> # mb stands for make bucket\ngsutil cp <file-name> gs://<bucket-name> # cp stands for copy\n
Create a new cloud function with the same initial settings as the first one, e.g. Python 3.11
and HTTP
. Then implement in the main.py
file code that:
In addition to writing the main.py
file, you also need to fill out the requirements.txt
file. You need at least three packages to run the application. Remember to also change the Entry point
to the name of your function. If your deployment fails, try to go to the Logs Explorer
page in gcp
which can help you identify why.
The main script should look something like this:
main.pyimport pickle\n\nimport functions_framework\nfrom google.cloud import storage\n\nBUCKET_NAME = \"my_sklearn_model_bucket\"\nMODEL_FILE = \"model.pkl\"\n\nclient = storage.Client()\nbucket = client.get_bucket(BUCKET_NAME)\nblob = bucket.get_blob(MODEL_FILE)\nmy_model = pickle.loads(blob.download_as_string())\n\n\n@functions_framework.http\ndef knn_classifier(request):\n \"\"\"Simple knn classifier function for iris prediction.\"\"\"\n request_json = request.get_json()\n if request_json and \"input_data\" in request_json:\n input_data = request_json[\"input_data\"]\n input_data = [float(in_data) for in_data in input_data]\n input_data = [input_data]\n prediction = my_model.predict(input_data)\n return {\"prediction\": prediction.tolist()}\n return {\"error\": \"No input data provided.\"}\n
And, the requirement file should look like this:
functions-framework>=3.7.0\ngoogle-cloud-storage>=2.14.0\nscikit-learn>=1.4.0\n
importantly make sure that you are using the same version of scikit-learn
as you used when you trained the model. Else when trying to load the model you will most likely get an error.
When you have successfully deployed the model, try to make predictions with it. What should the request look like?
SolutionIt depends on how exactly you have chosen to implement the main.py
. But for the provided solution, the payload should look like this:
{\n \"data\": [1, 2, 3, 4]\n}\n
with the corresponding curl
command:
curl -X POST \\\n https://your-cloud-function-url/knn_classifier \\\n -H \"Content-Type: application/json\" \\\n -d '{\"input_data\": [5.1, 3.5, 1.4, 0.2]}'\n
Let's try to figure out how to do the above deployment using gcloud
instead of the console UI. The relevant command is gcloud functions deploy. For this function to work you will need to put the main.py
and requirements.txt
in a separate folder. Try to execute the command to successfully deploy the function.
gcloud functions deploy <func-name> \\\n --gen2 --runtime python311 --trigger-http --source <folder> --entry-point knn_classifier\n
where you need to replace <func-name>
with the name of your function and <folder>
with the path to the folder containing the main.py
and requirements.txt
files.
(Optional) You can finally try to redo the exercises by deploying a PyTorch application. You will essentially need to go through the same steps as the sklearn example, including uploading a trained model to storage and writing a cloud function that loads it and returns some output. You are free to choose whatever PyTorch model you want.
Cloud functions are great for simple deployments, that can be encapsulated in a single script with only simple requirements. However, they do not scale with more advanced applications that may depend on multiple programming languages. We are already familiar with how we can deal with this through containers and Cloud Run is the corresponding service in GCP for deploying containers.
"},{"location":"s7_deployment/cloud_deployment/#exercises_1","title":"\u2754 Exercises","text":"We are going to start locally by developing a small app that we can deploy. We provide two small examples to choose from: first is a small FastAPI app consisting of a single Python script and a docker file. The second is a small Streamlit app (which you can learn more about in this module) consisting of a single docker file. You can choose which one you want to work with.
Simple Fastapi app simple_fastapi_app.pyfrom fastapi import FastAPI\n\napp = FastAPI()\n\n\n@app.get(\"/\")\ndef read_root():\n \"\"\"Root endpoint.\"\"\"\n return {\"Hello\": \"World\"}\n\n\n@app.get(\"/items/{item_id}\")\ndef read_item(item_id: int):\n \"\"\"Get an item by id.\"\"\"\n return {\"item_id\": item_id}\n
simple_fastapi_app.dockerfileFROM python:3.9-slim\n\nEXPOSE $PORT\n\nWORKDIR /app\n\nRUN apt-get update && apt-get install -y \\\n build-essential \\\n software-properties-common \\\n git \\\n && rm -rf /var/lib/apt/lists/*\n\nRUN pip install fastapi\nRUN pip install pydantic\nRUN pip install uvicorn\n\nCOPY simple_fastapi_app.py simple_fastapi_app.py\n\nCMD exec uvicorn simple_fastapi_app:app --port $PORT --host 0.0.0.0 --workers 1\n
Simple Streamlit app streamlit_app.dockerfileFROM python:3.9-slim\n\nEXPOSE $PORT\n\nWORKDIR /app\n\nRUN apt-get update && apt-get install -y \\\n build-essential \\\n software-properties-common \\\n git \\\n && rm -rf /var/lib/apt/lists/*\n\nRUN git clone https://github.com/streamlit/streamlit-example.git .\n\nRUN pip3 install -r requirements.txt\n\nENTRYPOINT [\"streamlit\", \"run\", \"streamlit_app.py\", \"--server.port=$PORT\", \"--server.address=0.0.0.0\"]\n
Start by going over the files belonging to your choice app and understand what it does.
Next, build the docker image belonging to the app
docker build -f <dockerfile> . -t gcp_test_app:latest\n
Next tag and push the image to your artifact registry
docker tag gcp_test_app <region>-docker.pkg.dev/<project-id>/<registry-name>/gcp_test_app:latest\ndocker push <region>-docker.pkg.dev/<project-id>/<registry-name>/gcp_test_app:latest\n
Afterward, check your artifact registry contains the pushed image.
Next, go to Cloud Run
in the cloud console and enable the service or use the following command:
gcloud services enable run.googleapis.com\n
Click the Create Service
button which should bring you to a page similar to the one below
Do the following:
Click the select button, which will bring up all build containers and pick the one you want to deploy. In the future, you probably want to choose the Continuously deploy new revisions from a source repository such that a new version is always deployed when a new container is built.
Hereafter, give the service a name and select the region. We recommend choosing a region close to you.
Set the authentication method to Allow unauthenticated invocations such that we can call it without providing credentials. In the future, you may only set that authenticated invocations are allowed.
Expand the Container, Connections, Security tab and edit the port such that it matches the port exposed in your chosen application. If your docker file exposes the env variable $PORT
you can set the port to anything.
Finally, click the create button and wait for the service to be deployed (may take some time).
Common problems
If you get an error saying The user-provided container failed to start and listen on the port defined by the PORT environment variable. there are two common reasons for this:
You need to add an EXPOSE
statement in your docker container:
EXPOSE 8080\nCMD exec uvicorn my_application:app --port 8080 --workers 1 main:app\n
and make sure that your application is also listening on that port. If you hard code the port in your application (as in the above code) it is best to set it 8080 which is the default port for cloud run. Alternatively, a better approach is to set it to the $PORT
environment variable which is set by cloud run and can be accessed in your application:
EXPOSE $PORT\nCMD exec uvicorn my_application:app --port $PORT --workers 1 main:app\n
If you do this and then want to run locally you can run it as:
docker run -p 8080:8080 -e PORT=8080 <image-name>:<image-tag>\n
If you are serving a large machine-learning model, it may also be that your deployed container is running out of memory. You can try to increase the memory of the container by going to the Edit container and the Resources tab and increasing the memory.
If you manage to deploy the service you should see an image like this:
You can now access your application by clicking the URL. This will access the root of your application, so you may need to add /
or /<path>
to the URL depending on how the app works.
Everything we just did in the console UI we can also do with the gcloud run deploy. How would you do that?
SolutionThe command should look something like this
gcloud run deploy <service-name> \\\n --image <image-name>:<image-tag> --platform managed --region <region> --allow-unauthenticated\n
where you need to replace <service-name>
with the name of your service, <image-name>
with the name of your image and <region>
with the region you want to deploy to. The --allow-unauthenticated
flag is optional but is needed if you want to access the service without providing credentials.
After deploying using the command line, make sure that the service is up and running by using these two commands
gcloud run services list\ngcloud run services describe <service-name> --region <region>\n
Instead of deploying our docker container using the UI or command line, which is a manual operation, we can do it continuously by using cloudbuild.yaml
file we learned about in the previous section. This is called continuous deployment, and it is a way to automate the deployment process.
Image credit
Let's revise the cloudbuild.yaml
file from the artifact registry exercises in this module which will build and push a specified docker image.
cloudbuild.yaml
cloudbuild.yamlsteps:\n- name: 'gcr.io/cloud-builders/docker'\n id: 'Build container image'\n args: [\n 'build',\n '.',\n '-t',\n 'europe-west1-docker.pkg.dev/$PROJECT_ID/<registry-name>/<image-name>',\n '-f',\n '<path-to-dockerfile>'\n ]\n- name: 'gcr.io/cloud-builders/docker'\n id: 'Push container image'\n args: [\n 'push',\n 'europe-west1-docker.pkg.dev/$PROJECT_ID/<registry-name>/<image-name>'\n ]\n
Add a third step to the cloudbuild.yaml
file that deploys the container image to Cloud Run. The relevant service you need to use is called 'gcr.io/cloud-builders/gcloud'
and the command is 'gcloud run deploy'
. Afterwards, reuse the trigger you created in the previous module or create a new one to build and deploy the container image continuously. Confirm that this works by making a change to your application and pushing it to GitHub and see if the application is updated continuously.
The full cloudbuild.yaml
file should look like this:
steps:\n- name: 'gcr.io/cloud-builders/docker'\n id: 'Build container image'\n args: [\n 'build',\n '.',\n '-t',\n 'europe-west1-docker.pkg.dev/$PROJECT_ID/<registry-name>/<image-name>',\n '-f',\n '<path-to-dockerfile>'\n ]\n- name: 'gcr.io/cloud-builders/docker'\n id: 'Push container image'\n args: [\n 'push',\n 'europe-west1-docker.pkg.dev/$PROJECT_ID/<registry-name>/<image-name>'\n ]\n- name: 'gcr.io/cloud-builders/gcloud'\n id: 'Deploy to Cloud Run'\n args: [\n 'run',\n 'deploy',\n '<service-name>',\n '--image',\n 'europe-west1-docker.pkg.dev/$PROJECT_ID/<registry-name>/<image-name>',\n '--region',\n 'europe-west1',\n '--platform',\n 'managed',\n ]\n
In the previous module on using the cloud you learned about the Secrets Manager in GCP. How can you use this service in combination with Cloud Run?
SolutionIn the cloud console, secrets can be set in the Container(s), Volumes, Networking, Security tab under the Variables & Secrets section, see image below.
In the gcloud
command, you can set the secret by using the --update-secrets
flag.
gcloud run deploy <service-name> \\\n --image <image-name>:<image-tag> --platform managed \\\n --region <region> --allow-unauthenticated \\\n --update-secrets <secret-name>=<secret-version>\n
That ends the exercises on deployment. The exercises above are just a small taste of what deployment has to offer. In both sections, we have explicitly chosen to work with serverless deployments. But what if you wanted to do the opposite e.g. being the one in charge of the management of the cluster that handles the deployed services? If you are interested in taking deployment to the next level should get started on Kubernetes which is the de-facto open-source container orchestration platform that is being used in production environments. If you want to deep dive we recommend starting here which describes how to make pipelines that are a necessary component before you start to create your own Kubernetes cluster.
"},{"location":"s7_deployment/frontend/","title":"M26 - Frontend","text":""},{"location":"s7_deployment/frontend/#frontend","title":"Frontend","text":"If you have gone over the deployment module you should be at the point where you have a machine learning model running in the cloud. The model can be interacted with by sending HTTP requests to the API endpoint. In general we refer to this as the backend of the application. It is the part of our application that are behind-the-scene that the user does not see and it is not really that user-friendly. Instead we want to create a frontend that the user can interact with in a more user-friendly way. This is what we will be doing in this module.
Another point of splitting our application into a frontend and a backend has to do with scalability. If we have a lot of users interacting with our application, we might want to scale only the backend and not the frontend, because that is the part that will be running our heavy machine learning model. In general dividing a application into smaller pieces are the pattern that is used in microservice architectures.
In monollithic applications everything the user may be requesting of our application is handled by a single process/ container. In microservice architectures the application is split into smaller pieces that can be scaled independently. This also leads to easier maintainability and faster development.Frontends have for the longest time been created using HTML, CSS and JavaScript. This is still the case, but there are now a lot of frameworks that can help us create a frontend in Python:
In this module we will be looking at streamlit
. streamlit
is a easy to use framework that allows us to create interactive web applications in Python. It is not at all as powerful as a framework like Django
, but it is very easy to get started with and it is very easy to integrate with our machine learning models.
In these exercises we go through the process of setting up a backend using fastapi
and a frontend using streamlit
, containerizing both applications and then deploying them to the cloud. We have already created an example of this which can be found in the samples/frontend_backend
folder.
Lets start by creating the backend application in a backend.py
file. You can use essentially any backend you want, but we will be using a simple imagenet classifier that we have created in the samples/frontend_backend/backend
folder.
Create a new file called backend.py
and implement a FastAPI interface with a single /predict
endpoint that takes a image as input and returns the predicted class (and probabilities) of the image.
import json\nfrom contextlib import asynccontextmanager\n\nimport anyio\nimport torch\nfrom fastapi import FastAPI, File, HTTPException, UploadFile\nfrom PIL import Image\nfrom torchvision import models, transforms\n\n\n@asynccontextmanager\nasync def lifespan(app: FastAPI):\n \"\"\"Context manager to start and stop the lifespan events of the FastAPI application.\"\"\"\n global model, transform, imagenet_classes\n # Load model\n model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)\n model.eval()\n\n transform = transforms.Compose(\n [\n transforms.Resize((224, 224)),\n transforms.ToTensor(),\n transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),\n ],\n )\n\n async with await anyio.open_file(\"imagenet-simple-labels.json\") as f:\n imagenet_classes = json.load(f)\n\n yield\n\n # Clean up\n del model\n del transform\n del imagenet_classes\n\n\napp = FastAPI(lifespan=lifespan)\n\n\ndef predict_image(image_path: str) -> str:\n \"\"\"Predict image class (or classes) given image path and return the result.\"\"\"\n img = Image.open(image_path).convert(\"RGB\")\n img = transform(img).unsqueeze(0)\n with torch.no_grad():\n output = model(img)\n _, predicted_idx = torch.max(output, 1)\n return output.softmax(dim=-1), imagenet_classes[predicted_idx.item()]\n\n\n@app.get(\"/\")\nasync def root():\n \"\"\"Root endpoint.\"\"\"\n return {\"message\": \"Hello from the backend!\"}\n\n\n# FastAPI endpoint for image classification\n@app.post(\"/classify/\")\nasync def classify_image(file: UploadFile = File(...)):\n \"\"\"Classify image endpoint.\"\"\"\n try:\n contents = await file.read()\n async with await anyio.open_file(file.filename, \"wb\") as f:\n f.write(contents)\n probabilities, prediction = predict_image(file.filename)\n return {\"filename\": file.filename, \"prediction\": prediction, \"probabilities\": probabilities.tolist()}\n except Exception as e:\n raise HTTPException(status_code=500) from e\n
Run the backend using uvicorn
uvicorn backend:app --reload\n
Test the backend by sending a request to the /predict
endpoint, preferably using curl
command
In this example we are sending a request to the /predict
endpoint with a file called my_cat.jpg
. The response should be \"tabby cat\" for the solution we have provided.
curl -X 'POST' \\\n 'http://127.0.0.1:8000/classify/' \\\n -H 'accept: application/json' \\\n -H 'Content-Type: multipart/form-data' \\\n -F 'file=@my_cat.jpg;type=image/jpeg'\n
Create a requirements_backend.txt
file with the dependencies needed for the backend.
fastapi>=0.108.0\nuvicorn>=0.25.0\ntorch>=2.1.2\ntorchvision>=0.16.2\n
Containerize the backend into a file called backend.dockerfile
.
FROM python:3.11-slim\n\nRUN apt update && \\\n apt install --no-install-recommends -y build-essential gcc git && \\\n apt clean && rm -rf /var/lib/apt/lists/*\n\nRUN mkdir /app\n\nWORKDIR /app\n\nCOPY requirements_backend.txt /app/requirements_backend.txt\nCOPY backend.py /app/backend.py\n\nRUN --mount=type=cache,target=/root/.cache/pip pip install -r requirements_backend.txt\n\nEXPOSE $PORT\nCMD exec unicorn --port $PORT --host 0.0.0.0 backend:app\n
Build the backend image
docker build -t backend:latest -f backend.dockerfile .\n
Recheck that the backend works by running the image in a container
docker run --rm -p 8000:8000 -e \"PORT=8000\" backend\n
and test that it works by sending a request to the /predict
endpoint.
Deploy the backend to Cloud run using the gcloud
command
Assuming that we have created an artifact registry called frontend_backend
we can deploy the backend to Cloud Run using the following commands:
docker tag \\\n backend:latest \\\n <region>-docker.pkg.dev/<project>/frontend-backend/backend:latest\ndocker push \\\n <region>.pkg.dev/<project>/frontend-backend/backend:latest\ngcloud run deploy backend \\\n --image=europe-west1-docker.pkg.dev/<project>/frontend-backend/backend:latest \\\n --region=europe-west1 \\\n --platform=managed \\\n
where <region>
and <project>
should be replaced with the appropriate values.
Finally, test that the deployed backend works as expected by sending a request to the /predict
endpoint
In this solution we are first extracting the url of the deployed backend and then sending a request to the /predict
endpoint.
export MYENDPOINT=$(gcloud run services describe backend --region=<region> --format=\"value(status.url)\")\ncurl -X 'POST' \\\n $MYENDPOINT/predict \\\n -H 'accept: application/json' \\\n -H 'Content-Type: multipart/form-data' \\\n -F 'file=@my_cat.jpg;type=image/jpeg'\n
With the backend taken care of lets now write our frontend. Our frontend just needs to be a \"nice\" interface to our backend. Its main functionality will be to send a request to the backend and display the result. streamlit documentation
Start by installing streamlit
pip install streamlit\n
Now create a file called frontend.py
and implement a streamlit application. You can design it as you want, but we recommend that the following can be done in the frontend:
Have a file uploader that allows the user to upload an image
Display the image that the user uploaded
Have a button that sends the image to the backend and displays the result
For now just assume that a environment variable called BACKEND
is available that contains the URL of the backend. We will in the next step show how to get this URL automatically.
import os\n\nimport pandas as pd\nimport requests\nimport streamlit as st\nfrom google.cloud import run_v2\n\n\ndef get_backend_url():\n \"\"\"Get the URL of the backend service.\"\"\"\n parent = \"projects/my-personal-mlops-project/locations/europe-west1\"\n client = run_v2.ServicesClient()\n services = client.list_services(parent=parent)\n for service in services:\n if service.name.split(\"/\")[-1] == \"production-model\":\n return service.uri\n return os.environ.get(\"BACKEND\", None)\n\n\ndef classify_image(image, backend):\n \"\"\"Send the image to the backend for classification.\"\"\"\n predict_url = f\"{backend}/predict\"\n response = requests.post(predict_url, files={\"image\": image}, timeout=10)\n if response.status_code == 200:\n return response.json()\n return None\n\n\ndef main() -> None:\n \"\"\"Main function of the Streamlit frontend.\"\"\"\n backend = get_backend_url()\n if backend is None:\n msg = \"Backend service not found\"\n raise ValueError(msg)\n\n st.title(\"Image Classification\")\n\n uploaded_file = st.file_uploader(\"Upload an image\", type=[\"jpg\", \"jpeg\", \"png\"])\n\n if uploaded_file is not None:\n image = uploaded_file.read()\n result = classify_image(image, backend=backend)\n\n if result is not None:\n prediction = result[\"prediction\"]\n probabilities = result[\"probabilities\"]\n\n # show the image and prediction\n st.image(image, caption=\"Uploaded Image\")\n st.write(\"Prediction:\", prediction)\n\n # make a nice bar chart\n data = {\"Class\": [f\"Class {i}\" for i in range(10)], \"Probability\": probabilities}\n df = pd.DataFrame(data)\n df.set_index(\"Class\", inplace=True)\n st.bar_chart(df, y=\"Probability\")\n else:\n st.write(\"Failed to get prediction\")\n\n\nif __name__ == \"__main__\":\n main()\n
We need to make sure that the frontend knows where the backend is located, and we want that to happen automatically so we do not have to hardcode the URL into our frontend. We can do this by using the Python SDK for Google Cloud Run. The following code snippet shows how to get the URL of the backend service or fall back to an environment variable if the service is not found.
from google.cloud import run_v2\nimport streamlit as st\n\n@st.cache_resource # (1)!\ndef get_backend_url():\n \"\"\"Get the URL of the backend service.\"\"\"\n parent = \"projects/<project>/locations/<region>\"\n client = run_v2.ServicesClient()\n services = client.list_services(parent=parent)\n for service in services:\n if service.name.split(\"/\")[-1] == \"production-model\":\n return service.uri\n name = os.environ.get(\"BACKEND\", None)\n return name\n
st.cache_resource
is a decorator that tells streamlit
to cache the result of the function. This is useful if the function is expensive to run and we want to avoid running it multiple times.Add the above code snippet to the top of your frontend.py
file and replace <project>
and <region>
with the appropriate values. You will need to install pip install google-cloud-run
to be able to use the code snippet.
Run the frontend using streamlit
streamlit run frontend.py\n
Create a requirements_frontend.txt
file with the dependencies needed for the frontend.
streamlit>=1.28.2\npandas>=2.1.3\ngoogle-cloud-run>=0.10.5\n
Containerize the frontend into a file called frontend.dockerfile
.
FROM python:3.11-slim\n\nRUN apt update && \\\n apt install --no-install-recommends -y build-essential gcc git && \\\n apt clean && rm -rf /var/lib/apt/lists/*\n\nRUN mkdir /app\n\nWORKDIR /app\n\nCOPY requirements_frontend.txt /app/requirements_frontend.txt\nCOPY frontend.py /app/frontend.py\n\nRUN --mount=type=cache,target=/root/.cache/pip pip install -r requirements_frontend.txt\n\nEXPOSE $PORT\n\nCMD [\"streamlit\", \"run\", \"frontend.py\", \"--server.port\", \"$PORT\"]\n
Build the frontend image
docker build -t frontend:latest -f frontend.dockerfile .\n
Run the frontend image
docker run --rm -p 8001:8001 -e \"PORT=8001\" backend\n
and check in your web browser that the frontend works as expected.
Deploy the frontend to Cloud run using the gcloud
command
Assuming that we have created an artifact registry called frontend_backend
we can deploy the backend to Cloud Run using the following commands:
docker tag frontend:latest \\\n <region>-docker.pkg.dev/<project>/frontend-backend/frontend:latest\ndocker push <region>.pkg.dev/<project>/frontend-backend/frontend:latest\ngcloud run deploy frontend \\\n --image=europe-west1-docker.pkg.dev/<project>/frontend-backend/frontend:latest \\\n --region=europe-west1 \\\n --platform=managed \\\n
Test that frontend works as expected by opening the URL of the deployed frontend in your web browser.
(Optional) If you have gotten this far you have successfully created a frontend and a backend and deployed them to the cloud. Finally, it may be worth it to load test your application to see how it performs under load. Write a locust file which is covered in this module and run it against your frontend. Make sure that it can handle the load you expect it to handle.
(Optional) Feel free to experiment further with streamlit and see what you can create. For example, you can try to create a option for the user to upload a video and then display the video with the predicted class overlaid on top of the video.
We have created separate requirements files for the frontend and the backend. Why is this a good idea?
SolutionThis is a good idea because the frontend and the backend may have different dependencies. By having separate requirements files we can make sure that we only install the dependencies that are needed for the specific application. This also has the positive side effect that we can keep the docker images smaller. For example, the frontend does not need the torch
library which is huge and only needed for the backend.
This ends the exercises for this module.
"},{"location":"s7_deployment/ml_deployment/","title":"M25 - ML deployment","text":""},{"location":"s7_deployment/ml_deployment/#deployment-of-machine-learning-models","title":"Deployment of Machine Learning Models","text":"In one of the previous modules you learned about how to use FastAPI to create an API to interact with your machine learning models. FastAPI is a great framework, but it is a general framework meaning that it was not developed with machine learning applications in mind. This means that there are features which you may consider to be missing when considering running large scale machine learning models:
Dynamic-batching: if you have a large number of requests coming in, you may want to process them in batches to reduce the overhead of loading the model and running the inference. This is especially true if you are running your model on a GPU, where the overhead of loading the model is significant.
Async inference: FastAPi does support async requests but not no way to call the model asynchronously. This means that if you have a large number of requests coming in, you will have to wait for the model to finish processing (because the model is not async) before you can start processing the next request.
Native GPU support: you can definitely run part of your application in FastAPI if you want to. But again it was not build with machine learning in mind, so you will have to do some extra work to get it to work.
It should come as no surprise that multiple frameworks have therefore sprung up that better supports deployment of machine learning algorithms (just listing a few here):
\ud83c\udf1f Framework \ud83e\udde9 Backend Agnostic \ud83e\udde0 Model Agnostic \ud83d\udcc2 Repository \u2b50 Github Stars Cortex \u2705 \u2705 \ud83d\udd17 Link 8.0k BentoML \u2705 \u2705 \ud83d\udd17 Link 7.2k Ray Serve \u2705 \u2705 \ud83d\udd17 Link 34.1k Triton Inference Server \u2705 \u2705 \ud83d\udd17 Link 8.4k OpenVINO \u2705 \u2705 \ud83d\udd17 Link 7.3k Seldon-core \u2705 \u2705 \ud83d\udd17 Link 4.4k Litserve \u2705 \u2705 \ud83d\udd17 Link 2.5k Torchserve \u274c \u2705 \ud83d\udd17 Link 4.2k TensorFlow serve \u274c \u2705 \ud83d\udd17 Link 6.2k vLLM \u274c \u274c \ud83d\udd17 Link 30.6kThe first 7 frameworks are backend agnostic, meaning that they are intended to work with whatever computational backend you model is implemented in (TensorFlow, PyTorch, Jax, Sklearn etc.), whereas the last 3 are backend specific (PyTorch, TensorFlow and a custom framework). The first 9 frameworks are model agnostic, meaning that they are intended to work with whatever model you have implemented, whereas the last one is model specific in this case to LLM's. When choosing a framework to deploy your model, you should consider the following:
Ease of use. Some frameworks are easier to use and get started with than others, but may have fewer features. As an example from the list above, Litserve
is very easy to get started with but is a relatively new framework and may not have all the features you need.
Performance. Some frameworks are optimized for performance, but may be harder to use. As an example from the list above, vLLM
is a very high performance framework for serving large language models but it cannot be used for other types of models.
Community. Some frameworks have a large community, which can be helpful if you run into problems. As an example from the list above, Triton Inference Server
is developed by Nvidia and has a large community of users. As a good rule of thumb, the more stars a repository has on Github, the larger the community.
In this module we are going to be looking at the BentoML
framework because it strikes a good balance between ease of use and having a lot of features that can improve the performance of serving your models. However, before we dive into this serving framework, we are going to look at a general way to package our machine learning models that should work with most of the above frameworks.
Whenever we want to serve a machine learning model, we in general need 3 things:
In the previous module on Docker we learned how to package all of these things into a container. This is a great way to package a model, but it is not the only way. The core assumption we currently have made is that the computational backend is the same as the one we trained the model on. However, this does not need to be the case. As long as we can export our model and weights to a common format, we can run the model on any backend that supports this format.
This is exactly what the Open Neural Network Exchange (ONNX) is designed to do. ONNX is a standardized format for creating and sharing machine learning models. It defines an extensible computation graph model, as well as definitions of built-in operators and standard data types. The idea behind ONNX is that a model trained with a specific framework on a specific device, let's say PyTorch on your local computer, can be exported and run with an entirely different framework and hardware easily. Learning how to export your models to ONNX is therefore a great way to increase the longevity of your models and not being locked into a specific framework for serving your models.
The ONNX format is designed to bridge the gap between development and deployment of machine learning models, by making it easy to export models between different frameworks and hardware. For example PyTorch is in general considered an developer friendly framework, however it has historically been slow to run inference with compared to a framework. Image credit"},{"location":"s7_deployment/ml_deployment/#exercises","title":"\u2754 Exercises","text":"Start by installing ONNX, ONNX runtime and ONNX script. This can be done by running the following command
pip install onnx onnxruntime onnxscript\n
the first package contains the core ONNX framework, the second package contains the runtime for running ONNX models and the third package contains a new experimental package that is designed to make it easier to export models to ONNX.
Let's start out with converting a model to ONNX. The following code snippets shows how to export a PyTorch model to ONNX.
PyTorch => 2.0PyTorch < 2.0 or WindowsPyTorch-lightningimport torch\nimport torchvision\n\nmodel = torchvision.models.resnet18(weights=None)\nmodel.eval()\n\ndummy_input = torch.randn(1, 3, 224, 224)\nonnx_model = torch.onnx.dynamo_export(\n model=model,\n model_args=(dummy_input,),\n export_options=torch.onnx.ExportOptions(dynamic_shapes=True),\n)\nonnx_model.save(\"resnet18.onnx\")\n
import torch\nimport torchvision\n\nmodel = torchvision.models.resnet18(weights=None)\nmodel.eval()\n\ndummy_input = torch.randn(1, 3, 224, 224)\ntorch.onnx.export(\n model=model,\n args=(dummy_input,),\n f=\"resnet18.onnx\",\n input_names=[\"input\"],\n output_names=[\"output\"],\n dynamic_axes={\"input\": {0: \"batch_size\"}, \"output\": {0: \"batch_size\"}}\n)\n
import torch\nimport torchvision\nimport pytorch_lightning as pl\nimport onnx\nimport onnxruntime\n\nclass LitModel(pl.LightningModule):\n def __init__(self):\n super().__init__()\n self.model = torchvision.models.resnet18(pretrained=True)\n self.model.eval()\n\n def forward(self, x):\n return self.model(x)\n\nmodel = LitModel()\nmodel.eval()\n\ndummy_input = torch.randn(1, 3, 224, 224)\nmodel.to_onnx(\n file_path=\"resnet18.onnx\",\n input_sample=dummy_input,\n input_names=[\"input\"],\n output_names=[\"output\"],\n dynamic_axes={\"input\": {0: \"batch_size\"}, \"output\": {0: \"batch_size\"}}\n)\n
Export a model of your own choice to ONNX or just try to export the resnet18
model as shown in the examples above, and confirm that the model was exported by checking that the file exists. Can you figure out what is meant by dynamic_axes
?
The dynamic_axes
argument is used to specify which axes of the input tensor that should be considered dynamic. This is useful when the model can accept inputs of different sizes, e.g. when the model is used in a dynamic batching scenario. In the example above we have specified that the first axis of the input tensor should be considered dynamic, meaning that the model can accept inputs of different batch sizes. While it may be tempting to specify all axes as dynamic, however this can lead to slower inference times, because the ONNX runtime will not be able to optimize the computational graph as well.
Check that the model was correctly exported by loading it using the onnx
package and afterwards check the graph of model using the following code:
import onnx\nmodel = onnx.load(\"resnet18.onnx\")\nonnx.checker.check_model(model)\nprint(onnx.helper.printable_graph(model.graph))\n
To get a better understanding of what is actually exported, lets try to visualize the computational graph of the model. This can be done using the open-source tool netron. You can either try it out directly in webbrowser or you can install it locally using pip install netron
and then run it using netron resnet18.onnx
. Can you figure out what method of the model is exported to ONNX?
When a PyTorch model is exported to ONNX, it is only the forward
method of the model that is exported. This means that it is the only method we have access to when we load the model later. Therefore, make sure that the forward
method of your model is implemented in a way that it can be used for inference.
After converting a model to ONNX format we can use the ONNX Runtime to run it. The benefit of this is that ONNX Runtime is able to optimize the computational graph of the model, which can lead to faster inference times. Lets try to look into that.
Figure out how to run a model using the ONNX Runtime. Relevant documentation.
SolutionTo use the ONNX runtime to run a model, we first need to start a inference session, then extract input output names of our model and finally run the model. The following code snippet shows how to do this.
import onnxruntime as rt\nort_session = rt.InferenceSession(\"<path-to-model>\")\ninput_names = [i.name for i in ort_session.get_inputs()]\noutput_names = [i.name for i in ort_session.get_outputs()]\nbatch = {input_names[0]: np.random.randn(1, 3, 224, 224).astype(np.float32)}\nout = ort_session.run(output_names, batch)\n
Let's experiment with performance of ONNX vs. PyTorch. Implement a benchmark that measures the time it takes to run a model using PyTorch and ONNX. Bonus points if you test for multiple input sizes. To get you started we have implemented a timing decorator that you can use to measure the time it takes to run a function.
from statistics import mean, stdev\nimport time\ndef timing_decorator(func, function_repeat: int = 10, timing_repeat: int = 5):\n \"\"\" Decorator that times the execution of a function. \"\"\"\n def wrapper(*args, **kwargs):\n timing_results = []\n for _ in range(timing_repeat):\n start_time = time.time()\n for _ in range(function_repeat):\n result = func(*args, **kwargs)\n end_time = time.time()\n elapsed_time = end_time - start_time\n timing_results.append(elapsed_time)\n print(f\"Avg +- Stddev: {mean(timing_results):0.3f} +- {stdev(timing_results):0.3f} seconds\")\n return result\n return wrapper\n
Solution onnx_benchmark.pyimport sys\nimport time\nfrom statistics import mean, stdev\n\nimport onnxruntime as ort\nimport torch\nimport torchvision\n\n\ndef timing_decorator(func, function_repeat: int = 10, timing_repeat: int = 5):\n \"\"\"Decorator that times the execution of a function.\"\"\"\n\n def wrapper(*args, **kwargs):\n timing_results = []\n for _ in range(timing_repeat):\n start_time = time.time()\n for _ in range(function_repeat):\n result = func(*args, **kwargs)\n end_time = time.time()\n elapsed_time = end_time - start_time\n timing_results.append(elapsed_time)\n print(f\"Avg +- Stddev: {mean(timing_results):0.3f} +- {stdev(timing_results):0.3f} seconds\")\n return result\n\n return wrapper\n\n\nmodel = torchvision.models.resnet18()\nmodel.eval()\n\ndummy_input = torch.randn(1, 3, 224, 224)\nif sys.platform == \"win32\":\n # Windows doesn't support the new TorchDynamo-based ONNX Exporter\n torch.onnx.export(\n model,\n dummy_input,\n \"resnet18.onnx\",\n input_names=[\"input.1\"],\n dynamic_axes={\"input.1\": {0: \"batch_size\", 2: \"height\", 3: \"width\"}},\n )\nelse:\n torch.onnx.dynamo_export(model, dummy_input).save(\"resnet18.onnx\")\n\nort_session = ort.InferenceSession(\"resnet18.onnx\")\n\n\n@timing_decorator\ndef torch_predict(image) -> None:\n \"\"\"Predict using PyTorch model.\"\"\"\n model(image)\n\n\n@timing_decorator\ndef onnx_predict(image) -> None:\n \"\"\"Predict using ONNX model.\"\"\"\n ort_session.run(None, {\"input.1\": image.numpy()})\n\n\nif __name__ == \"__main__\":\n for size in [224, 448, 896]:\n dummy_input = torch.randn(1, 3, size, size)\n print(f\"Image size: {size}\")\n torch_predict(dummy_input)\n onnx_predict(dummy_input)\n
To get a better understanding of why running the model using the ONNX runtime is usually faster lets try to see what happens to the computational graph. By default the ONNX Runtime will apply these optimization in online mode, meaning that the optimizations are applied when the model is loaded. However, it is also possible to apply the optimizations in offline mode, such that the optimized model is saved to disk. Below is an example of how to do this.
import onnxruntime as rt\nsess_options = rt.SessionOptions()\n\n# Set graph optimization level\nsess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_EXTENDED\n\n# To enable model serialization after graph optimization set this\nsess_options.optimized_model_filepath = \"optimized_model.onnx>\"\n\nsession = rt.InferenceSession(\"<model_path>\", sess_options)\n
Try to apply the optimizations in offline mode and use netron
to visualize both the original and optimized model side by side. Can you see any differences?
You should hopefully see that the optimized model consist of fewer nodes and edges than the original model. These nodes are often called fused nodes, because they are the result of multiple nodes being fused together. In the image below we have visualized the first part of the computational graph of a resnet18 model, before and after optimization.
As mentioned in the introduction, ONNX is able to run on many different types of hardware and execution engine. You can check all providers and all the available providers by running the following code
import onnxruntime\nprint(onnxruntime.get_all_providers())\nprint(onnxruntime.get_available_providers())\n
Can you figure out how to set which provide the ONNX runtime should use?
SolutionThe provider that the ONNX runtime should use can be set by passing the providers
argument to the InferenceSession
class. A list should be provided, which prioritizes the providers in the order they are listed.
import onnxruntime as rt\nprovider_list = ['CUDAExecutionProvider', 'CPUExecutionProvider']\nort_session = rt.InferenceSession(\"<path-to-model>\", providers=provider_list)\n
In this case we will prefer CUDA Execution Provider over CPU Execution Provider if both are available.
As you have probably realised in the exercises on docker, it can take a long time to build the kind of containers we are working with and they can be quite large. There is a reason for this and that is that PyTorch is a very large framework with a lot of dependencies. ONNX on the other hand is a much smaller framework. This kind of makes sense, because PyTorch is a framework that primarily was designed for developing e.g. training models, while ONNX is a framework that is designed for serving models. Let's try to quantify this.
Construct a dockerfile that builds a docker image with PyTorch as a dependency. The dockerfile does actually not need to run anything. Repeat the same process for the ONNX runtime. Bonus point for developing a docker image that takes a build arg at build time that specifies if the image should be built with CUDA support or not.
SolutionThe dockerfile for the PyTorch image could look something like this
inference_pytorch.dockerfileFROM python:3.11-slim\n\nRUN apt update && \\\n apt install --no-install-recommends -y build-essential gcc && \\\n apt clean && rm -rf /var/lib/apt/lists/*\n\nARG CUDA\nENV CUDA=${CUDA}\nRUN echo \"CUDA is set to: ${CUDA}\"\n\nRUN echo \"CUDA is set to: ${CUDA}\" && \\\n if [ -n \"$CUDA\" ]; then \\\n pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu121; \\\n else \\\n pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu; \\\n fi\n
and the dockerfile for the ONNX image could look something like this
inference_onnx.dockerfileFROM python:3.11-slim\n\nRUN apt update && \\\n apt install --no-install-recommends -y build-essential gcc && \\\n apt clean && rm -rf /var/lib/apt/lists/*\n\nRUN echo \"CUDA is set to: ${CUDA}\" && \\\n if [ -n \"$CUDA\" ]; then \\\n pip install onnxruntime-gpu; \\\n else \\\n pip install onnxruntime; \\\n fi\n
Build both containers and measure the time it takes to build them. How much faster is it to build the ONNX container compared to the PyTorch container?
SolutionOn unix/linux you can use the time command to measure the time it takes to build the containers. Building both images, with and without CUDA support, can be done with the following commands
time docker build . -t pytorch_inference_cuda:latest -f inference_pytorch.dockerfile \\\n --no-cache --build-arg CUDA=true\ntime docker build . -t pytorch_inference:latest -f inference_pytorch.dockerfile \\\n --no-cache --build-arg CUDA=\ntime docker build . -t onnx_inference_cuda:latest -f inference_onnx.dockerfile \\\n --no-cache --build-arg CUDA=true\ntime docker build . -t onnx_inference:latest -f inference_onnx.dockerfile \\\n --no-cache --build-arg CUDA=\n
the --no-cache
flag is used to ensure that the build process is not cached and ensure a fair comparison. On my laptop this respectively took 5m1s
, 1m4s
, 0m4s
, 0m50s
meaning that the ONNX container was respectively 7x (with CUDA) and 1.28x (no CUDA) faster to build than the PyTorch container.
Find out the size of the two docker images. It can be done in the terminal by running the docker images
command. How much smaller is the ONNX model compared to the PyTorch model?
As of writing the docker image containing the PyTorch framework was 5.54GB (with CUDA) and 1.25GB (no CUDA). In comparison the ONNX image was 647MB (with CUDA) and 647MB (no CUDA). This means that the ONNX image is respectively 8.5x (with CUDA) and 1.94x (no CUDA) smaller than the PyTorch image.
(Optional) Assuming you have completed the module on FastAPI try creating a small FastAPI application that serves a model using the ONNX runtime.
SolutionHere is a simple example of how to create a FastAPI application that serves a model using the ONNX runtime.
onnx_fastapi.pyimport numpy as np\nimport onnxruntime\nfrom fastapi import FastAPI\n\napp = FastAPI()\n\n\n@app.get(\"/predict\")\ndef predict():\n \"\"\"Predict using ONNX model.\"\"\"\n # Load the ONNX model\n model = onnxruntime.InferenceSession(\"model.onnx\")\n\n # Prepare the input data\n input_data = {\"input\": np.random.rand(1, 3).astype(np.float32)}\n\n # Run the model\n output = model.run(None, input_data)\n\n return {\"output\": output[0].tolist()}\n
This completes the exercises on the ONNX format. Do note that one limitation of the ONNX format is that is is based on ProtoBuf, which is a binary format. A protobuf file can have a maximum size of 2GB, which means that the .onnx
format is not enough for very large models. However, through the use of external data it is possible to circumvent this limitation.
BentoML cloud vs BentoML OSS
We are only going to be looking at the open-source version of BentoML in this module. However, BentoML also has a cloud version that makes it very easy to deploy models that are coded in BentoML to the cloud. If you are interested in this, you can check out the BentoML cloud documentation. This business strategy of having an open-source product and a cloud product is very common in the machine learning space (HuggingFace, LightningAI, Weights and Biases etc.), because it allows companies to make money from the cloud product while still providing a free product to the community.
BentoML is a framework that is designed to make it easy to serve machine learning models. It is designed to be backend agnostic, meaning that it can be used with any computational backend. It is also model agnostic, meaning that it can be used with any machine learning model.
Let's consider a simple example of how to serve a model using BentoML. The following code snippet shows how to serve a model that uses the transformers
library to summarize text.
import bentoml\nfrom transformers import pipeline\n\nEXAMPLE_INPUT = (\n \"Breaking News: In an astonishing turn of events, the small town of Willow Creek has been taken by storm as \"\n \"local resident Jerry Thompson's cat, Whiskers, performed what witnesses are calling a 'miraculous and gravity-\"\n \"defying leap.' Eyewitnesses report that Whiskers, an otherwise unremarkable tabby cat, jumped a record-breaking \"\n \"20 feet into the air to catch a fly. The event, which took place in Thompson's backyard, is now being investigated \"\n \"by scientists for potential breaches in the laws of physics. Local authorities are considering a town festival to \"\n \"celebrate what is being hailed as 'The Leap of the Century.'\"\n)\n\n@bentoml.service(resources={\"cpu\": \"2\"}, traffic={\"timeout\": 10})\nclass Summarization:\n def __init__(self) -> None:\n self.pipeline = pipeline('summarization')\n\n @bentoml.api\n def summarize(self, text: str = EXAMPLE_INPUT) -> str:\n result = self.pipeline(text)\n return result[0]['summary_text']\n
In BentoML
we organize our services in classes, where each class is a service that we want to serve. The two important parts of the code snippet are the @bentoml.service
and @bentoml.api
decorators.
The @bentoml.service
decorator is used to specify the resources that the service should use and in general how the service should be run. In this case we are specifying that the service should use 2 CPU cores and that the timeout for the service should be 10 seconds.
The @bentoml.api
decorator is used to specify the API that the service should expose. In this case we are specifying that the service should have an API called summarize
that takes a string as input and returns a string as output.
To serve the model using BentoML
we can execute the following command, which is very similar to the command we used to serve the model using FastAPI.
bentoml serve service:Summarization\n
"},{"location":"s7_deployment/ml_deployment/#exercises_1","title":"\u2754 Exercises","text":"In general, we advise looking through the docs for Bento ML if you need help with any of the exercises. We are going to assume that you have done the exercises on ONNX and we are therefore going to be using BentoML
to serve ONNX models. If you have not done this part, you can still follow along but you will need to use a PyTorch model instead of an ONNX model.
Install BentoML
pip install bentoml\n
Remember to add the dependency to your requirements.txt
file.
You are in principal free to serve any model you like, but we recommend to just use a torchvision model as in the ONNX exercises. Write your first service in BentoML
that serves a model of your choice. We recommend experimenting with providing input/output as tensors because bentoml supports this nativly. Secondly, write a client that can send a request to the service and print the result. Here we recommend using the build in bentoml.SyncHTTPClient.
The following implements a simple BentoML service that serves a ONNX resnet18 model. The service expects the both input and output to be numpy arrays.
bentoml_service.pyfrom __future__ import annotations\n\nimport bentoml\nimport numpy as np\nfrom onnxruntime import InferenceSession\n\n\n@bentoml.service\nclass ImageClassifierService:\n \"\"\"Image classifier service using ONNX model.\"\"\"\n\n def __init__(self) -> None:\n self.model = InferenceSession(\"model.onnx\")\n\n @bentoml.api\n def predict(self, image: np.ndarray) -> np.ndarray:\n \"\"\"Predict the class of the input image.\"\"\"\n output = self.model.run(None, {\"input\": image.astype(np.float32)})\n return output[0]\n
The service can be served using the following command
bentoml serve bentoml_service:ImageClassifierService\n
To test that the service works the following client can be used
bentoml_client.pyimport bentoml\nimport numpy as np\nfrom PIL import Image\n\nif __name__ == \"__main__\":\n image = Image.open(\"my_cat.jpg\")\n image = image.resize((224, 224)) # Resize to match the minimum input size of the model\n image = np.array(image)\n image = np.transpose(image, (2, 0, 1)) # Change to CHW format\n image = np.expand_dims(image, axis=0) # Add batch dimension\n\n with bentoml.SyncHTTPClient(\"http://localhost:4040\") as client:\n resp = client.predict(image=image)\n print(resp)\n
We are now going to look at features very BentoML
really sets itself apart from FastAPI
. The first is adaptive batching. As you are hopefully aware, modern machine learning models can process multiple samples at the same time and in doing so increases the throughput of the model. When we train a model we often set a fixed batch size, however we cannot do that when serving the model because that would mean that we would have to wait for the batch to be full before we can process it. Adaptive batching simply refers to the process where we specify a maximum batch size and also a timeout. When either the batch is full or the timeout is reached, however many samples we have collected are sent to the model for processing. This can be a very powerful feature because it allows us to process samples as soon as they arrive, while still taking advantage of the increased throughput of batching.
The overall architecture of the adaptive batching feature in BentoML. The feature is implemented on the server side and mainly consist of dispatcher that is in charge of collecting requests and sending them to the model server when either the batch is full or a timeout is reached. Image credit
Look through the documentation on adaptive batching and add adaptive batching to your service from the previous exercise. Make sure your service works as expected by testing it with the client from the previous exercise.
Solution bentoml_service_adaptive_batching.pyfrom __future__ import annotations\n\nimport bentoml\nimport numpy as np\nfrom onnxruntime import InferenceSession\n\n\n@bentoml.service\nclass ImageClassifierService:\n \"\"\"Image classifier service using ONNX model.\"\"\"\n\n def __init__(self) -> None:\n self.model = InferenceSession(\"model.onnx\")\n\n @bentoml.api(\n batchable=True,\n batch_dim=(0, 0),\n max_batch_size=128,\n max_latency_ms=1000,\n )\n def predict(self, image: np.ndarray) -> np.ndarray:\n \"\"\"Predict the class of the input image.\"\"\"\n output = self.model.run(None, {\"input\": image.astype(np.float32)})\n return output[0]\n
Try to measure the throughput of your model with and without adaptive batching. Assuming that you have completed the module on testing APIs and therefore are familiar with the locust
framework, we recommend that you write a simple locustfile and use the locust
command to measure the throughput of your model.
The following locust file can be used to measure the throughput of the model with and without adaptive
locustfile.pyimport numpy as np\nfrom locust import HttpUser, between, task\nfrom PIL import Image\n\n\ndef prepare_image():\n \"\"\"Load and preprocess the image as required.\"\"\"\n image = Image.open(\"my_cat.jpg\")\n image = image.resize((224, 224))\n image = np.array(image)\n image = np.transpose(image, (2, 0, 1)) # Convert to CHW format\n image = np.expand_dims(image, axis=0) # Add batch dimension\n # Convert to list format for JSON serialization\n return image.tolist()\n\n\nimage = prepare_image()\n\n\nclass BentoMLUser(HttpUser):\n \"\"\"Locust user class for sending prediction requests to the server.\"\"\"\n\n wait_time = between(1, 2)\n\n @task\n def send_prediction_request(self):\n \"\"\"Send a prediction request to the server.\"\"\"\n payload = {\"image\": image} # Package the image as JSON\n self.client.post(\"/predict\", json=payload, headers={\"Content-Type\": \"application/json\"})\n
and then the following command can be used to measure the throughput of the model
locust -f locustfile_bentoml.py --host http://localhost:4040 --headless -u 50 -t 60s\n
You should hopefully see that the throughput of the model is higher when adaptive batching is enabled, but the speedup is largely dependent on the model you are running, the configuration of the adaptive batching and the hardware you are running on.
On my laptop I saw about a 1.5 - 2x speedup when adaptive batching was enabled.
(Optional, requires GPU) Look through the documentation for inference on GPU and add this to your service. Check that your service works as expected by testing it with the client from the previous exercise and make sure you are seeing a speedup when running on the GPU.
SolutionA simple change to the bento.service
decorator is all that is needed to run the model on the GPU.
```python @bentoml.service(resources={\"gpu\": 1}) class MyService: def init(self): self.model = torch.load('model.pth').to('cuda:0')
Another way to speed up the inference is to just use multiple workers. This duplicates the server over multiple processes taking advantage of modern multi-core CPUs. This is similar to running uvicorn
command with the --workers
flag for fastapi applications. Implement multiple workers in your service and test that it works as expected by testing it with the client from the previous exercise. Also test that you are seeing a speedup when running with multiple workers.
Multiple workers can be added to the bento.service
decorator as shown below.
@bentoml.service(workers=4)\nclass MyService:\n # Service implementation\n
Alternatively, you can set workers=\"cpu_count\"
to use all available CPU cores. The speedup depends on the model you are serving, the hardware you are running on and the number of workers you are using, but it should be higher than using a single worker.
In addition to increasing the throughput of your deployments BentoML
can also help with ML applications that requires some kind of composition of multiple models. It is very normal in production setups to have multiple models that either
BentoML
makes it easy to compose multiple models together.
Implement two services that runs in a sequence e.g. the output of one service is used as the input of another service. As an example you can implement either some pre- or post-processing service that is used in conjunction with the model you have implemented in the previous exercises.
SolutionThe following code snippet shows how to implement two services that runs in a sequence.
bentoml_service_composition.pyfrom __future__ import annotations\n\nfrom pathlib import Path\n\nimport bentoml\nimport numpy as np\nfrom onnxruntime import InferenceSession\nfrom PIL import Image\n\n\n@bentoml.service\nclass ImagePreprocessorService:\n \"\"\"Image preprocessor service.\"\"\"\n\n @bentoml.api\n def preprocess(self, image_file: Path) -> np.ndarray:\n \"\"\"Preprocess the input image.\"\"\"\n image = Image.open(image_file)\n image = image.resize((224, 224))\n image = np.array(image)\n image = np.transpose(image, (2, 0, 1))\n return np.expand_dims(image, axis=0)\n\n\n@bentoml.service\nclass ImageClassifierService:\n \"\"\"Image classifier service using ONNX model.\"\"\"\n\n preprocessing_service = bentoml.depends(ImagePreprocessorService)\n\n def __init__(self) -> None:\n self.model = InferenceSession(\"model.onnx\")\n\n @bentoml.api\n async def predict(self, image_file: Path) -> np.ndarray:\n \"\"\"Predict the class of the input image.\"\"\"\n image = await self.preprocessing_service.to_async.preprocess(image_file)\n output = self.model.run(None, {\"input\": image.astype(np.float32)})\n return output[0]\n
Implement three services, where two of them runs concurrently and the output of both services are combined in the third service to make a prediction. As an example you can expand your previous service to serve two different models and then implement a third service that combines the output of both models to make a prediction.
SolutionThe following code snippet shows how to implement a service that consist of two concurrent services. The example assumes that two models called model_a.onnx
and model_b.onnx
are available.
from __future__ import annotations\n\nimport asyncio\n\nimport bentoml\nimport numpy as np\nfrom onnxruntime import InferenceSession\n\n\n@bentoml.service\nclass ImageClassifierServiceModelA:\n \"\"\"Image classifier service using ONNX model.\"\"\"\n\n def __init__(self) -> None:\n self.model = InferenceSession(\"model_a.onnx\")\n\n @bentoml.api\n def predict(self, image: np.ndarray) -> np.ndarray:\n \"\"\"Predict the class of the input image.\"\"\"\n output = self.model.run(None, {\"input\": image.astype(np.float32)})\n return output[0]\n\n\n@bentoml.service\nclass ImageClassifierServiceModelB:\n \"\"\"Image classifier service using ONNX model.\"\"\"\n\n def __init__(self) -> None:\n self.model = InferenceSession(\"model_b.onnx\")\n\n @bentoml.api\n def predict(self, image: np.ndarray) -> np.ndarray:\n \"\"\"Predict the class of the input image.\"\"\"\n output = self.model.run(None, {\"input\": image.astype(np.float32)})\n return output[0]\n\n\n@bentoml.service\nclass ImageClassifierService:\n \"\"\"Image classifier service using ONNX model.\"\"\"\n\n model_a = bentoml.depends(ImageClassifierServiceModelA)\n model_b = bentoml.depends(ImageClassifierServiceModelB)\n\n @bentoml.api\n async def predict(self, image: np.ndarray) -> np.ndarray:\n \"\"\"Predict the class of the input image.\"\"\"\n result_a, result_b = await asyncio.gather(\n self.model_a.to_async.predict(image), self.model_b.to_async.predict(image)\n )\n return (result_a + result_b) / 2\n
(Optional) Implement a server that consist of both sequential and concurrent services.
Similar to deploying a FastAPI application to the cloud, deploying a BentoML
framework to the cloud often requires you to first containerize the application. Because BentoML
is designed to be easy to use for even users not that familiar with Docker, it introduces the concept of a bentofile
. A bentofile
is a file that specifies how the container should be build. Below is an example of how a bentofile
could look like.
service: 'service:Summarization'\nlabels:\n owner: bentoml-team\n project: gallery\ninclude:\n - '*.py'\npython:\n packages:\n - torch\n - transformers\n
which can then be used to build a bento
using the following command
bentoml build\n
A bento
is not a docker image, but it can be used to build a docker image with the following command
bentoml containerize summarization:latest\n
Can you figure out how the different parts of the bentofile
are used to build the docker image? Additionally, can you figure out from the source repository how the bentofile
is used to build the docker image?
The service
part specifies both what the container should be called and also what service it should serve e.g. the last statement in the corresponding dockerfile is CMD [\"bentoml\", \"serve\", \"service:Summarization\"]
. The labels
part is used to specify labels about the container, see this link for more info. The include
part corresponds to COPY
statements in the dockerfile and finally the python
part is used to specify what python packages should be installed in the container which corresponds to RUN pip install ...
in the dockerfile.
Regarding how the bentofile
is used to build the docker image, the bentoml
package contains a number of templates (written using the jinja2 templating language) that are used to generate the dockerfiles. The templates can be found here.
Take whatever service from the previous exercises and try to containerize it. You are free to either write a bentofile
or a dockerfile
to do this.
The following bentofile
can be used to containerize the very first service we implemented in this set of exercises.
service: 'bentoml_service:ImageClassifierService'\nlabels:\n owner: bentoml-team\n project: gallery\ninclude:\n- 'bentoml_service.py'\n- 'model.onnx'\npython:\n packages:\n - onnxruntime\n - numpy\n
The corresponding dockerfile would look something like this
FROM python:3.11-slim\nWORKDIR /bento\nCOPY bentoml_service.py .\nCOPY model.onnx .\nRUN pip install onnxruntime numpy bentoml\nCMD [\"bentoml\", \"serve\", \"bentoml_service:ImageClassifierService\"]\n
Deploy the container to GCP Run and test that it works.
SolutionThe following command can be used to deploy the container to GCP Run. We assume that you have already build the container and called it bentoml_service:latest
.
docker tag bentoml_service:latest \\\n <region>-docker.pkg.dev/<project-id>/<repository-name>/bentoml_service:latest\ndocker push <region>-docker.pkg.dev/<project-id>/<repository-name>/bentoml_service:latest\ngcloud run deploy bentoml-service \\\n --image=<region>-docker.pkg.dev/<project-id>/<repository-name>/bentoml_service:latest \\\n --platform managed \\\n --port 3000 # default used by BentoML\n
where <project-id>
should be replaced with the id of the project you are deploying to. The service should now be available at the URL that is printed in the terminal.
This completes the exercises on the BentoML
framework. If you want to deep dive more into this we can recommend looking into their tasks feature for use cases that have a very long running time and build in model management feature to unify the way models are loaded, managed and served.
How would you export a scikit-learn
model to ONNX? What method is exported when you export scikit-learn
model to ONNX?
It is possible to export a scikit-learn
model to ONNX using the sklearn-onnx
package. The following code snippet shows how to export a scikit-learn
model to ONNX.
from sklearn.ensemble import RandomForestClassifier\nfrom skl2onnx import to_onnx\nmodel = RandomForestClassifier(n_estimators=2)\ndummy_input = np.random.randn(1, 4)\nonx = to_onnx(model, dummy_input)\nwith open(\"model.onnx\", \"wb\") as f:\n f.write(onx.SerializeToString())\n
The method that is exported when you export a scikit-learn
model to ONNX is the predict
method.
In your own words, describe what the concept of computational graph means?
SolutionA computational graph is a way to represent the mathematical operations that are performed in a model. It is essentially a graph where the nodes are the operations and the edges are the data that is passed between them. The computational graph normally represents the forward pass of the model and is the reason that we can easily backpropagate through the model to train it, because the graph contains all the necessary information to calculate the gradients of the model.
In your own words, explain why fusing operations together in the computational graph often leads to better performance?
SolutionEach time we want to do a computation, the data needs to be loaded from memory into the CPU/GPU. This is a slow process and the more operations we have, the more times we need to load the data. By fusing operations together, we can reduce the number of times we need to load the data, because we can do multiple operations on the same data before we need to load new data.
This ends the module on tools specifically designed for serving machine learning models. As stated in the beginning of the module, there are a lot of different tools that can be used to serve machine learning models and the choice of tool often depends on the specific use case. In general, we recommend that whenever you want to serve a machine learning model, you should try out a few different frameworks and see which one fits your use case the best.
"},{"location":"s7_deployment/testing_apis/","title":"M24 - API Testing","text":""},{"location":"s7_deployment/testing_apis/#api-testing","title":"API testing","text":"Core Module
API testing, similar to unit testing, is a type of software testing that involves testing the application programming interface (API) directly to ensure it meets requirements for functionality, reliability, performance, and security. The core difference from the unit testing we have been implementing until now is that instead of testing the individual functions, we are testing the entire API as a whole. API testing is therefore a form of integration testing. Additionally, another difference is that we need to simulate API calls that should be as similar as possible to the ones that will be made by the users of the API.
The is in general two things that we want to test when we are working with APIs:
In this module, we go over how to do each of them.
"},{"location":"s7_deployment/testing_apis/#testing-for-functionality","title":"Testing for functionality","text":"Similar to when we wrote unit tests for our code back in this module we can also write tests for our API that checks that our code does what it is supposed to do e.g. by using assert
statements. As always we recommend implementing the tests in a separate folder called tests
, but we recommend that you add further subfolders to separate the different types of tests. For example, for the type of machine learning projects and APIs we have been working with in this course:
my_project\n|-- src/\n| |-- train.py\n| |-- data.py\n| |-- app.py\n|-- tests/\n| |-- unittests/\n| | |-- test_train.py\n| | |-- test_data.py\n| |-- integrationtests/\n| | |-- test_apis.py\n
"},{"location":"s7_deployment/testing_apis/#exercises","title":"\u2754 Exercises","text":"In these exercises, we are going to assume that we want to test an API written in FastAPI (see this module). If the API is written in a different framework then how to write the tests may have to change.
Start by installing httpx which is the client we are going to use during testing:
pip install httpx\n
Remember to add it to your requirements.txt
file.
If you have already done the module on unittesting then you should already have a tests/
folder. If not then create one. Inside the tests/
folder create a new folder called integrationtests/
. Inside the integrationtests/
folder create a file called test_apis.py
and write the following code:
from fastapi.testclient import TestClient\nfrom app.main import app\nclient = TestClient(app)\n
this code will create a client that can be used to send requests to the API. The app
variable is the FastAPI application that we want to test.
Now, you can write tests that check that the API works as intended, much like you would write unit tests. For example, if you have an root endpoint that just returns a simple welcome message you could write a test like this:
def test_read_root(model):\n response = client.get(\"/\")\n assert response.status_code == 200\n assert response.json() == {\"message\": \"Welcome to the MNIST model inference API!\"}\n
make sure to always assert
that the status code is what you expect and that the response is what you expect. Add such tests for all the endpoints in your API.
If you have an application with lifespan events e.g. you have implemented the lifespan
function in your FastAPI application, you need to instead use the TestClient
in a with
statement. This is because the TestClient
will close the connection to the application after the test is done. Here is an example:
def test_read_root(model):\n with TestClient(app) as client:\n response = client.get(\"/\")\n assert response.status_code == 200\n assert response.json() == {\"message\": \"Welcome to the MNIST model inference API!\"}\n
To run the tests, you can use the following command:
pytest tests/integrationtests/test_apis.py\n
Make sure that all your tests pass.
The next type of testing we are going to implement for our application is load testing, which is a kind of performance testing. The goal of load testing is to determine how an application behaves under both normal and peak conditions. The purpose is to identify the maximum operating capacity of an application as well as any bottlenecks and to determine which element is causing degradation.
Before we get started on the exercises we recommend that you start by defining an environment variable that contains the endpoint of your API e.g we need the API running to be able to test it. To begin with, you can just run the API locally, thus in a terminal window run the following command:
uvicorn app.main:app --reload\n
by default the API will be running on http://localhost:8000
which we can then define as an environment variable:
set MYENDPOINT=http://localhost:8000\n
export MYENDPOINT=http://localhost:8000\n
However, the end goal is to test an API you have deployed in the cloud. If you have used Google Cloud Run to deploy your API then you can get the endpoint by going to the UI and looking at the service details:
The endpoint can be seen in the top center. It always starts with `https://` followed by a random string and then `.a.run.app`However, we can also use the gcloud
command to get the endpoint:
for /f \"delims=\" %i in ^\n('gcloud run services describe <name> --region=<region> --format=\"value(status.url)\"') do set MYENDPOINT=%i\n
export MYENDPOINT=$(gcloud run services describe <name> --region=<region> --format=\"value(status.url)\")\n
where you need to define <name>
and <region>
with the name of your service and the region it is deployed in.
For the exercises, we are going to use the locust framework for load testing (the name is a reference to a locust being a swarm of bugs invading your application). It is a Python framework that allows you to write tests that simulate many users interacting with your application. It is very easy to get started with and it is very easy to integrate with your CI/CD pipeline.
Install locust
pip install locust\n
Remember to add it to your requirements.txt
file.
Make sure you have written an API that you can test. Else you can for simplicity just use this simple example
Simple hallo world Fastapi example
model.pyfrom fastapi import FastAPI\n\napp = FastAPI()\n\n\n@app.get(\"/\")\ndef read_root():\n \"\"\"Root endpoint.\"\"\"\n return {\"Hello\": \"World\"}\n\n\n@app.get(\"/items/{item_id}\")\ndef read_item(item_id: int):\n \"\"\"Get an item by id.\"\"\"\n return {\"item_id\": item_id}\n
Add a new folder to your tests/
folder called performancetests
and inside it create a file called locustfile.py
. To that file, you need to add the appropriate code to simulate the users that you want to test. You can read more about how to write a locustfile.py
here.
Here we provide a solution to the above simple example:
locustfile.pyimport random\n\nfrom locust import HttpUser, between, task\n\n\nclass MyUser(HttpUser):\n \"\"\"A simple Locust user class that defines the tasks to be performed by the users.\"\"\"\n\n wait_time = between(1, 2)\n\n @task\n def get_root(self) -> None:\n \"\"\"A task that simulates a user visiting the root URL of the FastAPI app.\"\"\"\n self.client.get(\"/\")\n\n @task(3)\n def get_item(self) -> None:\n \"\"\"A task that simulates a user visiting a random item URL of the FastAPI app.\"\"\"\n item_id = random.randint(1, 10)\n self.client.get(f\"/items/{item_id}\")\n
Then try to run the locust
command:
locust -f tests/performancetests/locustfile.py\n
and then navigate to http://localhost:8089 in your web browser. You should see a page that looks similar to the top of this figure.
you can here define the number of users you want to simulate and how many users you want to spawn per second. Finally, you can define which endpoint you want to test. When you are ready you can press the Start
.
Afterward, you should see the results of the test in the web browser. Answer the following questions:
Maybe of more use to us is running locust in the terminal. To do this you can run the following command:
WindowsMac/Linuxlocust -f tests/performancetests/locustfile.py \\\n --headless --users 10 --spawn-rate 1 --run-time 1m --host %MYENDPOINT%\n
locust -f tests/performancetests/locustfile.py \\\n --headless --users 10 --spawn-rate 1 --run-time 1m --host $MYENDPOINT\n
this will run the test with 10 users that are spawned at a rate of 1 per second for 1 minute.
(Optional) A good use case for load testing in our case is to test that our API can handle a load right after it has been deployed. To do this we need to add appropriate steps to our CI/CD pipeline. Try adding locust to an existing or new workflow file in your .github/workflows/
folder, such that it runs after the deployment step.
The solution here expects that a service called production-model
has been deployed to Google Cloud Run. Then the following steps can be added to a workflow file, to first authenticate with Google Cloud, extract the relevant URL, and then run the load test:
- name: Auth with GCP\n uses: google-github-actions/auth@v2\n with:\n credentials_json: ${{ secrets.GCP_SA_KEY }}\n\n- name: Set up Cloud SDK\n uses: google-github-actions/setup-gcloud@v2\n\n- name: Extract deployed model URL\n run: |\n DEPLOYED_MODEL_URL=$(gcloud run services describe production-model \\\n --region=europe-west1 \\\n --format='value(status.url)')\n echo \"DEPLOYED_MODEL_URL=$DEPLOYED_MODEL_URL\" >> $GITHUB_ENV\n\n- name: Run load test on deployed model\n env:\n DEPLOYED_MODEL_URL: ${{ env.DEPLOYED_MODEL_URL }}\n run: |\n locust -f tests/performance/locustfile.py \\\n --headless -u 100 -r 10 --run-time 10m --host=$DEPLOYED_MODEL_URL --csv=/locust/results\n\n- name: Upload locust results\n uses: actions/upload-artifact@v4\n with:\n name: locust-results\n path: /locust\n
the results can afterward be downloaded from the artifacts tab in the GitHub UI.
In the locust
framework, what does the @task
decorator do and what does @task(3)
mean?
The @task
decorator is used to define a task that a user can perform. The @task(3)
decorator is used to define a task that a user can perform that is three times more likely to be performed than the other tasks.
In the locust
framework, what does the wait_time
attribute do?
The wait_time
attribute is used to define how long a user should wait between tasks. It can be either be a fixed number or a random number between two values.
from locust import HttpUser, task, between, constant\n\nclass MyUser(HttpUser):\n wait_time = between(5, 9)\n # or\n wait_time = constant(5)\n
Load testing can give numbers on average response time, 99th percentile response time, and requests per second. What do these numbers tell us about the user experience of the API?
SolutionThe average response time and 99th percentile response time are both measures how \"snappy\" the API feels to the user. While the average response time is normally considered the most important, the 99th percentile response time is also important as it tells us if there are a small amount of users that are experiencing a very slow response time. The requests per second tells us how many users the API can handle at the same time. If this number is too low it can lead to users experiencing slow response times and may indicate that something is wrong with the API.
Slides
Learn how to detect data drifting using the evidently
framework
M27: Data Drifting
Learn how to setup a prometheus monitoring system for your application
M28: System Monitoring
We have now reached the end of our machine-learning pipeline. We have successfully developed, trained and deployed a machine learning model. However, the question then becomes if you can trust that your newly deployed model still works as expected after 1 day without you intervening? What about 1 month? What about 1 year?
There may be corner cases where an ML model is working as expected, but the vast majority of ML models will perform worse over time because they are not generalizing well enough. For example, assume you have just deployed an application that classifies images from phones when suddenly a new phone comes out with a new kind of sensor that takes images that either have a very weird aspect ratio or something else your model is not robust towards. There is nothing wrong with this, you can essentially just retrain your model on new data that accounts for this corner case, however, you need a mechanism that informs you.
This is very monitoring comes into play. Monitoring practices are in charge of collecting any information about your application in some format that can then be analyzed and reacted on. Monitoring is essential to securing the longevity of your applications.
As with many other sub-fields within MLOps, we can divide monitoring into classic monitoring and ML-specific monitoring. Classic monitoring (known from classic DevOps) is often about
All these are basic information you are interested in regardless of what application type you are trying to deploy. However, then there is machine learning related monitoring that especially relates data. Take the example above, with the new phone, this we would in general consider to be data drifting problem e.g. the data you are trying to do inference on have drifted away from the distribution of data your model was trained on. Such monitoring problems are unique to machine learning applications and needs to be handled separately.
We are in this session going to see examples of both kinds of monitoring.
Learning objectives
The learning objectives of this session are:
evidently
frameworkData drifting is one of the core reasons for model accuracy degrades over time in production. For machine learning models, data drift is the change in model input data that leads to model performance degradation. In practical terms, this means that the model is receiving input that is outside of the scope that it was trained on, as seen in the figure below. This shows that the underlying distribution of a particular feature has slowly been increasing in value over two years
Image creditIn some cases, it may be that if you normalize some feature in a better way that you are able to generalize your model better, but this is not always the case. The reason for such a drift is commonly some external factor that you essentially have no control over. That really only leaves you with one option: retrain your model on the newly received input features and deploy that model to production. This process is probably going to repeat over the lifetime of your application if you want to keep it up-to-date with the real world.
Image creditWe have now come up with a solution to the data drift problem, but there is one important detail that we have not taken care of: When we should actually trigger the retraining? We do not want to wait around for our model performance to degrade, thus we need tools that can detect when we are seeing a drift in our data.
"},{"location":"s8_monitoring/data_drifting/#exercises","title":"\u2754 Exercises","text":"For these exercises we are going to use the framework Evidently developed by EvidentlyAI. Evidently currently supports both detection for both regression and classification models. The exercises are in large taken from here and in general we recommend if you are in doubt about an exercise to look at the docs for API and examples (their documentation can be a bit lacking sometimes, so you may also have to dive into the source code).
Additionally, we want to stress that data drift detection, concept drift detection etc. is still an active field of research and therefore exist multiple frameworks for doing this kind of detection. In addition to Evidently, we can also mention NannyML, WhyLogs and deepcheck.
Start by install Evidently
pip install evidently\n
you will also need scikit-learn
and pandas
installed if you do not already have it.
Hopefully you already gone through session S7 on deployment. As part of the deployment to GCP functions you should have developed a application that can classify the iris dataset, based on a model trained by this script . We are going to convert this into a FastAPI application for the purpose here:
Convert your GCP function into a FastAPI application. The appropriate curl
command should look something like this:
curl -X 'POST' \\\n 'http://127.0.0.1:8000/iris_v1/?sepal_length=1.0&sepal_width=1.0&petal_length=1.0&petal_width=1.0' \\\n -H 'accept: application/json' \\\n -d ''\n
and the response body should look like this:
{\n \"prediction\": \"Iris-Setosa\",\n \"prediction_int\": 0\n}\n
We have implemented a solution in this file (called v1) if you need help.
Next we are going to add some functionality to our application. We need to add that the input for the user is saved to a database whenever our application is called. However, to not slow down the response to our user we want to implement this as an background task. A background task is a function that should be executed after the user have got their response. Implement a background task that save the user input to a database implemented as a simple .csv
file. You can read more about background tasks here. The header of the database should look something like this:
time, sepal_length, sepal_width, petal_length, petal_width, prediction\n2022-12-28 17:24:34.045649, 1.0, 1.0, 1.0, 1.0, 1\n2022-12-28 17:24:44.026432, 2.0, 2.0, 2.0, 2.0, 1\n...\n
thus both input, timestamp and predicted value should be saved. We have implemented a solution in this file (called v2) if you need help.
Call you API a number of times to generate some dummy data in the database.
Create a new data_drift.py
file where we are going to implement the data drifting detection and reporting. Start by adding both the real iris data and your generated dummy data as pandas dataframes.
import pandas as pd\nfrom sklearn import datasets\nreference_data = datasets.load_iris(as_frame=True).frame\ncurrent_data = pd.read_csv('prediction_database.csv')\n
if done correctly you will most likely end up with two dataframes that look like
# reference_data\nsepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target\n0 5.1 3.5 1.4 0.2 0\n1 4.9 3.0 1.4 0.2 0\n...\n148 6.2 3.4 5.4 2.3 2\n149 5.9 3.0 5.1 1.8 2\n[150 rows x 5 columns]\n\n# current_data\ntime sepal_length sepal_width petal_length petal_width prediction\n2022-12-28 17:24:34.045649 1.0 1.0 1.0 1.0 1\n...\n2022-12-28 17:24:34.045649 1.0 1.0 1.0 1.0 1\n[10 rows x 5 columns]\n
Standardize the dataframes such that they have the same column names and drop the time column from the current_data
dataframe.
We are now ready to generate some reports about data drifting:
Try executing the following code:
from evidently.report import Report\nfrom evidently.metric_preset import DataDriftPreset\nreport = Report(metrics=[DataDriftPreset()])\nreport.run(reference_data=reference, current_data=current)\nreport.save_html('report.html')\n
and open the generated .html
page. What does it say about your data? Have it drifted? Make sure to poke around to understand what the different plots are actually showing.
Data drifting is not the only kind of reporting evidently can make. We can also get reports on the data quality. Try first adding a few Nan
values to your reference data. Secondly, try changing the report to
from evidently.metric_preset import DataDriftPreset, DataQualityPreset\nreport = Report(metrics=[DataDriftPreset(), DataQualityPreset()])\n
and re-run the report. Checkout the newly generated report. Again go over the generated plots and make sure that it picked up on the missing values you just added.
The final report present we will look at is the TargetDriftPreset
. Target drift means that our model is over/under predicting certain classes e.g. or general terms the distribution of predicted values differs from the ground true distribution of targets. Try adding the TargetDriftPreset
to the Report
class and re-run the analysis and inspect the result. Have your targets drifted?
Evidently reports are meant for debugging, exploration and reporting of results. However, as we stated in the beginning, what we are actually interested in methods automatically detecting when we are beginning to drift. For this we will need to look at Test and TestSuites:
Lets start with a simple test that checks if there are any missing values in our dataset:
from evidently.test_suite import TestSuite\nfrom evidently.tests import TestNumberOfMissingValues\ndata_test = TestSuite(tests=[TestNumberOfMissingValues()])\ndata_test.run(reference_data=reference, current_data=current)\n
again we could run data_test.save_html
to get a nice view of the results (feel free to try it out) but additionally we can also call data_test.as_dict()
method that will give a dict with the test results. What dictionary key contains the if all tests have passed or not?
Take a look at this colab notebook that contains all tests implemented in Evidently. Pick 5 tests of your choice, where at least 1 fails by default and implement them as a TestSuite
. Then try changing the arguments of the test so they better fit your usecase and get them all passing.
(Optional) When doing monitoring in practice, we are not always interested in running on all data collected from our API maybe only the last N
entries or maybe just from the last hour of observations. Since we are already logging the timestamps of when our API is called we can use that for filtering. Implement a simple filter that either takes an integer n
and returns the last n
entries in our database or some datetime t
that filters away observations earlier than this.
Evidently by default only supports structured data e.g. tabular data (so does nearly every other framework). Thus, the question then becomes how we can extend unstructured data such as images or text? The solution is to extract structured features from the data which we then can run the analysis on.
(Optional) For images the simple solution would be to flatten the images and consider each pixel a feature, however this does not work in practice because changes in the individual pixels does not really tell anything about the image. Instead we should derive some feature such as:
These are all numbers that can make up a feature vector for a image. Try out doing this yourself, for example by extracting such features from MNIST and FashionMNIST datasets, and check if you can detect a drift between the two sets.
(Optional) For text a common approach is to extra some higher level embedding such as the very classical GLOVE embedding. Try following this tutorial to understand how drift detection is done on text.
Lets instead take a deep learning based approach to doing this. Lets consider the CLIP model, which is normally used to do image captioning. For our purpose this is perfect because we can use the model to get abstract feature embeddings for both images and text:
from PIL import Image\nimport requests\n# requires transformers package: pip install transformers\nfrom transformers import CLIPProcessor, CLIPModel\n\nmodel = CLIPModel.from_pretrained(\"openai/clip-vit-base-patch32\")\nprocessor = CLIPProcessor.from_pretrained(\"openai/clip-vit-base-patch32\")\n\nurl = \"http://images.cocodataset.org/val2017/000000039769.jpg\"\nimage = Image.open(requests.get(url, stream=True).raw)\n\n# set either text=None or images=None when only the other is needed\ninputs = processor(text=[\"a photo of a cat\", \"a photo of a dog\"], images=image, return_tensors=\"pt\", padding=True)\n\nimg_features = model.get_image_features(inputs['pixel_values'])\ntext_features = model.get_text_features(inputs['input_ids'], inputs['attention_mask'])\n
Both img_features
and text_features
are in this case a (512,)
abstract feature embedding, that should be able to tell us something about our data distribution. Try using this method to extract features on two different datasets like CIFAR10 and SVHN if you want to work with vision or IMDB movie review and Amazon review for text. After extracting the features try running some of the data distribution testing you just learned about.
(Optional) If we have multiple applications and want to run monitoring for each application we often want also the monitoring to be a deployed application (that only we can access). Implement a /monitoring/
endpoint that does all the reporting we just went through such that you have two endpoints:
http://127.0.0.1:8000/iris_infer/?sepal_length=1.0&sepal_width=1.0&petal_length=1.0&petal_width=1.0 # user endpoint\nhttp://127.0.0.1:8000/iris_monitoring/ # monitoring endpoint\n
Our monitoring endpoint should return a HTML page either showing an Evidently report or test suit. Try implementing this endpoint. We have implemented a solution in this file if you need help with how to return an HTML page from a FastAPI application.
As an final exercise, we recommend that you try implementing this to run directly in the cloud. You will need to implement this in a container e.g. GCP Run service because the data gathering from the endpoint should still be implemented as an background task. For this to work you will need to change the following:
Instead of saving the input to a local file you should either store it in GCP bucket or an BigQuery SQL table (this is a better solution, but also out-of-scope for this course)
You can either run the data analysis locally by just pulling from cloud storage predictions and training data or alternatively you can deploy this as its own endpoint that can be invoked. For the latter option we recommend that this should require authentication.
That ends the module on detection of data drifting, data quality etc. If this has not already been made clear, monitoring of machine learning applications is an extremely hard discipline because it is not a clear cut when we should actually respond to feature beginning to drift and when it is probably fine. That comes down to the individual application what kind of rules that should be implemented. Additionally, the tools presented here are also in no way complete and are especially limited in one way: they are only considering the marginal distribution of data. Every analysis that we done have been on the distribution per feature (the marginal distribution), however as the image below show it is possible for data to have drifted to another distribution with the marginal being approximately the same.
There are methods such as Maximum Mean Discrepancy (MMD) tests that are able to do testing on multivariate distributions, which you are free to dive into. In this course we will just always recommend to consider multiple features when doing decision regarding your deployed applications.
"},{"location":"s8_monitoring/monitoring/","title":"M28 - System Monitoring","text":""},{"location":"s8_monitoring/monitoring/#monitoring","title":"Monitoring","text":"In this module we are going to look into more classical monitoring of applications. The key concept we are often working with here is called telemetry. Telemetry in general refer to any automatic measurement and wireless transmission of data from our application. It could be numbers such as:
In general there are three different kinds of telemetry we are interested in:
Name Description Example Purpose Metrics Metrics are quantitative measurements of the system. They are usually numbers that are aggregated over a period of time. E.g. the number of requests per minute. The number of requests per minute. Metrics are used to get an overview of the system. They are often used to create dashboards that can be used to get an overview of the system. Logs Logs are textual or structured records generated by applications. They provide a detailed account of events, errors, warnings, and informational messages that occur during the operation of the system. System logs, error logs Logs are essential for diagnosing issues, debugging, and auditing. They provide a detailed history of what happened in a system, making it easier to trace the root cause of problems and track the behavior of components over time. Traces Traces are detailed records of specific transactions or events as they move through a system. A trace typically includes information about the sequence of operations, timing, and dependencies between different components. Distributed tracing in microservices architecture Traces help in understanding the flow of a request or a transaction across different components. They are valuable for identifying bottlenecks, understanding latency, and troubleshooting issues related to the flow of data or control.We are mainly going to focus in this module on metrics.
"},{"location":"s8_monitoring/monitoring/#local-instrumentator","title":"Local instrumentator","text":"Before we look into the cloud lets at least conceptually understand how a given instance of a app can expose values that we may be interested in monitoring.
The standard framework for exposing metrics is called prometheus. Prometheus is a time series database that is designed to store metrics. It is also designed to be very easy to instrument applications with and it is designed to scale to large amounts of data. The way prometheus works is that it exposes a /metrics
endpoint that can be queried to get the current state of the metrics. The metrics are exposed in a format called prometheus text format.
Start by installing prometheus-fastapi-instrumentator
in python
pip install prometheus-fastapi-instrumentator\n
this will allow us to easily instrument our FastAPI application with prometheus.
Create a simple FastAPI application in a file called app.py
. You can reuse any application from the previous module on APIs. To that file now add the following code:
from prometheus_fastapi_instrumentator import Instrumentator\n\n# your app code here\n\nInstrumentator().instrument(app).expose(app)\n
This will instrument your application with prometheus and expose the metrics on the /metrics
endpoint.
Run the app using uvicorn
server. Make sure that the app exposes the endpoints you expect it too exposes, but make sure you also checkout the /metrics
endpoint.
The metric endpoint exposes multiple /metrics
. Metrics always looks like this:
# TYPE key <type>\nkey value\n
e.g. it is essentially a ditionary of key-value pairs with the added functionality of a <type>
. Look at this page over the different types prometheus metrics can have and try to understand the different metrics being exposed.
Look at the documentation for the prometheus-fastapi-instrumentator
and try to add at least one more metric to your application. Rerun the application and confirm that the new metric is being exposed.
Any cloud system with respect for itself will have some kind of monitoring system. GCP has a service called Monitoring that is designed to monitor all the different services. By default it will monitor a lot of metrics out-of-box. However, the question is if we want to monitor more than the default metrics. The complexity that comes with doing monitoring in the cloud is that we need more than one container. We at least need one container actually running the application that is also exposing the /metrics
endpoint and then we need a another container that is collecting the metrics from the first container and storing them in a database. To implement such system of containers that need to talk to each others we in general need to use a container orchestration system such as Kubernetes. This is out of scope for this course, but we can use a feature of Cloud Run
called sidecar containers
to achieve the same effect. A sidecar container is a container that is running alongside the main container and can be used to do things such as collecting metrics.
Overall we recommend that you just become familiar with the monitoring tab for your cloud run service (see image) above. Try to invoke your service a couple of times and see what happens to the metrics over time.
Try creating a service level objective (SLO). In short a SLO is a target for how well your application should be performing. Click the Create SLO
button and fill it out with what you consider to be a good SLO for your application.
(Optional) To expose our own metrics we need to setup a sidecar container. To do this follow the instructions here. We have setup a simple example that uses fastapi and prometheus that you can find here. After you have correctly setup the sidecar container you should be able to see the metrics in the monitoring tab.
A core problem within monitoring is alert systems. The alert system is in charge of sending out alerts to relevant people when some telemetry or metric we are tracking is not behaving as it should. Alert systems are a subjective choice of when and how many should be send out and in general should be proportional with how important to the of the metric/telemetry. We commonly run into what is referred to the goldielock problem where we want just the right amount of alerts however it is more often the case that we either have
Therefore, setting up proper alert systems can be as challenging as setting up the systems for actually the metrics we want to trigger alerts.
"},{"location":"s8_monitoring/monitoring/#exercises_2","title":"\u2754 Exercises","text":"We are in this exercise going to look at how we can setup automatic alerting such that we get an message every time one of our applications are not behaving as expected.
Go to the Monitoring
service. Then go to Alerting
tab.
Start by setting up an notification channel. A recommend setting up with an email.
Next lets create a policy. Clicking the Add Condition
should bring up a window as below. You are free to setup the condition as you want but the image is one way bo setup an alert that will react to the number of times an cloud function is invoked (actually it measures the amount of log entries from cloud functions).
After adding the condition, add the notification channel you created in one of the earlier steps. Remember to also add some documentation that should be send with the alert to better describe what the alert is actually doing.
When the alert is setup you need to trigger it. If you setup the condition as the image above you just need to invoke the cloud function many times. Here is a small code snippet that you can execute on your laptop to call a cloud function many time (you need to change the url and payload depending on your function):
import time\nimport requests\nurl = 'https://us-central1-dtumlops-335110.cloudfunctions.net/function-2'\npayload = {'message': 'Hello, General Kenobi'}\n\nfor _ in range(1000):\n r = requests.get(url, params=payload)\n
Make sure that you get the alert through the notification channel you setup.
Slides
Learn how to setup distributed data loading in your PyTorch application
M29: Distributed Data Loading
Learn how to do distributed training in PyTorch using pytorch-lightning
M30: Distributed Training
Learn how to do scalable inference in PyTorch
M31: Scalable Inference
This module is all about scaling the applications that we are building. We are here going to use a very narrow definition of scaling namely that we want our applications to run faster, however, one should note that in general scaling is a much broader term. There are many different ways to scale your applications and we are going to look at three of these related to different tasks in machine learning algorithms:
We are going to approach the term scaling from two different angles and both should result in your application running faster. The first approach is levering multiple devices, such as using multiple CPU cores or parallelizing training across multiple GPUs. The second approach is more analytical, where we are going to look at how we can design smaller/faster model architectures that run faster.
It should be noted that this module is specific to working with PyTorch applications. In particular, we are going to see how we can both improve base PyTorch code and how to utilize the PyTorch Lightning which we introduced in module M14 on boilerplate to improve the scaling of our applications. If your application is written using another framework we can guarantee that the same techniques in these modules transfer to that framework, but may require you to seek out how to specifically to it.
If you manage to complete all modules in this session, feel free to check out the extra module on scalable hyperparameter optimization.
Learning objectives
The learning objectives of this session are:
pytorch-lightning
Core Module
One way that deep learning fundamentally changed the way we think about data in machine learning is that more data is always better. This was very much not the case with more traditional machine learning algorithms (random forest, support vector machines etc.) where a plateau in performance was often reached for a certain amount of data and did not improve if more was added. However, as deep learning models have become deeper and deeper and thereby more and more data-hungry performance seems to be ever increasing or at least not reaching a plateau in the same way as for traditional machine learning.
Image creditAs we are trying to feed more and more data into our models, the obvious first question to ask is how to do this efficiently. As a general rule of thumb, we want the performance bottleneck to be the forward/backward e.g. the actual computation in our neural network and not the data loading. By bottleneck, we here refer to the part of our pipeline that is restricting how fast we can process data. If data loading is our bottleneck, then our compute device can sit idle while waiting for data to arrive, which is both inefficient and costly. For example, if you are using a cloud provider for training deep learning models, you are paying by the hour per device, and thus not using them fully can be costly in the long run.
In the first set of exercises, we are therefore going to focus on distributed data loading i.e. how to load data in parallel to make sure that we always have data ready for our compute devices. We are in the following going to look at what is going on behind the scenes when we use PyTorch to parallelize data loading.
"},{"location":"s9_scalable_applications/data_loading/#a-closer-look-at-data-loading","title":"A closer look at Data loading","text":"Before we talk distributed applications it is important to understand the physical layout of a standard CPU (the brain of your computer).
Most modern CPUs is a single chip that consists of multiple cores. Each core can further be divided into threads. In most laptops, the core count is 4 and commonly 2 threads per code. This means that the common laptop has 8 threads. The number of threads a compute unit has is important because that directly corresponds to the number of parallel operations that can be executed i.e. one per thread. In a Python terminal you should be able to get the number of cores in your machine by writing (try it):
import multiprocessing\ncores = multiprocessing.cpu_count()\nprint(f\"Number of cores: {cores}, Number of threads: {2*cores}\")\n
A distributed application is in general any kind of application that parallelizes some or all of its workload. We are in these exercises only focusing on distributed data loading, which happens primarily only on the CPU. In PyTorch
it is easy to parallelize data loading if you are using their dataset/data loader interface:
from torch.utils.data import Dataset, DataLoader\nclass MyDataset(Dataset):\n def __init__(self, ...):\n # whatever logic is needed to init the data set\n self.data = ...\n\n def __getitem__(self, idx):\n # return one item\n return self.data[idx]\n\ndataset = MyDataset()\ndataloader = Dataloader(\n dataset,\n batch_size=8,\n num_workers=4 # this is the number of threads we want to parallelize workload over\n)\n
Let's take a deep dive into what happens when we request a batch from our dataloader e.g. next(dataloader)
. First, we must understand that we have a thread that plays the role of the main and the remaining threads (in the above example we request 4) are called workers. When the dataloader is created, we create this structure and make sure that all threads have a copy of our dataset definition so each can call the __getitem__
method.
Then comes the actual part where we request a batch of data. Assume that we have a batch size of 8 and we do not do any shuffling. In this step, the master thread then distributes the list of requested data points ([0,1,2,3,4,5,6,7]
) to the four worker threads. With 8 indices and 4 workers, each worker will receive 2 indices.
Each worker thread then calls the __getitem__
method for all the indices it has received. When all processes are done, the loaded images data points gets sent back to the master thread and collected into a single structure/tensor.
Each arrow is corresponds to a communication between two threads, which is not a free operation. In total to get a single batch (not counting the initial startup cost) in this example we need to do 8 communication operations. This may seem like a small price to pay, but that may not be the case. If the processing time of __getitem__
is very low ( data is stored in memory, we just need to index to get it) then it does not make sense to use multiprocessing. The computational savings by doing the look-up operations in parallel are smaller than the communication cost there is between the main thread and the workers. Multiprocessing makes sense when the processing time of __getitem__
is high (data is probably stored on the hard drive).
It is this trade-off that we are going to investigate in the exercises.
"},{"location":"s9_scalable_applications/data_loading/#exercises","title":"\u2754 Exercises","text":"This exercise is intended to be done on the labeled faces in the wild (LFW) dataset. The dataset consists of images of famous people extracted from the internet. The dataset had been used to drive the field of facial verification, which you can read more about here. We are going to imagine that this dataset cannot fit in memory, and your job is therefore to construct a data pipeline that can be parallelized based on loading the raw data files (.jpg) at runtime.
Download the dataset and extract it to a folder. It does not matter if you choose the non-aligned or aligned version of the dataset.
We provide the lfw_dataset.py
file where we have started the process of defining a data class. Fill out the __init__
, __len__
and __getitem__
. Note that __getitem__
expects that you return a single img
which should be a torch.Tensor
. Loading should be done using PIL Image, as PIL
images are the default input format for torchvision for transforms (for data augmentation).
Make sure that the script runs without any additional arguments
python lfw_dataset.py\n
Visualize a single batch by filling out the codeblock after the first TODO right after defining the dataloader. The visualization should show when launching the script as
python lfw_dataset.py -visualize_batch\n
Hint: this tutorial.
Experiment how the number of workers influences the performance. We have already provide code that will pass over 100 batches from the dataset 5 times and calculate how long time it took, which you can play around with by calling
python lfw_dataset.py -get_timing -num_workers 1\n
Make a errorbar plot with number of workers along the x-axis and the timing along the y-axis. The errorbars should correspond to the standard deviation over the 5 runs. HINT: if it is taking too long to evaluate, measure the time over less batches (set the -batches_to_check
flag). Also if you are not seeing an improvement, try increasing the batch size (since data loading is parallelized per batch).
For certain machines like the Mac with M1 chipset it is necessary to set the multiprocessing_context
flag in the dataloder to \"fork\"
. This essentially tells the dataloader how the worker nodes should be created.
Retry the experiment where you change the data augmentation to be more complex:
lfw_trans = transforms.Compose([\n transforms.RandomAffine(5, (0.1, 0.1), (0.5, 2.0)),\n # add more transforms here\n transforms.ToTensor()\n])\n
by making the augmentation more computationally demanding, it should be easier to get a boost in performance when using multiple workers because the data augmentation is also executed in parallel.
(Optional, requires access to GPU) If your dataset fits in GPU memory it is beneficial to set the pin_memory
flag to True
. By setting this flag we are essentially telling PyTorch that they can lock the data in place in memory which will make the transfer between the host (CPU) and the device (GPU) faster.
This ends the module on distributed data loading in PyTorch. If you want to go into more details we highly recommend that you read this paper that goes into great detail on analyzing how data loading in PyTorch works and performance benchmarks.
"},{"location":"s9_scalable_applications/distributed_training/","title":"M30 - Distributed Training","text":""},{"location":"s9_scalable_applications/distributed_training/#distributed-training","title":"Distributed Training","text":"In this module we are going to look at distributed training. Distributed training is one of the key ingredients to all the awesome results that deep learning models are producing. For example: Alphafold the highly praised model from Deepmind that seems to have solved protein structure prediction, was trained in a distributed fashion for a few weeks. The training was done on 16 TPUv3s (specialized hardware), which is approximately equal to 100-200 modern GPUs. This means that training Alphafold without distributed training on a single GPU (probably not even possible) would take a couple of years to train! Therefore, it is simply impossible currently to train some of the state-of-the-art (SOTA) models within deep learning currently, without taking advantage of distributed training.
When we talk about distributed training, there are a number of different paradigms that we may use to parallelize our computations
In this module we are going to look at data parallel training, which is the original way of doing parallel training and distributed data parallel training which is an improved version of data parallel. If you want to know more about sharded training which is the newest of the paradigms you can read more about it in this blog post, which describes how sharded can save over 60% of memory used during your training.
Finally, we want to note that for all the exercises in the module you are going to need a multi GPU setup. If you have not already gained access to multi GPU machines on GCP (see the quotas exercises in this module) you will need to find another way of running the exercises. For DTU Students I can recommend checking out this optional module on using the high performance cluster (HPC) where you can get access to multi GPU resources.
"},{"location":"s9_scalable_applications/distributed_training/#data-parallel","title":"Data parallel","text":"While data parallel today in general is seen as obsolete compared to distributed data parallel, we are still going to investigate it a bit since it offers the most simple form of distributed computations in deep learning pipeline.
In the figure below is shown both the forward and backward step in the data parallel paradigm
The steps are the following:
Whenever we try to do forward call e.g. out=model(batch)
we take the batch and divide it equally between all devices. If we have a batch size of N
and M
devices each device will be sent N/M
datapoints.
Afterwards each device receives a copy of the model
e.g. a copy of the weights that currently parametrizes our neural network.
In this step we perform the actual forward pass in parallel. This is the actual steps that can help us scale our training.
Finally we need to send back the output of each replicated model to the primary device.
Similar to the analysis we did of parallel data loading, we cannot always expect that this will actual take less time than doing the forward call on a single GPU. If we are parallelizing over M
devices, we essentially need to do 3xM
communication calls to send batch, model and output between the devices. If the parallel forward call does not outweigh this, then it will take longer.
In addition, we also have the backward path to focus on
As the end of the forward collected the output on the primary device, this is also where the loss is accumulated. Thus, loss gradients are first calculated on the primary device
Next we scatter the gradient to all the workers
The workers then perform a parallel backward pass through their individual model
Finally, we reduce (sum) the gradients from all the workers on the main process such that we can do gradient descend.
One of the big downsides of using data parallel is that all the replicas are destroyed after each backward call. This means that we over and over again need to replicate our model and send it to the devices that are part of the computations.
Even though it seems like a lot of logic is implementing data parallel into your code, in PyTorch we can very simply enable data parallel training by wrapping our model in the nn.DataParallel class.
from torch import nn\nmodel = MyModelClass()\nmodel = nn.DataParallel(model, device_ids=[0, 1]) # data parallel on gpu 0 and 1\npreds = model(input) # same as usual\n
"},{"location":"s9_scalable_applications/distributed_training/#exercises","title":"\u2754 Exercises","text":"Please note that the exercise only makes sense if you have access to multiple GPUs.
Create a new script (call it data_parallel.py
) where you take a copy of model FashionCNN
from the fashion_mnist.py
script. Instantiate the model and wrap torch.nn.DataParallel
around it such that it can be executed in data parallel.
Try to run inference in parallel on multiple devices (pass a batch multiple times and time it) e.g.
import time\nstart = time.time()\nfor _ in range(n_reps):\n out = model(batch)\nend = time.time()\n
Does data parallel decrease the inference time? If no, can you explain why that may be? Try playing around with the batch size, and see if data parallel is more beneficial for larger batch sizes.
It should be clear that there is huge disadvantage of using the data parallel paradigm to scale your applications: the model needs to replicated on each pass (because it is destroyed in the end), which requires a large transfer of data. This is the main problem that distributed data parallel tries to solve.
The two key difference between distributed data parallel and data parallel that we move the model update (the gradient step) to happen on each device in parallel instead of only on the main device. This has the consequence that we do not need to move replicate the model on each step, instead we just keep a local version on each device that we keep updating. The full set of steps (as shown in the figure):
Initialize an exact copy of the model on each device
From disk (or memory) we start by loading data into a section of page-locked host memory per device. Page-locked memory is essentially a way to reverse a piece of a computers memory for a specific transfer that is going to happen over and over again to speed it up. The page-locked regions are loaded with non-overlapping data.
Transfer data from page-locked memory to each device in parallel
Perform forward pass in parallel
Do a all-reduce operation on the gradients. An all-reduce operation is a so call all-to-all operation meaning that all processes send their own gradient to all other processes and also received from all other processes.
Reduce the combined gradient signal from all processes and update the individual model in parallel. Since all processes received the same gradient information, all models will still be in sync.
Thus, in distributed data parallel we here end up only doing a single communication call between all processes, compared to all the communication going on in data parallel. While all-reduce is a more expensive operation that many of the other communication operations that we can do, because we only have to do a single we gain a huge performance boost. Empirically distributed data parallel tends to be 2-3 times faster than data parallel.
However, this performance increase does not come for free. Where we could implement data parallel in a single line in PyTorch, distributed data parallel is much more involving.
"},{"location":"s9_scalable_applications/distributed_training/#exercises_1","title":"\u2754 Exercises","text":"We have provided an example of how to do distributed data parallel training in PyTorch in the two files distributed_example.py
and distributed_example.sh
. You objective is to get a understanding of the necessary components in the script to get this kind of distributed training to work. Try to answer the following questions (HINT: try to Google around):
What is the function of the DDP
wrapper?
What is the function of the DistributedSampler
?
Why is it necessary to call dist.barrier()
before passing a batch into the model?
What does the different environment variables do in the .sh
file
Try to benchmark the runs using 1 and 2 GPUs
The first exercise have hopefully convinced you that it can be quite the trouble writing distributed training applications yourself. Luckily for us, PyTorch-lightning
can take care of this for us such that we do not have to care about the specific details. To get your model training on multiple GPUs you need to change two arguments in the trainer: the accelerator
flag and the gpus
flag. In addition to this, you can read through this guide about any additional steps you may need to do (for many of you, it should just work). Try running your model on multiple GPUs.
Try benchmarking your training using 1 and 2 gpus e.g. try running a couple of epochs and measure how long time it takes. How much of a speedup can you actually get? Why can you not get a speedup of 2?
Inference is the task of applying our trained model to some new and unseen data, often called prediction. Thus, scaling inference is different from scaling data loading and training, mainly due to inference normally only using a single data point (or a few). As we can neither parallelize the data loading nor parallelize using multiple GPUs (at least not in any efficient way), this is of no use to us when we are doing inference. Additionally, performing inference is often not something we do on machines that can perform large computations, as most inference today is actually either done on edge devices e.g. mobile phones or in low-cost-low-compute cloud environments. Thus, we need to be smarter about how we scale inference than just throwing more computing power at it.
In this module, we are going to look at various ways that you can either reduce the size of your model or make your model faster. Both are important for running inference fast regardless of the setup you are running your model on. We want to note that this is still very much an active area of research and therefore best practices for what to do in a specific situation can change.
"},{"location":"s9_scalable_applications/inference/#choosing-the-right-architecture","title":"Choosing the right architecture","text":"Assume you are starting a completely new project and have to come up with a model architecture for doing this. What is your strategy? The common way to do this is to look at prior work on similar problems that you are facing and either directly choose the same architecture or create some slight variation hereof. This is a great way to get started, but the architecture that you end up choosing may be optimal in terms of performance but not inference speed.
The fact is that not all base architectures are created equal, and a 10K parameter model with one architecture can have a significantly different inference speed than another 10K parameter model with another architecture. For example, consider the figure below which compares an number of models from the timm package, colored based on their base architecture. The general trend is that the number of images that can be processed by a model per sec (y-axis) is inversely proportional to the number of parameters (x-axis). However, we in general see that convolutional base architectures (conv) are more efficient than transformer (vit) for the same parameter budget.
Image credit"},{"location":"s9_scalable_applications/inference/#exercises","title":"\u2754 Exercises","text":"As discussed in this blogpost the largest increase in inference speed you will see (given some specific hardware) is choosing an efficient model architecture. In the exercises below we are going to investigate the inference speed of different architectures.
Start by checking out this table which contains a list of pretrained weights in torchvision
. Try finding an
model that has in the range of 20-30 mio parameters.
Write a small script that first initializes all models, creates a dummy input tensor of shape [100, 3, 256, 256] and then measures the time it takes to do a forward pass on the input tensor. Make sure to do this multiple times to get a good average time.
SolutionIn this solution, we have chosen to use the efficientnet b5 (30.4M parameters), resnet50 (25.6M parameters) and the swin v2 transformer tiny (28.4M parameters) models.
import time\nimport torch\nfrom torchvision import models\n\nmodel_list = [\"efficientnet_b5\", \"resnet50\", \"swin_v2_t\"]\nimage = torch.randn(100, 3, 256, 256)\n\nn_reps = 10\nfor i, m in enumerate(model_list):\n model = models.get_model(m)\n tic = time.time()\n for _ in range(n_reps):\n _ = model(image)\n toc = time.time()\n print(f\"Model {i} took: {(toc - tic) / n_reps}\")\n
Does the results make sense? Based on the above figure we would expect that efficientnet is faster than resnet, which is faster than the transformer based model. Is this also what you are seeing?
To figure out why one net is more efficient than another we can try to count the operations each network need to do for inference. A operation here we can define as a FLOP (floating point operation) which is any mathematical operation (such as +, -, *, /) or assignment that involves floating-point numbers. Luckily for us someone has already created a python package for calculating this in pytorch: ptflops
Install the package
pip install ptflops\n
Try calling the get_model_complexity_info
function from the ptflops
package on the networks from the previous exercise. What are the results?
from ptflops import get_model_complexity_info\nimport time\nimport torch\nfrom torchvision import models\n\nmodel_list = [\"efficientnet_b5\", \"resnet50\", \"swin_v2_t\"]\nfor model in model_list:\n macs, params = get_model_complexity_info(\n models.get_model(model_list[0]), (3, 256, 256), backend='pytorch', print_per_layer_stat=False\n )\n print(f\"Model {model} have {params} parameters and uses {macs}\")\n
In the table from the initial exercise, you could also see the overall performance of each network on the Imagenet-1K dataset. Given this performance, the inference speed, the flops count what network would you choose to use in a production setting? Discuss when choosing one over another should be considered.
Quantization is a technique where all computations are performed with integers instead of floats. We are essentially taking all continuous signals and converting them into discretized signals.
Image creditAs discussed in this blogpost series, while float
(32-bit) is the primarily used precision in machine learning because is strikes a good balance between memory consumption, precision and computational requirement it does not mean that during inference we can take advantage of quantization to improve the speed of our model. For instance:
Floating-point computations are slower than integer operations
Recent hardware have specialized hardware for doing integer operations
Many neural networks are actually not bottlenecked by how many computations they need to do but by how fast we can transfer data e.g. the memory bandwidth and cache of your system is the limiting factor. Therefore working with 8-bit integers vs 32-bit floats means that we can approximately move data around 4 times as fast.
Storing models in integers instead of floats save us approximately 75% of the ram/harddisk space whenever we save a checkpoint. This is especially useful in relation to deploying models using docker (as you hopefully remember) as it will lower the size of our docker images.
But how do we convert between floats and integers in quantization? In most cases we often use a linear affine quantization:
$$ x_{int} = \\text{round}\\left( \\frac{x_{float}}{s} + z \\right) $$
where $s$ is a scale and $z$ is the so called zero point. But how does to doing inference in a neural network. The figure below shows all the conversations that we need to make to our standard inference pipeline to actually do computations in quantized format.
Image credit"},{"location":"s9_scalable_applications/inference/#exercises_1","title":"\u2754 Exercises","text":"Lets look at how quantized tensors look in PyTorch
Start by creating a tensor that contains both random numbers
Next call the torch.quantize_per_tensor
function on the tensor. What does the quantized tensor look like? How does the values relate to the scale
and zero_point
arguments.
Finally, try to call the .dequantize()
method on the tensor. Do you get a tensor back that is close to what you initially started out with.
As you hopefully saw in the first exercise we are going to perform a number of rounding errors when doing quantization and naively we would expect that this would accumulate and lead to a much worse model. However, in practice we observe that quantization still works, and we actually have a mathematically sound reason for this. Can you figure out why quantization still works with all the small rounding errors? HINT: it has to do with the central limit theorem
Lets move on to quantization of our model. Follow this tutorial from PyTorch on how to do quantization. The goal is to construct a model model_fc32
that works on normal floats and a quantized version model_int8
. For simplicity you can just use one of the models from the tutorial.
Lets try to benchmark our quantized model and see if all the trouble that we went through actually paid of. Also try to perform the benchmark on the non-quantized model and see if you get a difference. If you do not get an improvement, explain why that may be.
Pruning is another way for reducing the model size and maybe improve performance of our network. As the figure below illustrates, in pruning we are simply removing weights in our network that we do not consider important for the task at hand. By removing, we here mean that the weight gets set to 0. There are many ways to determine if a weight is important but the general rule that the importance of a weight is proportional to the magnitude of a given weight. This makes intuitively sense, since weights in all linear operations (fully connected or convolutional) are always multiplied onto the incoming value, thus a small weight means a small outgoing activation.
Image credit"},{"location":"s9_scalable_applications/inference/#exercises_2","title":"\u2754 Exercises","text":"We provide a start script that implements the famous LeNet in this file. Open and run it just to make sure that you know the network.
PyTorch have already some pruning methods implemented in its package. Import the prune
module from torch.nn.utils
in the script.
Try to prune the weights of the first convolutional layer by calling
prune.random_unstructured(module_1, name=\"weight\", amount=0.3) # (1)!\n
Try printing the named_parameters
, named_buffers
before and after the module is pruned. Can you explain the difference and what is the connection to the module_1.weight
attribute.
Try pruning the bias of the same module this time using the l1_unstructured
function from the pruning module. Again check the named_parameters
, named_buffers
argument to make sure you understand the difference between L1 pruning and unstructured pruning.
Instead of pruning only a single module in the model lets try pruning the whole model. To do this we just need to iterate over all named_modules
in the model like this:
for name, module in new_model.named_modules():\n prune.l1_unstructured(module, name='weight', amount=0.2)\n
But what if we wanted to apply different pruning to different layers. Implement a pruning scheme where
amount=0.2
amount=0.4
Print print(dict(new_model.named_buffers()).keys())
after the pruning to confirm that all weights have been correctly pruned.
The pruning we have looked at until know have only been local in nature e.g. we have applied the pruning independently for each layer, not accounting globally for how much we should actually prune. As you may realize this can quickly lead to an network that is pruned too much. Instead, the more common approach is too prune globally where we remove the smallest X
amount of connections:
Start by creating a tuple over all the weights with the following format
parameters_to_prune = (\n (model.conv1, 'weight'),\n # fill in the rest of the modules yourself\n (model.fc3, 'weight'),\n)\n
The tuple needs to have length 5. Challenge: Can you construct the tuple using for
loops, such that the code works for arbitrary size networks?
Next prune using the global_unstructured
function to globally prune the tuple of parameters
prune.global_unstructured(\n parameters_to_prune,\n pruning_method=prune.L1Unstructured,\n amount=0.2,\n)\n
Check that the amount that have been pruned is actually equal to the 20% specified in the pruning. We provide the following function that for a given submodule (for example model.conv1
) computes the amount of pruned weights
def check_prune_level(module: nn.Module):\n sparsity_level = 100 * float(torch.sum(module.weight == 0) / module.weight.numel())\n print(f\"Sparsity level of module {sparsity_level}\")\n
With a pruned network we really want to see if all our effort actually ended up with a network that is faster and/or smaller in memory. Do the following to the globally pruned network from the previous exercises:
First we need to make the pruning of our network permanent. Right now it is only semi-permanent as we are still keeping a copy of the original weights in memory. Make the change permanent by calling prune.remove
on every pruned module in the model. Hint: iterate over the parameters_to_prune
tuple.
Next try to measure the time of a single inference (repeated 100 times) for both the pruned and non-pruned network
import time\ntic = time.time()\nfor _ in range(100):\n _ = network(torch.randn(100, 1, 28, 28))\ntoc = time.time()\nprint(toc - tic)\n
Is the pruned network actually faster? If not can you explain why?
Next lets measure the size of our network (called pruned_network
) and a freshly initialized network (called network
):
torch.save(pruned_network.state_dict(), 'pruned_network.pt')\ntorch.save(network.state_dict(), 'network.pt')\n
Lookup the size of each file. Are the pruned network actually smaller? If not can you explain why?
Repeat the last exercise, but this time start by converting all pruned weights to sparse format first by calling the .to_sparse()
method on each pruned weight. Is the saved model smaller now?
This ends the exercises on pruning. As you probably realized in the last couple of exercises, then pruning does not guarantee speedups out of the box. This is because linear operations in PyTorch does not handle sparse structures out of the box. To actually get speedups we would need to deep dive into the sparse tensor operations, which again does not even guarantee that a speedup because the performance of these operations depends on the sparsity structure of the pruned weights. Investigating this is out of scope for these exercises, but we highly recommend checking it out if you are interested in sparse networks.
"},{"location":"s9_scalable_applications/inference/#knowledge-distillation","title":"Knowledge distillation","text":"Knowledge distillation is somewhat similar to pruning, in the sense that it tries to find a smaller model that can perform equally well as a large model, however it does so in a completely different way. Knowledge distillation is a model compression technique that builds on the work of Bucila et al. in which we try do distill/compress the knowledge of a large complex model (also called the teacher model) into a simpler model (also called the student model).
The best known example of this is the DistilBERT model. The DistilBERT model is a smaller version of the large natural-language procession model Bert, which achieves 97% of the performance of Bert while only containing 40% of the weights and being 60% faster. You can see in the figure below how it is much smaller in size compared to other models developed at the same time.
Image creditKnowledge distillation works by assuming we have a big teacher that is already performing well that we want to compress. By running our training set through our large model we get a softmax distribution for each and every training sample. The goal of the students, is to both match the original labels of the training data but also match the softmax distribution of the teacher model. The intuition behind doing this, is that teacher model needs to be more complex to learn the complex inter-class relasionship from just (one-hot) labels. The student on the other hand gets directly feed with softmax distributions from the teacher that explicit encodes this inter-class relasionship and thus does not need the same capasity to learn the same as the teacher.
Image credit"},{"location":"s9_scalable_applications/inference/#exercises_3","title":"\u2754 Exercises","text":"Lets try implementing model distillation ourself. We are going to see if we can achieve this on the cifar10 dataset. Do note that exercise below can take quite long time to finish because it involves training multiple networks and therefore involve some waiting.
Start by install the transformers
and datasets
packages from Huggingface
pip install transformers\npip install datasets\n
which we are going to download the cifar10 dataset and a teacher model.
Next download the cifar10 dataset
from datasets import load_dataset\ndataset = load_dataset(\"cifar10\")\n
Next lets initialize our teacher model. For this we consider a large transformer based model:
from transformers import AutoFeatureExtractor, AutoModelForImageClassification\nextractor = AutoFeatureExtractor.from_pretrained(\"aaraki/vit-base-patch16-224-in21k-finetuned-cifar10\")\nmodel = AutoModelForImageClassification.from_pretrained(\"aaraki/vit-base-patch16-224-in21k-finetuned-cifar10\")\n
To get the logits (un-normalized softmax scores) from our teacher model for a single datapoint from the training dataset you would extract it like this:
sample_img = dataset['train'][0]['img']\npreprocessed_img = extractor(dataset['train'][0]['img'], return_tensors='pt')\noutput = model(**preprocessed_img)\nprint(output.logits)\n# tensor([[ 3.3682, -0.3160, -0.2798, -0.5006, -0.5529, -0.5625, -0.6144, -0.4671, 0.2807, -0.3066]])\n
Repeat this process for the whole training dataset and store the result somewhere.
Implement a simple convolutional model. You can create a custom one yourself or use a small one from torchvision
.
Train the model on cifar10 to convergence, so you have a base result on how the model is performing.
Redo the training, but this time add knowledge distillation to your training objective. It should look like this:
for batch in dataset:\n # ...\n img, target, teacher_logits = batch\n preds = model(img)\n loss = torch.nn.functional.cross_entropy(preds, target)\n loss_teacher = torch.nn.functional.cross_entropy(preds, teacher_logits)\n loss = loss + loss_teacher\n loss.backward()\n # ...\n
Compare the final performance obtained with and without knowledge distillation. Did the performance improve or not?
This ends the module on scaling inference in machine learning models.
"},{"location":"samples/","title":"Collection of sample applications","text":""},{"location":"tools/","title":"Tools","text":"Just a collection of tools and scripts for running the course.
"}]} \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index 9e000f5347fd8caa856fb8e292b3e03bbb414032..7607e6e63288faf0a885e256820b6f15379cc7ae 100644 GIT binary patch delta 13 Ucmb=gXP58h;9#h*o5)@P02tc?h5!Hn delta 13 Ucmb=gXP58h;Al{@oycAR02)05vj6}9